Posted: 14/06/19

Opinion Piece by Dr Teresa Lynn: The Irish language needs to be digitally supported and machine translation is just one way of doing this.

The job of the translator has undoubtedly changed over the past twenty years through the introduction of technology. This is a universal change and in Ireland, how we embrace this change is crucial.

The use of Machine Translation (MT) is part of this change and has become standard practice across the world. While most artificial intelligence technologies often make the average tech user feel uncomfortable, it is important to separate the myths from the science. 

The purpose of machine translation is two-fold. One purpose is to get the gist or sense of a piece of text in a language unknown to you-you might jump online to do this. The other is for pre-translation purposes in a professional environment, where the translation from a MT system is then proof-read or post-edited by a professional translator.

When that crucial second step in a pre-translation context is skipped or overlooked by those publishing translated texts, the onus and responsibility for quality is on those who decide to go ahead and publish, regardless. 

You can forgive an Irish speaker’s frustration when they see official signs such as "All Passengers Departing From These Gates Please Be Patient" translated (and unedited) as "Gach paisinéirí ag imeacht ó na geataí le do thoil a n-othar" ("All Passengers Departing From The Gates, Please, Their Patient"). It’s not surprising, therefore, that MT systems may indirectly have a bad reputation.  

However, by understanding the workings of MT, we can quickly see why should we not have high expectations beyond gisting or pre-translation, especially when it comes to Irish.

The design of an MT system is what we term "data-driven" - it works by engineers feeding (training) the system with data (previously professionally translated text) from which it learns how to predict a new translation.

It’s a game of maths, probabilities and chance. The more examples of translations it sees, the better the predictions and the higher the accuracy. MT systems for some language pairs - English-French, for example - work extremely well, as there is an abundance of training data available.

However, when a sufficient number of previous translations are not readily available, the translation system suffers. This is an issue facing languages spoken in the minority and where there is lack in available digital content. Irish falls into this category.

"Domain" is the term used in the machine translation world used to describe the genre of text being translated; it is influenced by features of the text such as the terminology used, register and sometimes style.

Text in parliamentary proceedings (legal domain) differs greatly from text found in government annual reports (public administration domain) or that found in the Roddy Doyle trilogies (literary domain – although Roddy has added challenges of colloquial speech and Hiberno-English terms!).

While free online systems are "open-domain", tuning a MT system with text from a specific domain can significantly improve the quality of automated translation. This is particularly important for minority languages for which there just is not enough translated text available to work with.  

A few years ago, the Department of Culture, Heritage and the Gaeltacht (DCHG) looked into the potential of using MT to assist their translators in meeting the growing demand for English-Irish translations.

While an open-source online system such as Google Translate is not an option for a government department (non-secure site, below par quality), researchers at Dublin City University (ADAPT Centre) proposed trialling a pilot system - Tapadóir - that was tuned specifically to the type of text their translators usually encounter (public administration domain).

Tapadóir was used to pre-translate the text so that professional translators would only need to correct (post-edit) the system’s suggested translation. The system reaches a quality bar that has been internationally recognised as suitable for a professional post-editing setting, mainly due to being specifically trained with old translations held by DCHG.

On a European level, this approach for training MT systems is being embraced (eTranslation) and is already available for use by those in public administration. The English-Irish version of the eTranslation system is, however, still below par compared to most other EU languages. The main reason for this is the lack of sufficient English-Irish parallel data available to train the system. 

To this end, for the past few years, DCU’s same researchers (with the support of DCHG) have been working with public bodies and Irish language organisations across the country in search of bilingual data contributions to help train both national and European machine translation systems for Irish.

In fact, a national data portal has just been launched which will facilitate this collection. ELRI Ireland will let users upload their own bilingual data and terminologies and anyone working with Irish language within public administration can request an account for the portal.

The majority of public administration data is open data under the PSI Directive and can be shared freely. Other text collections that have sharing restrictions may be uploaded through specific licenses that determine restricted usage (e.g. for training machine translation systems only). 

The Irish language needs to be digitally supported and machine translation is just one way of doing this. The improvement of the quality and reliability of Irish automated translation systems relies on the efforts and cooperation of those who carry out or oversee translations of English/Irish text. 

By embracing the notion of meitheal (cooperation) we will be able to see our language supported on a par with our European counterparts.  

Originally featured on RTÉ Brainstorm 

Share this article: