Over the past 5 years the machine translation (MT) community has become aware of the potential of neural machine translation (NMT) to sustain the increases in output quality that had appeared to plateau when using statistical MT (SMT) (Kenny 2018). This has led an increasing number of MT providers and research groups to focus their energies and resources on developing NMT systems.

Early studies on NMT quality demonstrated that, in general, this MT paradigm yields higher automatic evaluation metric scores than its predecessor, SMT (Bahdanau et al. 2014; Jean et al. 2015; Bojar et al. 2016; Koehn and Knowles 2017). NMT has also been shown to provide a jump in fluency when compared with SMT (Bentivogli et al. 2016; Toral and Sánchez-Cartagena 2017). This increased fluency has quickly made NMT the preferred MT paradigm for assimilation, as is evident from the move to NMT by many major online MT providers. Where MT for dissemination is concerned, when text is “machine translated as an intermediate step in production” (Forcada 2010), we might reasonably assume that the reported increase in quality would result in a concomitant productivity boost. However, studies such as Castilho et al. (2018) reported that NMT delivers only minor improvements in productivity and technical effort, relative to the improved scores using automatic metrics and human fluency evaluation, when comparing with phrase-based SMT (PBSMT) systems.

The rule of thumb for MT deployment suggested by Way (2018) is that “the degree of human involvement required—or warranted—in a particular translation scenario will depend on the purpose, value and shelf-life of the content.” However, positive evaluations of NMT for assimilation alongside occasionally hyperbolic reports in the media (as reported in Castilho et al. 2017; Toral et al. 2018) have pushed raw and post-edited MT into action in use-cases for which MT would previously have been considered inappropriate (Schmitdke 2016; Guerberof 2018). The rise of NMT as the state-of-the-art has been accompanied by growing awareness in the community of the need to improve methodologies and procedures for translation quality assessment on an ongoing basis, with a view to overcoming the limitations of both automatic metrics and human approaches, limiting the overhyping of NMT, and explaining the somewhat paradoxical results for NMT for dissemination (Läubli et al. 2018; Moorkens et al. 2018).

The current special issue attempts to address the latter point. Due to the novelty of NMT, little is known as yet about how humans—especially translation professionals, translation students, and end-users—engage with NMT output. Will the same types of errors that occur in SMT and rule-based MT (RBMT) systems recur in NMT outputs? Will translators take longer or become faster when post-editing (or otherwise processing) NMT output to improve productivity? Is cognitive effort higher or lower when processing NMT output? How is the end-user experience with NMT systems? How does post-editing (PE) NMT output compare with using translation memories (TMs) and adapting their fuzzy matches? This special issue aims to address these and similar questions around human factors in NMT by bringing together a collection of novel articles offering state-of-the-art research on a wide range of topics related to translation quality in terms of PE, error analysis, as well as the application of controlled languages in pre-processing. The articles adopt multiple complementary perspectives to tackle the issues at hand and cover a variety of language pairs and domains, showing the wide applicability of NMT to real-life tasks.

In this special issue, while most of the papers focus specifically on several aspects of PE, contributions that more closely consider the role of interactive MT, error analysis and controlled language in the human factors of NMT are also included.

1 Post-editing

PE effort (temporal, technical and cognitive, as per Krings 2001) with NMT output is usually reported in comparison with different translation approaches, i.e. human translation (HT) with or without TM matches, or with PE of other MT systems. While NMT PE shows large differences on the cognitive, temporal and technical levels when compared to HT, when it is compared to SMT output and TM matches, research does not yet seem to indicate that it is a significantly faster task in all scenarios. In this special issue, a good number of articles aim to investigate the differences between other translation approaches and translating with the aid of NMT.

Jia et al. compare fluency, accuracy and PE effort of Google’s PBSMT and NMT engines for English-to-Chinese translation of two news texts. Their findings suggest that post-editing NMT reduces temporal, technical, and cognitive effort for this language pair and text type. Interestingly, they also find a strong correlation between pause-based metrics that have been independently proposed very recently for cognitive effort, and that translation from scratch is more prone to speed variability based on source text complexity.

Sánchez-Gijón et al. investigate the differences between PE of a generic NMT system and translation using TM matches in English-to-Spanish technical translation, in terms of edit time and edit distance, as well as translators’ perceptions of NMT for productivity, considering in particular how these dimensions vary in relation to segment length. Their findings show that while NMT PE necessitates less editing than TM segments, it takes longer on average. The authors note that translators who perceived MT as boosting their productivity actually performed better when post-editing MT segments than those translators who perceived MT to be a poor resource.

Koponen et al. combine a product-based and a process-based approach to verify whether different editing patterns exist when post-editing NMT, SMT and RBMT outputs. They find that whereas NMT has the greatest numbers of word-form changes and word-substitution edit types, RBMT shows more deletion edits, and SMT more insertions. The effort indicators show a slight increase in keystrokes per word for NMT output, and a slight decrease in average pause length for NMT compared to the other systems. The authors argue that studies in PE quality and effort should identify preferential edits, participant errors, and individual differences in process metrics.

Herbig et al. explore how multiple modalities to measure cognitive load, including eye-, skin- and heart-based indicators, might be combined to predict the level of perceived cognitive load during NMT PE. Their results show that PE time strongly correlates with perceived cognitive load and, moreover, that a combined multimodal approach is able to estimate cognitive load during PE without the actual process being interrupted through manual ratings.

2 Interactive MT

Interactive and adaptive MT is one possible alternative method of employing MT for dissemination outside of PE, which Green (2016) called a “broken usability model” wherein MT suggestions “prime translators” (Green et al. 2013). Daems and Macken compare the differences between interactive adaptive SMT and NMT regarding quality, translation process, perceived usability, and translators’ attitude towards an interactive translation tool. The authors find that even though SMT suggestions contain more errors than NMT suggestions, neither translation time nor effort are significantly affected by the difference in quality. The authors argue that the differences found may be due to individual differences between translators, and that, while fewer errors were found in NMT output, these “could be harder to detect and to solve”. Despite this, users prefer to work with NMT output. Improved usability, even without increased productivity, may still be considered to make a move from interactive SMT to interactive NMT worthwhile.

Knowles et al. also explore interactive NMT. However, there are two important differences between the two articles: first of all, while this paper compares interactive NMT to PE NMT, Daems and Macken compare this paradigm against interactive SMT; in addition, the computer-assisted translation (CAT) tool employed in this paper is a research product (CASMACAT), while that used by Daems and Macken is a commercial offering (Lilt). Specifically, Knowles et al. investigate whether human translators’ productivity increases in a setting that makes use of interactive translation prediction (ITP) with an NMT system. They find that over half of the eight participant translators are faster when using neural ITP, which is preferred over PE by most of the translators. The authors argue then that ITP would be a viable alternative to PE.

3 Error analysis

Error analysis of NMT systems has also been on the radar of the MT field. Several papers have carried out automatic (Bentivogli et al. 2016; Toral and Sánchez-Cartagena 2017) or human error annotation (Burchardt et al. 2017; Klubička et al. 2017; Popović 2017; Castilho et al. 2018) in order to compare phrase-based and neural approaches for different language pairs and domains. In this issue, Calixto and Liu present an extensive error analysis of several MT systems, including two text-only systems that fall into the PBSMT and NMT paradigms, and a set of multi-modal NMT models which use not only text but also visual information extracted from images. The error taxonomy is based on that of Vilar et al. (2006), with a few adjustments. Their goal is to verify whether the multi-modal engine makes fewer errors when translating Flickr image descriptions in comparison to the other systems. Their findings suggest that adding global and local visual features into NMT significantly improves the output, and, moreover, that the mistranslation and wrong sense error types—which are arguably the most damaging the most damaging for the translation of image descriptions—were drastically reduced in the multi-modal systems. Finally, they find that not only the translation of terms with a strong visual connotation was improved, but also the translation of error types without a visual interpretation.

4 Controlled language

Controlled languages (CLs) for MT have been widely investigated for SMT and RBMT systems (O’Brien 2006; Aikawa et al. 2007; Temnikova and Orasan 2009; Temnikova 2012). However, the effect of CL for NMT has, to the best of our knowledge, not yet been investigated. In this issue, Marzouk and Hansen-Schirra examine the impact of CL rules on the output quality of NMT for the German-to-English language pair when compared to that of four other MT systems that fall under RBMT, SMT, and hybrid paradigms. Their findings suggest that CL does not have a positive impact on Google’s NMT system. GNMT's output was the one with the lowest amount of errors both before and after CL application, with a marginal increase in the number of errors after applying some CL rules. In addition, GNMT had the highest quality levels both with and without applying CL rules, with a quality decrease after its application.

In sum, the findings of the articles collected in this special issue demonstrate that there is still a large amount of research to be done on human factors for NMT systems. As in many research areas and applications that involve professional translators, the experiments with PE, especially with the recent NMT paradigm, have limitations such as small sample sizes, time constraints, and ecological validity (e.g. tools used in the research may not be the same as those used by translators in production). Further efforts are therefore required to be able to generalize the results that this special issue brings to the community, so that the evidence provided by research filters through to practising translators and to translator training programmes that need to keep abreast of technological progress. This does not mean, however, that the current results are not to be trusted, but rather reinforces the need for further investigation with bigger sample sizes, more professional translators, larger groups of translation students, end-users, considering different levels of experience (e.g. in PE), further language pairs and application domains, etc.

The articles herein are presented in the context that it is still early days in the development of NMT. From the outset, the development of MT was proposed as an interdisciplinary pursuit. Weaver’s choice of Norbert Wiener, a proponent of interdisciplinary research, as interlocutor in 1947 suggests that he foresaw MT development as requiring a broad combination of skills. Linguists were deeply involved in RBMT development and, far later, in the ecosystem of pre- and post-processing tools that eventually grew around SMT.Footnote 1 The early development of NMT has not involved a great deal of linguistic input, perhaps due to the complex nature of systems and the high barriers to entry (in cost and expertise). In that short time, there have been changes to architecture (Vaswani et al. 2017) and training data (Sennrich et al. 2016) that have been motivated by an engineering rather than linguistic focus. Trying to integrate input (that may be vaguely-defined) from non-engineers will be difficult, but our hope is that the articles in this special issue will provide feedback for interesting avenues of future development while also showcasing contemporary research in the area of NMT and human factors.

As co-editors, we hope that this publication will contribute to instigate and inspire further work to expand our knowledge and understanding of the phenomena involved in NMT for dissemination. At the same time, given the obvious applicability of these studies to real-world scenarios, this special issue also has the ambition to be relevant to interested professional translators, post-editors, project managers in language service providers, translation students, trainers and scholars, with a view to promoting the wider uptake of translation technologies informed by research-based good practice. This inclusive approach reflects the combined interests of the co-editors of the special issue, who are all, to different extents, not only involved in MT, PE and human factors research, but also actively engaged in translator training, e.g. as part of academic programmes, industry-facing initiatives, and lifelong professional development activities. In a similar vein, we see this special issue as a timely and forward-looking attempt to bring academic research, teaching and professional practice closer together, to the mutual benefit of these neighbouring communities.