Measuring Machine Translation Quality in the Era of Neural

19 January 2021
Measuring Machine Translation Quality in the Era of Neural

Posted: 03/04/18

In a recent paper submitted to, Professor Andy Way, Deputy Director of the ADAPT Centre for Digital Content Technology, unpacked quality expectations for machine translation (MT).

Instead of heavily technical research, Way discusses quality evaluation for MT, and how and why this is an important issue to address as NMT continues to develop as a major industry-changer.

“Companies often overlook how disruptive a technology MT actually is: it impacts not just technically trained staff, but also project managers, sales and marketing, the training team, finance employees, and of course post-editors and quality reviewers,” Way said in his paper. “All of this should be taken on board beforehand if the correct decision is to be taken with full knowledge of the expected return on investment, but in practice it rarely is.”

For NMT, one of the major concerns is Bilingual Evaluation Understudy (BLEU), the long-standing automatic evaluation metric used in majority of research.

BLEU’s Limitations

BLEU emerged as a de facto automatic evaluation system because of prevalence: the easiest way to show gains in MT research is using the same scoring used by previous ones.

When it comes to NMT, however, the improvements over predecessor MT—not to mention the differences in design (i.e. NMT usually runs on character-level encoder-decoder systems)—makes BLEU even less suited to quantifying output quality. Aside from the issue of BLEU comparing MT output to a single reference human translation, Way illustrates the limitations of BLEU more concretely through a sample reference translation and sample MT outputs.

The reference translation is: “The President frequently makes his vacation in Crawford Texas.”

The MT outputs are:

  1. George Bush often takes a holiday in Crawford Texas
  2. holiday often Bush a takes George in Crawford Texas
  3. George rhododendron often takes a holiday in Crawford Texas

Way notes that A and B and C would get the same BLEU score, due to inherent limitations in how BLEU calculates scores.

He proposed that the best way to address MT output is to consider two factors:

  1. Fitness for purpose of translations
  2. Perishability of content.

In his own words: “how will the translation be used, and for how long will we need to consult that translation?”

Demand for NMT Quality Metrics

Way went on to explain in his paper that “n-gram-based metrics such as BLEU are insufficient to truly demonstrate the benefits of NMT over [phrase-based, statistical, and hybrid] MT.”

He explained that existing research on NMT’s gains over predecessor tech shows significant improvements in various areas, and yet somehow overall BLEU score increases only reach around 2 BLEU points.

Additionally, on human-machine interaction, Way says MT and translation memory (TM) fuzzy matching is already a common tool in a human translator’s arsenal, so much so that it “compels MT developers to begin to output translations from their MT systems with an accompanying estimation of quality that makes sense to translators.”

In that regard, “while BLEU score is undoubtedly of use to MT developers, outputting a target sentence with a BLEU score of (say) 0.435 is pretty meaningless to a translator.”

Furthermore, this affects pricing and pay. “Translators are used to being paid different rates depending on the level of fuzzy match suggested by the TM system for each input string,” Way writes in his paper.

In his paper, Way writes that “If NMT does become the new state-of-the-art as the field expects, one can anticipate that further new evaluation metrics tuned more precisely to this paradigm will appear sooner rather than later.”


Orginally featured on Slator. 

Share this article: