Document-Level MachinE TransLation EvAluation

Why Document-Level Evaluation?

One of the biggest challenges for Machine Translation (MT) is the ability to handle discourse dependencies and the wider context of a document. The purpose of DELA – Document-level MachinE TransLation EvAluation – is to revolutionize the current practices in the machine translation evaluation field and demonstrate that the time is ripe for switching to document-level assessments.

Increasing efforts have been made in order to add discourse into neural machine translation (NMT) systems. However, the results reported for those attempts are somehow limited as the evaluation is still mostly performed at the sentence level, using single references, which are not able to recognise the improvements of those systems.

By assessing translation with document-level evaluation, it is possible to assess suprasentential context, textual cohesion and coherence types of errors (such as mistranslation of ambiguous words, gender and number agreement, etc.), which is impossible at times to be recognized in sentence level.

Misevaluations Example

Example of context span necessary for S1 from EN-PT. While translating a single sentence causes both MT and HT to be correct (see Fig 1), neither by adding one more sentence before (1+S1) nor adding 2 sentences before (2+S1) solves the problem, since “it” is still not defined. In those cases both MT and HT would be judged as ‘correct’. It is only when 3 sentences are added before S1 that we can identify what “it” is (chair=female), and therefore evaluate the sentence properly. In 3+S1, the green highlights in S1 are correctly translated (blue) in HT with all the agreement, while the MT does not agree with gender and mistranslates the verb ‘to be’ and the term ‘backwards’ (red).



The main objective of the DELA project is to test the existing human and automatic sentence-level metrics to the document-level and define best practices for document-level machine translation evaluation. DELA will also gather translators’ requirements to design a translation evaluation tool which will provide an environment for translators to assess MT quality at a document-level with human evaluation metrics. In addition, the tool will offer automatic evaluation metrics scores specified as best suited for document-level evaluation in the project.

  • Key work packages that will be conducted in DELA consist of:
  • WP1 – Testing context span for document-level evaluation
  • WP2 – Construction of context-aware challenge test sets
  • WP3 – New-generation document-level human evaluation metrics
  • WP4 – New-generation document-level automatic evaluation metrics
  • WP5 – Specifications for document-level evaluation tool


  • Timeline

Timeline for the Dela project, Document-level MachinE TransLation EvAluation


Castilho, S., Cavalheiro Camargo, J. L., Menezes, M., and Way, A. (2021). Dela corpus-a document-level corpus annotated with context-related issues. In Proceedings of the Sixth Conference on Machine Translation, pages 571–582. Association for Computational Linguistics (ACL), November.

Castilho, Sheila. “Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation.” In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pp. 34-45. 2021.

Here are also some publications from pilot experiments that derived from DELA:

Sheila Castilho, Maja Popovic, and Andy Way. 2020. On context span needed for machine translation evaluation. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 3735–3742.

Sheila Castilho. 2020. On the Same Page? Comparing Inter-Annotator Agreement in Sentence and Document Level Human Machine Translation Evaluation. In proceedings of The Fifth conference on machine translation (wmt20). EMNLP2020. Online.


Sheila Castilho won Researcher of the Year in the ADAPT Recognition Awards 2021, for her impactful work on the IRC fellowship and in Machine Translation. Her work is consistently being recognised by the MT community as seen by her increased citations and her voice sought at impressive conferences.

Check out Dr. Sheila’s Social Media profiles for the latest updates:





Project Contact

Download PDF
  • DELA stands for Document-Level MachinE TransLation EvAluation

    DELA is an Irish Research Council-funded (ERC)  project, under the Postdoctoral fellowship scheme, awarded to Dr. Sheila Castilho. It will run from the 1st of October 2020 to the 30th of September 2022. The project aims at finding new ways to evaluate machine translation with context.