Post-editing the output of a statistical machine translation (SMT) system to obtain high-quality
translation has become an increasingly common application of SMT, which henceforth we refer to as post-editing-based SMT (PE-SMT). PE-SMT is often deployed as an incrementally
retrained system that can learn knowledge from human post-editing outputs as early as possible
to augment the SMT models to reduce PE time. In this scenario, the order of input segments
plays a very important role in reducing the overall PE time. Under the active learning-based
(AL) framework, this paper provides an empirical study of several typical segment prioritization methods, namely the cross entropy difference (CED), n-grams, perplexity (PPL) and
translation confidence, and verifies their performance on different data sets and language pairs.
Experiments in a simulated setting show that the confidence of translations performs best with
decreases of 1.72-4.55 points TER absolute on average compared to the sequential PE-based
incrementally retrained SMT.
This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:
Science Foundation Ireland through the ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Dublin City University and Trinity College Dublin, Grant 610879 for the Falcon project funded by the European Commission
ID Code:
23216
Deposited On:
01 May 2019 15:31 by
Thomas Murtagh
. Last Modified 20 May 2021 13:59