Skip to main content
Log in

Crawl and crowd to bring machine translation to under-resourced languages

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia’s economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general-domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. According to a study (Rehm and Uszkoreit 2013) that analyses the state of language technology support for 30 European languages in four areas (machine translation, speech, text analytics and language resources), Croatian is given the lowest mark out of five (weak/none) for three of the areas and the second lowest mark (fragmentary) for the remaining area (language resources).

  2. http://www.theguardian.com/technology/2014/jul/15/twitter-world-cup-tweets-germany-brazil.

  3. http://abumatran.eu.

  4. http://nlp.ffzg.hr/resources/corpora/hrenwac/.

  5. http://commoncrawl.org.

  6. https://sites.google.com/site/amtworkshop2010/home.

  7. http://nlp.fi.muni.cz/trac/spiderling.

  8. http://nlp.ffzg.hr/resources/corpora/setimes/.

  9. http://opus.lingfil.uu.se.

  10. http://nlp.ffzg.hr/resources/corpora/ted-talks/.

  11. http://nlp.ffzg.hr/resources/corpora/hrenwac/.

  12. http://nlp.ilsp.gr/redmine/projects/ilsp-fc.

  13. http://code.google.com/p/language-detection/.

  14. http://sourceforge.net/projects/bitextor/.

  15. http://www.httrack.com/.

  16. http://tika.apache.org/.

  17. http://code.google.com/p/boilerpipe/.

  18. https://github.com/saffsd/langid.py.

  19. \(S(D_j,D_i)\) is also obtained, since score \(S(\cdot )\) is not symmetric.

  20. http://mokk.bme.hu/resources/hunalign/.

  21. For our task, paragraphs are blocks of text which may contain more than one sentence.

  22. http://www.nltk.org/.

  23. The English–Croatian bilingual lexicon available at http://sourceforge.net/projects/bitextor/files/bitextor/bitextor-4.0/dictionaries/ was used for sentence alignment with hunalign. In addition, this tool was run with the option bisent to ensure one-to-one sentence alignments.

  24. http://www.gala-global.org/oscarStandards/tmx/tmx14b.html.

  25. The comparison between the sentences was performed on lowercased text from which non-alphabetic characters (spaces, punctuation, and numbers) were removed.

  26. Fuzzy match scores measure the similarity between two strings by using the Levenshtein distance (Sikes 2007) to detect the elements (words in our case) matching between them.

  27. http://nlp.ffzg.hr/resources/corpora/hrenwac/.

  28. http://nlp.ffzg.hr/resources/corpora/setimes/.

  29. http://zeljko.agic.me/resources/.

  30. http://www.statmt.org/wmt14/translation-task.html.

  31. https://github.com/moses-smt/mosesdecoder/tree/RELEASE-2.1.1.

  32. http://www.statmt.org/wmt13/test.tgz.

  33. ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v13a.pl.

  34. http://www.umiacs.umd.edu/~snover/terp/.

  35. http://www.ark.cs.cmu.edu/MT/paired_bootstrap_v13a.tar.gz.

  36. While one might intuitively think that lower OOVs should correlate with better scores in terms of automatic MT evaluation metrics, this is not always the case as there are many other factors at play. MT evaluation metrics take into account word order, shifts, n-gram matching, etc. On top of these, a sizable portion of OOVs tend to be named entities, which in many cases are fine to be left untranslated, and if so whether the MT system covers them or not will not have any impact on the score produced by the MT metric.

  37. http://cngl.ie/brazilator/#/about.

  38. https://hub.microsofttranslator.com/.

  39. https://github.com/nljubesi/tweetcat.

  40. http://crowdflower.com/.

  41. http://hdl.handle.net/11356/1049.

References

  • Achananuparp, P., Hu, X., & Shen, X. (2008). The evaluation of sentence similarity measures. In I. Y. Song, J. Eder & T. Nguyen (Eds.), Data warehousing and knowledge discovery (Vol. 5182, pp. 305–316). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. doi:10.1007/978-3-540-85836-2_29.

  • Ambati, V., & Vogel, S. (2010). Can crowds build parallel corpora for machine translation systems? In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 62–65). Los Angeles: Association for Computational Linguistics. http://www.aclweb.org/anthology/W10-0710.

  • Axelrod, A., He, X., & Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 355–362). Edinburgh, Scotland, UK: Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1033.

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226. doi:10.1007/s10579-009-9081-4.

    Article  Google Scholar 

  • Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., et al. (2014). Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation (pp. 12–58). Association for Computational Linguistics, Baltimore, Maryland, USA. http://www.aclweb.org/anthology/W/W14/W14-3302.

  • Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & López, V. (2006). In Proceedings of the 2nd international workshop on Web as Corpus, chap CUCWeb: A Catalan corpus built from the Web. http://aclweb.org/anthology/W06-1704.

  • Callison-Burch, C. (2009). Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 286–295). EMNLP 2009, 6–7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL. http://www.aclweb.org/anthology/D09-1030.

  • Chakrabarti, S. (2003). Mining the Web: Discovering knowledge from hypertext data. Massachusetts:Morgan Kaufmann.

  • Esplà-Gomis, M., Klubička, F., Ljubešić, N., Ortiz-Rojas, S., Papavassiliou, V., & Prokopidis, P. (2014). Comparing two acquisition systems for automatically building an english-croatian parallel corpus from multilingual websites. In N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.

  • Esplà-Gomis, M., & Forcada, M. L. (2010). Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. The Prague Bulletin of Mathematical Linguistics, 93, 77–86.

    Article  Google Scholar 

  • Fišer, D., Tavčar, A., & Erjavec, T. (2014). slowcrowd: A crowdsourcing tool for lexicographic tasks. In: N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.),Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.

  • Gao, Q., & Vogel, S. (2008). Parallel implementations of word alignment tool. In Software engineering, testing, and quality assurance for natural language processing, association for computational linguistics (pp. 49–57).

  • Graham, Y., Baldwin, T., Moffat, A., & Zobel, J. (2014). Is machine translation getting better over time? In Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics (pp. 443–451), Gothenburg, Sweden. http://www.aclweb.org/anthology/E14-1047.

  • Hasler, E., Haddow, B., & Koehn, P. (2011). Margin infused relaxed algorithm for moses. The Prague Bulletin of Mathematical Linguistics, 96, 69–78.

    Article  Google Scholar 

  • Hassan, H., & Menezes, A. (2013). Social text normalization using contextual graph random walks. In Proceedings of the 51st annual meeting of the association for computational linguistics (Vol.1: Long Papers, pp. 1577–1586), Association for Computational Linguistics, Sofia, Bulgaria. http://www.aclweb.org/anthology/P13-1155.

  • Heafield, K. (2011). Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187–197). Association for Computational Linguistics.

  • Irvine, A., & Klementiev, A. (2010). Using mechanical turk to annotate lexicons for less commonly used languages. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk (pp. 108–113). Association for Computational Linguistics, Stroudsburg, PA, USA. http://www.aclweb.org/anthology/W10-0717.pdf.

  • Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.

    Article  Google Scholar 

  • Klubička, F., & Ljubešić, N. (2014). Using crowdsourcing in building a morphosyntactically annotated and lemmatized silver standard corpus of croatian. In T. Erjavec & J. Ž. Gros (Eds.), Language technologies: Proceedings of the 17th International Multiconference Information Society IS2014. Slovenia: Ljubljana.

  • Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 388–395). EMNLP 2004, A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction with ACL 2004, 25–26 July 2004, Barcelona, Spain, ACL. http://www.aclweb.org/anthology/W04-3250.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 177–180). http://dl.acm.org/citation.cfm?id=1557769.1557821.

  • Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. Proceedings of the third ACM international conference on Web search and data mining (pp. 441–450). New York, NY, USA.

  • Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., & José Finatto, M. (2014). Comparing the quality of focused crawlers and of the translation resources obtained from them. In Proceedings of the ninth international conference on language resources and evaluation (LREC-2014). European Language Resources Association (ELRA).

  • Ljubešić, N., & Erjavec, T. (2011). hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene. Text, speech and dialogue—14th international conference, TSD 2011 (pp. 395–402). Pilsen: Czech Republic, Springer, Lecture Notes in Computer Science.

  • Ljubešić, N., & Klubička, F. (2014). bs, hr, srWaC—Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg, Sweden: Association for Computational Linguistics.

  • Ljubešić, N., & Kranjčić, D. (2015). Discriminating between closely related languages on twitter. Informatica, 39(1), 1–8.

    Google Scholar 

  • Ljubešić, N., Fišer, D., & Erjavec, T. (2014). TweetCaT: A tool for building twitter corpora of smaller languages. In: N. C. C. Chair, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.

  • Ma, X., & Liberman, M. (1999). Bits: A method for bilingual text search over the web. Machine Translation Summit VII (pp. 538–542), Singapore.

  • Munro, R. (2010). Crowdsourced translation for emergency response in haiti: the global collaboration of local knowledge. In AMTA workshop on collaborative crowdsourcing for translation, Denver, Colorado.

  • Munteanu, S. D., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics (pp. 81–88). Association for Computational Linguistics.

  • Nie, J. Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 74–81), ACM, Berkeley, California, USA, SIGIR’99.

  • Papavassiliou, V., Prokopidis, P., & Thurmair, G. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In Proceedings of the sixth workshop on building and using comparable corpora (pp. 43–51), Association for Computational Linguistics, Sofia, Bulgaria. http://www.aclweb.org/anthology/W13-2506.

  • Papavassiliou, V., Prokopidis, P., Esplà-Gomis, M., & Ortiz-Rojas, S. (2014). D3.2. corpora acquisition software. Public deliverable, The Abu-MaTran Project (PIAP- GA-2012-324414).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311–318). doi:10.3115/1073083.1073135.

  • Pecina, P., Toral, A., & van Genabith, J. (2012). Simple and effective parameter tuning for domain adaptation of statistical machine translation. Proceedings of the 24th international conference on computational linguistics (Coling 2012), Coling 2012 Organizing Committee (pp. 2209–2224). India: Mumbai.

  • Rarrick, S., Quirk, C., & Lewis, W. (2011). Mt detection in web-scraped parallel corpora. In Proceedings of MT Summit XIII, Asia-Pacific Association for Machine Translation. http://research.microsoft.com/pubs/153367/MT-Summit-Detection_Lewis_0819.pdf.

  • Rehm, G., & Uszkoreit, H. (2013). META-NET Strategic Research Agenda for Multilingual Europe 2020 Incorporated. Springer.

  • Resnik, P., & Smith, N. A. (2003). The Web as a parallel corpus. Computational Linguistics, 29(3), 349–380.

    Article  Google Scholar 

  • Resnik, P., Buzek, O., Kronrod, Y., Hu, C., Quinn, A. J., & Bederson, B. B. (2013). Using targeted paraphrasing and monolingual crowdsourcing to improve translation. ACM Trans Intell Syst Technol, 4(3), 38:1–38:21. doi:10.1145/2483669.2483671.

    Article  Google Scholar 

  • Rubino, R., Toral, A., Sánchez-Cartagena, V. M., Ferrández-Tordera, J., Ortiz Rojas, S., Ramírez-Sánchez, G., et al. (2014). Abu-matran at wmt 2014 translation task: Two-step data selection and rbmt-style synthetic rules. In Proceedings of the ninth workshop on statistical machine translation (pp. 171–177).

  • Rubino, R., Pirinen, T., Esplà-Gomis, M., Ljubešić, N., Ortiz Rojas, S., Papavassiliou, V., et al. (2015). Abu-matran at wmt 2015 translation task: Morphological segmentation and web crawling. In Proceedings of the tenth workshop on statistical machine translation, Association for Computational Linguistics, Lisbon, Portugal (pp. 184–191) http://aclweb.org/anthology/W15-3022.

  • Sennrich, R. (2012) Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th conference of the European chapter of the association for computational linguistics (pp. 539–549). http://dl.acm.org/citation.cfm?id=2380816.2380881.

  • Sikes, R. (2007). Fuzzy matching in theory and practice. MultiLingual, 18(6), 39–43.

    Google Scholar 

  • Skadina, I., Vasiljevs, A., Skadins, R., Gaizauskas, R., & Tufis, D. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. In Proceedings of the 3rd workshop on building and using comparable corpora. Applications of parallel and comparable corpora in natural language engineering and the humanities (pp. 6–14).

  • Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Weischedel, R. (2006). A study of translation error rate with targeted human annotation. In Proceedings of the association for machine translation in the Americas.

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A.Y. (2008). Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In 2008 conference on empirical methods in natural language processing, EMNLP 2008, Proceedings of the conference, 25–27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL (pp. 254–263). http://www.aclweb.org/anthology/D08-1027.

  • Stolcke, A., Zheng, J., Wang, W., & Abrash, V. (2011). Srilm at sixteen: Update and outlook. In Proceedings of IEEE automatic speech recognition and understanding workshop (p. 5).

  • Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In S. S. Adam Kilgarriff (Ed.), Proceedings of the seventh Web as Corpus Workshop (WAC7), Lyon (pp. 39–43).

  • Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Inf Retr, 11(5), 427–445. doi:10.1007/s10791-008-9058-8.

    Article  Google Scholar 

  • Tiedemann, J. (2009). News from opus-a collection of multilingual parallel corpora with tools and interfaces. Recent Advances in Natural Language Processing, 5, 237–248.

    Article  Google Scholar 

  • Toral, A., Rubino, R., Esplà-Gomis, M., Pirinen, T., Way, A., & Ramirez-Sanchez, G. (2014). Extrinsic evaluation of web-crawlers in machine translation: A case study on Croatian–English for the tourism domain. In Proceedings of the 17th Conference of the European Association for Machine Translation (EAMT) (pp. 221–224).

  • Toral, A., Wu, X., Pirinen, T., Qiu, Z., Bicici, E., & Du, J. (2015). Dublin city university at the tweetmt 2015 shared task. TweetMT@ SEPLN. In Proceedings of the La Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).

  • Tyers, F. M., & Alperen, M. S. (2010). South-east european times: A parallel corpus of balkan languages. In Proceedings of the LREC workshop on exploitation of multilingual resources and tools for Central and (South-) Eastern European Languages (pp. 49–53).

  • Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Parallel corpora for medium density languages. Recent advances in natural language processing (pp. 590–596). Bulgaria: Borovets.

  • Wasala, A., Schäler, R., Buckley, J., Weerasinghe, R., & Exton, C. (2013). Building multilingual language resources in web localisation: A crowdsourcing approach. In I. Gurevych & J. Kim (Eds.), The people’s Web meets NLP, theory and applications of natural language processing (pp. 69–99). Berlin, Heidelberg: Springer. doi:10.1007/978-3-642-35085-6_3.

  • Zaidan, O. F., & Callison-Burch, C. (2011). Crowdsourcing translation: Professional quality from non-professionals. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1220–1229). Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11. http://dl.acm.org/citation.cfm?id=2002472.2002626.

  • Zbib, R., Markiewicz, G., Matsoukas, S., Schwartz, R. M., & Makhoul, J. (2013). Systematic comparison of professional and crowdsourced reference translations for machine translation. In HLT-NAACL (pp. 612–616).

  • Zhechev, V. (2012). Machine translation infrastructure and post-editing performance at autodesk. AMTA 2012 workshop on post-editing technology and practice (WPTP 2012) (pp. 87–96), San Diego, USA.

Download references

Acknowledgments

This research is supported by the European Union Seventh Framework Programme FP7/2007–2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran) and by the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded under the European Regional Development Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Toral.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Toral, A., Esplá-Gomis, M., Klubička, F. et al. Crawl and crowd to bring machine translation to under-resourced languages. Lang Resources & Evaluation 51, 1019–1051 (2017). https://doi.org/10.1007/s10579-016-9363-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9363-6

Keywords

Navigation