On May 15th, we released the latest version (v2.6) of the Irish Universal Dependency treebank. This was an important milestone for Irish language technology, yet the date passed, as with previous releases, relatively uncelebrated and without much understanding or awareness from the general Irish speaking public. Why is this? Well, the value of treebanks is hard to quantify or demonstrate at the best of times. Traditionally, their value was really only recognised by linguists who wished to test linguistic theories or analyse how words interact with each other in a given language. However, over the past 15 years or so of the Artificial Intelligence revolution, treebanks have played a central role in helping computers to learn how to understand human language. As such, the group of people interested in getting their hands on them has now broadened to computational linguists and software engineers. In short, treebanks now play a more significant role in digitally supporting languages.
Firstly though, what exactly is a treebank? It’s a collection of written text that has been marked-up (annotated, labelled) with information that tells you something about the linguistic structure of each sentence. If we’re talking about dependency treebanks, then that annotation usually tells you about the grammatical role that each word plays in the sentence. For example, the shark attacked the surfer – in this sentence shark is the ‘subject’ as it is the one doing the attacking, surfer is the ‘object’ being attacked, and so on. Words are therefore connected to each other in language in specific ways that help us decipher the intended meaning.
For some languages such as English, the order in which words appear together is very important. The fact that the shark appears before the verb helps us to know that it’s the subject. If the order was switched, a very different scene would be painted! However, word order does not have the same effect in all languages. Many free(er) word order languages contain words that change shape by adding affixes (prefixes, suffixes, infixes) in a way that tells you what role the words are playing in the sentence.
When we learn a language as children, many of these grammar rules become implicit in our understanding, through extensive exposure to the language, repetition or correction from others. Some other rules are formally taught at school, but once we reach proficiency we’re not aware of our language processing skills. Our highly tuned cognitive processes often only come to light when we mishear something and need to ask for clarification, or when we’re speaking to a learner of our language and their mistakes cause confusion, and so on. Of course, learning a second language as adults makes us fully aware of grammar rules, especially when we forget how to follow them!
The information available in a treebank provides us with this ‘syntactic’ information about the structure and rules of a language. When the treebank is given to a computer as ‘training’ data, specific algorithms help the system to ‘learn’ the structure of the language. This system is called a parser. Given sufficient data, a parser will become intelligent enough to correctly analyse the sentence in order to feed other applications such as chat bots, summarisation tools, language learning apps, grammar checkers, question-answering systems, sentiment analysis, text mining, automated translation, etc.
Our treebank includes 2924 trees (annotated Irish sentences). This might seem relatively small, given that a well-resourced language like Czech has over 85,000 trees in one treebank. However, this low number is simply a reflection of the lack of funding in this area, the lack of understanding of the need for an Irish treebank and the lack of skilled linguists who can contribute. Since the previous release, it took 6 months of labelling, discussion and review for two full-time annotators to produce 1161 new trees.
The Universal Dependency treebank collection contains data for 92 languages. Having an Irish treebank in this collection means that while Ireland makes slow progress in skilling up a new generation of Irish computational linguists, researchers and developers from across the world can contribute to improving parsing for Irish.
The current round of annotations have been funded by the Department of Culture, Heritage and the Gaeltacht through the GaelTech project at the ADAPT Centre, Dublin City University. As is always encouraged in the field of Irish Language Technology, the data is open-source and available for use under a creative commons license: CC BY-SA 3.0
We have also made annotation documentation available to anyone who might want to work with the data, or become a linguistic annotator in the future.
-Dr. Teresa Lynn, Research Fellow
Annotators: Jason Phelan and Sarah McGuinness
Technical Support: Abigail Walsh
Share this article: