Last weekend I was in Dublin for the first Celtic Language Technology Workshop, which was part of COLING2014. I am still digesting and still to follow up everything, but here’s a very brief summary.
Elaine Uí Dhonnchadha (DCU) started with an overview of Irish language technology and a plea for open resources. William Lamb and Sammy Danso (both Edinburgh) gave a two-hander about POS-tagging a Scottish Gaelic corpus of nearly 90 000 tokens. This will be coming out later this year. Having taken part in annotation tasks in the past, I was taken aback that they reported a kappa coefficient of 0.98 for two annotators on this task. (A kappa coefficient, for those who don’t know, is a modified agreement score that takes into account chance agreement. I say “a” kappa coefficient because there is more than one way of working them out. If your kappa is zero, that means your agreement is no better than chance, even though your percentage agreement might be something like 75%. Perfect agreement leads to a kappa of 1. Kappas of 0.7 or greater are pretty good, in general.) Monica Ward (also DCU) talked about building resources for the teaching of Irish: a past-tense teaching game for primary school children and a grammar checker for adult learners built around Kevin Scannell’s Gramadóir. Lastly before the coffee break Teresa Lynn (DCU) talked about cross-lingual transfer dependency parsing of Irish. On the face of it, this sounds like something that shouldn’t work at all. You take a model that has been trained on a completely different language, steam off the lexical information (which words relate to which), remove the dependency labels, and keep only the part-of-speech tags and unlabelled dependencies. Surprisingly, it kind-of works. Even more surprisingly the best results came from a model trained on Indonesian, which doesn’t seem to be VSO.
At coffee, quite by chance, I was surprised and delighted to meet Mark Steedman (Edinburgh), who invented CCG.
After coffee, Thierry Poibeau (CNRS and DTAL, Cambridge) talked about mutations in Breton. Discussion ensued about whether the anomalous behaviour of, how do I put this, “professional nouns” like kiger (butcher) was semantic or not. I suspect it’s more like word classes, like animacy in the Slavic languages or the Australian language that famously has the class of women, fire and dangerous things. Kevin Scannell (Saint Louis University), who I am astonished to have not come across before, talked about machine translation from Scottish Gaelic to Irish and the various spelling reforms in Irish that have made it look rather less like Scottish Gaelic than it used to. Finally, Caoimhín Ó Donnaíle (Sabhal Mòr Ostaig), talked about the Multidict site, which has videos for language learners and a splendid interactive wrapper for websites that links to external dictionaries. The idea is that you click on a word in one frame and the dictionary entry appears in another. When I was a first-year undergraduate in 1994, Caoimhín’s websites were some of the first I remember ever seeing, probably with the Mosaic browser.
After lunch, an invited talk from Kevin Scannell, largely about Manx, but also about building NLP resources from social media and web crawling. Much of the Manx data is revived Manx, but there’s also text from Skeealyn Vannin, which was compiled by the Irish Folklore Commission in the late 1940s from native speakers. Michal Boleslav Měchura (DCU) talked about breis.focloir.ie, an online Irish grammar database. The last of the full papers was given by Sarah Cooper (Bangor), describing an app, Paldaruo, for crowdsourcing speech recognition training data. The microphones on tablets are good enough these days to record speech from ordinary members of the public, in the first instance to control a wee robot arm connected to a Raspberry Pi.
Then the “poster boaster” session, featuring me, Michal again on onomastics, Francis Tyers (Tromsø) on within-tweet language detection and Delyth Prys (Bangor), talking about the DECHE Corpus of Welsh Scholarly Writing. Then pub.
An enormous amount to take in, and lots to follow up on. I do hope this’ll be the first of many.