What happens if you build a natural language model using modern Gaelic and try it out on late-19th/early-20th century texts?

The grammar is pretty much the same as it is today but one thing that has changed slightly if you’re a human being and hugely if you’re a machine is the spelling.

One wee word that has changed spelling is seo (this). Until recently it was pronounced the same but spelt so.

As an experiment I transcribed Seann Sgoil, the first chapter from William Watson’s Rosg Gàidhlig (‘Gaelic Prose’) from 1915 and ran it through a model trained on the UD version of the ARCOSG to see how it performed.

There are about 2000 words in the whole chapter, and 12 of them are so. In every instance, it’s seo in the old spelling. Sometimes it’s pronominal, tha so freagarrach, ‘this is suitable’, sometimes adverbial, am Fear-teagaisg an so, ‘the teacher here’ but usually it’s a determiner, na daoine so, ‘these people’.

Ten times out of twelve the parser has decided it’s a conjunction, once it’s decided it’s a preposition, and the other time it’s decided it’s a foreign word. (This latter might be my fault and inconsistent tagging in the training data.) All of these are wrong. The conjunctions are probably because the English language word so appears in otherwise Gaelic sentences in the conversational training data.

Dè nì mi? I need training data with pre-GOC texts in it with so tagged correctly and an indication somewhere that it’s really seo, at least for the human reader. This is not the only problem with old texts. More blog posts will follow.

Leave a Reply

Your email address will not be published. Required fields are marked *