s(e)o

What happens if you build a natural language model using modern Gaelic and try it out on late-19th/early-20th century texts?

The grammar is pretty much the same as it is today but one thing that has changed slightly if you’re a human being and hugely if you’re a machine is the spelling.

One wee word that has changed spelling is seo (this). Until recently it was pronounced the same but spelt so.

As an experiment I transcribed Seann Sgoil, the first chapter from William Watson’s Rosg Gàidhlig (‘Gaelic Prose’) from 1915 and ran it through a model trained on the UD version of the ARCOSG to see how it performed.

There are about 2000 words in the whole chapter, and 12 of them are so. In every instance, it’s seo in the old spelling. Sometimes it’s pronominal, tha so freagarrach, ‘this is suitable’, sometimes adverbial, am Fear-teagaisg an so, ‘the teacher here’ but usually it’s a determiner, na daoine so, ‘these people’.

Ten times out of twelve the parser has decided it’s a conjunction, once it’s decided it’s a preposition, and the other time it’s decided it’s a foreign word. (This latter might be my fault and inconsistent tagging in the training data.) All of these are wrong. The conjunctions are probably because the English language word so appears in otherwise Gaelic sentences in the conversational training data.

Dè nì mi? I need training data with pre-GOC texts in it with so tagged correctly and an indication somewhere that it’s really seo, at least for the human reader. This is not the only problem with old texts. More blog posts will follow.

The tidyverse and Universal Dependencies

If you use R you may be familiar with the tidyverse family of packages.

I particularly like them because they make it easy to write compact yet readable code with not a single for loop or if statement in sight.

As a worked example, here is some recent code I have written to evaluate the Gaelic Universal Dependencies corpus. My suspicion is that as I make the training set bigger, and the annotations of the development set and the test set more consistent with more automated checks, it will be easier for a parser to parse the texts correctly. (I can in fact use the udpipe_accuracy method, but this is a simple example for demonstration purposes.)

Firstly, some dependencies:

library(tidyverse)
library(lubridate)
library(udpipe)

lubridate is a horribly-named package that makes handling time easier. udpipe is an R wrapper for the udpipe package written in C++.

udpipe_train(file = str_glue("{today}.udpipe"),
files_conllu_training = "../ud/gd_arcosg-ud-train.conllu",
files_conllu_holdout = "../ud/gd_arcosg-ud-dev.conllu")

This outputs a model based on the training set using the dev set to work out when to stop training.

ud_gd <- udpipe_load_model(str_glue("{today}.udpipe"))
gold <- udpipe_read_conllu('../ud/gd_arcosg-ud-test.conllu')
text <- gold %>% select(sentence, sentence_id) %>% unique() %>% rename(old_sentence_id = sentence_id)

This loads in the model we generated earlier, sets up a data frame in gold and then generates another dataframe consisting only of the original text and the sentence IDs. Then it renames the sentence ID column to prevent udpipe clobbering the sentence IDs from the source file with its own sequential IDs. While a CoNLL-U file is stanza-based and a mixture of tab-separated lines and comments that actually contain data, a tidy csv file has to be consistent all the way through. This means that every word in the original file is an observation and has its own row with all of its properties, including the sentence it belongs to. This is why we need the unique() statement to create a data frame that is one sentence per row.

Also note the pipe operator %>%. This is inspired by the pipe operator in Unix, and how it works is that what is piped into it (on the left of it) is the first argument to the function, hence d?_dh?blaichte <- rud_sam_bith %>% unique() is the same as d?-dh?blaichte <- unique(rud_sam_bith). Put like that it seems trivial but it's really useful for long chains of these.

result <- text %>% pmap_dfr(function(sentence, old_sentence_id) udpipe_annotate(ud_gd, sentence) %>% as.data.frame() %>% mutate(sentence_id = old_sentence_id))

There is quite a lot going on here, but this is the heart of the code. It pipes the text data frame into pmap_dfr, which maps every row in the frame onto a function, here an anonymous one where udpipe takes the ud_gd model and annotates each sentence, converts that into a data frame, and then pmap_dfr binds all of these data frames together into one big one. The last stage in the pipeline renames the sentence ID column back to the original for readability.

Sgiobalta, nach e?

las <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id", "dep_rel")) %>% count()/nrow(gold)
ulas <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id")) %>% count()/nrow(gold)

Lastly we get the LAS (labelled attachment score, the number of words that have been attached to the correct word with the correct kind of relation) and the UAS (unlabelled attachment score, simply those ones that are attached to the correct word regardless of the kind of relation) by doing a join that throws out all the wrong answers, counts the rows and divides them by the rows in the original file to get a score, maximum 1 for perfect agreement, minimum 0.

The scores are slightly lower than those in udpipe_accuracy. Not sure why. Will investigate.

Universal dependencies for Scottish Gaelic

Available now at: https://universaldependencies.org/treebanks/gd_arcosg/index.html

I am happy to report that Scottish Gaelic now features among the treebanks of the Universal Dependencies project in releases 2.5 and 2.6. I have generally followed the annotation scheme for Irish, with some additions to cope with constructions that differ between the two languages.

The treebank is based on the ARCOSG corpus, which is half-and-half prose and speech. My paper presented at CLTW this year deals with the prose half so I thought it would be worthwhile to report some of the features of the speech subcorpora.

The first is sentence-splitting. ARCOSG is divided up into clauses rather than sentences. The prose subcorpora all have punctuation, so by-and-large I’ve relied on an automatic and pretty simplistic sentence-splitting algorithm to do the job for me. Occasionally a closing double-quote ends up in the wrong tree, but this is easy to fix. The speech subcorpora, lacking full stops, are something else entirely. The sentence-splitter splits on changes of speaker and in principle I could just take every utterance as a single tree with lots of parataxis relations, but this would give me ridiculously big trees in some cases. Better to split where it feels like a new utterance. Subsequently I have found some guidelines in the wild for this: http://ldp-uchicago.github.io/docs/guides/transcription/sect_4.html I am relieved to find that the rules match more or less what I was doing, except that I didn’t have the original recordings to work with.

The second is tag questions, like fhios agad ‘you know’ and nach e ‘isn’t it’. These are very common. I’ve elected to simply relate them to the rest of the sentence with a parataxis relation rather than a more specific parataxis:tag relation. Maybe that would help?

The third is words that the transcriber wasn’t able to transcribe. These are captured as [?]. If it’s not possible to work out from context what relation they bear to the rest of the utterance then Universal Dependencies has a completely generic dep relation I have used.

Number four is football commentary. Lots of it looks like this:

MacLare gu Johnson ma-th? ‘MacLare to Johnson indeed’

Johnson leatha a-mach an taobh-sa gu MacStay ‘Johnson with it out the side to MacStay’

s07_005 and s07_006, ud_gd-arcosg-train.conllu

What’s going on there grammatically? I have a solution for now: treat the player as the root and attach the PPs with the obl relation, but is this the most UD way of doing it? If not, then I should have been consistent enough to fix it automatically.

An editor for dependency treebanks

I was pleased to meet Johannes Heinecke at the International Congress of Celtic Studies in Bangor last week. As well as producing a dependency treebank for Welsh, he has written a rather smart editor for CoNLL-U files, which are pretty much the standard these days for dependency trees.

Screengrab of Johannes Heinecke's CoNLL-U editor. The tree is for the sentence "Cuir d' ainm ri seo."

I managed to get it working this morning on a Mac running Mac OS Mojave 10.14.6 with a minimum of hassle. You will need Java, Apache Maven, and Homebrew in order to install wget. One small surprise is that if you edit a file in a git repository then by default every time you edit the tree, the new file is committed, which makes the commit history look a bit busy.

The second best bit is that you can see non-projective relations at a glance, which I certainly can’t do in emacs.

The best bit, as someone who recently wrote a paper where all the arrows in the dependency diagrams pointed the wrong way and didn’t notice until the referees pointed it out, is that there is a wee button you can click on to get a tikz version of the tree for pasting into LaTeX.

Training a dependency parser on gdbank

A very quick note to say that I’ve trained maltparser, a dependency parser, with?the current gdbank sentences (a mere 1223 tokens spread across 70-odd sentences), the Universal POS tagging scheme and the current Universal-ish gdbank dependency annotation scheme, and then seen how it performed on an unseen test set of 8 sentences containing 276 tokens taken from an article in The Scotsman from a few years ago.

It got 196 (71%) of the heads right, 207 (75%) of the dependency types right, and both the head and the dependency were right in 187 (68%) of cases. My initial impressions is that the main problems are subordinators and my having mis-POS-tagged a few words, but there will be a confusion matrix soon.

MaltParser cheat mode

If you train MaltParser using the learnwo flowchart in place of learn, it does all the same things, except that it writes out the sentences as it reads them in.

This means that if you have, ahem, misformatted any of your input, you can see exactly which misformatting MaltParser is complaining about, because it will be in the first sentence that hasn’t been written to stdout.

Installing MaltParser on Mac OS X 10.6.8

MaltParser is a dependency parser and it’s available here: http://www.maltparser.org/download.html

If you try to run the ready-built jar under Mac OS X 10.6.8 and you haven’t updated to Java 1.7, you’ll get a major.minor version number error. However, if you simply edit references in the build.xml file to read 1.6, and type

ant dist

to build with ant, then it will whirr away for a bit and build fine.