The tidyverse and Universal Dependencies

If you use R you may be familiar with the tidyverse family of packages.

I particularly like them because they make it easy to write compact yet readable code with not a single for loop or if statement in sight.

As a worked example, here is some recent code I have written to evaluate the Gaelic Universal Dependencies corpus. My suspicion is that as I make the training set bigger, and the annotations of the development set and the test set more consistent with more automated checks, it will be easier for a parser to parse the texts correctly. (I can in fact use the udpipe_accuracy method, but this is a simple example for demonstration purposes.)

Firstly, some dependencies:

library(tidyverse)
library(lubridate)
library(udpipe)

lubridate is a horribly-named package that makes handling time easier. udpipe is an R wrapper for the udpipe package written in C++.

udpipe_train(file = str_glue("{today}.udpipe"),
files_conllu_training = "../ud/gd_arcosg-ud-train.conllu",
files_conllu_holdout = "../ud/gd_arcosg-ud-dev.conllu")

This outputs a model based on the training set using the dev set to work out when to stop training.

ud_gd <- udpipe_load_model(str_glue("{today}.udpipe"))
gold <- udpipe_read_conllu('../ud/gd_arcosg-ud-test.conllu')
text <- gold %>% select(sentence, sentence_id) %>% unique() %>% rename(old_sentence_id = sentence_id)

This loads in the model we generated earlier, sets up a data frame in gold and then generates another dataframe consisting only of the original text and the sentence IDs. Then it renames the sentence ID column to prevent udpipe clobbering the sentence IDs from the source file with its own sequential IDs. While a CoNLL-U file is stanza-based and a mixture of tab-separated lines and comments that actually contain data, a tidy csv file has to be consistent all the way through. This means that every word in the original file is an observation and has its own row with all of its properties, including the sentence it belongs to. This is why we need the unique() statement to create a data frame that is one sentence per row.

Also note the pipe operator %>%. This is inspired by the pipe operator in Unix, and how it works is that what is piped into it (on the left of it) is the first argument to the function, hence d?_dh?blaichte <- rud_sam_bith %>% unique() is the same as d?-dh?blaichte <- unique(rud_sam_bith). Put like that it seems trivial but it's really useful for long chains of these.

result <- text %>% pmap_dfr(function(sentence, old_sentence_id) udpipe_annotate(ud_gd, sentence) %>% as.data.frame() %>% mutate(sentence_id = old_sentence_id))

There is quite a lot going on here, but this is the heart of the code. It pipes the text data frame into pmap_dfr, which maps every row in the frame onto a function, here an anonymous one where udpipe takes the ud_gd model and annotates each sentence, converts that into a data frame, and then pmap_dfr binds all of these data frames together into one big one. The last stage in the pipeline renames the sentence ID column back to the original for readability.

Sgiobalta, nach e?

las <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id", "dep_rel")) %>% count()/nrow(gold)
ulas <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id")) %>% count()/nrow(gold)

Lastly we get the LAS (labelled attachment score, the number of words that have been attached to the correct word with the correct kind of relation) and the UAS (unlabelled attachment score, simply those ones that are attached to the correct word regardless of the kind of relation) by doing a join that throws out all the wrong answers, counts the rows and divides them by the rows in the original file to get a score, maximum 1 for perfect agreement, minimum 0.

The scores are slightly lower than those in udpipe_accuracy. Not sure why. Will investigate.

gdbank: CCG and dependency structures in Scottish Gaelic

I have been working on a small corpus of Scottish Gaelic sentences. The words in them are all annotated with categorial grammar types and dependency relations. It’s available on Google Code?GitHub and there is a more detailed description in this paper and this poster, which I presented at CLTW in Dublin.

What format is it in?

CoNLL-X is a tab-separated plain-text format for annotating text with dependency relations. It was developed for the 10th Computational Natural Language Learning meeting in 2006. I have abused the format slightly by putting the categorial grammar types in the “features” column.

Which standard?did you use for the dependency annotation?

After swithering between the Briscoe and Carroll (GR) scheme and Teresa Lynn at DCU’s?scheme and a mixture of the two I eventually opted for the Universal Dependency Scheme, which is based on the Stanford scheme. This has the merit of making inter-language comparisons straightforward.

Which standard did you use for the CCG annotations?

One based very closely on CCGBank with slight modifications for Gaelic.

How large is it?

Currently there are 40?sentences and 612?tokens (roughly speaking, words and punctuation marks).

Is it POS-tagged?

Ish.?The CoNLL-X format has two columns for this, though, a coarse POS tagset (simply whether something is a noun, a verb, an adposition or whatever) and a more fine-grained one that would include number, tense and so forth. I use the Universal POS tagset for both columns for now.

How many annotators did you have?

Ahem. Just me. This is a shortcoming.

Update 2015-08-24: migrated to GitHub (see above).