The tidyverse and Universal Dependencies

If you use R you may be familiar with the tidyverse family of packages.

I particularly like them because they make it easy to write compact yet readable code with not a single for loop or if statement in sight.

As a worked example, here is some recent code I have written to evaluate the Gaelic Universal Dependencies corpus. My suspicion is that as I make the training set bigger, and the annotations of the development set and the test set more consistent with more automated checks, it will be easier for a parser to parse the texts correctly. (I can in fact use the udpipe_accuracy method, but this is a simple example for demonstration purposes.)

Firstly, some dependencies:


lubridate is a horribly-named package that makes handling time easier. udpipe is an R wrapper for the udpipe package written in C++.

udpipe_train(file = str_glue("{today}.udpipe"),
files_conllu_training = "../ud/gd_arcosg-ud-train.conllu",
files_conllu_holdout = "../ud/gd_arcosg-ud-dev.conllu")

This outputs a model based on the training set using the dev set to work out when to stop training.

ud_gd <- udpipe_load_model(str_glue("{today}.udpipe"))
gold <- udpipe_read_conllu('../ud/gd_arcosg-ud-test.conllu')
text <- gold %>% select(sentence, sentence_id) %>% unique() %>% rename(old_sentence_id = sentence_id)

This loads in the model we generated earlier, sets up a data frame in gold and then generates another dataframe consisting only of the original text and the sentence IDs. Then it renames the sentence ID column to prevent udpipe clobbering the sentence IDs from the source file with its own sequential IDs. While a CoNLL-U file is stanza-based and a mixture of tab-separated lines and comments that actually contain data, a tidy csv file has to be consistent all the way through. This means that every word in the original file is an observation and has its own row with all of its properties, including the sentence it belongs to. This is why we need the unique() statement to create a data frame that is one sentence per row.

Also note the pipe operator %>%. This is inspired by the pipe operator in Unix, and how it works is that what is piped into it (on the left of it) is the first argument to the function, hence d?_dh?blaichte <- rud_sam_bith %>% unique() is the same as d?-dh?blaichte <- unique(rud_sam_bith). Put like that it seems trivial but it's really useful for long chains of these.

result <- text %>% pmap_dfr(function(sentence, old_sentence_id) udpipe_annotate(ud_gd, sentence) %>% %>% mutate(sentence_id = old_sentence_id))

There is quite a lot going on here, but this is the heart of the code. It pipes the text data frame into pmap_dfr, which maps every row in the frame onto a function, here an anonymous one where udpipe takes the ud_gd model and annotates each sentence, converts that into a data frame, and then pmap_dfr binds all of these data frames together into one big one. The last stage in the pipeline renames the sentence ID column back to the original for readability.

Sgiobalta, nach e?

las <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id", "dep_rel")) %>% count()/nrow(gold)
ulas <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id")) %>% count()/nrow(gold)

Lastly we get the LAS (labelled attachment score, the number of words that have been attached to the correct word with the correct kind of relation) and the UAS (unlabelled attachment score, simply those ones that are attached to the correct word regardless of the kind of relation) by doing a join that throws out all the wrong answers, counts the rows and divides them by the rows in the original file to get a score, maximum 1 for perfect agreement, minimum 0.

The scores are slightly lower than those in udpipe_accuracy. Not sure why. Will investigate.

Leave a Reply

Your email address will not be published. Required fields are marked *