What happens if you build a natural language model using modern Gaelic and try it out on late-19th/early-20th century texts?

The grammar is pretty much the same as it is today but one thing that has changed slightly if you’re a human being and hugely if you’re a machine is the spelling.

One wee word that has changed spelling is seo (this). Until recently it was pronounced the same but spelt so.

As an experiment I transcribed Seann Sgoil, the first chapter from William Watson’s Rosg Gàidhlig (‘Gaelic Prose’) from 1915 and ran it through a model trained on the UD version of the ARCOSG to see how it performed.

There are about 2000 words in the whole chapter, and 12 of them are so. In every instance, it’s seo in the old spelling. Sometimes it’s pronominal, tha so freagarrach, ‘this is suitable’, sometimes adverbial, am Fear-teagaisg an so, ‘the teacher here’ but usually it’s a determiner, na daoine so, ‘these people’.

Ten times out of twelve the parser has decided it’s a conjunction, once it’s decided it’s a preposition, and the other time it’s decided it’s a foreign word. (This latter might be my fault and inconsistent tagging in the training data.) All of these are wrong. The conjunctions are probably because the English language word so appears in otherwise Gaelic sentences in the conversational training data.

Dè nì mi? I need training data with pre-GOC texts in it with so tagged correctly and an indication somewhere that it’s really seo, at least for the human reader. This is not the only problem with old texts. More blog posts will follow.

Facle an latha 10 Dàmhair 2023: GOILE

Eil thu eòlach air Facle? Seo Wordle mì-oifigeil anns a’ Ghàidhlig agus ‘s dòcha beagan nas fhasa air sgàth na riaghailtean sgrìobhaidh.

Tha goile a’ ciallachadh “stamag” no gu meataforach “càil bidhe”. Mar as àbhaist feumaidh mi faighneachd am bi rudeigin inntinneach no neònach ann an DASG agus cha robh briseadh-dùil agam.

is mór an pian do goile super na h’oidhce

Regimen Sanitatis Salernitanum, 16mh linn

Regimen Sanitatis Salernitanum? Ged is mathaid gun deach an teacsa an sgrìobhadh ann am Montpellier bha an sgoil meidigeach ann an Salerno anns an Eadailt cudromach anns na Meadhan Aoisean agus mar sin cheanglaich daoine Riaghailtean na Slàinte ri Salerno.

Bha Clann ‘ic Beatha nan lèighean aig rìghrean na h-Alba, luchd-cinnidh agus Tighearnan nan Eilean agus seo làmh-sgrìobhainn à Vade mecum (= “tiugainn còmhla riumsa”, cruinneachadh teacsaichean feumail) aig Iain MacBheatha. Chaidh an teacsa eadar-theangachadh bhon Laideann agus thar-sgrìobh Aodh Uí Cendainn a bha na sgrìobhadair proifeasanta às an Èirinn a h-uile teacsaichean a bha anns a’ Vade mecum ged nach e lighiche a bha ann.

‘S urrainn dhut barrachd a leughadh air an Internet Archive: Regimen sanitatis : the rule of health : a Gaelic medical manuscript of the early sixteenth century or perhaps older : from the Vade mecum of the famous Macbeaths.

Typesetting f in Scottish Gaelic

Lower-case f is an inconveniently-shaped letter. In order to stop the top-right-hand side of the letter bumping into a tall successor you need special ligatures in English at least for ff, fl, fi, ffl, ffi and those five are sufficiently venerable to appear in the Alphabetic Presentation Forms block of Unicode.

What about Irish and Gaelic, where f lenites in many sentences to fh? In my old Flamingo copy of The Best of Myles, there is no special ligature for it, so there’s a huge gap in the middle of the word seanfhocal “old word”. How does it look in your browser?

seanfhocal, fhàgail, fhèin

The two easy options in hot metal printing are to make your own fh or to choose a typeface where the widest part of the f is the crossbar.

Digital typography gives us another option, which is to simply print the h over the f’s ball terminal, which is a terminal in the shape of a ball, though I do keep misreading it in my head as being like “poet laureate”. How bad this looks depends mostly, I think, on whether you’ve had this drawn to your attention (sorry) and how new your spectacles are.

Scottish Gaelic provides another difficult letter combination:

fàs, fàinne, fàilte

This deserves a ligature of its own in fonts where the top of the f overshoots.

See also: Typography Deconstructed’s Type Glossary.

The tidyverse and Universal Dependencies

If you use R you may be familiar with the tidyverse family of packages.

I particularly like them because they make it easy to write compact yet readable code with not a single for loop or if statement in sight.

As a worked example, here is some recent code I have written to evaluate the Gaelic Universal Dependencies corpus. My suspicion is that as I make the training set bigger, and the annotations of the development set and the test set more consistent with more automated checks, it will be easier for a parser to parse the texts correctly. (I can in fact use the udpipe_accuracy method, but this is a simple example for demonstration purposes.)

Firstly, some dependencies:


lubridate is a horribly-named package that makes handling time easier. udpipe is an R wrapper for the udpipe package written in C++.

udpipe_train(file = str_glue("{today}.udpipe"),
files_conllu_training = "../ud/gd_arcosg-ud-train.conllu",
files_conllu_holdout = "../ud/gd_arcosg-ud-dev.conllu")

This outputs a model based on the training set using the dev set to work out when to stop training.

ud_gd <- udpipe_load_model(str_glue("{today}.udpipe"))
gold <- udpipe_read_conllu('../ud/gd_arcosg-ud-test.conllu')
text <- gold %>% select(sentence, sentence_id) %>% unique() %>% rename(old_sentence_id = sentence_id)

This loads in the model we generated earlier, sets up a data frame in gold and then generates another dataframe consisting only of the original text and the sentence IDs. Then it renames the sentence ID column to prevent udpipe clobbering the sentence IDs from the source file with its own sequential IDs. While a CoNLL-U file is stanza-based and a mixture of tab-separated lines and comments that actually contain data, a tidy csv file has to be consistent all the way through. This means that every word in the original file is an observation and has its own row with all of its properties, including the sentence it belongs to. This is why we need the unique() statement to create a data frame that is one sentence per row.

Also note the pipe operator %>%. This is inspired by the pipe operator in Unix, and how it works is that what is piped into it (on the left of it) is the first argument to the function, hence d?_dh?blaichte <- rud_sam_bith %>% unique() is the same as d?-dh?blaichte <- unique(rud_sam_bith). Put like that it seems trivial but it's really useful for long chains of these.

result <- text %>% pmap_dfr(function(sentence, old_sentence_id) udpipe_annotate(ud_gd, sentence) %>% as.data.frame() %>% mutate(sentence_id = old_sentence_id))

There is quite a lot going on here, but this is the heart of the code. It pipes the text data frame into pmap_dfr, which maps every row in the frame onto a function, here an anonymous one where udpipe takes the ud_gd model and annotates each sentence, converts that into a data frame, and then pmap_dfr binds all of these data frames together into one big one. The last stage in the pipeline renames the sentence ID column back to the original for readability.

Sgiobalta, nach e?

las <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id", "dep_rel")) %>% count()/nrow(gold)
ulas <- gold %>% inner_join(result, by=c("sentence_id", "token_id", "head_token_id")) %>% count()/nrow(gold)

Lastly we get the LAS (labelled attachment score, the number of words that have been attached to the correct word with the correct kind of relation) and the UAS (unlabelled attachment score, simply those ones that are attached to the correct word regardless of the kind of relation) by doing a join that throws out all the wrong answers, counts the rows and divides them by the rows in the original file to get a score, maximum 1 for perfect agreement, minimum 0.

The scores are slightly lower than those in udpipe_accuracy. Not sure why. Will investigate.

Universal dependencies for Scottish Gaelic

Available now at: https://universaldependencies.org/treebanks/gd_arcosg/index.html

I am happy to report that Scottish Gaelic now features among the treebanks of the Universal Dependencies project in releases 2.5 and 2.6. I have generally followed the annotation scheme for Irish, with some additions to cope with constructions that differ between the two languages.

The treebank is based on the ARCOSG corpus, which is half-and-half prose and speech. My paper presented at CLTW this year deals with the prose half so I thought it would be worthwhile to report some of the features of the speech subcorpora.

The first is sentence-splitting. ARCOSG is divided up into clauses rather than sentences. The prose subcorpora all have punctuation, so by-and-large I’ve relied on an automatic and pretty simplistic sentence-splitting algorithm to do the job for me. Occasionally a closing double-quote ends up in the wrong tree, but this is easy to fix. The speech subcorpora, lacking full stops, are something else entirely. The sentence-splitter splits on changes of speaker and in principle I could just take every utterance as a single tree with lots of parataxis relations, but this would give me ridiculously big trees in some cases. Better to split where it feels like a new utterance. Subsequently I have found some guidelines in the wild for this: http://ldp-uchicago.github.io/docs/guides/transcription/sect_4.html I am relieved to find that the rules match more or less what I was doing, except that I didn’t have the original recordings to work with.

The second is tag questions, like fhios agad ‘you know’ and nach e ‘isn’t it’. These are very common. I’ve elected to simply relate them to the rest of the sentence with a parataxis relation rather than a more specific parataxis:tag relation. Maybe that would help?

The third is words that the transcriber wasn’t able to transcribe. These are captured as [?]. If it’s not possible to work out from context what relation they bear to the rest of the utterance then Universal Dependencies has a completely generic dep relation I have used.

Number four is football commentary. Lots of it looks like this:

MacLare gu Johnson ma-th? ‘MacLare to Johnson indeed’

Johnson leatha a-mach an taobh-sa gu MacStay ‘Johnson with it out the side to MacStay’

s07_005 and s07_006, ud_gd-arcosg-train.conllu

What’s going on there grammatically? I have a solution for now: treat the player as the root and attach the PPs with the obl relation, but is this the most UD way of doing it? If not, then I should have been consistent enough to fix it automatically.

Call for Participation: 3rd Celtic Language Technology Workshop (@MT Summit 2019)

Workshop: 3rd Celtic Language Technology Workshop

Date: 19th August 2019

Location: Dublin, Ireland

Website: ?http://cl.indiana.edu/cltw19/

Registration: https://www.mtsummit2019.com/registration

We invite you to participate in the Third Celtic Language Technology Workshop, sponsored by Mozilla and the Irish Government Department of Culture, Heritage and the Gaeltacht.

  • invited talks by Claudia Soria, Italian National Research Council, “The Digital Language Survival Kit” & ?Kelly Davis, Mozilla, “Common Voice”?
  • oral presentations on a range of technological advances and exploration for Celtic Languages (machine translation, treebanking, CALL, etc)?
  • a CLTW community discussion
  • social excursion and networking event

Please visit the workshop webpage for details on accepted papers: http://cl.indiana.edu/cltw19/

The full programme will be announced soon.

This workshop is co-located with MT Summit 2019, registration is available on the conference website: https://www.mtsummit2019.com/registration

Workshop Organizers and Program Committee Chairs

Teresa Lynn, Dublin City University

Delyth Prys, University of Bangor

Colin Batchelor, Royal Society of Chemistry

Francis M. Tyers, Indiana University and Higher School of Economics

An editor for dependency treebanks

I was pleased to meet Johannes Heinecke at the International Congress of Celtic Studies in Bangor last week. As well as producing a dependency treebank for Welsh, he has written a rather smart editor for CoNLL-U files, which are pretty much the standard these days for dependency trees.

Screengrab of Johannes Heinecke's CoNLL-U editor. The tree is for the sentence "Cuir d' ainm ri seo."

I managed to get it working this morning on a Mac running Mac OS Mojave 10.14.6 with a minimum of hassle. You will need Java, Apache Maven, and Homebrew in order to install wget. One small surprise is that if you edit a file in a git repository then by default every time you edit the tree, the new file is committed, which makes the commit history look a bit busy.

The second best bit is that you can see non-projective relations at a glance, which I certainly can’t do in emacs.

The best bit, as someone who recently wrote a paper where all the arrows in the dependency diagrams pointed the wrong way and didn’t notice until the referees pointed it out, is that there is a wee button you can click on to get a tikz version of the tree for pasting into LaTeX.


Unless I indicate otherwise, all these examples are taken from Gareth King’s Intermediate Welsh (London: Routledge, 1996). The analyses are mine, as are the errors.

I don’t think I ever mastered the word mai, and reading up on it, I think it’s because I never mastered changes of word order. The verb doesn’t have to go first in the sentence. Take the title of Menna Elfyn’s Ibsen translation Y Fenyw Ddaeth o’r M?r, where the NP, ‘the woman’, comes before the dependent form of the verb, ddaeth not daeth. The opening stage directions have lots of PPs before the independent form of the verb, like this:

  • Ar y chwith mae feranda dan do llydan. ‘On the left there is a veranda under a broad roof.’
  • Yn y tu blaen, ac o gwmpas y t?, mae gardd. ‘In front, around the house, is a garden.’
  • Islaw’r feranda, mae polyn baner. ‘Below the veranda, there is a flagpole’.

and so on. Now ordinary subordinate clauses, which I did get the hang of, look like this:

Dw i'n meddwl          fod                  Ron yn dod yfory
------ --- --- -------------
S[n]/NP/S[sub] S[sub]/S[asp]/NP/NP NP S[asp]/NP

which is the same as a declarative clause, except it can be an argument to meddwl or credu or another verb of thinking, feeling and so on. But what if we’re emphasizing Ron? Then we have the word mai before Ron before the dependent form of mae, which in this case is sy. So how do we handle this? There is a back door in CCG which is the unary type-changing rule. It’s not the done thing, but if I gather examples of them hopefully someone who understands these things better can refactor the grammar into a cleverer shape. Here are three type-changing rules, which add a feature FRONTED:

  • S[dep]/NP ? S[dcl, +FRONTED]\NP (blocked for mae)
  • S[dep]/NP ? S[dcl, +FRONTED]\S[n]/NP (not blocked for mae). Example: Gwaethygu mae’r sefyllfa yn Ne Ewrop.
  • PP ? S[+FRONTED]/S. Example: Menna Elfyn’s scene setting above.

The idea here is that mai (and its South Walian counterpart taw) has the type S[sub]/S[dcl, +FRONTED], which is to say that it only takes a declarative clause if there’s something in front of the verb.

That feels as if I’ve learnt something.

Every one’s a clitic: a general treatment of one family of fused words in Welsh

I’ve been starting to look at Welsh through the lens of CCG, largely because if I did manage to learn how to use words like mai, sydd, sef and bod (as a conjunction) correctly in my youth I have forgotten now.

I have to know what’s going on in the simpler clauses that these words are joining together first, though. So far the analysis from Scottish Gaelic, for example, word order, verbal nouns being clauses of type S[n]/NP/NP or S[n]/NP and particles like yn or wedi being type-changers, carries through, partly because I made sure I read up on how people have treated the verbal noun in Welsh beforehand. However the example sentences I’ve been looking at have pronouns attached to clitic particles, hi’n, to articles, e’r and to possessive pronouns, fe’ch.

This needn’t be a problem for dependency grammars, where you can have as many edges coming out of a single node as you like, but it looks tricky for constituency parsers where you expect the sentence to be of the form VP NP, but part of the fused word is in the VP and part of it is in the NP. At this stage it would be very easy to decide to change the tokenization rules so that e and ‘r are separate words, but one thing CCG is good at is assigning categories, possibly baroque and frightening ones to words that reflect what the words do in a sentence.

Let’s take Rydyn nhw’n dod ‘they are coming’. dod is an intransitive verbal noun which I take to be S[n]/NP. Rydyn is the independent verb ‘to be’, present tense, third person, and expects an NP for the subject and either an adjectival phrase or an aspectual phrase. I’ve written this as S[dcl]/S[asp]/NP/NP. On their own, nhw ‘they’ and yn (aspect marker) are NP and S[asp]/NP/S[n]/NP respectively. But what are they when combined? The way to answer this is to treat parsing the sentence as a mathematical puzzle. We know the solution is S[dcl], and at each stage of the proof we are allowed one of the allowed moves in CCG, application, substitution, type-raising or composition, and then we solve for Q in the below. I had a hunch that backwards crossed composition combined with type-raising would be the way to go here. Let’s try type-raising dod first. We want a backslash so we can try backwards crossed composition, Y/Z X\Y -> X/Z

Rydyn               nhw'n              dod
S[dcl]/S[asp]/NP/NP Q S[n]/NP
(try D = S[asp]/NP)

So, X = S[asp]/NP and Y = S[asp]/NP/S[n]/NP. Q = Y/Z. We know that X/Z = S[asp]/NP/NP, so…

S[asp]/NP/S[n]/NP/NP S[asp]/NP\S[asp]/NP/S[n]/NP
S[dcl] ?

The first thing I want to observe is that this would be clearer if everything were coloured in. The second thing is that the type of nhw’n, if you take the type of nhw to be A and ‘n to be B, is B/A. This feels like the sort of result that is obvious to someone who is more proficient than me. But is it generalizable? Let’s try the simpler construction in Gwerthodd e’r oergell – ‘he sold the fridge’. Here e is NP and ‘r, the article is NP/N, and oergell, an indefinite fridge, is N, so A = NP, B = NP/N and B/A = NP/N/NP.

Gwerthodd      e'r     oergell
S[dcl]/NP/NP NP/N/NP N
try type-raising with NP
Y = NP/N, X = NP, Y = NP
S[dcl] ?

I think that’s a result. Next up: look into Lambek’s product operator and sort out what’s going on with the eich… chi construction.

Ge?rr Ghr?mar na G?idhlig

Tha mi air a bhith a’ leughadh Ge?rr Ghr?mar na G?idhlig le Richard A. V. Cox. Tha e gl? dhl?th, mhionaideach is 492 duilleagan a dh’fhaide is e anns a’ Gh?idhlig air fad. Mar sin tha sanas bhriathar ann is tha na teirmichean teicnigeach nas soilleire na anns a’ Bheurla. D? tha apocope, syncope is aphaeresis a’ ciallachadh? Teasgadh deiridh, teasgadh meadhain is teasgadh toisich.

I have been reading Richard A. V. Cox’s Ge?rr Ghr?mar na G?idhlig (‘Short Grammar of Gaelic’). It’s very dense, very detailed and 492 pages long, not to mention entirely in Gaelic. To this end there is a glossary of the technical vocabulary, which is generally easier to work out than the corresponding vocabulary in English: apocope, syncope and aphaeresis are teasgadh deiridh, teasgadh meadhain and teasgadh toisich.