I read the news today. To be precise, I’ve been looking at the BBC website’s news in Gaelic and I’ve spotted a grammatical theme among a large proportion of the headlines and standfirsts:

  • Fiosrachadh ga shireadh mu ghoid charbad phoilis “information sought about the theft of a police car”
  • Ceathrar gan toirt far Beinn Nibheis “Four people taken from the top of Ben Nevis”
  • Teaghlach de cheathrar gan toirt far Beinn Nibheis […] (standfirst for the foregoing) “Family of four taken from the top of Ben Nevis”
  • Duine ga lorg air a’ Chliseam “Person found on Clisham [mountain on Harris]”
  • Leasachadh Beinn Uais ga dhiùltadh “Ben Wyvis development turned down”

Here the aspect marker ag preceding a verbal noun has merged with the possessive pronoun that is the direct object of the direct noun in question (sireadh, toirtlorg and diultadh), leniting it if it’s ga masculine. Put a form of bi at the front and you have a full sentence, but it need not be passive in that case. They could be, maybe absurdly:

  • Information seeks him about the theft of a police car
  • Four people take them from the top of Ben Nevis
  • Family of four take them from the top of Ben Nevis
  • Person finds him on Clisham or Person finds it on Clisham
  • Ben Wyvis development turns him down

These have a look of machine translation about them, don’t they?

gdbank: CCG and dependency structures in Scottish Gaelic

I have been working on a small corpus of Scottish Gaelic sentences. The words in them are all annotated with categorial grammar types and dependency relations. It’s available on Google Code GitHub and there is a more detailed description in this paper and this poster, which I presented at CLTW in Dublin.

What format is it in?

CoNLL-X is a tab-separated plain-text format for annotating text with dependency relations. It was developed for the 10th Computational Natural Language Learning meeting in 2006. I have abused the format slightly by putting the categorial grammar types in the “features” column.

Which standard did you use for the dependency annotation?

After swithering between the Briscoe and Carroll (GR) scheme and Teresa Lynn at DCU’s scheme and a mixture of the two I eventually opted for the Universal Dependency Scheme, which is based on the Stanford scheme. This has the merit of making inter-language comparisons straightforward.

Which standard did you use for the CCG annotations?

One based very closely on CCGBank with slight modifications for Gaelic.

How large is it?

Currently there are 40 sentences and 612 tokens (roughly speaking, words and punctuation marks).

Is it POS-tagged?

Ish. The CoNLL-X format has two columns for this, though, a coarse POS tagset (simply whether something is a noun, a verb, an adposition or whatever) and a more fine-grained one that would include number, tense and so forth. I use the Universal POS tagset for both columns for now.

How many annotators did you have?

Ahem. Just me. This is a shortcoming.

Update 2015-08-24: migrated to GitHub (see above).

CLTW2014: report

Last weekend I was in Dublin for the first Celtic Language Technology Workshop, which was part of COLING2014. I am still digesting and still to follow up everything, but here’s a very brief summary.

Elaine Uí Dhonnchadha (DCU) started with an overview of Irish language technology and a plea for open resources. William Lamb and Sammy Danso (both Edinburgh) gave a two-hander about POS-tagging a Scottish Gaelic corpus of nearly 90 000 tokens. This will be coming out later this year. Having taken part in annotation tasks in the past, I was taken aback that they reported a kappa coefficient of 0.98 for two annotators on this task. (A kappa coefficient, for those who don’t know, is a modified agreement score that takes into account chance agreement. I say “a” kappa coefficient because there is more than one way of working them out. If your kappa is zero, that means your agreement is no better than chance, even though your percentage agreement might be something like 75%. Perfect agreement leads to a kappa of 1. Kappas of 0.7 or greater are pretty good, in general.) Monica Ward (also DCU) talked about building resources for the teaching of Irish: a past-tense teaching game for primary school children and a grammar checker for adult learners built around Kevin Scannell’s Gramadóir. Lastly before the coffee break Teresa Lynn (DCU) talked about cross-lingual transfer dependency parsing of Irish. On the face of it, this sounds like something that shouldn’t work at all. You take a model that has been trained on a completely different language, steam off the lexical information (which words relate to which), remove the dependency labels, and keep only the part-of-speech tags and unlabelled dependencies. Surprisingly, it kind-of works. Even more surprisingly the best results came from a model trained on Indonesian, which doesn’t seem to be VSO.

At coffee, quite by chance, I was surprised and delighted to meet Mark Steedman (Edinburgh), who invented CCG.

After coffee, Thierry Poibeau (CNRS and DTAL, Cambridge) talked about mutations in Breton. Discussion ensued about whether the anomalous behaviour of, how do I put this, “professional nouns” like kiger (butcher) was semantic or not. I suspect it’s more like word classes, like animacy in the Slavic languages or the Australian language that famously has the class of women, fire and dangerous things. Kevin Scannell (Saint Louis University), who I am astonished to have not come across before, talked about machine translation from Scottish Gaelic to Irish and the various spelling reforms in Irish that have made it look rather less like Scottish Gaelic than it used to. Finally, Caoimhín Ó Donnaíle (Sabhal Mòr Ostaig), talked about the Multidict site, which has videos for language learners and a splendid interactive wrapper for websites that links to external dictionaries. The idea is that you click on a word in one frame and the dictionary entry appears in another. When I was a first-year undergraduate in 1994, Caoimhín’s websites were some of the first I remember ever seeing, probably with the Mosaic browser.

After lunch, an invited talk from Kevin Scannell, largely about Manx, but also about building NLP resources from social media and web crawling. Much of the Manx data is revived Manx, but there’s also text from Skeealyn Vannin, which was compiled by the Irish Folklore Commission in the late 1940s from native speakers. Michal Boleslav Měchura (DCU) talked about breis.focloir.ie, an online Irish grammar database. The last of the full papers was given by Sarah Cooper (Bangor), describing an app, Paldaruo, for crowdsourcing speech recognition training data. The microphones on tablets are good enough these days to record speech from ordinary members of the public, in the first instance to control a wee robot arm connected to a Raspberry Pi.

Then the “poster boaster” session, featuring me, Michal again on onomastics, Francis Tyers (Tromsø) on within-tweet language detection and Delyth Prys (Bangor), talking about the DECHE Corpus of Welsh Scholarly Writing. Then pub.

An enormous amount to take in, and lots to follow up on. I do hope this’ll be the first of many.


Quick note on visccg and UTF-8 for openccg beginners

If you can’t get UTF-8 to work in visccg (looking at Stack Overflow suggests this might be a Python-on-the-Mac thing, but I wouldn’t swear to it) you can still edit .ccg files in Your Favourite Text Editor and ccg2xml will still process them fine.

The Coordinate Structure Constraint: evidence from Irish


Ross’s 1967 MIT thesis Constraints on Variables in Syntax introduced, among other things, the Coordinate Structure Constraint, which is a generalization of the intuitive notion that coordinators (in English, “and”, “but”, “or” and so on) coordinate nouns with nouns (“fish and chips”), verbs with verbs (“come and go”) to exclude sentences like “Whose tax did the nurse polish her trombone and the plumber compute?”

While keeping my eyes peeled for examples of non-constituent coordination in Gaelic, and I should note that I have a blogpost in preparation with examples from William Lamb’s Scottish Gaelic, including the constructions that are examples of “cosubordination”, I’ve been reading Mícheál Ó Siadhail’s Learning Irish, which has some examples of what the author calls “idiomatic uses of agus“. These first five coordinate non-constituents:

Bhí Bríd ann agus í tinn. (1)

Tá Cáit ansin agus leabhar mór aici. (2)

D’imigh Máirtín amach agus gan aon chóta air. (3)

Bhí an bosca ansin is mé ag tíocht abhaile. (4)

Bhí an lá gearr is thú ag imeacht thart mar sin. (5)

(1) coordinates NP + existential ANN with NP + ADJ. (2) coordinates NP + ADV with NP + PP. (3) coordinates ADV with PP. (4) coordinates NP + ADV with NP + small clause. (5) coordinates NP + ADJ with NP + small clause. Sadly there are no counterexamples of uses that are unidiomatic. is also shows up in constructions with chomh (like Gaelic cho, which is similar):

chomh sásta is a bhí Máirtín (6)

“as pleased as Martin was”. Here is coordinates ADJ with a direct relative clause.

There are also some non-coordinative-looking uses:

An bhfuil sé míle as seo go Carna? Tá agus deich míle! (7)

Tá mé ag imeacht anois. Tá agus mise! (8)

Is maith liom an áit seo. Is maith agus liomsa! (9)

Any account of coordination in Irish at least has to be able to cope with examples (1) to (6). I hunt on for examples in Gaelic.

An interesting case of coordination

A few weeks ago I spotted this from @BBCAimsir (the weather in Gaelic) on Twitter:

which said (just in case the embedding stops working at some future date):

Tha i blàth agus sinn air 20C a ruighinn an Glaschu agus na Criochan.

Literally “It is warm and we have reached 20 degrees Celsius in Glasgow and the Borders”. What is interesting about it is that it’s coordinating two non-constituents, in English “it… warm” and “we… reached”. This is the sort of thing that CCG is good at handling.

I wonder how common non-constituent coordination like this is in Gaelic, though?

Celtic Language Technology Workshop at COLING 2014

There hasn’t been an enormous amount of work done on the Celtic languages in the fields of computational linguistics or natural language processing, so I was very pleased to see that COLING this year has a workshop on them: http://fionlive2.dcu.ie/cltw2014/

I was even more pleased, and really rather surprised, to have a short paper accepted. More on this soon, but for the full story you’ll have to come to Dublin in August.

Many thanks to one of the organizers, Teresa Lynn, who drew my attention to this meeting in the first place.

What particles do

Most words in categorial grammar are functions. In English, a transitive verb such as “eats” is a function that takes two NP arguments and gives you a clause, S, back. The notation for this is (S/NP)\NP. (Aside: This is rather like defining a function in a programming language, except that void isn’t a type.)

What does this functional approach tell us about particles like a and chan? To answer this I’ll need to set out the different sort of clauses I’ve seen in Scottish Gaelic. The notation here is based on CCGbank, which itself is based on that of the Penn Treebank, and I’ve marked new ones as such.

  • S[adj]: predicative adjective. Example: snog in Tha i snog.
  • S[dcl]: ordinary declarative sentence. Tha i snog.
  • S[q]: polar question. A bheil i snog?
  • (new) S[neg]: negative question. Chan eil i snog.
  • (new) S[negq]: negative polar question. Nach eil i snog?
  • S[wh]: wh-question: Ciamar a tha thu?
  • (new) S[n]: verbal noun-headed small clause. iarraidh cofaidh in Tha mi ag iarraidh cofaidh.
  • S[em]: embedded declarative. a tha thu in Ciamar a tha thu?
  • (new) S[dep]: dependent verb-headed clause. bheil i snog in A bheil i snog?
  • (new) S[a]: a-infinitive. a bhith a’ dannsadh

The five new ones need some explanation. S[neg] and S[negq] are motivated by the clear fourfold division of ordinary sentences into positive, interrogative, negative and interrogative negative. S[n], relating as it does to a verbal noun, replaces S[ed], S[pss] and S[ng] in the CCGbank scheme for English. S[a] is somewhat like S[to] in the CCGbank scheme but not exactly the same as it contains a verbal noun somewhere, and lastly S[dep] presents a phenomenon we simply don’t get in English.

So what do particles do here? Let’s take a few examples from last week’s An Litir Bheag:


Here cha is a function mapping a dependent clause to a negative sentence.


There is a lot going on there. I’ve thought of adverbs as taking a sentence in and giving you a sentence back. Hence gu when it serves to make an adverb out of an adjective, takes S[adj] as its argument and gives you a function that takes a S[dcl] and gives you back S[dcl]. na, as in “that which”, is a function that takes a S[dcl] and gives you a NP back. I’m using shorthands for conjunctions and PPs, but these are both described in the literature.

Potential point for discussion: I’ve treated aga’airgu and ri when they introduce verbal nouns as PP/S[n]. But maybe they should be a clause type of their own. Needs more thought.

What the meaning of “is” is

This is the Scottish Gaelic is, often pronounced and written ‘s, not the English “is”. It’s a copula, and you can say things like Is mise Càilean, or ‘S math sin, but usually the constructions are more complicated than that and those two examples are

We have the clefted construction Is + e + NP + (for example) a tha + PP[ann] to equate the NP and the innards of the PP, where e is pretty much an expletive like a lot of uses of “it” and “there” in English.

There are “quirky” constructions where the object looks like a subject, and the subject is expressed with a PP. Is toil leam biadh innseanach and  Is toil leam a bhith a’ dannsadhare examples, where it is I that like Indian food and I like dancing. (Examples from Teach Yourself Gaelic, 2nd edn). My list so far of the words that can go in the toil slot, and what sort of PP they take, is this:

  • PP[le]: toil (n), toigh (adj), caomh (adj), fhèarr (adj), mhath (adj)
  • PP[air]: beag (adj), lugha (adj)
  • PP[do]: fhiach (adj), urrainn (n), chòir (n), aithne (n), àbhaist (n), mhiann (n)

I expect there are more! To the best of my (admittedly very limited) knowledge, a difference between Scottish and Irish Gaelic is that Irish Gaelic only takes adjectives in the toil slot. They are a bit various in what sort of clausal complements they take, which is a matter for another blog posting.

The other important construction with is is where it’s followed by ann in order to emphasize something that doesn’t normally go in that position, a bit like 把 in Chinese. This is very often a PP, for example from here: ‘s ann às an Fhraing is Ameireagaidh a tha ise “It is from France and America she is from”. I think ann here is really the fused PP for ann + e.

In summary:

  • Is + NP + NP (rare)
  • Is + ADJ + NP (also rare)
  • Is + N[toil]/ADJ[toil] + PP + SUBJ
  • Is + e + a BI + PP[ann]
  • Is + ann + PP/ADJ/ADV/NP[temporal] + a BI + PP[ann]

Have I missed any?

Hope, expectation, responsibility

Even though bi is the verb for “to be”, you can’t usually use it with two noun phrases, certainly not to say that one of them is the order. But there is a class of nouns that go quite happily with another noun as arguments of bi. I think what might be going on is that they’re being used adverbially, like an diugh (today) or an làthair (present). Let’s take this phrase from the Scotsman (source) a few years ago (slightly edited because Johnston Press have mislaid their diacritics):

Thuirt am Ministear a tha an urra ris a’ Ghàidhlig, Peter Peacock:

“Said the minister responsible for Gaelic, Peter Peacock:” is what this means. It’s a clefted construction, as is so often the case in Gaelic and Irish. Tha am Ministear an urra ris a’ Ghàidhlig “The minister is responsible for Gaelic” would be the unclefted version.

Another example from the same piece:

Tha mi an dòchas gum bi duine làidir ann a sheasas suas riutha, a sheasas airson na Gàidhlig, airson nan Gàidheal ‘s an aghaidh an riaghaltais ma tha sin a dhìth.

“I hope that there will be strong people who will stand up for them, stand for Gaelic, for the Gaels and against the government if need be.” This is unclefted and clearer than the previous sentence. At the very beginning we have thami, and an dòchas gu… as the verb and two noun phrases.

And one from the BBC:

Chuir ministear eile aig Eaglais na h-Alba fios chun na h-eaglaise gu bheil e an dùil fàgail air sgàth cùis nam ministearan gèidh.

“Another minister in the Church of Scotland has sent word to the church that he expects to leave on account of the matter of gay clergy.” Here we have bheil, the dependent form of bi, followed by e, “he” and an dùil fàgail, “the expectation to leave”.

So that means that bi fits the following patterns (out of my head and double-checked with William Lamb’s Scottish Gaelic):

  1. bi + NP + PP: for expressing locations, for possession, for many verbal constructions if we take ag/a’ and friends to be prepositions (otherwise there is a 1b: bi + NP + AspP), and for linking two nouns: tha mi nam oileanach and ‘s e oileanach a th’annam
  2. bi + NP + ADJ: tha sinn toilichte, tha i brèagha and so on
  3. bi + NP + ADV[loc]: tha an cat a-staigh
  4. bi + NP + NP[dòchas]: the examples we’ve seen above and a few more. Wilson McLeod on Twitter has helpfully pointed out that dùil and urra (as shown above), eisimeil and crochadh are in this set of nouns.

I wonder whether there are any more? I will keep looking.

