Dependency structures in Irish Gaelic

Quick note to say that Teresa Lynn at DCU has been working on a project based on dependency treebanks for Irish. This is relevant to this blog because Irish Gaelic is very closely related to Scottish Gaelic and much of the grammar is similar, and there has also been work in the past (Clark and Curran 2007, Table 2, for example) on deriving dependency structures from CCG lexical structures.

Here are two papers I’ve had a quick look at:

Resources present and future

Excitingly, William Lamb at the University of Edinburgh?tells me in the comments on this earlier post has been funded by the B?rd na G?idhlig to work on a tagset and corpus for Scottish Gaelic.

I have been delighted to be pointed to his 2003?Scottish Gaelic (2nd edn, Lincom Europa, Munich), which is exactly the sort of book I have been looking for. Worth careful study.

Ambiguity everywhere

Much of the basic grammatical machinery of Gaelic consists of overloaded words. This is nothing unusual, of course; in English, for example, to?is both a preposition and marks the infinitive, but there seems to be an awful lot of it going on in Gaelic.?One of the more striking examples is?an. This can be:

  • the definite article:?an t-eilean
  • an interrogative particle:?An do ch?rd e riut?
  • the interrogative form of is:?An toil leat ball-coise?
  • a possessive pronoun (their): an c?r

Do has several meanings too:

  • a possessive pronoun (your): do bhaidhseagal?
  • a preposition:?do Ghlaschu
  • ?a past-tense marking particle:?An do ch?rd e riut?

A?has at least the following meanings and there may well be some I’ve missed:

  • numerical particle:?a h-aon
  • vocative particle: a Mh?iri
  • the infinitive particle: an uinneag a dh?nadh
  • an interrogative particle: A bheil thu a’ dannsadh?
  • two possessive pronouns (her and his): a ch?r, a h-athair
  • relative particle: D? an t-ainm a tha ort?

not to mention its homophonous friend?a’:

  • definite article:?anns a’ chidsin
  • the participle particle:?Tha mi a’ dol

If I want to start part-of-speech-tagging Gaelic text, as a preliminary to building a grammar, I’m going to need to write some guidelines as to when each of these words is what.

 

It’s fine

This confused me, so I mention it in case it confuses anyone else.

If predicative adjectives have type S[adj]\NP (because they come after the noun), NPs have type NP and the predicative copula has type (S[dcl]/(S[adj]\NP))/NP, then how do we cope with sentences that only have one NP? Where I went astray was assuming that if you have a word of type X/Y, then there has to be a Y somewhere in that sentence.

Not true! Tha i br?agha? “it’s fine” (talking about the weather) is a good and simple example.

Tha i br?agha
V NP ADJ
(S[dcl]/(S[adj]\NP))/NP NP S[adj]\NP
S[dcl]/(S[adj]\NP) S[adj]\NP
S[dcl]

In this case, tha is of type (X/Y)/Z, and just forward composes with Z to its right and then Y next to the right. It just so happens that Y is a non-atomic type.

Now I’ve understood this I can worry about more complicated things.

Useful site full of worked English-language examples

Much of the literature on categorial grammar focuses on things that are difficult to handle in other frameworks and isn’t necessarily helpful if you want to find something simple. However, there are lots and lots of worked examples on the Groningen Meaning Bank Explorer. More about how it works here.

Three sorts of PP

Le means “with”, roughly, but if you want to say “with X”, there are three different ways of doing it.

  1. le Alasdair: “with Alasdair”. This is the form used before a noun phrase that doesn’t begin with a definite article.
  2. leis an nurs: “with the nurse”.?Le becomes?leis before a noun phrase beginning with a definite article.
  3. leam: “with me”. This is a PP all of its very own, and there’s one for each personal pronoun, including, confusingly,?leis?for “with him”.

So this means that for a full grammar we need to mark the NP with whether it begins with certain determiners.?Leis, and friends?gus, ris and anns?don’t in fact go with all determiners in Gaelic. They go with?gach “each”, as in?Leis gach d?agh dh?rachd?”with every good wish” and mo “my”?but not, say, numbers.

Let’s, then, provisionally type the forms of?le as follows:

  • le: PPle/NPAN?
  • leis: PPle/NPAN+
  • leam: PPle,1s

Reminder: we need the features like le to keep track of what sort of preposition it is for agreement with words like toil (to like), and 1s to keep track of who it is liking what.

More on this, with, I hope, a shorter temporal gap before the next post than this time round.

Why do we bother with grammatical frameworks?

Most natural languages, like English French Chukchi Basque Gaelic Italian Russian Latgallian Finnish Tamil and so forth, can be reasonably well modelled by a context-free grammar, which is the sort of grammar that people write computer languages in. Parsers for these are ten-a-penny. They have to be, otherwise you couldn’t run C, Perl, PHP, Python, Haskell or whatever. So a question you might be asking is why people don’t use these parsers for natural languages and go off and invent grammatical frameworks like HPSG LFG CCG and so on.

One important reason is agreement, by which I mean that verbs in English, say, agree for number and in a limited way for person. What does this mean in practice? Well, if you’re writing a context-free grammar to handle sentences like “The lady vanishes”, then you can’t just say:

S ? NP VP

because that overgenerates. That would allow “The lady vanish”, “The ladies vanishes”, “I vanishes” and so on, because each of these have the form NP VP. “The lady” is an NP (noun phrase), as is “The ladies” and “I”. The rest of these sentences are all VPs (verb phrases). So our grammar has to also say:

S ? NP_3rdsg VP_3rdsg

S ? NP_non3rdsg VP_non3rdsg

and the same applies to every rule you have in the grammar. Modern grammatical frameworks use feature structures to look after all of this, and enable you to insist that whatever features, like number (singular, plural, and in Slovene dual) or person (I, you, he/she) words have have to agree, so you can write rules like this:

S ? NP VP

and let the lexicon, the collection of the words themselves, handle the details.

A first attempt at the copula

Having got OpenCCG working, we can now start doing what we’re here for. To say “Calum is a teacher”, or “I am a teacher”, you have to say the at-first-glance rather odd:

  • ‘S e tidsear a th’ann Calum.
  • ‘S e tidsear a th’annam.

The unwary might translate those as “It is a teacher that is in Calum” and “It is a teacher that is in me”, but really tha + ann means “there is”. annam is a preposition marked for person, which I don’t think I’ve mentioned before. I’ve kind of implemented this, but it does overgenerate like mad. Overgeneration is when your grammar allows sentences that aren’t grammatical.

copula.ccg contains the grammar so far. Here are some highlights: Continue reading “A first attempt at the copula”

Getting OpenCCG to work on the Mac

OpenCCG is a java/python toolkit for working on combinatory categorial grammar, so is ideal for this exercise.

Update 2014-07-14: if you’re using OpenCCG 0.95, the latest version, on Mac OS X 10.6.8, then as long as you have Python 2.x and Java installed, then if you follow the build instructions?exactly then it should Just Work.

It comes with instructions for getting it to work under Unix and Windows, but on the Mac, or at least on the one I’m using, there’s a small amount of fiddling needed. Here it is:

  • You may not already have a recent version of python, which you can get from http://www.python.org/download/releases/2.7.1/ as a .dmg, which has a friendly hand-holdy installation process.
  • Environmental variables:
    • export JAVA_HOME=/usr (this surprised me, but it works on Mac OS X 10.4.11)
    • export PATH=”$PATH:$OPENCCG_HOME/bin”
    • cd to the directory that you’ve downloaded openccg to, type pwd, and set OPENCCG_HOME to it using export.
  • You will also need to fetch lex.py and yacc.py from sourceforge: http://openccg.cvs.sourceforge.net/viewvc/openccg/openccg/bin/ and put them in the bin folder in your OpenCCG installation.
  • If you then follow the instructions in the README file and get an error about the wrong class number you’ll have to rebuild it. Try typing ant at the command line and see what happens. I don’t remember installing ant, which means that it might come on the Mac by default. If not, you’ll have to go to http://ant.apache.org/. Good luck! ?Update 2014-07-13: do NOT attempt to build by typing ‘ant’ at the command line. This does not work. Make sure you type ‘ccg-build’. Only issue the ‘ant’ command if you want to see whether ant is installed on your machine.

It comes with some minuscule test grammars including Basque and Turkish.

But what can we tell from the 100 top word tokens?

  • 26 are prepositions of some sort
  • 23 are nouns
  • 10 are conjunctions
  • 10 are verbs
  • 5 are articles
  • 7 are adjectives
  • 7 are pronouns
  • 4 are preverbal particles
  • 2 are adverbs

The number of prepositions is unusually high and indicates that PPs (prepositional phrases) do an awful lot of the work in a Gaelic sentence. The number of verbs seems pretty low, and in fact many of them are forms of the verbs “to be” that we’ve seen earlier. This is because the verb “to be” typically does much of the rest of the work. More examples of this to come.

The article doesn’t mark gender (of which there are two, masculine and feminine) but it does mark the two numbers (singular and plural). So how come there are five articles listed?

Well, an is the singular, na does double duty for “of the” and “the” plural. nan does “of the” plural. Before a labial consonant, an becomes am and nan becomes nam. This warns us that our system will have to take into account initial consonants to get this right.

There are also some duplicates. “Scotland” is Alba normally and h-Alba after na, as in Banca na h-Alba “Bank of Scotland”. duine (person) has a weird-looking plural, daoine. d?thaich has the genitive form d?thcha. baile (town) has a lenited form (I will come to this, but not today) bhaile. So we see that Gaelic is not only morphologically rich, but instead of adding case endings and whatnot to the ends of words, like in Hungarian or Turkish, modifies the insides of words instead.

That will do for the now.