Hope, expectation, responsibility

Even though?bi?is the verb for “to be”, you can’t usually use it with two noun phrases, certainly not to say that one of them is the order. But there is a class of nouns that go quite happily with another noun as arguments of bi. I think what might be going on is that they’re being used adverbially, like an diugh (today) or an l?thair (present).?Let’s take this phrase from the Scotsman (source) a few years ago (slightly edited because Johnston Press have mislaid their diacritics):

Thuirt am Ministear a tha an urra ris a’ Gh?idhlig, Peter Peacock:

“Said the minister responsible for Gaelic, Peter Peacock:” is what this means. It’s a clefted construction, as is so often the case in Gaelic and Irish.?Tha am Ministear an urra ris a’ Gh?idhlig?”The minister is responsible for Gaelic” would be the unclefted version.

Another example from the same piece:

Tha mi an d?chas gum bi duine l?idir ann a sheasas suas riutha, a sheasas airson na G?idhlig, airson nan G?idheal ‘s an aghaidh an riaghaltais ma tha sin a dh?th.

“I hope that there will be strong people who will stand up for them, stand for Gaelic, for the Gaels and against the government if need be.” This is unclefted and clearer than the previous sentence. At the very beginning we have?tha,?mi, and?an d?chas gu… as the verb and two noun phrases.

And one from the BBC:

Chuir ministear eile aig Eaglais na h-Alba fios chun na h-eaglaise gu bheil e an d?il f?gail air sg?th c?is nam ministearan g?idh.

“Another minister in the Church of Scotland has sent word to the church that he expects to leave on account of the matter of gay clergy.” Here we have bheil, the dependent form of?bi, followed by?e, “he” and an d?il f?gail, “the expectation to leave”.

So that means that?bi fits the following patterns (out of my head and double-checked with William Lamb’s?Scottish Gaelic):

  1. bi + NP + PP:?for expressing locations, for possession, for many verbal constructions if we take?ag/a’?and friends to be prepositions (otherwise?there is a 1b:?bi + NP + AspP), and for linking two nouns:?tha mi nam oileanach and ‘s e oileanach a th’annam
  2. bi + NP + ADJ: tha sinn toilichte, tha i br?agha and so on
  3. bi + NP + ADV[loc]: tha an cat a-staigh
  4. bi + NP + NP[d?chas]: the examples we’ve seen above and a few more. Wilson McLeod on Twitter has helpfully pointed out that d?il and?urra (as shown above),?eisimeil and crochadh?are in this set of nouns.

I wonder whether there are any more? I will keep looking.

Resources present and future

Excitingly, William Lamb at the University of Edinburgh?tells me in the comments on this earlier post has been funded by the B?rd na G?idhlig to work on a tagset and corpus for Scottish Gaelic.

I have been delighted to be pointed to his 2003?Scottish Gaelic (2nd edn, Lincom Europa, Munich), which is exactly the sort of book I have been looking for. Worth careful study.

Ambiguity everywhere

Much of the basic grammatical machinery of Gaelic consists of overloaded words. This is nothing unusual, of course; in English, for example, to?is both a preposition and marks the infinitive, but there seems to be an awful lot of it going on in Gaelic.?One of the more striking examples is?an. This can be:

  • the definite article:?an t-eilean
  • an interrogative particle:?An do ch?rd e riut?
  • the interrogative form of is:?An toil leat ball-coise?
  • a possessive pronoun (their): an c?r

Do has several meanings too:

  • a possessive pronoun (your): do bhaidhseagal?
  • a preposition:?do Ghlaschu
  • ?a past-tense marking particle:?An do ch?rd e riut?

A?has at least the following meanings and there may well be some I’ve missed:

  • numerical particle:?a h-aon
  • vocative particle: a Mh?iri
  • the infinitive particle: an uinneag a dh?nadh
  • an interrogative particle: A bheil thu a’ dannsadh?
  • two possessive pronouns (her and his): a ch?r, a h-athair
  • relative particle: D? an t-ainm a tha ort?

not to mention its homophonous friend?a’:

  • definite article:?anns a’ chidsin
  • the participle particle:?Tha mi a’ dol

If I want to start part-of-speech-tagging Gaelic text, as a preliminary to building a grammar, I’m going to need to write some guidelines as to when each of these words is what.

 

It’s fine

This confused me, so I mention it in case it confuses anyone else.

If predicative adjectives have type S[adj]\NP (because they come after the noun), NPs have type NP and the predicative copula has type (S[dcl]/(S[adj]\NP))/NP, then how do we cope with sentences that only have one NP? Where I went astray was assuming that if you have a word of type X/Y, then there has to be a Y somewhere in that sentence.

Not true! Tha i br?agha? “it’s fine” (talking about the weather) is a good and simple example.

Tha i br?agha
V NP ADJ
(S[dcl]/(S[adj]\NP))/NP NP S[adj]\NP
S[dcl]/(S[adj]\NP) S[adj]\NP
S[dcl]

In this case, tha is of type (X/Y)/Z, and just forward composes with Z to its right and then Y next to the right. It just so happens that Y is a non-atomic type.

Now I’ve understood this I can worry about more complicated things.

Three sorts of PP

Le means “with”, roughly, but if you want to say “with X”, there are three different ways of doing it.

  1. le Alasdair: “with Alasdair”. This is the form used before a noun phrase that doesn’t begin with a definite article.
  2. leis an nurs: “with the nurse”.?Le becomes?leis before a noun phrase beginning with a definite article.
  3. leam: “with me”. This is a PP all of its very own, and there’s one for each personal pronoun, including, confusingly,?leis?for “with him”.

So this means that for a full grammar we need to mark the NP with whether it begins with certain determiners.?Leis, and friends?gus, ris and anns?don’t in fact go with all determiners in Gaelic. They go with?gach “each”, as in?Leis gach d?agh dh?rachd?”with every good wish” and mo “my”?but not, say, numbers.

Let’s, then, provisionally type the forms of?le as follows:

  • le: PPle/NPAN?
  • leis: PPle/NPAN+
  • leam: PPle,1s

Reminder: we need the features like le to keep track of what sort of preposition it is for agreement with words like toil (to like), and 1s to keep track of who it is liking what.

More on this, with, I hope, a shorter temporal gap before the next post than this time round.

Why do we bother with grammatical frameworks?

Most natural languages, like English French Chukchi Basque Gaelic Italian Russian Latgallian Finnish Tamil and so forth, can be reasonably well modelled by a context-free grammar, which is the sort of grammar that people write computer languages in. Parsers for these are ten-a-penny. They have to be, otherwise you couldn’t run C, Perl, PHP, Python, Haskell or whatever. So a question you might be asking is why people don’t use these parsers for natural languages and go off and invent grammatical frameworks like HPSG LFG CCG and so on.

One important reason is agreement, by which I mean that verbs in English, say, agree for number and in a limited way for person. What does this mean in practice? Well, if you’re writing a context-free grammar to handle sentences like “The lady vanishes”, then you can’t just say:

S ? NP VP

because that overgenerates. That would allow “The lady vanish”, “The ladies vanishes”, “I vanishes” and so on, because each of these have the form NP VP. “The lady” is an NP (noun phrase), as is “The ladies” and “I”. The rest of these sentences are all VPs (verb phrases). So our grammar has to also say:

S ? NP_3rdsg VP_3rdsg

S ? NP_non3rdsg VP_non3rdsg

and the same applies to every rule you have in the grammar. Modern grammatical frameworks use feature structures to look after all of this, and enable you to insist that whatever features, like number (singular, plural, and in Slovene dual) or person (I, you, he/she) words have have to agree, so you can write rules like this:

S ? NP VP

and let the lexicon, the collection of the words themselves, handle the details.

A first attempt at the copula

Having got OpenCCG working, we can now start doing what we’re here for. To say “Calum is a teacher”, or “I am a teacher”, you have to say the at-first-glance rather odd:

  • ‘S e tidsear a th’ann Calum.
  • ‘S e tidsear a th’annam.

The unwary might translate those as “It is a teacher that is in Calum” and “It is a teacher that is in me”, but really tha + ann means “there is”. annam is a preposition marked for person, which I don’t think I’ve mentioned before. I’ve kind of implemented this, but it does overgenerate like mad. Overgeneration is when your grammar allows sentences that aren’t grammatical.

copula.ccg contains the grammar so far. Here are some highlights: Continue reading “A first attempt at the copula”

But what can we tell from the 100 top word tokens?

  • 26 are prepositions of some sort
  • 23 are nouns
  • 10 are conjunctions
  • 10 are verbs
  • 5 are articles
  • 7 are adjectives
  • 7 are pronouns
  • 4 are preverbal particles
  • 2 are adverbs

The number of prepositions is unusually high and indicates that PPs (prepositional phrases) do an awful lot of the work in a Gaelic sentence. The number of verbs seems pretty low, and in fact many of them are forms of the verbs “to be” that we’ve seen earlier. This is because the verb “to be” typically does much of the rest of the work. More examples of this to come.

The article doesn’t mark gender (of which there are two, masculine and feminine) but it does mark the two numbers (singular and plural). So how come there are five articles listed?

Well, an is the singular, na does double duty for “of the” and “the” plural. nan does “of the” plural. Before a labial consonant, an becomes am and nan becomes nam. This warns us that our system will have to take into account initial consonants to get this right.

There are also some duplicates. “Scotland” is Alba normally and h-Alba after na, as in Banca na h-Alba “Bank of Scotland”. duine (person) has a weird-looking plural, daoine. d?thaich has the genitive form d?thcha. baile (town) has a lenited form (I will come to this, but not today) bhaile. So we see that Gaelic is not only morphologically rich, but instead of adding case endings and whatnot to the ends of words, like in Hungarian or Turkish, modifies the insides of words instead.

That will do for the now.

What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
{
$list{lc($token)}++;
}
}
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";
}

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.