Interrogative frequencies in DASG

One aspect of Gaelic I want to look at more closely is interrogatives. Just as all the wh- words in English (who, when, why, what, how) go to the front of the sentence, so do all the c- words in Gaelic and the word order in the rest of the sentence changes as well. This is not universal, however. In Chinese, one simply substitutes the word for ‘what’ in the ordinary sentence order, just as when we’re particularly surprised in English we might say “You ate what?”.

In order to see how they work exactly, we need example sentences, so I’ve been looking in?DASG. One easy first step is to look at frequencies in this table:

Interrogative Count English Observations
c? 9122 who noisy; lots of prefixes and parts of words
ciod 4587 what ?
cia 2363 how also cia mar?in older texts, cia fhad ‘how long’,?cia mh?r ‘how big’
d? 403 what also ‘God’
ciamar 273 how ?
c?it 182 where also genitive of cat meaning ‘cat’
carson 133 why ?
c?ite 90 where ?
cuin 59 when ?
cuine 15 when ?

These are the results of accent-insensitive searches as the older texts haven’t had their spelling modernized or made consistent. The results surprised me a great deal for a number of reasons. Firstly,?ciod?’what’, which I don’t recall seeing terribly often in the present day is the most numerous interrogative, mostly occurring in a single document, a history of Scotland. One of the very first words you learn in Gaelic is its modern counterpart?d?, which only has about 200 (judged by eye) instances as an interrogative in DASG. This is a similar number to?c?it(e), carson,?cuin(e), and?ciamar, ‘where’, ‘why’, ‘what’ and ‘how’. Secondly, the enormous number of hits for?cia?‘how’, which on a cursory inspection are often?exclamations, ‘how swift’, ‘how long’, ‘how horrible’ or an old spelling of?ciamar in addition to the more familiar?cia mheud ‘how many’.?Thirdly, nearly all of the instances of?d? meaning ‘what’ are from a single work,?Saoghal Bana-mharaiche, describing the Gaelic from the coast of Easter Ross.

I’ll leave you with a new meaning I’d never seen before for gu. This can be gu the?preposition, gu the subordinator (as in?gu bheil),?gu the aspect marker?or gu?the adverbializer, but?Gu d? tha thu? from DASG31,?Ugam agus bhuam, is clearly neither. As explained here, what is going on is this: the Gaelic for ‘what’ used to be?ciod e, like the Irish?cad ?, and over time this became?d?. Gu d? is a variant of this. It’s another one of those pesky multiword expressions.

[Edit 2015-01-03 to clarify reason for looking at interrogatives and add another meaning of?gu.]

Three sorts of PP

Le means “with”, roughly, but if you want to say “with X”, there are three different ways of doing it.

  1. le Alasdair: “with Alasdair”. This is the form used before a noun phrase that doesn’t begin with a definite article.
  2. leis an nurs: “with the nurse”.?Le becomes?leis before a noun phrase beginning with a definite article.
  3. leam: “with me”. This is a PP all of its very own, and there’s one for each personal pronoun, including, confusingly,?leis?for “with him”.

So this means that for a full grammar we need to mark the NP with whether it begins with certain determiners.?Leis, and friends?gus, ris and anns?don’t in fact go with all determiners in Gaelic. They go with?gach “each”, as in?Leis gach d?agh dh?rachd?”with every good wish” and mo “my”?but not, say, numbers.

Let’s, then, provisionally type the forms of?le as follows:

  • le: PPle/NPAN?
  • leis: PPle/NPAN+
  • leam: PPle,1s

Reminder: we need the features like le to keep track of what sort of preposition it is for agreement with words like toil (to like), and 1s to keep track of who it is liking what.

More on this, with, I hope, a shorter temporal gap before the next post than this time round.

Why do we bother with grammatical frameworks?

Most natural languages, like English French Chukchi Basque Gaelic Italian Russian Latgallian Finnish Tamil and so forth, can be reasonably well modelled by a context-free grammar, which is the sort of grammar that people write computer languages in. Parsers for these are ten-a-penny. They have to be, otherwise you couldn’t run C, Perl, PHP, Python, Haskell or whatever. So a question you might be asking is why people don’t use these parsers for natural languages and go off and invent grammatical frameworks like HPSG LFG CCG and so on.

One important reason is agreement, by which I mean that verbs in English, say, agree for number and in a limited way for person. What does this mean in practice? Well, if you’re writing a context-free grammar to handle sentences like “The lady vanishes”, then you can’t just say:

S ? NP VP

because that overgenerates. That would allow “The lady vanish”, “The ladies vanishes”, “I vanishes” and so on, because each of these have the form NP VP. “The lady” is an NP (noun phrase), as is “The ladies” and “I”. The rest of these sentences are all VPs (verb phrases). So our grammar has to also say:

S ? NP_3rdsg VP_3rdsg

S ? NP_non3rdsg VP_non3rdsg

and the same applies to every rule you have in the grammar. Modern grammatical frameworks use feature structures to look after all of this, and enable you to insist that whatever features, like number (singular, plural, and in Slovene dual) or person (I, you, he/she) words have have to agree, so you can write rules like this:

S ? NP VP

and let the lexicon, the collection of the words themselves, handle the details.

Getting OpenCCG to work on the Mac

OpenCCG is a java/python toolkit for working on combinatory categorial grammar, so is ideal for this exercise.

Update 2014-07-14: if you’re using OpenCCG 0.95, the latest version, on Mac OS X 10.6.8, then as long as you have Python 2.x and Java installed, then if you follow the build instructions?exactly then it should Just Work.

It comes with instructions for getting it to work under Unix and Windows, but on the Mac, or at least on the one I’m using, there’s a small amount of fiddling needed. Here it is:

  • You may not already have a recent version of python, which you can get from http://www.python.org/download/releases/2.7.1/ as a .dmg, which has a friendly hand-holdy installation process.
  • Environmental variables:
    • export JAVA_HOME=/usr (this surprised me, but it works on Mac OS X 10.4.11)
    • export PATH=”$PATH:$OPENCCG_HOME/bin”
    • cd to the directory that you’ve downloaded openccg to, type pwd, and set OPENCCG_HOME to it using export.
  • You will also need to fetch lex.py and yacc.py from sourceforge: http://openccg.cvs.sourceforge.net/viewvc/openccg/openccg/bin/ and put them in the bin folder in your OpenCCG installation.
  • If you then follow the instructions in the README file and get an error about the wrong class number you’ll have to rebuild it. Try typing ant at the command line and see what happens. I don’t remember installing ant, which means that it might come on the Mac by default. If not, you’ll have to go to http://ant.apache.org/. Good luck! ?Update 2014-07-13: do NOT attempt to build by typing ‘ant’ at the command line. This does not work. Make sure you type ‘ccg-build’. Only issue the ‘ant’ command if you want to see whether ant is installed on your machine.

It comes with some minuscule test grammars including Basque and Turkish.

But what can we tell from the 100 top word tokens?

  • 26 are prepositions of some sort
  • 23 are nouns
  • 10 are conjunctions
  • 10 are verbs
  • 5 are articles
  • 7 are adjectives
  • 7 are pronouns
  • 4 are preverbal particles
  • 2 are adverbs

The number of prepositions is unusually high and indicates that PPs (prepositional phrases) do an awful lot of the work in a Gaelic sentence. The number of verbs seems pretty low, and in fact many of them are forms of the verbs “to be” that we’ve seen earlier. This is because the verb “to be” typically does much of the rest of the work. More examples of this to come.

The article doesn’t mark gender (of which there are two, masculine and feminine) but it does mark the two numbers (singular and plural). So how come there are five articles listed?

Well, an is the singular, na does double duty for “of the” and “the” plural. nan does “of the” plural. Before a labial consonant, an becomes am and nan becomes nam. This warns us that our system will have to take into account initial consonants to get this right.

There are also some duplicates. “Scotland” is Alba normally and h-Alba after na, as in Banca na h-Alba “Bank of Scotland”. duine (person) has a weird-looking plural, daoine. d?thaich has the genitive form d?thcha. baile (town) has a lenited form (I will come to this, but not today) bhaile. So we see that Gaelic is not only morphologically rich, but instead of adding case endings and whatnot to the ends of words, like in Hungarian or Turkish, modifies the insides of words instead.

That will do for the now.

What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
{
$list{lc($token)}++;
}
}
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";
}

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.