What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.

To be and to be (2)

And there is another verb “to be”, like this:

  • Positive: Is mise Calum. (I am Calum.)
  • Interrogative: An tusa Ealasaid? (Are you Elizabeth?)
  • Negative: Cha mise Calum. (I amn’t Calum.)
  • Negative interrogative: Chan esan Uilleam? (Aren’t you William?)

However, is doesn’t have type (S/NP)/NP because you can’t say

*Is tidsear Calum. (Calum is a teacher.)

(The star indicates that a sentence is ungrammatical.) You also can’t say

* Is tidsear mi. (I am a teacher.)

So we can, for now, assign it type (S/Nper)/PN, where Nper is a personal pronoun and PN is a proper noun. But let me assure you it gets much harder.

Coming soon: How to say that I am a teacher.

(Edit 2013-10-24: changed the word order in the ungrammatical examples to make them merely ungrammatical and not both ungrammatical and weird.)

What is the simplest parser that could possibly work?

Behind this blog is a happy half hour or so I spent on Friday evening writing a bit of code to do forward composition in a really simple-minded way. Forward composition, in categorial grammar, is applying the following rule:

X/Y Y ? X

So I, being simple minded, wrote a bit of code that Continue reading “What is the simplest parser that could possibly work?”

To be and to be (1)

First up, “to be”.

Bi has three forms in the present tense, according to whether it’s positive, interrogative, negative or negative interrogative, all thanks to particles which I don’t know whether they’re a VMOD or a P or a what.

  • Positive: Tha mi sg?th (I am tired). Tha is independent so is (S/NP)/ADJ.
  • Negative: Chan eil mi sg?th (I am not tired). Eil is dependent, so its type will be ((S\VMOD)/NP)/ADJ.
  • Negative interrogative: Nach eil thu sg?th? (Are you not tired?) Same as above.
  • Interrogative: A bheil thu sg?th? (Are you tired?) Bheil is dependent again but whereas most verbs just have a dependent and an independent form in any one tense, bi has two dependent forms. So, what to do?

Whatever happens, we have a tree at the top which says something like (Vspec Vbar). I don’t especially like this because eil thu sg?th, the Vbar, isn’t a constituent. I’m willing to lay money that there are no song titles that begin “Eil”. A coordination test for constituency, incidentally, isn’t decisive in English because one thing that categorial grammar is good at is non-constituent coordination, say in “Mary loves pizza and Tim rice”.

So either we commit ourselves to a type S/Vbar for a or chan or nach, which is potentially good because you could parse the entire sentence with forward composition, of which more tomorrow, or to assign eil type ((S\Vspec_neg)/NP)/ADJ and bheil type ((S\Vspec_int)/NP/ADJ. More types will be needed for all of these forms, because bi isn’t just for adjectives!

Tomorrow: what is the simplest parser that could possibly work?

The slashes

In most frameworks you quickly get familiar with notation like PP (prepositional phrase), NP (noun phrase), VT (transitive verb), ADJ (adjective) and so forth. Categorial grammar however, bristles with things like (S\NP)\((S\NP)/(S[adj]\NP)). What’s going on here?

Aside: hopefully these are the last examples I give in English.

“Mary loves pizza”. “Mary” is a singular personal name, “pizza” is a mass noun, and those are both sorts of NP. What about “loves”? It’s (S\NP)/NP. A simpler example is “Ice melts”. “Ice” is an NP, and “melts” here is S\NP. The backslash in Y\X means “give me something of type X to my left and I’ll give you a Y“.

So (S\NP)/NP, with a forward slash and a backslash, takes NPs to the left and right, and gives you an S, or a sentence.

In principle Gaelic verbs should have type (S/NP)/NP, but I have never seen a sentence exactly like this. “Mary loves pizza”, after all, is only OK because “loves” is stative. Unless you do the marketing for McDonald’s.

Categorial Grammar of Gaelic

Categorial grammar is a promising framework for doing things like really fast parsing of English and handling coordination (words like “and”, “or”) in a principled way. And I’ve recently realized that I’m not going to understand how it works unless I have a go.

So, why Gaelic? I don’t know of any parsers that have been written for Gaelic. It hasn’t attracted the attention that, say, Basque, has. It’s reasonably well resourced in the sense that there are dictionaries and grammars, a smallish Wikipedia, teaching materials, and I have a Sorley MacLean Collected and a Julie Fowlis CD. None of this is the Penn Treebank, of course, but it all helps.

Now, I don’t speak much Gaelic beyond “please”, “thank you”, “where is Spot?”, “it’s wet today” and “cheerio the now!” so I’ll hopefully learn a bit more about Gaelic in the process.

Also, Gaelic is VSO, which is a bit unusual.