Why do we bother with grammatical frameworks?

Most natural languages, like English French Chukchi Basque Gaelic Italian Russian Latgallian Finnish Tamil and so forth, can be reasonably well modelled by a context-free grammar, which is the sort of grammar that people write computer languages in. Parsers for these are ten-a-penny. They have to be, otherwise you couldn’t run C, Perl, PHP, Python, Haskell or whatever. So a question you might be asking is why people don’t use these parsers for natural languages and go off and invent grammatical frameworks like HPSG LFG CCG and so on.

One important reason is agreement, by which I mean that verbs in English, say, agree for number and in a limited way for person. What does this mean in practice? Well, if you’re writing a context-free grammar to handle sentences like “The lady vanishes”, then you can’t just say:

S → NP VP

because that overgenerates. That would allow “The lady vanish”, “The ladies vanishes”, “I vanishes” and so on, because each of these have the form NP VP. “The lady” is an NP (noun phrase), as is “The ladies” and “I”. The rest of these sentences are all VPs (verb phrases). So our grammar has to also say:

S → NP_3rdsg VP_3rdsg

S → NP_non3rdsg VP_non3rdsg

and the same applies to every rule you have in the grammar. Modern grammatical frameworks use feature structures to look after all of this, and enable you to insist that whatever features, like number (singular, plural, and in Slovene dual) or person (I, you, he/she) words have have to agree, so you can write rules like this:

S → NP VP

and let the lexicon, the collection of the words themselves, handle the details.

Posted in grammar, preliminaries | Leave a comment

A first attempt at the copula

Having got OpenCCG working, we can now start doing what we’re here for. To say “Calum is a teacher”, or “I am a teacher”, you have to say the at-first-glance rather odd:

  • ‘S e tidsear a th’ann Calum.
  • ‘S e tidsear a th’annam.

The unwary might translate those as “It is a teacher that is in Calum” and “It is a teacher that is in me”, but really tha + ann means “there is”. annam is a preposition marked for person, which I don’t think I’ve mentioned before. I’ve kind of implemented this, but it does overgenerate like mad. Overgeneration is when your grammar allows sentences that aren’t grammatical.

copula.ccg contains the grammar so far. Here are some highlights: Continue reading

Posted in grammar | Leave a comment

Getting OpenCCG to work on the Mac

OpenCCG is a java/python toolkit for working on combinatory categorial grammar, so is ideal for this exercise. It comes with instructions for getting it to work under Unix and Windows, but on the Mac, or at least on the one I’m using, there’s a small amount of fiddling needed. Here it is:

  • You may not already have a recent version of python, which you can get from http://www.python.org/download/releases/2.7.1/ as a .dmg, which has a friendly hand-holdy installation process.
  • Environmental variables:
    • export JAVA_HOME=/usr (this surprised me, but it works on Mac OS X 10.4.11)
    • export PATH=”$PATH:$OPENCCG_HOME/bin”
    • cd to the directory that you’ve downloaded openccg to, type pwd, and set OPENCCG_HOME to it using export.
  • You will also need to fetch lex.py and yacc.py from sourceforge: http://openccg.cvs.sourceforge.net/viewvc/openccg/openccg/bin/ and put them in the bin folder in your OpenCCG installation.
  • If you then follow the instructions in the README file and get an error about the wrong class number you’ll have to rebuild it. Try typing ant at the command line and see what happens. I don’t remember installing ant, which means that it might come on the Mac by default. If not, you’ll have to go to http://ant.apache.org/. Good luck!

It comes with some minuscule test grammars including Basque and Turkish.

Posted in other people's code, preliminaries | Leave a comment

But what can we tell from the 100 top word tokens?

  • 26 are prepositions of some sort
  • 23 are nouns
  • 10 are conjunctions
  • 10 are verbs
  • 5 are articles
  • 7 are adjectives
  • 7 are pronouns
  • 4 are preverbal particles
  • 2 are adverbs

The number of prepositions is unusually high and indicates that PPs (prepositional phrases) do an awful lot of the work in a Gaelic sentence. The number of verbs seems pretty low, and in fact many of them are forms of the verbs “to be” that we’ve seen earlier. This is because the verb “to be” typically does much of the rest of the work. More examples of this to come.

The article doesn’t mark gender (of which there are two, masculine and feminine) but it does mark the two numbers (singular and plural). So how come there are five articles listed?

Well, an is the singular, na does double duty for “of the” and “the” plural. nan does “of the” plural. Before a labial consonant, an becomes am and nan becomes nam. This warns us that our system will have to take into account initial consonants to get this right.

There are also some duplicates. “Scotland” is Alba normally and h-Alba after na, as in Banca na h-Alba “Bank of Scotland”. duine (person) has a weird-looking plural, daoine. dùthaich has the genitive form dùthcha. baile (town) has a lenited form (I will come to this, but not today) bhaile. So we see that Gaelic is not only morphologically rich, but instead of adding case endings and whatnot to the ends of words, like in Hungarian or Turkish, modifies the insides of words instead.

That will do for the now.

Posted in grammar, preliminaries | Leave a comment

What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
{
$list{lc($token)}++;
}
}
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";
}

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.

Posted in grammar, preliminaries | Leave a comment

Arabic

A better-resourced language that is VSO is Arabic, and I noticed today that Chris Brew’s group have a paper on converting the Penn Arabic Treebank into CCG. I didn’t know that Arabic had resumptive pronouns. Gaelic doesn’t, but they might be useful in posts explaining gapping later on.

Posted in grammar, not gaelic, other people's code | Leave a comment

To be and to be (2)

And there is another verb “to be”, like this:

  • Positive: Is mise Calum. (I am Calum.)
  • Interrogative: An tusa Ealasaid? (Are you Elizabeth?)
  • Negative: Cha mise Calum. (I amn’t Calum.)
  • Negative interrogative: Chan esan Uilleam? (Aren’t you William?)

However, is doesn’t have type (S/NP)/NP because you can’t say

*Is Calum tidsear. (Calum is a teacher.)

(The star indicates that a sentence is ungrammatical.) You also can’t say

* Is mise tidsear. (I am a teacher.)

So we can, for now, assign it type (S/Nper)/PN, where Nper is a personal pronoun and PN is a proper noun. But let me assure you it gets much harder.

Coming soon: How to say that I am a teacher.

Posted in grammar | Leave a comment

What is the simplest parser that could possibly work?

Behind this blog is a happy half hour or so I spent on Friday evening writing a bit of code to do forward composition in a really simple-minded way. Forward composition, in categorial grammar, is applying the following rule:

X/Y Y → X

So I, being simple minded, wrote a bit of code that Continue reading

Posted in code | Leave a comment

To be and to be (1)

First up, “to be”.

Bi has three forms in the present tense, according to whether it’s positive, interrogative, negative or negative interrogative, all thanks to particles which I don’t know whether they’re a VMOD or a P or a what.

  • Positive: Tha mi sgìth (I am tired). Tha is independent so is (S/NP)/ADJ.
  • Negative: Chan eil mi sgìth (I am not tired). Eil is dependent, so its type will be ((S\VMOD)/NP)/ADJ.
  • Negative interrogative: Nach eil thu sgìth? (Are you not tired?) Same as above.
  • Interrogative: A bheil thu sgìth? (Are you tired?) Bheil is dependent again but whereas most verbs just have a dependent and an independent form in any one tense, bi has two dependent forms. So, what to do?

Whatever happens, we have a tree at the top which says something like (Vspec Vbar). I don’t especially like this because eil thu sgìth, the Vbar, isn’t a constituent. I’m willing to lay money that there are no song titles that begin “Eil”. A coordination test for constituency, incidentally, isn’t decisive in English because one thing that categorial grammar is good at is non-constituent coordination, say in “Mary loves pizza and Tim rice”.

So either we commit ourselves to a type S/Vbar for a or chan or nach, which is potentially good because you could parse the entire sentence with forward composition, of which more tomorrow, or to assign eil type ((S\Vspec_neg)/NP)/ADJ and bheil type ((S\Vspec_int)/NP/ADJ. More types will be needed for all of these forms, because bi isn’t just for adjectives!

Tomorrow: what is the simplest parser that could possibly work?

Posted in grammar, open questions | Leave a comment

The slashes

In most frameworks you quickly get familiar with notation like PP (prepositional phrase), NP (noun phrase), VT (transitive verb), ADJ (adjective) and so forth. Categorial grammar however, bristles with things like (S\NP)\((S\NP)/(S[adj]\NP)). What’s going on here?

Aside: hopefully these are the last examples I give in English.

“Mary loves pizza”. “Mary” is a singular personal name, “pizza” is a mass noun, and those are both sorts of NP. What about “loves”? It’s (S\NP)/NP. A simpler example is “Ice melts”. “Ice” is an NP, and “melts” here is S\NP. The backslash in Y\X means “give me something of type X to my left and I’ll give you a Y“.

So (S\NP)/NP, with a forward slash and a backslash, takes NPs to the left and right, and gives you an S, or a sentence.

In principle Gaelic verbs should have type (S/NP)/NP, but I have never seen a sentence exactly like this. “Mary loves pizza”, after all, is only OK because “loves” is stative. Unless you do the marketing for McDonald’s.

Posted in grammar | Leave a comment