But what can we tell from the 100 top word tokens?

  • 26 are prepositions of some sort
  • 23 are nouns
  • 10 are conjunctions
  • 10 are verbs
  • 5 are articles
  • 7 are adjectives
  • 7 are pronouns
  • 4 are preverbal particles
  • 2 are adverbs

The number of prepositions is unusually high and indicates that PPs (prepositional phrases) do an awful lot of the work in a Gaelic sentence. The number of verbs seems pretty low, and in fact many of them are forms of the verbs “to be” that we’ve seen earlier. This is because the verb “to be” typically does much of the rest of the work. More examples of this to come.

The article doesn’t mark gender (of which there are two, masculine and feminine) but it does mark the two numbers (singular and plural). So how come there are five articles listed?

Well, an is the singular, na does double duty for “of the” and “the” plural. nan does “of the” plural. Before a labial consonant, an becomes am and nan becomes nam. This warns us that our system will have to take into account initial consonants to get this right.

There are also some duplicates. “Scotland” is Alba normally and h-Alba after na, as in Banca na h-Alba “Bank of Scotland”. duine (person) has a weird-looking plural, daoine. d?thaich has the genitive form d?thcha. baile (town) has a lenited form (I will come to this, but not today) bhaile. So we see that Gaelic is not only morphologically rich, but instead of adding case endings and whatnot to the ends of words, like in Hungarian or Turkish, modifies the insides of words instead.

That will do for the now.

What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
{
$list{lc($token)}++;
}
}
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";
}

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.