What kind of language is this? The top 100 word tokens in Gaelic

I downloaded all of the Gaelic wikipedia. This is not hard. It is at http://dumps.wikimedia.org/gdwiki/latest/ and you probably want gdwiki-latest-pages-articles.xml.bz2, which contains all the text.

Now I can do word-token counts on it, using terrible code like the following:

#!/usr/bin/perl -w
my %list;
while (<>) {
@tokens = split(/[\s\)\(\.=\,\?]/);
for $token (@tokens)
foreach $key (sort { $list{$b} <=> $list{$a} } (keys %list)) {
print " $key $list{$key}\n";

Note the entirely ad hoc collection of characters to split on.
The list is here, and you will see that the first noun is baile (town) at number 25, which tells you more about Wikipedia than it does about Gaelic. But also that an, e and is are, as we have seen, ambiguous between parts of speech, and that I can’t quite work out what to do with a.

Leave a Reply

Your email address will not be published. Required fields are marked *