I have been working on a small corpus of Scottish Gaelic sentences. The words in them are all annotated with categorial grammar types and dependency relations. It’s available on
Google Code GitHub and there is a more detailed description in this paper and this poster, which I presented at CLTW in Dublin.
What format is it in?
CoNLL-X is a tab-separated plain-text format for annotating text with dependency relations. It was developed for the 10th Computational Natural Language Learning meeting in 2006. I have abused the format slightly by putting the categorial grammar types in the “features” column.
Which standard did you use for the dependency annotation?
After swithering between the Briscoe and Carroll (GR) scheme and Teresa Lynn at DCU’s scheme and a mixture of the two I eventually opted for the Universal Dependency Scheme, which is based on the Stanford scheme. This has the merit of making inter-language comparisons straightforward.
Which standard did you use for the CCG annotations?
One based very closely on CCGBank with slight modifications for Gaelic.
How large is it?
Currently there are 40 sentences and 612 tokens (roughly speaking, words and punctuation marks).
Is it POS-tagged?
Ish. The CoNLL-X format has two columns for this, though, a coarse POS tagset (simply whether something is a noun, a verb, an adposition or whatever) and a more fine-grained one that would include number, tense and so forth. I use the Universal POS tagset for both columns for now.
How many annotators did you have?
Ahem. Just me. This is a shortcoming.
Update 2015-08-24: migrated to GitHub (see above).