Available now at: https://universaldependencies.org/treebanks/gd_arcosg/index.html
I am happy to report that Scottish Gaelic now features among the treebanks of the Universal Dependencies project in releases 2.5 and 2.6. I have generally followed the annotation scheme for Irish, with some additions to cope with constructions that differ between the two languages.
The treebank is based on the ARCOSG corpus, which is half-and-half prose and speech. My paper presented at CLTW this year deals with the prose half so I thought it would be worthwhile to report some of the features of the speech subcorpora.
The first is sentence-splitting. ARCOSG is divided up into clauses rather than sentences. The prose subcorpora all have punctuation, so by-and-large I’ve relied on an automatic and pretty simplistic sentence-splitting algorithm to do the job for me. Occasionally a closing double-quote ends up in the wrong tree, but this is easy to fix. The speech subcorpora, lacking full stops, are something else entirely. The sentence-splitter splits on changes of speaker and in principle I could just take every utterance as a single tree with lots of parataxis relations, but this would give me ridiculously big trees in some cases. Better to split where it feels like a new utterance. Subsequently I have found some guidelines in the wild for this: http://ldp-uchicago.github.io/docs/guides/transcription/sect_4.html I am relieved to find that the rules match more or less what I was doing, except that I didn’t have the original recordings to work with.
The second is tag questions, like fhios agad ‘you know’ and nach e ‘isn’t it’. These are very common. I’ve elected to simply relate them to the rest of the sentence with a parataxis relation rather than a more specific parataxis:tag relation. Maybe that would help?
The third is words that the transcriber wasn’t able to transcribe. These are captured as [?]. If it’s not possible to work out from context what relation they bear to the rest of the utterance then Universal Dependencies has a completely generic dep relation I have used.
Number four is football commentary. Lots of it looks like this:
MacLare gu Johnson ma-th? ‘MacLare to Johnson indeed’s07_005 and s07_006, ud_gd-arcosg-train.conllu
Johnson leatha a-mach an taobh-sa gu MacStay ‘Johnson with it out the side to MacStay’
What’s going on there grammatically? I have a solution for now: treat the player as the root and attach the PPs with the obl relation, but is this the most UD way of doing it? If not, then I should have been consistent enough to fix it automatically.