Headline passive

I read the news today. To be precise, I’ve been looking at the BBC website’s news in Gaelic?and I’ve spotted a grammatical theme among a large proportion?of the headlines and standfirsts:

  • Fiosrachadh ga shireadh mu ghoid charbad phoilis?“information sought about the theft of a police car”
  • Ceathrar gan toirt far Beinn Nibheis?“Four people taken from the top of Ben Nevis”
  • Teaghlach de cheathrar gan toirt far Beinn Nibheis […]?(standfirst for the foregoing) “Family of four taken from the top of Ben Nevis”
  • Duine ga lorg air a’ Chliseam?“Person found on Clisham [mountain on Harris]”
  • Leasachadh Beinn Uais ga dhi?ltadh?“Ben Wyvis development turned down”

Here the aspect marker?ag?preceding a verbal noun has merged with the possessive pronoun that is the direct object of the direct noun in question (sireadh,?toirt,?lorg and?diultadh), leniting it if it’s?ga?masculine. Put a form of?bi at the front and you have a full sentence, but it need not be passive in that case. They could be, maybe absurdly:

  • Information seeks him about the theft of a police car
  • Four people take them from the top of Ben Nevis
  • Family of four take them from the top of Ben Nevis
  • Person finds him on Clisham?or Person finds it on Clisham
  • Ben Wyvis development turns him down

These have a look of machine translation about them, don’t they?

gdbank: CCG and dependency structures in Scottish Gaelic

I have been working on a small corpus of Scottish Gaelic sentences. The words in them are all annotated with categorial grammar types and dependency relations. It’s available on Google Code?GitHub and there is a more detailed description in this paper and this poster, which I presented at CLTW in Dublin.

What format is it in?

CoNLL-X is a tab-separated plain-text format for annotating text with dependency relations. It was developed for the 10th Computational Natural Language Learning meeting in 2006. I have abused the format slightly by putting the categorial grammar types in the “features” column.

Which standard?did you use for the dependency annotation?

After swithering between the Briscoe and Carroll (GR) scheme and Teresa Lynn at DCU’s?scheme and a mixture of the two I eventually opted for the Universal Dependency Scheme, which is based on the Stanford scheme. This has the merit of making inter-language comparisons straightforward.

Which standard did you use for the CCG annotations?

One based very closely on CCGBank with slight modifications for Gaelic.

How large is it?

Currently there are 40?sentences and 612?tokens (roughly speaking, words and punctuation marks).

Is it POS-tagged?

Ish.?The CoNLL-X format has two columns for this, though, a coarse POS tagset (simply whether something is a noun, a verb, an adposition or whatever) and a more fine-grained one that would include number, tense and so forth. I use the Universal POS tagset for both columns for now.

How many annotators did you have?

Ahem. Just me. This is a shortcoming.

Update 2015-08-24: migrated to GitHub (see above).