Interrogative frequencies in DASG

One aspect of Gaelic I want to look at more closely is interrogatives. Just as all the wh- words in English (who, when, why, what, how) go to the front of the sentence, so do all the c- words in Gaelic and the word order in the rest of the sentence changes as well. This is not universal, however. In Chinese, one simply substitutes the word for ‘what’ in the ordinary sentence order, just as when we’re particularly surprised in English we might say “You ate what?”.

In order to see how they work exactly, we need example sentences, so I’ve been looking in DASG. One easy first step is to look at frequencies in this table:

Interrogative Count English Observations
9122 who noisy; lots of prefixes and parts of words
ciod 4587 what
cia 2363 how also cia mar in older texts, cia fhad ‘how long’, cia mhòr ‘how big’
403 what also ‘God’
ciamar 273 how
càit 182 where also genitive of cat meaning ‘cat’
carson 133 why
càite 90 where
cuin 59 when
cuine 15 when

These are the results of accent-insensitive searches as the older texts haven’t had their spelling modernized or made consistent. The results surprised me a great deal for a number of reasons. Firstly, ciod ‘what’, which I don’t recall seeing terribly often in the present day is the most numerous interrogative, mostly occurring in a single document, a history of Scotland. One of the very first words you learn in Gaelic is its modern counterpart , which only has about 200 (judged by eye) instances as an interrogative in DASG. This is a similar number to càit(e), carson, cuin(e), and ciamar, ‘where’, ‘why’, ‘what’ and ‘how’. Secondly, the enormous number of hits for cia ‘how’, which on a cursory inspection are often exclamations, ‘how swift’, ‘how long’, ‘how horrible’ or an old spelling of ciamar in addition to the more familiar cia mheud ‘how many’. Thirdly, nearly all of the instances of  meaning ‘what’ are from a single work, Saoghal Bana-mharaiche, describing the Gaelic from the coast of Easter Ross.

I’ll leave you with a new meaning I’d never seen before for gu. This can be gu the preposition, gu the subordinator (as in gu bheil), gu the aspect marker or gu the adverbializer, but Gu dè tha thu? from DASG31, Ugam agus bhuam, is clearly neither. As explained here, what is going on is this: the Gaelic for ‘what’ used to be ciod e, like the Irish cad é, and over time this became dè. Gu dè is a variant of this. It’s another one of those pesky multiword expressions.

[Edit 2015-01-03 to clarify reason for looking at interrogatives and add another meaning of gu.]

Posted in grammar, preliminaries | Leave a comment

DASG and the second comparative

If you haven’t come across Dachaigh airson Stòras na Gàidhlig/Digital Archive of Scottish Gaelic you should stop what reading this and go straight there.

Welcome back. It contains eight and a half million words and is a resource I keep coming back to. In my first investigation, I’m looking for the second comparative, which I had never seen before last weekend. Here’s an example:

Is feairrde na stamagan srubag dheth

(The stomachs are better for a wee drink in them.) It’s explained in Gillie’s Elements of Scottish Gaelic Grammar, as differing from the normal comparative (“Xer”) in that it means “Xer by that” or “Xer because of that”. If you search for a word, DASG gives you a concordance so you can look at the local context of words.

Some second comparatives in DASG: feairrd, feairrde, misd, bigid, lughaid. An ambiguous word that might be a second comparative: mòid. I look forward to a POS-tagged version of DASG.

Posted in grammar | 2 Comments

Training a dependency parser on gdbank

A very quick note to say that I’ve trained maltparser, a dependency parser, with the current gdbank sentences (a mere 1223 tokens spread across 70-odd sentences), the Universal POS tagging scheme and the current Universal-ish gdbank dependency annotation scheme, and then seen how it performed on an unseen test set of 8 sentences containing 276 tokens taken from an article in The Scotsman from a few years ago.

It got 196 (71%) of the heads right, 207 (75%) of the dependency types right, and both the head and the dependency were right in 187 (68%) of cases. My initial impressions is that the main problems are subordinators and my having mis-POS-tagged a few words, but there will be a confusion matrix soon.

Posted in dependency parsing, maltparser | Leave a comment

MaltParser cheat mode

If you train MaltParser using the learnwo flowchart in place of learn, it does all the same things, except that it writes out the sentences as it reads them in.

This means that if you have, ahem, misformatted any of your input, you can see exactly which misformatting MaltParser is complaining about, because it will be in the first sentence that hasn’t been written to stdout.

Posted in dependency parsing, maltparser | Leave a comment

Installing MaltParser on Mac OS X 10.6.8

MaltParser is a dependency parser and it’s available here: http://www.maltparser.org/download.html

If you try to run the ready-built jar under Mac OS X 10.6.8 and you haven’t updated to Java 1.7, you’ll get a major.minor version number error. However, if you simply edit references in the build.xml file to read 1.6, and type

ant dist

to build with ant, then it will whirr away for a bit and build fine.

Posted in dependency parsing, maltparser | Leave a comment

Headline passive

I read the news today. To be precise, I’ve been looking at the BBC website’s news in Gaelic and I’ve spotted a grammatical theme among a large proportion of the headlines and standfirsts:

  • Fiosrachadh ga shireadh mu ghoid charbad phoilis “information sought about the theft of a police car”
  • Ceathrar gan toirt far Beinn Nibheis “Four people taken from the top of Ben Nevis”
  • Teaghlach de cheathrar gan toirt far Beinn Nibheis […] (standfirst for the foregoing) “Family of four taken from the top of Ben Nevis”
  • Duine ga lorg air a’ Chliseam “Person found on Clisham [mountain on Harris]”
  • Leasachadh Beinn Uais ga dhiùltadh “Ben Wyvis development turned down”

Here the aspect marker ag preceding a verbal noun has merged with the possessive pronoun that is the direct object of the direct noun in question (sireadh, toirtlorg and diultadh), leniting it if it’s ga masculine. Put a form of bi at the front and you have a full sentence, but it need not be passive in that case. They could be, maybe absurdly:

  • Information seeks him about the theft of a police car
  • Four people take them from the top of Ben Nevis
  • Family of four take them from the top of Ben Nevis
  • Person finds him on Clisham or Person finds it on Clisham
  • Ben Wyvis development turns him down

These have a look of machine translation about them, don’t they?

Posted in grammar | Leave a comment

gdbank: CCG and dependency structures in Scottish Gaelic

I have been working on a small corpus of Scottish Gaelic sentences. The words in them are all annotated with categorial grammar types and dependency relations. It’s available on Google Code GitHub and there is a more detailed description in this paper and this poster, which I presented at CLTW in Dublin.

What format is it in?

CoNLL-X is a tab-separated plain-text format for annotating text with dependency relations. It was developed for the 10th Computational Natural Language Learning meeting in 2006. I have abused the format slightly by putting the categorial grammar types in the “features” column.

Which standard did you use for the dependency annotation?

After swithering between the Briscoe and Carroll (GR) scheme and Teresa Lynn at DCU’s scheme and a mixture of the two I eventually opted for the Universal Dependency Scheme, which is based on the Stanford scheme. This has the merit of making inter-language comparisons straightforward.

Which standard did you use for the CCG annotations?

One based very closely on CCGBank with slight modifications for Gaelic.

How large is it?

Currently there are 40 sentences and 612 tokens (roughly speaking, words and punctuation marks).

Is it POS-tagged?

Ish. The CoNLL-X format has two columns for this, though, a coarse POS tagset (simply whether something is a noun, a verb, an adposition or whatever) and a more fine-grained one that would include number, tense and so forth. I use the Universal POS tagset for both columns for now.

How many annotators did you have?

Ahem. Just me. This is a shortcoming.

Update 2015-08-24: migrated to GitHub (see above).

Posted in gdbank | Leave a comment

CLTW2014: report

Last weekend I was in Dublin for the first Celtic Language Technology Workshop, which was part of COLING2014. I am still digesting and still to follow up everything, but here’s a very brief summary.

Elaine Uí Dhonnchadha (DCU) started with an overview of Irish language technology and a plea for open resources. William Lamb and Sammy Danso (both Edinburgh) gave a two-hander about POS-tagging a Scottish Gaelic corpus of nearly 90 000 tokens. This will be coming out later this year. Having taken part in annotation tasks in the past, I was taken aback that they reported a kappa coefficient of 0.98 for two annotators on this task. (A kappa coefficient, for those who don’t know, is a modified agreement score that takes into account chance agreement. I say “a” kappa coefficient because there is more than one way of working them out. If your kappa is zero, that means your agreement is no better than chance, even though your percentage agreement might be something like 75%. Perfect agreement leads to a kappa of 1. Kappas of 0.7 or greater are pretty good, in general.) Monica Ward (also DCU) talked about building resources for the teaching of Irish: a past-tense teaching game for primary school children and a grammar checker for adult learners built around Kevin Scannell’s Gramadóir. Lastly before the coffee break Teresa Lynn (DCU) talked about cross-lingual transfer dependency parsing of Irish. On the face of it, this sounds like something that shouldn’t work at all. You take a model that has been trained on a completely different language, steam off the lexical information (which words relate to which), remove the dependency labels, and keep only the part-of-speech tags and unlabelled dependencies. Surprisingly, it kind-of works. Even more surprisingly the best results came from a model trained on Indonesian, which doesn’t seem to be VSO.

At coffee, quite by chance, I was surprised and delighted to meet Mark Steedman (Edinburgh), who invented CCG.

After coffee, Thierry Poibeau (CNRS and DTAL, Cambridge) talked about mutations in Breton. Discussion ensued about whether the anomalous behaviour of, how do I put this, “professional nouns” like kiger (butcher) was semantic or not. I suspect it’s more like word classes, like animacy in the Slavic languages or the Australian language that famously has the class of women, fire and dangerous things. Kevin Scannell (Saint Louis University), who I am astonished to have not come across before, talked about machine translation from Scottish Gaelic to Irish and the various spelling reforms in Irish that have made it look rather less like Scottish Gaelic than it used to. Finally, Caoimhín Ó Donnaíle (Sabhal Mòr Ostaig), talked about the Multidict site, which has videos for language learners and a splendid interactive wrapper for websites that links to external dictionaries. The idea is that you click on a word in one frame and the dictionary entry appears in another. When I was a first-year undergraduate in 1994, Caoimhín’s websites were some of the first I remember ever seeing, probably with the Mosaic browser.

After lunch, an invited talk from Kevin Scannell, largely about Manx, but also about building NLP resources from social media and web crawling. Much of the Manx data is revived Manx, but there’s also text from Skeealyn Vannin, which was compiled by the Irish Folklore Commission in the late 1940s from native speakers. Michal Boleslav Měchura (DCU) talked about breis.focloir.ie, an online Irish grammar database. The last of the full papers was given by Sarah Cooper (Bangor), describing an app, Paldaruo, for crowdsourcing speech recognition training data. The microphones on tablets are good enough these days to record speech from ordinary members of the public, in the first instance to control a wee robot arm connected to a Raspberry Pi.

Then the “poster boaster” session, featuring me, Michal again on onomastics, Francis Tyers (Tromsø) on within-tweet language detection and Delyth Prys (Bangor), talking about the DECHE Corpus of Welsh Scholarly Writing. Then pub.

An enormous amount to take in, and lots to follow up on. I do hope this’ll be the first of many.

 

Posted in conferences | Leave a comment

Quick note on visccg and UTF-8 for openccg beginners

If you can’t get UTF-8 to work in visccg (looking at Stack Overflow suggests this might be a Python-on-the-Mac thing, but I wouldn’t swear to it) you can still edit .ccg files in Your Favourite Text Editor and ccg2xml will still process them fine.

Posted in openccg | 2 Comments

The Coordinate Structure Constraint: evidence from Irish

Previously.

Ross’s 1967 MIT thesis Constraints on Variables in Syntax introduced, among other things, the Coordinate Structure Constraint, which is a generalization of the intuitive notion that coordinators (in English, “and”, “but”, “or” and so on) coordinate nouns with nouns (“fish and chips”), verbs with verbs (“come and go”) to exclude sentences like “Whose tax did the nurse polish her trombone and the plumber compute?”

While keeping my eyes peeled for examples of non-constituent coordination in Gaelic, and I should note that I have a blogpost in preparation with examples from William Lamb’s Scottish Gaelic, including the constructions that are examples of “cosubordination”, I’ve been reading Mícheál Ó Siadhail’s Learning Irish, which has some examples of what the author calls “idiomatic uses of agus“. These first five coordinate non-constituents:

Bhí Bríd ann agus í tinn. (1)

Tá Cáit ansin agus leabhar mór aici. (2)

D’imigh Máirtín amach agus gan aon chóta air. (3)

Bhí an bosca ansin is mé ag tíocht abhaile. (4)

Bhí an lá gearr is thú ag imeacht thart mar sin. (5)

(1) coordinates NP + existential ANN with NP + ADJ. (2) coordinates NP + ADV with NP + PP. (3) coordinates ADV with PP. (4) coordinates NP + ADV with NP + small clause. (5) coordinates NP + ADJ with NP + small clause. Sadly there are no counterexamples of uses that are unidiomatic. is also shows up in constructions with chomh (like Gaelic cho, which is similar):

chomh sásta is a bhí Máirtín (6)

“as pleased as Martin was”. Here is coordinates ADJ with a direct relative clause.

There are also some non-coordinative-looking uses:

An bhfuil sé míle as seo go Carna? Tá agus deich míle! (7)

Tá mé ag imeacht anois. Tá agus mise! (8)

Is maith liom an áit seo. Is maith agus liomsa! (9)

Any account of coordination in Irish at least has to be able to cope with examples (1) to (6). I hunt on for examples in Gaelic.

Posted in grammar, irish gaelic | Leave a comment