I was pleased to meet Johannes Heinecke at the International Congress of Celtic Studies in Bangor last week. As well as producing a dependency treebank for Welsh, he has written a rather smart editor for CoNLL-U files, which are pretty much the standard these days for dependency trees.
I managed to get it working this morning on a Mac running Mac OS Mojave 10.14.6 with a minimum of hassle. You will need Java, Apache Maven, and Homebrew in order to install wget. One small surprise is that if you edit a file in a git repository then by default every time you edit the tree, the new file is committed, which makes the commit history look a bit busy.
The second best bit is that you can see non-projective relations at a glance, which I certainly can’t do in emacs.
The best bit, as someone who recently wrote a paper where all the arrows in the dependency diagrams pointed the wrong way and didn’t notice until the referees pointed it out, is that there is a wee button you can click on to get a tikz version of the tree for pasting into LaTeX.
A very quick note to say that I’ve trained maltparser, a dependency parser, with?the current gdbank sentences (a mere 1223 tokens spread across 70-odd sentences), the Universal POS tagging scheme and the current Universal-ish gdbank dependency annotation scheme, and then seen how it performed on an unseen test set of 8 sentences containing 276 tokens taken from an article in The Scotsman from a few years ago.
It got 196 (71%) of the heads right, 207 (75%) of the dependency types right, and both the head and the dependency were right in 187 (68%) of cases. My initial impressions is that the main problems are subordinators and my having mis-POS-tagged a few words, but there will be a confusion matrix soon.
If you train MaltParser using the
learnwo flowchart in place of
learn, it does all the same things, except that it writes out the sentences as it reads them in.
This means that if you have, ahem, misformatted any of your input, you can see exactly which misformatting MaltParser is complaining about, because it will be in the first sentence that hasn’t been written to stdout.
MaltParser is a dependency parser and it’s available here: http://www.maltparser.org/download.html
If you try to run the ready-built jar under Mac OS X 10.6.8 and you haven’t updated to Java 1.7, you’ll get a major.minor version number error. However, if you simply edit references in the build.xml file to read 1.6, and type
to build with ant, then it will whirr away for a bit and build fine.
Quick note to say that Teresa Lynn at DCU has been working on a project based on dependency treebanks for Irish. This is relevant to this blog because Irish Gaelic is very closely related to Scottish Gaelic and much of the grammar is similar, and there has also been work in the past (Clark and Curran 2007, Table 2, for example) on deriving dependency structures from CCG lexical structures.
Here are two papers I’ve had a quick look at:
Much of the literature on categorial grammar focuses on things that are difficult to handle in other frameworks and isn’t necessarily helpful if you want to find something simple. However, there are lots and lots of worked examples on the Groningen Meaning Bank Explorer. More about how it works here.
OpenCCG is a java/python toolkit for working on combinatory categorial grammar, so is ideal for this exercise.
Update 2014-07-14: if you’re using OpenCCG 0.95, the latest version, on Mac OS X 10.6.8, then as long as you have Python 2.x and Java installed, then if you follow the build instructions?exactly then it should Just Work.
It comes with instructions for getting it to work under Unix and Windows, but on the Mac, or at least on the one I’m using, there’s a small amount of fiddling needed. Here it is:
- You may not already have a recent version of python, which you can get from http://www.python.org/download/releases/2.7.1/ as a .dmg, which has a friendly hand-holdy installation process.
- Environmental variables:
export JAVA_HOME=/usr (this surprised me, but it works on Mac OS X 10.4.11)
- export PATH=”$PATH:$OPENCCG_HOME/bin”
cd to the directory that you’ve downloaded openccg to, type
pwd, and set
OPENCCG_HOME to it using
You will also need to fetch
yacc.py from sourceforge: http://openccg.cvs.sourceforge.net/viewvc/openccg/openccg/bin/ and put them in the
bin folder in your OpenCCG installation.
If you then follow the instructions in the README file and get an error about the wrong class number you’ll have to rebuild it. Try typing Update 2014-07-13: do NOT attempt to build by typing ‘ant’ at the command line. This does not work. Make sure you type ‘ccg-build’. Only issue the ‘ant’ command if you want to see whether ant is installed on your machine.
ant at the command line and see what happens. I don’t remember installing ant, which means that it might come on the Mac by default. If not, you’ll have to go to http://ant.apache.org/. Good luck! ?
It comes with some minuscule test grammars including Basque and Turkish.
A better-resourced language that is VSO is Arabic, and I noticed today that Chris Brew’s group have a paper on converting the Penn Arabic Treebank into CCG. I didn’t know that Arabic had resumptive pronouns. Gaelic doesn’t, but they might be useful in posts explaining gapping later on.