06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

methods. When comb<strong>in</strong>ed with supervised mach<strong>in</strong>e learn<strong>in</strong>g methods, such richly<br />

annotated language resources <strong>in</strong>clud<strong>in</strong>g treebanks play a key role <strong>in</strong> modern computational<br />

l<strong>in</strong>guistics. The availability <strong>of</strong> large-scale treebanks <strong>in</strong> recent years has<br />

contributed to the blossom<strong>in</strong>g <strong>of</strong> data-driven approaches to robust and practical<br />

pars<strong>in</strong>g.<br />

On the other hand, the creation <strong>of</strong> detailed and consistent syntactic annotations<br />

on a large scale turns out to be a challeng<strong>in</strong>g task. 1 From the choice <strong>of</strong> the appropriate<br />

l<strong>in</strong>guistic framework and the design <strong>of</strong> the annotation scheme to the choice<br />

<strong>of</strong> the text source and the work<strong>in</strong>g protocols on the synchronization <strong>of</strong> the parallel<br />

development, as well as quality assurance, each <strong>of</strong> the steps <strong>in</strong> the entire annotation<br />

procedure presents non-trivial challenges that can impede the successful production<br />

<strong>of</strong> such resources.<br />

The aim <strong>of</strong> the DeepBank project is to overcome some <strong>of</strong> the limitations and<br />

shortcom<strong>in</strong>gs which are <strong>in</strong>herent <strong>in</strong> manual corpus annotation efforts, such as the<br />

German Negra/Tiger <strong>Treebank</strong> ([2]), the Prague Dependency <strong>Treebank</strong> ([11]), and<br />

the TüBa-D/Z. 2 All <strong>of</strong> these have stimulated research <strong>in</strong> various sub-fields <strong>of</strong> computational<br />

l<strong>in</strong>guistics where corpus-<strong>based</strong> empirical methods are used, but at a high<br />

cost <strong>of</strong> development and with limits on the level <strong>of</strong> detail <strong>in</strong> the syntactic and semantic<br />

annotations that can be consistently susta<strong>in</strong>ed. The central difference <strong>in</strong><br />

the DeepBank approach is to adopt the dynamic treebank<strong>in</strong>g methodology <strong>of</strong> Redwoods<br />

[18], which uses a grammar to produce full candidate analyses, and has<br />

human annotators disambiguate to identify and record the correct analyses, with<br />

the disambiguation choices recorded at the granularity <strong>of</strong> constituent words and<br />

phrases. This localized disambiguation enables the treebank annotations to be repeatedly<br />

ref<strong>in</strong>ed by mak<strong>in</strong>g corrections and improvements to the grammar, with<br />

the changes then projected throughout the treebank by repars<strong>in</strong>g the corpus and<br />

re-apply<strong>in</strong>g the disambiguation choices, with a relatively small number <strong>of</strong> new disambiguation<br />

choices left for manual disambiguation.<br />

For the English DeepBank annotation task, we make extensive use <strong>of</strong> resources<br />

<strong>in</strong> the DELPH-IN repository 3 , <strong>in</strong>clud<strong>in</strong>g the PET unification-<strong>based</strong> parser ([4]), the<br />

ERG plus a regular-expression preprocessor ([1]), the LKB grammar development<br />

platform ([7]), and the [<strong>in</strong>cr tsdb()] competence and performance pr<strong>of</strong>il<strong>in</strong>g<br />

system ([17]), which <strong>in</strong>cludes the treebank<strong>in</strong>g tools used for disambiguation and<br />

<strong>in</strong>spection. Us<strong>in</strong>g these resources, the task <strong>of</strong> treebank construction shifts from a<br />

labor-<strong>in</strong>tensive task <strong>of</strong> draw<strong>in</strong>g trees from scratch to a more <strong>in</strong>telligence-demand<strong>in</strong>g<br />

task <strong>of</strong> choos<strong>in</strong>g among candidate analyses to either arrive at the desired analysis<br />

or reject all candidates as ill-formed. The DeepBank approach should be differentiated<br />

from so-called treebank conversion approaches, which derive a new treebank<br />

1 Besides[18], which we draw more on for the rema<strong>in</strong>der <strong>of</strong> the paper, similar work has been done<br />

<strong>in</strong> the HPSG framework for <strong>Dutch</strong> [22]. Moreover, there is quite a lot <strong>of</strong> related research <strong>in</strong> the LFG<br />

community, e.g., <strong>in</strong> the context <strong>of</strong> the ParGram project: [9] for German, [14] for English, and the<br />

(related) Trepil project, e.g., [20] for Norwegian.<br />

2 http://www.sfs.nphil.uni-tueb<strong>in</strong>gen.de/en_tuebadz.shtml<br />

3 http://www.delph-<strong>in</strong>.net<br />

86

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!