A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
methods. When comb<strong>in</strong>ed with supervised mach<strong>in</strong>e learn<strong>in</strong>g methods, such richly<br />
annotated language resources <strong>in</strong>clud<strong>in</strong>g treebanks play a key role <strong>in</strong> modern computational<br />
l<strong>in</strong>guistics. The availability <strong>of</strong> large-scale treebanks <strong>in</strong> recent years has<br />
contributed to the blossom<strong>in</strong>g <strong>of</strong> data-driven approaches to robust and practical<br />
pars<strong>in</strong>g.<br />
On the other hand, the creation <strong>of</strong> detailed and consistent syntactic annotations<br />
on a large scale turns out to be a challeng<strong>in</strong>g task. 1 From the choice <strong>of</strong> the appropriate<br />
l<strong>in</strong>guistic framework and the design <strong>of</strong> the annotation scheme to the choice<br />
<strong>of</strong> the text source and the work<strong>in</strong>g protocols on the synchronization <strong>of</strong> the parallel<br />
development, as well as quality assurance, each <strong>of</strong> the steps <strong>in</strong> the entire annotation<br />
procedure presents non-trivial challenges that can impede the successful production<br />
<strong>of</strong> such resources.<br />
The aim <strong>of</strong> the DeepBank project is to overcome some <strong>of</strong> the limitations and<br />
shortcom<strong>in</strong>gs which are <strong>in</strong>herent <strong>in</strong> manual corpus annotation efforts, such as the<br />
German Negra/Tiger <strong>Treebank</strong> ([2]), the Prague Dependency <strong>Treebank</strong> ([11]), and<br />
the TüBa-D/Z. 2 All <strong>of</strong> these have stimulated research <strong>in</strong> various sub-fields <strong>of</strong> computational<br />
l<strong>in</strong>guistics where corpus-<strong>based</strong> empirical methods are used, but at a high<br />
cost <strong>of</strong> development and with limits on the level <strong>of</strong> detail <strong>in</strong> the syntactic and semantic<br />
annotations that can be consistently susta<strong>in</strong>ed. The central difference <strong>in</strong><br />
the DeepBank approach is to adopt the dynamic treebank<strong>in</strong>g methodology <strong>of</strong> Redwoods<br />
[18], which uses a grammar to produce full candidate analyses, and has<br />
human annotators disambiguate to identify and record the correct analyses, with<br />
the disambiguation choices recorded at the granularity <strong>of</strong> constituent words and<br />
phrases. This localized disambiguation enables the treebank annotations to be repeatedly<br />
ref<strong>in</strong>ed by mak<strong>in</strong>g corrections and improvements to the grammar, with<br />
the changes then projected throughout the treebank by repars<strong>in</strong>g the corpus and<br />
re-apply<strong>in</strong>g the disambiguation choices, with a relatively small number <strong>of</strong> new disambiguation<br />
choices left for manual disambiguation.<br />
For the English DeepBank annotation task, we make extensive use <strong>of</strong> resources<br />
<strong>in</strong> the DELPH-IN repository 3 , <strong>in</strong>clud<strong>in</strong>g the PET unification-<strong>based</strong> parser ([4]), the<br />
ERG plus a regular-expression preprocessor ([1]), the LKB grammar development<br />
platform ([7]), and the [<strong>in</strong>cr tsdb()] competence and performance pr<strong>of</strong>il<strong>in</strong>g<br />
system ([17]), which <strong>in</strong>cludes the treebank<strong>in</strong>g tools used for disambiguation and<br />
<strong>in</strong>spection. Us<strong>in</strong>g these resources, the task <strong>of</strong> treebank construction shifts from a<br />
labor-<strong>in</strong>tensive task <strong>of</strong> draw<strong>in</strong>g trees from scratch to a more <strong>in</strong>telligence-demand<strong>in</strong>g<br />
task <strong>of</strong> choos<strong>in</strong>g among candidate analyses to either arrive at the desired analysis<br />
or reject all candidates as ill-formed. The DeepBank approach should be differentiated<br />
from so-called treebank conversion approaches, which derive a new treebank<br />
1 Besides[18], which we draw more on for the rema<strong>in</strong>der <strong>of</strong> the paper, similar work has been done<br />
<strong>in</strong> the HPSG framework for <strong>Dutch</strong> [22]. Moreover, there is quite a lot <strong>of</strong> related research <strong>in</strong> the LFG<br />
community, e.g., <strong>in</strong> the context <strong>of</strong> the ParGram project: [9] for German, [14] for English, and the<br />
(related) Trepil project, e.g., [20] for Norwegian.<br />
2 http://www.sfs.nphil.uni-tueb<strong>in</strong>gen.de/en_tuebadz.shtml<br />
3 http://www.delph-<strong>in</strong>.net<br />
86