A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Figure 2: Screenshot <strong>of</strong> the <strong>in</strong>terface for work<strong>in</strong>g with unrecognized words<br />
clear that this is a family name, s<strong>in</strong>ce <strong>in</strong> both contexts it is preceded by the word<br />
Fru ‘Mrs.’. For each context, there is a hyperl<strong>in</strong>k to the position <strong>of</strong> the occurrence<br />
<strong>in</strong> the runn<strong>in</strong>g text so that the str<strong>in</strong>g may be viewed <strong>in</strong> an even broader context. An<br />
additional hyperl<strong>in</strong>k will display the page <strong>in</strong> the photographic scan <strong>of</strong> the orig<strong>in</strong>al,<br />
which sometimes reveals that characters are miss<strong>in</strong>g <strong>in</strong> the OCR version.<br />
Typos and OCR errors can be systematically corrected <strong>in</strong> this <strong>in</strong>terface, even<br />
though they can also be edited directly <strong>in</strong> the text <strong>in</strong>terface <strong>in</strong> figure 1. Variant<br />
spell<strong>in</strong>gs that could be common, such as the earlier mentioned plasert, may be<br />
added to the morphology s<strong>in</strong>ce we would like to be able to parse the sentence even<br />
though the word has a nonstandard orthography. Such spell<strong>in</strong>gs also get a special<br />
tag <strong>in</strong> the morphology so that they are marked as deviant; we would not want to<br />
produce these forms if the grammar were to be used for generation.<br />
The context will sometimes suggest that the word is part <strong>of</strong> a multiword expression.<br />
The annotator can then select adjacent words to be added as a fixed multiword<br />
entity. Not all multiwords may be treated <strong>in</strong> this way, <strong>of</strong> course; some require special<br />
treatment <strong>in</strong> both the lexicon and the grammar. Identify<strong>in</strong>g multiword entities is<br />
important with respect to pars<strong>in</strong>g, s<strong>in</strong>ce those unrecognized by the morphology will<br />
be analyzed compositionally by the parser unless they have already been entered<br />
<strong>in</strong>to the LFG lexicon.<br />
Not all OCR, typographical and spell<strong>in</strong>g errors are captured by morphological<br />
preprocess<strong>in</strong>g. This may be for <strong>in</strong>stance because the str<strong>in</strong>g is an exist<strong>in</strong>g word. For<br />
example, when the word form term ‘term’ occurs <strong>in</strong> the context term lyset ‘term<br />
the light’, it is clear to a native speaker that the orig<strong>in</strong>al text must have read tenn<br />
lyset ‘light the light’. This may be confirmed by consult<strong>in</strong>g the image <strong>of</strong> the orig<strong>in</strong>al<br />
text. Such errors may be spotted both dur<strong>in</strong>g text cleanup and dur<strong>in</strong>g further word