06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 2: Screenshot <strong>of</strong> the <strong>in</strong>terface for work<strong>in</strong>g with unrecognized words<br />

clear that this is a family name, s<strong>in</strong>ce <strong>in</strong> both contexts it is preceded by the word<br />

Fru ‘Mrs.’. For each context, there is a hyperl<strong>in</strong>k to the position <strong>of</strong> the occurrence<br />

<strong>in</strong> the runn<strong>in</strong>g text so that the str<strong>in</strong>g may be viewed <strong>in</strong> an even broader context. An<br />

additional hyperl<strong>in</strong>k will display the page <strong>in</strong> the photographic scan <strong>of</strong> the orig<strong>in</strong>al,<br />

which sometimes reveals that characters are miss<strong>in</strong>g <strong>in</strong> the OCR version.<br />

Typos and OCR errors can be systematically corrected <strong>in</strong> this <strong>in</strong>terface, even<br />

though they can also be edited directly <strong>in</strong> the text <strong>in</strong>terface <strong>in</strong> figure 1. Variant<br />

spell<strong>in</strong>gs that could be common, such as the earlier mentioned plasert, may be<br />

added to the morphology s<strong>in</strong>ce we would like to be able to parse the sentence even<br />

though the word has a nonstandard orthography. Such spell<strong>in</strong>gs also get a special<br />

tag <strong>in</strong> the morphology so that they are marked as deviant; we would not want to<br />

produce these forms if the grammar were to be used for generation.<br />

The context will sometimes suggest that the word is part <strong>of</strong> a multiword expression.<br />

The annotator can then select adjacent words to be added as a fixed multiword<br />

entity. Not all multiwords may be treated <strong>in</strong> this way, <strong>of</strong> course; some require special<br />

treatment <strong>in</strong> both the lexicon and the grammar. Identify<strong>in</strong>g multiword entities is<br />

important with respect to pars<strong>in</strong>g, s<strong>in</strong>ce those unrecognized by the morphology will<br />

be analyzed compositionally by the parser unless they have already been entered<br />

<strong>in</strong>to the LFG lexicon.<br />

Not all OCR, typographical and spell<strong>in</strong>g errors are captured by morphological<br />

preprocess<strong>in</strong>g. This may be for <strong>in</strong>stance because the str<strong>in</strong>g is an exist<strong>in</strong>g word. For<br />

example, when the word form term ‘term’ occurs <strong>in</strong> the context term lyset ‘term<br />

the light’, it is clear to a native speaker that the orig<strong>in</strong>al text must have read tenn<br />

lyset ‘light the light’. This may be confirmed by consult<strong>in</strong>g the image <strong>of</strong> the orig<strong>in</strong>al<br />

text. Such errors may be spotted both dur<strong>in</strong>g text cleanup and dur<strong>in</strong>g further word

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!