A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Figure 1: Screenshot show<strong>in</strong>g OCR-read text <strong>in</strong> the preprocess<strong>in</strong>g <strong>in</strong>terface<br />
head<strong>in</strong>g. Another common cause <strong>of</strong> sentences be<strong>in</strong>g <strong>in</strong>correctly marked as head<strong>in</strong>gs<br />
is miss<strong>in</strong>g end punctuation, a not uncommon OCR error with some fonts.<br />
For each paragraph, a number <strong>of</strong> buttons provide the most common correction<br />
choices: delet<strong>in</strong>g one or more paragraphs, l<strong>in</strong>k<strong>in</strong>g paragraphs, edit<strong>in</strong>g with<strong>in</strong> a<br />
paragraph, add<strong>in</strong>g miss<strong>in</strong>g punctuation, mark<strong>in</strong>g a paragraph as a title or mark<strong>in</strong>g<br />
a presumed title as a paragraph. The [Show] button causes the scanned image <strong>of</strong><br />
the text to appear on the <strong>in</strong>terface; this is useful when it is necessary to check the<br />
orig<strong>in</strong>al.<br />
Unknown words are marked <strong>in</strong> boldface and shaded with a yellow background<br />
<strong>in</strong> the text, mak<strong>in</strong>g them easy for the annotator to spot. In figure 1 this is the case<br />
with the str<strong>in</strong>g plasert, an archaic spell<strong>in</strong>g <strong>of</strong> plassert ‘placed’.<br />
All unknown words are not only highlighted <strong>in</strong> the text, but are also presented<br />
<strong>in</strong> an alphabetical list, as shown at the top <strong>of</strong> figure 2. There is also a list <strong>of</strong> extracted<br />
uncerta<strong>in</strong> (suspicious) words, e.g. the numeral 10, which is sometimes produced by<br />
the OCR s<strong>of</strong>tware <strong>in</strong>stead <strong>of</strong> the <strong>in</strong>tended word lo ‘laughed’. The lists <strong>of</strong> unknown<br />
and uncerta<strong>in</strong> words will shr<strong>in</strong>k when the <strong>in</strong>itial text cleanup is done, so that they<br />
only display words that need to be further categorized.<br />
3.2 Categoriz<strong>in</strong>g unknown words<br />
Foreign words, named entities, productive compounds, multiword expressions, neologisms,<br />
misspell<strong>in</strong>gs and variant spell<strong>in</strong>gs are examples <strong>of</strong> the types <strong>of</strong> words that<br />
may not be recognized automatically <strong>in</strong> the <strong>in</strong>itial preprocess<strong>in</strong>g phase and that need<br />
to be categorized. The <strong>in</strong>terface for this process<strong>in</strong>g step is shown <strong>in</strong> figure 2, where<br />
the selection <strong>of</strong> the str<strong>in</strong>g Bergom has opened the Store as w<strong>in</strong>dow, <strong>in</strong> which the<br />
annotator can make appropriate choices to correct the word or add it to the lexicon.<br />
Each occurrence <strong>of</strong> a chosen word <strong>in</strong> the text is shown with its immediate context<br />
<strong>in</strong> the Context(s) w<strong>in</strong>dow. In the case <strong>of</strong> the str<strong>in</strong>g Bergom, the contexts make it