06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Figure 1: Screenshot show<strong>in</strong>g OCR-read text <strong>in</strong> the preprocess<strong>in</strong>g <strong>in</strong>terface<br />

head<strong>in</strong>g. Another common cause <strong>of</strong> sentences be<strong>in</strong>g <strong>in</strong>correctly marked as head<strong>in</strong>gs<br />

is miss<strong>in</strong>g end punctuation, a not uncommon OCR error with some fonts.<br />

For each paragraph, a number <strong>of</strong> buttons provide the most common correction<br />

choices: delet<strong>in</strong>g one or more paragraphs, l<strong>in</strong>k<strong>in</strong>g paragraphs, edit<strong>in</strong>g with<strong>in</strong> a<br />

paragraph, add<strong>in</strong>g miss<strong>in</strong>g punctuation, mark<strong>in</strong>g a paragraph as a title or mark<strong>in</strong>g<br />

a presumed title as a paragraph. The [Show] button causes the scanned image <strong>of</strong><br />

the text to appear on the <strong>in</strong>terface; this is useful when it is necessary to check the<br />

orig<strong>in</strong>al.<br />

Unknown words are marked <strong>in</strong> boldface and shaded with a yellow background<br />

<strong>in</strong> the text, mak<strong>in</strong>g them easy for the annotator to spot. In figure 1 this is the case<br />

with the str<strong>in</strong>g plasert, an archaic spell<strong>in</strong>g <strong>of</strong> plassert ‘placed’.<br />

All unknown words are not only highlighted <strong>in</strong> the text, but are also presented<br />

<strong>in</strong> an alphabetical list, as shown at the top <strong>of</strong> figure 2. There is also a list <strong>of</strong> extracted<br />

uncerta<strong>in</strong> (suspicious) words, e.g. the numeral 10, which is sometimes produced by<br />

the OCR s<strong>of</strong>tware <strong>in</strong>stead <strong>of</strong> the <strong>in</strong>tended word lo ‘laughed’. The lists <strong>of</strong> unknown<br />

and uncerta<strong>in</strong> words will shr<strong>in</strong>k when the <strong>in</strong>itial text cleanup is done, so that they<br />

only display words that need to be further categorized.<br />

3.2 Categoriz<strong>in</strong>g unknown words<br />

Foreign words, named entities, productive compounds, multiword expressions, neologisms,<br />

misspell<strong>in</strong>gs and variant spell<strong>in</strong>gs are examples <strong>of</strong> the types <strong>of</strong> words that<br />

may not be recognized automatically <strong>in</strong> the <strong>in</strong>itial preprocess<strong>in</strong>g phase and that need<br />

to be categorized. The <strong>in</strong>terface for this process<strong>in</strong>g step is shown <strong>in</strong> figure 2, where<br />

the selection <strong>of</strong> the str<strong>in</strong>g Bergom has opened the Store as w<strong>in</strong>dow, <strong>in</strong> which the<br />

annotator can make appropriate choices to correct the word or add it to the lexicon.<br />

Each occurrence <strong>of</strong> a chosen word <strong>in</strong> the text is shown with its immediate context<br />

<strong>in</strong> the Context(s) w<strong>in</strong>dow. In the case <strong>of</strong> the str<strong>in</strong>g Bergom, the contexts make it

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!