05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4. System Extension to a New Language 108<br />

4.3.4 Post-processing Module<br />

The final system component which required adapting to French is the post-processing<br />

module. It is designed to resolve language classification ambiguities and classify some<br />

single-character tokens. As for German, some time was invested in analysing the<br />

core system output <strong>of</strong> the French development data in order to generate these post-<br />

processing rules. The individual contribution <strong>of</strong> each <strong>of</strong> the following rules on the<br />

system performance on the French development data is discussed in Section 4.4.<br />

The most general rules are designed to disambiguate one-character tokens and in-<br />

terlingual homographs. They are flagged as English if followed by a hyphen and an<br />

English token (e-mail or joint-venture). Furthermore, typical English function words<br />

are flagged as English, including prepositions, pronouns, conjunctions and articles, as<br />

these belong to a closed class and are easily recognisable. This also avoids having to<br />

extend the core system to these categories which not only prevents some output er-<br />

rors but also improves the performance <strong>of</strong> the POS tagger which is <strong>of</strong>ten unable to<br />

process foreign function words correctly. In the post-processing step, their POS tags<br />

are therefore corrected. Any words in the closed class <strong>of</strong> English function words that<br />

are ambiguous with respect to their language such as an (in French year) or but (in<br />

French goal) are only flagged as English inclusions if their surrounding context is al-<br />

ready classified as English by the system. Similarly, the possessive marker ’s is flagged<br />

as English if it is preceded by an English token. Moreover, several rules are designed<br />

to automatically deal with names <strong>of</strong> currencies (e.g. Euro) and units <strong>of</strong> measurement<br />

(e.g. Km). Such instances are prevented from being identified as English inclusions.<br />

Similarly to the German post-processing module, abbreviation extraction (Schwartz<br />

and Hearst, 2003) is applied in order to relate language information between abbrevi-<br />

ations or acronyms and their definitions. Subsequently, post-processing is applied to<br />

guarantee that each pair and earlier plus later mentions <strong>of</strong> either the definition or the<br />

abbreviation/acronym are assigned the same language tag within a document.<br />

When analysing the errors which the system made in the development data, it was<br />

also observed that foreign person names (e.g. Graham Cluley) are frequently identi-<br />

fied as English inclusions. In the experiments described in this <strong>thesis</strong>, the system is<br />

evaluated on identifying English inclusions. These are defined as any English words<br />

in the text except for person and location names. Therefore, the evaluation data does

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!