05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 56<br />

3.3.1 Processing Paradigm<br />

The underlying processing paradigm <strong>of</strong> the English inclusion classifier is XML-based.<br />

As a markup language for NLP tasks, XML is expressive and flexible yet constrain-<br />

able. Furthermore, there exists a wide range <strong>of</strong> XML-based tools for NLP applications<br />

which lend themselves to a modular, pipelined approach to processing whereby lin-<br />

guistic knowledge is computed and added incrementally as XML annotations. More-<br />

over, XML’s character encoding capabilities facilitate multilingual processing. As il-<br />

lustrated in Figure 3.1, the system for processing German text is essentially a UNIX<br />

pipeline which converts HTML files to XML and applies a sequence <strong>of</strong> modules: a pre-<br />

processing module for tokenisation and POS tagging, followed by a lexicon lookup, a<br />

search engine module, post-processing and an optional document consistency check<br />

which all add linguistic markup and classify tokens as either German or English. The<br />

pipeline is composed partly <strong>of</strong> calls to LT-TTT2 and LT-XML2 (Grover et al., 2006) 4 for<br />

tokenisation and sentence splitting. In addition, non-XML public-domain tools such<br />

as the TnT tagger (Brants, 2000b) were integrated and their output incorporated into<br />

the XML markup. The primary advantage <strong>of</strong> this architecture is the ability to integrate<br />

the output <strong>of</strong> already existing tools with that <strong>of</strong> new modules specifically tailored to<br />

the task in an organised fashion. The XML output can be searched to find specific<br />

instances or to acquire counts <strong>of</strong> occurrences using the LT-XML2 tools.<br />

3.3.2 Pre-processing Module<br />

All downloaded Web documents are first <strong>of</strong> all cleaned up using TIDY 5 to remove<br />

HTML markup and any non-textual information and then converted into XML. Alter-<br />

natively, the input into the classifier can be in simple text format which is subsequently<br />

converted into XML format. The resulting XML pages simply contain the textual in-<br />

formation <strong>of</strong> each article. Subsequently, all documents are passed through a series <strong>of</strong><br />

pre-processing steps implemented using the LT-XML2 and LT-TTT2 tools (Grover et al.,<br />

2006) with the output <strong>of</strong> each step encoded in XML.<br />

Two rule-based grammars which were developed specifically for German are used<br />

4 These tools are improved upgrades <strong>of</strong> the LT-TTT and LT-XML toolsets (Grover et al., 2000; Thompson<br />

et al., 1997) and are available under GPL as LT-TTT2 and LT-XML2 at: http://www.ltg.ed.ac.<br />

uk. 5http://tidy.sourceforge.net

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!