PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 56<br />
3.3.1 Processing Paradigm<br />
The underlying processing paradigm <strong>of</strong> the English inclusion classifier is XML-based.<br />
As a markup language for NLP tasks, XML is expressive and flexible yet constrain-<br />
able. Furthermore, there exists a wide range <strong>of</strong> XML-based tools for NLP applications<br />
which lend themselves to a modular, pipelined approach to processing whereby lin-<br />
guistic knowledge is computed and added incrementally as XML annotations. More-<br />
over, XML’s character encoding capabilities facilitate multilingual processing. As il-<br />
lustrated in Figure 3.1, the system for processing German text is essentially a UNIX<br />
pipeline which converts HTML files to XML and applies a sequence <strong>of</strong> modules: a pre-<br />
processing module for tokenisation and POS tagging, followed by a lexicon lookup, a<br />
search engine module, post-processing and an optional document consistency check<br />
which all add linguistic markup and classify tokens as either German or English. The<br />
pipeline is composed partly <strong>of</strong> calls to LT-TTT2 and LT-XML2 (Grover et al., 2006) 4 for<br />
tokenisation and sentence splitting. In addition, non-XML public-domain tools such<br />
as the TnT tagger (Brants, 2000b) were integrated and their output incorporated into<br />
the XML markup. The primary advantage <strong>of</strong> this architecture is the ability to integrate<br />
the output <strong>of</strong> already existing tools with that <strong>of</strong> new modules specifically tailored to<br />
the task in an organised fashion. The XML output can be searched to find specific<br />
instances or to acquire counts <strong>of</strong> occurrences using the LT-XML2 tools.<br />
3.3.2 Pre-processing Module<br />
All downloaded Web documents are first <strong>of</strong> all cleaned up using TIDY 5 to remove<br />
HTML markup and any non-textual information and then converted into XML. Alter-<br />
natively, the input into the classifier can be in simple text format which is subsequently<br />
converted into XML format. The resulting XML pages simply contain the textual in-<br />
formation <strong>of</strong> each article. Subsequently, all documents are passed through a series <strong>of</strong><br />
pre-processing steps implemented using the LT-XML2 and LT-TTT2 tools (Grover et al.,<br />
2006) with the output <strong>of</strong> each step encoded in XML.<br />
Two rule-based grammars which were developed specifically for German are used<br />
4 These tools are improved upgrades <strong>of</strong> the LT-TTT and LT-XML toolsets (Grover et al., 2000; Thompson<br />
et al., 1997) and are available under GPL as LT-TTT2 and LT-XML2 at: http://www.ltg.ed.ac.<br />
uk. 5http://tidy.sourceforge.net