PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 5. Parsing English Inclusions 139<br />
5.4 Parsing Experiments with a Hand-crafted Grammar<br />
A second set <strong>of</strong> parsing experiments involve a German parser based on a hand-crafted<br />
grammar, using the Lexical Functional Grammar (LFG) formalism, developed at the<br />
<strong>University</strong> <strong>of</strong> Stuttgart. The nature <strong>of</strong> parsing German sentences containing English<br />
inclusions with this monolingual parser will be analysed in detail. The aim is to de-<br />
termine if inclusions pose as much difficulty as they do with a monolingual treebank-<br />
induced parser and to test if additional knowledge about this language-mixing phe-<br />
nomenon can be exploited to overcome this problem. Considering that the treebank-<br />
induced parser sees at least some inclusions in the training data, although they are<br />
sparse, a hand-written symbolic parser is expected to have even more difficulty in deal-<br />
ing with English inclusions as it generally does not contain rules that handle foreign<br />
material. Before focussing on the experiments, the parser is briefly introduced.<br />
5.4.1 Parser<br />
The Xerox Linguistic Environment (XLE) is the underlying parsing platform used in<br />
the following set <strong>of</strong> experiments (John T. Maxwell and Kaplan, 1993). This platform<br />
functions in conjunction with a hand-written large-scale LFG <strong>of</strong> German developed<br />
by Butt et al. (2002) and improved, for example, by Dipper (2003), Rohrer and Forst<br />
(2006) and Forst and Kaplan (2006). The version <strong>of</strong> the German grammar used here<br />
contains 274 LFG style rules compiled into an automaton with 6,584 states and 22,241<br />
arcs. Before parsing, the input is firstly tokenised and normalised. Subsequently,<br />
string-based multi-word identification is carried out, followed by morphological analy-<br />
sis, analysis guessing for unknown words and lexically-based multi-word identification<br />
(Rohrer and Forst, 2006; Forst and Kaplan, 2006). Forst and Kaplan (2006) improved<br />
the parsing coverage for this grammar from 68.3% to 73.4% on sentences 8,001 to<br />
10,000 <strong>of</strong> the TIGER corpus by revising the integrated tokeniser.<br />
The parser outputs Prolog-encoded constituent-structure (c-structure) and<br />
functional-structure (f-structure) analyses for each sentence. These two representa-<br />
tion levels are fundamental to the linguistic theory <strong>of</strong> LFG and encode the syntactic<br />
properties <strong>of</strong> sentences. For in-depth introductions to LFG, see Falk (2001), Bresnan<br />
(2001), Dalrymple (2001) and Dalrymple et al. (1995). While c-structures represent<br />
the word order and phrasal grouping <strong>of</strong> a sentence in a tree, f-structures encode the