05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5. Parsing English Inclusions 139<br />

5.4 Parsing Experiments with a Hand-crafted Grammar<br />

A second set <strong>of</strong> parsing experiments involve a German parser based on a hand-crafted<br />

grammar, using the Lexical Functional Grammar (LFG) formalism, developed at the<br />

<strong>University</strong> <strong>of</strong> Stuttgart. The nature <strong>of</strong> parsing German sentences containing English<br />

inclusions with this monolingual parser will be analysed in detail. The aim is to de-<br />

termine if inclusions pose as much difficulty as they do with a monolingual treebank-<br />

induced parser and to test if additional knowledge about this language-mixing phe-<br />

nomenon can be exploited to overcome this problem. Considering that the treebank-<br />

induced parser sees at least some inclusions in the training data, although they are<br />

sparse, a hand-written symbolic parser is expected to have even more difficulty in deal-<br />

ing with English inclusions as it generally does not contain rules that handle foreign<br />

material. Before focussing on the experiments, the parser is briefly introduced.<br />

5.4.1 Parser<br />

The Xerox Linguistic Environment (XLE) is the underlying parsing platform used in<br />

the following set <strong>of</strong> experiments (John T. Maxwell and Kaplan, 1993). This platform<br />

functions in conjunction with a hand-written large-scale LFG <strong>of</strong> German developed<br />

by Butt et al. (2002) and improved, for example, by Dipper (2003), Rohrer and Forst<br />

(2006) and Forst and Kaplan (2006). The version <strong>of</strong> the German grammar used here<br />

contains 274 LFG style rules compiled into an automaton with 6,584 states and 22,241<br />

arcs. Before parsing, the input is firstly tokenised and normalised. Subsequently,<br />

string-based multi-word identification is carried out, followed by morphological analy-<br />

sis, analysis guessing for unknown words and lexically-based multi-word identification<br />

(Rohrer and Forst, 2006; Forst and Kaplan, 2006). Forst and Kaplan (2006) improved<br />

the parsing coverage for this grammar from 68.3% to 73.4% on sentences 8,001 to<br />

10,000 <strong>of</strong> the TIGER corpus by revising the integrated tokeniser.<br />

The parser outputs Prolog-encoded constituent-structure (c-structure) and<br />

functional-structure (f-structure) analyses for each sentence. These two representa-<br />

tion levels are fundamental to the linguistic theory <strong>of</strong> LFG and encode the syntactic<br />

properties <strong>of</strong> sentences. For in-depth introductions to LFG, see Falk (2001), Bresnan<br />

(2001), Dalrymple (2001) and Dalrymple et al. (1995). While c-structures represent<br />

the word order and phrasal grouping <strong>of</strong> a sentence in a tree, f-structures encode the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!