05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5. Parsing English Inclusions 119<br />

inclusions in 2,948 <strong>of</strong> 40,020 sentences (7.4%), 596 <strong>of</strong> which contained at least one<br />

multi-word inclusion. This sub-set <strong>of</strong> 596 sentences is the focus <strong>of</strong> the work reported<br />

in the remainder <strong>of</strong> this chapter, and will be referred to as the inclusion set.<br />

A gold standard parse tree for a sentence containing a typical multi-word English<br />

inclusion is shown in Figure 5.1. It can be seen that the tree is relatively flat, which<br />

is a common characteristic <strong>of</strong> TIGER treebank annotations (Brants et al., 2002). The<br />

non-terminal nodes <strong>of</strong> the tree represent a combination <strong>of</strong> both the phrase categories<br />

(e.g. noun phrase (NP)) and the grammatical functions (e.g. subject (SB)). In the exam-<br />

ple sentence, the English inclusion is contained in a proper noun (PN) phrase with a<br />

grammatical function <strong>of</strong> type noun kernel element (NK). Each terminal node is POS-<br />

tagged as a named entity (NE) with the grammatical function <strong>of</strong> type proper noun com-<br />

ponent (PNC).<br />

5.2.1 Data Sets<br />

For the following experiments, two different data sets are used:<br />

1. the inclusion set, i.e. the sentences containing multi-word English inclusions<br />

recognised by the inclusion classifier, and<br />

2. a stratified sample <strong>of</strong> sentences randomly extracted from the TIGER corpus, with<br />

strata for different sentence lengths.<br />

For the stratified sample, the strata were chosen so that the sentence length distribu-<br />

tion <strong>of</strong> the random set matches that <strong>of</strong> the inclusion set. The average sentence length <strong>of</strong><br />

this random set and the inclusion set is therefore the same at 28.4 tokens. This type <strong>of</strong><br />

sampling is necessary as parsing performance is correlated with sentence length. The<br />

inclusion set has a higher average sentence length than a completely random sample<br />

<strong>of</strong> sentences extracted from the TIGER corpus (as is displayed in Figure 5.2) which<br />

only amounts to 17.6 tokens. Both the inclusion set and the stratified random set con-<br />

sist <strong>of</strong> 596 sentences and do not overlap. They are used in the experiments with the<br />

treebank-induced parser and the hand-crafted grammar described in Sections 5.3 and<br />

5.4, respectively.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!