06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

from another already exist<strong>in</strong>g one, such as the Penn <strong>Treebank</strong>, mapp<strong>in</strong>g from one<br />

format to another, and <strong>of</strong>ten from one l<strong>in</strong>guistic framework to another, adapt<strong>in</strong>g<br />

and <strong>of</strong>ten enrich<strong>in</strong>g the annotations semi-automatically. In contrast, the English<br />

DeepBank resource is constructed by tak<strong>in</strong>g as <strong>in</strong>put only the orig<strong>in</strong>al ‘raw’ WSJ<br />

text, sentence-segmented to align with the segmentation <strong>in</strong> the PTB for ease <strong>of</strong><br />

comparison, but mak<strong>in</strong>g no reference to any <strong>of</strong> the PTB annotations, so that we<br />

ma<strong>in</strong>ta<strong>in</strong> a fully <strong>in</strong>dependent annotation pipel<strong>in</strong>e, important for later evaluation <strong>of</strong><br />

the quality <strong>of</strong> our annotations over held-out sections.<br />

2 DeepBank<br />

The process <strong>of</strong> DeepBank annotation <strong>of</strong> the Wall Street Journal corpus is organised<br />

<strong>in</strong>to iterations <strong>of</strong> a cycle <strong>of</strong> pars<strong>in</strong>g, treebank<strong>in</strong>g, error analysis and grammar/treebank<br />

updates, with the goal <strong>of</strong> maximiz<strong>in</strong>g the accuracy <strong>of</strong> annotation<br />

through successive ref<strong>in</strong>ement.<br />

Pars<strong>in</strong>g Each section <strong>of</strong> the WSJ corpus is first parsed with the PET parser us<strong>in</strong>g<br />

the ERG, with lexical entries for unknown words added on the fly <strong>based</strong> on a conventional<br />

part-<strong>of</strong>-speech tagger, TnT [3]. Analyses are ranked us<strong>in</strong>g a maximumentropy<br />

model built us<strong>in</strong>g the TADM [15] package, orig<strong>in</strong>ally tra<strong>in</strong>ed on out-<strong>of</strong>doma<strong>in</strong><br />

treebanked data, and later improved <strong>in</strong> accuracy for this task by <strong>in</strong>clud<strong>in</strong>g<br />

a portion <strong>of</strong> the emerg<strong>in</strong>g DeepBank itself for tra<strong>in</strong><strong>in</strong>g data. A maximum <strong>of</strong> 500<br />

highest-rank<strong>in</strong>g analyses are recorded for each sentence, with this limit motivated<br />

both by practical constra<strong>in</strong>ts on data storage costs for each parse forest and by the<br />

process<strong>in</strong>g capacity <strong>of</strong> the [<strong>in</strong>cr tsdb()] treebank<strong>in</strong>g tool. The exist<strong>in</strong>g parserank<strong>in</strong>g<br />

model has proven to be accurate enough to ensure that the desired analysis<br />

is almost always <strong>in</strong> these top 500 read<strong>in</strong>gs if it is licensed by the grammar at all.<br />

For each analysis <strong>in</strong> each parse forest, we record the exact derivation tree, which<br />

identifies the specific lexical entries and the lexical and syntactic rules applied to license<br />

that analysis, compris<strong>in</strong>g a complete ‘recipe’ sufficient to reconstruct the full<br />

feature structure given the relevant version <strong>of</strong> the grammar. This approach enables<br />

relatively efficient storage <strong>of</strong> each parse forest without any loss <strong>of</strong> detail.<br />

<strong>Treebank</strong><strong>in</strong>g For each sentence <strong>of</strong> the corpus, the pars<strong>in</strong>g results are then manually<br />

disambiguated by the human annotators, us<strong>in</strong>g the [<strong>in</strong>cr tsdb()] treebank<strong>in</strong>g<br />

tool which presents the annotator with a set <strong>of</strong> b<strong>in</strong>ary decisions, called discrim<strong>in</strong>ants,<br />

on the <strong>in</strong>clusion or exclusion <strong>of</strong> candidate lexical or phrasal elements for<br />

the desired analysis. This discrim<strong>in</strong>ant-<strong>based</strong> approach <strong>of</strong> [6] enables rapid reduction<br />

<strong>of</strong> the parse forest to either the s<strong>in</strong>gle desired analysis, or to rejection <strong>of</strong> the<br />

whole forest for sentences where the grammar has failed to propose a viable analysis.<br />

4 On average, given n candidate trees, log 2 n decisions are needed <strong>in</strong> order to<br />

4 For some sentences, an annotator may be unsure about the correctness <strong>of</strong> the best available<br />

analysis, <strong>in</strong> which case the analysis can still be recorded <strong>in</strong> the treebank, but with a lower ‘confidence’<br />

87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!