06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.2 Pronom<strong>in</strong>al Anaphora Resolution<br />

In order to use a mach<strong>in</strong>e learn<strong>in</strong>g method, a suitable annotated corpus is needed.<br />

As noted <strong>in</strong> the <strong>in</strong>troduction, we use part <strong>of</strong> the Eus3LB Corpus. This corpus conta<strong>in</strong>s<br />

349 annotated pronom<strong>in</strong>al anaphora and it’s different from the data we use to<br />

evaluate and develop our pronom<strong>in</strong>al and nom<strong>in</strong>al coreference resolution systems.<br />

The method used to create tra<strong>in</strong><strong>in</strong>g <strong>in</strong>stances is similar to the one expla<strong>in</strong>ed <strong>in</strong><br />

[25]. Positive <strong>in</strong>stances are created for each annotated anaphor and its antecedents,<br />

while negative <strong>in</strong>stances are created by pair<strong>in</strong>g each annotated anaphor with each<br />

<strong>of</strong> the preced<strong>in</strong>g noun phrases that are between the anaphor and the antecedent.<br />

Altogether, we have 968 <strong>in</strong>stances; 349 <strong>of</strong> them are positive, and the rest (619)<br />

negative.<br />

The method we use to create test<strong>in</strong>g <strong>in</strong>stances is the same we use to create<br />

tra<strong>in</strong><strong>in</strong>g <strong>in</strong>stances—with one exception—. As we can not know a priori what the<br />

antecedent <strong>of</strong> each pronoun is, we create <strong>in</strong>stances for each possible anaphor (pronouns)<br />

and the eight noun phrases nearest to it. We choose the eight nearest NPs<br />

because experiments on our tra<strong>in</strong><strong>in</strong>g set revealed that the antecedent lies at this<br />

distance 97% <strong>of</strong> the time. Therefore, for each anaphor we have eight candidate antecedents,<br />

i.e., eight <strong>in</strong>stances. The features used are obta<strong>in</strong>ed from the l<strong>in</strong>guistic<br />

process<strong>in</strong>g system def<strong>in</strong>ed <strong>in</strong> [4].<br />

The next step is to use a mach<strong>in</strong>e learn<strong>in</strong>g approach to determ<strong>in</strong>e the most probable<br />

antecedents for each anaphor. After consultation with l<strong>in</strong>guists, we decided<br />

on return<strong>in</strong>g a rank<strong>in</strong>g <strong>of</strong> the five most probable antecedents. The most probable<br />

antecedent will be highlighted. Then, these clusters <strong>of</strong> coreferences are displayed<br />

<strong>in</strong> the MMAX2 tool. In this manner, the annotator will select the correct antecedent<br />

for each anaphor from a set <strong>of</strong> five possible antecedents, <strong>in</strong>stead <strong>of</strong> hav<strong>in</strong>g to f<strong>in</strong>d it<br />

<strong>in</strong> the whole text, sav<strong>in</strong>g time and improv<strong>in</strong>g performance. Consequently, we will<br />

be able to create a larger tagged corpus faster as well as achieve better models for<br />

apply<strong>in</strong>g and improv<strong>in</strong>g the mach<strong>in</strong>e learn<strong>in</strong>g approach.<br />

Figure 2: Pronom<strong>in</strong>al coreference example.<br />

Figure 2 represents an anaphor (hark, <strong>in</strong> English he/she/it) and its five most<br />

probable antecedents <strong>in</strong> the MMAX2 w<strong>in</strong>dow. The annotator can choose the correct<br />

antecedent <strong>of</strong> the pronom<strong>in</strong>al anaphor with a few clicks.<br />

122

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!