26.01.2013 Views

Presuppositions in Spoken Discourse

Presuppositions in Spoken Discourse

Presuppositions in Spoken Discourse

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Between B<strong>in</strong>d<strong>in</strong>g and Accommodation<br />

to develop a classification scheme which is taught to two or more annotators, who<br />

are then to classify data <strong>in</strong>dependently <strong>in</strong>to the different categories. The result<strong>in</strong>g<br />

annotations are compared, the assumption be<strong>in</strong>g that a high degree of <strong>in</strong>terannotator<br />

agreement on a classification task guarantees the reliability of our<br />

categories. If, on the other hand, there is a low-degree of <strong>in</strong>ter-annotator agreement<br />

for one or more of the categories, or a consistent confusion between one or more<br />

categories, we can question the validity of the category, or question the correctness<br />

of the def<strong>in</strong>ition or delimitation of the category. Improv<strong>in</strong>g category def<strong>in</strong>itions, or<br />

chang<strong>in</strong>g the categories looked for can often improve <strong>in</strong>ter-annotator agreement.<br />

Inter-annotator agreement can be statistically measured by calculat<strong>in</strong>g the Kappavalue.<br />

7<br />

Poesio & Vieira (1998) did two different large-scale annotation experiments<br />

of def<strong>in</strong>ite NPs <strong>in</strong> the Wall Street Journal <strong>in</strong> an attempt to f<strong>in</strong>d a classification that<br />

is reliable and where the results could be used as a key for the evaluation of a<br />

program for def<strong>in</strong>ite description classification and resolution, implemented <strong>in</strong><br />

Vieira & Poesio (2000).<br />

Each of the two experiments used different def<strong>in</strong>itions of bridg<strong>in</strong>g. The first<br />

experiment (2 annotators, 1,040 NPs) used a classification with five categories that<br />

referred explicitly to surface characteristics of the def<strong>in</strong>ite NPs. In this experiment,<br />

they <strong>in</strong>cluded co-reference relationships that demanded some type of <strong>in</strong>ference as<br />

bridg<strong>in</strong>g, but treated co-reference relationships between noun phrases with the<br />

same head noun as anaphoric. They obta<strong>in</strong>ed a Kappa score of 0.68 for the first<br />

experiment. 204 NPs were identified as “associative”, i.e. bridg<strong>in</strong>g anaphors (cf.<br />

Poesio et al. 1997). With the def<strong>in</strong>ition used <strong>in</strong> this experiment, bridg<strong>in</strong>g examples<br />

made up about 20% of all the NPs studied. This gives an idea of how frequent<br />

potential bridg<strong>in</strong>g examples actually are <strong>in</strong> written text. A more restrictive def<strong>in</strong>ition<br />

would probably result <strong>in</strong> a smaller percentage of the examples be<strong>in</strong>g considered<br />

bridg<strong>in</strong>g.<br />

The second experiment (3 annotators, 464 def<strong>in</strong>ite NPs) used a different<br />

annotation scheme that asked subjects to classify def<strong>in</strong>ite noun phrases based on a<br />

semantic def<strong>in</strong>ition of the relationship to anchors and antecedents rather than the<br />

surface form. Co-reference relationships were not considered bridg<strong>in</strong>g. In this<br />

experiment, l<strong>in</strong>guistically naive annotators were used and this could, along with the<br />

different annotation categories, account for the somewhat poorer agreement score,<br />

which was K = 0.58. It is <strong>in</strong>terest<strong>in</strong>g to note that 164 out of 464 NPs <strong>in</strong> this<br />

experiment were classified as co-referential by all three coders (for which there was<br />

95% agreement) but only 7 were classified as bridg<strong>in</strong>g by all three, though the<br />

number of bridg<strong>in</strong>g descriptions identified by each annotator were 40, 29 and 49,<br />

7 The Kappa statistic takes the agreement as well as the number of categories <strong>in</strong> a classification<br />

task <strong>in</strong> to the equations. A Kappa value between 0.6 - 0.8 is supposed to signify some degree of<br />

agreement and over 0.8. should allow conclusions to be drawn. See also Carletta et al. (1997) and<br />

Poesio & Vieira (1998) for an explanation of how the Kappa value can be calculated for an<br />

annotation task.<br />

163

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!