12.09.2013 Views

Programme booklet (pdf)

Programme booklet (pdf)

Programme booklet (pdf)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PRESENTATION ABSTRACTS<br />

The more the merrier? How data set size and noisiness<br />

affect the accuracy of predicting the dative alternation<br />

Abstract<br />

Theijssen, Daphne and van Halteren, Hans and Boves, Lou and<br />

Oostdijk, Nelleke<br />

Radboud University Nijmegen<br />

In the dative alternation in English, speakers and writers choose between the<br />

prepositional dative construction ('I gave the ball to him' and the double object<br />

construction ('I gave him the ball'). Logistic regression models have already been shown<br />

to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007).<br />

Collecting dative instances from a corpus and encoding them with the required<br />

information is a costly procedure. We therefore developed a semi-automatic approach<br />

to do this, consisting of three steps: (1) automatically extracting dative candidates, (2)<br />

manually approving or rejecting these candidates, and (3) automatically annotating the<br />

approved candidates with the required information. The resulting data sets are noisier<br />

than data sets that have been checked completely manually, but the approach can<br />

yield much larger data sets.<br />

We compare the effect of data set size and noisiness on the accuracy of predicting the<br />

dative alternation. We employ a 'manual' set of 2,877 instances in spoken English,<br />

taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB<br />

(Greenbaum 1996) by Theijssen (2010). In addition, we use a 'semi-automatic' set with<br />

7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We<br />

compare the learning curves of various machine learning algorithms by randomly<br />

selecting subsets of the data and extending them with 500 instances each time. We do<br />

this for different levels of noisiness, i.e. varying the proportion of 'semi-automatic'<br />

instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference.<br />

References<br />

BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition).<br />

Oxford University Computing Services.<br />

Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the<br />

Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive<br />

67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!