Programme booklet (pdf)
Programme booklet (pdf)
Programme booklet (pdf)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
PRESENTATION ABSTRACTS<br />
The more the merrier? How data set size and noisiness<br />
affect the accuracy of predicting the dative alternation<br />
Abstract<br />
Theijssen, Daphne and van Halteren, Hans and Boves, Lou and<br />
Oostdijk, Nelleke<br />
Radboud University Nijmegen<br />
In the dative alternation in English, speakers and writers choose between the<br />
prepositional dative construction ('I gave the ball to him' and the double object<br />
construction ('I gave him the ball'). Logistic regression models have already been shown<br />
to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007).<br />
Collecting dative instances from a corpus and encoding them with the required<br />
information is a costly procedure. We therefore developed a semi-automatic approach<br />
to do this, consisting of three steps: (1) automatically extracting dative candidates, (2)<br />
manually approving or rejecting these candidates, and (3) automatically annotating the<br />
approved candidates with the required information. The resulting data sets are noisier<br />
than data sets that have been checked completely manually, but the approach can<br />
yield much larger data sets.<br />
We compare the effect of data set size and noisiness on the accuracy of predicting the<br />
dative alternation. We employ a 'manual' set of 2,877 instances in spoken English,<br />
taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB<br />
(Greenbaum 1996) by Theijssen (2010). In addition, we use a 'semi-automatic' set with<br />
7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We<br />
compare the learning curves of various machine learning algorithms by randomly<br />
selecting subsets of the data and extending them with 500 instances each time. We do<br />
this for different levels of noisiness, i.e. varying the proportion of 'semi-automatic'<br />
instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference.<br />
References<br />
BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition).<br />
Oxford University Computing Services.<br />
Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the<br />
Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive<br />
67