05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Background and Theory 44<br />

expression matching and a combination there<strong>of</strong> to automatically extract anglicisms in<br />

Norwegian text. The test set, a random sub-set <strong>of</strong> 10,000 tokens from a neologism<br />

archive (Wangensteen, 2002), was manually annotated by the author for anglicisms.<br />

For this binary classification, anglicisms were defined as either English words or com-<br />

pounds containing at least one element <strong>of</strong> English origin. Based on this annotation, the<br />

test data contained 563 tokens classified as anglicisms.<br />

Using lexicon lookup only, Andersen determines that exact matching against a lex-<br />

icon undergenerates in detecting anglicisms, resulting in low recall (6.75%). Con-<br />

versely, fuzzy matching overgenerates, resulting in low precision (8.39%). The char-<br />

acter n-gram matching is based on a chargram list <strong>of</strong> 1,074 items constisting <strong>of</strong> 4-6<br />

characters which frequently occur in the British National Corpus (BNC). Being typ-<br />

ical English letter sequences, any word in the test set containing such a chargram is<br />

classified as English. This method leads to a higher precision <strong>of</strong> 74.73% but a rela-<br />

tively low recall <strong>of</strong> 36.23%. Finally, regular expression matching based on English<br />

orthographic patterns results in a precision <strong>of</strong> 60.6% and a recall <strong>of</strong> 39.0%.<br />

On the 10,000 word test set <strong>of</strong> the neologism archive (Wangensteen, 2002), the<br />

best method <strong>of</strong> combining character n-gram and regular expression matching yields an<br />

accuracy <strong>of</strong> 96.32%. Simply assuming that the data does not contain any anglicisms<br />

yields an accuracy <strong>of</strong> 94.47%. Andersen’s reported accuracy score is therefore mis-<br />

leadingly high. In fact, the best F-score, which is calculated based on the number <strong>of</strong><br />

recognised and target anglicisms only, amounts to only 59.4 (P = 75.8%, R = 48.8%).<br />

However, this result is unsurprisingly low as no differentiation is made between full-<br />

word anglicisms and tokens with mixed-lingual morphemes in the gold standard.<br />

A shortcoming <strong>of</strong> Andersen’s work, and other reviewed studies, is that the meth-<br />

ods are not evaluated on unseen test data. The knowledge <strong>of</strong> previous evaluations<br />

could have affected the design <strong>of</strong> later algorithms. This could easily be tested on an-<br />

other set <strong>of</strong> data that was not used during the development stage. It would also be<br />

interesting to investigate how the methods devised by Andersen perform on running<br />

text instead <strong>of</strong> a collection <strong>of</strong> neologisms extracted from text. While Andersen’s work<br />

is already applied in a language identification module as part <strong>of</strong> a classification tool<br />

for neologisms, language identification on running text could exploit knowledge <strong>of</strong><br />

the surrounding text. Applied in such a way, anglicism detection would also allow<br />

lexicographers to examine the use <strong>of</strong> borrowings in context.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!