22.07.2013 Views

Automatic Mapping Clinical Notes to Medical - RMIT University

Automatic Mapping Clinical Notes to Medical - RMIT University

Automatic Mapping Clinical Notes to Medical - RMIT University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Improved Default Sense Selection for Word Sense Disambiguation<br />

Tobias HAWKER and Matthew HONNIBAL<br />

Language Technology Research Group<br />

School of Information Technologies<br />

<strong>University</strong> of Sydney<br />

{<strong>to</strong>by,mhonn}@it.usyd.edu.au<br />

Abstract<br />

small, and most words will have only a few examples.<br />

The low inter-annota<strong>to</strong>r agreement exac-<br />

Supervised word sense disambiguation erbates this problem, as the already small samples<br />

has proven incredibly difficult. Despite are thus also somewhat noisy. These difficulties<br />

significant effort, there has been little suc- mean that different sense frequency heuristics can<br />

cess at using contextual features <strong>to</strong> accu- significantly change the performance of a ‘baserately<br />

assign the sense of a word. Instead, line’ system. In this paper we discuss several such<br />

few systems are able <strong>to</strong> outperform the de- heuristics, and find that most outperform the comfault<br />

sense baseline of selecting the highmonly used first-sense strategy, one by as much as<br />

est ranked WordNet sense. In this pa- 1.3%.<br />

per, we suggest that the situation is even<br />

worse than it might first appear: the highest<br />

ranked WordNet sense is not even the<br />

best default sense classifier. We evaluate<br />

several default sense heuristics, using<br />

supersenses and SemCor frequencies <strong>to</strong><br />

achieve significant improvements on the<br />

WordNet ranking strategy.<br />

The sense ranks in WordNet are derived from<br />

semantic concordance texts used in the construction<br />

of the database. Most senses have explicit<br />

counts listed in the database, although sometimes<br />

the counts will be reported as 0. In these cases,<br />

the senses are presumably ranked by the lexicographer’s<br />

intuition. Usually these counts are higher<br />

than the frequency of the sense in the SemCor<br />

1 Introduction<br />

sense-tagged corpus (Miller et al., 1993), although<br />

not always. This introduces the first alternative<br />

Word sense disambiguation is the task of select- heuristic: using SemCor frequencies where availing<br />

the sense of a word intended in a usage conable, and using the WordNet sense ranking when<br />

text. This task has proven incredibly difficult: there are no examples of the word in SemCor. We<br />

in the SENSEVAL 3 all words task, only five of find that this heuristic performs significantly better<br />

the entered systems were able <strong>to</strong> use the contex- than the first-sense strategy.<br />

tual information <strong>to</strong> select word senses more ac- Increasing attention is also being paid <strong>to</strong> coarse<br />

curately than simply selecting the sense listed as grained word senses, as it is becoming obvious<br />

most likely in WordNet (Fellbaum, 1998). Suc- that WordNet senses are <strong>to</strong>o fine grained (Hovy<br />

cessful WSD systems mostly fall back <strong>to</strong> the first- et al., 2006). Kohomban and Lee (2005) explore<br />

sense strategy unless the system was very confi- finding the most general hypernym of the sense<br />

dent in over-ruling it (Hoste et al., 2001).<br />

being used, as a coarser grained WSD task. Sim-<br />

Deciding which sense of a word is most likely, ilarly, Ciaramita and Altun (2006) presents a sys-<br />

irrespective of its context, is therefore a crucial tem that uses sequence tagging <strong>to</strong> assign ‘super-<br />

task for word sense disambiguation. The decision senses’ — lexical file numbers — <strong>to</strong> words. Both<br />

is complicated by the high cost (Chklovski and of these systems compare their performance <strong>to</strong> a<br />

Mihalcea, 2002) and low inter-annota<strong>to</strong>r agree- baseline of selecting the coarse grained parent of<br />

ment (Snyder and Palmer, 2004) of sense-tagged the first-ranked fine grained sense. Ciaramita and<br />

corpora. The high cost means that the corpora are Altun also use this baseline as a feature in their<br />

Proceedings of the 2006 Australasian Language Technology Workshop (ALTW2006), pages 9–15.<br />

9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!