4 years ago

Automatic Extraction of Examples for Word Sense Disambiguation

Automatic Extraction of Examples for Word Sense Disambiguation


CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 21 automatic mapping to the most recent version is usually also provided. Other such resources are: Hector (Atkins, 1993), Longman Dictionary of Contemporary English (Procter, 1978), BalkaNet (Stamou et al., 2002), etc. Theoretically, each dictionary could be used as a sense inventory for WSD. However, there are several problems coming along. First, dictionaries are not always freely available for research, which was the reason why WordNet became in fact the standard sense inventory for the last decade. However, it is still being argued if it is good as such. Since WordNet distinguishes between the senses of each word in an extremely fine-grained manner, it is often hard to use it for WSD, hence there are cases where a coarser distinction is desirable. Calzolari et al. (2002) even argue that the use of WordNet as a sense inventory in WSD yields worse results than using traditional dictionaries. However, it is not WordNet itself but the predefined sense inventory as such that appears to hinder supervised word sense disambiguation. There is a large number of attempts to solve the latter problem and although none of them completely succeed WordNet will continue to be the standard sense inventory for WSD. Another problem in respect to sense inventories is their compatibility. Each dictionary has its own granularity and representation of senses, which are normally extremely hard if not even impossible to map against each other. Thus systems that use different sense inventories are impossible to compare, since their performance is bound to the inventory they use. Of course, since this issue is a well known problem already, evaluation exercises (see Chapter 4) use a single sense inventory for all the participating systems or it is required that those inventories that are different from the standard one provide mapping to it. 2.3.2 Source Corpora One of the biggest problems for supervised word sense disambiguation (the knowledge acquisi- tion bottleneck problem) is the fact that there are very few annotated corpora that can be used in order good and reliable broad-coverage systems to be trained. This is due to the fact that the creation of such corpora requires a highly laborious human effort. The huge dependency of the method on the provided corpora is the reason why for languages other than English, supervised WSD is extremely difficult if not even impossible. Below follows a brief description of the main data sources for supervised WSD. Senseval provides several corpora not only for English but as well for languages as Ital- ian, Basque, Catalan, Chinese, Romanian and Spanish. The most recent edition of Senseval (Senseval-3) resulted in the following annotated corpora: - English all words - 5000 words were tagged from Penn Treebank text (Marcus et al., 1993) with WordNet as senses. - English lexical sample - 57 words collected via The Open Mind Word Expert interface (Mi- halcea and Chklovski, 2003) with WordNet sense inventory.

CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 22 - Italian all words - 5000 words from the Italian Treebank (Bosco et al., 2000) semantically tagged according to the sense repository of ItalWordNet (Roventini et al., 2000). - Italian lexical sample - 45 words with sense inventory specially developed for the task (Italian MultiWordNet for Senseval-3) - Basque lexical sample - 40 words with sense inventory manually linked to WordNet. - Catalan lexical sample - 45 words with sense inventory manually linked to WordNet. - Spanish lexical sample - 45 words for which the sense inventory was again specially devel- oped and was manually linked to WordNet. - Chinese lexical sample - 20 words with sense inventory according to the HowNet knowledge base (Dong, 1998). - Romanian lexical sample - 50 words for which senses are collected from the new Romanian WordNet, or DEX (a widely recognized Romanian dictionary). The data is collected via the OMWE (Romanian edition) (see Section 4.5.2). - Swedish lexical sample task unfortunately was cancelled and thus no corpora were pro- vided. SemCor - (Miller et al., 1993) is a lot broader in coverage than Senseval. Around 23 346 words are gathered form The Brown Corpus (about 80%) and the novel The Red Badge of Courage (about 20%) and for a sense inventory, WordNet is used. The Open Mind Word Expert - (Mihalcea and Chklovski, 2003) consists of 230 words from the Penn Treebank, Los Angeles Times collection, Open Mind Common Sense and others. Here, WordNet is used as sense repository, too. This corpus, however, grows daily since it is being created mostly by volunteers that manually annotate examples on the Web. Further information about the OMWE can be found in Section 4.5.2. The DSO corpus - Ng and Lee (1996) gathered 191 words from The Brown Corpus and The Wall Street Journal and annotated them with senses according WordNet. Hector - (Atkins, 1993) account for about 300 words from the A 20M-word pilot for the British National Corpus 3 (BNC) for which the sense inventory is picked up as well from Hector. HKUST-Chinese has approximately 38 725 sentences again from the HowNet knowledge base (Dong, 1998). 3

A Machine Learning Approach for Automatic Road Extraction - asprs
Selective Sampling for Example-based Word Sense Disambiguation
Word sense disambiguation with pattern learning and automatic ...
Word Sense Disambiguation Using Automatically Acquired Verbal ...
Using Machine Learning Algorithms for Word Sense Disambiguation ...
Word Sense Disambiguation The problem of WSD - PEOPLE
Performance Metrics for Word Sense Disambiguation
Word Sense Disambiguation - cs547pa1
Word Sense Disambiguation Using Selectional Restriction -
MRD-based Word Sense Disambiguation - the Association for ...
word sense disambiguation and recognizing textual entailment with ...
Using Lexicon Definitions and Internet to Disambiguate Word Senses
KU: Word Sense Disambiguation by Substitution - Deniz Yuret's ...
A Comparative Evaluation of Word Sense Disambiguation Algorithms
Using unsupervised word sense disambiguation to ... - INESC-ID
Semi-supervised Word Sense Disambiguation ... - ResearchGate
Word Sense Disambiguation: An Empirical Survey - International ...
Word Sense Disambiguation is Fundamentally Multidimensional
Using Meaning Aspects for Word Sense Disambiguation
Towards Word Sense Disambiguation of Polish - Proceedings of the ...
Unsupervised learning of word sense disambiguation rules ... - CLAIR
Word-Sense Disambiguation for Machine Translation
Word Sense Disambiguation Using Association Rules: A Survey
Word Sense Disambiguation with Pictures - CiteSeerX
Similarity-based Word Sense Disambiguation
Word Sense Disambiguation with Pictures - CLAIR