Views
3 years ago

Towards Word Sense Disambiguation of Polish - Proceedings of the ...

Towards Word Sense Disambiguation of Polish - Proceedings of the ...

Towards Word Sense Disambiguation of Polish - Proceedings of the

Proceedings of the International Multiconference onComputer Science and Information Technology pp. 73–78ISBN 978-83-60810-14-9ISSN 1896-7094Towards Word Sense Disambiguation of PolishDominik Baś, Bartosz Broda, Maciej PiaseckiInstitute of Applied Informatics, Wrocław University of Technology, PolandEmail: {bartosz.broda, maciej.piasecki}@pwr.wroc.plAbstract—We compare three different methods of Word SenseDisambiguation applied to the disambiguation of a selectedset of 13 Polish words. The selected words express differentproblems for sense disambiguation. As it is hard to find worksfor Polish in this area, our goal was to analyse applicabilityand limitations of known methods in relation to Polish andPolish language resources and tools. The obtained results are verypositive, as using limited resources, we achieved the accuracy ofsense disambiguation greatly exceeding the baseline of the mostfrequent sense. For the needs of experiments a small corpus ofrepresentative examples was manually collected and annotatedwith senses drawn from plWordNet. Different representationsof context of word occurrences were also experimentally tested.Examples of limitations and advantages of the applied methodsare discussed.I. INTRODUCTIONPOLYSEMY of word forms seems to be an intrinsic featureof the natural language and can be observed in anynatural language including Polish. It is a stumbling block forsemantic text processing and complicates access to meaningsin a semantic lexicon. One needs an algorithm choosing themost appropriate meaning of the given word form in relation tothe given context. This need was noticed as early as in 50s, andcontinuous works on Word Sense Disambiguation (henceforthabbreviated to WSD) have been performed since 70s, e.g. theWord Expert system of Wilks [1].Research on WSD has been conducted for English but alsofor many other languages. Despite its importance, it is veryhard to find published works, working systems or practicalresults for Polish. Development of the corpus annotated byword senses and next development of a WSD algorithm isplanned as a part of the project on the construction of theNational Corpus of Polish [2].Our main objective is to developed a robust WSD methodfor Polish which will be based on the Polish wordnet calledplWordNet [3], as the source of lexical meanings. However, assuch a far going enterprise requires a lot of time and workload,we decided to start with a much more limited experiment.The goal of the work presented here was to develop a WSDalgorithm for selected Polish words going in the line of lexicalsample task of Senseval Evaluation Exercises [4]. Also, lexicalsample task is one approach to minimise the utilised resources,especially manual work.As we wanted to achieve a relatively high accuracy, from thevery beginning we assumed a supervised learning model anda construction of a small training corpus, in which selectedwords were manually annotated with plWordNet synsets. Inthe case of the supervised learning algorithms one can controlwhich senses are learned and identified in text. Nevertheless,research on unsupervised sense discovery have been alsoperformed in parallel, e.g. [5], [6].II. CONSTRUCTION OF THE TRAINING CORPUSAs none of the existing Polish corpora has been semanticallyannotated, we decided to select subcorpus of the IPI PANCorpus (IPIC) [7] and extend it with annotation by word sensesfor occurrences of chosen words. We selected 13 different baseword forms corresponding to several polysemous lexemes andhomonyms. Choosing the subset, we tried to represent the varietyof different problems for WSD. We selected polysemouslexemes possessing from 2 to 7 senses, where some of thesenses have homonymous character, i.e. those senses representseparate homonyms of the same morphological base form. Weincluded homonyms, as they express very different meaningsand it should be easy to differentiate between their senses.The important factor in selection was also the coverage ofplWordNet, which still does not describe some rare senses.Finally, we also tried to compose the set of words as consistingof words that are intuitively less or more difficult for possibleWSD algorithms. The selected set includes:1) agent (4 senses, English translation: agent),2) automat (3: automaton, automatic machine andmachine-gun),3) dziób (4: beak, bow, mouth (semantically marked) andface (semantically marked)),4) język (2: language or tongue),5) klasa (6: class, classroom, form, grade),6) linia (6: line, route, figure, contour),7) pole (3: field),8) policja (2: police, police station),9) powód (2: reason and protection (person)),10) sztuka (6: art, play, craft, item),11) zamek (4: castle, door lock, zipper and breechblock),12) zbiór (7: set, collection, harvest and file),13) zespół (5: team, gruop, band, company, complex andunit).Henceforth the selected base word forms will be calledtraining words.For sense inventory we used plWordNet - a resource underdevelopment, but it has already reached the size of 14 677lexical units and is publicly available for research [3]. plWord-Net follows the general scheme of the Princeton WordNetbut it was constructed from scratch in bottom-up way. Thestarting point was the list of about 10 000 most frequent wordbase forms from IPIC. As a result plWordNet includes the978-83-60810-14-9/08/$25.00 c○ 2008 IEEE 73

Towards clustering-based word sense discrimination - Lugos
Word sense disambiguation with pattern learning and automatic ...
BOOTSTRAPPING IN WORD SENSE DISAMBIGUATION
Similarity-based Word Sense Disambiguation
Towards a hybrid approach to word-sense disambiguation in ...
Word Sense Disambiguation - cs547pa1
Using Lexicon Definitions and Internet to Disambiguate Word Senses
Word sense disambiguation and information retrieval - CiteSeerX
Fine-Grained Word Sense Disambiguation Based on Parallel ...
Nominal Taxonomies and Word Sense Disambiguation - Machine ...
Selective Sampling for Example-based Word Sense Disambiguation
Word Sense Disambiguation The problem of WSD - PEOPLE
Word Sense Disambiguation: An Empirical Survey - International ...
Using LazyBoosting for Word Sense Disambiguation - TALP - UPC
Word Sense Disambiguation is Fundamentally Multidimensional
Semi-supervised Word Sense Disambiguation ... - ResearchGate
Performance Metrics for Word Sense Disambiguation
Soft Word Sense Disambiguation
WORD SENSE DISAMBIGUATION - Leffa
Knowledge-based biomedical word sense disambiguation ...
Using unsupervised word sense disambiguation to ... - INESC-ID
Knowledge-based Biomedical Word Sense Disambiguation: a ...
A Word-Sense Disambiguated Multilingual Wikipedia Corpus - UPC
word sense disambiguation and recognizing textual entailment with ...
Using Machine Learning Algorithms for Word Sense Disambiguation ...
Word Sense Disambiguation with a Corpus-based Semantic Network
Word Sense Disambiguation Using Selectional Restriction - Ijsrp.org
Exploiting Rules for Word Sense Disambiguation in Machine ... - sepln