13.07.2015 Views

Dictionary Alignment for Context-sensitive Word Glossing

Dictionary Alignment for Context-sensitive Word Glossing

Dictionary Alignment for Context-sensitive Word Glossing

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 4: Overview of the Lexeed–<strong>Word</strong>Net sense alignment methodtranslated Lexeed definition sentences, includingspurious translations <strong>for</strong> Japanese function wordssuch as ka, ga and no that have no literal translationin English, we predict that an appropriate <strong>for</strong>mof term weighting should improve the per<strong>for</strong>manceof our method.As a first attempt at term weighting, we experimentedwith the classic SMART <strong>for</strong>mulation of TF-IDF (Salton, 1971), treating the vector associatedwith each definition sentence as a single document.4.3 <strong>Word</strong> stoppingAs mentioned in the previous section, commonlyoccurringsemantically-bleached words are a sourceof noise in the naive cosine similarity scoringmethod. One conventional way of countering theirimpact is to filter them out of the vectors, based on astop word list. For our experiments, we use the stopword list provided by the Snowball project. 64.4 POS filteringAnother source of possible noise is the translationsof Japanese function words. As all the Lexeed definitionsentences are POS tagged, it is a relativelysimple process to filter out all Japanese functionwords, focusing on prefixes, suffixes and particles.4.5 Lemmatisation, stemming andnormalisationIn its basic <strong>for</strong>m, our vector space model treats distinctword as a unique term, including ignoring the6 http://snowball.tartarus.org/obvious similarity between inflectional variants ofthe same word, such as dragon and dragons. Toremove such inflectional variation, we experimentwith lemmatising all words found in both the Lexeedand <strong>Word</strong>Net vectors, using morph (Minnen etal., 2001). For similar reasons, we also experimentwith the Porter stemmer, noting that stemming willfurther reduce the set of terms but potential introducespurious matches.As part of this process (with both lemmatisationand stemming), we remove all punctuation from thedefinition sentences.4.6 Lexical relationsBoth the Lexeed and <strong>Word</strong>Net sense inventories aredescribed in the <strong>for</strong>m of hierarchies, making it possibleto complement the sense definitions with thosefrom neighbouring senses. The intuition behind thisis that the sense granularity in the two sense inventoriescan vary greatly, such that a single sensein Lexeed is split across multiple <strong>Word</strong>Net synsets,which we can readily uncover by considering eachsense as not a single point in <strong>Word</strong>Net but a semanticneighbourhood. For example, the second senseof the word kinou in Figure 5, which literally means“near past”, should be aligned with the second senseof yesterday, which is defined as “the recent past”.This alignment is more self-evident, however, whenwe observe that the hypernym of each of the twosenses is defined as “past”.In our current experiments, we only look at theutility of hypernymy. For a given sense Lexeed–129

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!