Dictionary Alignment for Context-sensitive Word Glossing

More documents

Recommendations

Info

Figure 3: Example of normalisation of the translationstring; we stop at “rook” as WordNet has amatching entry for itWhen building the baseline for our evaluation, weused the SemCor corpus—a subset of the Browncorpus annotated with WordNet senses—to derivethe frequency counts of each WordNet synset (Landeset al., 1998). Section 5 discusses this process inmore detail.4 Proposed MethodsOur basic alignment method, along with various extensions,is outlined below.4.1 Basic alignment method using cosinesimilarityIn this paper, we align a semantic database ofJapanese (Lexeed) with a semantic network of English(WordNet) at the sense level. First, we useLexeed to find all possible senses of a given word,and retrieve the definition sentences for each.Since all the definition sentences are in Japanese,we use EDICT as a pivot to convert Lexeed definitionsentences into English. In this process, allpossible translations of all Japanese words foundin the definition sentences are returned, along withtheir POS classes. For every translation returned,we find entries in WordNet that match the translationand POS category. If there is no match for thegiven POS, we relax this constraint and search forentries in WordNet that match the translation but notthe POS.Problems arise when WordNet does not have amatching entry for the translation. This situationdoesn’t distinguish between hyponyms and troponyms, however,we treat the two identically.usually happens when the translation returned byEDICT is comprised of more than one English word.For a Japanese verb, e.g., the English translation inEDICT almost always begins with the auxiliary to(e.g. nomu is translated as to drink). WordNet doesnot contain a verbal entry for to drink, but does containan entry for drink. To handle this case of partialmatch, we locate the longest right word substring ofthe EDICT translation which is indexed in WordNet.A related problem is when the translation containsdomain or collocational information in parentheses.For example, ryuu is translated as both dragon andpromoted rook (shogi). The first translation has amatching entry in WordNet but the second translationdoes not. In this second case, there is no rightword substring which matches in WordNet, as weend up with rook (shogi) and then (shogi), neitherof which is contained in WordNet. In order to dealwith this situation, we first normalise the translationstrings by removing all the brackets and queryWordNet with the normalised string. Should therebe a matching entry, we stop here. If not, we thenremove all strings between brackets, and apply thelongest right word substring heuristic as above. Anillustration of this process is given in Figure 3.In the worst case of WordNet not having a matchingentry for any right word substring, we discardthe translation.At this point, we have aligned a given Japaneseword with (hopefully) one or more English words,but are still no closer to inducing sense alignmentpairs. In order to produce the sense alignments, wegenerate all pairings of Lexeed senses with WordNetsynsets for each WordNet-matched word translation.For each such pair, we compile out the Lexeed definitionsentence(s) word-translated into English, andthe WordNet glosses, and convert each into a simplevector of term frequencies. We then measure thesimilarity of each vector pair using cosine similarity.An overview of this alignment process is presentedin Figure 4.4.2 Weighting terms using TF-IDF mechanismThe basic alignment method does not use any formof term weighting, and thus overemphasises commonfunction words such as the, which and and,and downplays the impact of rare words. As we expectto have a large amount of noise in the word-128
Figure 4: Overview of the Lexeed–WordNet sense alignment methodtranslated Lexeed definition sentences, includingspurious translations for Japanese function wordssuch as ka, ga and no that have no literal translationin English, we predict that an appropriate formof term weighting should improve the performanceof our method.As a first attempt at term weighting, we experimentedwith the classic SMART formulation of TF-IDF (Salton, 1971), treating the vector associatedwith each definition sentence as a single document.4.3 Word stoppingAs mentioned in the previous section, commonlyoccurringsemantically-bleached words are a sourceof noise in the naive cosine similarity scoringmethod. One conventional way of countering theirimpact is to filter them out of the vectors, based on astop word list. For our experiments, we use the stopword list provided by the Snowball project. 64.4 POS filteringAnother source of possible noise is the translationsof Japanese function words. As all the Lexeed definitionsentences are POS tagged, it is a relativelysimple process to filter out all Japanese functionwords, focusing on prefixes, suffixes and particles.4.5 Lemmatisation, stemming andnormalisationIn its basic form, our vector space model treats distinctword as a unique term, including ignoring the6 http://snowball.tartarus.org/obvious similarity between inflectional variants ofthe same word, such as dragon and dragons. Toremove such inflectional variation, we experimentwith lemmatising all words found in both the Lexeedand WordNet vectors, using morph (Minnen etal., 2001). For similar reasons, we also experimentwith the Porter stemmer, noting that stemming willfurther reduce the set of terms but potential introducespurious matches.As part of this process (with both lemmatisationand stemming), we remove all punctuation from thedefinition sentences.4.6 Lexical relationsBoth the Lexeed and WordNet sense inventories aredescribed in the form of hierarchies, making it possibleto complement the sense definitions with thosefrom neighbouring senses. The intuition behind thisis that the sense granularity in the two sense inventoriescan vary greatly, such that a single sensein Lexeed is split across multiple WordNet synsets,which we can readily uncover by considering eachsense as not a single point in WordNet but a semanticneighbourhood. For example, the second senseof the word kinou in Figure 5, which literally means“near past”, should be aligned with the second senseof yesterday, which is defined as “the recent past”.This alignment is more self-evident, however, whenwe observe that the hypernym of each of the twosenses is defined as “past”.In our current experiments, we only look at theutility of hypernymy. For a given sense Lexeed–129
Page 1: Proceedings of the Australasian Lan
Page 6 and 7: Figure 5: The output of word-transl
Page 8 and 9: Method Precision Recall F-scoreBase

Dictionary Alignment for Context-sensitive Word Glossing

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?