06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

94 Chapter 3. Discovering Semantic Relatednessinitial LU clusters generated by processing <strong>the</strong> lists of <strong>the</strong> k most similar LUs, seeII.1 and II.2. Only <strong>the</strong> groups dissimilar to o<strong>the</strong>r selected groups are added to <strong>the</strong> setof committees, because <strong>the</strong> committees should ideally describe all senses of <strong>the</strong> inputLUs, see II.3. The set of committees is also iteratively extended in order to coversenses of all input LUs, see <strong>the</strong> condition in II.4.Committees only define senses. They are not <strong>the</strong> final lemma groups we willextract. The final lemma groups – ideally sets of near-synonyms – are extracted inPhase III on <strong>the</strong> basis of committees. Each lemma can be assigned to one of severalgroups by <strong>the</strong> similarity to <strong>the</strong> corresponding committees. It is assumed that eachsense of a polysemous lemma corresponds to some subset of features which describe<strong>the</strong> given LU assigned to some committee c (<strong>the</strong> next sense of e has been identified).CBC attempts to identify <strong>the</strong> features that describe sense c of e and remove <strong>the</strong>m before<strong>the</strong> extraction of <strong>the</strong> o<strong>the</strong>r senses of e. The idea behind this operation is to removesense c <strong>from</strong> <strong>the</strong> representation of e, in order to make o<strong>the</strong>r senses more prominent.The original implementation of <strong>the</strong> overlap and remove operations is straightforward:<strong>the</strong> values of all features in <strong>the</strong> intersection are simply set to 0 (Pantel, 2003).We found this technique too radical. It would be correct if <strong>the</strong> association of featuresand senses were strict, but it is very rarely <strong>the</strong> case. Mostly, one feature derived <strong>from</strong>lexico-syntactic dependency corresponds in different degree to several senses.After a manual inspection of data collected in a co-incidence matrix, we concludedthat it is hard to expect any group of features to encode some sense unambiguously.Some features also have low, accidental values, while some are very high. Finally,vector similarity is influenced by <strong>the</strong> whole vector, especially when we analyse <strong>the</strong>absolute values of similarity by comparing it to a threshold such as. σ in step 2b of<strong>the</strong> CBC algorithm.Assuming that a group of features and some part of <strong>the</strong>ir strength are associatedwith a sense just recorded, we wanted to look for an estimation of <strong>the</strong> extent towhich feature values should be reduced. The best option seems to be <strong>the</strong> extraction ofsome association of features with senses, but for that we need an independent source ofknowledge for grouping features, as it was done in (Tomuro et al., 2007). Unfortunately,it is not possible in <strong>the</strong> case of a language with limited resources like Polish. Instead,we tested two simple heuristics (w(f i ) is <strong>the</strong> value of <strong>the</strong> f i feature, v c (f i ) – <strong>the</strong> valueof f i in <strong>the</strong> committee centroid 25 ):• minimal value:w(f i ) = w(f i ) − min(w(f i ), v c (f i ))25 The centroid features are calculated as average <strong>from</strong> <strong>the</strong> features of vectors in <strong>the</strong> committee.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!