06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5. Sense Discovery by Clustering 97in language data. The indirect evaluation defined in (Pantel, 2003, Lin and Pantel,2002) will measure <strong>the</strong> level of resemblance between <strong>the</strong> division into senses made bylinguists constructing <strong>the</strong> wordnet and that extracted via clustering.We wanted to evaluate <strong>the</strong> algorithm’s ability to reconstruct plWordNet synsets.That would confirm <strong>the</strong> applicability of <strong>the</strong> algorithm in <strong>the</strong> semi-automatic constructionof wordnets. We put nouns <strong>from</strong> plWordNet on <strong>the</strong> input list of nouns (E in <strong>the</strong>algorithm). Because plWordNet is constructed bottom-up, <strong>the</strong> list consisted of 13298most frequent nouns in IPIC plus some most general nouns, see Section 3.4.5. Theconstraints were parameterised by 96142 features (41599 adjectives and participles,and 54543 nouns).Several thresholds used in <strong>the</strong> CBC algorithm (plus a few more in <strong>the</strong> evaluation)are <strong>the</strong> major difficulty in its exact re-implementation. No method of optimising CBCin relation to thresholds was proposed in (Pantel, 2003, Lin and Pantel, 2002) 27 and <strong>the</strong>values of all thresholds in (Pantel, 2003) were established experimentally. There alsowas no discussion of <strong>the</strong>ir dependence on <strong>the</strong> applied tools, corpus and characteristicsof <strong>the</strong> given language.Broda et al. (2008) performed such analysis in relation to Polish. Here we willoutline only most important conclusions. Experiments with using RWF instead ofPMI showed that RWF gives higher precision (38.81% versus 22.37%), but leads tofewer resulting word assigned to groups (744 versus 2980). The value of σ, whichcontrols when to stop assigning words to a committee (step 2b in Phase III of <strong>the</strong>algorithm), must be carefully selected for each type of MSR separately. As <strong>the</strong> valueof σ increases, <strong>the</strong> precision increases too, but <strong>the</strong> number of words clustered dropssignificantly. When we make σ small and θ ElCom (meaning that word “is not similar”to any committee), we get relatively good precision but more words clustered. Wefound that contrary to <strong>the</strong> statement and chart in (Pantel, 2003), tuning both thresholdswas important in our case.The experiments confirmed our intuition that removing overlapping features inPhase III of CBC is too radical. The application of both proposed heuristics was testedexperimentally and resulted in <strong>the</strong> increased precision. The minimal-value heuristicincreased <strong>the</strong> precision <strong>from</strong> 38.8% to 41.0% on 695 words clustered. The usage of<strong>the</strong> ratio heuristic improves <strong>the</strong> result even fur<strong>the</strong>r: <strong>the</strong> precision rises to 42.5% on 701words clustered. A manual inspection of <strong>the</strong> results showed that <strong>the</strong> algorithm tendsto produce too many overlapping senses when it uses <strong>the</strong> ratio heuristic.Because of indirect nature of evaluation proposed in (Pantel, 2003) we wanted toevaluate CBC in more direct and intuitive way. We assumed that proper clustering27 Automatising this process is very difficult, because <strong>the</strong> whole process is computationally very expensive.A full iteration takes 5–7 hours on a 2.13GHz PC with 6GB of RAM, which makes, say, anapplication of Genetic Algorithms barely possible.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!