06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5. Sense Discovery by Clustering 93II. Find committeesering only e’s features above <strong>the</strong> threshold θ MI of Mutual Information, cf(Manning and Schütze, 2001).1. Extract a set of unique word clusters by average link clustering 23 , onehighest-scoring cluster per list.2. Sort clusters in descending order and for each cluster calculate a vectorrepresentation on <strong>the</strong> basis of its elements.3. Going down <strong>the</strong> list clusters in sorted order, extend an initially empty setC of committees with clusters similar to any previously added committeebelow <strong>the</strong> threshold θ 1 .4. For each e ∈ E, if <strong>the</strong> similarity of e to any committee in C is below <strong>the</strong>threshold θ 2 , add e to <strong>the</strong> set of residues R.5. If R ≠ ∅, repeat Phase II with C (possibly ≠ ∅) and E = R.III. Assign elements to clustersFor each e in <strong>the</strong> initial input set E:1. S = identify θ T 200 = 200 committees most similar to e,2. while S ≠ ∅(a) find a cluster c ∈ S most similar to e,(b) exit <strong>the</strong> loop if <strong>the</strong> similarity of e and c is below <strong>the</strong> threshold σ,(c) if c “is not similar” to any committee in C 24 , assign e to c and remove<strong>from</strong> e its features that overlap with c’s features,(d) remove c <strong>from</strong> S.CBC has three main phases, marked by Roman numerals in <strong>the</strong> outline. In <strong>the</strong>initial Phase I, data that represent <strong>the</strong> semantic similarity of LUs are prepared. Here,CBC strongly depends on <strong>the</strong> quality of <strong>the</strong> applied MSR – <strong>the</strong> most important CBCparameter – and <strong>the</strong> MSR is transformed by taking into consideration only some features(<strong>the</strong> threshold θ MI ) and <strong>the</strong> k most similar LUs.In <strong>the</strong> next two phases, <strong>the</strong> set of possible senses is first extracted by means ofcommittees; next, LUs are assigned to committees. A committee is an LU clusterintended to express some sense by means of a cluster vector representation derived<strong>from</strong> features that describe <strong>the</strong> LUs included in it. Committees are selected <strong>from</strong> <strong>the</strong>23 Average link clustering is also referred to as group-average agglomerative clustering (Manningand Schütze, 2001).24 We interpret this as c’s similarity being below an unmentioned threshold θ ElCom .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!