06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.1. Weaving <strong>the</strong> Full-fledged Structure 1675. Selected groups of new lemmas were loaded into WNW and <strong>the</strong> Algorithm ofActivation-area Attachment [AAA] was run to generate suggestions of attachmentareas.6. Linguists worked freely with <strong>the</strong> lemma groups; <strong>the</strong>y browsed suggestions in anyorder and edited <strong>the</strong> wordnet structure.7. At any moment of <strong>the</strong> process, linguists could re-run AAA to get perhaps bettersuggestions for those new lemmas that have not been edited yet.8. Linguists notified <strong>the</strong> coordinator about finishing work with particular groups;<strong>the</strong> coordinator <strong>the</strong>n could analyse <strong>the</strong> results using <strong>the</strong> same WNW system(accessing it via <strong>the</strong> Internet, just like <strong>the</strong> linguists).The whole process of extracting data sets – sources of evidence for AAA – performedin steps 1-2 took approximately 25 days on a standard PC (3GHz, 4GB RAM,one single-core processor). The time could be reduced to 2-4 days by applying a gridof at least several PCs. This one-time operation is computationally very intensive, butit prepares all data sets except classifiers at <strong>the</strong> beginning of a long-term expansionprocess. This is done once per each list of new lemmas, independent of <strong>the</strong> size of<strong>the</strong> list. Classifier training, to be repeated several times with <strong>the</strong> increasing size of <strong>the</strong>wordnet, it is much less computationally demanding than <strong>the</strong> o<strong>the</strong>r tasks. AAA is performedon <strong>the</strong> server, not on <strong>the</strong> linguists’ PCs. It takes 10-20 minutes on a PC-classserver.Clustering (step 4) is optional <strong>from</strong> <strong>the</strong> point of view of <strong>the</strong> WNW application,which can work efficiently with a list of several thousand new lemmas. Clusteringis necessary for people: a huge flat list is just too difficult to comprehend, and it ispractically impossible to organise around it work lasting several weeks.The idea behind clustering was to divide <strong>the</strong> initial list into lemma groups in sucha way that each group consists of lemmas with senses belonging to one domain commonto all of <strong>the</strong>m (at least <strong>the</strong> intersection of <strong>the</strong> lemma senses should belong to onedomain). There is no perfect clustering algorithm, but manual grouping would be toolabourious to be feasible. We applied an off-<strong>the</strong>-shelf implementation of clusteringalgorithms in <strong>the</strong> Cluto package (Karypis, 2002). The input to <strong>the</strong> clustering algorithmswere values which describe semantic relatedness of lemma pairs acquired <strong>from</strong>MSR GRW F . We experimented with different algorithms. After a manual inspectionof <strong>the</strong> results, we selected graph-based clustering. We did not evaluate <strong>the</strong> qualityof clustering exhaustively: <strong>the</strong> mechanism played only a minor, supporting role. Dueto <strong>the</strong> properties of <strong>the</strong> clustering algorithms, we repeated <strong>the</strong> process several times,each time getting some groups and a large set of ‘outliers’, which was next <strong>the</strong> inputto ano<strong>the</strong>r run. The obtained groups were loaded into WNW – all in all, 92 groupswere constructed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!