06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

90 Chapter 3. Discovering Semantic Relatednessdrawback of ROCK is that it sometimes produces a very deep and unbalanced hierarchy.On <strong>the</strong> o<strong>the</strong>r hand, GHSOM assigned pairs of documents into one cluster which didnot appear toge<strong>the</strong>r in any manually created category more often <strong>the</strong>n ROCK.We wanted to label with representative words document groups clustered in a hierarchicaltree. Words describing groups of documents closer to <strong>the</strong> root of <strong>the</strong> treeshould be more general than words used for <strong>the</strong> documents in <strong>the</strong> leaves. Ideally, wewould obtain a basic hypernymy structure for plWordNet (or at least instances of is-arelation) out of <strong>the</strong> assigned labels.Keyword extraction can be supervised or unsupervised. Supervised algorithm requiresample manually constructed resources. We applied only such unsupervisedmethods that try to capture statistical properties of words occurrences to identifywords which best describe <strong>the</strong> given document. The statistics can be counted locally,using data <strong>from</strong> a single document only, or estimated <strong>from</strong> a large body of text.To benefit <strong>from</strong> both local and global strategies, we extended <strong>the</strong> method proposed byIndyka-Piasecka (2004) with <strong>the</strong> algorithm of Matsuo and Ishizuka (2004) into a hybridkeyword extraction method.Indyka-Piasecka (2004) assigns a weight w to every lemma l that occurs in eachdocument of a group. Additionally, lemmas are filtered on <strong>the</strong> basis of <strong>the</strong>ir documentfrequency df l , that is a number of documents in which lemma l occurred. Both rare andfrequent lemmas are not good discriminator of document content (cf. Indyka-Piasecka,2004). The weight w is calculated by using two weighting schemes:and cue validitytf.idf l,d = tf l,d × log N df l(3.6)cv = tf group(3.7)tfwhere tf and df denote term frequency and document frequency.Matsuo and Ishizuka (2004) used a three–step–process to assign a weight to everylemma. First, all words in a document are reduced to <strong>the</strong>ir lemmas (basic morphologicalforms) and filtered on <strong>the</strong> basis of term frequency and a stoplist. Then, <strong>the</strong>ycluster lemmas in a document using two algorithms. If two lemmas have similar distributions,it means that <strong>the</strong>y belong to <strong>the</strong> same group. As a measure of <strong>the</strong> probabilitydistribution similarity between two lemmas l 1 and l 2 (Matsuo and Ishizuka, 2004)used <strong>the</strong> Jensen–Shannon divergence 22 . Lemmas are also clustered when <strong>the</strong>y followsimilar co-occurrence pattern with o<strong>the</strong>r lemmas. This can be measured using MutualInformation.22 Jensen–Shannon divergence is a symmetrised and smoo<strong>the</strong>d version of Kullback–Leibler divergence(Manning and Schütze, 2001).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!