06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.4. Measures of Semantic Relatedness 63• text windows – <strong>the</strong> context is a window (a whole document in some specialcases); co-occurrence with particular LUs serves as constraints;• lexico-syntactic constraints – <strong>the</strong> context is a sentence, clause or phrase; lexicosyntacticrelations serve as constraints.Mohammad and Hirst (2006) write that this distinguishes between measures ofsemantic relatedness and semantic similarity, but we feel that intermediate methodsare quite conceivable. For example, one can combine lexico-syntactic constraints withco-occurrences in <strong>the</strong> description of context. So, <strong>the</strong>re is a continuum of methods with<strong>the</strong>se two extremes.In <strong>the</strong> seminal paper on Latent Semantic Analysis [LSA] (Landauer and Dumais,1997), a context is simply <strong>the</strong> whole document (longer documents were truncated toa predefined size). The created co-incidence matrix (also called co-occurrence matrix)describes nouns 7 by <strong>the</strong> frequencies of <strong>the</strong>ir occurrences across documents. Rows correspondto nouns (60 768), columns to documents (30 473), and a cell M[n i , d j ] stores<strong>the</strong> number of occurrences of <strong>the</strong> noun n i in <strong>the</strong> document d j . The initial cell valuesare <strong>the</strong>n weighted by <strong>the</strong> logent function (Section 3.4.4) and <strong>the</strong> whole matrix is transformedby Singular Value Decomposition (SVD) (Berry, 1992) to a matrix of reduceddimensions. The SVD transformation not only improves <strong>the</strong> efficiency by reducing <strong>the</strong>row size but also – much more important – emphasises relatedness 8 between particularnouns or its absence. The final MSR value is calculated by comparing, using <strong>the</strong> cosinemeasure, rows of <strong>the</strong> reduced matrix that describe particular LUs. The relativelygood result of 64.4% achieved in <strong>the</strong> Test of English as a Foreign Language (discussedbriefly in Section 3.3) may have been due to <strong>the</strong> high quality of <strong>the</strong> corpus: GrolierEncyclopedia.In order to overcome <strong>the</strong> corpus size restriction induced in LSA by <strong>the</strong> applicationof SVD, Schütze (1998) proposed a method called Word Space. A text window movesacross documents. At each position MP of <strong>the</strong> window, statistics are collected: cooccurrenceof a word in <strong>the</strong> centre of <strong>the</strong> context with a number of meaning bearers(selected general words). Turney (2001) used <strong>the</strong> Altavista search engine to search forco-occurrences of LUs in millions of documents on <strong>the</strong> Internet and thus to calculatean MSR.Experiments performed on Polish data (Piasecki and Broda, 2007) suggest thattext-window contexts described by LU co-occurrences result in MSRs that produce7 Only noun word forms were described in <strong>the</strong> experiment of Landauer and Dumais (1997).8 Landauer and Dumais (1997) wrote about “similarity”, but in keeping with <strong>the</strong> earlier remarks weprefer to talk about semantic relatedness, because LSA is a typical text-window MSR.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!