06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

78 Chapter 3. Discovering Semantic Relatedness• and a corpus of large texts in Polish (about 214 million tokens) collected <strong>from</strong> <strong>the</strong>Internet; only documents containing a small percentage of erroneous word forms(tested manually) and not duplicated in <strong>the</strong> o<strong>the</strong>r two corpora were included in<strong>the</strong> collected corpus.Henceforth, we will refer to all three corpora used toge<strong>the</strong>r as <strong>the</strong> joint corpus.For nominal MSR, fours types of lexico-morphosyntactic constraints have beenused (Section 3.4.3): AdjC, NcC, NmgC and VsbC. Lists of lexical elements have beendefined as <strong>the</strong> combination of one-word LUs in <strong>the</strong> joint corpus and two-word LUsassumed for <strong>the</strong> expansion of plWordNet. We also used 63328 adjectives and adjectivalparticiples for AdjC, 199250 one-word and two-word nominal LUs for NcC and NmgC,and 29564 verbs for VsbC.Evaluation of <strong>the</strong> extracted nominal MSRs was performed on <strong>the</strong> basis of <strong>the</strong>WBST+H and EWBST tests presented in Section 3.3. The WBST+H test used for<strong>the</strong> final version of <strong>the</strong> nominal MSR, generated <strong>from</strong> <strong>the</strong> plWordNet version 11.2008consisted of 9486 questions; EWBST had 8689. Table 3.9 shows <strong>the</strong> results of <strong>the</strong>WBST+H and EWBST. Table 3.10 includes <strong>the</strong> results in relation to particular types ofconstraints (we constructed several coincidence matrices, <strong>from</strong> which different MSRswere built and tested).all LUs more frequent than ≥ 1000WBST+H EWBST WBST+H EWBST88.14 69.75 92.28 75.43Table 3.9: The accuracy of <strong>the</strong> nominal MSR based on <strong>the</strong> generalised RWF and Lin’s version of MIAdjC NcC NmgC VsbC all≥ 10 3 all ≥ 10 3 all ≥ 10 3 all ≥ 10 3 all ≥ 10 3 all90.90 84.98 88.81 80.67 76.25 65.10 79.17 65.89 92.28 88.14Table 3.10: The accuracy [%] of nominal MSRs based on different morphosyntactic constraints; all MSRsuse Generalised RWF based on Lin’s MI. “≥ 10 3 ” means more frequent than 1000The best results achieved for <strong>the</strong> nominal MSR in WBST+H and EWBST (Table3.9) are close to <strong>the</strong> average human results: 86.64% and 71.34%, respectively(Section 3.3). Both tests can clearly be interpreted <strong>from</strong> <strong>the</strong> perspective of practicalapplication of <strong>the</strong> MSR. It can distinguish among semantically related and unrelatedLUs with <strong>the</strong> accuracy 88.14% (WBST+H) and semantically closely related and moreremotely related with accuracy 69.75% (EWBST). Moreover, our comparison of <strong>the</strong>MSR GRW F with several o<strong>the</strong>r MSRs based on methods proposed in <strong>the</strong> literature

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!