06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.4. Benefits of Extracted Patterns 127occ=88 rel=0.0060688(hypo:subst:acc) interp który być (hyper:subst:inst)(hypo:subst:acc) interp which is (hyper:subst:inst)The plWordNet-related precision of <strong>the</strong> Espresso/Estratto algorithm is lower whenmeasured on Polish corpora than <strong>the</strong> precision reported by Pantel and Pennacchiotti(2006). This might be due to a slightly different approach to precision evaluation,which was performed partially on <strong>the</strong> basis of <strong>the</strong> much smaller plWordNet. On <strong>the</strong>o<strong>the</strong>r hand, <strong>the</strong> results of <strong>the</strong> manual evaluation are similar to <strong>the</strong> results reported in(Pantel and Pennacchiotti, 2006). The results for different similarity measures basedon reliability suggest that PMI gives <strong>the</strong> best results for <strong>the</strong> given test suite.The adjustment of <strong>the</strong> pattern scheme to <strong>the</strong> characteristic features of Polish improved<strong>the</strong> precision over Espresso patterns using only word forms and parts of speechas features.Estratto, <strong>the</strong> proposed modification of Espresso, succeeded in extracting hypernymyand antonymy <strong>from</strong> IPIC and <strong>the</strong> Rzeczpospolita corpus. Attempts to extract meronymywere unsuccessful. Meronymic pairs are present on <strong>the</strong> MSR-produced list of LUs <strong>the</strong>most semantically related to <strong>the</strong> given one, but with failure of <strong>the</strong> pattern-based attemptswe do not have an additional source of knowledge to separate meronymic pairs <strong>from</strong>those lists.We tested several parameters that have a significant influence on <strong>the</strong> Estratto algorithm.The most important of <strong>the</strong>m appeared to be:• <strong>the</strong> number of seed instances,• <strong>the</strong> confidence threshold,• <strong>the</strong> number of <strong>the</strong> top k patterns preserved between <strong>the</strong> subsequent iterations.The number of seed instances should exceed 10. The confidence threshold stronglydepends on <strong>the</strong> corpus; for example, for IPIC <strong>the</strong> best value found was about 2.6. Eachtime <strong>the</strong> algorithm is applied to a new corpus, both seed instances and <strong>the</strong> measure ofconfidence must be redefined. The number of <strong>the</strong> top k patterns should be low around8. Such a number results in a more stable representation of <strong>the</strong> semantic relation. Itis still unclear how to explore patterns that seem to be correct and are close to <strong>the</strong>top. Those patterns usually disappear in <strong>the</strong> next iterations, so some instances are alsoexcluded <strong>from</strong> final results.Espresso/Estratto is an intrinsically weakly supervised algorithm. That is true eventhough <strong>the</strong> preparation of an appropriate set of seeds leading to <strong>the</strong> extraction of patternsproducing large and diverge set of extracted instances might require even some initial

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!