A Wordnet from the Ground Up
A Wordnet from the Ground Up - School of Information Technology ...
A Wordnet from the Ground Up - School of Information Technology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.4. Benefits of Extracted Patterns 127occ=88 rel=0.0060688(hypo:subst:acc) interp który być (hyper:subst:inst)(hypo:subst:acc) interp which is (hyper:subst:inst)The plWordNet-related precision of <strong>the</strong> Espresso/Estratto algorithm is lower whenmeasured on Polish corpora than <strong>the</strong> precision reported by Pantel and Pennacchiotti(2006). This might be due to a slightly different approach to precision evaluation,which was performed partially on <strong>the</strong> basis of <strong>the</strong> much smaller plWordNet. On <strong>the</strong>o<strong>the</strong>r hand, <strong>the</strong> results of <strong>the</strong> manual evaluation are similar to <strong>the</strong> results reported in(Pantel and Pennacchiotti, 2006). The results for different similarity measures basedon reliability suggest that PMI gives <strong>the</strong> best results for <strong>the</strong> given test suite.The adjustment of <strong>the</strong> pattern scheme to <strong>the</strong> characteristic features of Polish improved<strong>the</strong> precision over Espresso patterns using only word forms and parts of speechas features.Estratto, <strong>the</strong> proposed modification of Espresso, succeeded in extracting hypernymyand antonymy <strong>from</strong> IPIC and <strong>the</strong> Rzeczpospolita corpus. Attempts to extract meronymywere unsuccessful. Meronymic pairs are present on <strong>the</strong> MSR-produced list of LUs <strong>the</strong>most semantically related to <strong>the</strong> given one, but with failure of <strong>the</strong> pattern-based attemptswe do not have an additional source of knowledge to separate meronymic pairs <strong>from</strong>those lists.We tested several parameters that have a significant influence on <strong>the</strong> Estratto algorithm.The most important of <strong>the</strong>m appeared to be:• <strong>the</strong> number of seed instances,• <strong>the</strong> confidence threshold,• <strong>the</strong> number of <strong>the</strong> top k patterns preserved between <strong>the</strong> subsequent iterations.The number of seed instances should exceed 10. The confidence threshold stronglydepends on <strong>the</strong> corpus; for example, for IPIC <strong>the</strong> best value found was about 2.6. Eachtime <strong>the</strong> algorithm is applied to a new corpus, both seed instances and <strong>the</strong> measure ofconfidence must be redefined. The number of <strong>the</strong> top k patterns should be low around8. Such a number results in a more stable representation of <strong>the</strong> semantic relation. Itis still unclear how to explore patterns that seem to be correct and are close to <strong>the</strong>top. Those patterns usually disappear in <strong>the</strong> next iterations, so some instances are alsoexcluded <strong>from</strong> final results.Espresso/Estratto is an intrinsically weakly supervised algorithm. That is true eventhough <strong>the</strong> preparation of an appropriate set of seeds leading to <strong>the</strong> extraction of patternsproducing large and diverge set of extracted instances might require even some initial