06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3. Generic Patterns Verified Statistically 117The unmarked order in a Polish sentence is Subject-Verb-Object, so simple lexicosyntacticpatterns might work similarly as in English. Anything more complicated,such as rich morphosyntactic agreement or even slightly relaxed word/phrase order(usually meaning-preserving) need additional work. We put more emphasis on <strong>the</strong>morphosyntactic description of pattern elements in terms of <strong>the</strong> tagset in <strong>the</strong> IPI PANCorpus [IPIC] (Przepiórkowski, 2004). The categories include a finer-grained list ofparts of speech and dozens of values of several grammatical categories (case, number,gender, person, degree, tense, aspect).Multiword LU occurrences also get morphosyntactic descriptions – see Section 3.4.5how this worked with MSR extraction. The linear order of LUs in a Polish sentenceneed not be correlated with <strong>the</strong>ir role in asymmetric lexico-semantic relations. Forexample, many patterns mark <strong>the</strong> hypernym and <strong>the</strong> hyponym by different cases, while<strong>the</strong>ir relative positions change. We could generate specific patterns for all differentcombinations, but we can also look for generalization of a group of patterns based on<strong>the</strong> morphosyntactic properties.Following Espresso, patterns are flat and describe a sentence as a sequence of wordforms or at most groups of word forms. They are not based on any deeper descriptionof <strong>the</strong> syntactic structure. The alphabet comprises three types of symbols:• an asterisk symbol * indicating place for zero or more tokens,• root,• and markers of matching locations, variables in Espresso, which require morestructure in Estratto.The empty symbol represents any LU (represented by any of its word forms). Theroot represents an actual basic morphological word form of an LU and its grammaticalclass. This takes care of <strong>the</strong> likely ambiguity. A matching location includes a partialmorphosyntactic description (a reduced version of <strong>the</strong> IPIC morphosyntactic tag, withselected category values) that represents all LUs with a matching morphosyntacticdescription. Grammatical classes in IPIC are too fine-grained. We introduced “macro”symbols, such as noun that represents jointly all grammatical classes: substantives,gerunds, foreign nominals and depreciative nouns.As in Espresso, <strong>the</strong>re are always two matching locations, at <strong>the</strong> beginning and <strong>the</strong>end of a pattern. Patterns do not describe <strong>the</strong> left and right context of a potentialinstance.Matching locations also encode <strong>the</strong> roles of both LUs identified, for example:(hypo:subst:nom) jest (hyper:subst:inst)(hyper:subst:inst) jest (hypo:subst:nom)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!