06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

102 Chapter 4. Extracting Relation InstancesStill, lexico-syntactic patterns are worth attention <strong>from</strong> <strong>the</strong> perspective of expandinga wordnet. Their accuracy in identifying instances of wordnet relations is much higherthan <strong>the</strong> accuracy of <strong>the</strong> MSRlist (x,k) lists (constructed using an MSR) – see <strong>the</strong>manual evaluation in Section 3.4.5. Each pattern is also focused on instances of onesemantic relation, mostly hypernymy. In this way, patterns can be a valuable source ofknowledge, complementary to MSRs. This has been <strong>the</strong> approach in Caraballo (1999,2001) (Section 4.5.3).4.1 Lexico-Morphosyntactic PatternsThere are few known applications of <strong>the</strong> pattern-based paradigm to Polish. Martinek(1997) and Ceglarek and Rutkowski (2006) presented attempts to apply patterns toMRDs. In an application to a corpus, Dernowicz (2007) performed a simple experiment:using lexico-syntactic patterns to extract meronymic and hypernymic pairs fora very limited set of words in a limited domain. The major obstacles are <strong>the</strong> lack ofavailable electronic dictionaries and problems with preprocessing text in Polish. Whenprocessing English, a chunker can pick noun phrases and this reduce <strong>the</strong> complexityof <strong>the</strong> text is reduced before patterns are applied. Nearly fixed word order also helps.It is not so in Polish: no chunkers or shallow parsers are available (see <strong>the</strong> discussionin Section 3.4.3). The word order is much freer; as an example, consider two almostsynonymous expressions that both point to <strong>the</strong> same instance of hyponymy:wieloryb jest ssakiem (whale case=nom is mammal case=inst )ssakiem jest wieloryb (mammal case=inst is whale case=nom )Polish verb arguments are marked by grammatical case, so <strong>the</strong> morphosyntacticproperties of words encode much of <strong>the</strong> syntactic structure – Section 3.4.3. We can<strong>the</strong>refore reapply to pattern identification <strong>the</strong> processing methods presented <strong>the</strong>re. Wehave built an IPIC subcorpus (we named it HC) with some 80000 sentences thatplWordNet signals as containing LU pairs linked by hypernymy. Hearst’s patterns(Hearst, 1992) were <strong>the</strong> first inspiration for discovering patterns for Polish. We alsomanually examined several hundred sentences <strong>from</strong> HC, looking for characteristic configurationsof LUs. Ra<strong>the</strong>r naturally, we were guided by language intuitions. TheEstratto algorithm (see <strong>the</strong> next section) automatically extracts patterns and instances;its results were ano<strong>the</strong>r interesting source of ideas. Although Estratto produces ra<strong>the</strong>rgeneric patterns, some such simple patterns became a starting point for <strong>the</strong> manualdevelopment of more refined versions.A pattern imposes lexico-morphosyntactic constraints on <strong>the</strong> elements of a text fragmentand <strong>the</strong>ir mutual dependencies. We implemented <strong>the</strong> constraints in <strong>the</strong> JOSKIPI

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!