24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.2. <strong>Lexical</strong>-Semantic Information Extraction 51<br />

pmi(i, p) = log<br />

The reliability <strong>of</strong> instance i, rl(i), is given by:<br />

rl(i) =<br />

3. Select <strong>the</strong> most reliable patterns.<br />

<br />

p∈P ′<br />

|x, p, y|<br />

|x, ∗, y||∗, p, ∗|<br />

pmi(i,p)<br />

maxpmi<br />

|P |<br />

<br />

∗ rπ(p)<br />

4. Extract new instances after applying <strong>the</strong> selected patterns to <strong>the</strong> corpus. The<br />

most reliable instances are added to <strong>the</strong> seed set.<br />

Ittoo and Bouma (2010) present a study on <strong>the</strong> acquisition <strong>of</strong> part-whole relations.<br />

Using an algorithm inspired by Espresso, <strong>the</strong>y notice that special attention<br />

should be given when choosing <strong>the</strong> seed relations. Given that <strong>the</strong>re are different<br />

subtypes <strong>of</strong> part-whole relations (e.g. member-<strong>of</strong>, contained-in, located-in), <strong>the</strong>y<br />

confirm that, if <strong>the</strong> initial set <strong>of</strong> seeds mixes pairs <strong>of</strong> different subtypes, <strong>the</strong> algorithm<br />

fails to capture <strong>the</strong>se subtypes. But even when <strong>the</strong>y carefully select seeds <strong>of</strong><br />

only one subtype, part-whole relations <strong>of</strong> o<strong>the</strong>r subtypes are discovered.<br />

Relation extraction from <strong>the</strong> Web<br />

In <strong>the</strong> last decade, with <strong>the</strong> explosion <strong>of</strong> available electronic contents, researchers<br />

felt <strong>the</strong> need for developing systems that acquire open-domain facts from large<br />

collections <strong>of</strong> text, including <strong>the</strong> Web. Given <strong>the</strong> size <strong>of</strong> <strong>the</strong> data to exploit, <strong>the</strong>se<br />

systems, whose final goal was to turn <strong>the</strong> texts into a large knowledge base, should<br />

be robust and scalable enough.<br />

An earlier approach to this problem was <strong>the</strong> Dual Iterative Pattern Expansion<br />

(DIPRE, Brin (1998)), a weakly-supervised technique for extracting a structured<br />

relation from <strong>the</strong> Web. DIPRE bootstraps from an initial set <strong>of</strong> seed examples,<br />

which is <strong>the</strong> only required training. For instance, for <strong>the</strong> extraction <strong>of</strong><br />

locationOf(location, organisation) relations, <strong>the</strong> following seeds could be provided:<br />

{Redmond, Micros<strong>of</strong>t}, {Cupertino, Apple}, {Armonk, IBM }, {Seattle, Boeing} and<br />

{Santa Clara, Intel}. After finding all close occurrences <strong>of</strong> <strong>the</strong> related entities in<br />

<strong>the</strong> collection, patterns where <strong>the</strong>y co-occur are used to extract new pairs holding<br />

<strong>the</strong> same relation. Snowball (Agichtein and Gravano, 2000) is a weakly-supervised<br />

system for extracting structured data from textual documents built on <strong>the</strong> idea <strong>of</strong><br />

DIPRE, but extending it to incorporate automatic pattern and pair evaluation<br />

KnowItAll (Etzioni et al., 2004) is an autonomous, domain-independent system<br />

that extracts facts, concepts, and relationships from <strong>the</strong> Web. The only domainspecific<br />

input to KnowItAll is a set <strong>of</strong> predicates that constitute its focus and a set<br />

<strong>of</strong> generic domain-independent extractions. KnowItAll uses <strong>the</strong> extraction patterns<br />

with classes (e.g. cities, movies) in order to generate extraction rules specific for each<br />

class <strong>of</strong> instances to extract. A web search engine is queried with keywords in each<br />

rule, and <strong>the</strong> rule is applied to extract information from <strong>the</strong> retrieved pages. The<br />

likelihood <strong>of</strong> each candidate fact is later assessed with a kind <strong>of</strong> PMI-IR (Turney,<br />

2001), using an estimation <strong>of</strong> <strong>the</strong> search engine hit counts.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!