Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
3.2. <strong>Lexical</strong>-Semantic Information Extraction 51<br />
pmi(i, p) = log<br />
The reliability <strong>of</strong> instance i, rl(i), is given by:<br />
rl(i) =<br />
3. Select <strong>the</strong> most reliable patterns.<br />
<br />
p∈P ′<br />
|x, p, y|<br />
|x, ∗, y||∗, p, ∗|<br />
pmi(i,p)<br />
maxpmi<br />
|P |<br />
<br />
∗ rπ(p)<br />
4. Extract new instances after applying <strong>the</strong> selected patterns to <strong>the</strong> corpus. The<br />
most reliable instances are added to <strong>the</strong> seed set.<br />
Ittoo and Bouma (2010) present a study on <strong>the</strong> acquisition <strong>of</strong> part-whole relations.<br />
Using an algorithm inspired by Espresso, <strong>the</strong>y notice that special attention<br />
should be given when choosing <strong>the</strong> seed relations. Given that <strong>the</strong>re are different<br />
subtypes <strong>of</strong> part-whole relations (e.g. member-<strong>of</strong>, contained-in, located-in), <strong>the</strong>y<br />
confirm that, if <strong>the</strong> initial set <strong>of</strong> seeds mixes pairs <strong>of</strong> different subtypes, <strong>the</strong> algorithm<br />
fails to capture <strong>the</strong>se subtypes. But even when <strong>the</strong>y carefully select seeds <strong>of</strong><br />
only one subtype, part-whole relations <strong>of</strong> o<strong>the</strong>r subtypes are discovered.<br />
Relation extraction from <strong>the</strong> Web<br />
In <strong>the</strong> last decade, with <strong>the</strong> explosion <strong>of</strong> available electronic contents, researchers<br />
felt <strong>the</strong> need for developing systems that acquire open-domain facts from large<br />
collections <strong>of</strong> text, including <strong>the</strong> Web. Given <strong>the</strong> size <strong>of</strong> <strong>the</strong> data to exploit, <strong>the</strong>se<br />
systems, whose final goal was to turn <strong>the</strong> texts into a large knowledge base, should<br />
be robust and scalable enough.<br />
An earlier approach to this problem was <strong>the</strong> Dual Iterative Pattern Expansion<br />
(DIPRE, Brin (1998)), a weakly-supervised technique for extracting a structured<br />
relation from <strong>the</strong> Web. DIPRE bootstraps from an initial set <strong>of</strong> seed examples,<br />
which is <strong>the</strong> only required training. For instance, for <strong>the</strong> extraction <strong>of</strong><br />
locationOf(location, organisation) relations, <strong>the</strong> following seeds could be provided:<br />
{Redmond, Micros<strong>of</strong>t}, {Cupertino, Apple}, {Armonk, IBM }, {Seattle, Boeing} and<br />
{Santa Clara, Intel}. After finding all close occurrences <strong>of</strong> <strong>the</strong> related entities in<br />
<strong>the</strong> collection, patterns where <strong>the</strong>y co-occur are used to extract new pairs holding<br />
<strong>the</strong> same relation. Snowball (Agichtein and Gravano, 2000) is a weakly-supervised<br />
system for extracting structured data from textual documents built on <strong>the</strong> idea <strong>of</strong><br />
DIPRE, but extending it to incorporate automatic pattern and pair evaluation<br />
KnowItAll (Etzioni et al., 2004) is an autonomous, domain-independent system<br />
that extracts facts, concepts, and relationships from <strong>the</strong> Web. The only domainspecific<br />
input to KnowItAll is a set <strong>of</strong> predicates that constitute its focus and a set<br />
<strong>of</strong> generic domain-independent extractions. KnowItAll uses <strong>the</strong> extraction patterns<br />
with classes (e.g. cities, movies) in order to generate extraction rules specific for each<br />
class <strong>of</strong> instances to extract. A web search engine is queried with keywords in each<br />
rule, and <strong>the</strong> rule is applied to extract information from <strong>the</strong> retrieved pages. The<br />
likelihood <strong>of</strong> each candidate fact is later assessed with a kind <strong>of</strong> PMI-IR (Turney,<br />
2001), using an estimation <strong>of</strong> <strong>the</strong> search engine hit counts.