24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.2. <strong>Lexical</strong>-Semantic Information Extraction 43<br />

points that should be considered in <strong>the</strong> automatic extraction <strong>of</strong> knowledge from a<br />

dictionary, and its conversion into a computational format. They refer three approaches<br />

for creating a knowledge base from a dictionary, which vary in <strong>the</strong> initial<br />

required amount <strong>of</strong> knowledge, and in <strong>the</strong> quality <strong>of</strong> <strong>the</strong> extracted information:<br />

• Co-ocurrences enable <strong>the</strong> establishment <strong>of</strong> associations between words, without<br />

requiring initial linguistic information.<br />

• A grammar with a collection <strong>of</strong> linguistic patterns enables, for instance, to<br />

identify <strong>the</strong> genus (hypernym) and <strong>the</strong> differentia for each dictionary entry.<br />

• Hand-coding <strong>the</strong> lexical entries <strong>of</strong> a controlled vocabulary (about 5% <strong>of</strong> <strong>the</strong><br />

knowledge base), and iterating through <strong>the</strong> remaining words, enables to derive<br />

a network <strong>of</strong> semantic units.<br />

While <strong>the</strong> third approach results in a rich semantic structure, it needs a substantial<br />

amount <strong>of</strong> initial linguistic knowledge. The first approach produces a much simpler<br />

resource, but does not require hand-coded knowledge.<br />

Ide and Véronis (1995) are very critical <strong>of</strong> <strong>the</strong> research on information extraction<br />

from dictionaries. They refer that dictionaries use inconsistent conventions to represent<br />

knowledge and that <strong>the</strong> definitions are not as consistent as <strong>the</strong>y should be.<br />

Since <strong>the</strong>y are <strong>the</strong> result <strong>of</strong> several lexicographers work for several years, dictionaries<br />

have many variations to transmit <strong>the</strong> same thing. Reviews and updates increase <strong>the</strong><br />

probability <strong>of</strong> inconsistencies.<br />

In order to assess <strong>the</strong> information extracted from dictionaries,<br />

Ide and Véronis (1995) performed a quantitative evaluation <strong>of</strong> automatically<br />

extracted hypernymy relations. As hypernymy is <strong>the</strong> least arguable semantic relation<br />

and <strong>the</strong> easiest to extract, <strong>the</strong> authors believed that, if <strong>the</strong>ir results were poor,<br />

<strong>the</strong>y would be poorer for more complex domains and less clearly defined relations.<br />

The evaluation consisted <strong>of</strong> comparing an “ideal” hierarchy, manually created, with<br />

hierarchies extracted from five dictionaries. The extraction procedure was based on<br />

<strong>the</strong> heuristics <strong>of</strong> Chodorow et al. (1985), which resulted in tangled hierarchies, later<br />

disambiguated manually. After inspection, it was noticed that <strong>the</strong>se hierarchies had<br />

serious problems <strong>of</strong> incompleteness and <strong>the</strong>re were difficulties at higher levels:<br />

• Some words were (relatively randomly) attached too high in <strong>the</strong> hierarchy;<br />

some heads <strong>of</strong> definitions were not <strong>the</strong> hypernym <strong>of</strong> <strong>the</strong> definiendum, but <strong>the</strong><br />

“whole” that contains it; overlaps that should occur between concepts are<br />

sometimes missing.<br />

• All <strong>the</strong> heads separated by <strong>the</strong> conjunction or are considered to be hypernyms,<br />

but sometimes, when looking at <strong>the</strong> hierarchy, problems exist; circularity tends<br />

to occur in <strong>the</strong> highest levels, possibly when lexicographers lack terms to<br />

designate certain concepts.<br />

The authors state that hierarchies with this kind <strong>of</strong> problems are likely to be unusable<br />

in NLP systems and discuss means to refine <strong>the</strong>m automatically. Merging <strong>the</strong><br />

hierarchies <strong>of</strong> <strong>the</strong> five dictionaries and introducing “covert categories” drastically<br />

reduces <strong>the</strong> amount <strong>of</strong> problems from 55-70% to 6%. O<strong>the</strong>r problems are minimised<br />

by considering “empty heads” and patterns occurring in <strong>the</strong> beginning <strong>of</strong> <strong>the</strong> definition<br />

that denote <strong>the</strong> part-<strong>of</strong> relation; or by using more complex grammars/broadcoverage<br />

parsers instead <strong>of</strong> static string patterns for extraction.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!