Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.2. <strong>Lexical</strong>-Semantic Information Extraction 43<br />
points that should be considered in <strong>the</strong> automatic extraction <strong>of</strong> knowledge from a<br />
dictionary, and its conversion into a computational format. They refer three approaches<br />
for creating a knowledge base from a dictionary, which vary in <strong>the</strong> initial<br />
required amount <strong>of</strong> knowledge, and in <strong>the</strong> quality <strong>of</strong> <strong>the</strong> extracted information:<br />
• Co-ocurrences enable <strong>the</strong> establishment <strong>of</strong> associations between words, without<br />
requiring initial linguistic information.<br />
• A grammar with a collection <strong>of</strong> linguistic patterns enables, for instance, to<br />
identify <strong>the</strong> genus (hypernym) and <strong>the</strong> differentia for each dictionary entry.<br />
• Hand-coding <strong>the</strong> lexical entries <strong>of</strong> a controlled vocabulary (about 5% <strong>of</strong> <strong>the</strong><br />
knowledge base), and iterating through <strong>the</strong> remaining words, enables to derive<br />
a network <strong>of</strong> semantic units.<br />
While <strong>the</strong> third approach results in a rich semantic structure, it needs a substantial<br />
amount <strong>of</strong> initial linguistic knowledge. The first approach produces a much simpler<br />
resource, but does not require hand-coded knowledge.<br />
Ide and Véronis (1995) are very critical <strong>of</strong> <strong>the</strong> research on information extraction<br />
from dictionaries. They refer that dictionaries use inconsistent conventions to represent<br />
knowledge and that <strong>the</strong> definitions are not as consistent as <strong>the</strong>y should be.<br />
Since <strong>the</strong>y are <strong>the</strong> result <strong>of</strong> several lexicographers work for several years, dictionaries<br />
have many variations to transmit <strong>the</strong> same thing. Reviews and updates increase <strong>the</strong><br />
probability <strong>of</strong> inconsistencies.<br />
In order to assess <strong>the</strong> information extracted from dictionaries,<br />
Ide and Véronis (1995) performed a quantitative evaluation <strong>of</strong> automatically<br />
extracted hypernymy relations. As hypernymy is <strong>the</strong> least arguable semantic relation<br />
and <strong>the</strong> easiest to extract, <strong>the</strong> authors believed that, if <strong>the</strong>ir results were poor,<br />
<strong>the</strong>y would be poorer for more complex domains and less clearly defined relations.<br />
The evaluation consisted <strong>of</strong> comparing an “ideal” hierarchy, manually created, with<br />
hierarchies extracted from five dictionaries. The extraction procedure was based on<br />
<strong>the</strong> heuristics <strong>of</strong> Chodorow et al. (1985), which resulted in tangled hierarchies, later<br />
disambiguated manually. After inspection, it was noticed that <strong>the</strong>se hierarchies had<br />
serious problems <strong>of</strong> incompleteness and <strong>the</strong>re were difficulties at higher levels:<br />
• Some words were (relatively randomly) attached too high in <strong>the</strong> hierarchy;<br />
some heads <strong>of</strong> definitions were not <strong>the</strong> hypernym <strong>of</strong> <strong>the</strong> definiendum, but <strong>the</strong><br />
“whole” that contains it; overlaps that should occur between concepts are<br />
sometimes missing.<br />
• All <strong>the</strong> heads separated by <strong>the</strong> conjunction or are considered to be hypernyms,<br />
but sometimes, when looking at <strong>the</strong> hierarchy, problems exist; circularity tends<br />
to occur in <strong>the</strong> highest levels, possibly when lexicographers lack terms to<br />
designate certain concepts.<br />
The authors state that hierarchies with this kind <strong>of</strong> problems are likely to be unusable<br />
in NLP systems and discuss means to refine <strong>the</strong>m automatically. Merging <strong>the</strong><br />
hierarchies <strong>of</strong> <strong>the</strong> five dictionaries and introducing “covert categories” drastically<br />
reduces <strong>the</strong> amount <strong>of</strong> problems from 55-70% to 6%. O<strong>the</strong>r problems are minimised<br />
by considering “empty heads” and patterns occurring in <strong>the</strong> beginning <strong>of</strong> <strong>the</strong> definition<br />
that denote <strong>the</strong> part-<strong>of</strong> relation; or by using more complex grammars/broadcoverage<br />
parsers instead <strong>of</strong> static string patterns for extraction.