Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 7<br />
Moving from term-based to<br />
synset-based relations<br />
Typical information extraction (IE) systems are capable <strong>of</strong> acquiring concept instances<br />
and information about <strong>the</strong>se concepts from large collections <strong>of</strong> text. Whe<strong>the</strong>r<br />
<strong>the</strong>se systems aim for <strong>the</strong> automatic acquisition <strong>of</strong> lexical-semantic relations (e.g.<br />
Chodorow et al. (1985); Hearst (1992); Pantel and Pennacchiotti (2006)), <strong>of</strong> knowledge<br />
on specific domains (e.g. Pustejovsky et al. (2002); Wiegand et al. (2012)),<br />
or <strong>the</strong> extraction <strong>of</strong> open-domain facts (e.g. Agichtein and Gravano (2000); Banko<br />
et al. (2007); Etzioni et al. (2011)) <strong>the</strong>y typically represent concepts as terms, which<br />
are lexical items identified by <strong>the</strong>ir lemma. This is also how CARTÃO is structured.<br />
There, semantic relations are denoted by relational triples t = {a R b},<br />
where <strong>the</strong> arguments (a and b) are terms whose meaning is connected by a relation<br />
described by R. As we have done throughout this <strong>the</strong>sis, we refer to <strong>the</strong> previous<br />
representation as term-based triples (tb-triples).<br />
The problem is that a simple term is usually not enough to unambiguously refer<br />
to a concept, because <strong>the</strong> same word might have different meanings and different<br />
words might have <strong>the</strong> same meaning. On <strong>the</strong> one hand, this problem is not severe in<br />
<strong>the</strong> extraction <strong>of</strong> domain knowledge, where, based on <strong>the</strong> “one sense per discourse”<br />
assumption (Gale et al., 1992), ambiguity is low. On <strong>the</strong> o<strong>the</strong>r hand, when dealing<br />
with broad-coverage knowledge, if ambiguities are not handled, it becomes impractical<br />
to formalise <strong>the</strong> extracted information and to accomplish tasks such as inference<br />
for discovering new knowledge.<br />
Therefore, to make IE systems more useful, a new step, which can be seen<br />
as a kind <strong>of</strong> WSD, is needed. Originally baptised as ontologising (Pantel, 2005),<br />
this step aims at moving from knowledge structured in terms, identified by <strong>the</strong>ir<br />
orthographical form, towards an ontological structure, organised in concepts, which<br />
is done by associating <strong>the</strong> terms to a representation <strong>of</strong> <strong>the</strong>ir meaning.<br />
After <strong>the</strong> steps presented in <strong>the</strong> previous chapters, we are left with a lexical<br />
network, CARTÃO, with tb-triples extracted from text (chapter 4), and with a<br />
<strong>the</strong>saurus, with synsets (chapter 5 and 6). While <strong>the</strong> synsets can be seen as concepts<br />
and <strong>the</strong>ir possible lexicalisations, <strong>the</strong> identification <strong>of</strong> <strong>the</strong> correct sense(s) <strong>of</strong><br />
<strong>the</strong> arguments <strong>of</strong> a tb-triple for which <strong>the</strong> relation is valid is not straightforward.<br />
However, whereas most WSD techniques rely on <strong>the</strong> context where <strong>the</strong> words to<br />
be disambiguated occur to find <strong>the</strong>ir most adequate sense, <strong>the</strong> tb-triples do not<br />
provide <strong>the</strong>ir extraction context. While we could recover <strong>the</strong> context for some <strong>of</strong>