24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

24 Chapter 2. Background Knowledge<br />

IE. The input <strong>of</strong> a OIE system is a corpus and <strong>the</strong> output is a set <strong>of</strong> facts, represented<br />

as relational triples t = {e1, relation phrase, e2}. There is no need for annotated<br />

data nor need for specifying <strong>the</strong> relations to extract.<br />

For learning a classifier, an OIE system starts by identifying <strong>the</strong> noun phrases<br />

<strong>of</strong> several thousands <strong>of</strong> sentences in <strong>the</strong> input corpus. The parsing structure <strong>of</strong> <strong>the</strong><br />

words connecting noun phrases is also analysed. This sequence is labelled as positive<br />

or negative examples <strong>of</strong> trustworthy relations, according to predefined heuristics.<br />

Positive and negative tuples are finally used to establish triples, where a pair <strong>of</strong><br />

noun phrases is connected by a relation phrase. Triples are mapped into feature<br />

vectors, used as <strong>the</strong> input <strong>of</strong> a classifier. For <strong>the</strong> extraction, only a single pass is<br />

needed over <strong>the</strong> input corpus. Each pair <strong>of</strong> noun phrases is used as <strong>the</strong> arguments<br />

<strong>of</strong> a triple, and <strong>the</strong> text connecting <strong>the</strong>m is used as <strong>the</strong> relation phrase. Triples<br />

classified as trustworthy are extracted.<br />

2.4 Remarks on this section<br />

In this section, we bridge <strong>the</strong> <strong>the</strong>oretical work described in <strong>the</strong> previous sections with<br />

<strong>the</strong> work developed in <strong>the</strong> scope <strong>of</strong> this <strong>the</strong>sis. The first part targets <strong>the</strong> knowledge<br />

representation in our work and <strong>the</strong> second is about <strong>the</strong> information extraction techniques<br />

applied.<br />

2.4.1 Knowledge representation in our work<br />

In our work, instances <strong>of</strong> semantic relations are first extracted as relational triples<br />

t = {w1, R, w2}, which can both be seen as logical predicates or as <strong>the</strong> edges <strong>of</strong> a<br />

lexical network. The types <strong>of</strong> semantic relations are typical relations between word<br />

senses, including synonymy, hypernymy, several types <strong>of</strong> meronymy and most <strong>of</strong> <strong>the</strong><br />

relations introduced in section 2.1.2. The arguments <strong>of</strong> <strong>the</strong>se relations are lexical<br />

items, described by <strong>the</strong>ir orthographical form. Word senses are not handled.<br />

On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> final resource <strong>of</strong> this work, <strong>Onto</strong>.<strong>PT</strong>, can be seen as lexical<br />

ontology, as we have adopted a model inspired by Princeton WordNet (see more<br />

about this resource in section 3.1.1). In order to represent natural language concepts,<br />

<strong>Onto</strong>.<strong>PT</strong> groups synonymous words in synsets, which are groups <strong>of</strong> synonymous<br />

words. This part <strong>of</strong> <strong>the</strong> resource can thus be seen as a <strong>the</strong>saurus. As for o<strong>the</strong>r<br />

semantic relations, <strong>Onto</strong>.<strong>PT</strong> includes several predefined types established between<br />

synsets. Given that <strong>the</strong> presence <strong>of</strong> a lexical item in a synset defines a new possible<br />

sense <strong>of</strong> this item, different senses <strong>of</strong> <strong>the</strong> same word are recognised.<br />

2.4.2 Information Extraction techniques in our work<br />

We have only exploited dictionaries for <strong>the</strong> extraction <strong>of</strong> semantic relations. For this<br />

purpose, we used symbolic techniques over <strong>the</strong> dictionary definitions (see section 4).<br />

We recall that dictionaries provide a wide coverage <strong>of</strong> <strong>the</strong> lexicon and <strong>the</strong>y are<br />

structured in words and meanings. Moreover, definitions tend to use simple vocabulary<br />

and follow regularities, which makes most <strong>of</strong> <strong>the</strong>m easily predictable. Therefore,<br />

after careful observation, we manually encoded a set <strong>of</strong> semantic patterns, organised<br />

in grammars, for processing <strong>the</strong>m. Despite <strong>the</strong> manual labour involved in <strong>the</strong><br />

manual creation <strong>of</strong> <strong>the</strong> grammars, we could take advantage <strong>of</strong> one <strong>of</strong> <strong>the</strong> pros <strong>of</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!