24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

22 Chapter 2. Background Knowledge<br />

<strong>the</strong> appropriate data. The slots can be, for instance, a table in a relational<br />

database.<br />

2.3.2 Information Extraction Techniques<br />

In order to increase <strong>the</strong> portability <strong>of</strong> IE systems, <strong>the</strong> development <strong>of</strong> techniques for<br />

IE is currently very centralised on machine learning (Moens, 2006). The alternative<br />

is to adopt approaches based on handcrafted knowledge. Existing IE techniques<br />

may be classified into three groups (Moens, 2006), namely: symbolic techniques,<br />

supervised machine learning and unsupervised machine learning.<br />

All <strong>the</strong>se techniques exploit patterns – recurring sequences <strong>of</strong> events, evidenced<br />

by <strong>the</strong> objects in text. Pattern recognition deals with <strong>the</strong> classification <strong>of</strong> objects<br />

into categories, according to <strong>the</strong> values <strong>of</strong> a selected number <strong>of</strong> <strong>the</strong>ir features. When<br />

using machine learning techniques, after selecting <strong>the</strong> features to be exploited, extraction<br />

rules are automatically learned from a collection <strong>of</strong> examples where <strong>the</strong><br />

features are annotated. Alternatively, <strong>the</strong> features might be used for <strong>the</strong> manual<br />

construction <strong>of</strong> static rules. According to <strong>the</strong> types <strong>of</strong> features to explore, patterns<br />

in text can be classified into <strong>the</strong> following categories:<br />

• <strong>Lexical</strong> patterns: features relative to <strong>the</strong> attributes <strong>of</strong> lexical items. For<br />

instance, if two lexical items co-occur in a context, or if a word is capitalised<br />

or not. The latter feature is especially productive for NER.<br />

• Syntactic patterns: features include <strong>the</strong> part-<strong>of</strong>-speech (POS) <strong>of</strong> <strong>the</strong> lexical<br />

items (e.g. noun, verb, preposition), and <strong>the</strong> type <strong>of</strong> phrase (e.g. noun phrase,<br />

verb phrase, prepositional phrase).<br />

• Semantic patterns: features that denote semantic classifications in information<br />

units. These include multi-word patterns that may be used for <strong>the</strong> extraction<br />

<strong>of</strong> semantic relations, such as works at (e.g. Hugo works at CISUC ), for<br />

denoting an employment relation, or is a (e.g. apple is a fruit) for denoting<br />

hyponymy.<br />

• Discourse patterns: features are values computed by using text fragments,<br />

such as <strong>the</strong> distance between two entities in a document. It is assumed that<br />

<strong>the</strong> distance is inversely proportional to <strong>the</strong> semantic relatedness.<br />

In <strong>the</strong> rest <strong>of</strong> this section, we will briefly describe three groups <strong>of</strong> IE techniques.<br />

Symbolic techniques<br />

This group <strong>of</strong> IE techniques relies on handcrafted symbolic knowledge. Rules for IE<br />

are written by someone who is familiar with <strong>the</strong> formalisms <strong>of</strong> <strong>the</strong> IE systems and,<br />

especially, familiar with <strong>the</strong> data where <strong>the</strong> IE system will run on.<br />

Symbolic techniques are <strong>of</strong>ten implemented through partial parsing and finite<br />

state automata (Partee et al., 1990). As <strong>the</strong> name suggests, partial parsing refers<br />

to situations when only part <strong>of</strong> <strong>the</strong> text is analysed, while <strong>the</strong> rest is skipped.<br />

The analysed part is antecipated through handcrafted patterns, which are in turn<br />

translated into <strong>the</strong> rules <strong>of</strong> a finite state automata.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!