Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
22 Chapter 2. Background Knowledge<br />
<strong>the</strong> appropriate data. The slots can be, for instance, a table in a relational<br />
database.<br />
2.3.2 Information Extraction Techniques<br />
In order to increase <strong>the</strong> portability <strong>of</strong> IE systems, <strong>the</strong> development <strong>of</strong> techniques for<br />
IE is currently very centralised on machine learning (Moens, 2006). The alternative<br />
is to adopt approaches based on handcrafted knowledge. Existing IE techniques<br />
may be classified into three groups (Moens, 2006), namely: symbolic techniques,<br />
supervised machine learning and unsupervised machine learning.<br />
All <strong>the</strong>se techniques exploit patterns – recurring sequences <strong>of</strong> events, evidenced<br />
by <strong>the</strong> objects in text. Pattern recognition deals with <strong>the</strong> classification <strong>of</strong> objects<br />
into categories, according to <strong>the</strong> values <strong>of</strong> a selected number <strong>of</strong> <strong>the</strong>ir features. When<br />
using machine learning techniques, after selecting <strong>the</strong> features to be exploited, extraction<br />
rules are automatically learned from a collection <strong>of</strong> examples where <strong>the</strong><br />
features are annotated. Alternatively, <strong>the</strong> features might be used for <strong>the</strong> manual<br />
construction <strong>of</strong> static rules. According to <strong>the</strong> types <strong>of</strong> features to explore, patterns<br />
in text can be classified into <strong>the</strong> following categories:<br />
• <strong>Lexical</strong> patterns: features relative to <strong>the</strong> attributes <strong>of</strong> lexical items. For<br />
instance, if two lexical items co-occur in a context, or if a word is capitalised<br />
or not. The latter feature is especially productive for NER.<br />
• Syntactic patterns: features include <strong>the</strong> part-<strong>of</strong>-speech (POS) <strong>of</strong> <strong>the</strong> lexical<br />
items (e.g. noun, verb, preposition), and <strong>the</strong> type <strong>of</strong> phrase (e.g. noun phrase,<br />
verb phrase, prepositional phrase).<br />
• Semantic patterns: features that denote semantic classifications in information<br />
units. These include multi-word patterns that may be used for <strong>the</strong> extraction<br />
<strong>of</strong> semantic relations, such as works at (e.g. Hugo works at CISUC ), for<br />
denoting an employment relation, or is a (e.g. apple is a fruit) for denoting<br />
hyponymy.<br />
• Discourse patterns: features are values computed by using text fragments,<br />
such as <strong>the</strong> distance between two entities in a document. It is assumed that<br />
<strong>the</strong> distance is inversely proportional to <strong>the</strong> semantic relatedness.<br />
In <strong>the</strong> rest <strong>of</strong> this section, we will briefly describe three groups <strong>of</strong> IE techniques.<br />
Symbolic techniques<br />
This group <strong>of</strong> IE techniques relies on handcrafted symbolic knowledge. Rules for IE<br />
are written by someone who is familiar with <strong>the</strong> formalisms <strong>of</strong> <strong>the</strong> IE systems and,<br />
especially, familiar with <strong>the</strong> data where <strong>the</strong> IE system will run on.<br />
Symbolic techniques are <strong>of</strong>ten implemented through partial parsing and finite<br />
state automata (Partee et al., 1990). As <strong>the</strong> name suggests, partial parsing refers<br />
to situations when only part <strong>of</strong> <strong>the</strong> text is analysed, while <strong>the</strong> rest is skipped.<br />
The analysed part is antecipated through handcrafted patterns, which are in turn<br />
translated into <strong>the</strong> rules <strong>of</strong> a finite state automata.