29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Semantic, Terminological and Linguistic Features for Sentence Classification<br />

1. Introduction<br />

Tobias Wunner and Paul Buitelaar<br />

Unit for Natural Language Processing,<br />

Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />

Email: firstname.lastname@deri.org<br />

Specialized domains often use shared vocabularies to<br />

agree on terminology and semantics of the models used<br />

in the community. However most of the information<br />

and knowledge in these domains is usually available as<br />

unstructured plain text in form of natural language.<br />

Here the vocabularies can be used in a semantically<br />

guided information extraction process to automatically<br />

create structured content. This is known as Ontologybased<br />

information extraction (OBIE) where the<br />

ontology is the vocabulary. A crucial problem of these<br />

domains is their shallow linguistic, terminological and<br />

semantic (STL) modeling which makes it difficult to<br />

match them to natural language text objects.<br />

2. Classification with STL Lexicon Objects<br />

To carry out the OBIE task in STL fashion we propose<br />

a two way approach. In a first step we generate an STL<br />

ontology-lexicon, in the Semantic Web compliant<br />

lemon 1 (Lexicon Model for Ontologies) format, from<br />

the original domain vocabulary as described in [1], [2]<br />

using an extended version of the Lemon Generator 2 .<br />

This provides us with a set of rich lexicon objects that<br />

represent the S, T and L context of the vocabulary<br />

terms. In a second step we then use its STL features to<br />

train a classifier for sentence classification [3].<br />

3. Financial Reporting Domain<br />

In this work we focus on vocabularies in the accounting<br />

domain, i.e. GAAP (General Accepted Accounting<br />

Principles) taxonomies that are used to create and<br />

interpret financial reports in the machine processable<br />

XBRL (eXtended Business Reporting Language)<br />

format. In order to apply STL enrichment we first<br />

convert these vocabularies in the Semantic Web RDF<br />

ontology format. In particular we are considering the<br />

two vocabularies:<br />

vocabulary Concepts<br />

IFRS 3 (International Finance Reporting<br />

Standard)<br />

2768<br />

xEBR (European Business Registers) 147<br />

A typical IFRS accounting term is:<br />

1 http://www.monnet-project.eu/lemon<br />

2 http://monnetproject.deri.ie/Lemon-Editor<br />

3 http://www.ifrs.org<br />

132<br />

Decrease Through Classified As Held For Sale<br />

Land And Buildings<br />

Applying the STL enrichment adds additional semantic,<br />

terminological and linguistic structure in form of<br />

annotations to the example term:<br />

level annotation<br />

S1: taxonomic S11 ifrs:LandAndBuildings<br />

context S12 is-a dbpedia:Asset<br />

T1: sub-term T11 ifrs:LandAndBuildings<br />

T12 term variation = properties<br />

L1 tokenization & L11 decrease/Preposition<br />

stemming L12 classified/Verb_pastTense<br />

L13 as/Preposition<br />

L14 held/Verb_present<br />

…<br />

L2: subcat frame L2 “hold for sale”/Verb_Phrase<br />

Given such an STL enriched representation of the<br />

vocabulary term we could then select features to<br />

classify the following sentence:<br />

This [decrease]_L11 was due to lower gains realized<br />

on the sale of foreclosed [assets]_S12 [held for sale]_L2.<br />

4. Future work<br />

For the future we plan to evaluate this approach on a<br />

corpus of financial reports of wind energy companies<br />

(UNLP Wind Energy Corpus) which contains concepts<br />

from IFRS and xEBR vocabularies. We also plan to<br />

extend our approach along the cross-lingual dimension<br />

towards Cross-lingual Ontology-based Information<br />

Extraction (CLOBIE) facilitating multilingual<br />

vocabularies.<br />

References<br />

[1] Wunner, T., Buitelaar, P., O’Riain, S., Semantic,<br />

Terminological and Linguistic Interpretation of XBRL. In<br />

Proceedings of the Workshop on Reuse and Adaptation of<br />

Ontologies and Terminologies at the 17th International<br />

Conference on Knowledge Engineering and Knowledge<br />

Management (EKAW), Lisbon<br />

[2] McCrae, J., Spohr, D. and Cimiano, P. (2011). Linking<br />

Lexical Resources and Ontologies on the Semantic Web with<br />

lemon. In Proceedings of the 2011 Extended Semantic Web<br />

Conference.<br />

[3] Khoo, A., Marom, Y., Albrecht, D., “Experiments with<br />

Sentence Classification”, In Proceedings of the 2006<br />

Australasian Language Technology Workshop, pages 18-25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!