NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Semantic, Terminological and Linguistic Features for Sentence Classification<br />
1. Introduction<br />
Tobias Wunner and Paul Buitelaar<br />
Unit for Natural Language Processing,<br />
Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />
Email: firstname.lastname@deri.org<br />
Specialized domains often use shared vocabularies to<br />
agree on terminology and semantics of the models used<br />
in the community. However most of the information<br />
and knowledge in these domains is usually available as<br />
unstructured plain text in form of natural language.<br />
Here the vocabularies can be used in a semantically<br />
guided information extraction process to automatically<br />
create structured content. This is known as Ontologybased<br />
information extraction (OBIE) where the<br />
ontology is the vocabulary. A crucial problem of these<br />
domains is their shallow linguistic, terminological and<br />
semantic (STL) modeling which makes it difficult to<br />
match them to natural language text objects.<br />
2. Classification with STL Lexicon Objects<br />
To carry out the OBIE task in STL fashion we propose<br />
a two way approach. In a first step we generate an STL<br />
ontology-lexicon, in the Semantic Web compliant<br />
lemon 1 (Lexicon Model for Ontologies) format, from<br />
the original domain vocabulary as described in [1], [2]<br />
using an extended version of the Lemon Generator 2 .<br />
This provides us with a set of rich lexicon objects that<br />
represent the S, T and L context of the vocabulary<br />
terms. In a second step we then use its STL features to<br />
train a classifier for sentence classification [3].<br />
3. Financial Reporting Domain<br />
In this work we focus on vocabularies in the accounting<br />
domain, i.e. GAAP (General Accepted Accounting<br />
Principles) taxonomies that are used to create and<br />
interpret financial reports in the machine processable<br />
XBRL (eXtended Business Reporting Language)<br />
format. In order to apply STL enrichment we first<br />
convert these vocabularies in the Semantic Web RDF<br />
ontology format. In particular we are considering the<br />
two vocabularies:<br />
vocabulary Concepts<br />
IFRS 3 (International Finance Reporting<br />
Standard)<br />
2768<br />
xEBR (European Business Registers) 147<br />
A typical IFRS accounting term is:<br />
1 http://www.monnet-project.eu/lemon<br />
2 http://monnetproject.deri.ie/Lemon-Editor<br />
3 http://www.ifrs.org<br />
132<br />
Decrease Through Classified As Held For Sale<br />
Land And Buildings<br />
Applying the STL enrichment adds additional semantic,<br />
terminological and linguistic structure in form of<br />
annotations to the example term:<br />
level annotation<br />
S1: taxonomic S11 ifrs:LandAndBuildings<br />
context S12 is-a dbpedia:Asset<br />
T1: sub-term T11 ifrs:LandAndBuildings<br />
T12 term variation = properties<br />
L1 tokenization & L11 decrease/Preposition<br />
stemming L12 classified/Verb_pastTense<br />
L13 as/Preposition<br />
L14 held/Verb_present<br />
…<br />
L2: subcat frame L2 “hold for sale”/Verb_Phrase<br />
Given such an STL enriched representation of the<br />
vocabulary term we could then select features to<br />
classify the following sentence:<br />
This [decrease]_L11 was due to lower gains realized<br />
on the sale of foreclosed [assets]_S12 [held for sale]_L2.<br />
4. Future work<br />
For the future we plan to evaluate this approach on a<br />
corpus of financial reports of wind energy companies<br />
(UNLP Wind Energy Corpus) which contains concepts<br />
from IFRS and xEBR vocabularies. We also plan to<br />
extend our approach along the cross-lingual dimension<br />
towards Cross-lingual Ontology-based Information<br />
Extraction (CLOBIE) facilitating multilingual<br />
vocabularies.<br />
References<br />
[1] Wunner, T., Buitelaar, P., O’Riain, S., Semantic,<br />
Terminological and Linguistic Interpretation of XBRL. In<br />
Proceedings of the Workshop on Reuse and Adaptation of<br />
Ontologies and Terminologies at the 17th International<br />
Conference on Knowledge Engineering and Knowledge<br />
Management (EKAW), Lisbon<br />
[2] McCrae, J., Spohr, D. and Cimiano, P. (2011). Linking<br />
Lexical Resources and Ontologies on the Semantic Web with<br />
lemon. In Proceedings of the 2011 Extended Semantic Web<br />
Conference.<br />
[3] Khoo, A., Marom, Y., Albrecht, D., “Experiments with<br />
Sentence Classification”, In Proceedings of the 2006<br />
Australasian Language Technology Workshop, pages 18-25