NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
1. Introduction<br />
A Corpus Framework For Cross-Lingual Search<br />
Nitish Aggarwal, Tobias Wunner, Paul Buitelaar<br />
Unit for Natural Language Processing,<br />
Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />
Email: firstname.lastname@deri.org<br />
Cross-lingual queries on text documents based on<br />
specialized domain vocabularies are complex and<br />
dependent on the semantic, terminological and<br />
linguistic (STL) features of the vocabulary and<br />
language. The challenge for a cross-lingual search is to<br />
retrieve corpus objects which best match the user query<br />
based on these features. Therefore both the corpus and<br />
the query need to be STL enriched [1].<br />
2. Method<br />
In this work we present a framework to carry out a STL<br />
enrichment process for document, sentence and token<br />
corpus objects. The implementation is based on the<br />
blackboard architecture pattern with the corpus as a<br />
blackboard and S, T and L annotators acting on the<br />
corpus to perform the STL enrichment process (fig. 1).<br />
Figure. 1 STL corpus framework<br />
We implemented L and T annotators for NLP<br />
processing such as tokenization or part-of speech<br />
tagging for English, German, Spanish and Dutch<br />
processing and a S annotator enriching the corpus<br />
objects with vocabulary annotation. On the query side<br />
we manually implemented a set of queries with<br />
different STL features as shown in table 1.<br />
type Examples<br />
S query val=IFRS_FinancialAssets, lang=DE<br />
T query val= activos financieros, lang=ES<br />
L query part-of speech=Verb<br />
L query lemma=finance, part-of-speech=Noun<br />
Table. 1 implemented STL queries<br />
84<br />
3. Data-Set<br />
We have constructed a multi-lingual finance data set<br />
consisting of financial reports from Wind Energy<br />
companies (UNLP Wind Energy Corpus) and<br />
vocabularies for English, German, and Spanish. The<br />
corpus comprises 96 financial reports and 1421 news<br />
texts from 9 different wind energy companies. We also<br />
used two financial vocabularies with STL enriched<br />
terms. The first vocabulary is the International Finance<br />
Reporting Standard (IFRS), which used worldwide to<br />
create financial reports in XBRL (eXtended Business<br />
Reporting Language) format. The second vocabulary is<br />
developed by the xEBR (XBRL European Business<br />
Registers) group to describe legal enterprise entities<br />
within Europe.<br />
Terms examples<br />
IFRS 2487 Financial assets<br />
Amortization Computer Software<br />
xEBR 147 Financial fixed assets<br />
Company address, Country<br />
Table. 2 Financial vocabularies<br />
4. Future work<br />
For the future we plan to develop a broader set of<br />
annotators and evaluate our approach on the UNLP<br />
Wind Energy corpus with queries constructed from the<br />
IFRS and the xEBR vocabulary. In particular we want<br />
to explore different combinations of S, T and L features<br />
of the queries using the framework.<br />
We also plan to extend the framework STL enrichment<br />
on the vocabulary side based on the lemon (lexicon<br />
model for ontologies) Generator 1 , as developed by the<br />
MONNET project 2 to facilitate richer STL searches.<br />
5. References<br />
[1] Wunner, T., Buitelaar, P., O’Riain, S., Semantic,<br />
Terminological and Linguistic Interpretation of XBRL.<br />
In Proceedings of the Workshop on Reuse and<br />
Adaptation of Ontologies and Terminologies at the 17th<br />
International Conference on Knowledge Engineering<br />
and Knowledge Management (EKAW), Lisbon<br />
[2] Cimiano et al. (2010). D2.1 Ontology-Lexicon<br />
Model. Monnet Project Deliverable.<br />
1 http://monnetproject.deri.ie/Lemon-Editor<br />
2 http://www.monnet-project.eu/