29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1. Introduction<br />

A Corpus Framework For Cross-Lingual Search<br />

Nitish Aggarwal, Tobias Wunner, Paul Buitelaar<br />

Unit for Natural Language Processing,<br />

Digital Enterprise Research Institute, National University of Ireland, <strong>Galway</strong><br />

Email: firstname.lastname@deri.org<br />

Cross-lingual queries on text documents based on<br />

specialized domain vocabularies are complex and<br />

dependent on the semantic, terminological and<br />

linguistic (STL) features of the vocabulary and<br />

language. The challenge for a cross-lingual search is to<br />

retrieve corpus objects which best match the user query<br />

based on these features. Therefore both the corpus and<br />

the query need to be STL enriched [1].<br />

2. Method<br />

In this work we present a framework to carry out a STL<br />

enrichment process for document, sentence and token<br />

corpus objects. The implementation is based on the<br />

blackboard architecture pattern with the corpus as a<br />

blackboard and S, T and L annotators acting on the<br />

corpus to perform the STL enrichment process (fig. 1).<br />

Figure. 1 STL corpus framework<br />

We implemented L and T annotators for NLP<br />

processing such as tokenization or part-of speech<br />

tagging for English, German, Spanish and Dutch<br />

processing and a S annotator enriching the corpus<br />

objects with vocabulary annotation. On the query side<br />

we manually implemented a set of queries with<br />

different STL features as shown in table 1.<br />

type Examples<br />

S query val=IFRS_FinancialAssets, lang=DE<br />

T query val= activos financieros, lang=ES<br />

L query part-of speech=Verb<br />

L query lemma=finance, part-of-speech=Noun<br />

Table. 1 implemented STL queries<br />

84<br />

3. Data-Set<br />

We have constructed a multi-lingual finance data set<br />

consisting of financial reports from Wind Energy<br />

companies (UNLP Wind Energy Corpus) and<br />

vocabularies for English, German, and Spanish. The<br />

corpus comprises 96 financial reports and 1421 news<br />

texts from 9 different wind energy companies. We also<br />

used two financial vocabularies with STL enriched<br />

terms. The first vocabulary is the International Finance<br />

Reporting Standard (IFRS), which used worldwide to<br />

create financial reports in XBRL (eXtended Business<br />

Reporting Language) format. The second vocabulary is<br />

developed by the xEBR (XBRL European Business<br />

Registers) group to describe legal enterprise entities<br />

within Europe.<br />

Terms examples<br />

IFRS 2487 Financial assets<br />

Amortization Computer Software<br />

xEBR 147 Financial fixed assets<br />

Company address, Country<br />

Table. 2 Financial vocabularies<br />

4. Future work<br />

For the future we plan to develop a broader set of<br />

annotators and evaluate our approach on the UNLP<br />

Wind Energy corpus with queries constructed from the<br />

IFRS and the xEBR vocabulary. In particular we want<br />

to explore different combinations of S, T and L features<br />

of the queries using the framework.<br />

We also plan to extend the framework STL enrichment<br />

on the vocabulary side based on the lemon (lexicon<br />

model for ontologies) Generator 1 , as developed by the<br />

MONNET project 2 to facilitate richer STL searches.<br />

5. References<br />

[1] Wunner, T., Buitelaar, P., O’Riain, S., Semantic,<br />

Terminological and Linguistic Interpretation of XBRL.<br />

In Proceedings of the Workshop on Reuse and<br />

Adaptation of Ontologies and Terminologies at the 17th<br />

International Conference on Knowledge Engineering<br />

and Knowledge Management (EKAW), Lisbon<br />

[2] Cimiano et al. (2010). D2.1 Ontology-Lexicon<br />

Model. Monnet Project Deliverable.<br />

1 http://monnetproject.deri.ie/Lemon-Editor<br />

2 http://www.monnet-project.eu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!