12.09.2013 Views

Programme booklet (pdf)

Programme booklet (pdf)

Programme booklet (pdf)

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Ghent, February 11 th 2011<br />

21 st meeting of Computational<br />

Linguistics In the Netherlands


CLIN-21 was supported by:


Welcome!<br />

For the first time in its over 20-year history, the “Computational Linguistics in the<br />

Netherlands” meeting is being held in beautiful city of Ghent. This year’s edition of the<br />

meeting is hosted by the Language and Translation Technology Team of the University<br />

College Ghent.<br />

CLIN-21 will cover a broad spectrum of areas related to natural language and<br />

computation. The program features 55 talks, organized in 5 parallel sessions and 18<br />

posters on different aspects of computational linguistics. We are delighted that<br />

Massimo Poesio from the University of Essex accepted to give us his vision on current<br />

anaphora resolution research. Leonoor van der Beek will present a <strong>booklet</strong> on the<br />

history of language and speech technology in the Netherlands and Flanders. At the<br />

CLIN meeting, we will also present the winner of the STIL Thesis Prize 2011 that will be<br />

awarded to the best MA thesis in computational linguistics or its applications.<br />

This <strong>booklet</strong> contains the presentation and poster abstracts for this year’s CLIN, as well<br />

as the program schedule. The abstracts have been ordered alphabetically.<br />

We hope that CLIN-21 will provide a rewarding forum for the presentation of<br />

interesting work and new ideas in the domain of computational linguistics and natural<br />

language processing, with stimulating and provocative discussions of successes, failures<br />

and new directions. We thank you for your support and participation and wish you a<br />

pleasant and fruitful conference!<br />

The CLIN-21 organizing committee<br />

Veronique Hoste<br />

Els Lefever<br />

Kathelijne Denturck<br />

Peter Velaerts<br />

3


Table of Contents<br />

Welcome! ......................................................................................................................... 3<br />

Table of Contents ............................................................................................................. 5<br />

<strong>Programme</strong> ....................................................................................................................... 9<br />

Keynote speaker ............................................................................................................. 17<br />

Rethinking anaphora .................................................................................................. 17<br />

Presentation Abstracts ................................................................................................... 19<br />

A discriminative syntactic model for source permutation via tree transduction for<br />

statistical machine translation .................................................................................... 20<br />

A Generalized Lexical Acquisition Technique for Improved Parsing Accuracy ............ 21<br />

A Semantic Vector Space for Modelling Word Meaning in Context ........................... 22<br />

A Toolkit for Visualizing the Coherence of Tree-based Reordering with Word-<br />

Alignments .................................................................................................................. 23<br />

A U-DOP approach to modeling language acquisition ................................................ 24<br />

Age and Gender Prediction on Netlog Data................................................................ 25<br />

Aligning translation divergences through semantic role projection ........................... 26<br />

An Aboutness-based Dependency Parser for Dutch ................................................... 27<br />

An exploration of n-gram relationships for transliteration identification .................. 28<br />

Application of a Constraint Conditional Model for improving the performance of<br />

a Sequence Tagger : A Case Study .............................................................................. 29<br />

Automatic terminology extraction: methods and practical applications ................... 30<br />

Automatically Constructing a Wordnet for Dutch ...................................................... 31<br />

Automatically determining phonetic distances .......................................................... 32<br />

Building a Gold Standard for Dutch Spelling Correction ............................................. 33<br />

Clustering customer questions ................................................................................... 34<br />

Collecting and using a corpus of lyrics and their moods ............................................. 35<br />

5


6<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Combined Qualitative and Quantitative Error Analysis in Multi-Topic Authorship<br />

Attribution ................................................................................................................. 36<br />

Combining e-learning and natural language processing. The example of<br />

automated dictation exercises .................................................................................. 37<br />

Computing Semantic Relations from Heterogeneous Information Sources .............. 38<br />

Computing the meaning of multi-word expressions for semantic inference ............ 39<br />

Cross-Domain Dutch Coreference Resolution ........................................................... 40<br />

Dmesure: a readability platform for French as a foreign language ........................... 42<br />

Essentials of person names ........................................................................................ 43<br />

Extraction of Historical Events from Unstructered Texts ........................................... 44<br />

Finding Statistically Motivated Features Influencing Subtree Alignment<br />

Performance .............................................................................................................. 45<br />

From Tokens to Text Entities: Line-based Parsing of Resumes using Conditional<br />

Random Fields ........................................................................................................... 46<br />

Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat ....................................... 47<br />

Language Evolution and SA-OT: The case of sentential negation .............................. 48<br />

Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog<br />

Corpus ........................................................................................................................ 49<br />

Measuring the Impact of Controlled Language on Machine Translation Text via<br />

Readability and Comprehensibility ............................................................................ 50<br />

Memory-based text completion ................................................................................ 51<br />

Overlap-based Phrase Alignment for Language Transformation ............................... 52<br />

Parse and Tag Somali Pirates ..................................................................................... 53<br />

Personalized Knowledge Discovery: Combining Social Media and Domain<br />

Ontologies.................................................................................................................. 54<br />

Recent Advances in Memory-Based Machine Translation ........................................ 55<br />

Reversible stochastic attribute-value grammars ....................................................... 56<br />

Robust Rhymes? The Stability of Authorial Style in Medieval Narratives .................. 57


TABLE OF CONTENTS<br />

Rule Induction for Synchronous Tree-Substitution Grammars in Machine<br />

Translation .................................................................................................................. 58<br />

Search in the Lassy Small Corpus ................................................................................ 59<br />

Simple Measures of Domain Similarity for Parsing ..................................................... 60<br />

SSLD: A smart tool for sms compression .................................................................... 61<br />

Subtrees as a new type of context in Word Space Models......................................... 62<br />

Successful extraction of opposites by means of textual patterns with part-of-<br />

speech information only. ............................................................................................ 63<br />

Syntactic Analysis of Dutch via Web Services ............................................................. 64<br />

Technology recycling between Dutch and Afrikaans .................................................. 65<br />

Technology recycling for closely related languages: Dutch and Afrikaans ................. 66<br />

The more the merrier? How data set size and noisiness affect the accuracy of<br />

predicting the dative alternation ................................................................................ 67<br />

The use of structure discovery methods to detect syntactic change ......................... 69<br />

Treatments of the Dutch verb cluster in formal and computational linguistics ......... 70<br />

TTNWW: de facto standards for Dutch in the context of CLARIN ............................... 71<br />

TTNWW: NLP Tools for Dutch as Webservices in a Workflow .................................... 72<br />

Using corpora tools to analyze gradable nouns in Dutch. .......................................... 73<br />

Using easy distributed computing for data-intensive processing ............................... 74<br />

What is the use of multidocument spatiotemporal analysis? .................................... 75<br />

Without a doubt no uncomplicated task: Negation cues and their scope ................. 76<br />

Poster Abstracts ............................................................................................................. 77<br />

A database for lexical orthographic errors in French ................................................. 77<br />

A Posteriori Agreement as a Quality Measure for Readability Prediction Systems .... 79<br />

A TN/ITN Framework for Western European languages ............................................ 80<br />

An Examination of Cross-Cultural Similarities and Differences from Social Media<br />

Data with respect to Language Use ............................................................................ 81<br />

Authorship Verification of Quran ............................................................................... 82<br />

7


8<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

CLAM: Computational Linguistics Application Mediator ........................................... 83<br />

Discriminative features in reversible stochastic attribute-value grammars .............. 84<br />

Fietstas: a web service for text analysis ..................................................................... 85<br />

FoLiA: Format for Linguistic Annotation .................................................................... 86<br />

How can computational linguistics help determine the core meaning of then in<br />

oral speech? ............................................................................................................... 87<br />

Of mathematicians and physicists: the history of language and speech technology<br />

in the Netherlands and Flanders ................................................................................ 88<br />

On the difficulty of making concreteness concrete ................................................... 89<br />

ParaSense or how to use parallel corpora for Cross-Lingual Word Sense<br />

Disambiguation .......................................................................................................... 90<br />

"Pattern", a web mining module for Python ............................................................. 91<br />

Semantic role labeling of gene regulation events ...................................................... 92<br />

Source Verification in Quran ...................................................................................... 93<br />

Towards a language-independent data-driven compound decomposition tool ....... 94<br />

Towards improving the precision of a relation extraction system by processing<br />

negation and speculation .......................................................................................... 95<br />

List of Participants ......................................................................................................... 97


<strong>Programme</strong><br />

9


CLIN 21 – CONFERENCE PROGRAMME<br />

09:00<br />

-<br />

09:30<br />

Registration and coffee (Foyer)<br />

Room 328 303 313<br />

Chair<br />

09:30<br />

-<br />

9:50<br />

09:50<br />

-<br />

10:10<br />

10:10<br />

-<br />

10:30<br />

10:30<br />

-<br />

10:50<br />

10:50<br />

-<br />

11:10<br />

11:10<br />

-<br />

11:20<br />

11:20<br />

-<br />

11:30<br />

Social media<br />

Richard Beaufort<br />

Age and Gender Prediction<br />

on Netlog Data<br />

C. Peersman, W.<br />

Daelemans, L. Van<br />

Vaerenbergh<br />

Im chattin :-) u wanna NLP<br />

it: Analyzing Reduction in<br />

Chat<br />

H. van Halteren, G.<br />

Martell, C. Du, Y. Gu, J.<br />

Kobben, L. Panjaitan, L.<br />

Schubotz, K. Vasylenko, Y.<br />

Vladimirova<br />

Personalized Knowledge<br />

Discovery: Combining<br />

Social Media and Domain<br />

Ontologies<br />

Th. Markus, E.<br />

Westerhout, P. Monachesi<br />

Machine Learning<br />

Approaches to Sentiment<br />

Analysis Using the Dutch<br />

Netlog Corpus<br />

S. Schrauwen, W.<br />

Daelemans<br />

Syntax and parsing<br />

Menno van Zaanen<br />

The use of structure<br />

discovery methods to<br />

detect syntactic change<br />

L. ten Bosch, M. Versteegh<br />

An Aboutness-based<br />

Dependency Parser for<br />

Dutch<br />

C. H.A. Koster<br />

A Generalized Lexical<br />

Acquisition Technique for<br />

Improved Parsing Accuracy<br />

K. Cholakov, G. van Noord,<br />

V. Kordoni, Y. Zhang<br />

Search in the Lassy Small<br />

Corpus<br />

G. van Noord, D. de Kok, J.<br />

van der Linde<br />

Coffee break.(Foyer)<br />

Welcome (Auditorium)<br />

STIL thesis prize (Auditorium)<br />

Lexical semantics<br />

Tanja Gaustad<br />

A Semantic Vector Space<br />

for Modelling Word<br />

Meaning in Context<br />

K. Heylen, D. Speelman, D.<br />

Geeraerts<br />

Automatically Constructing<br />

a Wordnet for Dutch<br />

T. Van de Cruys<br />

Successful extraction of<br />

opposites by means of<br />

textual patterns with partof-speech<br />

information<br />

only.<br />

A. Lobanova<br />

Computing the meaning of<br />

multi-word expressions for<br />

semantic inference<br />

C. Cremers<br />

10


PROGRAMME<br />

Machine translation<br />

Lieve Macken<br />

317 403<br />

Rule Induction for<br />

Synchronous Tree-<br />

Substitution Grammars in<br />

Machine Translation<br />

V. Vandeghinste, Scott<br />

Martens<br />

Recent Advances in<br />

Memory-Based Machine<br />

Translation<br />

M. van Gompel, A. van den<br />

Bosch, P. Berck<br />

A discriminative syntactic<br />

model for source<br />

permutation via tree<br />

transduction for statistical<br />

machine translation<br />

M. Khalilov, Kh. Sima'an,<br />

G.M. de Buy Wenniger<br />

Measuring the Impact of<br />

Controlled Language on<br />

Machine Translation Text<br />

via Readability and<br />

Comprehensibility<br />

St. Doherty<br />

Methodology<br />

Erik Tjong Kim Sang<br />

Using easy distributed<br />

computing for dataintensive<br />

processing<br />

J. Van den Bogaert<br />

Application of a Constraint<br />

Conditional Model for<br />

improving the<br />

performance of a<br />

Sequence Tagger : A Case<br />

Study<br />

T. Stokman<br />

Technology Recycling<br />

between Dutch and<br />

Afrikaans<br />

L. Augustinus, G. van<br />

Huyssteen, S. Pilon<br />

Technology recycling for<br />

closely related languages:<br />

Dutch and Afrikaans<br />

S. Pilon, G. Van Huyssteen<br />

09:00<br />

-<br />

09:30<br />

09:30<br />

-<br />

9:50<br />

09:50<br />

-<br />

10:10<br />

10:10<br />

-<br />

10:30<br />

10:30<br />

-<br />

10:50<br />

10:50<br />

-<br />

11:10<br />

11:10<br />

-<br />

11:20<br />

11:20<br />

-<br />

11:30<br />

11


11:30<br />

-<br />

12:30<br />

12:30<br />

-<br />

12:50<br />

12<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Invited talk: Rethinking anaphora<br />

M. Poesio<br />

(Auditorium)<br />

Of mathematicians and physicists: the history of language<br />

and speech technology in the Netherlands and Flanders<br />

L. van der Beek<br />

(Auditorium)<br />

12:50<br />

-<br />

13:50<br />

Lunch break (Restaurant)<br />

Room 328 303 313<br />

Chair<br />

13:50<br />

-<br />

14:10<br />

14:10<br />

-<br />

14:30<br />

14:30<br />

-<br />

14:50<br />

14:50<br />

-<br />

15:10<br />

15:10<br />

-<br />

16:10<br />

Information extraction<br />

Eline Westerhout<br />

Extraction of Historical<br />

Events from Unstructered<br />

Texts<br />

R. Segers, M. van Erp, L.<br />

van der Meij<br />

Clustering customer<br />

questions<br />

F. Nauze<br />

Parse and Tag Somali<br />

Pirates<br />

M. van Erp, V. Malaisé, W.<br />

van Hage, V. Osinga, J.M.<br />

Coleto<br />

From Tokens to Text<br />

Entities: Line-based<br />

Parsing of Resumes using<br />

Conditional Random Fields<br />

M. Rotaru<br />

Syntax and parsing<br />

Antal van den Bosch<br />

Reversible stochastic<br />

attribute-value grammars<br />

D. de Kok, G. van Noord, B.<br />

Plank<br />

The more the merrier?<br />

How data set size and<br />

noisiness affect the<br />

accuracy of predicting the<br />

dative alternation<br />

D. Theijssen, H. van<br />

Halteren, L. Boves, N.<br />

Oostdijk<br />

Simple Measures of<br />

Domain Similarity for<br />

Parsing<br />

B. Plank, G. van Noord<br />

Treatments of the Dutch<br />

verb cluster in formal and<br />

computational linguistics<br />

F. Van Eynde<br />

Poster session & coffee break.(Foyer)<br />

Semantics<br />

Kris Heylen<br />

Computing Semantic<br />

Relations from<br />

Heterogeneous<br />

Information Sources<br />

A. Panchenko<br />

Essentials of person names<br />

M. Schraagen<br />

Subtrees as a new type of<br />

context in Word Space<br />

Models<br />

M. Smets, D. Speelman, D.<br />

Geeraerts<br />

Using corpora tools to<br />

analyze gradable nouns in<br />

Dutch.<br />

N. Ruiz, E. Weiffenbach


PROGRAMME<br />

Translation<br />

Vincent Vandeghinste<br />

317 403<br />

Finding Statistically<br />

Motivated Features<br />

Influencing Subtree<br />

Alignment Performance<br />

G. Kotzé<br />

A Toolkit for Visualizing<br />

the Coherence of Treebased<br />

Reordering with<br />

Word-Alignments<br />

G. Maillette de Buy<br />

Wenniger<br />

Overlap-based Phrase<br />

Alignment for Language<br />

Transformation<br />

S. Wubben, A. van den<br />

Bosch, E. Krahmer<br />

Aligning translation<br />

divergences through<br />

semantic role projection<br />

T. Vanallemeersch<br />

Discourse<br />

Kim Luyckx<br />

Cross-Domain Dutch<br />

Coreference Resolution<br />

O. De Clercq, V. Hoste<br />

What is the use of<br />

multidocument<br />

spatiotemporal analysis?<br />

I. Schuurman, V.<br />

Vandeghinste<br />

Language Evolution and<br />

SA-OT: The case of<br />

sentential negation<br />

A. Lopopolo, T. Biro<br />

Without a doubt no<br />

uncomplicated task:<br />

Negation cues and their<br />

scope<br />

R. Morante, S. Schrauwen,<br />

W. Daelemans<br />

11:30<br />

-<br />

12:30<br />

12:30<br />

-<br />

12:50<br />

12:50<br />

-<br />

13:50<br />

13:50<br />

-<br />

14:10<br />

14:10<br />

-<br />

14:30<br />

14:30<br />

-<br />

14:50<br />

14:50<br />

-<br />

15:10<br />

15:10<br />

-<br />

16:10<br />

13


14<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Room 328 303 313<br />

Standards/CLARIN<br />

Syntax/Spelling<br />

Lexical semantics/<br />

Discourse<br />

Chair Gertjan van Noord<br />

Roser Morante<br />

Tim Van de Cruys<br />

16.10<br />

-<br />

16.30<br />

16.30<br />

-<br />

16.50<br />

16.50<br />

-<br />

17.10<br />

17:10<br />

-<br />

19:00<br />

TTNWW: de facto<br />

standards for Dutch in the<br />

context of CLARIN<br />

I. Schuurman, M. Kemps-<br />

Snijders<br />

TTNWW: NLP Tools for<br />

Dutch as Webservices in a<br />

Workflow<br />

M. Kemps-Snijders, I.<br />

Schuurman<br />

Syntactic Analysis of Dutch<br />

via Web Services<br />

E. Tjong Kim Sang<br />

A U-DOP approach to<br />

modeling language<br />

acquisition<br />

M. Smets<br />

Memory-based text<br />

completion<br />

A. van den Bosch<br />

Building a Gold Standard<br />

for Dutch Spelling<br />

Correction<br />

T. Gaustad, A. van den<br />

Bosch<br />

Drinks (Foyer)<br />

An exploration of n-gram<br />

relationships for<br />

transliteration<br />

identification<br />

P. Nabende<br />

Combined Qualitative and<br />

Quantitative Error Analysis<br />

in Multi-Topic Authorship<br />

Attribution<br />

K. Luyckx, W. Daelemans<br />

Robust Rhymes? The<br />

Stability of Authorial Style<br />

in Medieval Narratives<br />

M. Kestemont, W.<br />

Daelemans, D. Sandra


PROGRAMME<br />

317 403<br />

Beyond text<br />

Tools<br />

Paola Monachesi<br />

Combining e-learning and<br />

natural language<br />

processing. The example of<br />

automated dictation<br />

exercises<br />

R. Beaufort, S. Roekhaut<br />

Combining e-learning and<br />

natural language<br />

processing. The example of<br />

automated dictation<br />

exercises<br />

R. Beaufort, S. Roekhaut<br />

Collecting and using a<br />

corpus of lyrics and their<br />

moods<br />

M. van Zaanen<br />

Els Lefever<br />

Automatic terminology<br />

extraction: methods and<br />

practical applications<br />

D. de Vries<br />

Automatic terminology<br />

extraction: methods and<br />

practical applications<br />

D. de Vries<br />

SSLD: A smart tool for sms<br />

compression<br />

L.-A. Cougnon, R. Beaufort<br />

16.10<br />

-<br />

16.30<br />

16.30<br />

-<br />

16.50<br />

16.50<br />

-<br />

17.10<br />

17:10<br />

-<br />

19:00<br />

15


Rethinking anaphora<br />

Abstract<br />

Massimo Poesio<br />

Keynote speaker<br />

Current models of the anaphora resolution task achieve mediocre results for all but the<br />

simpler aspects of the task such as coreference proper (i.e. linking proper names into<br />

coreference chains). One of the reasons for this state of affairs is the drastically<br />

simplified picture of the task at the basis of existing annotated resources and modelse.g.,<br />

the assumption that human subjects by and large agree on anaphoric judgments.<br />

In this talk I will present the current state of our efforts to collect more realistic<br />

judgments about anaphora through the Phrase Detectives online game, and to develop<br />

models of anaphora resolution that do not rely on the total agreement assumption.<br />

17


Presentation Abstracts<br />

19


20<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

A discriminative syntactic model for source permutation<br />

via tree transduction for statistical machine translation<br />

Abstract<br />

Khalilov, Maxim and Sima'an, Khalil and de Buy Wenniger, Gideon<br />

Maillette<br />

ILLC-UvA<br />

Word ordering is still one of the most challenging problems in the statistical machine<br />

translation (SMT). In most existing work, a word reordering model is implicitly or<br />

explicitly incorporated into a translation system based on flat or hierarchical<br />

representations of phrases. By contrast, this study deals with an approach addressing<br />

word ordering problem via source-side permutation prior to translation using<br />

hierarchical and syntactic structures.<br />

Our work is driven by the idea that reordering the source sentence as a pre-translation<br />

step minimizes the need for reordering during translation and may bridge long-distance<br />

order differences, which are outside the scope of commonly used reordering models.<br />

Given a word-aligned parallel corpus, we define the source string permutation as the<br />

task of statistically learning to unfold the crossing alignments between sentence pairs<br />

in the parallel corpus.<br />

This work contributes an approach for learning source string permutation via transfer<br />

of the source syntax tree, i.e. we define source permutation as the problem of learning<br />

how to transfer a given source parse-tree into a parse-tree that minimizes the<br />

divergence from target word-order.<br />

We present a novel discriminative, probabilistic tree transduction model, and<br />

contribute a set of empirical oracle results (upperbounds on translation performance)<br />

for English-to-Dutch source string permutation under sequence and parse tree<br />

constraints. Finally, the translation performance of our learning model is shown to<br />

outperform the state-of-the-art phrase-based system significantly.<br />

Corresponding author: maxkhalilov@gmail.com


PRESENTATION ABSTRACTS<br />

Abstract<br />

A Generalized Lexical Acquisition Technique for<br />

Improved Parsing Accuracy<br />

Cholakov, Kostadin 1 and van Noord, Gertjan 1 and Kordoni, Valia 2 and<br />

Zhang, Yi 3<br />

1 University of Groningen<br />

2 DFKI, Germany<br />

3 University of Saarland, Germany<br />

Unknown words are a major issue for large-scale precision grammars of natural<br />

language. In Cholakov and van Noord (COLING 2010) we proposed a maximum entropy<br />

based classification algorithm for acquiring lexical entries for all forms in the paradigm<br />

of a given unknown word and we tested its performance on the Dutch Alpino grammar.<br />

The study showed an increase in parsing accuracy when our method was applied.<br />

However, the general applicability of our approach has been a major point of criticism.<br />

It has been considered too specific and its application to other systems and languages--<br />

doubtful.<br />

In this presentation, we argue that our method can be applied to any precision<br />

grammar provided that the following conditions are fulfilled: a finite set of labels which<br />

unknown words are mapped onto, large corpora, a parser which analyses various<br />

contexts of a given unknown word and provides syntactic constraints used as features<br />

in the classification process and a morphological component which generates the<br />

paradigm(s) of the unknown word. We show that the fulfillment of these conditions<br />

allows us to apply successfully our approach to other large-scale grammars where it<br />

leads to a significant increase in parsing accuracy on test sets of sentences containing<br />

unknown words in comparison with the achieved accuracy when the default methods<br />

used by these grammars to handle unknown words are employed.<br />

This provides strong support for our claim that our approach is general enough to be<br />

applied to various languages and precision grammars.<br />

Corresponding author: kcholakov@gmail.com<br />

21


22<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

A Semantic Vector Space for Modelling Word Meaning in<br />

Context<br />

Abstract<br />

Heylen, Kris and Speelman, Dirk and Geeraerts, Dirk<br />

QLVL, Katholieke Universiteit Leuven<br />

Semantic Vector spaces have become the mainstay of modelling of word meaning in<br />

statistical NLP. They encode the semantics of words through high-dimensional vectors<br />

that record the co-occurrence of those words with context features in a large corpus.<br />

Vector comparison then allows for the calculation of e.g. semantic similarity between<br />

words. Most semantic vector spaces represent word meaning on the type (or lemma)<br />

level, i.e. their vectors generalize over all occurrences of a word. However, the meaning<br />

of words can differ considerably between contexts due to polysemy or vagueness.<br />

Therefore, many applications, like Word Sense Disambiguation (WSD) or Textual<br />

Entailment, require that word meaning be modelled on the token level, i.e. the level of<br />

individual occurrences. In this paper, we present a semantic vector space model that<br />

represents the meaning of word tokens by taking the word type vector and reweighting<br />

it based on the words observed in the token's immediate vicinity. More specifically,<br />

we give a bigger weight to the context features in the original type vector that are<br />

semantically similar to the context features observed around the token. This semantic<br />

similarity between context features is calculated based on the original wordtype-bycontextfeature<br />

matrix. We explore the performance of this model in a WSD task by<br />

visualizing how well the model separates the different meanings of polysemous words<br />

in Multi-Dimensional Scaling solution. We also compare our model to other token-level<br />

semantic vector spaces as proposed by Schütze (1998) and Erk & Padó (2008).<br />

References<br />

Erk, K. & S. Padó. 2008. A Structured Vector Space Model for Word Meaning in Context.<br />

EMNLP Proceedings, 897-906.<br />

Schütze, H. 1998. Automatic word sense discrimination.Computational Linguistics,<br />

24(1):97–124<br />

Corresponding author: kris.heylen@arts.kuleuven.be


PRESENTATION ABSTRACTS<br />

A Toolkit for Visualizing the Coherence of Tree-based<br />

Reordering with Word-Alignments<br />

Abstract<br />

Maillette de Buy Wenniger, Gideon<br />

Institute for Logic Language and Computation<br />

Tree-based reordering constitutes an important motivation for the increasing interest<br />

in syntax-driven machine translation. It has often been argued that tree-based<br />

reordering might provide a more effective approach for bridging the word-order<br />

differences between source and target sentences. One major approach known as ITG<br />

(Inversion Transduction Grammar) allows permuting the order of the subtrees<br />

dominated by the children of any node in the tree. In practice, it has often been<br />

observed that the word-alignments usually cohere only to a certain degree with this<br />

kind of tree-based reordering, i.e., there are cases of word-alignments that cannot be<br />

fully explained with tree-based reordering when the tree is fixed a priori. This<br />

presentation describes a toolkit for visualizing alignment graphs that consist of a wordalignment<br />

together with a source or target tree. More importantly, the toolkit provides<br />

a facility for visualizing the coherence of word-alignment with tree-based reordering,<br />

highlighting nodes and word-alignments that are incompatible with one another. The<br />

tool allows visualizing the tree-based reordered source/target string as well as the<br />

reordered tree. Using our toolkit, we will also present results pertaining to the<br />

coverage of the ITG assumption of the word-alignments of a Europarl corpus, which is a<br />

very common starting point for training Translation systems. We will also dwell on the<br />

break-down of the types of incompatibility into general classes and discuss what that<br />

implies for training hierarchical translation models on this type of data.<br />

Corresponding author: gemdbw@gmail.coom<br />

23


24<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

A U-DOP approach to modeling language acquisition<br />

Abstract<br />

Smets, Margaux<br />

In linguistics, there is a debate between empiricists and nativists:<br />

the former believe that language is acquired from experience, the latter that there is<br />

an innate component for language. The main arguments adduced by nativists are<br />

Arguments from Poverty of Stimulus. It is claimed that children acquire certain<br />

phenomena, which they cannot learn on the basis of experience alone ---and therefore,<br />

there has to be some innate component for language. In this thesis, we show that at<br />

least for certain phenomena that are often used in such arguments, it is possible to<br />

explain how children acquire them on the basis of experience alone, viz. with an<br />

Unsupervised Data-Oriented Parsing (U-DOP) approach to language.<br />

In the first part of the thesis, we develop concrete implementations of U-DOP, and<br />

contribute to the field of unsupervised parsing with two innovations. First, we develop<br />

an algorithm that performs syntactic category labeling and parsing simultaneously, and<br />

second, we devise a new methodology for unsupervised parsing, which can in principle<br />

be applied to any unsupervised parsing algorithm, and which produces the best results<br />

reported on the ATIS-corpus so far, with a promising outlook for even better results.<br />

In the second part of the thesis, we then use these concrete implementations to show<br />

how the acquisition of certain phenomena can be explained in an empirical way. We<br />

look in detail at wh-questions, and then show that the U-DOP approach is more general<br />

than the nativist account by looking at other phenomena.<br />

Corresponding author: margauxsmets@gmail.com


PRESENTATION ABSTRACTS<br />

Abstract<br />

Age and Gender Prediction on Netlog Data<br />

Peersman, Claudia 1,2 and Daelemans, Walter 1 and Van Vaerenbergh,<br />

Leona 2<br />

1 CLiPS, University of Antwerp<br />

2 Artesis, University College Antwerp<br />

In recent years millions of people have started using social networking sites such as<br />

Netlog to support their personal and professional communications, creating digital<br />

communities. However, a common characteristic of these digital communities is that<br />

users can easily provide a false name, age, gender and location in order to hide their<br />

true identity. This way, social networking sites can be used by people with criminal<br />

intentions (e.g., paedophiles) to support their activities online.<br />

In the context of the DAPHNE project (Defending Against Paedophiles in<br />

Heterogeneous Network Environments), we present first results of a machine learning<br />

approach for age and gender prediction on a corpus of posts on the social network site<br />

Netlog. We investigate which types of linguistic and stylistic features are effective for<br />

age and gender prediction, given the specific characteristics of (the Dutch) chat<br />

language and compare the effectiveness of different machine learning techniques for<br />

age and gender prediction on the Netlog data.<br />

We will conclude our presentation by discussing how these results will guide future<br />

research in the DAPHNE project.<br />

Corresponding author: claudia.peersman@ua.ac.be<br />

25


26<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Aligning translation divergences through semantic role<br />

projection<br />

Abstract<br />

Vanallemeersch, Tom<br />

Centrum voor Computerlinguïstiek, K.U.Leuven<br />

We investigate whether an alignment method based on cross-lingual semantic<br />

annotation projection improves over approaches for linguistically uninformed word<br />

alignment and purely syntax-based tree alignment, specifically in the area of<br />

translation divergences. We apply an SRL system which annotates English sentences<br />

with PropBank and NomBank rolesets (verbal and nominal predicates and their<br />

semantic roles), and project the predicates and roles to Dutch and French using<br />

intersective GIZA++ word alignment. We create additional alignment links by detecting<br />

the auxiliary words of predicates (auxiliary, modal and support verbs) in parse trees<br />

and by detecting potential Dutch or French predicates based on projected roles. Finally,<br />

we investigate whether additional links can be created by training an SRL system on the<br />

projected predicates and roles and applying it to the Dutch and French parse trees.<br />

Corresponding author: tallem@ccl.kuleuven.be


PRESENTATION ABSTRACTS<br />

Abstract<br />

An Aboutness-based Dependency Parser for Dutch<br />

Koster, Cornelis H.A.<br />

Radboud Universiteit Nijmegen<br />

Dupira (the Dutch Parser for IR Applications) is a new Dependency Parser for Dutch,<br />

which was developed at the University of Nijmegen, based on the older Amazon<br />

grammar and lexicon.<br />

Dupira is a rule-based parser, which is generated by means of the AGFL parser<br />

generator from the Dupira grammar, lexicon and fact tables. By means of transductions<br />

which are specified in the grammar (and can be modified), the parser transduces<br />

sentences to dependency trees.<br />

Dupira was developed for applications in Information Retrieval (IR) rather than in<br />

Linguistics, and for that reason has the following properties:<br />

- the dependency model of Dupira expresses the aboutness of a sentence rather<br />

than describing its complete syntactic structure;<br />

- therefore it is highly suitable for extracting factoids from running text;<br />

- it is also possible to extract dependency triples, which can be used as highaccuracy<br />

terms for text categorization and full-text search;<br />

- Dupira performs certain aboutness-preserving normalizing transformations,<br />

including de-passivization and de-topicalization, in order to enhance recall;<br />

- it makes extensive use of subcategorization preferences to resolve where<br />

possible the attachment of Preposition Phrases;<br />

- it is highly robust, both lexically and syntactically, and fast enough for practical<br />

applications.<br />

In this presentation, we discuss the aboutness-based dependency model and the way<br />

in which the grammar describes the Dutch language. We show by means of examples<br />

the transduction of clauses and phrases. We report the availability of Dupira Version<br />

0.8 in the public domain and the plans we have for further development, and discuss<br />

some of its applications.<br />

Corresponding author: kees@cs.ru.nl<br />

27


28<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

An exploration of n-gram relationships for transliteration<br />

identification<br />

Abstract<br />

Nabende, Peter<br />

University of Groningen<br />

Transliteration identification aims at building quality bilingual lexicons for<br />

complementing and improving performance in various NLP applications including<br />

Machine Translation (MT) and Cross Language Information Retrieval (CLIR). The main<br />

task is to identify matching words across different writing systems from a given data<br />

source. Recent evaluations (Kumaran et al., 2010) show that no single approach<br />

achieves a consistently best performance in identifying transliterations from different<br />

language pairs: an approach that leads to the identification of quality matches between<br />

English and Russian, may result in many incorrect matches between English and<br />

Chinese. In this paper, we conduct experimental settings of utilizing n-gram<br />

relationships for computing candidate transliteration similarity scores which are<br />

subsequently evaluated for choosing potential transliteration matches. We use<br />

datasets from the 2010 transliteration generation shared task (Li et al., 2010) for five<br />

language pairs: English-Russian, English-Chinese, English-Hindi, English-Tamil, and<br />

English-Kannada. For each language pair, we explore various n-gram relationships<br />

starting from the unigram case to higher order n-grams. Results show that higher order<br />

n-grams lead to better transliteration identification quality across all languages,<br />

however, for different language pairs, the higher order n-grams outperform each<br />

other. For example, a pair trigram model outperforms a pair 4-gram model on an<br />

English-Russian dataset while the reverse is true for an English-Pinyin(Romanized<br />

Chinese) dataset. The results are promising as we aim at eliciting such n-gram<br />

correspondences for use in more complex stochastic models such as pair Hidden<br />

Markov Models (Pair HMMs) that we postulate may lead to even better transliteration<br />

identification quality.<br />

Corresponding author: p.nabende@rug.nl


PRESENTATION ABSTRACTS<br />

Application of a Constraint Conditional Model for<br />

improving the performance of a Sequence Tagger : A Case<br />

Study<br />

Abstract<br />

Stokman, Tim<br />

Textkernel<br />

Natural language classifiers often ignore natural global constraints arising from the<br />

nature of domain. These constraints are often hard to learn by the classifier because<br />

the constraints are global while most classifiers make local decisions. Given that the<br />

classifier can give a probability distribution on the possible assignments, we can search<br />

this solution space for a set of assignments satisfying the natural constraints. Given this<br />

solution space, we can formulate a ILP model that maximizes the probability of the<br />

assignment while conforming to all constraints formulated in the model.<br />

We apply this approach to a number of datasets and develop a set of natural<br />

constraints for these datasets. We will expand on the practical implementation issues<br />

encountered and show the performance improvements that arise from using the<br />

constraint conditional model approach.<br />

Corresponding author: timstokman@gmail.com<br />

29


30<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Automatic terminology extraction: methods and practical<br />

applications<br />

Abstract<br />

de Vries, Dennis<br />

GridLine<br />

Some of the usefull applications of terminology that GridLine provides are automatic<br />

assignment of keywords to documents, integration of thesauri with search engines,<br />

enrichment of search queries using a multi-lingual thesaurus and aiding writers in<br />

correct use of terminology.<br />

The problem that many organisations face though is that they don’t have a list of their<br />

specific terminology. Creating these lists manually by examining company<br />

documentation and interviewing experts is very expensive and time-consuming.<br />

Therefore, GridLine developed instruments for automatically extracting organisation<br />

specific terminology from document collections. Additionally, after extracting<br />

terminology, we can build a thesaurus by automatically determining semantic relations<br />

between terms.<br />

For extraction of terms and semantic relations we use a variety of linguistic methods<br />

(lemmatizer, POS-tagger, compound splitter) and statistical methods (unithood,<br />

termhood). In this presentation I will give a brief overview of these extraction<br />

techniques and show some examples of projects we did for our customers. In<br />

particular, I will talk about Termtreffer, an easy to use application for term extraction<br />

which we made for the Nederlandse Taalunie. In this application, users can extract<br />

terms from documents using custom combinations of linguistic and statistical modules<br />

for term extraction. Additionally they can manage, analyze and edit the resulting<br />

terminology lists.<br />

GridLine is a growing company based in the center of Amsterdam. Currently we are<br />

market leader in Dutch language technology.<br />

Corresponding author: dennis@gridline.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

Automatically Constructing a Wordnet for Dutch<br />

Van de Cruys, Tim<br />

INRIA & Université Paris VII<br />

In this talk, we describe the automatic construction of a wordnet for Dutch by<br />

combining a number of different sources of semantic information. First, a number of<br />

unsupervised and semi-supervised techniques are presented for the extraction of<br />

different kinds of semantic information. This includes techniques based on<br />

distributional similarity and clustering, but also techniques that extract semantic<br />

information from semi-structured and multilingual resources (such as Wiktionary). The<br />

second part describes how the output of these techniques may be combined with the<br />

structure of the original Princeton WordNet for English, which allows for the automatic<br />

construction of a wordnet for Dutch. Contrary to existing resources, the extracted<br />

resource also includes named entities. The resource is evaluated according to<br />

CORNETTO, an existing, manually constructed wordnet for Dutch.<br />

Corresponding author: Tim.Van_de_Cruys@inria.fr<br />

31


Abstract<br />

32<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Automatically determining phonetic distances<br />

Wieling, Martijn and Margaretha, Eliza and Nerbonne, John<br />

University of Groningen<br />

This study seeks to induce the distance between phonetic segments based on their<br />

correspondences in dialect atlas material. In other words, we induce information about<br />

the physical realization of sounds from their dialectal distributions.<br />

We algorithmically align segments in pairs of pronunciations at various sites in order to<br />

identify corresponding sounds. We then apply an information-theoretic measure,<br />

Pointwise Mutual Information, in order to automatically determine phonetic distances<br />

based on the relative frequency of correspondences. We repeat these steps until the<br />

alignments (and segment distances) stabilize.<br />

We evaluate the quality of the obtained phonetic distances by comparing them to<br />

acoustic vowel distances. For two separate dialect datasets, Dutch and German, we<br />

find high significant correlations between the induced phonetic distances and the<br />

acoustic distances, indicating that the frequency of correpondence in dialect material<br />

conveys information about the constitution of sounds. We close with some<br />

speculations about the usefulness of the method.<br />

Corresponding author: m.b.wieling@rug.nl


PRESENTATION ABSTRACTS<br />

Building a Gold Standard for Dutch Spelling Correction<br />

Abstract<br />

Gaustad, Tanja and van den Bosch, Antal<br />

TiCC, Tilburg University<br />

The main question in the NWO project "Implicit Linguistics" is whether abstract<br />

linguistic representations are necessary as an intermediate step in NLP models and<br />

systems. To investigate this, we focus on text-to-text processing tasks, i.e. processes<br />

which map form to form. In particular, we are investigating Dutch spelling correction<br />

where a corrupted text is converted to a clean version of the same text.<br />

In order to test the quality of a spelling corrector, a Gold standard is needed. This,<br />

however, does not exist for Dutch as of yet. For this reason, we set out to build such a<br />

Gold standard, containing a mixed selection of texts in which we aim to mark all errors<br />

and their corrections. In this talk, we will present the Gold standard including interannotator<br />

agreement and other statistics relating to the data used. Furthermore, we<br />

will present first results with applying our language model WOPR to the corpus,<br />

comparing it against two baselines: a high-precision known error list and a contextinsensitive<br />

lexical baseline. Evaluation is performed in terms of precision and recall on<br />

detection and correction on full text.<br />

Corresponding author: T.Gaustad@uvt.nl<br />

33


Abstract<br />

34<br />

Nauze, Fabrice<br />

Q-go<br />

Clustering customer questions<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Q-go’s natural language search technology powers the search box of many corporate<br />

websites. Its NLP technology allows customers to ask questions in their own words and<br />

returns a small set of relevant answers. Hundreds of millions of questions have already<br />

been processed and answered with Q-go’s solution providing us with a mine of data.<br />

In order to improve our knowledge of what customers are asking and to help further<br />

refine our core systems, Q-go needs a way to automatically cluster relevant queries<br />

from large sets of customer questions.<br />

To achieve this goal we tested several standard clustering methods on sets of customer<br />

questions. The outline of the talk will be the following.<br />

First we will explain the specific challenges one has to face when clustering customer<br />

questions (very short queries, typos, etc…). We will then present the clustering<br />

algorithms that have been tested (among other k-Means, GAAC hierarchical clustering,<br />

mini-batch k-Means). Thirdly we will outline two different types of heuristics used in<br />

the first case to improve the quality of the vector representations feeding the<br />

clustering algorithms and in the second to overcome the curse of dimensionality.<br />

Finally the different methods will be evaluated and compared with respect to<br />

processing speed and intrinsic quality of clustering (as well as its practical usefulness).<br />

Corresponding author: fabrice.nauze@q-go.com


PRESENTATION ABSTRACTS<br />

Collecting and using a corpus of lyrics and their moods<br />

Abstract<br />

van Zaanen, Menno<br />

Tilburg University<br />

Recently, there has been an increase in availability of music in digital formats. This has<br />

led to music collections that are different in nature than in the past. Collections are<br />

typically larger and consist of a selection of individual pieces instead of complete<br />

albums. Since playing any musical piece from the collection can be done without<br />

physically changing the medium, listeners create playlists that allow them to identify a<br />

subset of the collection and determine the order in which the pieces are played.<br />

People creating playlists often want to group pieces based on their emotional load<br />

(such as happy or sad). Creating such playlists, however, is time-consuming and<br />

requires knowledge of the music in the collection, since emotional information is not<br />

explicitly encoded with the pieces. We will describe a system that analyzes musical<br />

pieces and, based on the lyrics, classifies them into their corresponding mood class.<br />

This system is developed and evaluated using a corpus of lyrics of songs and their<br />

corresponding mood. The mood tags were collected by social tagging of musical pieces<br />

using the Moody iTunes plugin that is developed by the company Earth People within<br />

the Crayon Room project. Starting from a list of artist, title and mood triples, the<br />

corresponding lyrics of the songs have been collected. This has led to a corpus<br />

containing the lyrics of 5,631 songs, which will be made publicly available.<br />

Corresponding author: mvzaanen@uvt.nl<br />

35


36<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Combined Qualitative and Quantitative Error Analysis in<br />

Multi-Topic Authorship Attribution<br />

Abstract<br />

Luyckx, Kim and Daelemans, Walter<br />

CLiPS, Antwerp University<br />

In authorship attribution, function words are considered the ideal feature type to deal<br />

with the complexity of the task. There is a consensus that they are topic-neutral, highly<br />

frequent, and not under the author's conscious control. However, it has been shown<br />

that hardly any of these allegedly topic-neutral features are in fact topic-neutral. Topic<br />

seems to be hard to 'separate' from authorial style, irrespective of the type of features<br />

used to predict authorship. Although function words are robust to limited data and<br />

provide good indicators of authorship, the a-priori exclusion of content words causes a<br />

lot of useful information to be disregarded.<br />

We discuss experiments in multi-topic authorship attribution and zoom in on the<br />

features that constitute the attribution model. Qualitative analysis of results is typically<br />

lacking in authorship attribution studies, since many studies focus on performance, but<br />

refrain from going into detail about the features selected.<br />

In this talk, we show that high performance does not always imply a viable approach.<br />

More specifically, we zoom in on unique identifiers, features that occur exclusively with<br />

a specific authorship class in training and uniquely identify a test instance by the same<br />

author. Although a coincidence - the frequency of a feature in an unseen test set or the<br />

topic of that test set cannot be predicted - topic-related unique identifiers provide the<br />

model with an unfair advantage that will not scale. However, the absence of unique<br />

identifiers does not necessarily imply a scalable approach. Although this talk focuses on<br />

authorship attribution, we think any task in text mining would benefit from consistent<br />

error analysis.<br />

Corresponding author: kim.luyckx@ua.ac.be


PRESENTATION ABSTRACTS<br />

Combining e-learning and natural language processing.<br />

The example of automated dictation exercises<br />

Abstract<br />

Beaufort, Richard and Roekhaut, Sophie<br />

UCL CENTAL<br />

E-learning is a way of delivering education based on the use of electronic tools and<br />

content, either delivered on CD-ROMs or managed through network connections. The<br />

idea behind e-learning is to improve both the learning and its management. To this<br />

end, exercises are frequently automated. Of course, the ideal automation would<br />

involve the three distinct steps of an exercise: its preparation, its realization (by the<br />

student) and its correction.<br />

Up to now, the automation lead to exercises, like gap-fill texts or multiple choice tests,<br />

which limit the kinds of knowledge that can be assessed. This is due to the fact that the<br />

correction step, for some exercises, is far from easy to automate.<br />

An eloquent example of such an exercise is dictation, this activity where the teacher<br />

reads a passage aloud and the learners write it down. While automatically reading an<br />

unknown text aloud is not a problem as long as a reliable text-to-speech synthesis<br />

system is available, the accurate correction of a learner's copy can quickly become a<br />

nightmare.<br />

The correction of a dictation's copy involves two steps: first, the detection of the real<br />

places of errors: second, the classification of these errors. In this paper, we present a<br />

way of automating these two steps. The detection step is based on a finite-state string<br />

alignment between the copy and the original. The classification step is the best result<br />

of a finite-state intersection between all possible automatic analyses of an error and<br />

the single analysis of the corresponding correct form.<br />

Corresponding author: richard.beaufort@uclouvain.be<br />

37


Abstract<br />

38<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Computing Semantic Relations from Heterogeneous<br />

Information Sources<br />

Panchenko, Alexander<br />

UCL CENTAL<br />

Computation of semantic relations between terms or concepts is a general problem in<br />

Natural Language Processing and a subtask of automatic thesaurus construction.<br />

This work describes and compares available heterogeneous information sources which<br />

can be used for mining semantic relations such as texts, electronic dictionaries and<br />

encyclopedias, lexical ontologies and thesauri, folksonomies, surfaces of words, query<br />

logs of search engines, and so forth. Most of the existing algorithms use a single<br />

information source for extracting semantic knowledge: Distributional Analysis relies on<br />

text, Extented Lesk uses dictionary definitions, Jiang-Conrath distance employs a<br />

semantic network such as WordNet and so on. We show that different methods<br />

capture different aspects of the terms’ relatedness: while one acquires similarities of<br />

word contexts, others capture similarities of syntactic contexts, term definitions,<br />

surfaces forms etc.<br />

In these settings, there is a need for a general model capable to aggregate different<br />

aspects of semantic similarity from all available information sources and methods in an<br />

optimal and consistent way. We discuss how such a model can be implemented with a<br />

linear combination, and using tensors (i.e. multi-way arrays). We describe two ways of<br />

using tensors for calculation of semantic relations in the context of multiple<br />

information sources, which we call “adjacency tensor” and “feature tensor”. The sparse<br />

tensor factorization methods PARAFAC, Non-negative Tensor Factorization (NTF), and<br />

Memory-Efficient Tucker (MET) are suggested in order to fusion information about<br />

terms from different methods and information sources. We conclude that tensors can<br />

be used for representing terms, while tensor factorizations can serve to generalize data<br />

about terms’ relatedness.<br />

Corresponding author: alexander.panchenko@student.uclouvain.be


PRESENTATION ABSTRACTS<br />

Computing the meaning of multi-word expressions for<br />

semantic inference<br />

Abstract<br />

Cremers, Crit<br />

Leiden University<br />

The immense diversity of multi words expressions in every language imposes heavy<br />

requirements on the lexicon, the grammar and their interface for deep semantic<br />

analysis to be feasible. The lexicon for meaning-driven NLP is huge and phrasal.<br />

We present a model for dealing with extended lexical units in a parser/generator for<br />

Dutch that aims at the logical computation of entailments and presuppositions. The<br />

model consists of three components: a fiat architecture for the computational lexicon,<br />

an efficient organization of on-line lexical retrieval and a selectional method of<br />

semantic underspecification. In the fiat lexicon, all combinatory instances of all lexical<br />

varieties of all (semantically) relevant constructions are spelled out as feature-value<br />

graphs. Each of the feature-value graphs contains all combinatory information needed<br />

for synatctic and semantic processing. This constructicon is produced off line. It is<br />

managed on line by a retrieval system that selects contextually required and adequate<br />

constructions in linear time. The underspecification allows to disambiguate the<br />

combinatory result by evaluating the structure of the representations.<br />

The model will be demonstrated by reference to two particular constructions: the<br />

Dutch way- (Poss 2009) and the Dutch honger-construction. The first one – “jij hebt je<br />

een weg uit de gevangenis geslijmd” - exemplifies an intriguing combination of lexical<br />

restrictions, productivity, structure sensitivity and semantic specificity. The second – “ik<br />

heb geen erg grote honger” - exemplifies transcategorial semantic effects, where open<br />

modification of a noun phrase requires translation into propositional and statesensitive<br />

operators.<br />

Corresponding author: c.l.j.m.cremers@hum.leidenuniv.nl<br />

39


Abstract<br />

40<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Cross-Domain Dutch Coreference Resolution<br />

De Clercq, Orphée and Hoste, Véronique<br />

LT3 Language and Translation Technology Team, University College<br />

Ghent<br />

For the STEVIN funded SoNaR project, a Dutch reference corpus of 500 million words is<br />

being built. At the same time, a one-million-word subset is progressively enriched with<br />

semantic information: named entities, coreference, spatio-temporal relations and<br />

semantic roles. As a prerequisite, existing schemes and systems developed for Dutch<br />

are to be reused to the fullest extent possible. In this talk we present the ongoing task<br />

of annotating this subset with coreference information, following existing guidelines<br />

for Dutch (Bouma et al. 2007).<br />

The basis for our coreference resolver is an existing mention-pair approach (Hoste<br />

2005, Hendrickx et al. 2008) for Dutch. One of the main challenges in the domain of<br />

coreference resolution is portability across different domains and languages. Since one<br />

of the great advantages of the SoNaR corpus is its diversity - the 1MW subset itself<br />

comprises six text types - we decided to train our system on, respectively, each text<br />

type separately and all types combined. We will report cross-type (e.g. administrative,<br />

external communication, instructive, journalistic text) cross-validation results for<br />

different NP types and present an extensive qualitative error analysis. We compare<br />

performance when providing perfect markables, derived from deep parsing (Alpino,<br />

Van Noord et al. 2006) with automatically generated markables, and investigate the<br />

added value of integrating additional semantic information resulting from other<br />

annotation layers.<br />

References<br />

G. Bouma, W. Daelemans, I. Hendrickx, V. Hoste, and A. Mineur. 2007. The COREAproject,<br />

Manual for the annotation of coreference in Dutch texts. Technical report,<br />

University Groningen.<br />

I. Hendrickx, V Hoste, and W. Daelemans. 2008. Semantic and Syntactic features for<br />

Anaphora Resolution for Dutch. In Lecture Notes in Computer Science, Volume 4919,<br />

Proceedings of the CICLing-2008 conference, pages 351–361. Berlin: Springer Verlag.


PRESENTATION ABSTRACTS<br />

V. Hoste. 2005. Optimization Issues in Machine Learning of Coreference Resolution.<br />

Ph.D. thesis, Antwerp University.<br />

G. Van Noord, I. Schuurman, and V. Vandeghinste. 2006. Syntactic Annotation of Large<br />

Corpora in STEVIN. In Proceedings of LREC 2006, Genua.<br />

Corresponding author: orphee.declercq@hogent.be<br />

41


42<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Dmesure: a readability platform for French as a foreign<br />

language<br />

Abstract<br />

François, Thomas<br />

UCL CENTAL<br />

It is a well-known fact that reading practice improves the reading abilities of L1<br />

students as well as L2 students. However, once one strays from textbooks, matching<br />

individuals with texts of an adequate level of difficulty is far from an easy task. FFL<br />

teachers have all, at some time or other, wasted time carrying out such a task.<br />

Our research aims at providing to the community a web platform, called Dmesure, able<br />

to retrieve from the web texts on a specific topic and at a specific readability level. We<br />

will present the current version of this platform, in which texts are first retrieved<br />

through the Yahoo search engine, before being assessed for difficulty using the<br />

readability measure described in Francois (2009). It is worth noting that the output of<br />

Dmesure is compliant with the proficiency scale set in the Common European<br />

Framework of Reference for languages, what makes this tool very convenient for FFL<br />

teachers.<br />

We also address some specific problems encountered when applying a readability<br />

measure to web texts. Among them, we consider the influence of the boilerplate on<br />

the readability measure, and some ways to reject pages whose language diverge too<br />

much from the norm. To conclude, we show that because Dmesure has been<br />

developped in a participative perspective, it allows to collect new texts annotated by<br />

teachers. Therefore, the build in readability model can be retrain occasionally with this<br />

enhanced corpus.<br />

Corresponding author: thomas.francois@uclouvain.be


PRESENTATION ABSTRACTS<br />

Abstract<br />

Essentials of person names<br />

Schraagen, Marijn<br />

Leiden Institute of Advanced Computer Science<br />

The frequency of spelling variation and errors in person names is relatively high,<br />

compared to normal vocabulary. Standardization of a name to some base form, or<br />

core, could be useful in named entity matching or record linkage. Two types of core are<br />

investigated: the semantic core and the syntactic core. The semantic core approach<br />

exploits the idea that names in Dutch, especially surnames, have meaning: a surname is<br />

usually based on a first name, a location, a profession or a personal characteristic. The<br />

semantic component is subject to heavy and unpredictable modifications due to<br />

suffixes, inflections and compounding, therefore suffix removal techniques are less<br />

successful for names than for standard vocabulary. The semantic component itself is<br />

however relatively stable, and the set of semantic categories is reasonably restricted.<br />

Therefore, a word list approach can be applied for names, which avoids learning or<br />

designing complex suffix removal rules. An alternative approach is to extract the<br />

syntactic core of a name. The syntactic core is the (possibly discontinuous) character<br />

sequence that remains constant or phonetically equivalent in all variants of a name.<br />

The syntactic core can be analysed on various linguistic levels. An advantage of the<br />

syntactic approach is that a word list is not needed, and therefore the procedure can<br />

also be applied to unknown names or names without a vocabulary-based meaning<br />

component (such as first names). Algorithms for extracting semantic and syntactic<br />

cores are discussed, and an application is provided for the problem of record linkage in<br />

data mining.<br />

Corresponding author: schraage@liacs.nl<br />

43


44<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Extraction of Historical Events from Unstructered Texts<br />

Abstract<br />

Segers, Roxane and van Erp, Marieke and van der Meij, Lourens<br />

VU University Amsterdam<br />

Historiography revolves around events as these express important changepoints in<br />

reality. We postulate that events can play an important role in improving automated<br />

search and data integration in the historic domain as events connect information about<br />

who did what where and when.<br />

We present a pattern based approach to automatically extract historical named events<br />

like "French Revolution" and "Second World War" from unstructured texts in Dutch.<br />

The extracted events are the backbone of a structured event thesaurus that will consist<br />

of events with their time, place and participants.<br />

In our approach we make a distinction between external and internal event patterns.<br />

For collecting external event patterns like 'during the', we retrieved text snippets for a<br />

number of seed events. We ranked the pattern candidates by their frequency and cooccurence<br />

with different events. Next, we ran the pattern collection over a domain<br />

specific corpus. We evaluated the precision of the extracted historical event candidates<br />

by the number of patterns that extracted the event and the confidence score of these<br />

patterns.<br />

The extracted events were used as input for obtaining event internal patterns. We<br />

classified and analysed the events based on their morpho-syntactic structure: this<br />

yielded patterns such as "Massacre of Y". To expand these patterns, we used the head<br />

of the events to iterate Wordnet: this yielded new internal patterns such as "Bloodbath<br />

of Y".<br />

As a result we obtained a library of external and internal patterns that can be used to<br />

extract named events from unstructured texts. The presented combination of internal<br />

and external patterns is vital as our combined library outperforms each pattern type on<br />

its own.<br />

Corresponding author: rh.segers@vu.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

Finding Statistically Motivated Features Influencing<br />

Subtree Alignment Performance<br />

Kotzé, Gideon<br />

University of Groningen<br />

We present results of an ongoing investigation of a manually aligned parallel treebank<br />

and an automatic tree aligner. Using the parallel treebank as a test set, features that<br />

are shown to have a significant correlation with alignment performance are<br />

established. Our conclusion is that lexical features generally have a more significant<br />

influence than tree features. We present these findings with a discussion of their<br />

significance and with reference to possible useful applications in the alignment of<br />

parallel texts for machine translation.<br />

Corresponding author: g.j.kotze@rug.nl<br />

45


Abstract<br />

46<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

From Tokens to Text Entities: Line-based Parsing of<br />

Resumes using Conditional Random Fields<br />

Rotaru, Mihai<br />

Textkernel NL<br />

Resumes (Curriculum Vitae) form a challenging category of semi-structured<br />

documents. Regardless of the language, most resumes tend to be structured in<br />

sections: e.g. experience, education, skills, personal. Consequently, the first task of a<br />

resume information extraction system is to segment the resume in sections. We cast<br />

the section segmentation problem as a sequence labeling problem. In this paper, we<br />

show practical results that compare two approaches. The first approach works at the<br />

word level and uses Hidden Markov Models (HMM) with words as the HMM<br />

observations. The second approach works at the line level and uses Conditional<br />

Random Fields (CFRF) and a variety of features computed for each line. We find that<br />

the CRF approach outperforms the HMM approach significantly on this real world task,<br />

and that the improvement is also reflected in the later stages of our resume<br />

information extraction pipeline. In addition, this result generalizes across several<br />

languages after porting the corresponding CRF features. The main advantages of the<br />

CRF approach are the expressiveness of the features (e.g. easily express information<br />

that spans multiple words), and the fact that it makes the practical assumption that a<br />

line of text belongs to a single section.<br />

Corresponding author: rotaru@textkernel.nl


PRESENTATION ABSTRACTS<br />

Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat<br />

Abstract<br />

van Halteren, Hans 1 and Martell, Craig 2 and Du, Caixia 3 and Gu, Yan 3<br />

and Johan, Kobben 3 and Panjaitan, Leequisach 3 and Schubotz, Louise 3<br />

Vasylenko, Kateryna 4<br />

1 Radboud University Nijmegen<br />

2 Naval Postgraduate School, Monterey<br />

3 ReMa L&C, RUN/UvT<br />

4 ReMa L&C<br />

Modern NLP research attempts to cover the whole spectrum from written to spoken<br />

text. Right in the middle we find chat text, a written text type which has many<br />

similarities with spoken text. One of these is spelling variation, often reduction, e.g.<br />

nite instead of night. It is clear that, if we ever want to analyze or generate chat text,<br />

we have to understand the factors behind this spelling behavior, whether user<br />

experience with SMS, peer group identification by speech spelling or otherwise.<br />

This paper contributes by studying spelling reduction in chat text. We investigated<br />

cases of reduction in the NPS Chat Corpus. After identifying various types in 2000 posts<br />

from the publicly available part of the corpus, we focused on four frequent<br />

phenomena: a) wanna (want to) and gonna (going to), b) ya and u (you), c) g-drop in<br />

present participles, e.g. findin for finding d) apostrophe drop in enclitics, e.g. hes for<br />

he’s. For these, we automatically extracted all occurrences of both reduced and full<br />

forms in 1Mw from the complete corpus. For each, we also determined features which<br />

could be of influence on the choice between the alternating forms, such as the poster’s<br />

age group and immediate context in the post. On the basis of this, we built regression<br />

models to find out which of the features show a significant influence.<br />

In the paper, we present the main findings and relate them to those identified in the<br />

literature as being active in spoken text.<br />

Corresponding author: hvh@let.ru.nl<br />

47


48<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Language Evolution and SA-OT: The case of sentential<br />

negation<br />

Abstract<br />

Lopopolo, Alessandro and Biro, Tamas<br />

University of Amsterdam<br />

Simulated Annealing Optimality Theory (SA-OT) is a recent update of the OT<br />

framework, and it adds a model of performance to a theory of linguistic competence.<br />

Our aim is to show how SA-OT can be useful for Language Evolution simulations.<br />

Performance error is a central concept in this model, and it is considered to be one of<br />

the causes of variation and evolution. In performance, speakers accept sacrificing<br />

precision in order to enhance communicative strength, and the performance errors<br />

influence the language learning of the next generation.<br />

In order to test the potentialities of SA-OT, we have chosen to model the evolution of<br />

sentential negation. The background is based on Jespersen's Cycle (JC). In JC, the<br />

evolution of sentential negation follows three stages (1. pre-verbal, 2. discontinuous,<br />

and 3. post-verbal). Our starting point is the treatment of JC by de Swart (2010) in<br />

terms of traditional OT. Her model predicts six stages: the three above-mentioned pure<br />

stages, as well as three intermediate, mixed stages. Yet, there are no convincing<br />

empirical data for an intermediate stage between stages 1 and 3.<br />

Therefore, we advance a novel, computational model for JC, based on SA-OT. It<br />

reproduces the three pure and the two observed mixed stages, whereas it correctly<br />

predicts the lack of an intermediate stage between 1 and 3. This result makes different<br />

predictions for the evolution of sentential negation, and confirms the validity of SA-OT<br />

as a computational model for language evolution.<br />

Corresponding author: A.Lopopolo@student.uva.nl


PRESENTATION ABSTRACTS<br />

Machine Learning Approaches to Sentiment Analysis<br />

Using the Dutch Netlog Corpus<br />

Abstract<br />

Schrauwen, Sarah and Daelemans, Walter<br />

CLiPS, Antwerp University<br />

Sentiment analysis deals with the computational treatment of opinion, sentiment and<br />

subjectivity. We constructed and manually annotated a corpus, the Dutch Netlog<br />

Corpus, with data extracted from the social networking website Netlog. This corpus<br />

was annotated on three levels: ‘valence’ (expressing the opinion of the writer: we<br />

distinguish between ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and additionally<br />

language performance, which is divided into two areas: ‘performance’ (‘standard’,<br />

‘dialect’ and ‘n/a’) and ‘chat’ (‘chat’, ‘non-chat’ and ‘n/a’). We tackle sentiment analysis<br />

as a text classification task and employ two simple feature sets (the most frequent and<br />

the most informative words of the corpus) and three supervised classifiers<br />

implemented from the Natural Language ToolKit (the Naïve Bayes, Maximum Entropy<br />

and Decision Tree classifiers). The highest obtained accuracy score for valence<br />

classification with the entire data set is a 65.1%.<br />

We suggest three factors leading to errors in valence classification. First, the nature of<br />

the data affects results, since most of the corpus is made up of dialect and chat<br />

language, which is more difficult to predict. Second, the number of classes to predict<br />

from is larger for valence classification (five classes) than for performance or chat<br />

classification (three classes), and is therefore also more difficult to process. Third, the<br />

skewed class distribution of the corpus probably has the biggest influence on the<br />

results. We suspect that more training data will solve these current problems.<br />

Corresponding author: sarah.schrauwen@gmail.com<br />

49


50<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Measuring the Impact of Controlled Language on Machine<br />

Translation Text via Readability and Comprehensibility<br />

Abstract<br />

Doherty, Stephen<br />

Centre for Next Generation Localisation, Dublin<br />

This paper describes a recent study of the readability and comprehensibility of English<br />

software documentation, which has been translated into French by Matrex, a state-ofthe-art<br />

(phrase-based statistical) machine translation system. The primary aim of the<br />

study is to examine what, if any, effects there are on the readability and<br />

comprehensibility of the machine translation output following the application of<br />

controlled language (CL) rules on the source language texts. Our hypothesis is that the<br />

application of CL rules would result in an observable increase in readability and<br />

comprehensibility of the target language text.<br />

Our approach consisted of a three-pronged evaluation of the texts by means of (i)<br />

readability indices in both the source and target languages: (ii) an eye tracking<br />

measurement of readability: and (iii) a post-task qualitative measurement of<br />

comprehensibility, using recall and likert-scale human evaluations. We also looked at<br />

correlations between automatic machine translation evaluation metrics (e.g. BLEU,<br />

GTM etc.) and the evaluation results mentioned above in an attempt to bridge the gap<br />

between human and automatic approaches to evaluation.<br />

The paper will first describe some background and context in the relevant research<br />

areas, followed by a presentation of the methods employed with a particular focus on<br />

the measurement of readability via eye tracking and tentative results in this regard.<br />

Corresponding author: stephen.doherty2@mail.dcu.ie


PRESENTATION ABSTRACTS<br />

Abstract<br />

Memory-based text completion<br />

van den Bosch, Antal<br />

Tilburg University<br />

The commonly accepted technology for fast and efficient word completion is the prefix<br />

tree, or trie. As a word is keyed in, the trie can be queried for unicity points and best<br />

guesses. We present three improvements over the normal prefix trie in experiments in<br />

which we measure the percentage of keypresses saved on both in-domain and out-ofdomain<br />

test text, emulating a perfectly alert user who would select correct suggestions<br />

promptly. First, we train a suffix trie that tests backwards from the most recent<br />

keypresses. Conditioned on first letters, the suffix trie model yields about 10% more<br />

saved keypresses than the baseline character saving percentage on in-domain test<br />

data. Second, the suffix trie model can be straightforwardly extended to testing on<br />

characters of previous words. Adding this context yields another 10% increase in<br />

character savings. Third, when we train the context-rich suffix trie model to complete<br />

the current word and predict the next one in one go, character savings go up another<br />

4%. In a learning experiment on Dutch texts we observe character savings of up to 44%<br />

on in-domain test data where the baseline prefix tree savings percentage is 19%. On<br />

out-of-domain twitter data, the prefix trie baseline of 19% is only mildly surpassed by<br />

the suffix tree variants to 24% character savings. We develop an explanation for the<br />

spectacular success of the suffix tree approach on in-domain data, and review the<br />

applicability of the approach in real-world text entry contexts.<br />

Corresponding author: Antal.vdnBosch@uvt.nl<br />

51


Abstract<br />

52<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Overlap-based Phrase Alignment for Language<br />

Transformation<br />

Wubben, Sander and van den Bosch, Antal and Krahmer, Emiel<br />

Tilburg University<br />

In this talk we will present our work on the task of paraphrasing from an old variant of<br />

a language to a modern variant. One of the tasks we consider is paraphrasing the<br />

Canterbury Tales from Middle English to Modern English. We approach this task as a<br />

translation task and therefore we use Machine Translation techniques. The current<br />

state of the art Machine Translation systems rely heavily on statistical word alignment.<br />

The alignment package most commonly used is GIZA++, which is used to train IBM<br />

Models 1 to 5 and an HMM word alignment model. The benefit of using statistical word<br />

alignment is that no assumptions need to be made about the parallel corpus and that it<br />

generally produces better results when being fed more data. This holds for the task of<br />

paraphrasing as well. However, when we consider monolingual parallel corpora, it<br />

might be naive to only use statistics when we can in fact utilize the attribute that both<br />

sides of the corpus are in the same (or at least similar) language, and therefore likely to<br />

exhibit a certain amount of overlap. We will investigate the feasiblity of using overlap<br />

measures to align words and phrases in monolingual corpora and how this method<br />

holds up against pure statistical alignment in a Machine Translation framework.<br />

Corresponding author: s.wubben@uvt.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

Parse and Tag Somali Pirates<br />

van Erp, Marieke and Malaisé, Véronique and van Hage, Willem and<br />

Osinga, Vincent and Coleto, Juan Manuel<br />

VU University Amsterdam<br />

Events are the most prevalent complex entities described in user contributed social<br />

network activities, newswire, commercial infringement reports etc. Unfortunately, due<br />

to the nature of free text, event descriptions can take many forms, making querying for<br />

or reasoning over them difficult.<br />

We present an approach for event extraction from piracy attack reports issued by the<br />

International Chamber of Commerce (ICC-CCS[1]). As the piracy attack reports are<br />

semi-structured, we can treat the extraction task as a segmentation and labelling<br />

problem. We extract information from the reports about participants, weapons,<br />

locations, times and types of events, and store the information as structured event<br />

instances. We argue that an event model is not only an intuitive representation for<br />

such information, enabling automatic analysis and reasoning over the attacks and their<br />

components, but also a very powerful tool for knowledge and data integration. We<br />

show that the event model enables automatic analysis of the data, so questions such as<br />

"How did the weapon use of pirates evolve over time?" can be answered.<br />

[1] http://www.icc-ccs.org<br />

Corresponding author: marieke@cs.vu.nl<br />

53


54<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Personalized Knowledge Discovery: Combining Social<br />

Media and Domain Ontologies<br />

Abstract<br />

Markus, Thomas and Westerhout, Eline and Monachesi, Paola<br />

Utrecht University<br />

We present a system that facilitates knowledge discovery by means of structured<br />

domain ontologies. The user can discover new concepts and relations by exploring an<br />

expert approved ontological structure which has been automatically enriched with new<br />

concepts, relations and lexicalisations originating from social media. The system also<br />

on-the-fly interlinks the conceptual knowledge in the ontology with noisy data coming<br />

from social media on the conceptual level.<br />

Our ontology enrichment methodology identifies salient terms using similarity<br />

measures and determines the appropriate word senses for each term by employing a<br />

disambiguation algorithm. The appropriate relation between the new concept (word<br />

sense) and the existing ones is either extracted from DBpedia or from text documents<br />

retrieved from the web. The disambiguation algorithm is also used to store the original<br />

context of each term, that is, the term itself, its meaning, associated person and<br />

resource. These personalised contexts are stored using the MOAT semantic vocabulary.<br />

The enriched ontology and the disambiguation methodology allow us to give a<br />

personalised semantic interpretation to each search result in the context of the<br />

enriched domain ontology and the user. The amount of conceptual overlap between a<br />

document and the person using the system is employed to offer personalised<br />

recommendation of documents.<br />

The advantages that this approach brings to students has been evaluated as part of a<br />

university course with a large group of students and a separate control group.<br />

Corresponding author: Thomas.Markus@phil.uu.nl


PRESENTATION ABSTRACTS<br />

Recent Advances in Memory-Based Machine Translation<br />

Abstract<br />

van Gompel, Maarten and van den Bosch, Antal and Berck, Peter<br />

Tilburg University<br />

We present advances in research on Memory-based Machine Translation (MBMT), a<br />

form of machine translation in which the translation model takes the form of<br />

approximate k-nearest neighbour classifiers. These classifiers are trained to map words<br />

or phrases in context to a target word or phrase. The modelling of source-side context<br />

is a key feature distinguishing this approach from standard Statistical Machine<br />

Translation (SMT).<br />

In 2010 we released the open source PBMBMT (phrase-based memory-based machine<br />

translation) system. PBMBMT embraces the concept of phrases, as opposed to the<br />

single words or fixed n-grams that earlier work in memory-based machine translation<br />

focused on. PBMBMT employs a phrase translation table generated by Moses as the<br />

basis for the generation of training and test instances for our classifiers. We present an<br />

automatic method for hyperparameter optimisation, and investigate the usage of<br />

example weighting in the memory-based classifier. We critically measure and compare<br />

the performance of our latest system against its precursor systems, and a state-of-theart<br />

competitor.<br />

A recent branch of research has focused on the language model component of<br />

PBMBMT. As PBMBMT can work with both the well known SRILM software and WOPR,<br />

the memory-based language model, we performed a learning curve experiment with<br />

both language models to investigate the effect of the amount of training data. Our<br />

results challenge the common "more data is better" belief.<br />

Corresponding author: proycon@anaproy.nl<br />

55


Abstract<br />

56<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Reversible stochastic attribute-value grammars<br />

de Kok, Daniël and van Noord, Gertjan and Plank, Barbara<br />

University of Groningen<br />

Attribute-value grammars have been advocated because they are reversible. Their<br />

declarative nature ensures that the same grammar can in principle be used for parsing<br />

and generation.<br />

In more recent years, attribute-value grammars have been extended with conditional<br />

models to perform parse disambiguation and fluency ranking. However, since such<br />

models are conditioned on a sentence or a logical form, reversibility is sacrificed.<br />

We propose a framework for reversible stochastic attribute-value grammars. In this<br />

framework, a single statistical model is used for parse disambiguation and fluency<br />

ranking. We argue that this framework is more appropriate, since it recognizes that<br />

preferences are shared between production and comprehension components. For<br />

instance, if fluency ranking and disambiguation would have different preferences with<br />

respect to subject fronting in Dutch, communication would become problematic.<br />

We provide experimental results that show that the performance of a reversible model<br />

does not differ significantly from directional models for parse disambiguation and<br />

fluency ranking. We also show that fluency ranking models can be improved by adding<br />

annotated parse disambiguation training data, and vise versa.<br />

Corresponding author: d.j.a.de.kok@rug.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

Robust Rhymes? The Stability of Authorial Style in<br />

Medieval Narratives<br />

Kestemont, Mike and Daelemans, Walter and Sandra, Dominiek<br />

CLiPS , University of Antwerp<br />

We explore the application of stylometric methods developed for modern texts to<br />

rhymed medieval narratives (Jacob of Maerlant and Lodewijk of Velthem, ca. 1260-<br />

1330). Because of the peculiarities of medieval text transmission, we propose to use<br />

highly frequent rhyme words for authorship attribution. First, we shall demonstrate<br />

that these offer important benefits, being relatively content-independent and wellspread<br />

over texts. Subsequent experimentation shows that correspondence analyses<br />

can indeed detect authorial differences using highly frequent rhyme words. Finally, we<br />

demonstrate for Maerlant’s oeuvre that these highly frequent rhyme words’ stylistic<br />

stability should not be exaggerated since their distribution significantly correlates with<br />

the internal structure of that oeuvre.<br />

Corresponding author: mike.kestemont@ua.ac.be<br />

57


Abstract<br />

58<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Rule Induction for Synchronous Tree-Substitution<br />

Grammars in Machine Translation<br />

Vandeghinste, Vincent and Martens, Scott<br />

Centrum voor Computerlinguïstiek - KULeuven<br />

Data-driven machine translation systems are evolving from string-based systems<br />

towards tree-based systems, such as the PaCo-MT system. In this system, the source<br />

language sentence is parsed using a monolingual parser. This parse tree needs to be<br />

converted or transduced into one or more target language parse trees from which one<br />

or more target language sentences can be generated.<br />

Rules are induced from phrasal alignments in an automatically parsed version of the<br />

English and Dutch portions of the Europarl treebank. The procedure for extraction<br />

assumes that the subtrees bounded by alignments between phrasal nodes in the two<br />

syntactic tree-structures are suitable as rules for a Synchronous Tree Substitution<br />

Grammar. The maximum number of such trees are extracted, given the alignments<br />

between the two sentences, and collected over the entire corpus. Rules that occur<br />

multiple times are inserted into the transducer as tree substitution rules. Minimally<br />

small tree substitution rules -- those consisting of a single node and its parent -- are<br />

used to induce translations where the extracted rules have insufficient coverage.<br />

Corresponding author: vincent@ccl.kuleuven.be


PRESENTATION ABSTRACTS<br />

Abstract<br />

Search in the Lassy Small Corpus<br />

van Noord, Gertjan and de Kok, Daniel and van der Linde, Jelmer<br />

University of Groningen<br />

A few months ago, the STEVIN Lassy project yielded its most important results: Lassy<br />

Small - a corpus of 1 million words with syntactic annotations which have been<br />

manually verified and corrected, and Lassy Large - a corpus of 1.5 billion words with<br />

automatically assigned syntactic structures. Syntactic annotations include part-ofspeech<br />

tags, lemma and dependency annotations of the type developed earlier in CGN<br />

and D-Coi.<br />

In this presentation we focus on the Lassy Small corpus, and introduce a stand-alone<br />

portable tool called DACT which can be used to browse the syntactic annotations in an<br />

attractive graphical form, and to search for sentences according to a number of search<br />

criteria, which can be specified elegantly by means of search queries formulated in<br />

XPATH, the WWW standard query language for XML documents. We provide a number<br />

of linguistically relevant examples of such queries, and we review the criticism of Lai<br />

and Bird (2010) which they take as motivation to introduce LPATH, an extension of<br />

XPATH. We will argue that such an extension is not required if string positions are<br />

explicitly encoded as XML attributes, as is the case in Lassy Small.<br />

DACT is freely available for various platforms, including Mac OS and recent versions of<br />

Windows.<br />

Corresponding author: g.j.m.van.noord@rug.nl<br />

59


Abstract<br />

60<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Simple Measures of Domain Similarity for Parsing<br />

Plank, Barbara and van Noord, Gertjan<br />

University of Groningen<br />

It is well known that parsing accuracy suffers when a model is applied to out-of-domain<br />

data. It is also known that the most beneficial data to parse a given domain is data that<br />

matches the domain (Sekine 1997, Gildea 2001). Hence, an important task is to select<br />

appropriate domains. However, most previous work on domain adaptation relied on<br />

the implicit assumption that domains are somehow given.<br />

With the growth of the web, more and more data is becoming available, and automatic<br />

ways to select data that is beneficial for a new (unknown) target domain are becoming<br />

attractive. We consider various ways to automatically acquire related training data for<br />

a given test article, and compare automatic measures to human-annotated meta-data.<br />

The results show that a very simple measure of similarity based on word frequencies<br />

works surprisingly well.<br />

Corresponding author: b.plank@rug.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

SSLD: A smart tool for sms compression<br />

Cougnon, Louise-Amélie and Beaufort, Richard<br />

UCLouvain - IL&C - Cental<br />

Since 2009, we have been designing a methodology to semi-automatically develop a<br />

dictionary based on a corpus of SMS. Such a dictionary can be used to help systems<br />

translate from standard into sms language, a procedure which has so far been seen as<br />

a entertaining activity: our methodology can also be employed for more serious<br />

purposes such as text message summarising and compression tools. Our first results<br />

were encouraging (Cougnon and Beaufort, 2010) but only focused on French data<br />

(from Belgium, Quebec, Switzerland and La Reunion). Thanks to the sms4science<br />

project that aims at collecting sms corpora from all over the world, we now have at our<br />

disposal German, Dutch and Italian text messages. The aim of this paper is to describe<br />

our three-step approach to the extraction of the dictionary entries from the various<br />

corpora and to detail the smart manual sorting performed on the dictionary. The<br />

results will give us the opportunity to test our initially French-based methodology on<br />

other languages and to find out whether our approach is generic, i.e. applicable to all<br />

languages. This question also paves the way for a panorama of sms phenomena<br />

observed in the dictionary and which occur throughout the languages. Finally, we<br />

propose ways in which our methodology could be further improved.<br />

Corresponding author: louise-amelie.cougnon@uclouvain.be<br />

61


62<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Subtrees as a new type of context in Word Space Models<br />

Abstract<br />

Smets, Margaux and Speelman, Dirk and Geeraerts, Dirk<br />

QLVL, K.U.Leuven<br />

In Word Space Models (WSMs) there are traditionally two types of contexts that can be<br />

used: (i) lexical co-occurrences (`bag-of-words models') and (ii) syntactic dependencies.<br />

In general, models with the second type of contexts seem to perform better. However,<br />

there are some problems with these models. In the first place, a choice has to be made<br />

which contexts to include: only subject/verb and verb/object-relations, or also other<br />

dependencies . Second, in contrast with bag-of-words models, the syntactic models are<br />

supervised: they require quite large resources (a dependency parser, a manually<br />

annotated corpus, . . .), which might not be available for each language .<br />

The contexts we propose for use in WSMs are subtrees as defined in the framework of<br />

Data-Oriented-Parsing. Subtrees can capture both bag-of-words (co- occurrence)<br />

information, and syntactic information. Moreover, they are not limited to specific types<br />

of dependencies, but rather take entire structures into account.<br />

At first sight, it might seem that the problem of resources for dependency-WSMs<br />

remains in this framework. After all, we first need the `correct' tree for a sentence,<br />

before we can extract subtrees from it. However, in our experiments we show how the<br />

entire algorithm can be made unsupervised by using an unsupervised parser as a<br />

preprocessing step.<br />

In the presentation, I will first discuss in detail the workings of this new type of WSM.<br />

Next, I will present some initial results from experiments with parameters such as the<br />

accuracy of the parser in the preprocessing step, the maximum subtree depth, the<br />

minimum subtree frequency, and considering only subtrees with the highest variance.<br />

Corresponding author: margauxsmets@gmail.com


PRESENTATION ABSTRACTS<br />

Successful extraction of opposites by means of textual<br />

patterns with part-of-speech information only.<br />

Abstract<br />

Lobanova, Anna<br />

Department of Artificial Intelligence, University of Groningen<br />

We present an automatic method for extraction of opposites (e.g., rich - poor, top -<br />

bottom, buy - sell) by means of textual patterns that only contain part-of-speech<br />

information about target word pairs, e.g., difference between and . Our<br />

preliminary results suggest that this method outperforms a pattern-based method that<br />

uses dependency patterns [2] (requiring more sophisticated data preprocessing),<br />

especially for opposites expressed by nouns and verbs.<br />

Starting with small seed sets, we automatically acquired textual patterns from a 450<br />

million word version of Twente Nieuws Corpus of Dutch [4]. All patterns were<br />

automatically evaluated based on their overall frequency and the number of times they<br />

contained seed pairs. Best patterns were used to find candidate pairs. All found pairs<br />

were automatically scored based on their frequency and co-occurrence in reliable<br />

patterns. In addition, pairs with the highest scoring =0.9 were evaluated by two human<br />

judges. The precision scores for the top-100 found pairs were 0.61 for adjectiveadjective<br />

pairs, 0.63 for noun-noun pairs and 0.52 for verb-verb pairs. When more pairs<br />

were considered, the precision was still higher than that reported in previous studies.<br />

Namely, for the top-500 pairs, the precision was 0.42 for adjective-adjective pairs, 0.33<br />

for noun-noun pairs and 0.49 for verb-verb pairs.<br />

This method needs less pre-processing steps than dependency patterns and can easily<br />

be applied to vast data collections. The results can benefit many NLP applications<br />

including augmentation of computational lexical resources, Contrast identification<br />

[3,5], detection of paraphrases and contradictions [1] and others.<br />

Corresponding author: a.lobanova@ai.rug.nl<br />

63


Abstract<br />

64<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Syntactic Analysis of Dutch via Web Services<br />

Tjong Kim Sang, Erik<br />

University of Groningen<br />

Alpino is a general-purpose syntactic parser for Dutch sentences. At this moment, using<br />

the parser requires installation of the parser software at a local machine. In the CLARIN<br />

project TTNWW, we develop a web service interface to the parser which will simplify<br />

access for future users. The service provides access for client software via standard<br />

protocols like SOAP and exchanges XML-encoded text data between the client machine<br />

and the server where the parser is run. In this presentation, we present the current<br />

status of this project.<br />

Corresponding author: erikt@xs4all.nl


PRESENTATION ABSTRACTS<br />

Abstract<br />

Technology recycling between Dutch and Afrikaans<br />

Augustinus, Liesbeth 1 and van Huyssteen, Gerhard 2 and Pilon, Suléne 3<br />

1<br />

Centre for Computational Linguistics (CCL), K.U. L<br />

2<br />

Centre for Text Technology (CTexT), North-West University,<br />

Potchefstroom, South Africa<br />

3<br />

School for Languages, North-West University, Vanderbijlpark, South<br />

Africa<br />

Resource development for resource-scarce languages can be fast-tracked by recycling<br />

existing technologies for closely-related languages. The main issue dealt with is the<br />

recycling of Dutch technologies for Afrikaans. The possibilities of technology transfer<br />

are investigated by focusing on the D2AC-A2DC project. After exploring the<br />

architecture and functioning of D2AC, a Dutch-to-Afrikaans convertor, the attention<br />

goes out to the development and performance of A2DC, an Afrikaans-to-Dutch<br />

convertor. The latter tool is then used to improve the annotation of Afrikaans text with<br />

Dutch technologies. In particular, the performance of part-of-speech tagging and<br />

chunking has been considered. The accuracies of both tagger and chunker improve<br />

significantly if the data are first converted with A2DC before they are sent through the<br />

tools for Dutch analysis.<br />

Corresponding author: liesbeth@ccl.kuleuven,be<br />

65


66<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Technology recycling for closely related languages: Dutch<br />

and Afrikaans<br />

Abstract<br />

Pilon, Suléne 1 and Van Huyssteen, Gerhard 2<br />

1 North-West University (VTC)<br />

2 North-West University (PC)<br />

If two languages (L1 and L2) are similar enough, the development of technologies for<br />

L2 can be expedited by recycling existing L1 resources. This process is called technology<br />

recycling and the success thereof is greatly dependent on the degree of similarity<br />

between the two languages in question. Other strategies can, however, be employed<br />

to improve the efficiency of L1 technologies on L2 data and in this research we<br />

experiment with one such strategy, viz. lexical conversion as pre-processing step. We<br />

explore the possibility of using rule-based lexical conversion to improve the accuracy of<br />

Dutch technologies when annotating Afrikaans data. The rationale here is that Dutch<br />

technologies should perform better on Afrikaans data that appears more Dutch-like,<br />

even if the conversion does not yield a good Dutch translation. To do the lexical<br />

conversion, we developed an Afrikaans to Dutch convertor (A2DC) which obtains an<br />

accuracy of more than 72% when converting Afrikaans words to Dutch. For our<br />

experiment we use a state of the art Dutch POS tagger and parser to annotate raw<br />

Afrikaans data. The same data is then converted with A2DC and once again annotated<br />

with the Dutch technologies. In both experiments the conversion has a notably positive<br />

effect on the performance of the Dutch technologies. The biggest difference is<br />

observed in the POS tagging task with the overall accuracy increasing from 62.6% when<br />

annotating raw Afrikaans data to 80.6% when annotating converted data, while the<br />

parsing f-score improves from 0.44 (raw data) to more than 0.68 (converted data).<br />

Corresponding author: sulene.pilon@nwu.ac.za


PRESENTATION ABSTRACTS<br />

The more the merrier? How data set size and noisiness<br />

affect the accuracy of predicting the dative alternation<br />

Abstract<br />

Theijssen, Daphne and van Halteren, Hans and Boves, Lou and<br />

Oostdijk, Nelleke<br />

Radboud University Nijmegen<br />

In the dative alternation in English, speakers and writers choose between the<br />

prepositional dative construction ('I gave the ball to him' and the double object<br />

construction ('I gave him the ball'). Logistic regression models have already been shown<br />

to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007).<br />

Collecting dative instances from a corpus and encoding them with the required<br />

information is a costly procedure. We therefore developed a semi-automatic approach<br />

to do this, consisting of three steps: (1) automatically extracting dative candidates, (2)<br />

manually approving or rejecting these candidates, and (3) automatically annotating the<br />

approved candidates with the required information. The resulting data sets are noisier<br />

than data sets that have been checked completely manually, but the approach can<br />

yield much larger data sets.<br />

We compare the effect of data set size and noisiness on the accuracy of predicting the<br />

dative alternation. We employ a 'manual' set of 2,877 instances in spoken English,<br />

taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB<br />

(Greenbaum 1996) by Theijssen (2010). In addition, we use a 'semi-automatic' set with<br />

7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We<br />

compare the learning curves of various machine learning algorithms by randomly<br />

selecting subsets of the data and extending them with 500 instances each time. We do<br />

this for different levels of noisiness, i.e. varying the proportion of 'semi-automatic'<br />

instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference.<br />

References<br />

BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition).<br />

Oxford University Computing Services.<br />

Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the<br />

Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive<br />

67


68<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Foundations of Interpretation, Royal Netherlands Academy of Science, Amsterdam, pp<br />

69-94.<br />

Godfrey, John J., Edward C. Holliman and Jane McDaniel (1992). Switchboard:<br />

Telephone speech corpus for research and development. Proceedings of the<br />

International Conference on Acoustics, Speech, and Signal Processing (ICASSP-92), pp.<br />

517–20.<br />

Greenbaum, Sidney (ed.) (1996). Comparing English Worldwide: The International<br />

Corpus of English. Oxford, U.K.<br />

Theijssen, Daphne (2010). Variable selection in Logistic Regression: The British English<br />

dative alternation. In Icard, Thomas and Reinhard Muskens (eds.), Interfaces:<br />

Explorations in Logic, Language and Computation. Series: Lecture Notes in Computer<br />

Science (subseries: Lecture Notes in Artificial Intelligence), volume 6211, Springer.<br />

Corresponding author: d.theijssen@let.ru.nl


PRESENTATION ABSTRACTS<br />

The use of structure discovery methods to detect syntactic<br />

change<br />

Abstract<br />

ten Bosch, Louis and Versteegh, Maarten<br />

Radboud University Nijmegen<br />

A well-known problem in linguistics deals with the description and analysis of<br />

diachronic changes in syntactic constructions. In Western European languages, such<br />

changes have occurred a number of times over the last few centuries. In this<br />

presentation we present an overview of quantitative methods for analyzing historical<br />

text corpora. Our overview will include parsing-related methods, Bayesian methods<br />

and Latent Semantic Analysis, with special focus on methods that do not take syntactic<br />

trees as a starting point. In addition, we will pay attention to two methodological<br />

approaches, viz. the contrastive approach, according to which two different<br />

independent analyses of two corpora are compared, and the single-model approach,<br />

according to which changes in syntactic structure are interpreted as the result of a<br />

(possibly biased) competition within a single model.<br />

We will compare the various methods by presenting different analyses of the same text<br />

material.<br />

Corresponding author: l.tenbosch@let.ru.nl<br />

69


Abstract<br />

70<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Treatments of the Dutch verb cluster in formal and<br />

computational linguistics<br />

Van Eynde, Frank<br />

K.U.Leuven<br />

The Dutch verb cluster has always been a challenge for formal and computational<br />

linguistics, since the sentences which contain one display a rather dramatic discrepancy<br />

between surface structure, on the one hand, and semantic structure, on the other<br />

hand, as illustrated amongst others by the cross-serial dependencies in sentences with<br />

an AcI verb, such as 'zien' in '...dat ik haar de honden heb zien voederen' (... that I saw<br />

her feed the dogs).<br />

In multistratal frameworks, such as transformational grammar, the discrepancy is<br />

accounted for in terms of movement. More specifically, there is a level of syntactic<br />

structure which straightforwardly reflects the semantic relations, called deep structure<br />

or D-structure, and there is a series of transformations which map D-structures onto<br />

surface structures. The transformations either move the verbs, as in Arnold Evers'<br />

analysis, or their arguments, as in Jan-Wouter Zwart's analysis.<br />

In monostratal frameworks, such as GPSG and HPSG, the discrepancy between surface<br />

stucture and semantic structure is handled in terms of the inheritance of valence<br />

requirements, allowing the verbs in the cluster to take over the unfulfilled valence<br />

requirements of their verbal complement. This approach was pioneered by Mark<br />

Johnson in GPSG and by Erhard Hinrichs and Tsuneko Nakazawa in HPSG. Applications<br />

to Dutch are spelled out in work by Gerrit Rentier and by Gosse Bouma and Gertjan van<br />

Noord.<br />

In the Dutch treebanks, such as those of CGN and Lassy, the treatment of the verb<br />

cluster is monostratal, but the device to bridge the discrepancy between surface<br />

structure and semantic structure is more reminiscent of multistratal analyses, allowing<br />

the existence of crossing dependencies and hence the postulation of discontinuous<br />

constituents. The talk will give a survey of the existing treatments and provide a<br />

comparative evaluation.<br />

Corresponding author: frank.vaneynde@ccl.kuleuven.be


PRESENTATION ABSTRACTS<br />

TTNWW: de facto standards for Dutch in the context of<br />

CLARIN<br />

Abstract<br />

Schuurman, Ineke 1 and Marc Kemps-Snijders 2<br />

1 Centrum voor Computerlinguïstiek, K.U.Leuven<br />

2 Meertens Instituut, Amsterdam<br />

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project<br />

(2010-2012), in which several existing resources (both text and speech) and the facto<br />

standards for Dutch are a) used to help HSS researchers to address new research<br />

needs, while b) these resources and standards are adapted to/mapped onto the<br />

standards adopted by CLARIN, embedded in a work flow and presented as a web<br />

service designed for these HSS researchers.<br />

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text)<br />

are involved, spread over Flanders and the Netherlands<br />

The text part of TTNWW focusses on recognition of all kinds of names in various types<br />

of texts, such as Dutch novels and archaeological documents, the latter case in<br />

combination with temporal analysis. All 'lower' levels are also taken care of.<br />

The project will provide the CLARIN community with ample feedback with respect to<br />

the standards and technologies proposed in the European context and promote the de<br />

facto standards for Dutch NLP as used in CGN and several STEVIN-projects.<br />

In this presentation we will concentrate on various standards for written language and<br />

mapping between these, especially PoS (CGN/D-Coi), MAF and ISOcat.<br />

Corresponding author: ineke.schuurman@ccl.kuleuven.be<br />

71


Abstract<br />

72<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

TTNWW: NLP Tools for Dutch as Webservices in a<br />

Workflow<br />

Kemps-Snijders, Marc 1<br />

Ineke Schuurman 2<br />

1 Meertens Instituut, Amsterdam<br />

2 CCL, K.U.Leuven<br />

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project<br />

(2010-2012), in which several existing resources (both text and speech) and the facto<br />

standards for Dutch are a) used to help HSS researchers to address new research<br />

needs, while b) these resources and standards are adapted to/mapped onto the<br />

standards adopted by CLARIN, embedded in a work flow and presented as a web<br />

service designed for these HSS researchers.<br />

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text)<br />

are involved, spread over Flanders and the Netherlands.<br />

To develop the functionalities for both the speech and text parts of the project services<br />

that are delivered by each of the partners will be combined in a workflow approach<br />

allowing for flexible combinations of processes. Efforts in this area are embedded in<br />

the CLARIN effort to describe web services for easy discovery and profile matching, i.e.<br />

offering possible combinations of available resources and web services for specific<br />

tasks.<br />

In this presentation we will focus on methods for workflow construction, description of<br />

web services and place them in the international perspective of CLARIN.<br />

Corresponding author: marc.kemps.snijders@meertens.knaw.nl


PRESENTATION ABSTRACTS<br />

Using corpora tools to analyze gradable nouns in Dutch.<br />

Abstract<br />

Ruiz, Nicholas and Weiffenbach, Edgar<br />

University of Groningen<br />

Morzycki (2009) claims that degree readings of size adjectives, such as "a big idiot" are<br />

not merely the "consequence of some extragrammatical phenomenon," but rather can<br />

be attributed to syntax, which gives positional restrictions on the availability of degree<br />

readings. We expand on Morzycki (2009) by introducing a corpus-based analysis in<br />

Dutch to verify Morzycki's claims and to extend his claim to the semantic domain.<br />

Using LASSY, a syntactically annotated Dutch corpus developed inter alia under the<br />

STEVIN programme, we extract syntactic and semantic properties of noun phrases<br />

consisting of adjectives "gigantisch", "kolossaal", and "reusachtig" and manually<br />

annotate each adjective-noun pair with a gradable or non-gradable label.<br />

Using these features, we construct a statistical model based on logistic regression and<br />

find that the semantic role, definiteness, and particular semantic noun groups derived<br />

from Cornetto (a Dutch WordNet with referential relations) have a significant effect on<br />

the likelihood that an adjective-noun pair is interpreted by the reader to have a degree<br />

reading.<br />

Corresponding author: nicholas.ruiz@gmail.com<br />

73


74<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Using easy distributed computing for data-intensive<br />

processing<br />

Abstract<br />

Van den Bogaert, Joachim<br />

Centre for Computational Linguistics, K.U. Leuven<br />

Given the large amounts of data we are coping with when computing useful data from<br />

large corpora, and the difficulties and costs it takes to run parallel code with traditional<br />

parallel computing, we will present different frameworks that may be used to facilitate<br />

easy distributed computing. Using string-to-tree alignment (GHKM), frequent subtree<br />

mining, and distributed Moses decoding as example cases, we will demonstrate how<br />

applications and algorithms may be upscaled and out-scaled with these frameworks.<br />

We will consider both the creation of an embarrassingly parallel solution and the redesign<br />

of an existing algorithm to fit the mapreduce paradigm.<br />

Corresponding author: joachim@ccl.kuleuven.be


PRESENTATION ABSTRACTS<br />

Abstract<br />

What is the use of multidocument spatiotemporal<br />

analysis?<br />

Schuurman, Ineke and Vandeghinste, Vincent<br />

Centrum voor Computerlinguïstiek, K.U.Leuven<br />

In the project AMASS++ (IWT) the central research topic concerns multilingual<br />

multimedia multidocument summarization: how, in a huge digitized newsarchive, can a<br />

journalist find documents dealing with the same events and have a summary made of<br />

them in order to assess their usefulness for a specific purpose? Or: How can a news<br />

paper deliver personalized (inter)national and local news to a specific subscriber?<br />

In this presentation we will show how spatiotemporal analysis can be of assistance in<br />

such tasks, even when the input just consists of raw PoS-tagged texts (i.e. contrary to<br />

the SoNaR-project where many levels of annotation are available, all of them<br />

corrected).<br />

Corresponding author: ineke.schuurman@ccl.kuleuven.be<br />

75


76<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Without a doubt no uncomplicated task: Negation cues<br />

and their scope<br />

Abstract<br />

Morante, Roser and Schrauwen, Sarah and Daelemans, Walter<br />

CLiPS - University of Antwerp<br />

Although negation has been extensively treated from a theoretical perspective (Klima<br />

1964, Horn 1989, Tottie 1991, van der Wouden 1997) and it's processing is thought to<br />

be relevant for natural language processing systems (Morante and Sporleder 2010),<br />

there is a lack of annotated resources and no publicly available annotation guidelines<br />

can be found that describe in detail how to annotate negation related aspects. In this<br />

talk we present a corpus annotated with negation cues and their scope, we describe<br />

the guidelines that we have defined and we comment on the linguistic aspects of the<br />

annotation process. The annotated corpus contains the detective stories The Hound of<br />

the Baskervilles and The Adventure of Wisteria Lodge by Conan Doyle. Part of the<br />

corpus has already been annotated with other layers of semantic information<br />

(semantic roles, coreference) for the SemEval Task Linking Events and Their<br />

Participants in Discourse (Ruppenhofer et al., 2010). We first describe the expression<br />

of negation in this corpus and compare it with the expression of negation in biomedical<br />

documents. Then we comment in detail several aspects related to the negation<br />

phenomenon: how to determine what negation cues are, how to mark the scope, and<br />

how to determine whether an event is negated. We will show that marking the cues is<br />

not a matter of lexical look-up because some cues are ambiguous, and that contextual<br />

and discourse level features play a role in finding the scope. Additionally, we show<br />

that finding negated events depends on the semantic class of the predicates being<br />

involved, their mood and tense: on the modality of the event clause and on the<br />

syntactic constructions. Finally, we comment on the most difficult aspects of the<br />

annotation process, like determining when prepositions like "save" or "except" act as<br />

negation cues.<br />

Corresponding author: roser.morante@ua.ac.be


Poster Abstracts<br />

77


78<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

A database for lexical orthographic errors in French<br />

Abstract<br />

Manguin, Jean-Luc<br />

GREYC - Univ.de Caen - France<br />

This work describes the construction of a database for lexical orthographic errors in<br />

French. This construction uses different techniques form the field of NLP for a goal in<br />

the field of psycholinguisitics. In psycholinguisitics, it is often difficult and long to<br />

collect enough data from experiments with real people. Here the data are collected online<br />

and come from the requests made to an on-line dictionary. In this huge amount of<br />

data (about 160 millions words, 4 millions distinctive forms), we can find enough errors<br />

to have good statistics for a deep study of errors. The questions developped here are<br />

the link between a "bad" form and its correction, and the classification of errors in a<br />

small number of types. Several programs and techniques are involved to achieve these<br />

tasks : detection of graphic neighbours, phonetization, pattern matching : the<br />

combination of these techniques leads us to 70% of correction with no ambiguity, and<br />

80% if we accept the system give several possible corrections. The classification of<br />

errors is also useful for predicting where errors may appear in the words, and thus for<br />

the knowledge of children's learning of orthography.<br />

Corresponding author: jean-luc.manguin@unicaen.fr


POSTER ABSTRACTS<br />

Abstract<br />

A Posteriori Agreement as a Quality Measure for<br />

Readability Prediction Systems<br />

van Oosten, Philip and Hoste, Véronique and Tanghe, Dries<br />

LT3, University College Ghent<br />

All readability research is ultimately concerned with the research question whether it is<br />

possible for a prediction system to automatically determine the level of readability of<br />

an unseen text.<br />

A significant problem for such a system is that readability might depend in part on the<br />

reader.If different readers assess the readability of texts in fundamentally different<br />

ways, there is insufficient a priori agreement to justify the correctness of a readability<br />

prediction system based on the texts assessed by those readers.We built a data set of<br />

readability assessments by expert readers.We clustered the experts into groups with<br />

greater a priori agreement and then measured for each group whether classifiers<br />

trained only on data from this group exhibited a classification bias.<br />

As this was found to be the case, the classification mechanism cannot be<br />

unproblematically generalized to a different user group.<br />

Corresponding author: philip.vanoosten@hogent.be<br />

79


80<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

A TN/ITN Framework for Western European languages<br />

Abstract<br />

Chesi, Cristiano 1 and Cho, Hyongsil 2 andBaldewijns, Daan 2 and Braga,<br />

Daniela 1<br />

1 Microsoft Language Development Center<br />

2 Microsoft Language Development Center, ISCTE-Lisbon University<br />

Institute, Portugal<br />

(Inverse) Text Normalization, (I)TN, is an essential module in Text-to-Speech (TTS) and<br />

Speech Recognition (SR) systems and it requires both a significant development<br />

timeline and a deep linguistic expertise (Mikheev 2000, Palmer 2010).<br />

In this work, we describe an efficient multilingual (I)TN framework that is rule-based,<br />

hierarchical and modular: the core system is composed by a large set of optimized<br />

Finite-State Transducers (FSTs) that are compiled following the Normalization Maps<br />

developed by Language Experts (LEs) for each language: such maps are built using a<br />

proprietary tool (TNAuthoringTool, Patent Serial No. 12/361,114) that allows the LEs to<br />

express terminals normalization at high level (e.g. Term_1: “1” > “one”) and easily<br />

combine such terminals by means of hierarchical, weighted rules: these rules can be<br />

ordered sets of terminals or other rules, each one ranked according to their relevance<br />

so as to prevent interference in specific contexts (e.g. Rule_1: “21-12-2010” > “twentyfirst<br />

of december two thousand ten“ vs. Rule_2: “23-29” > “from twenty three to<br />

twenty nine”). Such rules are clustered under a small number of Top-Level rules that<br />

will be the entry states of the compiled FSTs.<br />

The core set of Top-Level rules developed covers Numerals, Ordinals, Dates and Time,<br />

Telephone numbers, Measurements and Web-related terms (e.g. URLs, email,<br />

acronyms). Here, we focus on the ambiguity resolution implemented in three<br />

languages (English, French, Italian) in the normalization of web-search specific terms<br />

and mobile text messages. Accuracy and coverage of the FSTs are evaluated against<br />

very large BING queries collections and SMS corpora.<br />

Corresponding author: v-crches@microsoft.com


POSTER ABSTRACTS<br />

Abstract<br />

An Examination of Cross-Cultural Similarities and<br />

Differences from Social Media Data with respect to<br />

Language Use<br />

Elahi, Mohammad Fazleh and Monachesi, Paola<br />

We present a methodology for analyzing cross-cultural similarities and differences<br />

using language as a medium, love as domain, social media as a data source and 'Terms'<br />

(emotions and sentiments) and 'Topics' as cultural features. We discuss the techniques<br />

necessary for the creation of the social data corpus from which emotion terms have<br />

been extracted using NLP techniques. Topics of love discussion were then extracted<br />

from the corpus by means of Latent Dirichlet Allocation (LDA). Finally, on the basis of<br />

these features, a cross-cultural comparison was carried out. For the purpose of crosscultural<br />

analysis, the experimental focus was on comparing data from a culture from<br />

the East (India) with a culture from the West (United States of America). Similarities<br />

and differences between these cultures have been analyzed with respect to the usage<br />

of emotions, their intensities and the topics used during love discussion in social media.<br />

Findings include (i) Indians are more emotional than Americans but Americans express<br />

themselves with stronger emotion terms than Indians, (ii) In discussions on common<br />

topics related to love (Wedding, Same Sex etc), the conversations of Indians and<br />

Americans are related to the particular traditions and recent issues of their culture, and<br />

(iii) Indians and Americans also use some terms and topics,which are only related to<br />

their culture.<br />

Corresponding author: rmf_ku@yahoo.com<br />

81


Abstract<br />

82<br />

Authorship Verification of Quran<br />

Shokrollahi-Far, Mahmoud<br />

Tilburg University<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Holy Quran, as the cultural heritage of the Islamic world, has long been the focus of<br />

scholarly disputes some not solved so far, such as whether Prophet Mohammad<br />

himself has authored the book. This paper reports a research trend approaching such<br />

disputes as a text classification task, such as authorship verification. To induce some<br />

classifiers for this verification task, SVM and Naive Bays machines have been trained on<br />

the tagged corpora of Quranic texts bootstrapped by Mobin, a morpho-syntactic tagger<br />

developed for Arabic. The algorithm applied for the task is an efficient enhancement of<br />

the algorithms applied so far for authorship verification. This algorithm seems<br />

applicable for authorship problems of the other texts in Arabic. The results have not<br />

verified the authorship of Quran by Prophet Mohammad.<br />

Corresponding author: m.shokrollahifar@uvt.nl


POSTER ABSTRACTS<br />

CLAM: Computational Linguistics Application Mediator<br />

Abstract<br />

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal<br />

TiCC, Tilburg University<br />

The Computational Linguistics Application Mediator (CLAM) allows you to quickly and<br />

transparently transform your Natural Language Processing application into a RESTful<br />

webservice, with which automated clients can communicate, but which at the same<br />

time also acts as a modern webapplication with which human end-users can interact<br />

directly. CLAM takes a description of your system and wraps itself around the system. It<br />

allows both automated clients and human end-users to upload input files to your<br />

application, start your application with specific parameters, and download or directly<br />

view the output files produced by your application after it has completed execution.<br />

Rich support for metadata and provenance data is also provided.<br />

CLAM is set up in a universal fashion, making it flexible enough to be wrapped around a<br />

wide range of computational linguistic applications. These applications are treated as a<br />

black box, of which only the parameters, input formats, and output formats need to be<br />

described. The applications themselves need not be network-aware in any way, nor<br />

aware of CLAM. The handling and validation of input is taken care of by CLAM.<br />

Corresponding author: proycon@anaproy.nl<br />

83


84<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Discriminative features in reversible stochastic attributevalue<br />

grammars<br />

Abstract<br />

de Kok, Daniël<br />

University of Groningen<br />

Reversible stochastic attribute-value grammars use one model for parse<br />

disambiguation and fluency ranking. Such a model encodes preferences with respect to<br />

syntax, fluency, and appropriateness of logical forms, as weighted features. This<br />

framework is appropriate if similar preferences are used in parsing and generation.<br />

Reversible models incorporate features that are specific to parse disambiguation and<br />

fluency ranking, as well as features that are used for both tasks. One particular concern<br />

with respect to such models is that much of their discriminatory power is provided by<br />

task-specific features. If this is true, the premise that similar preferences are used in<br />

parsing and generation does not hold.<br />

A detailed analysis of features could give us more insight into the true reversibility of<br />

stochastic attribute-value grammars. However, as De Kok (2010) argued, such featurebased<br />

models are very opaque due to their enormous size and the tendency to spread<br />

weight mass among overlapping features. Feature selection methods can be used to<br />

extract a subset of features that do not overlap.<br />

In this work, we compare gain-informed feature selection (Berger et al., 1996: Zhou et<br />

al., 2003: De Kok, 2010), grafting (Perkins et al, 2003), and grafting-light (Zhu et al.,<br />

2010) in performing selection on reversible models. We then use the most effective<br />

method to extract a list of features ranked by their discriminatory power. We show<br />

that only a very small number of features is required to produce an effective model for<br />

parsing and generation. We also provide a qualitative and quantitative analysis of these<br />

features.<br />

Corresponding author: d.j.a.de.kok@rug.nl


POSTER ABSTRACTS<br />

Abstract<br />

Fietstas: a web service for text analysis<br />

Jijkoun, Valentin and de Rijke, Maarten and Vishneuski, Andrei<br />

University of Amsterdam<br />

We present Fietstas: a open-access web service for text analysis, created with the idea<br />

of simplifying building text-intensive applications. As a web service, Fietstas consists of<br />

(1) a simple content management component, where users can upload their content<br />

with metadata, (2) a collection of text processing components, from tokenization to<br />

named entity extraction and normalization, (3) a component for accessing/visualizing<br />

document processing results (e.g., as XML or HTML), and (4) data analysis component<br />

for generating term-cloud-based summaries and timelines. The functionality is<br />

available through easy-to-use REST interface (i.e., through standard HTTP requests),<br />

and moreover, a number of APIs are available (e.g., for Python and Perl). In this<br />

presentation, we briefly describe Fietstas and demonstrate how its functionality can be<br />

used in a simple web application (news search and analysis).<br />

Corresponding author: jijkoun@uva.nl<br />

85


Abstract<br />

86<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

FoLiA: Format for Linguistic Annotation<br />

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal<br />

TiCC, Tilburg University<br />

We present FoLiA, an XML-based annotation format suitable for the representation of<br />

written language resources. The format builds upon the work put in the D-Coi/SoNaR<br />

format, but greatly extends this to accommodate a wide variety of linguistic<br />

annotations. The objective is to present a rich annotation format based on a single<br />

unifying notation paradigm that does not commit to any particular tagset, but instead<br />

offers maximum flexibility and extensibility. In doing so, we replace the many ad-hoc<br />

formats present in the field with a single well-structured format.<br />

FoLiA will be proposed as a candidate CLARIN-standard.<br />

Corresponding author: proycon@anaproy.nl


POSTER ABSTRACTS<br />

How can computational linguistics help determine the<br />

core meaning of then in oral speech?<br />

Abstract<br />

Vallee, Michael<br />

EDC<br />

The research on the connective then mainly focuses on the temporal interpretation of<br />

it. However, little has been done on then in oral speech and more precisely in<br />

questions, orders or inferential sentences. It seems important to show that<br />

computational linguistics can really help determine if there is one way to describe the<br />

connective in these contexts or if there are as many different connectives then as there<br />

are linguistic structures.<br />

To do so, I will use the prosody of questions and sentences in oral speech to<br />

demonstrate that the speaker expresses a surprise or a contradiction with what was<br />

uttered prior the connective in these contexts. To illustrate this perspective, let’s<br />

consider the following utterance “Now then you listen to me”. I will show that the<br />

phonological structure shown by a software helps to describe how then works. For<br />

instance, in the example above, the underlying structure was “you-not-listen to me”<br />

before the utterance which was not expected by the speaker. In that case, the<br />

connective shows the different viewpoints between the speaker and the hearer.<br />

Corresponding author: m.vallee@yahoo.fr<br />

87


88<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Of mathematicians and physicists: the history of language<br />

and speech technology in the Netherlands and Flanders<br />

Abstract<br />

van der Beek, Leonoor<br />

Q-go<br />

When did language technology came into being in the Low Countries? Why are<br />

language and speech technology (LST) located in different Faculties in Dutch and<br />

Flemish universities? What was the impact of Lernout and Hauspie on the LST industry<br />

in the Netherlands? From September 2009 until September 2010, I investigated these<br />

and many other questions related to the history of LST in the Netherlands and<br />

Flanders. I interviewed the pioneers in our field and compiled from their stories a<br />

diverse picture of struggle against the limitations of immature computer technology, of<br />

boundless optimism and deep disappointment, and of academic friendships and fights.<br />

From Adriaan van Wijngaarden to Jo Lernout and from PHLIQA via Eurotra to CGN. I'll<br />

sketch the project and the approach I've taken, and give away some of the highlights of<br />

the book `Van Rekenmachine tot Taalautomaat' (in Dutch), which will tell the full story.<br />

Corresponding author: vdbeek@gmail.com


POSTER ABSTRACTS<br />

Abstract<br />

On the difficulty of making concreteness concrete<br />

van Halteren, Hans and Theijssen, Daphne and Oostdijk, Nelleke and<br />

Boves, Lou<br />

Radboud University Nijmegen<br />

As analysis and annotation progresses to deeper linguistic levels, matters prove ever<br />

more difficult. It not only becomes harder to get machines to provide proper analyses,<br />

but also to define exactly what we want. Whereas there appears to be consensus on<br />

what plural nouns are (morpho-syntax) or what relative clauses are (syntax), this is<br />

certainly not the case for semantic properties like concreteness. When reading papers<br />

referring to such concepts, one is unlikely to notice any problems. Bresnan et al.<br />

(2007), e.g., just use concreteness of a noun as a given and draw conclusions about the<br />

significance of its influence on choices in the dative alternation.<br />

However, once we ourselves attempt to annotate for concreteness, we run headlong<br />

into the absence of any clear definition of concreteness. Bresnan refers to Garretson<br />

(2003), where all we get is a vague (and somewhat circular) description and some<br />

examples. Looking further, we find lists, such as in the MRC Psycholinguistic Database<br />

(Coltheart, 1981), as well as procedures, such as Xing et al.’s (2010) procedure based<br />

on WordNet, all apparently leading to values for the property concreteness. But we can<br />

only wonder to which degree these various definitions/procedures lead to the same<br />

results. In this paper, therefore, we take a number of concreteness value yielding<br />

procedures and examine a) to which degree they overlap in their annotation of corpus<br />

data (here: Semcor) and b) to which degree they lead to the same conclusions about<br />

the influence of concreteness on syntactic processes (here: dative alternation).<br />

Corresponding author: hvh@let.ru.nl<br />

89


Abstract<br />

90<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

ParaSense or how to use parallel corpora for Cross-<br />

Lingual Word Sense Disambiguation<br />

Lefever, Els and Hoste, Véronique<br />

LT3, University College Ghent<br />

Cross-Lingual Word sense disambiguation (WSD) consists in selecting the correct<br />

translation of an ambiguous word in a given context. In this talk we present a set of<br />

experiments for a classification-based WSD system that uses evidence from multiple<br />

languages to define a translation label for an ambiguous target word in one of the five<br />

supported languages (viz. Italian, Spanish, French, Dutch and German). Instead of using<br />

a predefined monolingual sense-inventory such as WordNet, we use a languageindependent<br />

framework and build up our sense inventory by means of the aligned<br />

translations from the parallel corpus Europarl.<br />

The information that is used to train and test our classifier contains the well known<br />

WSD local context features of the English input sentences, as well as translation<br />

features from the other languages.<br />

Our results show that the multilingual approach outperforms the classification<br />

experiments that merely take into account the more traditional monolingual WSD<br />

features. In additon, our results are competitive with those of the best systems that<br />

participated in the SemEval-2 "Cross-Lingual Word Sense Disambiguation" task.<br />

Corresponding author: els.lefever@hogent.be


POSTER ABSTRACTS<br />

Abstract<br />

"Pattern", a web mining module for Python<br />

De Smedt, Tom and Daelemans, Walter<br />

CLiPS, University of Antwerp<br />

"Pattern" is a mash-up package for the Python programming language that bundles<br />

fast, regular expressions-based functionality for NLP and data-mining tasks. It consists<br />

of the following modules:<br />

1) pattern.web: provides easy access to Google, Yahoo, Bing, Twitter, Wikipedia,<br />

Flickr, RSS + a robust HTML DOM parser.<br />

2) pattern.en: tools for verb inflection, noun pluralization/singularization, a WordNet<br />

interface, a fast tagger/chunker based on regular expressions.<br />

3) pattern.table: for working with datasheets (e.g. MS Excel) and CSV-files.<br />

4) pattern.search: regular expressions for syntax and semantics. For example:<br />

"BRAND|NP VP JJ+" matches any sentence in which a noun phrase containing a<br />

brand name is followed by a verb phrase followed by one or more adjectives, e.g.<br />

"the new iPhone will be amazing", "Doritos taste cheesy", ...<br />

5) pattern.vector: corpus tools for tf-idf, cosine similarity, vector space search and<br />

LSA.<br />

6) pattern.graph: for exploring graphs and semantic networks.<br />

The package can be used and extended for harvesting online data, opinion mining,<br />

building semantic networks using a machine learning approach, and so on.<br />

Corresponding author: tom.desmedt@ua.ac.be<br />

91


Abstract<br />

92<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Semantic role labeling of gene regulation events<br />

Morante, Roser<br />

CLiPS,- University of Antwerp<br />

This poster describes work in progress on semantic role labeling of gene regulation<br />

events. Semantic role labeling (SRL) is a natural language processing task that consists<br />

of identifying the arguments of predicates within a sentence and assigning a semantic<br />

role to them (Màrquez et al., 2008). This task can support the extraction of relations<br />

from biomedical texts. Recent research has produced a rich variety of SRL systems to<br />

process general domain corpora. However, less systems have been developed to<br />

process biomedical corpora (Tzong-Han Tsai et al, 2007: Bethard et al., 2008). In this<br />

abstract, we present preliminary results of a new system that is trained on the GREC<br />

corpus (Thompson et al., 2009). The system performs argument identi?cation and<br />

semantic role assignment in a single step, assuming gold standard event identi?cation.<br />

We provide cross-validation and cross-domain results.<br />

Corresponding author: roser.morante@ua.ac.be


POSTER ABSTRACTS<br />

Abstract<br />

Source Verification in Quran<br />

Shokrollahi-Far, Mahmoud<br />

Tilburg University<br />

The revelation of Holy Quran has been accomplished either in Mecca or in Medina, for<br />

which some chapters or even verses of the book have been classified as either Meccan<br />

or Medinan. This crucial classification helps the scholars of Quran in so many topics<br />

including the exegesis of the book. Among the one hundred and fourteen chapters of<br />

Quran there are still thirty-two disagreed on to be whether Meccan or Medinan. More<br />

deeply, the scholars have long disputed on the features that would discriminate<br />

between these two classes. This paper reports a research trend on applying text<br />

classification tasks, say source verification, to help the resolution of such Quranic<br />

disputes. For this binary TC task, some classifiers have been induced by training SVM<br />

and Naive Bays machines on the tagged corpora of Quranic texts bootstrapped by<br />

Mobin, a morpho-syntactic tagger developed for Arabic. This research has not only<br />

explored the required distinctive grammatical features, but also led to a successful<br />

classification of the disputed chapters and verses as Meccan or Medinan.<br />

Corresponding author: m.shokrollahifar@uvt.nl<br />

93


94<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Towards a language-independent data-driven compound<br />

decomposition tool<br />

Abstract<br />

Réveil, Bert 1 and Macken, Lieve 2<br />

1 ELIS, Ghent University<br />

2 LT3, Language and Translation Technology Team<br />

Compounding is a highly productive process in Dutch that poses a challenge for various<br />

NLP applications such as terminology extraction, continuous speech recognition, and<br />

automated word alignment. The present work therefore proposes a languageindependent,<br />

data-driven decomposition tool that tries to segment compounds into<br />

their meaningful parts.<br />

The basic version of this tool initially determines a list of eligible compound<br />

constituents (so-called heads and tails), relying solely on word frequency information<br />

that is extracted from a large text corpus. The decomposition algorithm then<br />

recursively attempts to decompose the compounds, allowing only two-part head-tail<br />

divisions in each iteration. E.g. the noun 'postzegelverzamelaar' is first split into<br />

'postzegel' + 'verzamelaar', followed by an additional decomposition of 'postzegel' into<br />

'post' + 'zegel'.<br />

Apart from the basic version, an extended version of the tool is assessed that uses PoS<br />

information as a means to restrict the list of possible heads and tails. The preformance<br />

of both versions is evaluated in two large-scale decomposition experiments, one on the<br />

E-lex compound list and one on a word list that contains specific vocabulary from the<br />

automotive domain. As the presented decomposition tool only relates on word<br />

frequency and PoS information, it is expected that the tool can be easily adapted to<br />

new domains and languages.<br />

Corresponding author: breveil@elis.ugent.be


POSTER ABSTRACTS<br />

Towards improving the precision of a relation extraction<br />

system by processing negation and speculation<br />

Abstract<br />

Van Asch, Vincent and Morante, Roser and Daelemans, Walter<br />

CLiPS - University of Antwerp<br />

In this poster we present BiographTA, a system that extracts biological relations from<br />

PubMed abstracts. The relation extraction system has been designed to process<br />

abstracts in which biological relations from multiple databases have been annotated<br />

automatically based on an in-sentence co-occurrence criterium. It performs a relation<br />

identification task learning from noisy data, since a proportion of the automatically<br />

annotated relations in the training corpus is incorrect. The system cannot be evaluated<br />

on noisy data. For this reason, in order to develop and evaluate the system, we gather<br />

a corpus of PubMed abstracts annotated with the gold biological relations of the<br />

Bioinfer corpus.<br />

Additionally, one of the text mining goals in the Biograph project is to develop<br />

techniques that allow to perform large scale relation extraction starting from the<br />

smallest possible amount of manually annotated data and obtaining the highest<br />

precision possible. This is why we add to the relation extraction system a module that<br />

processes negation and speculation cues. We present experiments aimed at testing<br />

whether processing the scope of negation and speculation cues results in a higher<br />

precision of the relations extracted. Results show that the negation and speculation<br />

detection module increases the precision in 2.93 at the cost of decreasing recall in 0.68.<br />

Corresponding author: Vincent.VanAsch@ua.ac.be<br />

95


List of Participants<br />

97


LIST OF PARTICIPANTS<br />

Liesbeth Augustinus liesbeth@ccl.kuleuven.be CCL - K.U. Leuven<br />

Daan Baldewijns v-daanb@microsoft.com Microsoft Language<br />

Development Center<br />

Kim Bauters kim.bauters@ugent.be Ghent University<br />

Richard Beaufort richard.beaufort@uclouvai<br />

n.be<br />

UCL CENTAL<br />

Peter Berck P.J.Berck@UvT.nl TiCC, Tilburg University<br />

Tamas Biro t.s.biro@uva.nl ACLC, University of<br />

Amsterdam<br />

Jelke Bloem j.bloem.3@student.rug.nl University of Groningen<br />

Cristiano Chesi v-crches@microsoft.com Microsoft Language<br />

Development Center,<br />

Porto Salvo - ISCTE-Lisbon<br />

University Institute,<br />

Portugal<br />

Kostadin Cholakov k.cholakov@rug.nl University of Groningen<br />

Louise-Amélie Cougnon louiseamelie.cougnon@uclouvai<br />

n.be<br />

CENTAL - IL&C, UCLouvain<br />

Crit Cremers c.l.j.m.cremers@hum.leide<br />

nuniv.nl<br />

Leiden University<br />

Walter Daelemans walter.daelemans@ua.ac. CLiPS, University of<br />

be<br />

Antwerp<br />

Orphée De Clercq orphee.declercq@hogent. LT3, University College<br />

be<br />

Ghent<br />

Martine De Cock Martine.DeCock@UGent.b<br />

e<br />

Ghent University<br />

Daniël de Kok d.j.a.de.kok@rug.nl University of Groningen<br />

Tom De Smedt tomdesmedt@gmail.com CLiPS Universiteit<br />

Antwerpen<br />

Herwig De Smet herwig.desmet@kdg.be OptiFox 7th Framework<br />

Europe<br />

Dennis de Vries dennis@gridline.nl GridLine<br />

Saskia Debergh saskia.debergh@intersyste<br />

ms.com<br />

i.Know nv<br />

Johannes Deleu johannes.deleu@intec.uge IBCN, IBBT & Ghent<br />

nt.be<br />

University<br />

Thomas Demeester thomas.demeester@ugent<br />

.be<br />

Ghent University<br />

Bart Desmet bart.desmet@hogent.be LT3, University College<br />

Ghent<br />

99


100<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Brecht Desplanques brecht.desplanques@elis.u<br />

gent.be<br />

ELIS, Ghent University<br />

Peter Dirix peter.dirix@nuance.com Nuance<br />

Stephen Doherty stephen.doherty2@mail.d<br />

cu.ie<br />

Marius Doornenbal m.doornenbal@elsevier.co<br />

m<br />

Frederik Durant frederik.durant@tomtom.<br />

com<br />

Mohammad<br />

Fazleh<br />

Elahi rmf_ku@yahoo.com<br />

Dublin City University<br />

Reed Elsevier<br />

TomTom<br />

Thomas François thomas.francois@uclouvai<br />

n.be<br />

CENTAL, UCLouvain<br />

Tanja Gaustad T.Gaustad@uvt.nl TiCC, Tilburg University<br />

Olga Gordeeva ogordeeva@gmail.com Acapela Group<br />

Kris Heylen kris.heylen@arts.kuleuven<br />

.be<br />

QLVL, K.U.Leuven<br />

Maarten Hijzelendoorn p.m.hijzelendoorn@hum.l<br />

eidenuniv.nl<br />

Leiden University<br />

Veronique Hoste veronique.hoste@hogent. LT3, University College<br />

be<br />

Ghent<br />

Steve Hunt s.j.hunt@tilburguniversity.<br />

nl<br />

TiCC, Tilburg University<br />

Marc Kemps-Snijders marc.kemps.snijders@me<br />

ertens.knaw.nl<br />

Meertens Instituut<br />

Mike Kestemont mike.kestemont@ua.ac.be CLiPS, University of<br />

Antwerp<br />

Maxim Khalilov maxkhalilov@gmail.com ILLC, University of<br />

Amsterdam<br />

Henny Klein E.H.Klein@rug.nl University of Groningen<br />

Cornelis H.A. Koster kees@cs.ru.nl Radboud Universiteit<br />

Nijmegen<br />

Gideon Kotzé g.j.kotze@rug.nl University of Groningen<br />

Mark Kroon mark.kroon@actonomy.co<br />

m<br />

Actonomy<br />

Reinier Lamers lamers@textkernel.nl Textkernel<br />

Els Lefever els.lefever@hogent.be LT3, University College<br />

Ghent<br />

Anna Lobanova a.lobanova@ai.rug.nl AI, University of Groningen<br />

Alessandro Lopopolo A.Lopopolo@student.uva. ACLC, University of<br />

nl<br />

Amsterdam<br />

Kim Luyckx kim.luyckx@ua.ac.be CLiPS, University of<br />

Antwerp<br />

Lieve Macken lieve.macken@hogent.be LT3, University College


LIST OF PARTICIPANTS<br />

Gideon Maillette de<br />

Buy Wenniger<br />

Ghent<br />

gemdbw@gmail.com ILLC, University of<br />

Amsterdam<br />

Véronique Malaisé vmalaise@vu.nl VU University Amsterdam<br />

Jean-Luc Manguin jeanluc.manguin@unicaen.fr<br />

CNRS - Université de Caen<br />

Eliza Margaretha e.margaretha@student.ru<br />

g.nl<br />

University of Groningen<br />

Thomas Markus Thomas.Markus@phil.uu.n<br />

l<br />

Utrecht University<br />

Scott Martens scott@ccl.kuleuven.be CCL, K.U.Leuven<br />

Dieneke Meijer dieneke.meijer@agentsch<br />

apnl.nl<br />

Agentschap NL / STEVIN<br />

Sien Moens sien.moens@cs.kuleuven.<br />

be<br />

K.U.Leuven CW<br />

Paola Monachesi P.Monachesi@uu.nl Utrecht University<br />

Roser Morante roser.morante@ua.ac.be CLiPS, University of<br />

Antwerp<br />

Peter Nabende p.nabende@rug.nl University of Groningen<br />

Fabrice Nauze fabrice.nauze@rightnow.c<br />

om<br />

Q-go / Rightnow<br />

John Nerbonne j.nerbonne@rug.nl University of Groningen<br />

Jan Odijk j.odijk@uu.nl UiL-OTS Universiteit<br />

Utrecht<br />

Leequisach Panjaitan leequisach.panjaitan@yah<br />

oo.com<br />

Claudia Peersman claudia.peersman@ua.ac.b<br />

e<br />

CLiPS, University of<br />

Antwerp & Artesis<br />

Suléne Pilon sulene.pilon@nwu.ac.za North-West University<br />

(VTC)<br />

Barbara Plank b.plank@rug.nl University of Groningen<br />

Massimo Poesio poesio@essex.ac.uk University of Essex<br />

Bert Réveil breveil@elis.ugent.be DSSP, ELIS, Ghent<br />

University<br />

Mihai Rotaru rotaru@textkernel.nl Textkernel<br />

Nicholas Ruiz nicholas.ruiz@gmail.com University of Groningen<br />

Marijn Schraagen schraage@liacs.nl Leiden University<br />

louise schubotz louise_schubotz@gmx.de Radboud University<br />

Nijmegen<br />

Ineke Schuurman ineke.schuurman@ccl.kule<br />

uven.be<br />

CCL, K.U.Leuven<br />

101


102<br />

CLIN 21 – CONFERENCE PROGRAMME<br />

Roxane Segers r.h.segers@vu.nl VU University Amsterdam<br />

Binyam Seyoum binephrem@gmail.com Addis Ababa University<br />

Margaux Smets margauxsmets@gmail.com QLVL, K.U.Leuven<br />

Martijn Spitters spitters@textkernel.nl Textkernel<br />

Peter Spyns pspyns@taalunie.org Nederlandse Taalunie<br />

Tim Stokman timstokman@gmail.com Textkernel<br />

Dries Tanghe dwiesje@hotmail.com LT3, University College<br />

Ghent<br />

Tristan<br />

Thomas<br />

Teunissen tristan@w3lab.nl BA<br />

Daphne Theijssen d.theijssen@let.ru.nl Radboud University<br />

Nijmegen<br />

Erik Tjong Kim Sang erikt@xs4all.nl University of Groningen<br />

Fabian Triefenbach fabian.triefenbach@elis.ug<br />

ent.be<br />

ELIS, Ghent University<br />

Frederik Vaassen frederik.vaassen@ua.ac.be CLiPS, University of<br />

Antwerp<br />

Michaël Vallée m.vallee@yahoo.fr EDC Paris<br />

Vincent Van Asch Vincent.VanAsch@ua.ac.b CLiPS, University of<br />

e<br />

Antwerp<br />

Matje van de Camp M.M.v.d.Camp@uvt.nl TiCC, Tilburg University<br />

Tim Van de Cruys tv234@cam.ac.uk University of Cambridge<br />

Marjan Van de Kauter marjan.vandekauter@hog LT3, University College<br />

ent.be<br />

Ghent<br />

Anne van de<br />

amvdwetering@hotmail.c University of Groningen<br />

Wetering om<br />

Joachim Van den<br />

Bogaert<br />

joachim@ccl.kuleuven.be CCL - K.U. Leuven<br />

Antal van den Bosch Antal.vdnBosch@uvt.nl TiCC, Tilburg University<br />

Leonoor van der Beek leonoor.vanderbeek@right<br />

now.com<br />

Q-go / Rightnow<br />

Marieke van Erp Marieke@cs.vu.nl VU University Amsterdam<br />

Frank Van Eynde frank.vaneynde@ccl.kuleu<br />

ven.be<br />

CCL, K.U.Leuven<br />

Maarten van Gompel proycon@anaproy.nl TiCC, Tilburg University<br />

Hans van Halteren hvh@let.ru.nl Radboud University<br />

Nijmegen<br />

Gertjan van Noord g.j.m.van.noord@rug.nl University of Groningen<br />

Philip van Oosten philip.vanoosten@hogent.<br />

be<br />

LT3, University College<br />

Ghent


LIST OF PARTICIPANTS<br />

Menno van Zaanen mvzaanen@uvt.nl TiCC, Tilburg University<br />

Tom Vanallemeersch tallem@ccl.kuleuven.be CCL, K.U.Leuven<br />

Vincent Vandeghinste vincent@ccl.kuleuven.be CCL, K.U.Leuven<br />

Klaar Vanopstal klaar.vanopstal@hogent.b LT3, University College<br />

e<br />

Ghent<br />

Kateryna Vasylenko Katyaknu1986@mail.ru Nijmegen University<br />

Peter Velaerts peter.velaerts@hogent.be LT3, University College<br />

Ghent<br />

Suzan Verberne s.verberne@let.ru.nl Radboud University<br />

Nijmegen<br />

Reinder Verlinde R.Verlinde@Elsevier.com Elsevier<br />

yuliya vladimirova vladimirova83@gmail.com<br />

Tim Wauters tim.wauters@intec.ugent.<br />

be<br />

IBBT & Ghent University<br />

Edgar Weiffenbach s1422022@student.rug.nl CLCG, University of<br />

Groningen<br />

Eline Westerhout elinewesterhout@gmail.co<br />

m<br />

Utrecht University<br />

Thomas Wielfaert thomas.wielfaert@ugent.b<br />

e<br />

Martijn Wieling m.b.wieling@rug.nl University of Groningen<br />

Sander Wubben s.wubben@uvt.nl TiCC, Tilburg University<br />

Jakub Zavrel zavrel@textkernel.nl Textkernel<br />

103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!