Programme booklet (pdf)

Ghent, February 11 th 2011 

21 st meeting of Computational 

Linguistics In the Netherlands

CLIN-21 was supported by:

Welcome! 

For the first time in its over 20-year history, the “Computational Linguistics in the 

Netherlands” meeting is being held in beautiful city of Ghent. This year’s edition of the 

meeting is hosted by the Language and Translation Technology Team of the University 

College Ghent. 

CLIN-21 will cover a broad spectrum of areas related to natural language and 

computation. The program features 55 talks, organized in 5 parallel sessions and 18 

posters on different aspects of computational linguistics. We are delighted that 

Massimo Poesio from the University of Essex accepted to give us his vision on current 

anaphora resolution research. Leonoor van der Beek will present a booklet on the 

history of language and speech technology in the Netherlands and Flanders. At the 

CLIN meeting, we will also present the winner of the STIL Thesis Prize 2011 that will be 

awarded to the best MA thesis in computational linguistics or its applications. 

This booklet contains the presentation and poster abstracts for this year’s CLIN, as well 

as the program schedule. The abstracts have been ordered alphabetically. 

We hope that CLIN-21 will provide a rewarding forum for the presentation of 

interesting work and new ideas in the domain of computational linguistics and natural 

language processing, with stimulating and provocative discussions of successes, failures 

and new directions. We thank you for your support and participation and wish you a 

pleasant and fruitful conference! 

The CLIN-21 organizing committee 

Veronique Hoste 

Els Lefever 

Kathelijne Denturck 

Peter Velaerts 

3

Table of Contents 

Welcome! ......................................................................................................................... 3 

Table of Contents ............................................................................................................. 5 

Programme ....................................................................................................................... 9 

Keynote speaker ............................................................................................................. 17 

Rethinking anaphora .................................................................................................. 17 

Presentation Abstracts ................................................................................................... 19 

A discriminative syntactic model for source permutation via tree transduction for 

statistical machine translation .................................................................................... 20 

A Generalized Lexical Acquisition Technique for Improved Parsing Accuracy ............ 21 

A Semantic Vector Space for Modelling Word Meaning in Context ........................... 22 

A Toolkit for Visualizing the Coherence of Tree-based Reordering with Word- 

Alignments .................................................................................................................. 23 

A U-DOP approach to modeling language acquisition ................................................ 24 

Age and Gender Prediction on Netlog Data................................................................ 25 

Aligning translation divergences through semantic role projection ........................... 26 

An Aboutness-based Dependency Parser for Dutch ................................................... 27 

An exploration of n-gram relationships for transliteration identification .................. 28 

Application of a Constraint Conditional Model for improving the performance of 

a Sequence Tagger : A Case Study .............................................................................. 29 

Automatic terminology extraction: methods and practical applications ................... 30 

Automatically Constructing a Wordnet for Dutch ...................................................... 31 

Automatically determining phonetic distances .......................................................... 32 

Building a Gold Standard for Dutch Spelling Correction ............................................. 33 

Clustering customer questions ................................................................................... 34 

Collecting and using a corpus of lyrics and their moods ............................................. 35 

5

6 

CLIN 21 – CONFERENCE PROGRAMME 

Combined Qualitative and Quantitative Error Analysis in Multi-Topic Authorship 

Attribution ................................................................................................................. 36 

Combining e-learning and natural language processing. The example of 

automated dictation exercises .................................................................................. 37 

Computing Semantic Relations from Heterogeneous Information Sources .............. 38 

Computing the meaning of multi-word expressions for semantic inference ............ 39 

Cross-Domain Dutch Coreference Resolution ........................................................... 40 

Dmesure: a readability platform for French as a foreign language ........................... 42 

Essentials of person names ........................................................................................ 43 

Extraction of Historical Events from Unstructered Texts ........................................... 44 

Finding Statistically Motivated Features Influencing Subtree Alignment 

Performance .............................................................................................................. 45 

From Tokens to Text Entities: Line-based Parsing of Resumes using Conditional 

Random Fields ........................................................................................................... 46 

Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat ....................................... 47 

Language Evolution and SA-OT: The case of sentential negation .............................. 48 

Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog 

Corpus ........................................................................................................................ 49 

Measuring the Impact of Controlled Language on Machine Translation Text via 

Readability and Comprehensibility ............................................................................ 50 

Memory-based text completion ................................................................................ 51 

Overlap-based Phrase Alignment for Language Transformation ............................... 52 

Parse and Tag Somali Pirates ..................................................................................... 53 

Personalized Knowledge Discovery: Combining Social Media and Domain 

Ontologies.................................................................................................................. 54 

Recent Advances in Memory-Based Machine Translation ........................................ 55 

Reversible stochastic attribute-value grammars ....................................................... 56 

Robust Rhymes? The Stability of Authorial Style in Medieval Narratives .................. 57

TABLE OF CONTENTS 

Rule Induction for Synchronous Tree-Substitution Grammars in Machine 

Translation .................................................................................................................. 58 

Search in the Lassy Small Corpus ................................................................................ 59 

Simple Measures of Domain Similarity for Parsing ..................................................... 60 

SSLD: A smart tool for sms compression .................................................................... 61 

Subtrees as a new type of context in Word Space Models......................................... 62 

Successful extraction of opposites by means of textual patterns with part-of- 

speech information only. ............................................................................................ 63 

Syntactic Analysis of Dutch via Web Services ............................................................. 64 

Technology recycling between Dutch and Afrikaans .................................................. 65 

Technology recycling for closely related languages: Dutch and Afrikaans ................. 66 

The more the merrier? How data set size and noisiness affect the accuracy of 

predicting the dative alternation ................................................................................ 67 

The use of structure discovery methods to detect syntactic change ......................... 69 

Treatments of the Dutch verb cluster in formal and computational linguistics ......... 70 

TTNWW: de facto standards for Dutch in the context of CLARIN ............................... 71 

TTNWW: NLP Tools for Dutch as Webservices in a Workflow .................................... 72 

Using corpora tools to analyze gradable nouns in Dutch. .......................................... 73 

Using easy distributed computing for data-intensive processing ............................... 74 

What is the use of multidocument spatiotemporal analysis? .................................... 75 

Without a doubt no uncomplicated task: Negation cues and their scope ................. 76 

Poster Abstracts ............................................................................................................. 77 

A database for lexical orthographic errors in French ................................................. 77 

A Posteriori Agreement as a Quality Measure for Readability Prediction Systems .... 79 

A TN/ITN Framework for Western European languages ............................................ 80 

An Examination of Cross-Cultural Similarities and Differences from Social Media 

Data with respect to Language Use ............................................................................ 81 

Authorship Verification of Quran ............................................................................... 82 

7

8 


CLAM: Computational Linguistics Application Mediator ........................................... 83 

Discriminative features in reversible stochastic attribute-value grammars .............. 84 

Fietstas: a web service for text analysis ..................................................................... 85 

FoLiA: Format for Linguistic Annotation .................................................................... 86 

How can computational linguistics help determine the core meaning of then in 

oral speech? ............................................................................................................... 87 

Of mathematicians and physicists: the history of language and speech technology 

in the Netherlands and Flanders ................................................................................ 88 

On the difficulty of making concreteness concrete ................................................... 89 

ParaSense or how to use parallel corpora for Cross-Lingual Word Sense 

Disambiguation .......................................................................................................... 90 

"Pattern", a web mining module for Python ............................................................. 91 

Semantic role labeling of gene regulation events ...................................................... 92 

Source Verification in Quran ...................................................................................... 93 

Towards a language-independent data-driven compound decomposition tool ....... 94 

Towards improving the precision of a relation extraction system by processing 

negation and speculation .......................................................................................... 95 

List of Participants ......................................................................................................... 97

Programme 

9


09:00 

- 

09:30 

Registration and coffee (Foyer) 

Room 328 303 313 

Chair 

09:30 

- 

9:50 

09:50 

- 

10:10 

10:10 

- 

10:30 

10:30 

- 

10:50 

10:50 

- 

11:10 

11:10 

- 

11:20 

11:20 

- 

11:30 

Social media 

Richard Beaufort 

Age and Gender Prediction 

on Netlog Data 

C. Peersman, W. 

Daelemans, L. Van 

Vaerenbergh 

Im chattin :-) u wanna NLP 

it: Analyzing Reduction in 

Chat 

H. van Halteren, G. 

Martell, C. Du, Y. Gu, J. 

Kobben, L. Panjaitan, L. 

Schubotz, K. Vasylenko, Y. 

Vladimirova 

Personalized Knowledge 

Discovery: Combining 

Social Media and Domain 

Ontologies 

Th. Markus, E. 

Westerhout, P. Monachesi 

Machine Learning 

Approaches to Sentiment 

Analysis Using the Dutch 

Netlog Corpus 

S. Schrauwen, W. 

Daelemans 

Syntax and parsing 

Menno van Zaanen 

The use of structure 

discovery methods to 

detect syntactic change 

L. ten Bosch, M. Versteegh 

An Aboutness-based 

Dependency Parser for 

Dutch 

C. H.A. Koster 

A Generalized Lexical 

Acquisition Technique for 

Improved Parsing Accuracy 

K. Cholakov, G. van Noord, 

V. Kordoni, Y. Zhang 

Search in the Lassy Small 

Corpus 

G. van Noord, D. de Kok, J. 

van der Linde 

Coffee break.(Foyer) 

Welcome (Auditorium) 

STIL thesis prize (Auditorium) 

Lexical semantics 

Tanja Gaustad 

A Semantic Vector Space 

for Modelling Word 

Meaning in Context 

K. Heylen, D. Speelman, D. 

Geeraerts 

Automatically Constructing 

a Wordnet for Dutch 

T. Van de Cruys 

Successful extraction of 

opposites by means of 

textual patterns with partof-speech 

information 

only. 

A. Lobanova 

Computing the meaning of 

multi-word expressions for 

semantic inference 

C. Cremers 

10

PROGRAMME 

Machine translation 

Lieve Macken 

317 403 

Rule Induction for 

Synchronous Tree- 

Substitution Grammars in 

Machine Translation 

V. Vandeghinste, Scott 

Martens 

Recent Advances in 

Memory-Based Machine 

Translation 

M. van Gompel, A. van den 

Bosch, P. Berck 

A discriminative syntactic 

model for source 

permutation via tree 

transduction for statistical 

machine translation 

M. Khalilov, Kh. Sima'an, 

G.M. de Buy Wenniger 

Measuring the Impact of 

Controlled Language on 

Machine Translation Text 

via Readability and 

Comprehensibility 

St. Doherty 

Methodology 

Erik Tjong Kim Sang 

Using easy distributed 

computing for dataintensive 

processing 

J. Van den Bogaert 

Application of a Constraint 

Conditional Model for 

improving the 

performance of a 

Sequence Tagger : A Case 

Study 

T. Stokman 

Technology Recycling 

between Dutch and 

Afrikaans 

L. Augustinus, G. van 

Huyssteen, S. Pilon 

Technology recycling for 

closely related languages: 

Dutch and Afrikaans 

S. Pilon, G. Van Huyssteen 

09:00 

- 

09:30 

09:30 

- 

9:50 

09:50 

- 

10:10 

10:10 

- 

10:30 

10:30 

- 

10:50 

10:50 

- 

11:10 

11:10 

- 

11:20 

11:20 

- 

11:30 

11

11:30 

- 

12:30 

12:30 

- 

12:50 

12 


Invited talk: Rethinking anaphora 

M. Poesio 

(Auditorium) 

Of mathematicians and physicists: the history of language 

and speech technology in the Netherlands and Flanders 

L. van der Beek 

(Auditorium) 

12:50 

- 

13:50 

Lunch break (Restaurant) 

Room 328 303 313 

Chair 

13:50 

- 

14:10 

14:10 

- 

14:30 

14:30 

- 

14:50 

14:50 

- 

15:10 

15:10 

- 

16:10 

Information extraction 

Eline Westerhout 

Extraction of Historical 

Events from Unstructered 

Texts 

R. Segers, M. van Erp, L. 

van der Meij 

Clustering customer 

questions 

F. Nauze 

Parse and Tag Somali 

Pirates 

M. van Erp, V. Malaisé, W. 

van Hage, V. Osinga, J.M. 

Coleto 

From Tokens to Text 

Entities: Line-based 

Parsing of Resumes using 

Conditional Random Fields 

M. Rotaru 

Syntax and parsing 

Antal van den Bosch 

Reversible stochastic 

attribute-value grammars 

D. de Kok, G. van Noord, B. 

Plank 

The more the merrier? 

How data set size and 

noisiness affect the 

accuracy of predicting the 

dative alternation 

D. Theijssen, H. van 

Halteren, L. Boves, N. 

Oostdijk 

Simple Measures of 

Domain Similarity for 

Parsing 

B. Plank, G. van Noord 

Treatments of the Dutch 

verb cluster in formal and 

computational linguistics 

F. Van Eynde 

Poster session & coffee break.(Foyer) 

Semantics 

Kris Heylen 

Computing Semantic 

Relations from 

Heterogeneous 

Information Sources 

A. Panchenko 

Essentials of person names 

M. Schraagen 

Subtrees as a new type of 

context in Word Space 

Models 

M. Smets, D. Speelman, D. 

Geeraerts 

Using corpora tools to 

analyze gradable nouns in 

Dutch. 

N. Ruiz, E. Weiffenbach

PROGRAMME 

Translation 

Vincent Vandeghinste 

317 403 

Finding Statistically 

Motivated Features 

Influencing Subtree 

Alignment Performance 

G. Kotzé 

A Toolkit for Visualizing 

the Coherence of Treebased 

Reordering with 

Word-Alignments 

G. Maillette de Buy 

Wenniger 

Overlap-based Phrase 

Alignment for Language 

Transformation 

S. Wubben, A. van den 

Bosch, E. Krahmer 

Aligning translation 

divergences through 

semantic role projection 

T. Vanallemeersch 

Discourse 

Kim Luyckx 

Cross-Domain Dutch 

Coreference Resolution 

O. De Clercq, V. Hoste 

What is the use of 

multidocument 

spatiotemporal analysis? 

I. Schuurman, V. 

Vandeghinste 

Language Evolution and 

SA-OT: The case of 

sentential negation 

A. Lopopolo, T. Biro 

Without a doubt no 

uncomplicated task: 

Negation cues and their 

scope 

R. Morante, S. Schrauwen, 

W. Daelemans 

11:30 

- 

12:30 

12:30 

- 

12:50 

12:50 

- 

13:50 

13:50 

- 

14:10 

14:10 

- 

14:30 

14:30 

- 

14:50 

14:50 

- 

15:10 

15:10 

- 

16:10 

13

14 


Room 328 303 313 

Standards/CLARIN 

Syntax/Spelling 

Lexical semantics/ 

Discourse 

Chair Gertjan van Noord 

Roser Morante 

Tim Van de Cruys 

16.10 

- 

16.30 

16.30 

- 

16.50 

16.50 

- 

17.10 

17:10 

- 

19:00 

TTNWW: de facto 

standards for Dutch in the 

context of CLARIN 

I. Schuurman, M. Kemps- 

Snijders 

TTNWW: NLP Tools for 

Dutch as Webservices in a 

Workflow 

M. Kemps-Snijders, I. 

Schuurman 

Syntactic Analysis of Dutch 

via Web Services 

E. Tjong Kim Sang 

A U-DOP approach to 

modeling language 

acquisition 

M. Smets 

Memory-based text 

completion 

A. van den Bosch 

Building a Gold Standard 

for Dutch Spelling 

Correction 

T. Gaustad, A. van den 

Bosch 

Drinks (Foyer) 

An exploration of n-gram 

relationships for 

transliteration 

identification 

P. Nabende 

Combined Qualitative and 

Quantitative Error Analysis 

in Multi-Topic Authorship 

Attribution 

K. Luyckx, W. Daelemans 

Robust Rhymes? The 

Stability of Authorial Style 

in Medieval Narratives 

M. Kestemont, W. 

Daelemans, D. Sandra

PROGRAMME 

317 403 

Beyond text 

Tools 

Paola Monachesi 

Combining e-learning and 

natural language 

processing. The example of 

automated dictation 

exercises 

R. Beaufort, S. Roekhaut 

Combining e-learning and 

natural language 

processing. The example of 

automated dictation 

exercises 

R. Beaufort, S. Roekhaut 

Collecting and using a 

corpus of lyrics and their 

moods 

M. van Zaanen 

Els Lefever 

Automatic terminology 

extraction: methods and 

practical applications 

D. de Vries 

Automatic terminology 

extraction: methods and 

practical applications 

D. de Vries 

SSLD: A smart tool for sms 

compression 

L.-A. Cougnon, R. Beaufort 

16.10 

- 

16.30 

16.30 

- 

16.50 

16.50 

- 

17.10 

17:10 

- 

19:00 

15

Rethinking anaphora 

Abstract 

Massimo Poesio 

Keynote speaker 

Current models of the anaphora resolution task achieve mediocre results for all but the 

simpler aspects of the task such as coreference proper (i.e. linking proper names into 

coreference chains). One of the reasons for this state of affairs is the drastically 

simplified picture of the task at the basis of existing annotated resources and modelse.g., 

the assumption that human subjects by and large agree on anaphoric judgments. 

In this talk I will present the current state of our efforts to collect more realistic 

judgments about anaphora through the Phrase Detectives online game, and to develop 

models of anaphora resolution that do not rely on the total agreement assumption. 

17

Presentation Abstracts 

19

20 


A discriminative syntactic model for source permutation 

via tree transduction for statistical machine translation 

Abstract 

Khalilov, Maxim and Sima'an, Khalil and de Buy Wenniger, Gideon 

Maillette 

ILLC-UvA 

Word ordering is still one of the most challenging problems in the statistical machine 

translation (SMT). In most existing work, a word reordering model is implicitly or 

explicitly incorporated into a translation system based on flat or hierarchical 

representations of phrases. By contrast, this study deals with an approach addressing 

word ordering problem via source-side permutation prior to translation using 

hierarchical and syntactic structures. 

Our work is driven by the idea that reordering the source sentence as a pre-translation 

step minimizes the need for reordering during translation and may bridge long-distance 

order differences, which are outside the scope of commonly used reordering models. 

Given a word-aligned parallel corpus, we define the source string permutation as the 

task of statistically learning to unfold the crossing alignments between sentence pairs 

in the parallel corpus. 

This work contributes an approach for learning source string permutation via transfer 

of the source syntax tree, i.e. we define source permutation as the problem of learning 

how to transfer a given source parse-tree into a parse-tree that minimizes the 

divergence from target word-order. 

We present a novel discriminative, probabilistic tree transduction model, and 

contribute a set of empirical oracle results (upperbounds on translation performance) 

for English-to-Dutch source string permutation under sequence and parse tree 

constraints. Finally, the translation performance of our learning model is shown to 

outperform the state-of-the-art phrase-based system significantly. 

Corresponding author: maxkhalilov@gmail.com

PRESENTATION ABSTRACTS 

Abstract 

A Generalized Lexical Acquisition Technique for 

Improved Parsing Accuracy 

Cholakov, Kostadin 1 and van Noord, Gertjan 1 and Kordoni, Valia 2 and 

Zhang, Yi 3 

1 University of Groningen 

2 DFKI, Germany 

3 University of Saarland, Germany 

Unknown words are a major issue for large-scale precision grammars of natural 

language. In Cholakov and van Noord (COLING 2010) we proposed a maximum entropy 

based classification algorithm for acquiring lexical entries for all forms in the paradigm 

of a given unknown word and we tested its performance on the Dutch Alpino grammar. 

The study showed an increase in parsing accuracy when our method was applied. 

However, the general applicability of our approach has been a major point of criticism. 

It has been considered too specific and its application to other systems and languages-- 

doubtful. 

In this presentation, we argue that our method can be applied to any precision 

grammar provided that the following conditions are fulfilled: a finite set of labels which 

unknown words are mapped onto, large corpora, a parser which analyses various 

contexts of a given unknown word and provides syntactic constraints used as features 

in the classification process and a morphological component which generates the 

paradigm(s) of the unknown word. We show that the fulfillment of these conditions 

allows us to apply successfully our approach to other large-scale grammars where it 

leads to a significant increase in parsing accuracy on test sets of sentences containing 

unknown words in comparison with the achieved accuracy when the default methods 

used by these grammars to handle unknown words are employed. 

This provides strong support for our claim that our approach is general enough to be 

applied to various languages and precision grammars. 

Corresponding author: kcholakov@gmail.com 

21

22 


A Semantic Vector Space for Modelling Word Meaning in 

Context 

Abstract 

Heylen, Kris and Speelman, Dirk and Geeraerts, Dirk 

QLVL, Katholieke Universiteit Leuven 

Semantic Vector spaces have become the mainstay of modelling of word meaning in 

statistical NLP. They encode the semantics of words through high-dimensional vectors 

that record the co-occurrence of those words with context features in a large corpus. 

Vector comparison then allows for the calculation of e.g. semantic similarity between 

words. Most semantic vector spaces represent word meaning on the type (or lemma) 

level, i.e. their vectors generalize over all occurrences of a word. However, the meaning 

of words can differ considerably between contexts due to polysemy or vagueness. 

Therefore, many applications, like Word Sense Disambiguation (WSD) or Textual 

Entailment, require that word meaning be modelled on the token level, i.e. the level of 

individual occurrences. In this paper, we present a semantic vector space model that 

represents the meaning of word tokens by taking the word type vector and reweighting 

it based on the words observed in the token's immediate vicinity. More specifically, 

we give a bigger weight to the context features in the original type vector that are 

semantically similar to the context features observed around the token. This semantic 

similarity between context features is calculated based on the original wordtype-bycontextfeature 

matrix. We explore the performance of this model in a WSD task by 

visualizing how well the model separates the different meanings of polysemous words 

in Multi-Dimensional Scaling solution. We also compare our model to other token-level 

semantic vector spaces as proposed by Schütze (1998) and Erk & Padó (2008). 

References 

Erk, K. & S. Padó. 2008. A Structured Vector Space Model for Word Meaning in Context. 

EMNLP Proceedings, 897-906. 

Schütze, H. 1998. Automatic word sense discrimination.Computational Linguistics, 

24(1):97–124 

Corresponding author: kris.heylen@arts.kuleuven.be


A Toolkit for Visualizing the Coherence of Tree-based 

Reordering with Word-Alignments 

Abstract 

Maillette de Buy Wenniger, Gideon 

Institute for Logic Language and Computation 

Tree-based reordering constitutes an important motivation for the increasing interest 

in syntax-driven machine translation. It has often been argued that tree-based 

reordering might provide a more effective approach for bridging the word-order 

differences between source and target sentences. One major approach known as ITG 

(Inversion Transduction Grammar) allows permuting the order of the subtrees 

dominated by the children of any node in the tree. In practice, it has often been 

observed that the word-alignments usually cohere only to a certain degree with this 

kind of tree-based reordering, i.e., there are cases of word-alignments that cannot be 

fully explained with tree-based reordering when the tree is fixed a priori. This 

presentation describes a toolkit for visualizing alignment graphs that consist of a wordalignment 

together with a source or target tree. More importantly, the toolkit provides 

a facility for visualizing the coherence of word-alignment with tree-based reordering, 

highlighting nodes and word-alignments that are incompatible with one another. The 

tool allows visualizing the tree-based reordered source/target string as well as the 

reordered tree. Using our toolkit, we will also present results pertaining to the 

coverage of the ITG assumption of the word-alignments of a Europarl corpus, which is a 

very common starting point for training Translation systems. We will also dwell on the 

break-down of the types of incompatibility into general classes and discuss what that 

implies for training hierarchical translation models on this type of data. 

Corresponding author: gemdbw@gmail.coom 

23

24 


A U-DOP approach to modeling language acquisition 

Abstract 

Smets, Margaux 

In linguistics, there is a debate between empiricists and nativists: 

the former believe that language is acquired from experience, the latter that there is 

an innate component for language. The main arguments adduced by nativists are 

Arguments from Poverty of Stimulus. It is claimed that children acquire certain 

phenomena, which they cannot learn on the basis of experience alone ---and therefore, 

there has to be some innate component for language. In this thesis, we show that at 

least for certain phenomena that are often used in such arguments, it is possible to 

explain how children acquire them on the basis of experience alone, viz. with an 

Unsupervised Data-Oriented Parsing (U-DOP) approach to language. 

In the first part of the thesis, we develop concrete implementations of U-DOP, and 

contribute to the field of unsupervised parsing with two innovations. First, we develop 

an algorithm that performs syntactic category labeling and parsing simultaneously, and 

second, we devise a new methodology for unsupervised parsing, which can in principle 

be applied to any unsupervised parsing algorithm, and which produces the best results 

reported on the ATIS-corpus so far, with a promising outlook for even better results. 

In the second part of the thesis, we then use these concrete implementations to show 

how the acquisition of certain phenomena can be explained in an empirical way. We 

look in detail at wh-questions, and then show that the U-DOP approach is more general 

than the nativist account by looking at other phenomena. 

Corresponding author: margauxsmets@gmail.com


Abstract 

Age and Gender Prediction on Netlog Data 

Peersman, Claudia 1,2 and Daelemans, Walter 1 and Van Vaerenbergh, 

Leona 2 

1 CLiPS, University of Antwerp 

2 Artesis, University College Antwerp 

In recent years millions of people have started using social networking sites such as 

Netlog to support their personal and professional communications, creating digital 

communities. However, a common characteristic of these digital communities is that 

users can easily provide a false name, age, gender and location in order to hide their 

true identity. This way, social networking sites can be used by people with criminal 

intentions (e.g., paedophiles) to support their activities online. 

In the context of the DAPHNE project (Defending Against Paedophiles in 

Heterogeneous Network Environments), we present first results of a machine learning 

approach for age and gender prediction on a corpus of posts on the social network site 

Netlog. We investigate which types of linguistic and stylistic features are effective for 

age and gender prediction, given the specific characteristics of (the Dutch) chat 

language and compare the effectiveness of different machine learning techniques for 

age and gender prediction on the Netlog data. 

We will conclude our presentation by discussing how these results will guide future 

research in the DAPHNE project. 

Corresponding author: claudia.peersman@ua.ac.be 

25

26 


Aligning translation divergences through semantic role 

projection 

Abstract 

Vanallemeersch, Tom 

Centrum voor Computerlinguïstiek, K.U.Leuven 

We investigate whether an alignment method based on cross-lingual semantic 

annotation projection improves over approaches for linguistically uninformed word 

alignment and purely syntax-based tree alignment, specifically in the area of 

translation divergences. We apply an SRL system which annotates English sentences 

with PropBank and NomBank rolesets (verbal and nominal predicates and their 

semantic roles), and project the predicates and roles to Dutch and French using 

intersective GIZA++ word alignment. We create additional alignment links by detecting 

the auxiliary words of predicates (auxiliary, modal and support verbs) in parse trees 

and by detecting potential Dutch or French predicates based on projected roles. Finally, 

we investigate whether additional links can be created by training an SRL system on the 

projected predicates and roles and applying it to the Dutch and French parse trees. 

Corresponding author: tallem@ccl.kuleuven.be


Abstract 

An Aboutness-based Dependency Parser for Dutch 

Koster, Cornelis H.A. 

Radboud Universiteit Nijmegen 

Dupira (the Dutch Parser for IR Applications) is a new Dependency Parser for Dutch, 

which was developed at the University of Nijmegen, based on the older Amazon 

grammar and lexicon. 

Dupira is a rule-based parser, which is generated by means of the AGFL parser 

generator from the Dupira grammar, lexicon and fact tables. By means of transductions 

which are specified in the grammar (and can be modified), the parser transduces 

sentences to dependency trees. 

Dupira was developed for applications in Information Retrieval (IR) rather than in 

Linguistics, and for that reason has the following properties: 

- the dependency model of Dupira expresses the aboutness of a sentence rather 

than describing its complete syntactic structure; 

- therefore it is highly suitable for extracting factoids from running text; 

- it is also possible to extract dependency triples, which can be used as highaccuracy 

terms for text categorization and full-text search; 

- Dupira performs certain aboutness-preserving normalizing transformations, 

including de-passivization and de-topicalization, in order to enhance recall; 

- it makes extensive use of subcategorization preferences to resolve where 

possible the attachment of Preposition Phrases; 

- it is highly robust, both lexically and syntactically, and fast enough for practical 

applications. 

In this presentation, we discuss the aboutness-based dependency model and the way 

in which the grammar describes the Dutch language. We show by means of examples 

the transduction of clauses and phrases. We report the availability of Dupira Version 

0.8 in the public domain and the plans we have for further development, and discuss 

some of its applications. 

Corresponding author: kees@cs.ru.nl 

27

28 


An exploration of n-gram relationships for transliteration 

identification 

Abstract 

Nabende, Peter 

University of Groningen 

Transliteration identification aims at building quality bilingual lexicons for 

complementing and improving performance in various NLP applications including 

Machine Translation (MT) and Cross Language Information Retrieval (CLIR). The main 

task is to identify matching words across different writing systems from a given data 

source. Recent evaluations (Kumaran et al., 2010) show that no single approach 

achieves a consistently best performance in identifying transliterations from different 

language pairs: an approach that leads to the identification of quality matches between 

English and Russian, may result in many incorrect matches between English and 

Chinese. In this paper, we conduct experimental settings of utilizing n-gram 

relationships for computing candidate transliteration similarity scores which are 

subsequently evaluated for choosing potential transliteration matches. We use 

datasets from the 2010 transliteration generation shared task (Li et al., 2010) for five 

language pairs: English-Russian, English-Chinese, English-Hindi, English-Tamil, and 

English-Kannada. For each language pair, we explore various n-gram relationships 

starting from the unigram case to higher order n-grams. Results show that higher order 

n-grams lead to better transliteration identification quality across all languages, 

however, for different language pairs, the higher order n-grams outperform each 

other. For example, a pair trigram model outperforms a pair 4-gram model on an 

English-Russian dataset while the reverse is true for an English-Pinyin(Romanized 

Chinese) dataset. The results are promising as we aim at eliciting such n-gram 

correspondences for use in more complex stochastic models such as pair Hidden 

Markov Models (Pair HMMs) that we postulate may lead to even better transliteration 

identification quality. 

Corresponding author: p.nabende@rug.nl


Application of a Constraint Conditional Model for 

improving the performance of a Sequence Tagger : A Case 

Study 

Abstract 

Stokman, Tim 

Textkernel 

Natural language classifiers often ignore natural global constraints arising from the 

nature of domain. These constraints are often hard to learn by the classifier because 

the constraints are global while most classifiers make local decisions. Given that the 

classifier can give a probability distribution on the possible assignments, we can search 

this solution space for a set of assignments satisfying the natural constraints. Given this 

solution space, we can formulate a ILP model that maximizes the probability of the 

assignment while conforming to all constraints formulated in the model. 

We apply this approach to a number of datasets and develop a set of natural 

constraints for these datasets. We will expand on the practical implementation issues 

encountered and show the performance improvements that arise from using the 

constraint conditional model approach. 

Corresponding author: timstokman@gmail.com 

29

30 


Automatic terminology extraction: methods and practical 

applications 

Abstract 

de Vries, Dennis 

GridLine 

Some of the usefull applications of terminology that GridLine provides are automatic 

assignment of keywords to documents, integration of thesauri with search engines, 

enrichment of search queries using a multi-lingual thesaurus and aiding writers in 

correct use of terminology. 

The problem that many organisations face though is that they don’t have a list of their 

specific terminology. Creating these lists manually by examining company 

documentation and interviewing experts is very expensive and time-consuming. 

Therefore, GridLine developed instruments for automatically extracting organisation 

specific terminology from document collections. Additionally, after extracting 

terminology, we can build a thesaurus by automatically determining semantic relations 

between terms. 

For extraction of terms and semantic relations we use a variety of linguistic methods 

(lemmatizer, POS-tagger, compound splitter) and statistical methods (unithood, 

termhood). In this presentation I will give a brief overview of these extraction 

techniques and show some examples of projects we did for our customers. In 

particular, I will talk about Termtreffer, an easy to use application for term extraction 

which we made for the Nederlandse Taalunie. In this application, users can extract 

terms from documents using custom combinations of linguistic and statistical modules 

for term extraction. Additionally they can manage, analyze and edit the resulting 

terminology lists. 

GridLine is a growing company based in the center of Amsterdam. Currently we are 

market leader in Dutch language technology. 

Corresponding author: dennis@gridline.nl


Abstract 

Automatically Constructing a Wordnet for Dutch 

Van de Cruys, Tim 

INRIA & Université Paris VII 

In this talk, we describe the automatic construction of a wordnet for Dutch by 

combining a number of different sources of semantic information. First, a number of 

unsupervised and semi-supervised techniques are presented for the extraction of 

different kinds of semantic information. This includes techniques based on 

distributional similarity and clustering, but also techniques that extract semantic 

information from semi-structured and multilingual resources (such as Wiktionary). The 

second part describes how the output of these techniques may be combined with the 

structure of the original Princeton WordNet for English, which allows for the automatic 

construction of a wordnet for Dutch. Contrary to existing resources, the extracted 

resource also includes named entities. The resource is evaluated according to 

CORNETTO, an existing, manually constructed wordnet for Dutch. 

Corresponding author: Tim.Van_de_Cruys@inria.fr 

31

Abstract 

32 


Automatically determining phonetic distances 

Wieling, Martijn and Margaretha, Eliza and Nerbonne, John 


This study seeks to induce the distance between phonetic segments based on their 

correspondences in dialect atlas material. In other words, we induce information about 

the physical realization of sounds from their dialectal distributions. 

We algorithmically align segments in pairs of pronunciations at various sites in order to 

identify corresponding sounds. We then apply an information-theoretic measure, 

Pointwise Mutual Information, in order to automatically determine phonetic distances 

based on the relative frequency of correspondences. We repeat these steps until the 

alignments (and segment distances) stabilize. 

We evaluate the quality of the obtained phonetic distances by comparing them to 

acoustic vowel distances. For two separate dialect datasets, Dutch and German, we 

find high significant correlations between the induced phonetic distances and the 

acoustic distances, indicating that the frequency of correpondence in dialect material 

conveys information about the constitution of sounds. We close with some 

speculations about the usefulness of the method. 

Corresponding author: m.b.wieling@rug.nl


Building a Gold Standard for Dutch Spelling Correction 

Abstract 

Gaustad, Tanja and van den Bosch, Antal 

TiCC, Tilburg University 

The main question in the NWO project "Implicit Linguistics" is whether abstract 

linguistic representations are necessary as an intermediate step in NLP models and 

systems. To investigate this, we focus on text-to-text processing tasks, i.e. processes 

which map form to form. In particular, we are investigating Dutch spelling correction 

where a corrupted text is converted to a clean version of the same text. 

In order to test the quality of a spelling corrector, a Gold standard is needed. This, 

however, does not exist for Dutch as of yet. For this reason, we set out to build such a 

Gold standard, containing a mixed selection of texts in which we aim to mark all errors 

and their corrections. In this talk, we will present the Gold standard including interannotator 

agreement and other statistics relating to the data used. Furthermore, we 

will present first results with applying our language model WOPR to the corpus, 

comparing it against two baselines: a high-precision known error list and a contextinsensitive 

lexical baseline. Evaluation is performed in terms of precision and recall on 

detection and correction on full text. 

Corresponding author: T.Gaustad@uvt.nl 

33

Abstract 

34 

Nauze, Fabrice 

Q-go 

Clustering customer questions 


Q-go’s natural language search technology powers the search box of many corporate 

websites. Its NLP technology allows customers to ask questions in their own words and 

returns a small set of relevant answers. Hundreds of millions of questions have already 

been processed and answered with Q-go’s solution providing us with a mine of data. 

In order to improve our knowledge of what customers are asking and to help further 

refine our core systems, Q-go needs a way to automatically cluster relevant queries 

from large sets of customer questions. 

To achieve this goal we tested several standard clustering methods on sets of customer 

questions. The outline of the talk will be the following. 

First we will explain the specific challenges one has to face when clustering customer 

questions (very short queries, typos, etc…). We will then present the clustering 

algorithms that have been tested (among other k-Means, GAAC hierarchical clustering, 

mini-batch k-Means). Thirdly we will outline two different types of heuristics used in 

the first case to improve the quality of the vector representations feeding the 

clustering algorithms and in the second to overcome the curse of dimensionality. 

Finally the different methods will be evaluated and compared with respect to 

processing speed and intrinsic quality of clustering (as well as its practical usefulness). 

Corresponding author: fabrice.nauze@q-go.com


Collecting and using a corpus of lyrics and their moods 

Abstract 

van Zaanen, Menno 

Tilburg University 

Recently, there has been an increase in availability of music in digital formats. This has 

led to music collections that are different in nature than in the past. Collections are 

typically larger and consist of a selection of individual pieces instead of complete 

albums. Since playing any musical piece from the collection can be done without 

physically changing the medium, listeners create playlists that allow them to identify a 

subset of the collection and determine the order in which the pieces are played. 

People creating playlists often want to group pieces based on their emotional load 

(such as happy or sad). Creating such playlists, however, is time-consuming and 

requires knowledge of the music in the collection, since emotional information is not 

explicitly encoded with the pieces. We will describe a system that analyzes musical 

pieces and, based on the lyrics, classifies them into their corresponding mood class. 

This system is developed and evaluated using a corpus of lyrics of songs and their 

corresponding mood. The mood tags were collected by social tagging of musical pieces 

using the Moody iTunes plugin that is developed by the company Earth People within 

the Crayon Room project. Starting from a list of artist, title and mood triples, the 

corresponding lyrics of the songs have been collected. This has led to a corpus 

containing the lyrics of 5,631 songs, which will be made publicly available. 

Corresponding author: mvzaanen@uvt.nl 

35

36 


Combined Qualitative and Quantitative Error Analysis in 

Multi-Topic Authorship Attribution 

Abstract 

Luyckx, Kim and Daelemans, Walter 

CLiPS, Antwerp University 

In authorship attribution, function words are considered the ideal feature type to deal 

with the complexity of the task. There is a consensus that they are topic-neutral, highly 

frequent, and not under the author's conscious control. However, it has been shown 

that hardly any of these allegedly topic-neutral features are in fact topic-neutral. Topic 

seems to be hard to 'separate' from authorial style, irrespective of the type of features 

used to predict authorship. Although function words are robust to limited data and 

provide good indicators of authorship, the a-priori exclusion of content words causes a 

lot of useful information to be disregarded. 

We discuss experiments in multi-topic authorship attribution and zoom in on the 

features that constitute the attribution model. Qualitative analysis of results is typically 

lacking in authorship attribution studies, since many studies focus on performance, but 

refrain from going into detail about the features selected. 

In this talk, we show that high performance does not always imply a viable approach. 

More specifically, we zoom in on unique identifiers, features that occur exclusively with 

a specific authorship class in training and uniquely identify a test instance by the same 

author. Although a coincidence - the frequency of a feature in an unseen test set or the 

topic of that test set cannot be predicted - topic-related unique identifiers provide the 

model with an unfair advantage that will not scale. However, the absence of unique 

identifiers does not necessarily imply a scalable approach. Although this talk focuses on 

authorship attribution, we think any task in text mining would benefit from consistent 

error analysis. 

Corresponding author: kim.luyckx@ua.ac.be


Combining e-learning and natural language processing. 

The example of automated dictation exercises 

Abstract 

Beaufort, Richard and Roekhaut, Sophie 

UCL CENTAL 

E-learning is a way of delivering education based on the use of electronic tools and 

content, either delivered on CD-ROMs or managed through network connections. The 

idea behind e-learning is to improve both the learning and its management. To this 

end, exercises are frequently automated. Of course, the ideal automation would 

involve the three distinct steps of an exercise: its preparation, its realization (by the 

student) and its correction. 

Up to now, the automation lead to exercises, like gap-fill texts or multiple choice tests, 

which limit the kinds of knowledge that can be assessed. This is due to the fact that the 

correction step, for some exercises, is far from easy to automate. 

An eloquent example of such an exercise is dictation, this activity where the teacher 

reads a passage aloud and the learners write it down. While automatically reading an 

unknown text aloud is not a problem as long as a reliable text-to-speech synthesis 

system is available, the accurate correction of a learner's copy can quickly become a 

nightmare. 

The correction of a dictation's copy involves two steps: first, the detection of the real 

places of errors: second, the classification of these errors. In this paper, we present a 

way of automating these two steps. The detection step is based on a finite-state string 

alignment between the copy and the original. The classification step is the best result 

of a finite-state intersection between all possible automatic analyses of an error and 

the single analysis of the corresponding correct form. 

Corresponding author: richard.beaufort@uclouvain.be 

37

Abstract 

38 


Computing Semantic Relations from Heterogeneous 

Information Sources 

Panchenko, Alexander 

UCL CENTAL 

Computation of semantic relations between terms or concepts is a general problem in 

Natural Language Processing and a subtask of automatic thesaurus construction. 

This work describes and compares available heterogeneous information sources which 

can be used for mining semantic relations such as texts, electronic dictionaries and 

encyclopedias, lexical ontologies and thesauri, folksonomies, surfaces of words, query 

logs of search engines, and so forth. Most of the existing algorithms use a single 

information source for extracting semantic knowledge: Distributional Analysis relies on 

text, Extented Lesk uses dictionary definitions, Jiang-Conrath distance employs a 

semantic network such as WordNet and so on. We show that different methods 

capture different aspects of the terms’ relatedness: while one acquires similarities of 

word contexts, others capture similarities of syntactic contexts, term definitions, 

surfaces forms etc. 

In these settings, there is a need for a general model capable to aggregate different 

aspects of semantic similarity from all available information sources and methods in an 

optimal and consistent way. We discuss how such a model can be implemented with a 

linear combination, and using tensors (i.e. multi-way arrays). We describe two ways of 

using tensors for calculation of semantic relations in the context of multiple 

information sources, which we call “adjacency tensor” and “feature tensor”. The sparse 

tensor factorization methods PARAFAC, Non-negative Tensor Factorization (NTF), and 

Memory-Efficient Tucker (MET) are suggested in order to fusion information about 

terms from different methods and information sources. We conclude that tensors can 

be used for representing terms, while tensor factorizations can serve to generalize data 

about terms’ relatedness. 

Corresponding author: alexander.panchenko@student.uclouvain.be


Computing the meaning of multi-word expressions for 

semantic inference 

Abstract 

Cremers, Crit 

Leiden University 

The immense diversity of multi words expressions in every language imposes heavy 

requirements on the lexicon, the grammar and their interface for deep semantic 

analysis to be feasible. The lexicon for meaning-driven NLP is huge and phrasal. 

We present a model for dealing with extended lexical units in a parser/generator for 

Dutch that aims at the logical computation of entailments and presuppositions. The 

model consists of three components: a fiat architecture for the computational lexicon, 

an efficient organization of on-line lexical retrieval and a selectional method of 

semantic underspecification. In the fiat lexicon, all combinatory instances of all lexical 

varieties of all (semantically) relevant constructions are spelled out as feature-value 

graphs. Each of the feature-value graphs contains all combinatory information needed 

for synatctic and semantic processing. This constructicon is produced off line. It is 

managed on line by a retrieval system that selects contextually required and adequate 

constructions in linear time. The underspecification allows to disambiguate the 

combinatory result by evaluating the structure of the representations. 

The model will be demonstrated by reference to two particular constructions: the 

Dutch way- (Poss 2009) and the Dutch honger-construction. The first one – “jij hebt je 

een weg uit de gevangenis geslijmd” - exemplifies an intriguing combination of lexical 

restrictions, productivity, structure sensitivity and semantic specificity. The second – “ik 

heb geen erg grote honger” - exemplifies transcategorial semantic effects, where open 

modification of a noun phrase requires translation into propositional and statesensitive 

operators. 

Corresponding author: c.l.j.m.cremers@hum.leidenuniv.nl 

39

Abstract 

40 


Cross-Domain Dutch Coreference Resolution 

De Clercq, Orphée and Hoste, Véronique 

LT3 Language and Translation Technology Team, University College 

Ghent 

For the STEVIN funded SoNaR project, a Dutch reference corpus of 500 million words is 

being built. At the same time, a one-million-word subset is progressively enriched with 

semantic information: named entities, coreference, spatio-temporal relations and 

semantic roles. As a prerequisite, existing schemes and systems developed for Dutch 

are to be reused to the fullest extent possible. In this talk we present the ongoing task 

of annotating this subset with coreference information, following existing guidelines 

for Dutch (Bouma et al. 2007). 

The basis for our coreference resolver is an existing mention-pair approach (Hoste 

2005, Hendrickx et al. 2008) for Dutch. One of the main challenges in the domain of 

coreference resolution is portability across different domains and languages. Since one 

of the great advantages of the SoNaR corpus is its diversity - the 1MW subset itself 

comprises six text types - we decided to train our system on, respectively, each text 

type separately and all types combined. We will report cross-type (e.g. administrative, 

external communication, instructive, journalistic text) cross-validation results for 

different NP types and present an extensive qualitative error analysis. We compare 

performance when providing perfect markables, derived from deep parsing (Alpino, 

Van Noord et al. 2006) with automatically generated markables, and investigate the 

added value of integrating additional semantic information resulting from other 

annotation layers. 

References 

G. Bouma, W. Daelemans, I. Hendrickx, V. Hoste, and A. Mineur. 2007. The COREAproject, 

Manual for the annotation of coreference in Dutch texts. Technical report, 

University Groningen. 

I. Hendrickx, V Hoste, and W. Daelemans. 2008. Semantic and Syntactic features for 

Anaphora Resolution for Dutch. In Lecture Notes in Computer Science, Volume 4919, 

Proceedings of the CICLing-2008 conference, pages 351–361. Berlin: Springer Verlag.


V. Hoste. 2005. Optimization Issues in Machine Learning of Coreference Resolution. 

Ph.D. thesis, Antwerp University. 

G. Van Noord, I. Schuurman, and V. Vandeghinste. 2006. Syntactic Annotation of Large 

Corpora in STEVIN. In Proceedings of LREC 2006, Genua. 

Corresponding author: orphee.declercq@hogent.be 

41

42 


Dmesure: a readability platform for French as a foreign 

language 

Abstract 

François, Thomas 

UCL CENTAL 

It is a well-known fact that reading practice improves the reading abilities of L1 

students as well as L2 students. However, once one strays from textbooks, matching 

individuals with texts of an adequate level of difficulty is far from an easy task. FFL 

teachers have all, at some time or other, wasted time carrying out such a task. 

Our research aims at providing to the community a web platform, called Dmesure, able 

to retrieve from the web texts on a specific topic and at a specific readability level. We 

will present the current version of this platform, in which texts are first retrieved 

through the Yahoo search engine, before being assessed for difficulty using the 

readability measure described in Francois (2009). It is worth noting that the output of 

Dmesure is compliant with the proficiency scale set in the Common European 

Framework of Reference for languages, what makes this tool very convenient for FFL 

teachers. 

We also address some specific problems encountered when applying a readability 

measure to web texts. Among them, we consider the influence of the boilerplate on 

the readability measure, and some ways to reject pages whose language diverge too 

much from the norm. To conclude, we show that because Dmesure has been 

developped in a participative perspective, it allows to collect new texts annotated by 

teachers. Therefore, the build in readability model can be retrain occasionally with this 

enhanced corpus. 

Corresponding author: thomas.francois@uclouvain.be


Abstract 

Essentials of person names 

Schraagen, Marijn 

Leiden Institute of Advanced Computer Science 

The frequency of spelling variation and errors in person names is relatively high, 

compared to normal vocabulary. Standardization of a name to some base form, or 

core, could be useful in named entity matching or record linkage. Two types of core are 

investigated: the semantic core and the syntactic core. The semantic core approach 

exploits the idea that names in Dutch, especially surnames, have meaning: a surname is 

usually based on a first name, a location, a profession or a personal characteristic. The 

semantic component is subject to heavy and unpredictable modifications due to 

suffixes, inflections and compounding, therefore suffix removal techniques are less 

successful for names than for standard vocabulary. The semantic component itself is 

however relatively stable, and the set of semantic categories is reasonably restricted. 

Therefore, a word list approach can be applied for names, which avoids learning or 

designing complex suffix removal rules. An alternative approach is to extract the 

syntactic core of a name. The syntactic core is the (possibly discontinuous) character 

sequence that remains constant or phonetically equivalent in all variants of a name. 

The syntactic core can be analysed on various linguistic levels. An advantage of the 

syntactic approach is that a word list is not needed, and therefore the procedure can 

also be applied to unknown names or names without a vocabulary-based meaning 

component (such as first names). Algorithms for extracting semantic and syntactic 

cores are discussed, and an application is provided for the problem of record linkage in 

data mining. 

Corresponding author: schraage@liacs.nl 

43

44 


Extraction of Historical Events from Unstructered Texts 

Abstract 

Segers, Roxane and van Erp, Marieke and van der Meij, Lourens 

VU University Amsterdam 

Historiography revolves around events as these express important changepoints in 

reality. We postulate that events can play an important role in improving automated 

search and data integration in the historic domain as events connect information about 

who did what where and when. 

We present a pattern based approach to automatically extract historical named events 

like "French Revolution" and "Second World War" from unstructured texts in Dutch. 

The extracted events are the backbone of a structured event thesaurus that will consist 

of events with their time, place and participants. 

In our approach we make a distinction between external and internal event patterns. 

For collecting external event patterns like 'during the', we retrieved text snippets for a 

number of seed events. We ranked the pattern candidates by their frequency and cooccurence 

with different events. Next, we ran the pattern collection over a domain 

specific corpus. We evaluated the precision of the extracted historical event candidates 

by the number of patterns that extracted the event and the confidence score of these 

patterns. 

The extracted events were used as input for obtaining event internal patterns. We 

classified and analysed the events based on their morpho-syntactic structure: this 

yielded patterns such as "Massacre of Y". To expand these patterns, we used the head 

of the events to iterate Wordnet: this yielded new internal patterns such as "Bloodbath 

of Y". 

As a result we obtained a library of external and internal patterns that can be used to 

extract named events from unstructured texts. The presented combination of internal 

and external patterns is vital as our combined library outperforms each pattern type on 

its own. 

Corresponding author: rh.segers@vu.nl


Abstract 

Finding Statistically Motivated Features Influencing 

Subtree Alignment Performance 

Kotzé, Gideon 


We present results of an ongoing investigation of a manually aligned parallel treebank 

and an automatic tree aligner. Using the parallel treebank as a test set, features that 

are shown to have a significant correlation with alignment performance are 

established. Our conclusion is that lexical features generally have a more significant 

influence than tree features. We present these findings with a discussion of their 

significance and with reference to possible useful applications in the alignment of 

parallel texts for machine translation. 

Corresponding author: g.j.kotze@rug.nl 

45

Abstract 

46 


From Tokens to Text Entities: Line-based Parsing of 

Resumes using Conditional Random Fields 

Rotaru, Mihai 

Textkernel NL 

Resumes (Curriculum Vitae) form a challenging category of semi-structured 

documents. Regardless of the language, most resumes tend to be structured in 

sections: e.g. experience, education, skills, personal. Consequently, the first task of a 

resume information extraction system is to segment the resume in sections. We cast 

the section segmentation problem as a sequence labeling problem. In this paper, we 

show practical results that compare two approaches. The first approach works at the 

word level and uses Hidden Markov Models (HMM) with words as the HMM 

observations. The second approach works at the line level and uses Conditional 

Random Fields (CFRF) and a variety of features computed for each line. We find that 

the CRF approach outperforms the HMM approach significantly on this real world task, 

and that the improvement is also reflected in the later stages of our resume 

information extraction pipeline. In addition, this result generalizes across several 

languages after porting the corresponding CRF features. The main advantages of the 

CRF approach are the expressiveness of the features (e.g. easily express information 

that spans multiple words), and the fact that it makes the practical assumption that a 

line of text belongs to a single section. 

Corresponding author: rotaru@textkernel.nl


Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat 

Abstract 

van Halteren, Hans 1 and Martell, Craig 2 and Du, Caixia 3 and Gu, Yan 3 

and Johan, Kobben 3 and Panjaitan, Leequisach 3 and Schubotz, Louise 3 

Vasylenko, Kateryna 4 

1 Radboud University Nijmegen 

2 Naval Postgraduate School, Monterey 

3 ReMa L&C, RUN/UvT 

4 ReMa L&C 

Modern NLP research attempts to cover the whole spectrum from written to spoken 

text. Right in the middle we find chat text, a written text type which has many 

similarities with spoken text. One of these is spelling variation, often reduction, e.g. 

nite instead of night. It is clear that, if we ever want to analyze or generate chat text, 

we have to understand the factors behind this spelling behavior, whether user 

experience with SMS, peer group identification by speech spelling or otherwise. 

This paper contributes by studying spelling reduction in chat text. We investigated 

cases of reduction in the NPS Chat Corpus. After identifying various types in 2000 posts 

from the publicly available part of the corpus, we focused on four frequent 

phenomena: a) wanna (want to) and gonna (going to), b) ya and u (you), c) g-drop in 

present participles, e.g. findin for finding d) apostrophe drop in enclitics, e.g. hes for 

he’s. For these, we automatically extracted all occurrences of both reduced and full 

forms in 1Mw from the complete corpus. For each, we also determined features which 

could be of influence on the choice between the alternating forms, such as the poster’s 

age group and immediate context in the post. On the basis of this, we built regression 

models to find out which of the features show a significant influence. 

In the paper, we present the main findings and relate them to those identified in the 

literature as being active in spoken text. 

Corresponding author: hvh@let.ru.nl 

47

48 


Language Evolution and SA-OT: The case of sentential 

negation 

Abstract 

Lopopolo, Alessandro and Biro, Tamas 

University of Amsterdam 

Simulated Annealing Optimality Theory (SA-OT) is a recent update of the OT 

framework, and it adds a model of performance to a theory of linguistic competence. 

Our aim is to show how SA-OT can be useful for Language Evolution simulations. 

Performance error is a central concept in this model, and it is considered to be one of 

the causes of variation and evolution. In performance, speakers accept sacrificing 

precision in order to enhance communicative strength, and the performance errors 

influence the language learning of the next generation. 

In order to test the potentialities of SA-OT, we have chosen to model the evolution of 

sentential negation. The background is based on Jespersen's Cycle (JC). In JC, the 

evolution of sentential negation follows three stages (1. pre-verbal, 2. discontinuous, 

and 3. post-verbal). Our starting point is the treatment of JC by de Swart (2010) in 

terms of traditional OT. Her model predicts six stages: the three above-mentioned pure 

stages, as well as three intermediate, mixed stages. Yet, there are no convincing 

empirical data for an intermediate stage between stages 1 and 3. 

Therefore, we advance a novel, computational model for JC, based on SA-OT. It 

reproduces the three pure and the two observed mixed stages, whereas it correctly 

predicts the lack of an intermediate stage between 1 and 3. This result makes different 

predictions for the evolution of sentential negation, and confirms the validity of SA-OT 

as a computational model for language evolution. 

Corresponding author: A.Lopopolo@student.uva.nl


Machine Learning Approaches to Sentiment Analysis 

Using the Dutch Netlog Corpus 

Abstract 

Schrauwen, Sarah and Daelemans, Walter 

CLiPS, Antwerp University 

Sentiment analysis deals with the computational treatment of opinion, sentiment and 

subjectivity. We constructed and manually annotated a corpus, the Dutch Netlog 

Corpus, with data extracted from the social networking website Netlog. This corpus 

was annotated on three levels: ‘valence’ (expressing the opinion of the writer: we 

distinguish between ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and additionally 

language performance, which is divided into two areas: ‘performance’ (‘standard’, 

‘dialect’ and ‘n/a’) and ‘chat’ (‘chat’, ‘non-chat’ and ‘n/a’). We tackle sentiment analysis 

as a text classification task and employ two simple feature sets (the most frequent and 

the most informative words of the corpus) and three supervised classifiers 

implemented from the Natural Language ToolKit (the Naïve Bayes, Maximum Entropy 

and Decision Tree classifiers). The highest obtained accuracy score for valence 

classification with the entire data set is a 65.1%. 

We suggest three factors leading to errors in valence classification. First, the nature of 

the data affects results, since most of the corpus is made up of dialect and chat 

language, which is more difficult to predict. Second, the number of classes to predict 

from is larger for valence classification (five classes) than for performance or chat 

classification (three classes), and is therefore also more difficult to process. Third, the 

skewed class distribution of the corpus probably has the biggest influence on the 

results. We suspect that more training data will solve these current problems. 

Corresponding author: sarah.schrauwen@gmail.com 

49

50 


Measuring the Impact of Controlled Language on Machine 

Translation Text via Readability and Comprehensibility 

Abstract 

Doherty, Stephen 

Centre for Next Generation Localisation, Dublin 

This paper describes a recent study of the readability and comprehensibility of English 

software documentation, which has been translated into French by Matrex, a state-ofthe-art 

(phrase-based statistical) machine translation system. The primary aim of the 

study is to examine what, if any, effects there are on the readability and 

comprehensibility of the machine translation output following the application of 

controlled language (CL) rules on the source language texts. Our hypothesis is that the 

application of CL rules would result in an observable increase in readability and 

comprehensibility of the target language text. 

Our approach consisted of a three-pronged evaluation of the texts by means of (i) 

readability indices in both the source and target languages: (ii) an eye tracking 

measurement of readability: and (iii) a post-task qualitative measurement of 

comprehensibility, using recall and likert-scale human evaluations. We also looked at 

correlations between automatic machine translation evaluation metrics (e.g. BLEU, 

GTM etc.) and the evaluation results mentioned above in an attempt to bridge the gap 

between human and automatic approaches to evaluation. 

The paper will first describe some background and context in the relevant research 

areas, followed by a presentation of the methods employed with a particular focus on 

the measurement of readability via eye tracking and tentative results in this regard. 

Corresponding author: stephen.doherty2@mail.dcu.ie


Abstract 

Memory-based text completion 

van den Bosch, Antal 


The commonly accepted technology for fast and efficient word completion is the prefix 

tree, or trie. As a word is keyed in, the trie can be queried for unicity points and best 

guesses. We present three improvements over the normal prefix trie in experiments in 

which we measure the percentage of keypresses saved on both in-domain and out-ofdomain 

test text, emulating a perfectly alert user who would select correct suggestions 

promptly. First, we train a suffix trie that tests backwards from the most recent 

keypresses. Conditioned on first letters, the suffix trie model yields about 10% more 

saved keypresses than the baseline character saving percentage on in-domain test 

data. Second, the suffix trie model can be straightforwardly extended to testing on 

characters of previous words. Adding this context yields another 10% increase in 

character savings. Third, when we train the context-rich suffix trie model to complete 

the current word and predict the next one in one go, character savings go up another 

4%. In a learning experiment on Dutch texts we observe character savings of up to 44% 

on in-domain test data where the baseline prefix tree savings percentage is 19%. On 

out-of-domain twitter data, the prefix trie baseline of 19% is only mildly surpassed by 

the suffix tree variants to 24% character savings. We develop an explanation for the 

spectacular success of the suffix tree approach on in-domain data, and review the 

applicability of the approach in real-world text entry contexts. 

Corresponding author: Antal.vdnBosch@uvt.nl 

51

Abstract 

52 


Overlap-based Phrase Alignment for Language 

Transformation 

Wubben, Sander and van den Bosch, Antal and Krahmer, Emiel 


In this talk we will present our work on the task of paraphrasing from an old variant of 

a language to a modern variant. One of the tasks we consider is paraphrasing the 

Canterbury Tales from Middle English to Modern English. We approach this task as a 

translation task and therefore we use Machine Translation techniques. The current 

state of the art Machine Translation systems rely heavily on statistical word alignment. 

The alignment package most commonly used is GIZA++, which is used to train IBM 

Models 1 to 5 and an HMM word alignment model. The benefit of using statistical word 

alignment is that no assumptions need to be made about the parallel corpus and that it 

generally produces better results when being fed more data. This holds for the task of 

paraphrasing as well. However, when we consider monolingual parallel corpora, it 

might be naive to only use statistics when we can in fact utilize the attribute that both 

sides of the corpus are in the same (or at least similar) language, and therefore likely to 

exhibit a certain amount of overlap. We will investigate the feasiblity of using overlap 

measures to align words and phrases in monolingual corpora and how this method 

holds up against pure statistical alignment in a Machine Translation framework. 

Corresponding author: s.wubben@uvt.nl


Abstract 

Parse and Tag Somali Pirates 

van Erp, Marieke and Malaisé, Véronique and van Hage, Willem and 

Osinga, Vincent and Coleto, Juan Manuel 

VU University Amsterdam 

Events are the most prevalent complex entities described in user contributed social 

network activities, newswire, commercial infringement reports etc. Unfortunately, due 

to the nature of free text, event descriptions can take many forms, making querying for 

or reasoning over them difficult. 

We present an approach for event extraction from piracy attack reports issued by the 

International Chamber of Commerce (ICC-CCS[1]). As the piracy attack reports are 

semi-structured, we can treat the extraction task as a segmentation and labelling 

problem. We extract information from the reports about participants, weapons, 

locations, times and types of events, and store the information as structured event 

instances. We argue that an event model is not only an intuitive representation for 

such information, enabling automatic analysis and reasoning over the attacks and their 

components, but also a very powerful tool for knowledge and data integration. We 

show that the event model enables automatic analysis of the data, so questions such as 

"How did the weapon use of pirates evolve over time?" can be answered. 

[1] http://www.icc-ccs.org 

Corresponding author: marieke@cs.vu.nl 

53

54 


Personalized Knowledge Discovery: Combining Social 

Media and Domain Ontologies 

Abstract 

Markus, Thomas and Westerhout, Eline and Monachesi, Paola 

Utrecht University 

We present a system that facilitates knowledge discovery by means of structured 

domain ontologies. The user can discover new concepts and relations by exploring an 

expert approved ontological structure which has been automatically enriched with new 

concepts, relations and lexicalisations originating from social media. The system also 

on-the-fly interlinks the conceptual knowledge in the ontology with noisy data coming 

from social media on the conceptual level. 

Our ontology enrichment methodology identifies salient terms using similarity 

measures and determines the appropriate word senses for each term by employing a 

disambiguation algorithm. The appropriate relation between the new concept (word 

sense) and the existing ones is either extracted from DBpedia or from text documents 

retrieved from the web. The disambiguation algorithm is also used to store the original 

context of each term, that is, the term itself, its meaning, associated person and 

resource. These personalised contexts are stored using the MOAT semantic vocabulary. 

The enriched ontology and the disambiguation methodology allow us to give a 

personalised semantic interpretation to each search result in the context of the 

enriched domain ontology and the user. The amount of conceptual overlap between a 

document and the person using the system is employed to offer personalised 

recommendation of documents. 

The advantages that this approach brings to students has been evaluated as part of a 

university course with a large group of students and a separate control group. 

Corresponding author: Thomas.Markus@phil.uu.nl


Recent Advances in Memory-Based Machine Translation 

Abstract 

van Gompel, Maarten and van den Bosch, Antal and Berck, Peter 


We present advances in research on Memory-based Machine Translation (MBMT), a 

form of machine translation in which the translation model takes the form of 

approximate k-nearest neighbour classifiers. These classifiers are trained to map words 

or phrases in context to a target word or phrase. The modelling of source-side context 

is a key feature distinguishing this approach from standard Statistical Machine 

Translation (SMT). 

In 2010 we released the open source PBMBMT (phrase-based memory-based machine 

translation) system. PBMBMT embraces the concept of phrases, as opposed to the 

single words or fixed n-grams that earlier work in memory-based machine translation 

focused on. PBMBMT employs a phrase translation table generated by Moses as the 

basis for the generation of training and test instances for our classifiers. We present an 

automatic method for hyperparameter optimisation, and investigate the usage of 

example weighting in the memory-based classifier. We critically measure and compare 

the performance of our latest system against its precursor systems, and a state-of-theart 

competitor. 

A recent branch of research has focused on the language model component of 

PBMBMT. As PBMBMT can work with both the well known SRILM software and WOPR, 

the memory-based language model, we performed a learning curve experiment with 

both language models to investigate the effect of the amount of training data. Our 

results challenge the common "more data is better" belief. 

Corresponding author: proycon@anaproy.nl 

55

Abstract 

56 


Reversible stochastic attribute-value grammars 

de Kok, Daniël and van Noord, Gertjan and Plank, Barbara 


Attribute-value grammars have been advocated because they are reversible. Their 

declarative nature ensures that the same grammar can in principle be used for parsing 

and generation. 

In more recent years, attribute-value grammars have been extended with conditional 

models to perform parse disambiguation and fluency ranking. However, since such 

models are conditioned on a sentence or a logical form, reversibility is sacrificed. 

We propose a framework for reversible stochastic attribute-value grammars. In this 

framework, a single statistical model is used for parse disambiguation and fluency 

ranking. We argue that this framework is more appropriate, since it recognizes that 

preferences are shared between production and comprehension components. For 

instance, if fluency ranking and disambiguation would have different preferences with 

respect to subject fronting in Dutch, communication would become problematic. 

We provide experimental results that show that the performance of a reversible model 

does not differ significantly from directional models for parse disambiguation and 

fluency ranking. We also show that fluency ranking models can be improved by adding 

annotated parse disambiguation training data, and vise versa. 

Corresponding author: d.j.a.de.kok@rug.nl


Abstract 

Robust Rhymes? The Stability of Authorial Style in 

Medieval Narratives 

Kestemont, Mike and Daelemans, Walter and Sandra, Dominiek 

CLiPS , University of Antwerp 

We explore the application of stylometric methods developed for modern texts to 

rhymed medieval narratives (Jacob of Maerlant and Lodewijk of Velthem, ca. 1260- 

1330). Because of the peculiarities of medieval text transmission, we propose to use 

highly frequent rhyme words for authorship attribution. First, we shall demonstrate 

that these offer important benefits, being relatively content-independent and wellspread 

over texts. Subsequent experimentation shows that correspondence analyses 

can indeed detect authorial differences using highly frequent rhyme words. Finally, we 

demonstrate for Maerlant’s oeuvre that these highly frequent rhyme words’ stylistic 

stability should not be exaggerated since their distribution significantly correlates with 

the internal structure of that oeuvre. 

Corresponding author: mike.kestemont@ua.ac.be 

57

Abstract 

58 


Rule Induction for Synchronous Tree-Substitution 

Grammars in Machine Translation 

Vandeghinste, Vincent and Martens, Scott 

Centrum voor Computerlinguïstiek - KULeuven 

Data-driven machine translation systems are evolving from string-based systems 

towards tree-based systems, such as the PaCo-MT system. In this system, the source 

language sentence is parsed using a monolingual parser. This parse tree needs to be 

converted or transduced into one or more target language parse trees from which one 

or more target language sentences can be generated. 

Rules are induced from phrasal alignments in an automatically parsed version of the 

English and Dutch portions of the Europarl treebank. The procedure for extraction 

assumes that the subtrees bounded by alignments between phrasal nodes in the two 

syntactic tree-structures are suitable as rules for a Synchronous Tree Substitution 

Grammar. The maximum number of such trees are extracted, given the alignments 

between the two sentences, and collected over the entire corpus. Rules that occur 

multiple times are inserted into the transducer as tree substitution rules. Minimally 

small tree substitution rules -- those consisting of a single node and its parent -- are 

used to induce translations where the extracted rules have insufficient coverage. 

Corresponding author: vincent@ccl.kuleuven.be


Abstract 

Search in the Lassy Small Corpus 

van Noord, Gertjan and de Kok, Daniel and van der Linde, Jelmer 


A few months ago, the STEVIN Lassy project yielded its most important results: Lassy 

Small - a corpus of 1 million words with syntactic annotations which have been 

manually verified and corrected, and Lassy Large - a corpus of 1.5 billion words with 

automatically assigned syntactic structures. Syntactic annotations include part-ofspeech 

tags, lemma and dependency annotations of the type developed earlier in CGN 

and D-Coi. 

In this presentation we focus on the Lassy Small corpus, and introduce a stand-alone 

portable tool called DACT which can be used to browse the syntactic annotations in an 

attractive graphical form, and to search for sentences according to a number of search 

criteria, which can be specified elegantly by means of search queries formulated in 

XPATH, the WWW standard query language for XML documents. We provide a number 

of linguistically relevant examples of such queries, and we review the criticism of Lai 

and Bird (2010) which they take as motivation to introduce LPATH, an extension of 

XPATH. We will argue that such an extension is not required if string positions are 

explicitly encoded as XML attributes, as is the case in Lassy Small. 

DACT is freely available for various platforms, including Mac OS and recent versions of 

Windows. 

Corresponding author: g.j.m.van.noord@rug.nl 

59

Abstract 

60 


Simple Measures of Domain Similarity for Parsing 

Plank, Barbara and van Noord, Gertjan 


It is well known that parsing accuracy suffers when a model is applied to out-of-domain 

data. It is also known that the most beneficial data to parse a given domain is data that 

matches the domain (Sekine 1997, Gildea 2001). Hence, an important task is to select 

appropriate domains. However, most previous work on domain adaptation relied on 

the implicit assumption that domains are somehow given. 

With the growth of the web, more and more data is becoming available, and automatic 

ways to select data that is beneficial for a new (unknown) target domain are becoming 

attractive. We consider various ways to automatically acquire related training data for 

a given test article, and compare automatic measures to human-annotated meta-data. 

The results show that a very simple measure of similarity based on word frequencies 

works surprisingly well. 

Corresponding author: b.plank@rug.nl


Abstract 

SSLD: A smart tool for sms compression 

Cougnon, Louise-Amélie and Beaufort, Richard 

UCLouvain - IL&C - Cental 

Since 2009, we have been designing a methodology to semi-automatically develop a 

dictionary based on a corpus of SMS. Such a dictionary can be used to help systems 

translate from standard into sms language, a procedure which has so far been seen as 

a entertaining activity: our methodology can also be employed for more serious 

purposes such as text message summarising and compression tools. Our first results 

were encouraging (Cougnon and Beaufort, 2010) but only focused on French data 

(from Belgium, Quebec, Switzerland and La Reunion). Thanks to the sms4science 

project that aims at collecting sms corpora from all over the world, we now have at our 

disposal German, Dutch and Italian text messages. The aim of this paper is to describe 

our three-step approach to the extraction of the dictionary entries from the various 

corpora and to detail the smart manual sorting performed on the dictionary. The 

results will give us the opportunity to test our initially French-based methodology on 

other languages and to find out whether our approach is generic, i.e. applicable to all 

languages. This question also paves the way for a panorama of sms phenomena 

observed in the dictionary and which occur throughout the languages. Finally, we 

propose ways in which our methodology could be further improved. 

Corresponding author: louise-amelie.cougnon@uclouvain.be 

61

62 


Subtrees as a new type of context in Word Space Models 

Abstract 

Smets, Margaux and Speelman, Dirk and Geeraerts, Dirk 

QLVL, K.U.Leuven 

In Word Space Models (WSMs) there are traditionally two types of contexts that can be 

used: (i) lexical co-occurrences (`bag-of-words models') and (ii) syntactic dependencies. 

In general, models with the second type of contexts seem to perform better. However, 

there are some problems with these models. In the first place, a choice has to be made 

which contexts to include: only subject/verb and verb/object-relations, or also other 

dependencies . Second, in contrast with bag-of-words models, the syntactic models are 

supervised: they require quite large resources (a dependency parser, a manually 

annotated corpus, . . .), which might not be available for each language . 

The contexts we propose for use in WSMs are subtrees as defined in the framework of 

Data-Oriented-Parsing. Subtrees can capture both bag-of-words (co- occurrence) 

information, and syntactic information. Moreover, they are not limited to specific types 

of dependencies, but rather take entire structures into account. 

At first sight, it might seem that the problem of resources for dependency-WSMs 

remains in this framework. After all, we first need the `correct' tree for a sentence, 

before we can extract subtrees from it. However, in our experiments we show how the 

entire algorithm can be made unsupervised by using an unsupervised parser as a 

preprocessing step. 

In the presentation, I will first discuss in detail the workings of this new type of WSM. 

Next, I will present some initial results from experiments with parameters such as the 

accuracy of the parser in the preprocessing step, the maximum subtree depth, the 

minimum subtree frequency, and considering only subtrees with the highest variance. 

Corresponding author: margauxsmets@gmail.com


Successful extraction of opposites by means of textual 

patterns with part-of-speech information only. 

Abstract 

Lobanova, Anna 

Department of Artificial Intelligence, University of Groningen 

We present an automatic method for extraction of opposites (e.g., rich - poor, top - 

bottom, buy - sell) by means of textual patterns that only contain part-of-speech 

information about target word pairs, e.g., difference between and . Our 

preliminary results suggest that this method outperforms a pattern-based method that 

uses dependency patterns [2] (requiring more sophisticated data preprocessing), 

especially for opposites expressed by nouns and verbs. 

Starting with small seed sets, we automatically acquired textual patterns from a 450 

million word version of Twente Nieuws Corpus of Dutch [4]. All patterns were 

automatically evaluated based on their overall frequency and the number of times they 

contained seed pairs. Best patterns were used to find candidate pairs. All found pairs 

were automatically scored based on their frequency and co-occurrence in reliable 

patterns. In addition, pairs with the highest scoring =0.9 were evaluated by two human 

judges. The precision scores for the top-100 found pairs were 0.61 for adjectiveadjective 

pairs, 0.63 for noun-noun pairs and 0.52 for verb-verb pairs. When more pairs 

were considered, the precision was still higher than that reported in previous studies. 

Namely, for the top-500 pairs, the precision was 0.42 for adjective-adjective pairs, 0.33 

for noun-noun pairs and 0.49 for verb-verb pairs. 

This method needs less pre-processing steps than dependency patterns and can easily 

be applied to vast data collections. The results can benefit many NLP applications 

including augmentation of computational lexical resources, Contrast identification 

[3,5], detection of paraphrases and contradictions [1] and others. 

Corresponding author: a.lobanova@ai.rug.nl 

63

Abstract 

64 


Syntactic Analysis of Dutch via Web Services 

Tjong Kim Sang, Erik 


Alpino is a general-purpose syntactic parser for Dutch sentences. At this moment, using 

the parser requires installation of the parser software at a local machine. In the CLARIN 

project TTNWW, we develop a web service interface to the parser which will simplify 

access for future users. The service provides access for client software via standard 

protocols like SOAP and exchanges XML-encoded text data between the client machine 

and the server where the parser is run. In this presentation, we present the current 

status of this project. 

Corresponding author: erikt@xs4all.nl


Abstract 

Technology recycling between Dutch and Afrikaans 

Augustinus, Liesbeth 1 and van Huyssteen, Gerhard 2 and Pilon, Suléne 3 

1 

Centre for Computational Linguistics (CCL), K.U. L 

2 

Centre for Text Technology (CTexT), North-West University, 

Potchefstroom, South Africa 

3 

School for Languages, North-West University, Vanderbijlpark, South 

Africa 

Resource development for resource-scarce languages can be fast-tracked by recycling 

existing technologies for closely-related languages. The main issue dealt with is the 

recycling of Dutch technologies for Afrikaans. The possibilities of technology transfer 

are investigated by focusing on the D2AC-A2DC project. After exploring the 

architecture and functioning of D2AC, a Dutch-to-Afrikaans convertor, the attention 

goes out to the development and performance of A2DC, an Afrikaans-to-Dutch 

convertor. The latter tool is then used to improve the annotation of Afrikaans text with 

Dutch technologies. In particular, the performance of part-of-speech tagging and 

chunking has been considered. The accuracies of both tagger and chunker improve 

significantly if the data are first converted with A2DC before they are sent through the 

tools for Dutch analysis. 

Corresponding author: liesbeth@ccl.kuleuven,be 

65

66 


Technology recycling for closely related languages: Dutch 

and Afrikaans 

Abstract 

Pilon, Suléne 1 and Van Huyssteen, Gerhard 2 

1 North-West University (VTC) 

2 North-West University (PC) 

If two languages (L1 and L2) are similar enough, the development of technologies for 

L2 can be expedited by recycling existing L1 resources. This process is called technology 

recycling and the success thereof is greatly dependent on the degree of similarity 

between the two languages in question. Other strategies can, however, be employed 

to improve the efficiency of L1 technologies on L2 data and in this research we 

experiment with one such strategy, viz. lexical conversion as pre-processing step. We 

explore the possibility of using rule-based lexical conversion to improve the accuracy of 

Dutch technologies when annotating Afrikaans data. The rationale here is that Dutch 

technologies should perform better on Afrikaans data that appears more Dutch-like, 

even if the conversion does not yield a good Dutch translation. To do the lexical 

conversion, we developed an Afrikaans to Dutch convertor (A2DC) which obtains an 

accuracy of more than 72% when converting Afrikaans words to Dutch. For our 

experiment we use a state of the art Dutch POS tagger and parser to annotate raw 

Afrikaans data. The same data is then converted with A2DC and once again annotated 

with the Dutch technologies. In both experiments the conversion has a notably positive 

effect on the performance of the Dutch technologies. The biggest difference is 

observed in the POS tagging task with the overall accuracy increasing from 62.6% when 

annotating raw Afrikaans data to 80.6% when annotating converted data, while the 

parsing f-score improves from 0.44 (raw data) to more than 0.68 (converted data). 

Corresponding author: sulene.pilon@nwu.ac.za


The more the merrier? How data set size and noisiness 

affect the accuracy of predicting the dative alternation 

Abstract 

Theijssen, Daphne and van Halteren, Hans and Boves, Lou and 

Oostdijk, Nelleke 

Radboud University Nijmegen 

In the dative alternation in English, speakers and writers choose between the 

prepositional dative construction ('I gave the ball to him' and the double object 

construction ('I gave him the ball'). Logistic regression models have already been shown 

to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007). 

Collecting dative instances from a corpus and encoding them with the required 

information is a costly procedure. We therefore developed a semi-automatic approach 

to do this, consisting of three steps: (1) automatically extracting dative candidates, (2) 

manually approving or rejecting these candidates, and (3) automatically annotating the 

approved candidates with the required information. The resulting data sets are noisier 

than data sets that have been checked completely manually, but the approach can 

yield much larger data sets. 

We compare the effect of data set size and noisiness on the accuracy of predicting the 

dative alternation. We employ a 'manual' set of 2,877 instances in spoken English, 

taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB 

(Greenbaum 1996) by Theijssen (2010). In addition, we use a 'semi-automatic' set with 

7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We 

compare the learning curves of various machine learning algorithms by randomly 

selecting subsets of the data and extending them with 500 instances each time. We do 

this for different levels of noisiness, i.e. varying the proportion of 'semi-automatic' 

instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference. 

References 

BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition). 

Oxford University Computing Services. 

Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the 

Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive 

67

68 


Foundations of Interpretation, Royal Netherlands Academy of Science, Amsterdam, pp 

69-94. 

Godfrey, John J., Edward C. Holliman and Jane McDaniel (1992). Switchboard: 

Telephone speech corpus for research and development. Proceedings of the 

International Conference on Acoustics, Speech, and Signal Processing (ICASSP-92), pp. 

517–20. 

Greenbaum, Sidney (ed.) (1996). Comparing English Worldwide: The International 

Corpus of English. Oxford, U.K. 

Theijssen, Daphne (2010). Variable selection in Logistic Regression: The British English 

dative alternation. In Icard, Thomas and Reinhard Muskens (eds.), Interfaces: 

Explorations in Logic, Language and Computation. Series: Lecture Notes in Computer 

Science (subseries: Lecture Notes in Artificial Intelligence), volume 6211, Springer. 

Corresponding author: d.theijssen@let.ru.nl


The use of structure discovery methods to detect syntactic 

change 

Abstract 

ten Bosch, Louis and Versteegh, Maarten 


A well-known problem in linguistics deals with the description and analysis of 

diachronic changes in syntactic constructions. In Western European languages, such 

changes have occurred a number of times over the last few centuries. In this 

presentation we present an overview of quantitative methods for analyzing historical 

text corpora. Our overview will include parsing-related methods, Bayesian methods 

and Latent Semantic Analysis, with special focus on methods that do not take syntactic 

trees as a starting point. In addition, we will pay attention to two methodological 

approaches, viz. the contrastive approach, according to which two different 

independent analyses of two corpora are compared, and the single-model approach, 

according to which changes in syntactic structure are interpreted as the result of a 

(possibly biased) competition within a single model. 

We will compare the various methods by presenting different analyses of the same text 

material. 

Corresponding author: l.tenbosch@let.ru.nl 

69

Abstract 

70 


Treatments of the Dutch verb cluster in formal and 

computational linguistics 

Van Eynde, Frank 

K.U.Leuven 

The Dutch verb cluster has always been a challenge for formal and computational 

linguistics, since the sentences which contain one display a rather dramatic discrepancy 

between surface structure, on the one hand, and semantic structure, on the other 

hand, as illustrated amongst others by the cross-serial dependencies in sentences with 

an AcI verb, such as 'zien' in '...dat ik haar de honden heb zien voederen' (... that I saw 

her feed the dogs). 

In multistratal frameworks, such as transformational grammar, the discrepancy is 

accounted for in terms of movement. More specifically, there is a level of syntactic 

structure which straightforwardly reflects the semantic relations, called deep structure 

or D-structure, and there is a series of transformations which map D-structures onto 

surface structures. The transformations either move the verbs, as in Arnold Evers' 

analysis, or their arguments, as in Jan-Wouter Zwart's analysis. 

In monostratal frameworks, such as GPSG and HPSG, the discrepancy between surface 

stucture and semantic structure is handled in terms of the inheritance of valence 

requirements, allowing the verbs in the cluster to take over the unfulfilled valence 

requirements of their verbal complement. This approach was pioneered by Mark 

Johnson in GPSG and by Erhard Hinrichs and Tsuneko Nakazawa in HPSG. Applications 

to Dutch are spelled out in work by Gerrit Rentier and by Gosse Bouma and Gertjan van 

Noord. 

In the Dutch treebanks, such as those of CGN and Lassy, the treatment of the verb 

cluster is monostratal, but the device to bridge the discrepancy between surface 

structure and semantic structure is more reminiscent of multistratal analyses, allowing 

the existence of crossing dependencies and hence the postulation of discontinuous 

constituents. The talk will give a survey of the existing treatments and provide a 

comparative evaluation. 

Corresponding author: frank.vaneynde@ccl.kuleuven.be


TTNWW: de facto standards for Dutch in the context of 

CLARIN 

Abstract 

Schuurman, Ineke 1 and Marc Kemps-Snijders 2 

1 Centrum voor Computerlinguïstiek, K.U.Leuven 

2 Meertens Instituut, Amsterdam 

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project 

(2010-2012), in which several existing resources (both text and speech) and the facto 

standards for Dutch are a) used to help HSS researchers to address new research 

needs, while b) these resources and standards are adapted to/mapped onto the 

standards adopted by CLARIN, embedded in a work flow and presented as a web 

service designed for these HSS researchers. 

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text) 

are involved, spread over Flanders and the Netherlands 

The text part of TTNWW focusses on recognition of all kinds of names in various types 

of texts, such as Dutch novels and archaeological documents, the latter case in 

combination with temporal analysis. All 'lower' levels are also taken care of. 

The project will provide the CLARIN community with ample feedback with respect to 

the standards and technologies proposed in the European context and promote the de 

facto standards for Dutch NLP as used in CGN and several STEVIN-projects. 

In this presentation we will concentrate on various standards for written language and 

mapping between these, especially PoS (CGN/D-Coi), MAF and ISOcat. 

Corresponding author: ineke.schuurman@ccl.kuleuven.be 

71

Abstract 

72 


TTNWW: NLP Tools for Dutch as Webservices in a 

Workflow 

Kemps-Snijders, Marc 1 

Ineke Schuurman 2 

1 Meertens Instituut, Amsterdam 

2 CCL, K.U.Leuven 

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project 

(2010-2012), in which several existing resources (both text and speech) and the facto 

standards for Dutch are a) used to help HSS researchers to address new research 

needs, while b) these resources and standards are adapted to/mapped onto the 

standards adopted by CLARIN, embedded in a work flow and presented as a web 

service designed for these HSS researchers. 

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text) 

are involved, spread over Flanders and the Netherlands. 

To develop the functionalities for both the speech and text parts of the project services 

that are delivered by each of the partners will be combined in a workflow approach 

allowing for flexible combinations of processes. Efforts in this area are embedded in 

the CLARIN effort to describe web services for easy discovery and profile matching, i.e. 

offering possible combinations of available resources and web services for specific 

tasks. 

In this presentation we will focus on methods for workflow construction, description of 

web services and place them in the international perspective of CLARIN. 

Corresponding author: marc.kemps.snijders@meertens.knaw.nl


Using corpora tools to analyze gradable nouns in Dutch. 

Abstract 

Ruiz, Nicholas and Weiffenbach, Edgar 


Morzycki (2009) claims that degree readings of size adjectives, such as "a big idiot" are 

not merely the "consequence of some extragrammatical phenomenon," but rather can 

be attributed to syntax, which gives positional restrictions on the availability of degree 

readings. We expand on Morzycki (2009) by introducing a corpus-based analysis in 

Dutch to verify Morzycki's claims and to extend his claim to the semantic domain. 

Using LASSY, a syntactically annotated Dutch corpus developed inter alia under the 

STEVIN programme, we extract syntactic and semantic properties of noun phrases 

consisting of adjectives "gigantisch", "kolossaal", and "reusachtig" and manually 

annotate each adjective-noun pair with a gradable or non-gradable label. 

Using these features, we construct a statistical model based on logistic regression and 

find that the semantic role, definiteness, and particular semantic noun groups derived 

from Cornetto (a Dutch WordNet with referential relations) have a significant effect on 

the likelihood that an adjective-noun pair is interpreted by the reader to have a degree 

reading. 

Corresponding author: nicholas.ruiz@gmail.com 

73

74 


Using easy distributed computing for data-intensive 

processing 

Abstract 

Van den Bogaert, Joachim 

Centre for Computational Linguistics, K.U. Leuven 

Given the large amounts of data we are coping with when computing useful data from 

large corpora, and the difficulties and costs it takes to run parallel code with traditional 

parallel computing, we will present different frameworks that may be used to facilitate 

easy distributed computing. Using string-to-tree alignment (GHKM), frequent subtree 

mining, and distributed Moses decoding as example cases, we will demonstrate how 

applications and algorithms may be upscaled and out-scaled with these frameworks. 

We will consider both the creation of an embarrassingly parallel solution and the redesign 

of an existing algorithm to fit the mapreduce paradigm. 

Corresponding author: joachim@ccl.kuleuven.be


Abstract 

What is the use of multidocument spatiotemporal 

analysis? 

Schuurman, Ineke and Vandeghinste, Vincent 

Centrum voor Computerlinguïstiek, K.U.Leuven 

In the project AMASS++ (IWT) the central research topic concerns multilingual 

multimedia multidocument summarization: how, in a huge digitized newsarchive, can a 

journalist find documents dealing with the same events and have a summary made of 

them in order to assess their usefulness for a specific purpose? Or: How can a news 

paper deliver personalized (inter)national and local news to a specific subscriber? 

In this presentation we will show how spatiotemporal analysis can be of assistance in 

such tasks, even when the input just consists of raw PoS-tagged texts (i.e. contrary to 

the SoNaR-project where many levels of annotation are available, all of them 

corrected). 

Corresponding author: ineke.schuurman@ccl.kuleuven.be 

75

76 


Without a doubt no uncomplicated task: Negation cues 

and their scope 

Abstract 

Morante, Roser and Schrauwen, Sarah and Daelemans, Walter 

CLiPS - University of Antwerp 

Although negation has been extensively treated from a theoretical perspective (Klima 

1964, Horn 1989, Tottie 1991, van der Wouden 1997) and it's processing is thought to 

be relevant for natural language processing systems (Morante and Sporleder 2010), 

there is a lack of annotated resources and no publicly available annotation guidelines 

can be found that describe in detail how to annotate negation related aspects. In this 

talk we present a corpus annotated with negation cues and their scope, we describe 

the guidelines that we have defined and we comment on the linguistic aspects of the 

annotation process. The annotated corpus contains the detective stories The Hound of 

the Baskervilles and The Adventure of Wisteria Lodge by Conan Doyle. Part of the 

corpus has already been annotated with other layers of semantic information 

(semantic roles, coreference) for the SemEval Task Linking Events and Their 

Participants in Discourse (Ruppenhofer et al., 2010). We first describe the expression 

of negation in this corpus and compare it with the expression of negation in biomedical 

documents. Then we comment in detail several aspects related to the negation 

phenomenon: how to determine what negation cues are, how to mark the scope, and 

how to determine whether an event is negated. We will show that marking the cues is 

not a matter of lexical look-up because some cues are ambiguous, and that contextual 

and discourse level features play a role in finding the scope. Additionally, we show 

that finding negated events depends on the semantic class of the predicates being 

involved, their mood and tense: on the modality of the event clause and on the 

syntactic constructions. Finally, we comment on the most difficult aspects of the 

annotation process, like determining when prepositions like "save" or "except" act as 

negation cues. 

Corresponding author: roser.morante@ua.ac.be

Poster Abstracts 

77

78 


A database for lexical orthographic errors in French 

Abstract 

Manguin, Jean-Luc 

GREYC - Univ.de Caen - France 

This work describes the construction of a database for lexical orthographic errors in 

French. This construction uses different techniques form the field of NLP for a goal in 

the field of psycholinguisitics. In psycholinguisitics, it is often difficult and long to 

collect enough data from experiments with real people. Here the data are collected online 

and come from the requests made to an on-line dictionary. In this huge amount of 

data (about 160 millions words, 4 millions distinctive forms), we can find enough errors 

to have good statistics for a deep study of errors. The questions developped here are 

the link between a "bad" form and its correction, and the classification of errors in a 

small number of types. Several programs and techniques are involved to achieve these 

tasks : detection of graphic neighbours, phonetization, pattern matching : the 

combination of these techniques leads us to 70% of correction with no ambiguity, and 

80% if we accept the system give several possible corrections. The classification of 

errors is also useful for predicting where errors may appear in the words, and thus for 

the knowledge of children's learning of orthography. 

Corresponding author: jean-luc.manguin@unicaen.fr

POSTER ABSTRACTS 

Abstract 

A Posteriori Agreement as a Quality Measure for 

Readability Prediction Systems 

van Oosten, Philip and Hoste, Véronique and Tanghe, Dries 

LT3, University College Ghent 

All readability research is ultimately concerned with the research question whether it is 

possible for a prediction system to automatically determine the level of readability of 

an unseen text. 

A significant problem for such a system is that readability might depend in part on the 

reader.If different readers assess the readability of texts in fundamentally different 

ways, there is insufficient a priori agreement to justify the correctness of a readability 

prediction system based on the texts assessed by those readers.We built a data set of 

readability assessments by expert readers.We clustered the experts into groups with 

greater a priori agreement and then measured for each group whether classifiers 

trained only on data from this group exhibited a classification bias. 

As this was found to be the case, the classification mechanism cannot be 

unproblematically generalized to a different user group. 

Corresponding author: philip.vanoosten@hogent.be 

79

80 


A TN/ITN Framework for Western European languages 

Abstract 

Chesi, Cristiano 1 and Cho, Hyongsil 2 andBaldewijns, Daan 2 and Braga, 

Daniela 1 

1 Microsoft Language Development Center 

2 Microsoft Language Development Center, ISCTE-Lisbon University 

Institute, Portugal 

(Inverse) Text Normalization, (I)TN, is an essential module in Text-to-Speech (TTS) and 

Speech Recognition (SR) systems and it requires both a significant development 

timeline and a deep linguistic expertise (Mikheev 2000, Palmer 2010). 

In this work, we describe an efficient multilingual (I)TN framework that is rule-based, 

hierarchical and modular: the core system is composed by a large set of optimized 

Finite-State Transducers (FSTs) that are compiled following the Normalization Maps 

developed by Language Experts (LEs) for each language: such maps are built using a 

proprietary tool (TNAuthoringTool, Patent Serial No. 12/361,114) that allows the LEs to 

express terminals normalization at high level (e.g. Term_1: “1” > “one”) and easily 

combine such terminals by means of hierarchical, weighted rules: these rules can be 

ordered sets of terminals or other rules, each one ranked according to their relevance 

so as to prevent interference in specific contexts (e.g. Rule_1: “21-12-2010” > “twentyfirst 

of december two thousand ten“ vs. Rule_2: “23-29” > “from twenty three to 

twenty nine”). Such rules are clustered under a small number of Top-Level rules that 

will be the entry states of the compiled FSTs. 

The core set of Top-Level rules developed covers Numerals, Ordinals, Dates and Time, 

Telephone numbers, Measurements and Web-related terms (e.g. URLs, email, 

acronyms). Here, we focus on the ambiguity resolution implemented in three 

languages (English, French, Italian) in the normalization of web-search specific terms 

and mobile text messages. Accuracy and coverage of the FSTs are evaluated against 

very large BING queries collections and SMS corpora. 

Corresponding author: v-crches@microsoft.com


Abstract 

An Examination of Cross-Cultural Similarities and 

Differences from Social Media Data with respect to 

Language Use 

Elahi, Mohammad Fazleh and Monachesi, Paola 

We present a methodology for analyzing cross-cultural similarities and differences 

using language as a medium, love as domain, social media as a data source and 'Terms' 

(emotions and sentiments) and 'Topics' as cultural features. We discuss the techniques 

necessary for the creation of the social data corpus from which emotion terms have 

been extracted using NLP techniques. Topics of love discussion were then extracted 

from the corpus by means of Latent Dirichlet Allocation (LDA). Finally, on the basis of 

these features, a cross-cultural comparison was carried out. For the purpose of crosscultural 

analysis, the experimental focus was on comparing data from a culture from 

the East (India) with a culture from the West (United States of America). Similarities 

and differences between these cultures have been analyzed with respect to the usage 

of emotions, their intensities and the topics used during love discussion in social media. 

Findings include (i) Indians are more emotional than Americans but Americans express 

themselves with stronger emotion terms than Indians, (ii) In discussions on common 

topics related to love (Wedding, Same Sex etc), the conversations of Indians and 

Americans are related to the particular traditions and recent issues of their culture, and 

(iii) Indians and Americans also use some terms and topics,which are only related to 

their culture. 

Corresponding author: rmf_ku@yahoo.com 

81

Abstract 

82 

Authorship Verification of Quran 

Shokrollahi-Far, Mahmoud 



Holy Quran, as the cultural heritage of the Islamic world, has long been the focus of 

scholarly disputes some not solved so far, such as whether Prophet Mohammad 

himself has authored the book. This paper reports a research trend approaching such 

disputes as a text classification task, such as authorship verification. To induce some 

classifiers for this verification task, SVM and Naive Bays machines have been trained on 

the tagged corpora of Quranic texts bootstrapped by Mobin, a morpho-syntactic tagger 

developed for Arabic. The algorithm applied for the task is an efficient enhancement of 

the algorithms applied so far for authorship verification. This algorithm seems 

applicable for authorship problems of the other texts in Arabic. The results have not 

verified the authorship of Quran by Prophet Mohammad. 

Corresponding author: m.shokrollahifar@uvt.nl


CLAM: Computational Linguistics Application Mediator 

Abstract 

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal 


The Computational Linguistics Application Mediator (CLAM) allows you to quickly and 

transparently transform your Natural Language Processing application into a RESTful 

webservice, with which automated clients can communicate, but which at the same 

time also acts as a modern webapplication with which human end-users can interact 

directly. CLAM takes a description of your system and wraps itself around the system. It 

allows both automated clients and human end-users to upload input files to your 

application, start your application with specific parameters, and download or directly 

view the output files produced by your application after it has completed execution. 

Rich support for metadata and provenance data is also provided. 

CLAM is set up in a universal fashion, making it flexible enough to be wrapped around a 

wide range of computational linguistic applications. These applications are treated as a 

black box, of which only the parameters, input formats, and output formats need to be 

described. The applications themselves need not be network-aware in any way, nor 

aware of CLAM. The handling and validation of input is taken care of by CLAM. 

Corresponding author: proycon@anaproy.nl 

83

84 


Discriminative features in reversible stochastic attributevalue 

grammars 

Abstract 

de Kok, Daniël 


Reversible stochastic attribute-value grammars use one model for parse 

disambiguation and fluency ranking. Such a model encodes preferences with respect to 

syntax, fluency, and appropriateness of logical forms, as weighted features. This 

framework is appropriate if similar preferences are used in parsing and generation. 

Reversible models incorporate features that are specific to parse disambiguation and 

fluency ranking, as well as features that are used for both tasks. One particular concern 

with respect to such models is that much of their discriminatory power is provided by 

task-specific features. If this is true, the premise that similar preferences are used in 

parsing and generation does not hold. 

A detailed analysis of features could give us more insight into the true reversibility of 

stochastic attribute-value grammars. However, as De Kok (2010) argued, such featurebased 

models are very opaque due to their enormous size and the tendency to spread 

weight mass among overlapping features. Feature selection methods can be used to 

extract a subset of features that do not overlap. 

In this work, we compare gain-informed feature selection (Berger et al., 1996: Zhou et 

al., 2003: De Kok, 2010), grafting (Perkins et al, 2003), and grafting-light (Zhu et al., 

2010) in performing selection on reversible models. We then use the most effective 

method to extract a list of features ranked by their discriminatory power. We show 

that only a very small number of features is required to produce an effective model for 

parsing and generation. We also provide a qualitative and quantitative analysis of these 

features. 

Corresponding author: d.j.a.de.kok@rug.nl


Abstract 

Fietstas: a web service for text analysis 

Jijkoun, Valentin and de Rijke, Maarten and Vishneuski, Andrei 

University of Amsterdam 

We present Fietstas: a open-access web service for text analysis, created with the idea 

of simplifying building text-intensive applications. As a web service, Fietstas consists of 

(1) a simple content management component, where users can upload their content 

with metadata, (2) a collection of text processing components, from tokenization to 

named entity extraction and normalization, (3) a component for accessing/visualizing 

document processing results (e.g., as XML or HTML), and (4) data analysis component 

for generating term-cloud-based summaries and timelines. The functionality is 

available through easy-to-use REST interface (i.e., through standard HTTP requests), 

and moreover, a number of APIs are available (e.g., for Python and Perl). In this 

presentation, we briefly describe Fietstas and demonstrate how its functionality can be 

used in a simple web application (news search and analysis). 

Corresponding author: jijkoun@uva.nl 

85

Abstract 

86 


FoLiA: Format for Linguistic Annotation 

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal 


We present FoLiA, an XML-based annotation format suitable for the representation of 

written language resources. The format builds upon the work put in the D-Coi/SoNaR 

format, but greatly extends this to accommodate a wide variety of linguistic 

annotations. The objective is to present a rich annotation format based on a single 

unifying notation paradigm that does not commit to any particular tagset, but instead 

offers maximum flexibility and extensibility. In doing so, we replace the many ad-hoc 

formats present in the field with a single well-structured format. 

FoLiA will be proposed as a candidate CLARIN-standard. 

Corresponding author: proycon@anaproy.nl


How can computational linguistics help determine the 

core meaning of then in oral speech? 

Abstract 

Vallee, Michael 

EDC 

The research on the connective then mainly focuses on the temporal interpretation of 

it. However, little has been done on then in oral speech and more precisely in 

questions, orders or inferential sentences. It seems important to show that 

computational linguistics can really help determine if there is one way to describe the 

connective in these contexts or if there are as many different connectives then as there 

are linguistic structures. 

To do so, I will use the prosody of questions and sentences in oral speech to 

demonstrate that the speaker expresses a surprise or a contradiction with what was 

uttered prior the connective in these contexts. To illustrate this perspective, let’s 

consider the following utterance “Now then you listen to me”. I will show that the 

phonological structure shown by a software helps to describe how then works. For 

instance, in the example above, the underlying structure was “you-not-listen to me” 

before the utterance which was not expected by the speaker. In that case, the 

connective shows the different viewpoints between the speaker and the hearer. 

Corresponding author: m.vallee@yahoo.fr 

87

88 


Of mathematicians and physicists: the history of language 

and speech technology in the Netherlands and Flanders 

Abstract 

van der Beek, Leonoor 

Q-go 

When did language technology came into being in the Low Countries? Why are 

language and speech technology (LST) located in different Faculties in Dutch and 

Flemish universities? What was the impact of Lernout and Hauspie on the LST industry 

in the Netherlands? From September 2009 until September 2010, I investigated these 

and many other questions related to the history of LST in the Netherlands and 

Flanders. I interviewed the pioneers in our field and compiled from their stories a 

diverse picture of struggle against the limitations of immature computer technology, of 

boundless optimism and deep disappointment, and of academic friendships and fights. 

From Adriaan van Wijngaarden to Jo Lernout and from PHLIQA via Eurotra to CGN. I'll 

sketch the project and the approach I've taken, and give away some of the highlights of 

the book `Van Rekenmachine tot Taalautomaat' (in Dutch), which will tell the full story. 

Corresponding author: vdbeek@gmail.com


Abstract 

On the difficulty of making concreteness concrete 

van Halteren, Hans and Theijssen, Daphne and Oostdijk, Nelleke and 

Boves, Lou 


As analysis and annotation progresses to deeper linguistic levels, matters prove ever 

more difficult. It not only becomes harder to get machines to provide proper analyses, 

but also to define exactly what we want. Whereas there appears to be consensus on 

what plural nouns are (morpho-syntax) or what relative clauses are (syntax), this is 

certainly not the case for semantic properties like concreteness. When reading papers 

referring to such concepts, one is unlikely to notice any problems. Bresnan et al. 

(2007), e.g., just use concreteness of a noun as a given and draw conclusions about the 

significance of its influence on choices in the dative alternation. 

However, once we ourselves attempt to annotate for concreteness, we run headlong 

into the absence of any clear definition of concreteness. Bresnan refers to Garretson 

(2003), where all we get is a vague (and somewhat circular) description and some 

examples. Looking further, we find lists, such as in the MRC Psycholinguistic Database 

(Coltheart, 1981), as well as procedures, such as Xing et al.’s (2010) procedure based 

on WordNet, all apparently leading to values for the property concreteness. But we can 

only wonder to which degree these various definitions/procedures lead to the same 

results. In this paper, therefore, we take a number of concreteness value yielding 

procedures and examine a) to which degree they overlap in their annotation of corpus 

data (here: Semcor) and b) to which degree they lead to the same conclusions about 

the influence of concreteness on syntactic processes (here: dative alternation). 

Corresponding author: hvh@let.ru.nl 

89

Abstract 

90 


ParaSense or how to use parallel corpora for Cross- 

Lingual Word Sense Disambiguation 

Lefever, Els and Hoste, Véronique 

LT3, University College Ghent 

Cross-Lingual Word sense disambiguation (WSD) consists in selecting the correct 

translation of an ambiguous word in a given context. In this talk we present a set of 

experiments for a classification-based WSD system that uses evidence from multiple 

languages to define a translation label for an ambiguous target word in one of the five 

supported languages (viz. Italian, Spanish, French, Dutch and German). Instead of using 

a predefined monolingual sense-inventory such as WordNet, we use a languageindependent 

framework and build up our sense inventory by means of the aligned 

translations from the parallel corpus Europarl. 

The information that is used to train and test our classifier contains the well known 

WSD local context features of the English input sentences, as well as translation 

features from the other languages. 

Our results show that the multilingual approach outperforms the classification 

experiments that merely take into account the more traditional monolingual WSD 

features. In additon, our results are competitive with those of the best systems that 

participated in the SemEval-2 "Cross-Lingual Word Sense Disambiguation" task. 

Corresponding author: els.lefever@hogent.be


Abstract 

"Pattern", a web mining module for Python 

De Smedt, Tom and Daelemans, Walter 

CLiPS, University of Antwerp 

"Pattern" is a mash-up package for the Python programming language that bundles 

fast, regular expressions-based functionality for NLP and data-mining tasks. It consists 

of the following modules: 

1) pattern.web: provides easy access to Google, Yahoo, Bing, Twitter, Wikipedia, 

Flickr, RSS + a robust HTML DOM parser. 

2) pattern.en: tools for verb inflection, noun pluralization/singularization, a WordNet 

interface, a fast tagger/chunker based on regular expressions. 

3) pattern.table: for working with datasheets (e.g. MS Excel) and CSV-files. 

4) pattern.search: regular expressions for syntax and semantics. For example: 

"BRAND|NP VP JJ+" matches any sentence in which a noun phrase containing a 

brand name is followed by a verb phrase followed by one or more adjectives, e.g. 

"the new iPhone will be amazing", "Doritos taste cheesy", ... 

5) pattern.vector: corpus tools for tf-idf, cosine similarity, vector space search and 

LSA. 

6) pattern.graph: for exploring graphs and semantic networks. 

The package can be used and extended for harvesting online data, opinion mining, 

building semantic networks using a machine learning approach, and so on. 

Corresponding author: tom.desmedt@ua.ac.be 

91

Abstract 

92 


Semantic role labeling of gene regulation events 

Morante, Roser 

CLiPS,- University of Antwerp 

This poster describes work in progress on semantic role labeling of gene regulation 

events. Semantic role labeling (SRL) is a natural language processing task that consists 

of identifying the arguments of predicates within a sentence and assigning a semantic 

role to them (Màrquez et al., 2008). This task can support the extraction of relations 

from biomedical texts. Recent research has produced a rich variety of SRL systems to 

process general domain corpora. However, less systems have been developed to 

process biomedical corpora (Tzong-Han Tsai et al, 2007: Bethard et al., 2008). In this 

abstract, we present preliminary results of a new system that is trained on the GREC 

corpus (Thompson et al., 2009). The system performs argument identi?cation and 

semantic role assignment in a single step, assuming gold standard event identi?cation. 

We provide cross-validation and cross-domain results. 

Corresponding author: roser.morante@ua.ac.be


Abstract 

Source Verification in Quran 

Shokrollahi-Far, Mahmoud 


The revelation of Holy Quran has been accomplished either in Mecca or in Medina, for 

which some chapters or even verses of the book have been classified as either Meccan 

or Medinan. This crucial classification helps the scholars of Quran in so many topics 

including the exegesis of the book. Among the one hundred and fourteen chapters of 

Quran there are still thirty-two disagreed on to be whether Meccan or Medinan. More 

deeply, the scholars have long disputed on the features that would discriminate 

between these two classes. This paper reports a research trend on applying text 

classification tasks, say source verification, to help the resolution of such Quranic 

disputes. For this binary TC task, some classifiers have been induced by training SVM 

and Naive Bays machines on the tagged corpora of Quranic texts bootstrapped by 

Mobin, a morpho-syntactic tagger developed for Arabic. This research has not only 

explored the required distinctive grammatical features, but also led to a successful 

classification of the disputed chapters and verses as Meccan or Medinan. 

Corresponding author: m.shokrollahifar@uvt.nl 

93

94 


Towards a language-independent data-driven compound 

decomposition tool 

Abstract 

Réveil, Bert 1 and Macken, Lieve 2 

1 ELIS, Ghent University 

2 LT3, Language and Translation Technology Team 

Compounding is a highly productive process in Dutch that poses a challenge for various 

NLP applications such as terminology extraction, continuous speech recognition, and 

automated word alignment. The present work therefore proposes a languageindependent, 

data-driven decomposition tool that tries to segment compounds into 

their meaningful parts. 

The basic version of this tool initially determines a list of eligible compound 

constituents (so-called heads and tails), relying solely on word frequency information 

that is extracted from a large text corpus. The decomposition algorithm then 

recursively attempts to decompose the compounds, allowing only two-part head-tail 

divisions in each iteration. E.g. the noun 'postzegelverzamelaar' is first split into 

'postzegel' + 'verzamelaar', followed by an additional decomposition of 'postzegel' into 

'post' + 'zegel'. 

Apart from the basic version, an extended version of the tool is assessed that uses PoS 

information as a means to restrict the list of possible heads and tails. The preformance 

of both versions is evaluated in two large-scale decomposition experiments, one on the 

E-lex compound list and one on a word list that contains specific vocabulary from the 

automotive domain. As the presented decomposition tool only relates on word 

frequency and PoS information, it is expected that the tool can be easily adapted to 

new domains and languages. 

Corresponding author: breveil@elis.ugent.be


Towards improving the precision of a relation extraction 

system by processing negation and speculation 

Abstract 

Van Asch, Vincent and Morante, Roser and Daelemans, Walter 

CLiPS - University of Antwerp 

In this poster we present BiographTA, a system that extracts biological relations from 

PubMed abstracts. The relation extraction system has been designed to process 

abstracts in which biological relations from multiple databases have been annotated 

automatically based on an in-sentence co-occurrence criterium. It performs a relation 

identification task learning from noisy data, since a proportion of the automatically 

annotated relations in the training corpus is incorrect. The system cannot be evaluated 

on noisy data. For this reason, in order to develop and evaluate the system, we gather 

a corpus of PubMed abstracts annotated with the gold biological relations of the 

Bioinfer corpus. 

Additionally, one of the text mining goals in the Biograph project is to develop 

techniques that allow to perform large scale relation extraction starting from the 

smallest possible amount of manually annotated data and obtaining the highest 

precision possible. This is why we add to the relation extraction system a module that 

processes negation and speculation cues. We present experiments aimed at testing 

whether processing the scope of negation and speculation cues results in a higher 

precision of the relations extracted. Results show that the negation and speculation 

detection module increases the precision in 2.93 at the cost of decreasing recall in 0.68. 

Corresponding author: Vincent.VanAsch@ua.ac.be 

95

List of Participants 

97

LIST OF PARTICIPANTS 

Liesbeth Augustinus liesbeth@ccl.kuleuven.be CCL - K.U. Leuven 

Daan Baldewijns v-daanb@microsoft.com Microsoft Language 

Development Center 

Kim Bauters kim.bauters@ugent.be Ghent University 

Richard Beaufort richard.beaufort@uclouvai 

n.be 

UCL CENTAL 

Peter Berck P.J.Berck@UvT.nl TiCC, Tilburg University 

Tamas Biro t.s.biro@uva.nl ACLC, University of 

Amsterdam 

Jelke Bloem j.bloem.3@student.rug.nl University of Groningen 

Cristiano Chesi v-crches@microsoft.com Microsoft Language 

Development Center, 

Porto Salvo - ISCTE-Lisbon 

University Institute, 

Portugal 

Kostadin Cholakov k.cholakov@rug.nl University of Groningen 

Louise-Amélie Cougnon louiseamelie.cougnon@uclouvai 

n.be 

CENTAL - IL&C, UCLouvain 

Crit Cremers c.l.j.m.cremers@hum.leide 

nuniv.nl 


Walter Daelemans walter.daelemans@ua.ac. CLiPS, University of 

be 

Antwerp 

Orphée De Clercq orphee.declercq@hogent. LT3, University College 

be 

Ghent 

Martine De Cock Martine.DeCock@UGent.b 

e 

Ghent University 

Daniël de Kok d.j.a.de.kok@rug.nl University of Groningen 

Tom De Smedt tomdesmedt@gmail.com CLiPS Universiteit 

Antwerpen 

Herwig De Smet herwig.desmet@kdg.be OptiFox 7th Framework 

Europe 

Dennis de Vries dennis@gridline.nl GridLine 

Saskia Debergh saskia.debergh@intersyste 

ms.com 

i.Know nv 

Johannes Deleu johannes.deleu@intec.uge IBCN, IBBT & Ghent 

nt.be 

University 

Thomas Demeester thomas.demeester@ugent 

.be 

Ghent University 

Bart Desmet bart.desmet@hogent.be LT3, University College 

Ghent 

99

100 


Brecht Desplanques brecht.desplanques@elis.u 

gent.be 

ELIS, Ghent University 

Peter Dirix peter.dirix@nuance.com Nuance 

Stephen Doherty stephen.doherty2@mail.d 

cu.ie 

Marius Doornenbal m.doornenbal@elsevier.co 

m 

Frederik Durant frederik.durant@tomtom. 

com 

Mohammad 

Fazleh 

Elahi rmf_ku@yahoo.com 

Dublin City University 

Reed Elsevier 

TomTom 

Thomas François thomas.francois@uclouvai 

n.be 

CENTAL, UCLouvain 

Tanja Gaustad T.Gaustad@uvt.nl TiCC, Tilburg University 

Olga Gordeeva ogordeeva@gmail.com Acapela Group 

Kris Heylen kris.heylen@arts.kuleuven 

.be 

QLVL, K.U.Leuven 

Maarten Hijzelendoorn p.m.hijzelendoorn@hum.l 

eidenuniv.nl 


Veronique Hoste veronique.hoste@hogent. LT3, University College 

be 

Ghent 

Steve Hunt s.j.hunt@tilburguniversity. 

nl 


Marc Kemps-Snijders marc.kemps.snijders@me 

ertens.knaw.nl 

Meertens Instituut 

Mike Kestemont mike.kestemont@ua.ac.be CLiPS, University of 

Antwerp 

Maxim Khalilov maxkhalilov@gmail.com ILLC, University of 

Amsterdam 

Henny Klein E.H.Klein@rug.nl University of Groningen 

Cornelis H.A. Koster kees@cs.ru.nl Radboud Universiteit 

Nijmegen 

Gideon Kotzé g.j.kotze@rug.nl University of Groningen 

Mark Kroon mark.kroon@actonomy.co 

m 

Actonomy 

Reinier Lamers lamers@textkernel.nl Textkernel 

Els Lefever els.lefever@hogent.be LT3, University College 

Ghent 

Anna Lobanova a.lobanova@ai.rug.nl AI, University of Groningen 

Alessandro Lopopolo A.Lopopolo@student.uva. ACLC, University of 

nl 

Amsterdam 

Kim Luyckx kim.luyckx@ua.ac.be CLiPS, University of 

Antwerp 

Lieve Macken lieve.macken@hogent.be LT3, University College


Gideon Maillette de 

Buy Wenniger 

Ghent 

gemdbw@gmail.com ILLC, University of 

Amsterdam 

Véronique Malaisé vmalaise@vu.nl VU University Amsterdam 

Jean-Luc Manguin jeanluc.manguin@unicaen.fr 

CNRS - Université de Caen 

Eliza Margaretha e.margaretha@student.ru 

g.nl 


Thomas Markus Thomas.Markus@phil.uu.n 

l 


Scott Martens scott@ccl.kuleuven.be CCL, K.U.Leuven 

Dieneke Meijer dieneke.meijer@agentsch 

apnl.nl 

Agentschap NL / STEVIN 

Sien Moens sien.moens@cs.kuleuven. 

be 

K.U.Leuven CW 

Paola Monachesi P.Monachesi@uu.nl Utrecht University 

Roser Morante roser.morante@ua.ac.be CLiPS, University of 

Antwerp 

Peter Nabende p.nabende@rug.nl University of Groningen 

Fabrice Nauze fabrice.nauze@rightnow.c 

om 

Q-go / Rightnow 

John Nerbonne j.nerbonne@rug.nl University of Groningen 

Jan Odijk j.odijk@uu.nl UiL-OTS Universiteit 

Utrecht 

Leequisach Panjaitan leequisach.panjaitan@yah 

oo.com 

Claudia Peersman claudia.peersman@ua.ac.b 

e 

CLiPS, University of 

Antwerp & Artesis 

Suléne Pilon sulene.pilon@nwu.ac.za North-West University 

(VTC) 

Barbara Plank b.plank@rug.nl University of Groningen 

Massimo Poesio poesio@essex.ac.uk University of Essex 

Bert Réveil breveil@elis.ugent.be DSSP, ELIS, Ghent 

University 

Mihai Rotaru rotaru@textkernel.nl Textkernel 

Nicholas Ruiz nicholas.ruiz@gmail.com University of Groningen 

Marijn Schraagen schraage@liacs.nl Leiden University 

louise schubotz louise_schubotz@gmx.de Radboud University 

Nijmegen 

Ineke Schuurman ineke.schuurman@ccl.kule 

uven.be 

CCL, K.U.Leuven 

101

102 


Roxane Segers r.h.segers@vu.nl VU University Amsterdam 

Binyam Seyoum binephrem@gmail.com Addis Ababa University 

Margaux Smets margauxsmets@gmail.com QLVL, K.U.Leuven 

Martijn Spitters spitters@textkernel.nl Textkernel 

Peter Spyns pspyns@taalunie.org Nederlandse Taalunie 

Tim Stokman timstokman@gmail.com Textkernel 

Dries Tanghe dwiesje@hotmail.com LT3, University College 

Ghent 

Tristan 

Thomas 

Teunissen tristan@w3lab.nl BA 

Daphne Theijssen d.theijssen@let.ru.nl Radboud University 

Nijmegen 

Erik Tjong Kim Sang erikt@xs4all.nl University of Groningen 

Fabian Triefenbach fabian.triefenbach@elis.ug 

ent.be 

ELIS, Ghent University 

Frederik Vaassen frederik.vaassen@ua.ac.be CLiPS, University of 

Antwerp 

Michaël Vallée m.vallee@yahoo.fr EDC Paris 

Vincent Van Asch Vincent.VanAsch@ua.ac.b CLiPS, University of 

e 

Antwerp 

Matje van de Camp M.M.v.d.Camp@uvt.nl TiCC, Tilburg University 

Tim Van de Cruys tv234@cam.ac.uk University of Cambridge 

Marjan Van de Kauter marjan.vandekauter@hog LT3, University College 

ent.be 

Ghent 

Anne van de 

amvdwetering@hotmail.c University of Groningen 

Wetering om 

Joachim Van den 

Bogaert 

joachim@ccl.kuleuven.be CCL - K.U. Leuven 

Antal van den Bosch Antal.vdnBosch@uvt.nl TiCC, Tilburg University 

Leonoor van der Beek leonoor.vanderbeek@right 

now.com 

Q-go / Rightnow 

Marieke van Erp Marieke@cs.vu.nl VU University Amsterdam 

Frank Van Eynde frank.vaneynde@ccl.kuleu 

ven.be 

CCL, K.U.Leuven 

Maarten van Gompel proycon@anaproy.nl TiCC, Tilburg University 

Hans van Halteren hvh@let.ru.nl Radboud University 

Nijmegen 

Gertjan van Noord g.j.m.van.noord@rug.nl University of Groningen 

Philip van Oosten philip.vanoosten@hogent. 

be 

LT3, University College 

Ghent


Menno van Zaanen mvzaanen@uvt.nl TiCC, Tilburg University 

Tom Vanallemeersch tallem@ccl.kuleuven.be CCL, K.U.Leuven 

Vincent Vandeghinste vincent@ccl.kuleuven.be CCL, K.U.Leuven 

Klaar Vanopstal klaar.vanopstal@hogent.b LT3, University College 

e 

Ghent 

Kateryna Vasylenko Katyaknu1986@mail.ru Nijmegen University 

Peter Velaerts peter.velaerts@hogent.be LT3, University College 

Ghent 

Suzan Verberne s.verberne@let.ru.nl Radboud University 

Nijmegen 

Reinder Verlinde R.Verlinde@Elsevier.com Elsevier 

yuliya vladimirova vladimirova83@gmail.com 

Tim Wauters tim.wauters@intec.ugent. 

be 

IBBT & Ghent University 

Edgar Weiffenbach s1422022@student.rug.nl CLCG, University of 

Groningen 

Eline Westerhout elinewesterhout@gmail.co 

m 


Thomas Wielfaert thomas.wielfaert@ugent.b 

e 

Martijn Wieling m.b.wieling@rug.nl University of Groningen 

Sander Wubben s.wubben@uvt.nl TiCC, Tilburg University 

Jakub Zavrel zavrel@textkernel.nl Textkernel 

103

Programme booklet (pdf)

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?