Programme booklet (pdf)

lt3.hogent.be

Programme booklet (pdf)

Ghent, February 11 th 2011

21 st meeting of Computational

Linguistics In the Netherlands


CLIN-21 was supported by:


Welcome!

For the first time in its over 20-year history, the “Computational Linguistics in the

Netherlands” meeting is being held in beautiful city of Ghent. This year’s edition of the

meeting is hosted by the Language and Translation Technology Team of the University

College Ghent.

CLIN-21 will cover a broad spectrum of areas related to natural language and

computation. The program features 55 talks, organized in 5 parallel sessions and 18

posters on different aspects of computational linguistics. We are delighted that

Massimo Poesio from the University of Essex accepted to give us his vision on current

anaphora resolution research. Leonoor van der Beek will present a booklet on the

history of language and speech technology in the Netherlands and Flanders. At the

CLIN meeting, we will also present the winner of the STIL Thesis Prize 2011 that will be

awarded to the best MA thesis in computational linguistics or its applications.

This booklet contains the presentation and poster abstracts for this year’s CLIN, as well

as the program schedule. The abstracts have been ordered alphabetically.

We hope that CLIN-21 will provide a rewarding forum for the presentation of

interesting work and new ideas in the domain of computational linguistics and natural

language processing, with stimulating and provocative discussions of successes, failures

and new directions. We thank you for your support and participation and wish you a

pleasant and fruitful conference!

The CLIN-21 organizing committee

Veronique Hoste

Els Lefever

Kathelijne Denturck

Peter Velaerts

3


Table of Contents

Welcome! ......................................................................................................................... 3

Table of Contents ............................................................................................................. 5

Programme ....................................................................................................................... 9

Keynote speaker ............................................................................................................. 17

Rethinking anaphora .................................................................................................. 17

Presentation Abstracts ................................................................................................... 19

A discriminative syntactic model for source permutation via tree transduction for

statistical machine translation .................................................................................... 20

A Generalized Lexical Acquisition Technique for Improved Parsing Accuracy ............ 21

A Semantic Vector Space for Modelling Word Meaning in Context ........................... 22

A Toolkit for Visualizing the Coherence of Tree-based Reordering with Word-

Alignments .................................................................................................................. 23

A U-DOP approach to modeling language acquisition ................................................ 24

Age and Gender Prediction on Netlog Data................................................................ 25

Aligning translation divergences through semantic role projection ........................... 26

An Aboutness-based Dependency Parser for Dutch ................................................... 27

An exploration of n-gram relationships for transliteration identification .................. 28

Application of a Constraint Conditional Model for improving the performance of

a Sequence Tagger : A Case Study .............................................................................. 29

Automatic terminology extraction: methods and practical applications ................... 30

Automatically Constructing a Wordnet for Dutch ...................................................... 31

Automatically determining phonetic distances .......................................................... 32

Building a Gold Standard for Dutch Spelling Correction ............................................. 33

Clustering customer questions ................................................................................... 34

Collecting and using a corpus of lyrics and their moods ............................................. 35

5


6

CLIN 21 – CONFERENCE PROGRAMME

Combined Qualitative and Quantitative Error Analysis in Multi-Topic Authorship

Attribution ................................................................................................................. 36

Combining e-learning and natural language processing. The example of

automated dictation exercises .................................................................................. 37

Computing Semantic Relations from Heterogeneous Information Sources .............. 38

Computing the meaning of multi-word expressions for semantic inference ............ 39

Cross-Domain Dutch Coreference Resolution ........................................................... 40

Dmesure: a readability platform for French as a foreign language ........................... 42

Essentials of person names ........................................................................................ 43

Extraction of Historical Events from Unstructered Texts ........................................... 44

Finding Statistically Motivated Features Influencing Subtree Alignment

Performance .............................................................................................................. 45

From Tokens to Text Entities: Line-based Parsing of Resumes using Conditional

Random Fields ........................................................................................................... 46

Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat ....................................... 47

Language Evolution and SA-OT: The case of sentential negation .............................. 48

Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog

Corpus ........................................................................................................................ 49

Measuring the Impact of Controlled Language on Machine Translation Text via

Readability and Comprehensibility ............................................................................ 50

Memory-based text completion ................................................................................ 51

Overlap-based Phrase Alignment for Language Transformation ............................... 52

Parse and Tag Somali Pirates ..................................................................................... 53

Personalized Knowledge Discovery: Combining Social Media and Domain

Ontologies.................................................................................................................. 54

Recent Advances in Memory-Based Machine Translation ........................................ 55

Reversible stochastic attribute-value grammars ....................................................... 56

Robust Rhymes? The Stability of Authorial Style in Medieval Narratives .................. 57


TABLE OF CONTENTS

Rule Induction for Synchronous Tree-Substitution Grammars in Machine

Translation .................................................................................................................. 58

Search in the Lassy Small Corpus ................................................................................ 59

Simple Measures of Domain Similarity for Parsing ..................................................... 60

SSLD: A smart tool for sms compression .................................................................... 61

Subtrees as a new type of context in Word Space Models......................................... 62

Successful extraction of opposites by means of textual patterns with part-of-

speech information only. ............................................................................................ 63

Syntactic Analysis of Dutch via Web Services ............................................................. 64

Technology recycling between Dutch and Afrikaans .................................................. 65

Technology recycling for closely related languages: Dutch and Afrikaans ................. 66

The more the merrier? How data set size and noisiness affect the accuracy of

predicting the dative alternation ................................................................................ 67

The use of structure discovery methods to detect syntactic change ......................... 69

Treatments of the Dutch verb cluster in formal and computational linguistics ......... 70

TTNWW: de facto standards for Dutch in the context of CLARIN ............................... 71

TTNWW: NLP Tools for Dutch as Webservices in a Workflow .................................... 72

Using corpora tools to analyze gradable nouns in Dutch. .......................................... 73

Using easy distributed computing for data-intensive processing ............................... 74

What is the use of multidocument spatiotemporal analysis? .................................... 75

Without a doubt no uncomplicated task: Negation cues and their scope ................. 76

Poster Abstracts ............................................................................................................. 77

A database for lexical orthographic errors in French ................................................. 77

A Posteriori Agreement as a Quality Measure for Readability Prediction Systems .... 79

A TN/ITN Framework for Western European languages ............................................ 80

An Examination of Cross-Cultural Similarities and Differences from Social Media

Data with respect to Language Use ............................................................................ 81

Authorship Verification of Quran ............................................................................... 82

7


8

CLIN 21 – CONFERENCE PROGRAMME

CLAM: Computational Linguistics Application Mediator ........................................... 83

Discriminative features in reversible stochastic attribute-value grammars .............. 84

Fietstas: a web service for text analysis ..................................................................... 85

FoLiA: Format for Linguistic Annotation .................................................................... 86

How can computational linguistics help determine the core meaning of then in

oral speech? ............................................................................................................... 87

Of mathematicians and physicists: the history of language and speech technology

in the Netherlands and Flanders ................................................................................ 88

On the difficulty of making concreteness concrete ................................................... 89

ParaSense or how to use parallel corpora for Cross-Lingual Word Sense

Disambiguation .......................................................................................................... 90

"Pattern", a web mining module for Python ............................................................. 91

Semantic role labeling of gene regulation events ...................................................... 92

Source Verification in Quran ...................................................................................... 93

Towards a language-independent data-driven compound decomposition tool ....... 94

Towards improving the precision of a relation extraction system by processing

negation and speculation .......................................................................................... 95

List of Participants ......................................................................................................... 97


Programme

9


CLIN 21 – CONFERENCE PROGRAMME

09:00

-

09:30

Registration and coffee (Foyer)

Room 328 303 313

Chair

09:30

-

9:50

09:50

-

10:10

10:10

-

10:30

10:30

-

10:50

10:50

-

11:10

11:10

-

11:20

11:20

-

11:30

Social media

Richard Beaufort

Age and Gender Prediction

on Netlog Data

C. Peersman, W.

Daelemans, L. Van

Vaerenbergh

Im chattin :-) u wanna NLP

it: Analyzing Reduction in

Chat

H. van Halteren, G.

Martell, C. Du, Y. Gu, J.

Kobben, L. Panjaitan, L.

Schubotz, K. Vasylenko, Y.

Vladimirova

Personalized Knowledge

Discovery: Combining

Social Media and Domain

Ontologies

Th. Markus, E.

Westerhout, P. Monachesi

Machine Learning

Approaches to Sentiment

Analysis Using the Dutch

Netlog Corpus

S. Schrauwen, W.

Daelemans

Syntax and parsing

Menno van Zaanen

The use of structure

discovery methods to

detect syntactic change

L. ten Bosch, M. Versteegh

An Aboutness-based

Dependency Parser for

Dutch

C. H.A. Koster

A Generalized Lexical

Acquisition Technique for

Improved Parsing Accuracy

K. Cholakov, G. van Noord,

V. Kordoni, Y. Zhang

Search in the Lassy Small

Corpus

G. van Noord, D. de Kok, J.

van der Linde

Coffee break.(Foyer)

Welcome (Auditorium)

STIL thesis prize (Auditorium)

Lexical semantics

Tanja Gaustad

A Semantic Vector Space

for Modelling Word

Meaning in Context

K. Heylen, D. Speelman, D.

Geeraerts

Automatically Constructing

a Wordnet for Dutch

T. Van de Cruys

Successful extraction of

opposites by means of

textual patterns with partof-speech

information

only.

A. Lobanova

Computing the meaning of

multi-word expressions for

semantic inference

C. Cremers

10


PROGRAMME

Machine translation

Lieve Macken

317 403

Rule Induction for

Synchronous Tree-

Substitution Grammars in

Machine Translation

V. Vandeghinste, Scott

Martens

Recent Advances in

Memory-Based Machine

Translation

M. van Gompel, A. van den

Bosch, P. Berck

A discriminative syntactic

model for source

permutation via tree

transduction for statistical

machine translation

M. Khalilov, Kh. Sima'an,

G.M. de Buy Wenniger

Measuring the Impact of

Controlled Language on

Machine Translation Text

via Readability and

Comprehensibility

St. Doherty

Methodology

Erik Tjong Kim Sang

Using easy distributed

computing for dataintensive

processing

J. Van den Bogaert

Application of a Constraint

Conditional Model for

improving the

performance of a

Sequence Tagger : A Case

Study

T. Stokman

Technology Recycling

between Dutch and

Afrikaans

L. Augustinus, G. van

Huyssteen, S. Pilon

Technology recycling for

closely related languages:

Dutch and Afrikaans

S. Pilon, G. Van Huyssteen

09:00

-

09:30

09:30

-

9:50

09:50

-

10:10

10:10

-

10:30

10:30

-

10:50

10:50

-

11:10

11:10

-

11:20

11:20

-

11:30

11


11:30

-

12:30

12:30

-

12:50

12

CLIN 21 – CONFERENCE PROGRAMME

Invited talk: Rethinking anaphora

M. Poesio

(Auditorium)

Of mathematicians and physicists: the history of language

and speech technology in the Netherlands and Flanders

L. van der Beek

(Auditorium)

12:50

-

13:50

Lunch break (Restaurant)

Room 328 303 313

Chair

13:50

-

14:10

14:10

-

14:30

14:30

-

14:50

14:50

-

15:10

15:10

-

16:10

Information extraction

Eline Westerhout

Extraction of Historical

Events from Unstructered

Texts

R. Segers, M. van Erp, L.

van der Meij

Clustering customer

questions

F. Nauze

Parse and Tag Somali

Pirates

M. van Erp, V. Malaisé, W.

van Hage, V. Osinga, J.M.

Coleto

From Tokens to Text

Entities: Line-based

Parsing of Resumes using

Conditional Random Fields

M. Rotaru

Syntax and parsing

Antal van den Bosch

Reversible stochastic

attribute-value grammars

D. de Kok, G. van Noord, B.

Plank

The more the merrier?

How data set size and

noisiness affect the

accuracy of predicting the

dative alternation

D. Theijssen, H. van

Halteren, L. Boves, N.

Oostdijk

Simple Measures of

Domain Similarity for

Parsing

B. Plank, G. van Noord

Treatments of the Dutch

verb cluster in formal and

computational linguistics

F. Van Eynde

Poster session & coffee break.(Foyer)

Semantics

Kris Heylen

Computing Semantic

Relations from

Heterogeneous

Information Sources

A. Panchenko

Essentials of person names

M. Schraagen

Subtrees as a new type of

context in Word Space

Models

M. Smets, D. Speelman, D.

Geeraerts

Using corpora tools to

analyze gradable nouns in

Dutch.

N. Ruiz, E. Weiffenbach


PROGRAMME

Translation

Vincent Vandeghinste

317 403

Finding Statistically

Motivated Features

Influencing Subtree

Alignment Performance

G. Kotzé

A Toolkit for Visualizing

the Coherence of Treebased

Reordering with

Word-Alignments

G. Maillette de Buy

Wenniger

Overlap-based Phrase

Alignment for Language

Transformation

S. Wubben, A. van den

Bosch, E. Krahmer

Aligning translation

divergences through

semantic role projection

T. Vanallemeersch

Discourse

Kim Luyckx

Cross-Domain Dutch

Coreference Resolution

O. De Clercq, V. Hoste

What is the use of

multidocument

spatiotemporal analysis?

I. Schuurman, V.

Vandeghinste

Language Evolution and

SA-OT: The case of

sentential negation

A. Lopopolo, T. Biro

Without a doubt no

uncomplicated task:

Negation cues and their

scope

R. Morante, S. Schrauwen,

W. Daelemans

11:30

-

12:30

12:30

-

12:50

12:50

-

13:50

13:50

-

14:10

14:10

-

14:30

14:30

-

14:50

14:50

-

15:10

15:10

-

16:10

13


14

CLIN 21 – CONFERENCE PROGRAMME

Room 328 303 313

Standards/CLARIN

Syntax/Spelling

Lexical semantics/

Discourse

Chair Gertjan van Noord

Roser Morante

Tim Van de Cruys

16.10

-

16.30

16.30

-

16.50

16.50

-

17.10

17:10

-

19:00

TTNWW: de facto

standards for Dutch in the

context of CLARIN

I. Schuurman, M. Kemps-

Snijders

TTNWW: NLP Tools for

Dutch as Webservices in a

Workflow

M. Kemps-Snijders, I.

Schuurman

Syntactic Analysis of Dutch

via Web Services

E. Tjong Kim Sang

A U-DOP approach to

modeling language

acquisition

M. Smets

Memory-based text

completion

A. van den Bosch

Building a Gold Standard

for Dutch Spelling

Correction

T. Gaustad, A. van den

Bosch

Drinks (Foyer)

An exploration of n-gram

relationships for

transliteration

identification

P. Nabende

Combined Qualitative and

Quantitative Error Analysis

in Multi-Topic Authorship

Attribution

K. Luyckx, W. Daelemans

Robust Rhymes? The

Stability of Authorial Style

in Medieval Narratives

M. Kestemont, W.

Daelemans, D. Sandra


PROGRAMME

317 403

Beyond text

Tools

Paola Monachesi

Combining e-learning and

natural language

processing. The example of

automated dictation

exercises

R. Beaufort, S. Roekhaut

Combining e-learning and

natural language

processing. The example of

automated dictation

exercises

R. Beaufort, S. Roekhaut

Collecting and using a

corpus of lyrics and their

moods

M. van Zaanen

Els Lefever

Automatic terminology

extraction: methods and

practical applications

D. de Vries

Automatic terminology

extraction: methods and

practical applications

D. de Vries

SSLD: A smart tool for sms

compression

L.-A. Cougnon, R. Beaufort

16.10

-

16.30

16.30

-

16.50

16.50

-

17.10

17:10

-

19:00

15


Rethinking anaphora

Abstract

Massimo Poesio

Keynote speaker

Current models of the anaphora resolution task achieve mediocre results for all but the

simpler aspects of the task such as coreference proper (i.e. linking proper names into

coreference chains). One of the reasons for this state of affairs is the drastically

simplified picture of the task at the basis of existing annotated resources and modelse.g.,

the assumption that human subjects by and large agree on anaphoric judgments.

In this talk I will present the current state of our efforts to collect more realistic

judgments about anaphora through the Phrase Detectives online game, and to develop

models of anaphora resolution that do not rely on the total agreement assumption.

17


Presentation Abstracts

19


20

CLIN 21 – CONFERENCE PROGRAMME

A discriminative syntactic model for source permutation

via tree transduction for statistical machine translation

Abstract

Khalilov, Maxim and Sima'an, Khalil and de Buy Wenniger, Gideon

Maillette

ILLC-UvA

Word ordering is still one of the most challenging problems in the statistical machine

translation (SMT). In most existing work, a word reordering model is implicitly or

explicitly incorporated into a translation system based on flat or hierarchical

representations of phrases. By contrast, this study deals with an approach addressing

word ordering problem via source-side permutation prior to translation using

hierarchical and syntactic structures.

Our work is driven by the idea that reordering the source sentence as a pre-translation

step minimizes the need for reordering during translation and may bridge long-distance

order differences, which are outside the scope of commonly used reordering models.

Given a word-aligned parallel corpus, we define the source string permutation as the

task of statistically learning to unfold the crossing alignments between sentence pairs

in the parallel corpus.

This work contributes an approach for learning source string permutation via transfer

of the source syntax tree, i.e. we define source permutation as the problem of learning

how to transfer a given source parse-tree into a parse-tree that minimizes the

divergence from target word-order.

We present a novel discriminative, probabilistic tree transduction model, and

contribute a set of empirical oracle results (upperbounds on translation performance)

for English-to-Dutch source string permutation under sequence and parse tree

constraints. Finally, the translation performance of our learning model is shown to

outperform the state-of-the-art phrase-based system significantly.

Corresponding author: maxkhalilov@gmail.com


PRESENTATION ABSTRACTS

Abstract

A Generalized Lexical Acquisition Technique for

Improved Parsing Accuracy

Cholakov, Kostadin 1 and van Noord, Gertjan 1 and Kordoni, Valia 2 and

Zhang, Yi 3

1 University of Groningen

2 DFKI, Germany

3 University of Saarland, Germany

Unknown words are a major issue for large-scale precision grammars of natural

language. In Cholakov and van Noord (COLING 2010) we proposed a maximum entropy

based classification algorithm for acquiring lexical entries for all forms in the paradigm

of a given unknown word and we tested its performance on the Dutch Alpino grammar.

The study showed an increase in parsing accuracy when our method was applied.

However, the general applicability of our approach has been a major point of criticism.

It has been considered too specific and its application to other systems and languages--

doubtful.

In this presentation, we argue that our method can be applied to any precision

grammar provided that the following conditions are fulfilled: a finite set of labels which

unknown words are mapped onto, large corpora, a parser which analyses various

contexts of a given unknown word and provides syntactic constraints used as features

in the classification process and a morphological component which generates the

paradigm(s) of the unknown word. We show that the fulfillment of these conditions

allows us to apply successfully our approach to other large-scale grammars where it

leads to a significant increase in parsing accuracy on test sets of sentences containing

unknown words in comparison with the achieved accuracy when the default methods

used by these grammars to handle unknown words are employed.

This provides strong support for our claim that our approach is general enough to be

applied to various languages and precision grammars.

Corresponding author: kcholakov@gmail.com

21


22

CLIN 21 – CONFERENCE PROGRAMME

A Semantic Vector Space for Modelling Word Meaning in

Context

Abstract

Heylen, Kris and Speelman, Dirk and Geeraerts, Dirk

QLVL, Katholieke Universiteit Leuven

Semantic Vector spaces have become the mainstay of modelling of word meaning in

statistical NLP. They encode the semantics of words through high-dimensional vectors

that record the co-occurrence of those words with context features in a large corpus.

Vector comparison then allows for the calculation of e.g. semantic similarity between

words. Most semantic vector spaces represent word meaning on the type (or lemma)

level, i.e. their vectors generalize over all occurrences of a word. However, the meaning

of words can differ considerably between contexts due to polysemy or vagueness.

Therefore, many applications, like Word Sense Disambiguation (WSD) or Textual

Entailment, require that word meaning be modelled on the token level, i.e. the level of

individual occurrences. In this paper, we present a semantic vector space model that

represents the meaning of word tokens by taking the word type vector and reweighting

it based on the words observed in the token's immediate vicinity. More specifically,

we give a bigger weight to the context features in the original type vector that are

semantically similar to the context features observed around the token. This semantic

similarity between context features is calculated based on the original wordtype-bycontextfeature

matrix. We explore the performance of this model in a WSD task by

visualizing how well the model separates the different meanings of polysemous words

in Multi-Dimensional Scaling solution. We also compare our model to other token-level

semantic vector spaces as proposed by Schütze (1998) and Erk & Padó (2008).

References

Erk, K. & S. Padó. 2008. A Structured Vector Space Model for Word Meaning in Context.

EMNLP Proceedings, 897-906.

Schütze, H. 1998. Automatic word sense discrimination.Computational Linguistics,

24(1):97–124

Corresponding author: kris.heylen@arts.kuleuven.be


PRESENTATION ABSTRACTS

A Toolkit for Visualizing the Coherence of Tree-based

Reordering with Word-Alignments

Abstract

Maillette de Buy Wenniger, Gideon

Institute for Logic Language and Computation

Tree-based reordering constitutes an important motivation for the increasing interest

in syntax-driven machine translation. It has often been argued that tree-based

reordering might provide a more effective approach for bridging the word-order

differences between source and target sentences. One major approach known as ITG

(Inversion Transduction Grammar) allows permuting the order of the subtrees

dominated by the children of any node in the tree. In practice, it has often been

observed that the word-alignments usually cohere only to a certain degree with this

kind of tree-based reordering, i.e., there are cases of word-alignments that cannot be

fully explained with tree-based reordering when the tree is fixed a priori. This

presentation describes a toolkit for visualizing alignment graphs that consist of a wordalignment

together with a source or target tree. More importantly, the toolkit provides

a facility for visualizing the coherence of word-alignment with tree-based reordering,

highlighting nodes and word-alignments that are incompatible with one another. The

tool allows visualizing the tree-based reordered source/target string as well as the

reordered tree. Using our toolkit, we will also present results pertaining to the

coverage of the ITG assumption of the word-alignments of a Europarl corpus, which is a

very common starting point for training Translation systems. We will also dwell on the

break-down of the types of incompatibility into general classes and discuss what that

implies for training hierarchical translation models on this type of data.

Corresponding author: gemdbw@gmail.coom

23


24

CLIN 21 – CONFERENCE PROGRAMME

A U-DOP approach to modeling language acquisition

Abstract

Smets, Margaux

In linguistics, there is a debate between empiricists and nativists:

the former believe that language is acquired from experience, the latter that there is

an innate component for language. The main arguments adduced by nativists are

Arguments from Poverty of Stimulus. It is claimed that children acquire certain

phenomena, which they cannot learn on the basis of experience alone ---and therefore,

there has to be some innate component for language. In this thesis, we show that at

least for certain phenomena that are often used in such arguments, it is possible to

explain how children acquire them on the basis of experience alone, viz. with an

Unsupervised Data-Oriented Parsing (U-DOP) approach to language.

In the first part of the thesis, we develop concrete implementations of U-DOP, and

contribute to the field of unsupervised parsing with two innovations. First, we develop

an algorithm that performs syntactic category labeling and parsing simultaneously, and

second, we devise a new methodology for unsupervised parsing, which can in principle

be applied to any unsupervised parsing algorithm, and which produces the best results

reported on the ATIS-corpus so far, with a promising outlook for even better results.

In the second part of the thesis, we then use these concrete implementations to show

how the acquisition of certain phenomena can be explained in an empirical way. We

look in detail at wh-questions, and then show that the U-DOP approach is more general

than the nativist account by looking at other phenomena.

Corresponding author: margauxsmets@gmail.com


PRESENTATION ABSTRACTS

Abstract

Age and Gender Prediction on Netlog Data

Peersman, Claudia 1,2 and Daelemans, Walter 1 and Van Vaerenbergh,

Leona 2

1 CLiPS, University of Antwerp

2 Artesis, University College Antwerp

In recent years millions of people have started using social networking sites such as

Netlog to support their personal and professional communications, creating digital

communities. However, a common characteristic of these digital communities is that

users can easily provide a false name, age, gender and location in order to hide their

true identity. This way, social networking sites can be used by people with criminal

intentions (e.g., paedophiles) to support their activities online.

In the context of the DAPHNE project (Defending Against Paedophiles in

Heterogeneous Network Environments), we present first results of a machine learning

approach for age and gender prediction on a corpus of posts on the social network site

Netlog. We investigate which types of linguistic and stylistic features are effective for

age and gender prediction, given the specific characteristics of (the Dutch) chat

language and compare the effectiveness of different machine learning techniques for

age and gender prediction on the Netlog data.

We will conclude our presentation by discussing how these results will guide future

research in the DAPHNE project.

Corresponding author: claudia.peersman@ua.ac.be

25


26

CLIN 21 – CONFERENCE PROGRAMME

Aligning translation divergences through semantic role

projection

Abstract

Vanallemeersch, Tom

Centrum voor Computerlinguïstiek, K.U.Leuven

We investigate whether an alignment method based on cross-lingual semantic

annotation projection improves over approaches for linguistically uninformed word

alignment and purely syntax-based tree alignment, specifically in the area of

translation divergences. We apply an SRL system which annotates English sentences

with PropBank and NomBank rolesets (verbal and nominal predicates and their

semantic roles), and project the predicates and roles to Dutch and French using

intersective GIZA++ word alignment. We create additional alignment links by detecting

the auxiliary words of predicates (auxiliary, modal and support verbs) in parse trees

and by detecting potential Dutch or French predicates based on projected roles. Finally,

we investigate whether additional links can be created by training an SRL system on the

projected predicates and roles and applying it to the Dutch and French parse trees.

Corresponding author: tallem@ccl.kuleuven.be


PRESENTATION ABSTRACTS

Abstract

An Aboutness-based Dependency Parser for Dutch

Koster, Cornelis H.A.

Radboud Universiteit Nijmegen

Dupira (the Dutch Parser for IR Applications) is a new Dependency Parser for Dutch,

which was developed at the University of Nijmegen, based on the older Amazon

grammar and lexicon.

Dupira is a rule-based parser, which is generated by means of the AGFL parser

generator from the Dupira grammar, lexicon and fact tables. By means of transductions

which are specified in the grammar (and can be modified), the parser transduces

sentences to dependency trees.

Dupira was developed for applications in Information Retrieval (IR) rather than in

Linguistics, and for that reason has the following properties:

- the dependency model of Dupira expresses the aboutness of a sentence rather

than describing its complete syntactic structure;

- therefore it is highly suitable for extracting factoids from running text;

- it is also possible to extract dependency triples, which can be used as highaccuracy

terms for text categorization and full-text search;

- Dupira performs certain aboutness-preserving normalizing transformations,

including de-passivization and de-topicalization, in order to enhance recall;

- it makes extensive use of subcategorization preferences to resolve where

possible the attachment of Preposition Phrases;

- it is highly robust, both lexically and syntactically, and fast enough for practical

applications.

In this presentation, we discuss the aboutness-based dependency model and the way

in which the grammar describes the Dutch language. We show by means of examples

the transduction of clauses and phrases. We report the availability of Dupira Version

0.8 in the public domain and the plans we have for further development, and discuss

some of its applications.

Corresponding author: kees@cs.ru.nl

27


28

CLIN 21 – CONFERENCE PROGRAMME

An exploration of n-gram relationships for transliteration

identification

Abstract

Nabende, Peter

University of Groningen

Transliteration identification aims at building quality bilingual lexicons for

complementing and improving performance in various NLP applications including

Machine Translation (MT) and Cross Language Information Retrieval (CLIR). The main

task is to identify matching words across different writing systems from a given data

source. Recent evaluations (Kumaran et al., 2010) show that no single approach

achieves a consistently best performance in identifying transliterations from different

language pairs: an approach that leads to the identification of quality matches between

English and Russian, may result in many incorrect matches between English and

Chinese. In this paper, we conduct experimental settings of utilizing n-gram

relationships for computing candidate transliteration similarity scores which are

subsequently evaluated for choosing potential transliteration matches. We use

datasets from the 2010 transliteration generation shared task (Li et al., 2010) for five

language pairs: English-Russian, English-Chinese, English-Hindi, English-Tamil, and

English-Kannada. For each language pair, we explore various n-gram relationships

starting from the unigram case to higher order n-grams. Results show that higher order

n-grams lead to better transliteration identification quality across all languages,

however, for different language pairs, the higher order n-grams outperform each

other. For example, a pair trigram model outperforms a pair 4-gram model on an

English-Russian dataset while the reverse is true for an English-Pinyin(Romanized

Chinese) dataset. The results are promising as we aim at eliciting such n-gram

correspondences for use in more complex stochastic models such as pair Hidden

Markov Models (Pair HMMs) that we postulate may lead to even better transliteration

identification quality.

Corresponding author: p.nabende@rug.nl


PRESENTATION ABSTRACTS

Application of a Constraint Conditional Model for

improving the performance of a Sequence Tagger : A Case

Study

Abstract

Stokman, Tim

Textkernel

Natural language classifiers often ignore natural global constraints arising from the

nature of domain. These constraints are often hard to learn by the classifier because

the constraints are global while most classifiers make local decisions. Given that the

classifier can give a probability distribution on the possible assignments, we can search

this solution space for a set of assignments satisfying the natural constraints. Given this

solution space, we can formulate a ILP model that maximizes the probability of the

assignment while conforming to all constraints formulated in the model.

We apply this approach to a number of datasets and develop a set of natural

constraints for these datasets. We will expand on the practical implementation issues

encountered and show the performance improvements that arise from using the

constraint conditional model approach.

Corresponding author: timstokman@gmail.com

29


30

CLIN 21 – CONFERENCE PROGRAMME

Automatic terminology extraction: methods and practical

applications

Abstract

de Vries, Dennis

GridLine

Some of the usefull applications of terminology that GridLine provides are automatic

assignment of keywords to documents, integration of thesauri with search engines,

enrichment of search queries using a multi-lingual thesaurus and aiding writers in

correct use of terminology.

The problem that many organisations face though is that they don’t have a list of their

specific terminology. Creating these lists manually by examining company

documentation and interviewing experts is very expensive and time-consuming.

Therefore, GridLine developed instruments for automatically extracting organisation

specific terminology from document collections. Additionally, after extracting

terminology, we can build a thesaurus by automatically determining semantic relations

between terms.

For extraction of terms and semantic relations we use a variety of linguistic methods

(lemmatizer, POS-tagger, compound splitter) and statistical methods (unithood,

termhood). In this presentation I will give a brief overview of these extraction

techniques and show some examples of projects we did for our customers. In

particular, I will talk about Termtreffer, an easy to use application for term extraction

which we made for the Nederlandse Taalunie. In this application, users can extract

terms from documents using custom combinations of linguistic and statistical modules

for term extraction. Additionally they can manage, analyze and edit the resulting

terminology lists.

GridLine is a growing company based in the center of Amsterdam. Currently we are

market leader in Dutch language technology.

Corresponding author: dennis@gridline.nl


PRESENTATION ABSTRACTS

Abstract

Automatically Constructing a Wordnet for Dutch

Van de Cruys, Tim

INRIA & Université Paris VII

In this talk, we describe the automatic construction of a wordnet for Dutch by

combining a number of different sources of semantic information. First, a number of

unsupervised and semi-supervised techniques are presented for the extraction of

different kinds of semantic information. This includes techniques based on

distributional similarity and clustering, but also techniques that extract semantic

information from semi-structured and multilingual resources (such as Wiktionary). The

second part describes how the output of these techniques may be combined with the

structure of the original Princeton WordNet for English, which allows for the automatic

construction of a wordnet for Dutch. Contrary to existing resources, the extracted

resource also includes named entities. The resource is evaluated according to

CORNETTO, an existing, manually constructed wordnet for Dutch.

Corresponding author: Tim.Van_de_Cruys@inria.fr

31


Abstract

32

CLIN 21 – CONFERENCE PROGRAMME

Automatically determining phonetic distances

Wieling, Martijn and Margaretha, Eliza and Nerbonne, John

University of Groningen

This study seeks to induce the distance between phonetic segments based on their

correspondences in dialect atlas material. In other words, we induce information about

the physical realization of sounds from their dialectal distributions.

We algorithmically align segments in pairs of pronunciations at various sites in order to

identify corresponding sounds. We then apply an information-theoretic measure,

Pointwise Mutual Information, in order to automatically determine phonetic distances

based on the relative frequency of correspondences. We repeat these steps until the

alignments (and segment distances) stabilize.

We evaluate the quality of the obtained phonetic distances by comparing them to

acoustic vowel distances. For two separate dialect datasets, Dutch and German, we

find high significant correlations between the induced phonetic distances and the

acoustic distances, indicating that the frequency of correpondence in dialect material

conveys information about the constitution of sounds. We close with some

speculations about the usefulness of the method.

Corresponding author: m.b.wieling@rug.nl


PRESENTATION ABSTRACTS

Building a Gold Standard for Dutch Spelling Correction

Abstract

Gaustad, Tanja and van den Bosch, Antal

TiCC, Tilburg University

The main question in the NWO project "Implicit Linguistics" is whether abstract

linguistic representations are necessary as an intermediate step in NLP models and

systems. To investigate this, we focus on text-to-text processing tasks, i.e. processes

which map form to form. In particular, we are investigating Dutch spelling correction

where a corrupted text is converted to a clean version of the same text.

In order to test the quality of a spelling corrector, a Gold standard is needed. This,

however, does not exist for Dutch as of yet. For this reason, we set out to build such a

Gold standard, containing a mixed selection of texts in which we aim to mark all errors

and their corrections. In this talk, we will present the Gold standard including interannotator

agreement and other statistics relating to the data used. Furthermore, we

will present first results with applying our language model WOPR to the corpus,

comparing it against two baselines: a high-precision known error list and a contextinsensitive

lexical baseline. Evaluation is performed in terms of precision and recall on

detection and correction on full text.

Corresponding author: T.Gaustad@uvt.nl

33


Abstract

34

Nauze, Fabrice

Q-go

Clustering customer questions

CLIN 21 – CONFERENCE PROGRAMME

Q-go’s natural language search technology powers the search box of many corporate

websites. Its NLP technology allows customers to ask questions in their own words and

returns a small set of relevant answers. Hundreds of millions of questions have already

been processed and answered with Q-go’s solution providing us with a mine of data.

In order to improve our knowledge of what customers are asking and to help further

refine our core systems, Q-go needs a way to automatically cluster relevant queries

from large sets of customer questions.

To achieve this goal we tested several standard clustering methods on sets of customer

questions. The outline of the talk will be the following.

First we will explain the specific challenges one has to face when clustering customer

questions (very short queries, typos, etc…). We will then present the clustering

algorithms that have been tested (among other k-Means, GAAC hierarchical clustering,

mini-batch k-Means). Thirdly we will outline two different types of heuristics used in

the first case to improve the quality of the vector representations feeding the

clustering algorithms and in the second to overcome the curse of dimensionality.

Finally the different methods will be evaluated and compared with respect to

processing speed and intrinsic quality of clustering (as well as its practical usefulness).

Corresponding author: fabrice.nauze@q-go.com


PRESENTATION ABSTRACTS

Collecting and using a corpus of lyrics and their moods

Abstract

van Zaanen, Menno

Tilburg University

Recently, there has been an increase in availability of music in digital formats. This has

led to music collections that are different in nature than in the past. Collections are

typically larger and consist of a selection of individual pieces instead of complete

albums. Since playing any musical piece from the collection can be done without

physically changing the medium, listeners create playlists that allow them to identify a

subset of the collection and determine the order in which the pieces are played.

People creating playlists often want to group pieces based on their emotional load

(such as happy or sad). Creating such playlists, however, is time-consuming and

requires knowledge of the music in the collection, since emotional information is not

explicitly encoded with the pieces. We will describe a system that analyzes musical

pieces and, based on the lyrics, classifies them into their corresponding mood class.

This system is developed and evaluated using a corpus of lyrics of songs and their

corresponding mood. The mood tags were collected by social tagging of musical pieces

using the Moody iTunes plugin that is developed by the company Earth People within

the Crayon Room project. Starting from a list of artist, title and mood triples, the

corresponding lyrics of the songs have been collected. This has led to a corpus

containing the lyrics of 5,631 songs, which will be made publicly available.

Corresponding author: mvzaanen@uvt.nl

35


36

CLIN 21 – CONFERENCE PROGRAMME

Combined Qualitative and Quantitative Error Analysis in

Multi-Topic Authorship Attribution

Abstract

Luyckx, Kim and Daelemans, Walter

CLiPS, Antwerp University

In authorship attribution, function words are considered the ideal feature type to deal

with the complexity of the task. There is a consensus that they are topic-neutral, highly

frequent, and not under the author's conscious control. However, it has been shown

that hardly any of these allegedly topic-neutral features are in fact topic-neutral. Topic

seems to be hard to 'separate' from authorial style, irrespective of the type of features

used to predict authorship. Although function words are robust to limited data and

provide good indicators of authorship, the a-priori exclusion of content words causes a

lot of useful information to be disregarded.

We discuss experiments in multi-topic authorship attribution and zoom in on the

features that constitute the attribution model. Qualitative analysis of results is typically

lacking in authorship attribution studies, since many studies focus on performance, but

refrain from going into detail about the features selected.

In this talk, we show that high performance does not always imply a viable approach.

More specifically, we zoom in on unique identifiers, features that occur exclusively with

a specific authorship class in training and uniquely identify a test instance by the same

author. Although a coincidence - the frequency of a feature in an unseen test set or the

topic of that test set cannot be predicted - topic-related unique identifiers provide the

model with an unfair advantage that will not scale. However, the absence of unique

identifiers does not necessarily imply a scalable approach. Although this talk focuses on

authorship attribution, we think any task in text mining would benefit from consistent

error analysis.

Corresponding author: kim.luyckx@ua.ac.be


PRESENTATION ABSTRACTS

Combining e-learning and natural language processing.

The example of automated dictation exercises

Abstract

Beaufort, Richard and Roekhaut, Sophie

UCL CENTAL

E-learning is a way of delivering education based on the use of electronic tools and

content, either delivered on CD-ROMs or managed through network connections. The

idea behind e-learning is to improve both the learning and its management. To this

end, exercises are frequently automated. Of course, the ideal automation would

involve the three distinct steps of an exercise: its preparation, its realization (by the

student) and its correction.

Up to now, the automation lead to exercises, like gap-fill texts or multiple choice tests,

which limit the kinds of knowledge that can be assessed. This is due to the fact that the

correction step, for some exercises, is far from easy to automate.

An eloquent example of such an exercise is dictation, this activity where the teacher

reads a passage aloud and the learners write it down. While automatically reading an

unknown text aloud is not a problem as long as a reliable text-to-speech synthesis

system is available, the accurate correction of a learner's copy can quickly become a

nightmare.

The correction of a dictation's copy involves two steps: first, the detection of the real

places of errors: second, the classification of these errors. In this paper, we present a

way of automating these two steps. The detection step is based on a finite-state string

alignment between the copy and the original. The classification step is the best result

of a finite-state intersection between all possible automatic analyses of an error and

the single analysis of the corresponding correct form.

Corresponding author: richard.beaufort@uclouvain.be

37


Abstract

38

CLIN 21 – CONFERENCE PROGRAMME

Computing Semantic Relations from Heterogeneous

Information Sources

Panchenko, Alexander

UCL CENTAL

Computation of semantic relations between terms or concepts is a general problem in

Natural Language Processing and a subtask of automatic thesaurus construction.

This work describes and compares available heterogeneous information sources which

can be used for mining semantic relations such as texts, electronic dictionaries and

encyclopedias, lexical ontologies and thesauri, folksonomies, surfaces of words, query

logs of search engines, and so forth. Most of the existing algorithms use a single

information source for extracting semantic knowledge: Distributional Analysis relies on

text, Extented Lesk uses dictionary definitions, Jiang-Conrath distance employs a

semantic network such as WordNet and so on. We show that different methods

capture different aspects of the terms’ relatedness: while one acquires similarities of

word contexts, others capture similarities of syntactic contexts, term definitions,

surfaces forms etc.

In these settings, there is a need for a general model capable to aggregate different

aspects of semantic similarity from all available information sources and methods in an

optimal and consistent way. We discuss how such a model can be implemented with a

linear combination, and using tensors (i.e. multi-way arrays). We describe two ways of

using tensors for calculation of semantic relations in the context of multiple

information sources, which we call “adjacency tensor” and “feature tensor”. The sparse

tensor factorization methods PARAFAC, Non-negative Tensor Factorization (NTF), and

Memory-Efficient Tucker (MET) are suggested in order to fusion information about

terms from different methods and information sources. We conclude that tensors can

be used for representing terms, while tensor factorizations can serve to generalize data

about terms’ relatedness.

Corresponding author: alexander.panchenko@student.uclouvain.be


PRESENTATION ABSTRACTS

Computing the meaning of multi-word expressions for

semantic inference

Abstract

Cremers, Crit

Leiden University

The immense diversity of multi words expressions in every language imposes heavy

requirements on the lexicon, the grammar and their interface for deep semantic

analysis to be feasible. The lexicon for meaning-driven NLP is huge and phrasal.

We present a model for dealing with extended lexical units in a parser/generator for

Dutch that aims at the logical computation of entailments and presuppositions. The

model consists of three components: a fiat architecture for the computational lexicon,

an efficient organization of on-line lexical retrieval and a selectional method of

semantic underspecification. In the fiat lexicon, all combinatory instances of all lexical

varieties of all (semantically) relevant constructions are spelled out as feature-value

graphs. Each of the feature-value graphs contains all combinatory information needed

for synatctic and semantic processing. This constructicon is produced off line. It is

managed on line by a retrieval system that selects contextually required and adequate

constructions in linear time. The underspecification allows to disambiguate the

combinatory result by evaluating the structure of the representations.

The model will be demonstrated by reference to two particular constructions: the

Dutch way- (Poss 2009) and the Dutch honger-construction. The first one – “jij hebt je

een weg uit de gevangenis geslijmd” - exemplifies an intriguing combination of lexical

restrictions, productivity, structure sensitivity and semantic specificity. The second – “ik

heb geen erg grote honger” - exemplifies transcategorial semantic effects, where open

modification of a noun phrase requires translation into propositional and statesensitive

operators.

Corresponding author: c.l.j.m.cremers@hum.leidenuniv.nl

39


Abstract

40

CLIN 21 – CONFERENCE PROGRAMME

Cross-Domain Dutch Coreference Resolution

De Clercq, Orphée and Hoste, Véronique

LT3 Language and Translation Technology Team, University College

Ghent

For the STEVIN funded SoNaR project, a Dutch reference corpus of 500 million words is

being built. At the same time, a one-million-word subset is progressively enriched with

semantic information: named entities, coreference, spatio-temporal relations and

semantic roles. As a prerequisite, existing schemes and systems developed for Dutch

are to be reused to the fullest extent possible. In this talk we present the ongoing task

of annotating this subset with coreference information, following existing guidelines

for Dutch (Bouma et al. 2007).

The basis for our coreference resolver is an existing mention-pair approach (Hoste

2005, Hendrickx et al. 2008) for Dutch. One of the main challenges in the domain of

coreference resolution is portability across different domains and languages. Since one

of the great advantages of the SoNaR corpus is its diversity - the 1MW subset itself

comprises six text types - we decided to train our system on, respectively, each text

type separately and all types combined. We will report cross-type (e.g. administrative,

external communication, instructive, journalistic text) cross-validation results for

different NP types and present an extensive qualitative error analysis. We compare

performance when providing perfect markables, derived from deep parsing (Alpino,

Van Noord et al. 2006) with automatically generated markables, and investigate the

added value of integrating additional semantic information resulting from other

annotation layers.

References

G. Bouma, W. Daelemans, I. Hendrickx, V. Hoste, and A. Mineur. 2007. The COREAproject,

Manual for the annotation of coreference in Dutch texts. Technical report,

University Groningen.

I. Hendrickx, V Hoste, and W. Daelemans. 2008. Semantic and Syntactic features for

Anaphora Resolution for Dutch. In Lecture Notes in Computer Science, Volume 4919,

Proceedings of the CICLing-2008 conference, pages 351–361. Berlin: Springer Verlag.


PRESENTATION ABSTRACTS

V. Hoste. 2005. Optimization Issues in Machine Learning of Coreference Resolution.

Ph.D. thesis, Antwerp University.

G. Van Noord, I. Schuurman, and V. Vandeghinste. 2006. Syntactic Annotation of Large

Corpora in STEVIN. In Proceedings of LREC 2006, Genua.

Corresponding author: orphee.declercq@hogent.be

41


42

CLIN 21 – CONFERENCE PROGRAMME

Dmesure: a readability platform for French as a foreign

language

Abstract

François, Thomas

UCL CENTAL

It is a well-known fact that reading practice improves the reading abilities of L1

students as well as L2 students. However, once one strays from textbooks, matching

individuals with texts of an adequate level of difficulty is far from an easy task. FFL

teachers have all, at some time or other, wasted time carrying out such a task.

Our research aims at providing to the community a web platform, called Dmesure, able

to retrieve from the web texts on a specific topic and at a specific readability level. We

will present the current version of this platform, in which texts are first retrieved

through the Yahoo search engine, before being assessed for difficulty using the

readability measure described in Francois (2009). It is worth noting that the output of

Dmesure is compliant with the proficiency scale set in the Common European

Framework of Reference for languages, what makes this tool very convenient for FFL

teachers.

We also address some specific problems encountered when applying a readability

measure to web texts. Among them, we consider the influence of the boilerplate on

the readability measure, and some ways to reject pages whose language diverge too

much from the norm. To conclude, we show that because Dmesure has been

developped in a participative perspective, it allows to collect new texts annotated by

teachers. Therefore, the build in readability model can be retrain occasionally with this

enhanced corpus.

Corresponding author: thomas.francois@uclouvain.be


PRESENTATION ABSTRACTS

Abstract

Essentials of person names

Schraagen, Marijn

Leiden Institute of Advanced Computer Science

The frequency of spelling variation and errors in person names is relatively high,

compared to normal vocabulary. Standardization of a name to some base form, or

core, could be useful in named entity matching or record linkage. Two types of core are

investigated: the semantic core and the syntactic core. The semantic core approach

exploits the idea that names in Dutch, especially surnames, have meaning: a surname is

usually based on a first name, a location, a profession or a personal characteristic. The

semantic component is subject to heavy and unpredictable modifications due to

suffixes, inflections and compounding, therefore suffix removal techniques are less

successful for names than for standard vocabulary. The semantic component itself is

however relatively stable, and the set of semantic categories is reasonably restricted.

Therefore, a word list approach can be applied for names, which avoids learning or

designing complex suffix removal rules. An alternative approach is to extract the

syntactic core of a name. The syntactic core is the (possibly discontinuous) character

sequence that remains constant or phonetically equivalent in all variants of a name.

The syntactic core can be analysed on various linguistic levels. An advantage of the

syntactic approach is that a word list is not needed, and therefore the procedure can

also be applied to unknown names or names without a vocabulary-based meaning

component (such as first names). Algorithms for extracting semantic and syntactic

cores are discussed, and an application is provided for the problem of record linkage in

data mining.

Corresponding author: schraage@liacs.nl

43


44

CLIN 21 – CONFERENCE PROGRAMME

Extraction of Historical Events from Unstructered Texts

Abstract

Segers, Roxane and van Erp, Marieke and van der Meij, Lourens

VU University Amsterdam

Historiography revolves around events as these express important changepoints in

reality. We postulate that events can play an important role in improving automated

search and data integration in the historic domain as events connect information about

who did what where and when.

We present a pattern based approach to automatically extract historical named events

like "French Revolution" and "Second World War" from unstructured texts in Dutch.

The extracted events are the backbone of a structured event thesaurus that will consist

of events with their time, place and participants.

In our approach we make a distinction between external and internal event patterns.

For collecting external event patterns like 'during the', we retrieved text snippets for a

number of seed events. We ranked the pattern candidates by their frequency and cooccurence

with different events. Next, we ran the pattern collection over a domain

specific corpus. We evaluated the precision of the extracted historical event candidates

by the number of patterns that extracted the event and the confidence score of these

patterns.

The extracted events were used as input for obtaining event internal patterns. We

classified and analysed the events based on their morpho-syntactic structure: this

yielded patterns such as "Massacre of Y". To expand these patterns, we used the head

of the events to iterate Wordnet: this yielded new internal patterns such as "Bloodbath

of Y".

As a result we obtained a library of external and internal patterns that can be used to

extract named events from unstructured texts. The presented combination of internal

and external patterns is vital as our combined library outperforms each pattern type on

its own.

Corresponding author: rh.segers@vu.nl


PRESENTATION ABSTRACTS

Abstract

Finding Statistically Motivated Features Influencing

Subtree Alignment Performance

Kotzé, Gideon

University of Groningen

We present results of an ongoing investigation of a manually aligned parallel treebank

and an automatic tree aligner. Using the parallel treebank as a test set, features that

are shown to have a significant correlation with alignment performance are

established. Our conclusion is that lexical features generally have a more significant

influence than tree features. We present these findings with a discussion of their

significance and with reference to possible useful applications in the alignment of

parallel texts for machine translation.

Corresponding author: g.j.kotze@rug.nl

45


Abstract

46

CLIN 21 – CONFERENCE PROGRAMME

From Tokens to Text Entities: Line-based Parsing of

Resumes using Conditional Random Fields

Rotaru, Mihai

Textkernel NL

Resumes (Curriculum Vitae) form a challenging category of semi-structured

documents. Regardless of the language, most resumes tend to be structured in

sections: e.g. experience, education, skills, personal. Consequently, the first task of a

resume information extraction system is to segment the resume in sections. We cast

the section segmentation problem as a sequence labeling problem. In this paper, we

show practical results that compare two approaches. The first approach works at the

word level and uses Hidden Markov Models (HMM) with words as the HMM

observations. The second approach works at the line level and uses Conditional

Random Fields (CFRF) and a variety of features computed for each line. We find that

the CRF approach outperforms the HMM approach significantly on this real world task,

and that the improvement is also reflected in the later stages of our resume

information extraction pipeline. In addition, this result generalizes across several

languages after porting the corresponding CRF features. The main advantages of the

CRF approach are the expressiveness of the features (e.g. easily express information

that spans multiple words), and the fact that it makes the practical assumption that a

line of text belongs to a single section.

Corresponding author: rotaru@textkernel.nl


PRESENTATION ABSTRACTS

Im chattin :-) u wanna NLP it: Analyzing Reduction in Chat

Abstract

van Halteren, Hans 1 and Martell, Craig 2 and Du, Caixia 3 and Gu, Yan 3

and Johan, Kobben 3 and Panjaitan, Leequisach 3 and Schubotz, Louise 3

Vasylenko, Kateryna 4

1 Radboud University Nijmegen

2 Naval Postgraduate School, Monterey

3 ReMa L&C, RUN/UvT

4 ReMa L&C

Modern NLP research attempts to cover the whole spectrum from written to spoken

text. Right in the middle we find chat text, a written text type which has many

similarities with spoken text. One of these is spelling variation, often reduction, e.g.

nite instead of night. It is clear that, if we ever want to analyze or generate chat text,

we have to understand the factors behind this spelling behavior, whether user

experience with SMS, peer group identification by speech spelling or otherwise.

This paper contributes by studying spelling reduction in chat text. We investigated

cases of reduction in the NPS Chat Corpus. After identifying various types in 2000 posts

from the publicly available part of the corpus, we focused on four frequent

phenomena: a) wanna (want to) and gonna (going to), b) ya and u (you), c) g-drop in

present participles, e.g. findin for finding d) apostrophe drop in enclitics, e.g. hes for

he’s. For these, we automatically extracted all occurrences of both reduced and full

forms in 1Mw from the complete corpus. For each, we also determined features which

could be of influence on the choice between the alternating forms, such as the poster’s

age group and immediate context in the post. On the basis of this, we built regression

models to find out which of the features show a significant influence.

In the paper, we present the main findings and relate them to those identified in the

literature as being active in spoken text.

Corresponding author: hvh@let.ru.nl

47


48

CLIN 21 – CONFERENCE PROGRAMME

Language Evolution and SA-OT: The case of sentential

negation

Abstract

Lopopolo, Alessandro and Biro, Tamas

University of Amsterdam

Simulated Annealing Optimality Theory (SA-OT) is a recent update of the OT

framework, and it adds a model of performance to a theory of linguistic competence.

Our aim is to show how SA-OT can be useful for Language Evolution simulations.

Performance error is a central concept in this model, and it is considered to be one of

the causes of variation and evolution. In performance, speakers accept sacrificing

precision in order to enhance communicative strength, and the performance errors

influence the language learning of the next generation.

In order to test the potentialities of SA-OT, we have chosen to model the evolution of

sentential negation. The background is based on Jespersen's Cycle (JC). In JC, the

evolution of sentential negation follows three stages (1. pre-verbal, 2. discontinuous,

and 3. post-verbal). Our starting point is the treatment of JC by de Swart (2010) in

terms of traditional OT. Her model predicts six stages: the three above-mentioned pure

stages, as well as three intermediate, mixed stages. Yet, there are no convincing

empirical data for an intermediate stage between stages 1 and 3.

Therefore, we advance a novel, computational model for JC, based on SA-OT. It

reproduces the three pure and the two observed mixed stages, whereas it correctly

predicts the lack of an intermediate stage between 1 and 3. This result makes different

predictions for the evolution of sentential negation, and confirms the validity of SA-OT

as a computational model for language evolution.

Corresponding author: A.Lopopolo@student.uva.nl


PRESENTATION ABSTRACTS

Machine Learning Approaches to Sentiment Analysis

Using the Dutch Netlog Corpus

Abstract

Schrauwen, Sarah and Daelemans, Walter

CLiPS, Antwerp University

Sentiment analysis deals with the computational treatment of opinion, sentiment and

subjectivity. We constructed and manually annotated a corpus, the Dutch Netlog

Corpus, with data extracted from the social networking website Netlog. This corpus

was annotated on three levels: ‘valence’ (expressing the opinion of the writer: we

distinguish between ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and additionally

language performance, which is divided into two areas: ‘performance’ (‘standard’,

‘dialect’ and ‘n/a’) and ‘chat’ (‘chat’, ‘non-chat’ and ‘n/a’). We tackle sentiment analysis

as a text classification task and employ two simple feature sets (the most frequent and

the most informative words of the corpus) and three supervised classifiers

implemented from the Natural Language ToolKit (the Naïve Bayes, Maximum Entropy

and Decision Tree classifiers). The highest obtained accuracy score for valence

classification with the entire data set is a 65.1%.

We suggest three factors leading to errors in valence classification. First, the nature of

the data affects results, since most of the corpus is made up of dialect and chat

language, which is more difficult to predict. Second, the number of classes to predict

from is larger for valence classification (five classes) than for performance or chat

classification (three classes), and is therefore also more difficult to process. Third, the

skewed class distribution of the corpus probably has the biggest influence on the

results. We suspect that more training data will solve these current problems.

Corresponding author: sarah.schrauwen@gmail.com

49


50

CLIN 21 – CONFERENCE PROGRAMME

Measuring the Impact of Controlled Language on Machine

Translation Text via Readability and Comprehensibility

Abstract

Doherty, Stephen

Centre for Next Generation Localisation, Dublin

This paper describes a recent study of the readability and comprehensibility of English

software documentation, which has been translated into French by Matrex, a state-ofthe-art

(phrase-based statistical) machine translation system. The primary aim of the

study is to examine what, if any, effects there are on the readability and

comprehensibility of the machine translation output following the application of

controlled language (CL) rules on the source language texts. Our hypothesis is that the

application of CL rules would result in an observable increase in readability and

comprehensibility of the target language text.

Our approach consisted of a three-pronged evaluation of the texts by means of (i)

readability indices in both the source and target languages: (ii) an eye tracking

measurement of readability: and (iii) a post-task qualitative measurement of

comprehensibility, using recall and likert-scale human evaluations. We also looked at

correlations between automatic machine translation evaluation metrics (e.g. BLEU,

GTM etc.) and the evaluation results mentioned above in an attempt to bridge the gap

between human and automatic approaches to evaluation.

The paper will first describe some background and context in the relevant research

areas, followed by a presentation of the methods employed with a particular focus on

the measurement of readability via eye tracking and tentative results in this regard.

Corresponding author: stephen.doherty2@mail.dcu.ie


PRESENTATION ABSTRACTS

Abstract

Memory-based text completion

van den Bosch, Antal

Tilburg University

The commonly accepted technology for fast and efficient word completion is the prefix

tree, or trie. As a word is keyed in, the trie can be queried for unicity points and best

guesses. We present three improvements over the normal prefix trie in experiments in

which we measure the percentage of keypresses saved on both in-domain and out-ofdomain

test text, emulating a perfectly alert user who would select correct suggestions

promptly. First, we train a suffix trie that tests backwards from the most recent

keypresses. Conditioned on first letters, the suffix trie model yields about 10% more

saved keypresses than the baseline character saving percentage on in-domain test

data. Second, the suffix trie model can be straightforwardly extended to testing on

characters of previous words. Adding this context yields another 10% increase in

character savings. Third, when we train the context-rich suffix trie model to complete

the current word and predict the next one in one go, character savings go up another

4%. In a learning experiment on Dutch texts we observe character savings of up to 44%

on in-domain test data where the baseline prefix tree savings percentage is 19%. On

out-of-domain twitter data, the prefix trie baseline of 19% is only mildly surpassed by

the suffix tree variants to 24% character savings. We develop an explanation for the

spectacular success of the suffix tree approach on in-domain data, and review the

applicability of the approach in real-world text entry contexts.

Corresponding author: Antal.vdnBosch@uvt.nl

51


Abstract

52

CLIN 21 – CONFERENCE PROGRAMME

Overlap-based Phrase Alignment for Language

Transformation

Wubben, Sander and van den Bosch, Antal and Krahmer, Emiel

Tilburg University

In this talk we will present our work on the task of paraphrasing from an old variant of

a language to a modern variant. One of the tasks we consider is paraphrasing the

Canterbury Tales from Middle English to Modern English. We approach this task as a

translation task and therefore we use Machine Translation techniques. The current

state of the art Machine Translation systems rely heavily on statistical word alignment.

The alignment package most commonly used is GIZA++, which is used to train IBM

Models 1 to 5 and an HMM word alignment model. The benefit of using statistical word

alignment is that no assumptions need to be made about the parallel corpus and that it

generally produces better results when being fed more data. This holds for the task of

paraphrasing as well. However, when we consider monolingual parallel corpora, it

might be naive to only use statistics when we can in fact utilize the attribute that both

sides of the corpus are in the same (or at least similar) language, and therefore likely to

exhibit a certain amount of overlap. We will investigate the feasiblity of using overlap

measures to align words and phrases in monolingual corpora and how this method

holds up against pure statistical alignment in a Machine Translation framework.

Corresponding author: s.wubben@uvt.nl


PRESENTATION ABSTRACTS

Abstract

Parse and Tag Somali Pirates

van Erp, Marieke and Malaisé, Véronique and van Hage, Willem and

Osinga, Vincent and Coleto, Juan Manuel

VU University Amsterdam

Events are the most prevalent complex entities described in user contributed social

network activities, newswire, commercial infringement reports etc. Unfortunately, due

to the nature of free text, event descriptions can take many forms, making querying for

or reasoning over them difficult.

We present an approach for event extraction from piracy attack reports issued by the

International Chamber of Commerce (ICC-CCS[1]). As the piracy attack reports are

semi-structured, we can treat the extraction task as a segmentation and labelling

problem. We extract information from the reports about participants, weapons,

locations, times and types of events, and store the information as structured event

instances. We argue that an event model is not only an intuitive representation for

such information, enabling automatic analysis and reasoning over the attacks and their

components, but also a very powerful tool for knowledge and data integration. We

show that the event model enables automatic analysis of the data, so questions such as

"How did the weapon use of pirates evolve over time?" can be answered.

[1] http://www.icc-ccs.org

Corresponding author: marieke@cs.vu.nl

53


54

CLIN 21 – CONFERENCE PROGRAMME

Personalized Knowledge Discovery: Combining Social

Media and Domain Ontologies

Abstract

Markus, Thomas and Westerhout, Eline and Monachesi, Paola

Utrecht University

We present a system that facilitates knowledge discovery by means of structured

domain ontologies. The user can discover new concepts and relations by exploring an

expert approved ontological structure which has been automatically enriched with new

concepts, relations and lexicalisations originating from social media. The system also

on-the-fly interlinks the conceptual knowledge in the ontology with noisy data coming

from social media on the conceptual level.

Our ontology enrichment methodology identifies salient terms using similarity

measures and determines the appropriate word senses for each term by employing a

disambiguation algorithm. The appropriate relation between the new concept (word

sense) and the existing ones is either extracted from DBpedia or from text documents

retrieved from the web. The disambiguation algorithm is also used to store the original

context of each term, that is, the term itself, its meaning, associated person and

resource. These personalised contexts are stored using the MOAT semantic vocabulary.

The enriched ontology and the disambiguation methodology allow us to give a

personalised semantic interpretation to each search result in the context of the

enriched domain ontology and the user. The amount of conceptual overlap between a

document and the person using the system is employed to offer personalised

recommendation of documents.

The advantages that this approach brings to students has been evaluated as part of a

university course with a large group of students and a separate control group.

Corresponding author: Thomas.Markus@phil.uu.nl


PRESENTATION ABSTRACTS

Recent Advances in Memory-Based Machine Translation

Abstract

van Gompel, Maarten and van den Bosch, Antal and Berck, Peter

Tilburg University

We present advances in research on Memory-based Machine Translation (MBMT), a

form of machine translation in which the translation model takes the form of

approximate k-nearest neighbour classifiers. These classifiers are trained to map words

or phrases in context to a target word or phrase. The modelling of source-side context

is a key feature distinguishing this approach from standard Statistical Machine

Translation (SMT).

In 2010 we released the open source PBMBMT (phrase-based memory-based machine

translation) system. PBMBMT embraces the concept of phrases, as opposed to the

single words or fixed n-grams that earlier work in memory-based machine translation

focused on. PBMBMT employs a phrase translation table generated by Moses as the

basis for the generation of training and test instances for our classifiers. We present an

automatic method for hyperparameter optimisation, and investigate the usage of

example weighting in the memory-based classifier. We critically measure and compare

the performance of our latest system against its precursor systems, and a state-of-theart

competitor.

A recent branch of research has focused on the language model component of

PBMBMT. As PBMBMT can work with both the well known SRILM software and WOPR,

the memory-based language model, we performed a learning curve experiment with

both language models to investigate the effect of the amount of training data. Our

results challenge the common "more data is better" belief.

Corresponding author: proycon@anaproy.nl

55


Abstract

56

CLIN 21 – CONFERENCE PROGRAMME

Reversible stochastic attribute-value grammars

de Kok, Daniël and van Noord, Gertjan and Plank, Barbara

University of Groningen

Attribute-value grammars have been advocated because they are reversible. Their

declarative nature ensures that the same grammar can in principle be used for parsing

and generation.

In more recent years, attribute-value grammars have been extended with conditional

models to perform parse disambiguation and fluency ranking. However, since such

models are conditioned on a sentence or a logical form, reversibility is sacrificed.

We propose a framework for reversible stochastic attribute-value grammars. In this

framework, a single statistical model is used for parse disambiguation and fluency

ranking. We argue that this framework is more appropriate, since it recognizes that

preferences are shared between production and comprehension components. For

instance, if fluency ranking and disambiguation would have different preferences with

respect to subject fronting in Dutch, communication would become problematic.

We provide experimental results that show that the performance of a reversible model

does not differ significantly from directional models for parse disambiguation and

fluency ranking. We also show that fluency ranking models can be improved by adding

annotated parse disambiguation training data, and vise versa.

Corresponding author: d.j.a.de.kok@rug.nl


PRESENTATION ABSTRACTS

Abstract

Robust Rhymes? The Stability of Authorial Style in

Medieval Narratives

Kestemont, Mike and Daelemans, Walter and Sandra, Dominiek

CLiPS , University of Antwerp

We explore the application of stylometric methods developed for modern texts to

rhymed medieval narratives (Jacob of Maerlant and Lodewijk of Velthem, ca. 1260-

1330). Because of the peculiarities of medieval text transmission, we propose to use

highly frequent rhyme words for authorship attribution. First, we shall demonstrate

that these offer important benefits, being relatively content-independent and wellspread

over texts. Subsequent experimentation shows that correspondence analyses

can indeed detect authorial differences using highly frequent rhyme words. Finally, we

demonstrate for Maerlant’s oeuvre that these highly frequent rhyme words’ stylistic

stability should not be exaggerated since their distribution significantly correlates with

the internal structure of that oeuvre.

Corresponding author: mike.kestemont@ua.ac.be

57


Abstract

58

CLIN 21 – CONFERENCE PROGRAMME

Rule Induction for Synchronous Tree-Substitution

Grammars in Machine Translation

Vandeghinste, Vincent and Martens, Scott

Centrum voor Computerlinguïstiek - KULeuven

Data-driven machine translation systems are evolving from string-based systems

towards tree-based systems, such as the PaCo-MT system. In this system, the source

language sentence is parsed using a monolingual parser. This parse tree needs to be

converted or transduced into one or more target language parse trees from which one

or more target language sentences can be generated.

Rules are induced from phrasal alignments in an automatically parsed version of the

English and Dutch portions of the Europarl treebank. The procedure for extraction

assumes that the subtrees bounded by alignments between phrasal nodes in the two

syntactic tree-structures are suitable as rules for a Synchronous Tree Substitution

Grammar. The maximum number of such trees are extracted, given the alignments

between the two sentences, and collected over the entire corpus. Rules that occur

multiple times are inserted into the transducer as tree substitution rules. Minimally

small tree substitution rules -- those consisting of a single node and its parent -- are

used to induce translations where the extracted rules have insufficient coverage.

Corresponding author: vincent@ccl.kuleuven.be


PRESENTATION ABSTRACTS

Abstract

Search in the Lassy Small Corpus

van Noord, Gertjan and de Kok, Daniel and van der Linde, Jelmer

University of Groningen

A few months ago, the STEVIN Lassy project yielded its most important results: Lassy

Small - a corpus of 1 million words with syntactic annotations which have been

manually verified and corrected, and Lassy Large - a corpus of 1.5 billion words with

automatically assigned syntactic structures. Syntactic annotations include part-ofspeech

tags, lemma and dependency annotations of the type developed earlier in CGN

and D-Coi.

In this presentation we focus on the Lassy Small corpus, and introduce a stand-alone

portable tool called DACT which can be used to browse the syntactic annotations in an

attractive graphical form, and to search for sentences according to a number of search

criteria, which can be specified elegantly by means of search queries formulated in

XPATH, the WWW standard query language for XML documents. We provide a number

of linguistically relevant examples of such queries, and we review the criticism of Lai

and Bird (2010) which they take as motivation to introduce LPATH, an extension of

XPATH. We will argue that such an extension is not required if string positions are

explicitly encoded as XML attributes, as is the case in Lassy Small.

DACT is freely available for various platforms, including Mac OS and recent versions of

Windows.

Corresponding author: g.j.m.van.noord@rug.nl

59


Abstract

60

CLIN 21 – CONFERENCE PROGRAMME

Simple Measures of Domain Similarity for Parsing

Plank, Barbara and van Noord, Gertjan

University of Groningen

It is well known that parsing accuracy suffers when a model is applied to out-of-domain

data. It is also known that the most beneficial data to parse a given domain is data that

matches the domain (Sekine 1997, Gildea 2001). Hence, an important task is to select

appropriate domains. However, most previous work on domain adaptation relied on

the implicit assumption that domains are somehow given.

With the growth of the web, more and more data is becoming available, and automatic

ways to select data that is beneficial for a new (unknown) target domain are becoming

attractive. We consider various ways to automatically acquire related training data for

a given test article, and compare automatic measures to human-annotated meta-data.

The results show that a very simple measure of similarity based on word frequencies

works surprisingly well.

Corresponding author: b.plank@rug.nl


PRESENTATION ABSTRACTS

Abstract

SSLD: A smart tool for sms compression

Cougnon, Louise-Amélie and Beaufort, Richard

UCLouvain - IL&C - Cental

Since 2009, we have been designing a methodology to semi-automatically develop a

dictionary based on a corpus of SMS. Such a dictionary can be used to help systems

translate from standard into sms language, a procedure which has so far been seen as

a entertaining activity: our methodology can also be employed for more serious

purposes such as text message summarising and compression tools. Our first results

were encouraging (Cougnon and Beaufort, 2010) but only focused on French data

(from Belgium, Quebec, Switzerland and La Reunion). Thanks to the sms4science

project that aims at collecting sms corpora from all over the world, we now have at our

disposal German, Dutch and Italian text messages. The aim of this paper is to describe

our three-step approach to the extraction of the dictionary entries from the various

corpora and to detail the smart manual sorting performed on the dictionary. The

results will give us the opportunity to test our initially French-based methodology on

other languages and to find out whether our approach is generic, i.e. applicable to all

languages. This question also paves the way for a panorama of sms phenomena

observed in the dictionary and which occur throughout the languages. Finally, we

propose ways in which our methodology could be further improved.

Corresponding author: louise-amelie.cougnon@uclouvain.be

61


62

CLIN 21 – CONFERENCE PROGRAMME

Subtrees as a new type of context in Word Space Models

Abstract

Smets, Margaux and Speelman, Dirk and Geeraerts, Dirk

QLVL, K.U.Leuven

In Word Space Models (WSMs) there are traditionally two types of contexts that can be

used: (i) lexical co-occurrences (`bag-of-words models') and (ii) syntactic dependencies.

In general, models with the second type of contexts seem to perform better. However,

there are some problems with these models. In the first place, a choice has to be made

which contexts to include: only subject/verb and verb/object-relations, or also other

dependencies . Second, in contrast with bag-of-words models, the syntactic models are

supervised: they require quite large resources (a dependency parser, a manually

annotated corpus, . . .), which might not be available for each language .

The contexts we propose for use in WSMs are subtrees as defined in the framework of

Data-Oriented-Parsing. Subtrees can capture both bag-of-words (co- occurrence)

information, and syntactic information. Moreover, they are not limited to specific types

of dependencies, but rather take entire structures into account.

At first sight, it might seem that the problem of resources for dependency-WSMs

remains in this framework. After all, we first need the `correct' tree for a sentence,

before we can extract subtrees from it. However, in our experiments we show how the

entire algorithm can be made unsupervised by using an unsupervised parser as a

preprocessing step.

In the presentation, I will first discuss in detail the workings of this new type of WSM.

Next, I will present some initial results from experiments with parameters such as the

accuracy of the parser in the preprocessing step, the maximum subtree depth, the

minimum subtree frequency, and considering only subtrees with the highest variance.

Corresponding author: margauxsmets@gmail.com


PRESENTATION ABSTRACTS

Successful extraction of opposites by means of textual

patterns with part-of-speech information only.

Abstract

Lobanova, Anna

Department of Artificial Intelligence, University of Groningen

We present an automatic method for extraction of opposites (e.g., rich - poor, top -

bottom, buy - sell) by means of textual patterns that only contain part-of-speech

information about target word pairs, e.g., difference between and . Our

preliminary results suggest that this method outperforms a pattern-based method that

uses dependency patterns [2] (requiring more sophisticated data preprocessing),

especially for opposites expressed by nouns and verbs.

Starting with small seed sets, we automatically acquired textual patterns from a 450

million word version of Twente Nieuws Corpus of Dutch [4]. All patterns were

automatically evaluated based on their overall frequency and the number of times they

contained seed pairs. Best patterns were used to find candidate pairs. All found pairs

were automatically scored based on their frequency and co-occurrence in reliable

patterns. In addition, pairs with the highest scoring =0.9 were evaluated by two human

judges. The precision scores for the top-100 found pairs were 0.61 for adjectiveadjective

pairs, 0.63 for noun-noun pairs and 0.52 for verb-verb pairs. When more pairs

were considered, the precision was still higher than that reported in previous studies.

Namely, for the top-500 pairs, the precision was 0.42 for adjective-adjective pairs, 0.33

for noun-noun pairs and 0.49 for verb-verb pairs.

This method needs less pre-processing steps than dependency patterns and can easily

be applied to vast data collections. The results can benefit many NLP applications

including augmentation of computational lexical resources, Contrast identification

[3,5], detection of paraphrases and contradictions [1] and others.

Corresponding author: a.lobanova@ai.rug.nl

63


Abstract

64

CLIN 21 – CONFERENCE PROGRAMME

Syntactic Analysis of Dutch via Web Services

Tjong Kim Sang, Erik

University of Groningen

Alpino is a general-purpose syntactic parser for Dutch sentences. At this moment, using

the parser requires installation of the parser software at a local machine. In the CLARIN

project TTNWW, we develop a web service interface to the parser which will simplify

access for future users. The service provides access for client software via standard

protocols like SOAP and exchanges XML-encoded text data between the client machine

and the server where the parser is run. In this presentation, we present the current

status of this project.

Corresponding author: erikt@xs4all.nl


PRESENTATION ABSTRACTS

Abstract

Technology recycling between Dutch and Afrikaans

Augustinus, Liesbeth 1 and van Huyssteen, Gerhard 2 and Pilon, Suléne 3

1

Centre for Computational Linguistics (CCL), K.U. L

2

Centre for Text Technology (CTexT), North-West University,

Potchefstroom, South Africa

3

School for Languages, North-West University, Vanderbijlpark, South

Africa

Resource development for resource-scarce languages can be fast-tracked by recycling

existing technologies for closely-related languages. The main issue dealt with is the

recycling of Dutch technologies for Afrikaans. The possibilities of technology transfer

are investigated by focusing on the D2AC-A2DC project. After exploring the

architecture and functioning of D2AC, a Dutch-to-Afrikaans convertor, the attention

goes out to the development and performance of A2DC, an Afrikaans-to-Dutch

convertor. The latter tool is then used to improve the annotation of Afrikaans text with

Dutch technologies. In particular, the performance of part-of-speech tagging and

chunking has been considered. The accuracies of both tagger and chunker improve

significantly if the data are first converted with A2DC before they are sent through the

tools for Dutch analysis.

Corresponding author: liesbeth@ccl.kuleuven,be

65


66

CLIN 21 – CONFERENCE PROGRAMME

Technology recycling for closely related languages: Dutch

and Afrikaans

Abstract

Pilon, Suléne 1 and Van Huyssteen, Gerhard 2

1 North-West University (VTC)

2 North-West University (PC)

If two languages (L1 and L2) are similar enough, the development of technologies for

L2 can be expedited by recycling existing L1 resources. This process is called technology

recycling and the success thereof is greatly dependent on the degree of similarity

between the two languages in question. Other strategies can, however, be employed

to improve the efficiency of L1 technologies on L2 data and in this research we

experiment with one such strategy, viz. lexical conversion as pre-processing step. We

explore the possibility of using rule-based lexical conversion to improve the accuracy of

Dutch technologies when annotating Afrikaans data. The rationale here is that Dutch

technologies should perform better on Afrikaans data that appears more Dutch-like,

even if the conversion does not yield a good Dutch translation. To do the lexical

conversion, we developed an Afrikaans to Dutch convertor (A2DC) which obtains an

accuracy of more than 72% when converting Afrikaans words to Dutch. For our

experiment we use a state of the art Dutch POS tagger and parser to annotate raw

Afrikaans data. The same data is then converted with A2DC and once again annotated

with the Dutch technologies. In both experiments the conversion has a notably positive

effect on the performance of the Dutch technologies. The biggest difference is

observed in the POS tagging task with the overall accuracy increasing from 62.6% when

annotating raw Afrikaans data to 80.6% when annotating converted data, while the

parsing f-score improves from 0.44 (raw data) to more than 0.68 (converted data).

Corresponding author: sulene.pilon@nwu.ac.za


PRESENTATION ABSTRACTS

The more the merrier? How data set size and noisiness

affect the accuracy of predicting the dative alternation

Abstract

Theijssen, Daphne and van Halteren, Hans and Boves, Lou and

Oostdijk, Nelleke

Radboud University Nijmegen

In the dative alternation in English, speakers and writers choose between the

prepositional dative construction ('I gave the ball to him' and the double object

construction ('I gave him the ball'). Logistic regression models have already been shown

to be able to predict over 90% of the choices correctly (e.g. Bresnan et al. 2007).

Collecting dative instances from a corpus and encoding them with the required

information is a costly procedure. We therefore developed a semi-automatic approach

to do this, consisting of three steps: (1) automatically extracting dative candidates, (2)

manually approving or rejecting these candidates, and (3) automatically annotating the

approved candidates with the required information. The resulting data sets are noisier

than data sets that have been checked completely manually, but the approach can

yield much larger data sets.

We compare the effect of data set size and noisiness on the accuracy of predicting the

dative alternation. We employ a 'manual' set of 2,877 instances in spoken English,

taken from Switchboard (Godfrey et al. 1992) by Bresnan et al (2007) and from ICE-GB

(Greenbaum 1996) by Theijssen (2010). In addition, we use a 'semi-automatic' set with

7,755 instances from Switchboard, ICE-GB and BNC (BNC Consortium 2007). We

compare the learning curves of various machine learning algorithms by randomly

selecting subsets of the data and extending them with 500 instances each time. We do

this for different levels of noisiness, i.e. varying the proportion of 'semi-automatic'

instances (0%, 25%, 50%, 75%, 100%). The results are presented at the conference.

References

BNC Consortium (2007). The British National Corpus, version 3 (BNC XML Edition).

Oxford University Computing Services.

Bresnan Joan, Anna Cueni, Tatiana Nikitina and R. Harald Baayen (2007). Predicting the

Dative Alternation. In Bouma, Gerlof, Irene Kraemer and Joost Zwarts (eds.), Cognitive

67


68

CLIN 21 – CONFERENCE PROGRAMME

Foundations of Interpretation, Royal Netherlands Academy of Science, Amsterdam, pp

69-94.

Godfrey, John J., Edward C. Holliman and Jane McDaniel (1992). Switchboard:

Telephone speech corpus for research and development. Proceedings of the

International Conference on Acoustics, Speech, and Signal Processing (ICASSP-92), pp.

517–20.

Greenbaum, Sidney (ed.) (1996). Comparing English Worldwide: The International

Corpus of English. Oxford, U.K.

Theijssen, Daphne (2010). Variable selection in Logistic Regression: The British English

dative alternation. In Icard, Thomas and Reinhard Muskens (eds.), Interfaces:

Explorations in Logic, Language and Computation. Series: Lecture Notes in Computer

Science (subseries: Lecture Notes in Artificial Intelligence), volume 6211, Springer.

Corresponding author: d.theijssen@let.ru.nl


PRESENTATION ABSTRACTS

The use of structure discovery methods to detect syntactic

change

Abstract

ten Bosch, Louis and Versteegh, Maarten

Radboud University Nijmegen

A well-known problem in linguistics deals with the description and analysis of

diachronic changes in syntactic constructions. In Western European languages, such

changes have occurred a number of times over the last few centuries. In this

presentation we present an overview of quantitative methods for analyzing historical

text corpora. Our overview will include parsing-related methods, Bayesian methods

and Latent Semantic Analysis, with special focus on methods that do not take syntactic

trees as a starting point. In addition, we will pay attention to two methodological

approaches, viz. the contrastive approach, according to which two different

independent analyses of two corpora are compared, and the single-model approach,

according to which changes in syntactic structure are interpreted as the result of a

(possibly biased) competition within a single model.

We will compare the various methods by presenting different analyses of the same text

material.

Corresponding author: l.tenbosch@let.ru.nl

69


Abstract

70

CLIN 21 – CONFERENCE PROGRAMME

Treatments of the Dutch verb cluster in formal and

computational linguistics

Van Eynde, Frank

K.U.Leuven

The Dutch verb cluster has always been a challenge for formal and computational

linguistics, since the sentences which contain one display a rather dramatic discrepancy

between surface structure, on the one hand, and semantic structure, on the other

hand, as illustrated amongst others by the cross-serial dependencies in sentences with

an AcI verb, such as 'zien' in '...dat ik haar de honden heb zien voederen' (... that I saw

her feed the dogs).

In multistratal frameworks, such as transformational grammar, the discrepancy is

accounted for in terms of movement. More specifically, there is a level of syntactic

structure which straightforwardly reflects the semantic relations, called deep structure

or D-structure, and there is a series of transformations which map D-structures onto

surface structures. The transformations either move the verbs, as in Arnold Evers'

analysis, or their arguments, as in Jan-Wouter Zwart's analysis.

In monostratal frameworks, such as GPSG and HPSG, the discrepancy between surface

stucture and semantic structure is handled in terms of the inheritance of valence

requirements, allowing the verbs in the cluster to take over the unfulfilled valence

requirements of their verbal complement. This approach was pioneered by Mark

Johnson in GPSG and by Erhard Hinrichs and Tsuneko Nakazawa in HPSG. Applications

to Dutch are spelled out in work by Gerrit Rentier and by Gosse Bouma and Gertjan van

Noord.

In the Dutch treebanks, such as those of CGN and Lassy, the treatment of the verb

cluster is monostratal, but the device to bridge the discrepancy between surface

structure and semantic structure is more reminiscent of multistratal analyses, allowing

the existence of crossing dependencies and hence the postulation of discontinuous

constituents. The talk will give a survey of the existing treatments and provide a

comparative evaluation.

Corresponding author: frank.vaneynde@ccl.kuleuven.be


PRESENTATION ABSTRACTS

TTNWW: de facto standards for Dutch in the context of

CLARIN

Abstract

Schuurman, Ineke 1 and Marc Kemps-Snijders 2

1 Centrum voor Computerlinguïstiek, K.U.Leuven

2 Meertens Instituut, Amsterdam

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project

(2010-2012), in which several existing resources (both text and speech) and the facto

standards for Dutch are a) used to help HSS researchers to address new research

needs, while b) these resources and standards are adapted to/mapped onto the

standards adopted by CLARIN, embedded in a work flow and presented as a web

service designed for these HSS researchers.

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text)

are involved, spread over Flanders and the Netherlands

The text part of TTNWW focusses on recognition of all kinds of names in various types

of texts, such as Dutch novels and archaeological documents, the latter case in

combination with temporal analysis. All 'lower' levels are also taken care of.

The project will provide the CLARIN community with ample feedback with respect to

the standards and technologies proposed in the European context and promote the de

facto standards for Dutch NLP as used in CGN and several STEVIN-projects.

In this presentation we will concentrate on various standards for written language and

mapping between these, especially PoS (CGN/D-Coi), MAF and ISOcat.

Corresponding author: ineke.schuurman@ccl.kuleuven.be

71


Abstract

72

CLIN 21 – CONFERENCE PROGRAMME

TTNWW: NLP Tools for Dutch as Webservices in a

Workflow

Kemps-Snijders, Marc 1

Ineke Schuurman 2

1 Meertens Instituut, Amsterdam

2 CCL, K.U.Leuven

The Flemish and Dutch CLARIN groups have started TTNWW, a larger pilot project

(2010-2012), in which several existing resources (both text and speech) and the facto

standards for Dutch are a) used to help HSS researchers to address new research

needs, while b) these resources and standards are adapted to/mapped onto the

standards adopted by CLARIN, embedded in a work flow and presented as a web

service designed for these HSS researchers.

In TTNWW technology partners (5 speech, 10 text) and user groups (4 speech, 2 text)

are involved, spread over Flanders and the Netherlands.

To develop the functionalities for both the speech and text parts of the project services

that are delivered by each of the partners will be combined in a workflow approach

allowing for flexible combinations of processes. Efforts in this area are embedded in

the CLARIN effort to describe web services for easy discovery and profile matching, i.e.

offering possible combinations of available resources and web services for specific

tasks.

In this presentation we will focus on methods for workflow construction, description of

web services and place them in the international perspective of CLARIN.

Corresponding author: marc.kemps.snijders@meertens.knaw.nl


PRESENTATION ABSTRACTS

Using corpora tools to analyze gradable nouns in Dutch.

Abstract

Ruiz, Nicholas and Weiffenbach, Edgar

University of Groningen

Morzycki (2009) claims that degree readings of size adjectives, such as "a big idiot" are

not merely the "consequence of some extragrammatical phenomenon," but rather can

be attributed to syntax, which gives positional restrictions on the availability of degree

readings. We expand on Morzycki (2009) by introducing a corpus-based analysis in

Dutch to verify Morzycki's claims and to extend his claim to the semantic domain.

Using LASSY, a syntactically annotated Dutch corpus developed inter alia under the

STEVIN programme, we extract syntactic and semantic properties of noun phrases

consisting of adjectives "gigantisch", "kolossaal", and "reusachtig" and manually

annotate each adjective-noun pair with a gradable or non-gradable label.

Using these features, we construct a statistical model based on logistic regression and

find that the semantic role, definiteness, and particular semantic noun groups derived

from Cornetto (a Dutch WordNet with referential relations) have a significant effect on

the likelihood that an adjective-noun pair is interpreted by the reader to have a degree

reading.

Corresponding author: nicholas.ruiz@gmail.com

73


74

CLIN 21 – CONFERENCE PROGRAMME

Using easy distributed computing for data-intensive

processing

Abstract

Van den Bogaert, Joachim

Centre for Computational Linguistics, K.U. Leuven

Given the large amounts of data we are coping with when computing useful data from

large corpora, and the difficulties and costs it takes to run parallel code with traditional

parallel computing, we will present different frameworks that may be used to facilitate

easy distributed computing. Using string-to-tree alignment (GHKM), frequent subtree

mining, and distributed Moses decoding as example cases, we will demonstrate how

applications and algorithms may be upscaled and out-scaled with these frameworks.

We will consider both the creation of an embarrassingly parallel solution and the redesign

of an existing algorithm to fit the mapreduce paradigm.

Corresponding author: joachim@ccl.kuleuven.be


PRESENTATION ABSTRACTS

Abstract

What is the use of multidocument spatiotemporal

analysis?

Schuurman, Ineke and Vandeghinste, Vincent

Centrum voor Computerlinguïstiek, K.U.Leuven

In the project AMASS++ (IWT) the central research topic concerns multilingual

multimedia multidocument summarization: how, in a huge digitized newsarchive, can a

journalist find documents dealing with the same events and have a summary made of

them in order to assess their usefulness for a specific purpose? Or: How can a news

paper deliver personalized (inter)national and local news to a specific subscriber?

In this presentation we will show how spatiotemporal analysis can be of assistance in

such tasks, even when the input just consists of raw PoS-tagged texts (i.e. contrary to

the SoNaR-project where many levels of annotation are available, all of them

corrected).

Corresponding author: ineke.schuurman@ccl.kuleuven.be

75


76

CLIN 21 – CONFERENCE PROGRAMME

Without a doubt no uncomplicated task: Negation cues

and their scope

Abstract

Morante, Roser and Schrauwen, Sarah and Daelemans, Walter

CLiPS - University of Antwerp

Although negation has been extensively treated from a theoretical perspective (Klima

1964, Horn 1989, Tottie 1991, van der Wouden 1997) and it's processing is thought to

be relevant for natural language processing systems (Morante and Sporleder 2010),

there is a lack of annotated resources and no publicly available annotation guidelines

can be found that describe in detail how to annotate negation related aspects. In this

talk we present a corpus annotated with negation cues and their scope, we describe

the guidelines that we have defined and we comment on the linguistic aspects of the

annotation process. The annotated corpus contains the detective stories The Hound of

the Baskervilles and The Adventure of Wisteria Lodge by Conan Doyle. Part of the

corpus has already been annotated with other layers of semantic information

(semantic roles, coreference) for the SemEval Task Linking Events and Their

Participants in Discourse (Ruppenhofer et al., 2010). We first describe the expression

of negation in this corpus and compare it with the expression of negation in biomedical

documents. Then we comment in detail several aspects related to the negation

phenomenon: how to determine what negation cues are, how to mark the scope, and

how to determine whether an event is negated. We will show that marking the cues is

not a matter of lexical look-up because some cues are ambiguous, and that contextual

and discourse level features play a role in finding the scope. Additionally, we show

that finding negated events depends on the semantic class of the predicates being

involved, their mood and tense: on the modality of the event clause and on the

syntactic constructions. Finally, we comment on the most difficult aspects of the

annotation process, like determining when prepositions like "save" or "except" act as

negation cues.

Corresponding author: roser.morante@ua.ac.be


Poster Abstracts

77


78

CLIN 21 – CONFERENCE PROGRAMME

A database for lexical orthographic errors in French

Abstract

Manguin, Jean-Luc

GREYC - Univ.de Caen - France

This work describes the construction of a database for lexical orthographic errors in

French. This construction uses different techniques form the field of NLP for a goal in

the field of psycholinguisitics. In psycholinguisitics, it is often difficult and long to

collect enough data from experiments with real people. Here the data are collected online

and come from the requests made to an on-line dictionary. In this huge amount of

data (about 160 millions words, 4 millions distinctive forms), we can find enough errors

to have good statistics for a deep study of errors. The questions developped here are

the link between a "bad" form and its correction, and the classification of errors in a

small number of types. Several programs and techniques are involved to achieve these

tasks : detection of graphic neighbours, phonetization, pattern matching : the

combination of these techniques leads us to 70% of correction with no ambiguity, and

80% if we accept the system give several possible corrections. The classification of

errors is also useful for predicting where errors may appear in the words, and thus for

the knowledge of children's learning of orthography.

Corresponding author: jean-luc.manguin@unicaen.fr


POSTER ABSTRACTS

Abstract

A Posteriori Agreement as a Quality Measure for

Readability Prediction Systems

van Oosten, Philip and Hoste, Véronique and Tanghe, Dries

LT3, University College Ghent

All readability research is ultimately concerned with the research question whether it is

possible for a prediction system to automatically determine the level of readability of

an unseen text.

A significant problem for such a system is that readability might depend in part on the

reader.If different readers assess the readability of texts in fundamentally different

ways, there is insufficient a priori agreement to justify the correctness of a readability

prediction system based on the texts assessed by those readers.We built a data set of

readability assessments by expert readers.We clustered the experts into groups with

greater a priori agreement and then measured for each group whether classifiers

trained only on data from this group exhibited a classification bias.

As this was found to be the case, the classification mechanism cannot be

unproblematically generalized to a different user group.

Corresponding author: philip.vanoosten@hogent.be

79


80

CLIN 21 – CONFERENCE PROGRAMME

A TN/ITN Framework for Western European languages

Abstract

Chesi, Cristiano 1 and Cho, Hyongsil 2 andBaldewijns, Daan 2 and Braga,

Daniela 1

1 Microsoft Language Development Center

2 Microsoft Language Development Center, ISCTE-Lisbon University

Institute, Portugal

(Inverse) Text Normalization, (I)TN, is an essential module in Text-to-Speech (TTS) and

Speech Recognition (SR) systems and it requires both a significant development

timeline and a deep linguistic expertise (Mikheev 2000, Palmer 2010).

In this work, we describe an efficient multilingual (I)TN framework that is rule-based,

hierarchical and modular: the core system is composed by a large set of optimized

Finite-State Transducers (FSTs) that are compiled following the Normalization Maps

developed by Language Experts (LEs) for each language: such maps are built using a

proprietary tool (TNAuthoringTool, Patent Serial No. 12/361,114) that allows the LEs to

express terminals normalization at high level (e.g. Term_1: “1” > “one”) and easily

combine such terminals by means of hierarchical, weighted rules: these rules can be

ordered sets of terminals or other rules, each one ranked according to their relevance

so as to prevent interference in specific contexts (e.g. Rule_1: “21-12-2010” > “twentyfirst

of december two thousand ten“ vs. Rule_2: “23-29” > “from twenty three to

twenty nine”). Such rules are clustered under a small number of Top-Level rules that

will be the entry states of the compiled FSTs.

The core set of Top-Level rules developed covers Numerals, Ordinals, Dates and Time,

Telephone numbers, Measurements and Web-related terms (e.g. URLs, email,

acronyms). Here, we focus on the ambiguity resolution implemented in three

languages (English, French, Italian) in the normalization of web-search specific terms

and mobile text messages. Accuracy and coverage of the FSTs are evaluated against

very large BING queries collections and SMS corpora.

Corresponding author: v-crches@microsoft.com


POSTER ABSTRACTS

Abstract

An Examination of Cross-Cultural Similarities and

Differences from Social Media Data with respect to

Language Use

Elahi, Mohammad Fazleh and Monachesi, Paola

We present a methodology for analyzing cross-cultural similarities and differences

using language as a medium, love as domain, social media as a data source and 'Terms'

(emotions and sentiments) and 'Topics' as cultural features. We discuss the techniques

necessary for the creation of the social data corpus from which emotion terms have

been extracted using NLP techniques. Topics of love discussion were then extracted

from the corpus by means of Latent Dirichlet Allocation (LDA). Finally, on the basis of

these features, a cross-cultural comparison was carried out. For the purpose of crosscultural

analysis, the experimental focus was on comparing data from a culture from

the East (India) with a culture from the West (United States of America). Similarities

and differences between these cultures have been analyzed with respect to the usage

of emotions, their intensities and the topics used during love discussion in social media.

Findings include (i) Indians are more emotional than Americans but Americans express

themselves with stronger emotion terms than Indians, (ii) In discussions on common

topics related to love (Wedding, Same Sex etc), the conversations of Indians and

Americans are related to the particular traditions and recent issues of their culture, and

(iii) Indians and Americans also use some terms and topics,which are only related to

their culture.

Corresponding author: rmf_ku@yahoo.com

81


Abstract

82

Authorship Verification of Quran

Shokrollahi-Far, Mahmoud

Tilburg University

CLIN 21 – CONFERENCE PROGRAMME

Holy Quran, as the cultural heritage of the Islamic world, has long been the focus of

scholarly disputes some not solved so far, such as whether Prophet Mohammad

himself has authored the book. This paper reports a research trend approaching such

disputes as a text classification task, such as authorship verification. To induce some

classifiers for this verification task, SVM and Naive Bays machines have been trained on

the tagged corpora of Quranic texts bootstrapped by Mobin, a morpho-syntactic tagger

developed for Arabic. The algorithm applied for the task is an efficient enhancement of

the algorithms applied so far for authorship verification. This algorithm seems

applicable for authorship problems of the other texts in Arabic. The results have not

verified the authorship of Quran by Prophet Mohammad.

Corresponding author: m.shokrollahifar@uvt.nl


POSTER ABSTRACTS

CLAM: Computational Linguistics Application Mediator

Abstract

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal

TiCC, Tilburg University

The Computational Linguistics Application Mediator (CLAM) allows you to quickly and

transparently transform your Natural Language Processing application into a RESTful

webservice, with which automated clients can communicate, but which at the same

time also acts as a modern webapplication with which human end-users can interact

directly. CLAM takes a description of your system and wraps itself around the system. It

allows both automated clients and human end-users to upload input files to your

application, start your application with specific parameters, and download or directly

view the output files produced by your application after it has completed execution.

Rich support for metadata and provenance data is also provided.

CLAM is set up in a universal fashion, making it flexible enough to be wrapped around a

wide range of computational linguistic applications. These applications are treated as a

black box, of which only the parameters, input formats, and output formats need to be

described. The applications themselves need not be network-aware in any way, nor

aware of CLAM. The handling and validation of input is taken care of by CLAM.

Corresponding author: proycon@anaproy.nl

83


84

CLIN 21 – CONFERENCE PROGRAMME

Discriminative features in reversible stochastic attributevalue

grammars

Abstract

de Kok, Daniël

University of Groningen

Reversible stochastic attribute-value grammars use one model for parse

disambiguation and fluency ranking. Such a model encodes preferences with respect to

syntax, fluency, and appropriateness of logical forms, as weighted features. This

framework is appropriate if similar preferences are used in parsing and generation.

Reversible models incorporate features that are specific to parse disambiguation and

fluency ranking, as well as features that are used for both tasks. One particular concern

with respect to such models is that much of their discriminatory power is provided by

task-specific features. If this is true, the premise that similar preferences are used in

parsing and generation does not hold.

A detailed analysis of features could give us more insight into the true reversibility of

stochastic attribute-value grammars. However, as De Kok (2010) argued, such featurebased

models are very opaque due to their enormous size and the tendency to spread

weight mass among overlapping features. Feature selection methods can be used to

extract a subset of features that do not overlap.

In this work, we compare gain-informed feature selection (Berger et al., 1996: Zhou et

al., 2003: De Kok, 2010), grafting (Perkins et al, 2003), and grafting-light (Zhu et al.,

2010) in performing selection on reversible models. We then use the most effective

method to extract a list of features ranked by their discriminatory power. We show

that only a very small number of features is required to produce an effective model for

parsing and generation. We also provide a qualitative and quantitative analysis of these

features.

Corresponding author: d.j.a.de.kok@rug.nl


POSTER ABSTRACTS

Abstract

Fietstas: a web service for text analysis

Jijkoun, Valentin and de Rijke, Maarten and Vishneuski, Andrei

University of Amsterdam

We present Fietstas: a open-access web service for text analysis, created with the idea

of simplifying building text-intensive applications. As a web service, Fietstas consists of

(1) a simple content management component, where users can upload their content

with metadata, (2) a collection of text processing components, from tokenization to

named entity extraction and normalization, (3) a component for accessing/visualizing

document processing results (e.g., as XML or HTML), and (4) data analysis component

for generating term-cloud-based summaries and timelines. The functionality is

available through easy-to-use REST interface (i.e., through standard HTTP requests),

and moreover, a number of APIs are available (e.g., for Python and Perl). In this

presentation, we briefly describe Fietstas and demonstrate how its functionality can be

used in a simple web application (news search and analysis).

Corresponding author: jijkoun@uva.nl

85


Abstract

86

CLIN 21 – CONFERENCE PROGRAMME

FoLiA: Format for Linguistic Annotation

van Gompel, Maarten and Reynaert, Martin and van den Bosch, Antal

TiCC, Tilburg University

We present FoLiA, an XML-based annotation format suitable for the representation of

written language resources. The format builds upon the work put in the D-Coi/SoNaR

format, but greatly extends this to accommodate a wide variety of linguistic

annotations. The objective is to present a rich annotation format based on a single

unifying notation paradigm that does not commit to any particular tagset, but instead

offers maximum flexibility and extensibility. In doing so, we replace the many ad-hoc

formats present in the field with a single well-structured format.

FoLiA will be proposed as a candidate CLARIN-standard.

Corresponding author: proycon@anaproy.nl


POSTER ABSTRACTS

How can computational linguistics help determine the

core meaning of then in oral speech?

Abstract

Vallee, Michael

EDC

The research on the connective then mainly focuses on the temporal interpretation of

it. However, little has been done on then in oral speech and more precisely in

questions, orders or inferential sentences. It seems important to show that

computational linguistics can really help determine if there is one way to describe the

connective in these contexts or if there are as many different connectives then as there

are linguistic structures.

To do so, I will use the prosody of questions and sentences in oral speech to

demonstrate that the speaker expresses a surprise or a contradiction with what was

uttered prior the connective in these contexts. To illustrate this perspective, let’s

consider the following utterance “Now then you listen to me”. I will show that the

phonological structure shown by a software helps to describe how then works. For

instance, in the example above, the underlying structure was “you-not-listen to me”

before the utterance which was not expected by the speaker. In that case, the

connective shows the different viewpoints between the speaker and the hearer.

Corresponding author: m.vallee@yahoo.fr

87


88

CLIN 21 – CONFERENCE PROGRAMME

Of mathematicians and physicists: the history of language

and speech technology in the Netherlands and Flanders

Abstract

van der Beek, Leonoor

Q-go

When did language technology came into being in the Low Countries? Why are

language and speech technology (LST) located in different Faculties in Dutch and

Flemish universities? What was the impact of Lernout and Hauspie on the LST industry

in the Netherlands? From September 2009 until September 2010, I investigated these

and many other questions related to the history of LST in the Netherlands and

Flanders. I interviewed the pioneers in our field and compiled from their stories a

diverse picture of struggle against the limitations of immature computer technology, of

boundless optimism and deep disappointment, and of academic friendships and fights.

From Adriaan van Wijngaarden to Jo Lernout and from PHLIQA via Eurotra to CGN. I'll

sketch the project and the approach I've taken, and give away some of the highlights of

the book `Van Rekenmachine tot Taalautomaat' (in Dutch), which will tell the full story.

Corresponding author: vdbeek@gmail.com


POSTER ABSTRACTS

Abstract

On the difficulty of making concreteness concrete

van Halteren, Hans and Theijssen, Daphne and Oostdijk, Nelleke and

Boves, Lou

Radboud University Nijmegen

As analysis and annotation progresses to deeper linguistic levels, matters prove ever

more difficult. It not only becomes harder to get machines to provide proper analyses,

but also to define exactly what we want. Whereas there appears to be consensus on

what plural nouns are (morpho-syntax) or what relative clauses are (syntax), this is

certainly not the case for semantic properties like concreteness. When reading papers

referring to such concepts, one is unlikely to notice any problems. Bresnan et al.

(2007), e.g., just use concreteness of a noun as a given and draw conclusions about the

significance of its influence on choices in the dative alternation.

However, once we ourselves attempt to annotate for concreteness, we run headlong

into the absence of any clear definition of concreteness. Bresnan refers to Garretson

(2003), where all we get is a vague (and somewhat circular) description and some

examples. Looking further, we find lists, such as in the MRC Psycholinguistic Database

(Coltheart, 1981), as well as procedures, such as Xing et al.’s (2010) procedure based

on WordNet, all apparently leading to values for the property concreteness. But we can

only wonder to which degree these various definitions/procedures lead to the same

results. In this paper, therefore, we take a number of concreteness value yielding

procedures and examine a) to which degree they overlap in their annotation of corpus

data (here: Semcor) and b) to which degree they lead to the same conclusions about

the influence of concreteness on syntactic processes (here: dative alternation).

Corresponding author: hvh@let.ru.nl

89


Abstract

90

CLIN 21 – CONFERENCE PROGRAMME

ParaSense or how to use parallel corpora for Cross-

Lingual Word Sense Disambiguation

Lefever, Els and Hoste, Véronique

LT3, University College Ghent

Cross-Lingual Word sense disambiguation (WSD) consists in selecting the correct

translation of an ambiguous word in a given context. In this talk we present a set of

experiments for a classification-based WSD system that uses evidence from multiple

languages to define a translation label for an ambiguous target word in one of the five

supported languages (viz. Italian, Spanish, French, Dutch and German). Instead of using

a predefined monolingual sense-inventory such as WordNet, we use a languageindependent

framework and build up our sense inventory by means of the aligned

translations from the parallel corpus Europarl.

The information that is used to train and test our classifier contains the well known

WSD local context features of the English input sentences, as well as translation

features from the other languages.

Our results show that the multilingual approach outperforms the classification

experiments that merely take into account the more traditional monolingual WSD

features. In additon, our results are competitive with those of the best systems that

participated in the SemEval-2 "Cross-Lingual Word Sense Disambiguation" task.

Corresponding author: els.lefever@hogent.be


POSTER ABSTRACTS

Abstract

"Pattern", a web mining module for Python

De Smedt, Tom and Daelemans, Walter

CLiPS, University of Antwerp

"Pattern" is a mash-up package for the Python programming language that bundles

fast, regular expressions-based functionality for NLP and data-mining tasks. It consists

of the following modules:

1) pattern.web: provides easy access to Google, Yahoo, Bing, Twitter, Wikipedia,

Flickr, RSS + a robust HTML DOM parser.

2) pattern.en: tools for verb inflection, noun pluralization/singularization, a WordNet

interface, a fast tagger/chunker based on regular expressions.

3) pattern.table: for working with datasheets (e.g. MS Excel) and CSV-files.

4) pattern.search: regular expressions for syntax and semantics. For example:

"BRAND|NP VP JJ+" matches any sentence in which a noun phrase containing a

brand name is followed by a verb phrase followed by one or more adjectives, e.g.

"the new iPhone will be amazing", "Doritos taste cheesy", ...

5) pattern.vector: corpus tools for tf-idf, cosine similarity, vector space search and

LSA.

6) pattern.graph: for exploring graphs and semantic networks.

The package can be used and extended for harvesting online data, opinion mining,

building semantic networks using a machine learning approach, and so on.

Corresponding author: tom.desmedt@ua.ac.be

91


Abstract

92

CLIN 21 – CONFERENCE PROGRAMME

Semantic role labeling of gene regulation events

Morante, Roser

CLiPS,- University of Antwerp

This poster describes work in progress on semantic role labeling of gene regulation

events. Semantic role labeling (SRL) is a natural language processing task that consists

of identifying the arguments of predicates within a sentence and assigning a semantic

role to them (Màrquez et al., 2008). This task can support the extraction of relations

from biomedical texts. Recent research has produced a rich variety of SRL systems to

process general domain corpora. However, less systems have been developed to

process biomedical corpora (Tzong-Han Tsai et al, 2007: Bethard et al., 2008). In this

abstract, we present preliminary results of a new system that is trained on the GREC

corpus (Thompson et al., 2009). The system performs argument identi?cation and

semantic role assignment in a single step, assuming gold standard event identi?cation.

We provide cross-validation and cross-domain results.

Corresponding author: roser.morante@ua.ac.be


POSTER ABSTRACTS

Abstract

Source Verification in Quran

Shokrollahi-Far, Mahmoud

Tilburg University

The revelation of Holy Quran has been accomplished either in Mecca or in Medina, for

which some chapters or even verses of the book have been classified as either Meccan

or Medinan. This crucial classification helps the scholars of Quran in so many topics

including the exegesis of the book. Among the one hundred and fourteen chapters of

Quran there are still thirty-two disagreed on to be whether Meccan or Medinan. More

deeply, the scholars have long disputed on the features that would discriminate

between these two classes. This paper reports a research trend on applying text

classification tasks, say source verification, to help the resolution of such Quranic

disputes. For this binary TC task, some classifiers have been induced by training SVM

and Naive Bays machines on the tagged corpora of Quranic texts bootstrapped by

Mobin, a morpho-syntactic tagger developed for Arabic. This research has not only

explored the required distinctive grammatical features, but also led to a successful

classification of the disputed chapters and verses as Meccan or Medinan.

Corresponding author: m.shokrollahifar@uvt.nl

93


94

CLIN 21 – CONFERENCE PROGRAMME

Towards a language-independent data-driven compound

decomposition tool

Abstract

Réveil, Bert 1 and Macken, Lieve 2

1 ELIS, Ghent University

2 LT3, Language and Translation Technology Team

Compounding is a highly productive process in Dutch that poses a challenge for various

NLP applications such as terminology extraction, continuous speech recognition, and

automated word alignment. The present work therefore proposes a languageindependent,

data-driven decomposition tool that tries to segment compounds into

their meaningful parts.

The basic version of this tool initially determines a list of eligible compound

constituents (so-called heads and tails), relying solely on word frequency information

that is extracted from a large text corpus. The decomposition algorithm then

recursively attempts to decompose the compounds, allowing only two-part head-tail

divisions in each iteration. E.g. the noun 'postzegelverzamelaar' is first split into

'postzegel' + 'verzamelaar', followed by an additional decomposition of 'postzegel' into

'post' + 'zegel'.

Apart from the basic version, an extended version of the tool is assessed that uses PoS

information as a means to restrict the list of possible heads and tails. The preformance

of both versions is evaluated in two large-scale decomposition experiments, one on the

E-lex compound list and one on a word list that contains specific vocabulary from the

automotive domain. As the presented decomposition tool only relates on word

frequency and PoS information, it is expected that the tool can be easily adapted to

new domains and languages.

Corresponding author: breveil@elis.ugent.be


POSTER ABSTRACTS

Towards improving the precision of a relation extraction

system by processing negation and speculation

Abstract

Van Asch, Vincent and Morante, Roser and Daelemans, Walter

CLiPS - University of Antwerp

In this poster we present BiographTA, a system that extracts biological relations from

PubMed abstracts. The relation extraction system has been designed to process

abstracts in which biological relations from multiple databases have been annotated

automatically based on an in-sentence co-occurrence criterium. It performs a relation

identification task learning from noisy data, since a proportion of the automatically

annotated relations in the training corpus is incorrect. The system cannot be evaluated

on noisy data. For this reason, in order to develop and evaluate the system, we gather

a corpus of PubMed abstracts annotated with the gold biological relations of the

Bioinfer corpus.

Additionally, one of the text mining goals in the Biograph project is to develop

techniques that allow to perform large scale relation extraction starting from the

smallest possible amount of manually annotated data and obtaining the highest

precision possible. This is why we add to the relation extraction system a module that

processes negation and speculation cues. We present experiments aimed at testing

whether processing the scope of negation and speculation cues results in a higher

precision of the relations extracted. Results show that the negation and speculation

detection module increases the precision in 2.93 at the cost of decreasing recall in 0.68.

Corresponding author: Vincent.VanAsch@ua.ac.be

95


List of Participants

97


LIST OF PARTICIPANTS

Liesbeth Augustinus liesbeth@ccl.kuleuven.be CCL - K.U. Leuven

Daan Baldewijns v-daanb@microsoft.com Microsoft Language

Development Center

Kim Bauters kim.bauters@ugent.be Ghent University

Richard Beaufort richard.beaufort@uclouvai

n.be

UCL CENTAL

Peter Berck P.J.Berck@UvT.nl TiCC, Tilburg University

Tamas Biro t.s.biro@uva.nl ACLC, University of

Amsterdam

Jelke Bloem j.bloem.3@student.rug.nl University of Groningen

Cristiano Chesi v-crches@microsoft.com Microsoft Language

Development Center,

Porto Salvo - ISCTE-Lisbon

University Institute,

Portugal

Kostadin Cholakov k.cholakov@rug.nl University of Groningen

Louise-Amélie Cougnon louiseamelie.cougnon@uclouvai

n.be

CENTAL - IL&C, UCLouvain

Crit Cremers c.l.j.m.cremers@hum.leide

nuniv.nl

Leiden University

Walter Daelemans walter.daelemans@ua.ac. CLiPS, University of

be

Antwerp

Orphée De Clercq orphee.declercq@hogent. LT3, University College

be

Ghent

Martine De Cock Martine.DeCock@UGent.b

e

Ghent University

Daniël de Kok d.j.a.de.kok@rug.nl University of Groningen

Tom De Smedt tomdesmedt@gmail.com CLiPS Universiteit

Antwerpen

Herwig De Smet herwig.desmet@kdg.be OptiFox 7th Framework

Europe

Dennis de Vries dennis@gridline.nl GridLine

Saskia Debergh saskia.debergh@intersyste

ms.com

i.Know nv

Johannes Deleu johannes.deleu@intec.uge IBCN, IBBT & Ghent

nt.be

University

Thomas Demeester thomas.demeester@ugent

.be

Ghent University

Bart Desmet bart.desmet@hogent.be LT3, University College

Ghent

99


100

CLIN 21 – CONFERENCE PROGRAMME

Brecht Desplanques brecht.desplanques@elis.u

gent.be

ELIS, Ghent University

Peter Dirix peter.dirix@nuance.com Nuance

Stephen Doherty stephen.doherty2@mail.d

cu.ie

Marius Doornenbal m.doornenbal@elsevier.co

m

Frederik Durant frederik.durant@tomtom.

com

Mohammad

Fazleh

Elahi rmf_ku@yahoo.com

Dublin City University

Reed Elsevier

TomTom

Thomas François thomas.francois@uclouvai

n.be

CENTAL, UCLouvain

Tanja Gaustad T.Gaustad@uvt.nl TiCC, Tilburg University

Olga Gordeeva ogordeeva@gmail.com Acapela Group

Kris Heylen kris.heylen@arts.kuleuven

.be

QLVL, K.U.Leuven

Maarten Hijzelendoorn p.m.hijzelendoorn@hum.l

eidenuniv.nl

Leiden University

Veronique Hoste veronique.hoste@hogent. LT3, University College

be

Ghent

Steve Hunt s.j.hunt@tilburguniversity.

nl

TiCC, Tilburg University

Marc Kemps-Snijders marc.kemps.snijders@me

ertens.knaw.nl

Meertens Instituut

Mike Kestemont mike.kestemont@ua.ac.be CLiPS, University of

Antwerp

Maxim Khalilov maxkhalilov@gmail.com ILLC, University of

Amsterdam

Henny Klein E.H.Klein@rug.nl University of Groningen

Cornelis H.A. Koster kees@cs.ru.nl Radboud Universiteit

Nijmegen

Gideon Kotzé g.j.kotze@rug.nl University of Groningen

Mark Kroon mark.kroon@actonomy.co

m

Actonomy

Reinier Lamers lamers@textkernel.nl Textkernel

Els Lefever els.lefever@hogent.be LT3, University College

Ghent

Anna Lobanova a.lobanova@ai.rug.nl AI, University of Groningen

Alessandro Lopopolo A.Lopopolo@student.uva. ACLC, University of

nl

Amsterdam

Kim Luyckx kim.luyckx@ua.ac.be CLiPS, University of

Antwerp

Lieve Macken lieve.macken@hogent.be LT3, University College


LIST OF PARTICIPANTS

Gideon Maillette de

Buy Wenniger

Ghent

gemdbw@gmail.com ILLC, University of

Amsterdam

Véronique Malaisé vmalaise@vu.nl VU University Amsterdam

Jean-Luc Manguin jeanluc.manguin@unicaen.fr

CNRS - Université de Caen

Eliza Margaretha e.margaretha@student.ru

g.nl

University of Groningen

Thomas Markus Thomas.Markus@phil.uu.n

l

Utrecht University

Scott Martens scott@ccl.kuleuven.be CCL, K.U.Leuven

Dieneke Meijer dieneke.meijer@agentsch

apnl.nl

Agentschap NL / STEVIN

Sien Moens sien.moens@cs.kuleuven.

be

K.U.Leuven CW

Paola Monachesi P.Monachesi@uu.nl Utrecht University

Roser Morante roser.morante@ua.ac.be CLiPS, University of

Antwerp

Peter Nabende p.nabende@rug.nl University of Groningen

Fabrice Nauze fabrice.nauze@rightnow.c

om

Q-go / Rightnow

John Nerbonne j.nerbonne@rug.nl University of Groningen

Jan Odijk j.odijk@uu.nl UiL-OTS Universiteit

Utrecht

Leequisach Panjaitan leequisach.panjaitan@yah

oo.com

Claudia Peersman claudia.peersman@ua.ac.b

e

CLiPS, University of

Antwerp & Artesis

Suléne Pilon sulene.pilon@nwu.ac.za North-West University

(VTC)

Barbara Plank b.plank@rug.nl University of Groningen

Massimo Poesio poesio@essex.ac.uk University of Essex

Bert Réveil breveil@elis.ugent.be DSSP, ELIS, Ghent

University

Mihai Rotaru rotaru@textkernel.nl Textkernel

Nicholas Ruiz nicholas.ruiz@gmail.com University of Groningen

Marijn Schraagen schraage@liacs.nl Leiden University

louise schubotz louise_schubotz@gmx.de Radboud University

Nijmegen

Ineke Schuurman ineke.schuurman@ccl.kule

uven.be

CCL, K.U.Leuven

101


102

CLIN 21 – CONFERENCE PROGRAMME

Roxane Segers r.h.segers@vu.nl VU University Amsterdam

Binyam Seyoum binephrem@gmail.com Addis Ababa University

Margaux Smets margauxsmets@gmail.com QLVL, K.U.Leuven

Martijn Spitters spitters@textkernel.nl Textkernel

Peter Spyns pspyns@taalunie.org Nederlandse Taalunie

Tim Stokman timstokman@gmail.com Textkernel

Dries Tanghe dwiesje@hotmail.com LT3, University College

Ghent

Tristan

Thomas

Teunissen tristan@w3lab.nl BA

Daphne Theijssen d.theijssen@let.ru.nl Radboud University

Nijmegen

Erik Tjong Kim Sang erikt@xs4all.nl University of Groningen

Fabian Triefenbach fabian.triefenbach@elis.ug

ent.be

ELIS, Ghent University

Frederik Vaassen frederik.vaassen@ua.ac.be CLiPS, University of

Antwerp

Michaël Vallée m.vallee@yahoo.fr EDC Paris

Vincent Van Asch Vincent.VanAsch@ua.ac.b CLiPS, University of

e

Antwerp

Matje van de Camp M.M.v.d.Camp@uvt.nl TiCC, Tilburg University

Tim Van de Cruys tv234@cam.ac.uk University of Cambridge

Marjan Van de Kauter marjan.vandekauter@hog LT3, University College

ent.be

Ghent

Anne van de

amvdwetering@hotmail.c University of Groningen

Wetering om

Joachim Van den

Bogaert

joachim@ccl.kuleuven.be CCL - K.U. Leuven

Antal van den Bosch Antal.vdnBosch@uvt.nl TiCC, Tilburg University

Leonoor van der Beek leonoor.vanderbeek@right

now.com

Q-go / Rightnow

Marieke van Erp Marieke@cs.vu.nl VU University Amsterdam

Frank Van Eynde frank.vaneynde@ccl.kuleu

ven.be

CCL, K.U.Leuven

Maarten van Gompel proycon@anaproy.nl TiCC, Tilburg University

Hans van Halteren hvh@let.ru.nl Radboud University

Nijmegen

Gertjan van Noord g.j.m.van.noord@rug.nl University of Groningen

Philip van Oosten philip.vanoosten@hogent.

be

LT3, University College

Ghent


LIST OF PARTICIPANTS

Menno van Zaanen mvzaanen@uvt.nl TiCC, Tilburg University

Tom Vanallemeersch tallem@ccl.kuleuven.be CCL, K.U.Leuven

Vincent Vandeghinste vincent@ccl.kuleuven.be CCL, K.U.Leuven

Klaar Vanopstal klaar.vanopstal@hogent.b LT3, University College

e

Ghent

Kateryna Vasylenko Katyaknu1986@mail.ru Nijmegen University

Peter Velaerts peter.velaerts@hogent.be LT3, University College

Ghent

Suzan Verberne s.verberne@let.ru.nl Radboud University

Nijmegen

Reinder Verlinde R.Verlinde@Elsevier.com Elsevier

yuliya vladimirova vladimirova83@gmail.com

Tim Wauters tim.wauters@intec.ugent.

be

IBBT & Ghent University

Edgar Weiffenbach s1422022@student.rug.nl CLCG, University of

Groningen

Eline Westerhout elinewesterhout@gmail.co

m

Utrecht University

Thomas Wielfaert thomas.wielfaert@ugent.b

e

Martijn Wieling m.b.wieling@rug.nl University of Groningen

Sander Wubben s.wubben@uvt.nl TiCC, Tilburg University

Jakub Zavrel zavrel@textkernel.nl Textkernel

103

More magazines by this user
Similar magazines