06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Also, it <strong>in</strong>teracts with external resources and NLP tools. It is well suited for dependency<br />

graphs but its visualization does not seem to display constituent trees or<br />

(recursive) feature-value graphs, nor does it seem to support parse selection.<br />

In the INESS project, we have probably been the first to develop a fully web<strong>based</strong><br />

<strong>in</strong>frastructure for treebank<strong>in</strong>g. This approach enables annotation as a distributed<br />

effort without any <strong>in</strong>stallation on the annotators’ side and <strong>of</strong>fers immediate<br />

onl<strong>in</strong>e exploration <strong>of</strong> treebanks from anywhere. In earlier publications, various<br />

aspects <strong>of</strong> this <strong>in</strong>frastructure have been presented, but specific <strong>in</strong>terface aspects <strong>of</strong><br />

the annotation process have not been systematically described.<br />

In the rema<strong>in</strong>der <strong>of</strong> this paper, we will describe some aspects <strong>of</strong> the annotation<br />

<strong>in</strong>terface that we have implemented. Section 2 provides an overview <strong>of</strong> the INESS<br />

project and reviews some <strong>of</strong> the web-<strong>based</strong> annotation features that have been described<br />

<strong>in</strong> earlier publications. After that we present new developments: <strong>in</strong> section<br />

3 we discuss the preprocess<strong>in</strong>g <strong>of</strong> texts to be parsed <strong>in</strong> the Norwegian treebank, and<br />

<strong>in</strong> section 4 the <strong>in</strong>tegrated issue track<strong>in</strong>g system is presented.<br />

2 The INESS treebank<strong>in</strong>g <strong>in</strong>frastructure<br />

INESS, the Norwegian Infrastructure for the Exploration <strong>of</strong> Syntax and Semantics,<br />

is a project c<strong>of</strong>unded by the Norwegian Research Council and the University <strong>of</strong><br />

Bergen. It is aimed at provid<strong>in</strong>g a virtual eScience laboratory for l<strong>in</strong>guistic research<br />

[19]. It provides an environment for both annotation and exploration and runs on<br />

its own HPC cluster.<br />

One <strong>of</strong> the missions <strong>of</strong> the INESS project is to host treebanks for many different<br />

languages (currently 24) and annotated <strong>in</strong> various formalisms, <strong>in</strong> a unified, accessible<br />

system on the web. Some treebanks are currently only small test suites, while<br />

others are quite large, for <strong>in</strong>stance the TIGER treebank, which consists <strong>of</strong> about<br />

50,000 sentences <strong>of</strong> German newspaper text. There are dependency treebanks, constituency<br />

treebanks, and LFG treebanks; furthermore a number <strong>of</strong> parallel treebanks<br />

are available, annotated at sentence level and experimentally at phrase level [10].<br />

We have implemented middleware for onl<strong>in</strong>e visualization <strong>of</strong> various k<strong>in</strong>ds <strong>of</strong> structures<br />

and for powerful search functionality <strong>in</strong> these treebanks [14].<br />

The second mission <strong>of</strong> the INESS project is to develop the first large treebank<br />

for Norwegian. This treebank is be<strong>in</strong>g built by automatic pars<strong>in</strong>g with an LFG grammar,<br />

NorGram [8, 9], on the XLE platform [13]. An LFG grammar produces two<br />

parallel levels <strong>of</strong> syntactic analysis; the c(onstituent)-structure is a phrase structure<br />

tree, while the f(unctional)-structure is an attribute–value matrix with <strong>in</strong>formation<br />

about grammatical features and functions [4].<br />

Deep analysis with a large lexicon and a large grammar can <strong>of</strong>ten lead to massive<br />

ambiguity; a sentence may have hundreds or thousands <strong>of</strong> possible analyses.<br />

Therefore, the LFG Parsebanker was developed <strong>in</strong> the LOGON and TREPIL<br />

projects to enable efficient semiautomatic disambiguation [17, 21]. We have developed<br />

discrim<strong>in</strong>ants for LFG analyses, as described earlier [20].<br />

In our ongo<strong>in</strong>g annotation work with the LFG Parsebanker, we are basically<br />

develop<strong>in</strong>g the grammar, the lexicon and a gold standard treebank <strong>in</strong> tandem [18,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!