A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Also, it <strong>in</strong>teracts with external resources and NLP tools. It is well suited for dependency<br />
graphs but its visualization does not seem to display constituent trees or<br />
(recursive) feature-value graphs, nor does it seem to support parse selection.<br />
In the INESS project, we have probably been the first to develop a fully web<strong>based</strong><br />
<strong>in</strong>frastructure for treebank<strong>in</strong>g. This approach enables annotation as a distributed<br />
effort without any <strong>in</strong>stallation on the annotators’ side and <strong>of</strong>fers immediate<br />
onl<strong>in</strong>e exploration <strong>of</strong> treebanks from anywhere. In earlier publications, various<br />
aspects <strong>of</strong> this <strong>in</strong>frastructure have been presented, but specific <strong>in</strong>terface aspects <strong>of</strong><br />
the annotation process have not been systematically described.<br />
In the rema<strong>in</strong>der <strong>of</strong> this paper, we will describe some aspects <strong>of</strong> the annotation<br />
<strong>in</strong>terface that we have implemented. Section 2 provides an overview <strong>of</strong> the INESS<br />
project and reviews some <strong>of</strong> the web-<strong>based</strong> annotation features that have been described<br />
<strong>in</strong> earlier publications. After that we present new developments: <strong>in</strong> section<br />
3 we discuss the preprocess<strong>in</strong>g <strong>of</strong> texts to be parsed <strong>in</strong> the Norwegian treebank, and<br />
<strong>in</strong> section 4 the <strong>in</strong>tegrated issue track<strong>in</strong>g system is presented.<br />
2 The INESS treebank<strong>in</strong>g <strong>in</strong>frastructure<br />
INESS, the Norwegian Infrastructure for the Exploration <strong>of</strong> Syntax and Semantics,<br />
is a project c<strong>of</strong>unded by the Norwegian Research Council and the University <strong>of</strong><br />
Bergen. It is aimed at provid<strong>in</strong>g a virtual eScience laboratory for l<strong>in</strong>guistic research<br />
[19]. It provides an environment for both annotation and exploration and runs on<br />
its own HPC cluster.<br />
One <strong>of</strong> the missions <strong>of</strong> the INESS project is to host treebanks for many different<br />
languages (currently 24) and annotated <strong>in</strong> various formalisms, <strong>in</strong> a unified, accessible<br />
system on the web. Some treebanks are currently only small test suites, while<br />
others are quite large, for <strong>in</strong>stance the TIGER treebank, which consists <strong>of</strong> about<br />
50,000 sentences <strong>of</strong> German newspaper text. There are dependency treebanks, constituency<br />
treebanks, and LFG treebanks; furthermore a number <strong>of</strong> parallel treebanks<br />
are available, annotated at sentence level and experimentally at phrase level [10].<br />
We have implemented middleware for onl<strong>in</strong>e visualization <strong>of</strong> various k<strong>in</strong>ds <strong>of</strong> structures<br />
and for powerful search functionality <strong>in</strong> these treebanks [14].<br />
The second mission <strong>of</strong> the INESS project is to develop the first large treebank<br />
for Norwegian. This treebank is be<strong>in</strong>g built by automatic pars<strong>in</strong>g with an LFG grammar,<br />
NorGram [8, 9], on the XLE platform [13]. An LFG grammar produces two<br />
parallel levels <strong>of</strong> syntactic analysis; the c(onstituent)-structure is a phrase structure<br />
tree, while the f(unctional)-structure is an attribute–value matrix with <strong>in</strong>formation<br />
about grammatical features and functions [4].<br />
Deep analysis with a large lexicon and a large grammar can <strong>of</strong>ten lead to massive<br />
ambiguity; a sentence may have hundreds or thousands <strong>of</strong> possible analyses.<br />
Therefore, the LFG Parsebanker was developed <strong>in</strong> the LOGON and TREPIL<br />
projects to enable efficient semiautomatic disambiguation [17, 21]. We have developed<br />
discrim<strong>in</strong>ants for LFG analyses, as described earlier [20].<br />
In our ongo<strong>in</strong>g annotation work with the LFG Parsebanker, we are basically<br />
develop<strong>in</strong>g the grammar, the lexicon and a gold standard treebank <strong>in</strong> tandem [18,