A Treebank-based Investigation of IPP-triggering Verbs in Dutch

More documents

Recommendations

Info

methods. When combined with supervised machine learning methods, such richly annotated language resources including treebanks play a key role in modern computational linguistics. The availability of large-scale treebanks in recent years has contributed to the blossoming of data-driven approaches to robust and practical parsing. On the other hand, the creation of detailed and consistent syntactic annotations on a large scale turns out to be a challenging task. 1 From the choice of the appropriate linguistic framework and the design of the annotation scheme to the choice of the text source and the working protocols on the synchronization of the parallel development, as well as quality assurance, each of the steps in the entire annotation procedure presents non-trivial challenges that can impede the successful production of such resources. The aim of the DeepBank project is to overcome some of the limitations and shortcomings which are inherent in manual corpus annotation efforts, such as the German Negra/Tiger Treebank ([2]), the Prague Dependency Treebank ([11]), and the TüBa-D/Z. 2 All of these have stimulated research in various sub-fields of computational linguistics where corpus-based empirical methods are used, but at a high cost of development and with limits on the level of detail in the syntactic and semantic annotations that can be consistently sustained. The central difference in the DeepBank approach is to adopt the dynamic treebanking methodology of Redwoods [18], which uses a grammar to produce full candidate analyses, and has human annotators disambiguate to identify and record the correct analyses, with the disambiguation choices recorded at the granularity of constituent words and phrases. This localized disambiguation enables the treebank annotations to be repeatedly refined by making corrections and improvements to the grammar, with the changes then projected throughout the treebank by reparsing the corpus and re-applying the disambiguation choices, with a relatively small number of new disambiguation choices left for manual disambiguation. For the English DeepBank annotation task, we make extensive use of resources in the DELPH-IN repository 3 , including the PET unification-based parser ([4]), the ERG plus a regular-expression preprocessor ([1]), the LKB grammar development platform ([7]), and the [incr tsdb()] competence and performance profiling system ([17]), which includes the treebanking tools used for disambiguation and inspection. Using these resources, the task of treebank construction shifts from a labor-intensive task of drawing trees from scratch to a more intelligence-demanding task of choosing among candidate analyses to either arrive at the desired analysis or reject all candidates as ill-formed. The DeepBank approach should be differentiated from so-called treebank conversion approaches, which derive a new treebank 1 Besides[18], which we draw more on for the remainder of the paper, similar work has been done in the HPSG framework for Dutch [22]. Moreover, there is quite a lot of related research in the LFG community, e.g., in the context of the ParGram project: [9] for German, [14] for English, and the (related) Trepil project, e.g., [20] for Norwegian. 2 http://www.sfs.nphil.uni-tuebingen.de/en_tuebadz.shtml 3 http://www.delph-in.net 86
from another already existing one, such as the Penn Treebank, mapping from one format to another, and often from one linguistic framework to another, adapting and often enriching the annotations semi-automatically. In contrast, the English DeepBank resource is constructed by taking as input only the original ‘raw’ WSJ text, sentence-segmented to align with the segmentation in the PTB for ease of comparison, but making no reference to any of the PTB annotations, so that we maintain a fully independent annotation pipeline, important for later evaluation of the quality of our annotations over held-out sections. 2 DeepBank The process of DeepBank annotation of the Wall Street Journal corpus is organised into iterations of a cycle of parsing, treebanking, error analysis and grammar/treebank updates, with the goal of maximizing the accuracy of annotation through successive refinement. Parsing Each section of the WSJ corpus is first parsed with the PET parser using the ERG, with lexical entries for unknown words added on the fly based on a conventional part-of-speech tagger, TnT [3]. Analyses are ranked using a maximumentropy model built using the TADM [15] package, originally trained on out-ofdomain treebanked data, and later improved in accuracy for this task by including a portion of the emerging DeepBank itself for training data. A maximum of 500 highest-ranking analyses are recorded for each sentence, with this limit motivated both by practical constraints on data storage costs for each parse forest and by the processing capacity of the [incr tsdb()] treebanking tool. The existing parseranking model has proven to be accurate enough to ensure that the desired analysis is almost always in these top 500 readings if it is licensed by the grammar at all. For each analysis in each parse forest, we record the exact derivation tree, which identifies the specific lexical entries and the lexical and syntactic rules applied to license that analysis, comprising a complete ‘recipe’ sufficient to reconstruct the full feature structure given the relevant version of the grammar. This approach enables relatively efficient storage of each parse forest without any loss of detail. Treebanking For each sentence of the corpus, the parsing results are then manually disambiguated by the human annotators, using the [incr tsdb()] treebanking tool which presents the annotator with a set of binary decisions, called discriminants, on the inclusion or exclusion of candidate lexical or phrasal elements for the desired analysis. This discriminant-based approach of [6] enables rapid reduction of the parse forest to either the single desired analysis, or to rejection of the whole forest for sentences where the grammar has failed to propose a viable analysis. 4 On average, given n candidate trees, log 2 n decisions are needed in order to 4 For some sentences, an annotator may be unsure about the correctness of the best available analysis, in which case the analysis can still be recorded in the treebank, but with a lower ‘confidence’ 87
Page 1 and 2:
A Treebank-based Investigation of I
Page 3 and 4:
participle and a (te-)infinitival c
Page 5 and 6:
Some verbs occur twice, since they
Page 7 and 8:
Profiling Feature Selection for Nam
Page 9 and 10:
prepositional objects (FOPP, OPP).
Page 11 and 12:
the limited size of annotated data
Page 13 and 14:
with high precision typically have
Page 15 and 16:
‘This was “not significantly”
Page 17 and 18:
The preposition durch typically has
Page 19 and 20:
Non-Projective Structures in Indian
Page 21 and 22:
the sequential order of nodes in a
Page 23 and 24:
extra-posed relative clause that ge
Page 25 and 26:
Experiments on Dependency Parsing o
Page 27 and 28:
for mitigating the effects of spars
Page 29 and 30: obtained with MALTParser is 76.61%
Page 31 and 32: as a standardised serialisation for
Page 33 and 34: constituency and dependency, possib
Page 35 and 36: SynAF and/or in . However, they sha
Page 37 and 38: shows how some elements that are no
Page 39 and 40: Example
Page 41 and 42: ‏ Example 8: represent
Page 43 and 44:
Page 45 and 46: Chinese) as the direct object NP.
Page 47 and 48: Example 15: Tokens and Word Forms
Page 49 and 50: In a second experiment, a dataset w
Page 51 and 52:
Page 53 and 54: [3] Leech G. N., Barnett, R. & Kahr
Page 55 and 56: Effectively long-distance dependenc
Page 57 and 58: In French, another case of eLDD is
Page 59 and 60: elativization, it-clefts or questio
Page 61 and 62: 4.2.3 Annotation methodology Becaus
Page 63 and 64: Number of occurrences in FTB +SEQTB
Page 65 and 66: producing treebank based LFG approx
Page 67 and 68: Logical Form Representation for Lin
Page 69 and 70: gerundives for a total amount of so
Page 71 and 72: object+indirect object/object one.
Page 73 and 74: (VP (VB patch) ) ) ) (VP (VBZ is) (
Page 75 and 76: Types Adverb. Adject. Verbs Nouns T
Page 77 and 78: Eventually we may comment that ther
Page 79: DeepBank: A Dynamically Annotated T
Page 83 and 84: to parse now does. The extra manual
Page 85 and 86: In the derivation show in Figure 1,
Page 87 and 88: 4 Patching Coverage Gaps with An Ap
Page 89 and 90: will enable further improvements in
Page 91 and 92: ParDeepBank: Multiple Parallel Deep
Page 93 and 94: 2 The ParDeepBank The PTB has emerg
Page 95 and 96: undergone extensive scientific scru
Page 97 and 98: the second combines the structures
Page 99 and 100: Sofia University). Each sentence wa
Page 101 and 102: data, and improvements in the infra
Page 103 and 104: The Effect of Treebank Annotation G
Page 105 and 106: ut without feature structures. This
Page 107 and 108: only the lexicon is fine-grained to
Page 109 and 110: Automatic Coreference Annotation in
Page 111 and 112: manually annotated Czech sentences.
Page 113 and 114: citizens of Bilbao] are very involv
Page 115 and 116: 4.1.3 Coreference Selector Module T
Page 117 and 118: Nominal P R F1 MUC 75.33% 81.33% 78
Page 119 and 120: [9] G. Doddington, A. Mitchell, M.
Page 121 and 122: Analyzing the Most Common Errors in
Page 123 and 124: Graph 1 shows results of subsequent
Page 125 and 126: in the semantic type). This situati
Page 127 and 128: Will a Parser Overtake Achilles? Fi
Page 129 and 130: annotation is also based on a depen
Page 131 and 132:
Set Name Sentences Tokens % Train/T
Page 133 and 134:
dency relations (PRED, OBJ, SBJ, AD
Page 135 and 136:
ton and Stacklazy), we trained a mo
Page 137 and 138:
Feature Function Column Name Addres
Page 139 and 140:
Using Parallel Treebanks for Machin
Page 141 and 142:
phrases are generated by a shallow
Page 143 and 144:
Figure 2: A Kybot p
Page 145 and 146:
SUJ NP SENT MOD PP NP PP PP NP NP P
Page 147 and 148:
Checkpoint Instances Google PT Our
Page 149 and 150:
6 Conclusions In this paper we have
Page 151 and 152:
An integrated web-based treebank an
Page 153 and 154:
16] and we have earlier reported fr
Page 155 and 156:
Figure 2: Screenshot of the interfa
Page 157 and 158:
ple subcategorization frames may be
Page 159 and 160:
4 Integrated interface for annotati
Page 161 and 162:
of the 5th International Conference
Page 163 and 164:
Translational divergences and their
Page 165 and 166:
posed so far, Cyrus’ work did not
Page 167 and 168:
allowed an appropriate coverage of
Page 169 and 170:
damentales [...] 10 (the universal
Page 171 and 172:
for all lex_pair(s,t) do if head an
Page 173 and 174:
the treatment of typical translatio
Page 175 and 176:
Building a treebank of noisy user-g
Page 177 and 178:
Phenomenon Attested example Std. co
Page 179 and 180:
Figure 1: French Social Media Bank
Page 181 and 182:
Impact of treebank characteristics
Page 183 and 184:
Det Adj N (a) Danish Det Adj N (b)
Page 185 and 186:
sv: Bestämmelserna i detta avtal f
Page 187 and 188:
100 90 80 Unlabelled attachment 70
Page 189 and 190:
vidner , tilhørere og tiltalte wit
Page 191 and 192:
mance is not inconsiderable, as was
Page 193 and 194:
Genitives in Hindi Treebank: An Att
Page 195 and 196:
A genitive can occur with complex p
Page 197 and 198:
quite high. This motivates us towar
show all

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

Create successful ePaper yourself

Delete template?

Save as template?