A Treebank-based Investigation of IPP-triggering Verbs in Dutch

More documents

Recommendations

Info

statistical parsers. Since data sparsity is the biggest challenge in data oriented parsing, parsers will have a poor performance if they are trained with a small set of data, or when the domain of the training and the test data are not similar. Brown et al. [2] pioneered to use word clustering for language modeling methods. Later on, word clustering was widely used in various NLP applications including parsing [3, 7, 9]. The Brown word clustering is an approach which provides a coarse level of word representation such that the words which have similarities with each other are assigned to one cluster. In this type of annotation, instead of words, the corresponding mapped clusters are used. POS Tag: The syntactic functions of words at the sentence level are the very basic linguistic knowledge that the parser learns; therefore, they play a very important role on the performance of a parser. The quality of the assigned POS tags to the words and the amount of information they contain have a direct effect on the performance of the parser. The representation of this knowledge can be either coarse-grained such as Noun, Verb, Adjective, etc, or fine-grained to contain morpho-syntactic and semantic information such as Noun-Single, Verb-Past, Adjective-superlative, etc. The fine representation of the tags leads to increase the tag set size and intensify the complexity of the tagging task for a statistical POS tagger to disambiguate the correct labels. Phrase Structure: Depending on the formalism used as the backbone of a treebank, the labels of the nodes at the phrasal level can be either fine- or coarsegrained. The annotated data in the Penn Treebank [11] provides relatively coarsegrained phrase structures in which mostly the types of the phrasal constituents such as NP, VP, etc are determined. It is not denied that in the latest version of the Penn Treebank the syntactic functions are added to the labels as well, but this information is not available for all nodes. Contrary, annotated data of deep formalisms like HPSG provides a fine representation of phrase structures since the types of headdaughters’ dependencies are defined explicitly for all nodes such as the Bulgarian treebank [15] and the Persian treebank [6]. Representation of dependency information on constituents adds complexities to the parser for disambiguating the type of the dependencies as the size of the constituent label set is increased. 3 Evaluation 3.1 Data Set and Tool The Bijankhan Corpus 1 contains more than 2.5 million word tokens, and it is POS tagged manually with a rich set of 586 tags containing morpho-syntactic and semantic information [1] such that there is a hierarchy on the assigned tags based on the EAGLES guidelines [10]. The number of the main syntactic categories in the tag set is 14, and it is increased to 15 when clitics are splited from their hosts. The Persian Treebank (PerTreeBank) 2 is a treebank for Persian developed in the framework of HPSG [13] such that the basic properties of HPSG are simulated 1 http://ece.ut.ac.ir/dbrg/bijankhan/ 2 http://hpsg.fu-berlin.de/∼ghayoomi/PTB.html 110
ut without feature structures. This treebank contains 1016 trees with 27659 word tokens from the Bijankhan Corpus, and it is developed semi-automatically via a bootstrapping approach [6]. In this treebank, the types of head-daughter dependencies are defined according to the HPSG basic schemas, namely head-subject, head-complement, head-adjunct, and head-filler; therefore it is a hierarchical treebank which represents subcategorization requirements. Moreover, the trace for scrambled or extraposed elements and also empty nodes for ellipses are explicitly determined. Stanford Parser is the Java implementation of a lexicalized, probabilistic natural language parser [8]. The parser is based on an optimized Probabilistic Context Free Grammar (PCFG) and lexicalized dependency parsers, and a lexicalized PCFG parser. Based on the study of Collins [5], heads should be provided for the parser. This has been done semi-automatically for Persian based on the head-daughters’ labels. The evaluation of the parsing results is done with Evalb 3 to report the standard bracketing metric results like precision, recall, and F-measure. SRILM Toolkit contains the implementation of the the Brown word clustering algorithm [2] and it is used to cluster the lexical items in the Bijankhan Corpus [16]. 3.2 Setup of the Experiments Section 2 described the three possible annotation dimensions for parsing. In the followings, we will describe the data preparation and the setup of our experiments to study the effect of each dimension’s annotation on parsing performance. Besides of preparing the PerTreeBank based on the Penn Treebank style explained in Ghayoomi [7], we modified the treebank in three directions for our experiments. The SRILM toolkit is used to cluster the words in the Bijankhan Corpus. Since the Brown algorithm requires a predefined number of clusters, we set the number of clusters to 700 based on the extended model of Brown algorithm proposed by Ghayoomi [7] to treat homographs distinctly. In order to provide a coarse-grained representation of morpho-syntactic information of the words, only the main POS tags of the words (the 15 tags) are used instead of all 586 tags; and in order to provide simple head-daughter relations as coarse-grained phrase structures, only the type of dependencies on adjunct-daughters and complementdaughters as well as the type of clauses on marker-daughters are removed without any changes on other head-daughter relations. In each model, we train the Stanford parser with the Persian data. Since PerTreeBank is currently the only available annotated data set for Persian with constituents, this data set is used for both training and testing. The 10-fold cross validation is performed to avoid any overlap between the two data sets. 3.3 Results In the first step of our experiments, we trained the Stanford parser with PerTree- Bank without any changes on annotation granularities (Model 1) and recognized it as the baseline model. To further study the effect of each annotation dimension, 3 http://nlp.cs.nyu.edu/evalb/ 111
Page 1 and 2:
A Treebank-based Investigation of I
Page 3 and 4:
participle and a (te-)infinitival c
Page 5 and 6:
Some verbs occur twice, since they
Page 7 and 8:
Profiling Feature Selection for Nam
Page 9 and 10:
prepositional objects (FOPP, OPP).
Page 11 and 12:
the limited size of annotated data
Page 13 and 14:
with high precision typically have
Page 15 and 16:
‘This was “not significantly”
Page 17 and 18:
The preposition durch typically has
Page 19 and 20:
Non-Projective Structures in Indian
Page 21 and 22:
the sequential order of nodes in a
Page 23 and 24:
extra-posed relative clause that ge
Page 25 and 26:
Experiments on Dependency Parsing o
Page 27 and 28:
for mitigating the effects of spars
Page 29 and 30:
obtained with MALTParser is 76.61%
Page 31 and 32:
as a standardised serialisation for
Page 33 and 34:
constituency and dependency, possib
Page 35 and 36:
SynAF and/or in . However, they sha
Page 37 and 38:
shows how some elements that are no
Page 39 and 40:
Example
Page 41 and 42:
‏ Example 8: represent
Page 43 and 44:

Page 45 and 46:
Chinese) as the direct object NP.
Page 47 and 48:
Example 15: Tokens and Word Forms
Page 49 and 50:
In a second experiment, a dataset w
Page 51 and 52:

Page 53 and 54: [3] Leech G. N., Barnett, R. & Kahr
Page 55 and 56: Effectively long-distance dependenc
Page 57 and 58: In French, another case of eLDD is
Page 59 and 60: elativization, it-clefts or questio
Page 61 and 62: 4.2.3 Annotation methodology Becaus
Page 63 and 64: Number of occurrences in FTB +SEQTB
Page 65 and 66: producing treebank based LFG approx
Page 67 and 68: Logical Form Representation for Lin
Page 69 and 70: gerundives for a total amount of so
Page 71 and 72: object+indirect object/object one.
Page 73 and 74: (VP (VB patch) ) ) ) (VP (VBZ is) (
Page 75 and 76: Types Adverb. Adject. Verbs Nouns T
Page 77 and 78: Eventually we may comment that ther
Page 79 and 80: DeepBank: A Dynamically Annotated T
Page 81 and 82: from another already existing one,
Page 83 and 84: to parse now does. The extra manual
Page 85 and 86: In the derivation show in Figure 1,
Page 87 and 88: 4 Patching Coverage Gaps with An Ap
Page 89 and 90: will enable further improvements in
Page 91 and 92: ParDeepBank: Multiple Parallel Deep
Page 93 and 94: 2 The ParDeepBank The PTB has emerg
Page 95 and 96: undergone extensive scientific scru
Page 97 and 98: the second combines the structures
Page 99 and 100: Sofia University). Each sentence wa
Page 101 and 102: data, and improvements in the infra
Page 103: The Effect of Treebank Annotation G
Page 107 and 108: only the lexicon is fine-grained to
Page 109 and 110: Automatic Coreference Annotation in
Page 111 and 112: manually annotated Czech sentences.
Page 113 and 114: citizens of Bilbao] are very involv
Page 115 and 116: 4.1.3 Coreference Selector Module T
Page 117 and 118: Nominal P R F1 MUC 75.33% 81.33% 78
Page 119 and 120: [9] G. Doddington, A. Mitchell, M.
Page 121 and 122: Analyzing the Most Common Errors in
Page 123 and 124: Graph 1 shows results of subsequent
Page 125 and 126: in the semantic type). This situati
Page 127 and 128: Will a Parser Overtake Achilles? Fi
Page 129 and 130: annotation is also based on a depen
Page 131 and 132: Set Name Sentences Tokens % Train/T
Page 133 and 134: dency relations (PRED, OBJ, SBJ, AD
Page 135 and 136: ton and Stacklazy), we trained a mo
Page 137 and 138: Feature Function Column Name Addres
Page 139 and 140: Using Parallel Treebanks for Machin
Page 141 and 142: phrases are generated by a shallow
Page 143 and 144: Figure 2: A Kybot p
Page 145 and 146: SUJ NP SENT MOD PP NP PP PP NP NP P
Page 147 and 148: Checkpoint Instances Google PT Our
Page 149 and 150: 6 Conclusions In this paper we have
Page 151 and 152: An integrated web-based treebank an
Page 153 and 154: 16] and we have earlier reported fr
Page 155 and 156:
Figure 2: Screenshot of the interfa
Page 157 and 158:
ple subcategorization frames may be
Page 159 and 160:
4 Integrated interface for annotati
Page 161 and 162:
of the 5th International Conference
Page 163 and 164:
Translational divergences and their
Page 165 and 166:
posed so far, Cyrus’ work did not
Page 167 and 168:
allowed an appropriate coverage of
Page 169 and 170:
damentales [...] 10 (the universal
Page 171 and 172:
for all lex_pair(s,t) do if head an
Page 173 and 174:
the treatment of typical translatio
Page 175 and 176:
Building a treebank of noisy user-g
Page 177 and 178:
Phenomenon Attested example Std. co
Page 179 and 180:
Figure 1: French Social Media Bank
Page 181 and 182:
Impact of treebank characteristics
Page 183 and 184:
Det Adj N (a) Danish Det Adj N (b)
Page 185 and 186:
sv: Bestämmelserna i detta avtal f
Page 187 and 188:
100 90 80 Unlabelled attachment 70
Page 189 and 190:
vidner , tilhørere og tiltalte wit
Page 191 and 192:
mance is not inconsiderable, as was
Page 193 and 194:
Genitives in Hindi Treebank: An Att
Page 195 and 196:
A genitive can occur with complex p
Page 197 and 198:
quite high. This motivates us towar
show all

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?