A Treebank-based Investigation of IPP-triggering Verbs in Dutch

More documents

Recommendations

Info

Portuguese Component A first step in the construction of the Portuguese part of the ParDeepBank consisted in obtaining a corpus of raw sentences that is parallel to the WSJ corpus, on which the PTB is based. To this end, the WSJ text was translated from English into Portuguese. This translation was performed by two professional translators. Each portion of the corpus that was translated by one of the translators was subsequently reviewed by the other translator in order to double-check their outcome and enforce consistency among the translators. Given that the original English corpus results from the gathering of newspaper texts, more specifically from the Wall Street Journal, a newspaper specialized on economic and business matters, the translators were instructed to perform the translation as if the result of their work was to be published in a Portuguese newspaper of a similar type, suitable to be read by native speakers of Portuguese. A second recommendation was that each English sentence should be translated into a Portuguese sentence if possible and if that would not clash with the first recommendation concerning the “naturalness” of the outcome. As the Portuguese corpus was obtained, it entered a process of dynamic annotation, analogous to the one applied to the Redwoods treebank. With the support of the annotation environment [incr tsdb()] [21], the Portuguese Resource Grammar LXGram was used to support the association of sentences with their deep grammatical representation. For each sentence the grammar provides a parse forest; the correct parse if available, can be selected and stored by deciding on a number of binary semantic discriminants that differentiate the different parses in the respective parse forest. The translation of the WSJ corpus into Portuguese is completed, and at the time of writing the present paper, only a portion of these sentences had been parsed and annotated. While this is an ongoing endeavor, at present the Portuguese part of the ParDeepBank includes more than 1,000 sentences. The sentences are treebanked resorting to the annotation methodology that has been deemed in the literature as better ensuring the reliability of the dataset produced. They are submitted to a process of double blind annotation followed by adjudication. Two different annotators annotate each sentence in an independent way, without having access to each other’s decisions. For those sentences over whose annotation they happen to disagree, a third element of the annotation team, the adjudicator, decides which one of the two different grammatical representations for the same sentence, if any, is the suitable one to be stored. The annotation team consists of experts graduated in Linguistics or Language studies, specifically hired to perform the annotation task on a full time basis. Bulgarian Component Bulgarian part of ParDeepBank was produced in a similar way to the Portuguese part. First, translations of WSJ texts were performed in two steps. During the first step the text was translated by one translator. We could not afford professional translators. We have hired three students in translation studies(one PhD student and two master students studying at English Department of the 104
Sofia University). Each sentence was translated by one of them. The distribution of sentences included whole articles. The reason for this is the fact that one of the main problems during the translation turned out to be the named entities in the text. Bulgarian news tradition changed a lot in the last decades moving from transliteration of foreign names to acceptance of some names in their original form. Because of the absence of strict rules, we have asked the translators to do search over the web for existing transliteration of a given name in the original text. If such did not exist, the translator had two possibilities: (1) to transliterate the name according to its pronunciation in English; or (2) to keep the original form in English. The first option was mainly used for people and location names. The second was more appropriate for acronyms. Translating a whole article ensured that the names have been handled in the same way. Additionally, the translators had to translate each original sentence into just one sentence in Bulgarian, in spite of its complex structure. The second phase of the translation process is the editing of the translations by a professional editor. The idea is the editor to ensure the “naturalness” of the text. The editor also has a very good knowledge of English. At the moment we have translated section 00 to 03. The treebanking is done in two complementary ways. First, the sentences are processed by BURGER. If the sentence is parsed, then the resulting parses are loaded in the environment [incrs tsdb()] and the selection of the correct analysis is done similarly to English and Portuguese cases. If the sentence cannot be processed by BURGER or all analyses produced by BURGER are not acceptable, then the sentence is processed by the Bulgarian language pipeline. It always produces some analysis, but in some cases it contains errors. All the results from the pipeline are manually checked via the visualization tool within the CLaRK system. After the corrections have been repaired, the MRS analyzer is applied. For the creation of the current version of ParDeepBank we have concentrated on the intersection between the English DeepBank and Portuguese DeepBank. At the moment we have processed 328 sentences from this intersection. 5 Dynamic Parallel Treebanking Having dynamic treebanks as components of ParDeepBank we cannot rely on fixed parallelism on the level of syntactic and semantic structures. They are subject to change during the further development of the resource grammars for the three languages. Thus, similarly to [28] we rely on alignment done on the sentence and word levels. The sentence level alignment is ensured by the mechanisms of translation of the original data from English to Portuguese and Bulgarian. The word level could be done in different ways as described in this section. For Bulgarian we are following the word level alignment guidelines presented in [29] and [30]. The rules follow the guidelines for segmentation of BulTreeBank. This ensures a good interaction between the word level alignment and the parsing mechanism for Bulgarian. The rules are tested by manual annotation of several 105
Page 1 and 2:
A Treebank-based Investigation of I
Page 3 and 4:
participle and a (te-)infinitival c
Page 5 and 6:
Some verbs occur twice, since they
Page 7 and 8:
Profiling Feature Selection for Nam
Page 9 and 10:
prepositional objects (FOPP, OPP).
Page 11 and 12:
the limited size of annotated data
Page 13 and 14:
with high precision typically have
Page 15 and 16:
‘This was “not significantly”
Page 17 and 18:
The preposition durch typically has
Page 19 and 20:
Non-Projective Structures in Indian
Page 21 and 22:
the sequential order of nodes in a
Page 23 and 24:
extra-posed relative clause that ge
Page 25 and 26:
Experiments on Dependency Parsing o
Page 27 and 28:
for mitigating the effects of spars
Page 29 and 30:
obtained with MALTParser is 76.61%
Page 31 and 32:
as a standardised serialisation for
Page 33 and 34:
constituency and dependency, possib
Page 35 and 36:
SynAF and/or in . However, they sha
Page 37 and 38:
shows how some elements that are no
Page 39 and 40:
Example
Page 41 and 42:
‏ Example 8: represent
Page 43 and 44:

Page 45 and 46:
Chinese) as the direct object NP.
Page 47 and 48: Example 15: Tokens and Word Forms
Page 49 and 50: In a second experiment, a dataset w
Page 51 and 52:
Page 53 and 54: [3] Leech G. N., Barnett, R. & Kahr
Page 55 and 56: Effectively long-distance dependenc
Page 57 and 58: In French, another case of eLDD is
Page 59 and 60: elativization, it-clefts or questio
Page 61 and 62: 4.2.3 Annotation methodology Becaus
Page 63 and 64: Number of occurrences in FTB +SEQTB
Page 65 and 66: producing treebank based LFG approx
Page 67 and 68: Logical Form Representation for Lin
Page 69 and 70: gerundives for a total amount of so
Page 71 and 72: object+indirect object/object one.
Page 73 and 74: (VP (VB patch) ) ) ) (VP (VBZ is) (
Page 75 and 76: Types Adverb. Adject. Verbs Nouns T
Page 77 and 78: Eventually we may comment that ther
Page 79 and 80: DeepBank: A Dynamically Annotated T
Page 81 and 82: from another already existing one,
Page 83 and 84: to parse now does. The extra manual
Page 85 and 86: In the derivation show in Figure 1,
Page 87 and 88: 4 Patching Coverage Gaps with An Ap
Page 89 and 90: will enable further improvements in
Page 91 and 92: ParDeepBank: Multiple Parallel Deep
Page 93 and 94: 2 The ParDeepBank The PTB has emerg
Page 95 and 96: undergone extensive scientific scru
Page 97: the second combines the structures
Page 101 and 102: data, and improvements in the infra
Page 103 and 104: The Effect of Treebank Annotation G
Page 105 and 106: ut without feature structures. This
Page 107 and 108: only the lexicon is fine-grained to
Page 109 and 110: Automatic Coreference Annotation in
Page 111 and 112: manually annotated Czech sentences.
Page 113 and 114: citizens of Bilbao] are very involv
Page 115 and 116: 4.1.3 Coreference Selector Module T
Page 117 and 118: Nominal P R F1 MUC 75.33% 81.33% 78
Page 119 and 120: [9] G. Doddington, A. Mitchell, M.
Page 121 and 122: Analyzing the Most Common Errors in
Page 123 and 124: Graph 1 shows results of subsequent
Page 125 and 126: in the semantic type). This situati
Page 127 and 128: Will a Parser Overtake Achilles? Fi
Page 129 and 130: annotation is also based on a depen
Page 131 and 132: Set Name Sentences Tokens % Train/T
Page 133 and 134: dency relations (PRED, OBJ, SBJ, AD
Page 135 and 136: ton and Stacklazy), we trained a mo
Page 137 and 138: Feature Function Column Name Addres
Page 139 and 140: Using Parallel Treebanks for Machin
Page 141 and 142: phrases are generated by a shallow
Page 143 and 144: Figure 2: A Kybot p
Page 145 and 146: SUJ NP SENT MOD PP NP PP PP NP NP P
Page 147 and 148: Checkpoint Instances Google PT Our
Page 149 and 150:
6 Conclusions In this paper we have
Page 151 and 152:
An integrated web-based treebank an
Page 153 and 154:
16] and we have earlier reported fr
Page 155 and 156:
Figure 2: Screenshot of the interfa
Page 157 and 158:
ple subcategorization frames may be
Page 159 and 160:
4 Integrated interface for annotati
Page 161 and 162:
of the 5th International Conference
Page 163 and 164:
Translational divergences and their
Page 165 and 166:
posed so far, Cyrus’ work did not
Page 167 and 168:
allowed an appropriate coverage of
Page 169 and 170:
damentales [...] 10 (the universal
Page 171 and 172:
for all lex_pair(s,t) do if head an
Page 173 and 174:
the treatment of typical translatio
Page 175 and 176:
Building a treebank of noisy user-g
Page 177 and 178:
Phenomenon Attested example Std. co
Page 179 and 180:
Figure 1: French Social Media Bank
Page 181 and 182:
Impact of treebank characteristics
Page 183 and 184:
Det Adj N (a) Danish Det Adj N (b)
Page 185 and 186:
sv: Bestämmelserna i detta avtal f
Page 187 and 188:
100 90 80 Unlabelled attachment 70
Page 189 and 190:
vidner , tilhørere og tiltalte wit
Page 191 and 192:
mance is not inconsiderable, as was
Page 193 and 194:
Genitives in Hindi Treebank: An Att
Page 195 and 196:
A genitive can occur with complex p
Page 197 and 198:
quite high. This motivates us towar
show all

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?