Acknowledgments This work was partly funded by the French ANR project EDyLex (ANR-09-CORD-008). References [1] Abeillé, A., Clément, L., and Toussenel, F. (2003). Build<strong>in</strong>g a <strong>Treebank</strong> for French. Kluwer, Dordrecht. [2] Bies, A., Mott, J., Warner, C., and Kulick, S. (2012). English web treebank. Technical report, L<strong>in</strong>guistic Data Consortium„ Philadelphia, PA, USA. [3] Candito, M. and Crabbé, B. (2009). Improv<strong>in</strong>g generative statistical pars<strong>in</strong>g with semi-supervised word cluster<strong>in</strong>g. In Proc. <strong>of</strong> IWPT’09, Paris, France. [4] Candito, M., Nivre, J., Denis, P., and Henestroza, E. (2010). Benchmark<strong>in</strong>g <strong>of</strong> statistical dependency parsers for french. In Proc. <strong>of</strong> CoL<strong>in</strong>g’10, Beij<strong>in</strong>g, Ch<strong>in</strong>a. [5] Chrupała, G., D<strong>in</strong>u, G., and van Genabith, J. (2008). Learn<strong>in</strong>g morphology with morfette. In In Proceed<strong>in</strong>gs <strong>of</strong> LREC 2008, Marrakech, Morocco. [6] Crabbé, B. and Candito, M. (2008). Expériences d’analyse syntaxique statistique du français. In Proc. <strong>of</strong> TALN’08, pages 45–54, Senlis, France. [7] Denis, P. and Sagot, B. (2009). Coupl<strong>in</strong>g an annotated corpus and a morphosyntactic lexicon for state-<strong>of</strong>-the-art POS tagg<strong>in</strong>g with less human effort. In Proc. <strong>of</strong> PACLIC, Hong Kong, Ch<strong>in</strong>a. [8] Foster, J. (2010). “cba to check the spell<strong>in</strong>g”: Investigat<strong>in</strong>g parser performance on discussion forum posts. In Proc. <strong>of</strong> HLT/NAACL’10, Los Angeles, USA. [9] Foster, J., Cet<strong>in</strong>oglu, O., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan, D., and van Genabith, J. (2011a). #hardtoparse: Pos tagg<strong>in</strong>g and pars<strong>in</strong>g the twitterverse. In Proc. <strong>of</strong> the AAAI 2011 Workshop On Analyz<strong>in</strong>g Microtext. [10] Foster, J., Cet<strong>in</strong>oglu, O., Wagner, J., Le Roux, J., Nivre, J., Hogan, D., and van Genabith, J. (2011b). From news to comment: Resources and benchmarks for pars<strong>in</strong>g the language <strong>of</strong> web 2.0. In proc <strong>of</strong> IJCNLP, Chiang Mai, Thailand. [11] Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenste<strong>in</strong>, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. (2011). Part-<strong>of</strong>speech tagg<strong>in</strong>g for twitter: Annotation, features, and experiments. In Proc. <strong>of</strong> ACL’11, Portland, USA. [12] Petrov, S., Barrett, L., Thibaux, R., and Kle<strong>in</strong>, D. (2006). Learn<strong>in</strong>g accurate, compact, and <strong>in</strong>terpretable tree annotation. In Proc. <strong>of</strong> ACL’06, Sydney, Australia. [13] Seddah, D., Sagot, B., Candito, M., Mouilleron, V., and Combet, V. (2012). The french social media bank: a treebank <strong>of</strong> noisy user generated content. In Proceed<strong>in</strong>gs <strong>of</strong> CoL<strong>in</strong>g’12, Mumbai, India. 186
Impact <strong>of</strong> treebank characteristics on cross-l<strong>in</strong>gual parser adaptation Arne Skjærholt, Lilja Øvrelid Department <strong>of</strong> <strong>in</strong>formatics, University <strong>of</strong> Oslo {arnskj,liljao}@ifi.uio.no Abstract <strong>Treebank</strong> creation can benefit from the use <strong>of</strong> a parser. Recent work on cross-l<strong>in</strong>gual parser adaptation has presented results that make this a viable option as an early-stage preprocessor <strong>in</strong> cases where there are no (or not enough) resources available to tra<strong>in</strong> a statistical parser for a language. In this paper we exam<strong>in</strong>e cross-l<strong>in</strong>gual parser adaptation between three highly related languages: Swedish, Danish and Norwegian. Our focus on related languages allows for an <strong>in</strong>-depth study <strong>of</strong> factors <strong>in</strong>fluenc<strong>in</strong>g the performance <strong>of</strong> the adapted parsers and we exam<strong>in</strong>e the <strong>in</strong>fluence <strong>of</strong> annotation strategy and treebank size. Our results show that a simple conversion process can give very good results, and with a few simple, l<strong>in</strong>guistically <strong>in</strong>formed, changes to the source data, even better results can be obta<strong>in</strong>ed, even with a source treebank that is quite differently annotated from the target treebank. We also show that for languages with large amounts <strong>of</strong> lexical overlap, delexicalisation is not necessary, <strong>in</strong>deed lexicalised parsers outperform delexicalised parsers, and that for some treebanks it is possible to convert dependency relations and create a labelled cross-l<strong>in</strong>gual parser. 1 1 Introduction It is well known that when annotat<strong>in</strong>g new resources, the best way to go about it is not necessarily to annotate raw data, but rather to correct automatically annotated material (Chiou et al. [2], Fort and Sagot [3]). This can yield important speed ga<strong>in</strong>s, and <strong>of</strong>ten also improvements <strong>in</strong> measures <strong>of</strong> annotation quality such as <strong>in</strong>terannotator agreement. However, if we wish to create a treebank for a language with no pre-exist<strong>in</strong>g resources, we face someth<strong>in</strong>g <strong>of</strong> a Catch-22; on the one hand, we have no data to tra<strong>in</strong> a statistical parser to automatically annotate sentences, but on the other hand we really would like to have one. In this sett<strong>in</strong>g, one option is cross-l<strong>in</strong>gual parser adaptation. 1 The code used to obta<strong>in</strong> the data <strong>in</strong> this paper are available from https://github.com/ arnsholt/tlt11-experiments. 187
- Page 1 and 2:
A Treebank-based Investigation of I
- Page 3 and 4:
participle and a (te-)infinitival c
- Page 5 and 6:
Some verbs occur twice, since they
- Page 7 and 8:
Profiling Feature Selection for Nam
- Page 9 and 10:
prepositional objects (FOPP, OPP).
- Page 11 and 12:
the limited size of annotated data
- Page 13 and 14:
with high precision typically have
- Page 15 and 16:
‘This was “not significantly”
- Page 17 and 18:
The preposition durch typically has
- Page 19 and 20:
Non-Projective Structures in Indian
- Page 21 and 22:
the sequential order of nodes in a
- Page 23 and 24:
extra-posed relative clause that ge
- Page 25 and 26:
Experiments on Dependency Parsing o
- Page 27 and 28:
for mitigating the effects of spars
- Page 29 and 30:
obtained with MALTParser is 76.61%
- Page 31 and 32:
as a standardised serialisation for
- Page 33 and 34:
constituency and dependency, possib
- Page 35 and 36:
SynAF and/or in . However, they sha
- Page 37 and 38:
shows how some elements that are no
- Page 39 and 40:
Example
- Page 41 and 42:
Example 8: represent
- Page 43 and 44:
- Page 45 and 46:
Chinese) as the direct object NP.
- Page 47 and 48:
Example 15: Tokens and Word Forms
- Page 49 and 50:
In a second experiment, a dataset w
- Page 51 and 52:
- Page 53 and 54:
[3] Leech G. N., Barnett, R. & Kahr
- Page 55 and 56:
Effectively long-distance dependenc
- Page 57 and 58:
In French, another case of eLDD is
- Page 59 and 60:
elativization, it-clefts or questio
- Page 61 and 62:
4.2.3 Annotation methodology Becaus
- Page 63 and 64:
Number of occurrences in FTB +SEQTB
- Page 65 and 66:
producing treebank based LFG approx
- Page 67 and 68:
Logical Form Representation for Lin
- Page 69 and 70:
gerundives for a total amount of so
- Page 71 and 72:
object+indirect object/object one.
- Page 73 and 74:
(VP (VB patch) ) ) ) (VP (VBZ is) (
- Page 75 and 76:
Types Adverb. Adject. Verbs Nouns T
- Page 77 and 78:
Eventually we may comment that ther
- Page 79 and 80:
DeepBank: A Dynamically Annotated T
- Page 81 and 82:
from another already existing one,
- Page 83 and 84:
to parse now does. The extra manual
- Page 85 and 86:
In the derivation show in Figure 1,
- Page 87 and 88:
4 Patching Coverage Gaps with An Ap
- Page 89 and 90:
will enable further improvements in
- Page 91 and 92:
ParDeepBank: Multiple Parallel Deep
- Page 93 and 94:
2 The ParDeepBank The PTB has emerg
- Page 95 and 96:
undergone extensive scientific scru
- Page 97 and 98:
the second combines the structures
- Page 99 and 100:
Sofia University). Each sentence wa
- Page 101 and 102:
data, and improvements in the infra
- Page 103 and 104:
The Effect of Treebank Annotation G
- Page 105 and 106:
ut without feature structures. This
- Page 107 and 108:
only the lexicon is fine-grained to
- Page 109 and 110:
Automatic Coreference Annotation in
- Page 111 and 112:
manually annotated Czech sentences.
- Page 113 and 114:
citizens of Bilbao] are very involv
- Page 115 and 116:
4.1.3 Coreference Selector Module T
- Page 117 and 118:
Nominal P R F1 MUC 75.33% 81.33% 78
- Page 119 and 120:
[9] G. Doddington, A. Mitchell, M.
- Page 121 and 122:
Analyzing the Most Common Errors in
- Page 123 and 124:
Graph 1 shows results of subsequent
- Page 125 and 126:
in the semantic type). This situati
- Page 127 and 128:
Will a Parser Overtake Achilles? Fi
- Page 129 and 130: annotation is also based on a depen
- Page 131 and 132: Set Name Sentences Tokens % Train/T
- Page 133 and 134: dency relations (PRED, OBJ, SBJ, AD
- Page 135 and 136: ton and Stacklazy), we trained a mo
- Page 137 and 138: Feature Function Column Name Addres
- Page 139 and 140: Using Parallel Treebanks for Machin
- Page 141 and 142: phrases are generated by a shallow
- Page 143 and 144: Figure 2: A Kybot p
- Page 145 and 146: SUJ NP SENT MOD PP NP PP PP NP NP P
- Page 147 and 148: Checkpoint Instances Google PT Our
- Page 149 and 150: 6 Conclusions In this paper we have
- Page 151 and 152: An integrated web-based treebank an
- Page 153 and 154: 16] and we have earlier reported fr
- Page 155 and 156: Figure 2: Screenshot of the interfa
- Page 157 and 158: ple subcategorization frames may be
- Page 159 and 160: 4 Integrated interface for annotati
- Page 161 and 162: of the 5th International Conference
- Page 163 and 164: Translational divergences and their
- Page 165 and 166: posed so far, Cyrus’ work did not
- Page 167 and 168: allowed an appropriate coverage of
- Page 169 and 170: damentales [...] 10 (the universal
- Page 171 and 172: for all lex_pair(s,t) do if head an
- Page 173 and 174: the treatment of typical translatio
- Page 175 and 176: Building a treebank of noisy user-g
- Page 177 and 178: Phenomenon Attested example Std. co
- Page 179: Figure 1: French Social Media Bank
- Page 183 and 184: Det Adj N (a) Danish Det Adj N (b)
- Page 185 and 186: sv: Bestämmelserna i detta avtal f
- Page 187 and 188: 100 90 80 Unlabelled attachment 70
- Page 189 and 190: vidner , tilhørere og tiltalte wit
- Page 191 and 192: mance is not inconsiderable, as was
- Page 193 and 194: Genitives in Hindi Treebank: An Att
- Page 195 and 196: A genitive can occur with complex p
- Page 197 and 198: quite high. This motivates us towar