Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL
Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL
Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Morphosyntactic</strong> <strong>annotation</strong><br />
<strong>of</strong> <strong>the</strong> C-<strong>ORAL</strong> <strong>Brasil</strong> <strong>Corpus</strong><br />
Eckard Bick<br />
University <strong>of</strong> Sou<strong>the</strong>rn Denmark<br />
eckhard.bick@mail.dk
Introduction<br />
● Spoken language corpora are becoming more available, growing in<br />
scope and size<br />
● To fully exploit such corpora linguistically, grammatical <strong>annotation</strong> is<br />
needed in addition to phonetic-prosodic <strong>annotation</strong><br />
● Problems<br />
Speech data are notationally ”colourful” and use different<br />
strategies for retaining non-text information in transcription<br />
Spoken language has many non-standard features, both in<br />
grammar and lexicon
Some comparable projects<br />
● C-<strong>ORAL</strong> sister projects, using statistical taggers, no syntax<br />
Italian: PiTagger (Moneglia et al. 2004) with lexicon (107.000 standard<br />
lemmas and 2000 non-standard forms) and training corpus (50.000<br />
words ~ 25% <strong>of</strong> C-<strong>ORAL</strong> size)<br />
European Portuguese: Brill tagger (Brill 1993) with written Portuguese<br />
training corpus (250.000 words)<br />
● Arabic treebank, i.e. syntax (Maamouri et al. 2010)<br />
manual selection <strong>of</strong> analyzer suggestion, followed by automatic parsing<br />
not directly comparable: broadcast news, not spontaneous speech<br />
● Our approach: Constraint Grammar <strong>annotation</strong><br />
context-driven rules ra<strong>the</strong>r than statistics
Why Constraint Grammar<br />
● robust <strong>annotation</strong> methodology (Karlsson et al. 1994)<br />
focus on disambiguation - no breakdowns due to generative rules<br />
token-based <strong>annotation</strong>, one tag per feature --> easy to filter<br />
modular methodology --> allows <strong>the</strong> combination <strong>of</strong> grammars targeting<br />
different linguistic levels<br />
● existing parser for Portuguese: PALAVRAS (Bick 2000)<br />
has been succesfully used on a large variety <strong>of</strong> genres (Linguateca<br />
project, <strong>Corpus</strong>Eye)<br />
has proven adaptability in <strong>the</strong> face <strong>of</strong> non-standard data, e.g. historical<br />
texts (Bick & Módolo 2005)<br />
keeps <strong>annotation</strong> levels seperate, while allowing <strong>the</strong>m to interact<br />
– e.g. prosodic versus syntactic information
CG in comparable projects<br />
● O<strong>the</strong>r projects with CG <strong>annotation</strong> <strong>of</strong> speech corpora<br />
Estonian speech corpus (Müürisep & Uibo 2006)<br />
Nordic Dialect <strong>Corpus</strong> (Bondi et al. 2009), hybrid method<br />
– (a) written-language CG used on Oslo dialaect<br />
– (b) manual correction<br />
– (c) train Decision Tree Tagger (Schmid 1994) for o<strong>the</strong>r dialects<br />
Spanish section <strong>of</strong> European C-<strong>ORAL</strong><br />
– CG-inspired rules for PoS disambiguation <strong>of</strong> morphological output<br />
from <strong>the</strong> GRAMPAL system (Moreno 2003)<br />
Early experiments with PALAVRAS (Bick 1998)<br />
– NURC corpus (”Norma Lingüística Urbana Culta”, Castilho 1993)
The point <strong>of</strong> departure: PALAVRAS<br />
● Modular Constraint Grammar (CG) parser with a hierarchically<br />
structured sets <strong>of</strong> contextual rules<br />
morphosyntactic tagging, dependency trees<br />
6000 rules, full lexical support, semantics<br />
O DET M S @>N #1->3<br />
último ADJ M S @>N #2->3<br />
diagnóstico N M S @SUBJ> #3->9<br />
elaborado V PCP2 M S @ICL-N< #4->3<br />
por PRP @4<br />
a DET F S @>N #6->7<br />
Comissão=Nacional PROP F S @P< #7->5<br />
não ADV @ADVL> #8->9<br />
deixa V PR 3S @FMV #9->0<br />
dúvidas N F P @9<br />
$. #11->0
TEXT<br />
Morphology<br />
Analyzer<br />
Cohorts<br />
“”<br />
“crear” V PR 3S IND<br />
“crear” V IMP 2S<br />
“creer” V PR 1/3S SUBJ<br />
sem. roles<br />
...<br />
Lexica<br />
polysemy<br />
Disambiguation<br />
Mapping<br />
Disambiguation<br />
Disambiguation<br />
Mapping<br />
CG flowchart<br />
Syntax<br />
Mapping<br />
PSG<br />
Dep.<br />
External<br />
e.g.DTT<br />
Substitution<br />
external<br />
modules<br />
tagger
Text flow normalisation
Text flow: problems<br />
● nesting and overlapping markers (<strong>the</strong> latter also<br />
problematic in xml)<br />
● focus marker é_que (2% <strong>of</strong> turns) transcribed as que -><br />
need for disambiguation<br />
● syntax needs (separate) prepositions<br />
built-in ordinary contractions: do, nele, pelo ...<br />
corpus-specific: pa (pra), pro, pum, naquea ...<br />
difficult ambiguity: pra (para vs. para_a)<br />
● post-tokenization (coral-inter) with support from<br />
normalization lexicon for <strong>the</strong> most difficult contractions,<br />
e.g. né = não é
Lexical and orthographical<br />
normalization<br />
● Parser's treatment <strong>of</strong> unknow wordforms:<br />
Affix derivations<br />
Variants: br vs. pt, accents, orthogr. reform<br />
● Special needs for C-<strong>ORAL</strong> speech corpus<br />
phonetically transcribed word forms<br />
grammatical variants (-amos -> -amo)<br />
● Solutions: two-level <strong>annotation</strong> and specialized standardisation modules<br />
meninim OALT menininho [menino] N M S<br />
coral.inter: second preprocessor with systematic and item-based changes<br />
postlex_pt: postprocessor with morphological analyzer using separate lexicon (2000<br />
entries) and overriding PALAVRAS' heuristics
(a1) emedebê MDB (phonetic abbreviations)<br />
(b3) inda ainda ( (word-initial changes)<br />
(b4) roz arroz<br />
(d2) fazido feito (overregularization)<br />
● Multi-word strings: effect also tokenization, but help disambiguate <strong>the</strong>ir<br />
parts, e.g.<br />
n' = não (not em) in: n'=era, n'=ocê<br />
● Non-systematic new words and names:<br />
(a1) fazeção N F S<br />
(a2) zenes N M P # termo de jogo<br />
(a3) caça-talentos N M S<br />
(a5) superbem-arrumada ADJ F S<br />
(b) mil-oitocentos-e=vovó=gostosa NUM M/F P<br />
(c1) remote N M S # estrangeirismo<br />
(c2) completed ADJ M/F S/P # estrangeirismo<br />
(c5) anche ADV # estrangeirismo<br />
(d1) tu=tu X # onomatopéia<br />
(e2) TIM PROP F S # company<br />
(e3) Tim<strong>of</strong>tol PROP M S
unknown / nonstandard<br />
input<br />
coral.pre<br />
tokenization<br />
PALAVRAS preprocessor<br />
e.g. complex prp, names<br />
coral.inter<br />
normalization<br />
<br />
PALAVRAS analyzer<br />
heuristics<br />
postlex_pt<br />
analysis <strong>of</strong><br />
”unknown” words<br />
CG disambiguation<br />
● Add-on lexicon used to add<br />
readings ra<strong>the</strong>r than override<br />
<strong>the</strong> parsers, allowing for<br />
contextual disambiguation, even<br />
<strong>of</strong> unintended ambiguity:<br />
pô --> pôs ---> verb vs.<br />
interjection plural (allowed in C-<br />
<strong>ORAL</strong>)<br />
CG syntax<br />
pt_forms<br />
_coral<br />
newlex_pt<br />
dependency
Syntax<br />
● Problem: syntactic noise: ah, eeh, uh<br />
Solution: two-level <strong>annotation</strong><br />
● Problem:Syntactic <strong>annotation</strong> needs long-distance contexts, so how can existing rules<br />
be made to work on a speech corpus<br />
> 80% unbounded/global CG rules in syntax<br />
but <strong>the</strong> corpus lacks sentence segmentation and punctuation to delimit <strong>the</strong>se rules<br />
Solution: Exploit prosodic information by not moving it to a meta-level, but ra<strong>the</strong>r<br />
change it into punctuation<br />
● // (major prosodic break) --> semicolon<br />
● / (s<strong>of</strong>t prosodic break) -->
prosodic ”break markers”:<br />
rule-based disambiguation<br />
● --> comma<br />
● --> meta-level<br />
(a) between a noun or a nominative pronoun or a conjunction to <strong>the</strong><br />
left, and a finite verb to <strong>the</strong> right, a prosodic /-marker is treated<br />
as (subject - verb case)<br />
(b) prosodic /-markers between a noun and ano<strong>the</strong>r np are treated<br />
as (appositions)<br />
(c) / between a prenominal and its head is treated as (np<br />
cohesion), e.g. 388 cases <strong>of</strong> article + <br />
... tipo Zé=Mourinho falando<br />
assim não <br />
o [o] DET M S @>N<br />
<br />
[campeonato] N M S @SUBJ><br />
d' OALT de [de] PRP @N<<br />
ocês OALT vocês [você] PERS M/F 3P NOM/PIV @P<<br />
é [ser] V PR 3S IND VFIN @FMV
Evaluation<br />
● random transcription file (1895 tokens)<br />
● eval_cg tool raw analysis file and revised version<br />
● challenge: alignment in <strong>the</strong> face <strong>of</strong> punctuation ambiguity
Effectiveness <strong>of</strong> using prosodic break<br />
markers as punctuation<br />
● standard run: pause/break disambiguation<br />
● no-break: /-marks ignored<br />
● no-sentence: both /- and //-marks ignored<br />
● all-break: all /-marks as commas, no disambiguation
using prosodic breaks for syntax:<br />
results<br />
● prosodic break markers do help <strong>the</strong> parser<br />
● more so for syntax than PoS/morphology (wider contextual<br />
scope with corresponding segmentation needs)<br />
● pause/break disambiguation more relevant for syntax than PoS
graphical search interface
concordances
statistics
contrastive searches: N dela - N dele
comparative search across corpora
cqp speak: [pos="ADJ" & func="N
Conclusions<br />
● A standard written-language parser (PALAVRAS) can be used to<br />
assign morphosyntactic tags to transcribed speech data<br />
● For optimal performans, certain adaptations are necessary:<br />
orthographical normalization<br />
lexicon extensions<br />
syntactic segmentation (disambiguation <strong>of</strong> prosodic breaks)<br />
● Under optimal conditions, almost-normal F-scores can be<br />
achieved (98.6% PoS, 95% syntax, 99% lemmatization)
Outlook<br />
● Annotation scheme does preserve <strong>the</strong> original prosodictranscriptional<br />
information (speech flow, retractions, overlaps<br />
etc.) encoded as metatagging alongside morphosyntactic tags,<br />
but how to make this accessible to GUI searches<br />
searchable now on with grammar: prosodic breaks<br />
not searchable now: speech flow, retractions, overlaps<br />
● Wish lish: higher level <strong>annotation</strong>:<br />
Dependency (done)<br />
semantic classes and roles<br />
anaphoric relations ....
References 1<br />
Bick, Eckhard. 2000. The Parsing System Palavras - Automatic Grammatical Analysis <strong>of</strong><br />
Portuguese in a Constraint Grammar Framework, Aarhus: Aarhus University Press<br />
Bick, Eckhard. 1998. Tagging Speech Data - Constraint Grammar Analysis <strong>of</strong> Spoken<br />
Portuguese, in: Proceedings <strong>of</strong> <strong>the</strong> 17th Scandinavian Conference <strong>of</strong> Linguistics (Odense<br />
1998)<br />
Bick, Eckhard & Marcelo Módolo. 2005. Letters and Editorials: A grammatically<br />
annotated corpus <strong>of</strong> 19th century Brazilian Portuguese. In: Claus Pusch & Johannes<br />
Kabatek & Wolfgang Raible (eds.) Romance <strong>Corpus</strong> Linguistics II: Corpora and Historical<br />
Linguistics (Proceedings <strong>of</strong> <strong>the</strong> 2nd Freiburg Workshop on Romance <strong>Corpus</strong> stics, Sept.<br />
2003). pp. 271-280. Tübingen: Gun<strong>the</strong>r Narr Verlag.<br />
Brill, Eric. 1992. A simple rule-based part <strong>of</strong> speech tagger. In: Proceedings <strong>of</strong> <strong>the</strong><br />
workshop on Speech and Natural Language. HLT '91, Morristown, NJ, USA: Association<br />
for Computational Linguistics, pp.112–116<br />
Castilho, Ataliba de (ed.), 1993. Gramática do Português Falado, vol.3, Campinas:<br />
Editora da Unicamp.<br />
Johannessen, Janne Bondi, Joel Priestley, Kristin Hagen, Tor Anders Åfarli, and Øystein<br />
Alexander Vangsnes. 2009. The Nordic Dialect <strong>Corpus</strong> - an Advanced Research Tool. In<br />
Jokinen, Kristiina and Eckhard Bick (eds.): Proceedings <strong>of</strong> <strong>the</strong> 17th Nordic Conference <strong>of</strong><br />
Computational Linguistics NODALIDA 2009. NEALT Proceedings Series Volume 4<br />
Karlsson, Fred & Voutilainen, Atro & Heikkilä, Juka & Anttila, Arto. 1995. Constraint<br />
Grammar, A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton<br />
de Gruyter.
References 2<br />
Maamouri, Mohamed et al. 2010. From Speech to Trees: Applying Treebank Annotation<br />
to Arabic Broadcast News. In: Proceedings <strong>of</strong> LREC 2010, Valletta, Malta, May 2010.<br />
Moreno, A. & J.M. Guirão. 2003. "Tagging a spontaneous speech corpus <strong>of</strong> Spanish". In:<br />
Proceedings <strong>of</strong> <strong>the</strong> International Conference on Recent Advances in Natural Language<br />
Processing.). Borovets, Bulgaria, 2003. p. 292-296.<br />
Müürisep, Kaili and Uibo, Heli (2006). "Shallow Parsing <strong>of</strong> Spoken Estonian Using<br />
Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings <strong>of</strong> NODALIDA-<br />
2005 special session on treebanking. Copenhagen Studies in Language #33/2006<br />
Moneglia, M., A. Panunzi, E. Picchi, 2004, Using PiTagger for Lemmatization and PoS<br />
Tagging <strong>of</strong> a Spontaneous Speech <strong>Corpus</strong> : C-Oral-Rom Italian. In M.T. Lino et al. (eds.),<br />
Proceedings <strong>of</strong> <strong>the</strong> 4th LREC Conference, vol. 2, ELRA, Paris, pp. 563-566.<br />
Raso, Tommaso & Heliana Mello. 2010. The C-<strong>ORAL</strong> BRASIL corpus. In: Massimo<br />
Moneglia & Alessandro Panunzi (eds): Bootstrapping Infromation from Corpora in a<br />
Cross-Linguistic Perspective. Universitá degli studi di Firenze, Biblioteca Digitale.<br />
Schmid, Helmut. 1994. Probabilistic Part-<strong>of</strong>-Speech Tagging Using Decision Trees.<br />
Proceedings <strong>of</strong> <strong>the</strong> International Conference on New Methods in Language Processing<br />
1994. pp. 44-49.
visl.sdu.dk (parsers)<br />
corp.hum.sdu.dk (corpora)