06.08.2013 Views

Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL

Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL

Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Morphosyntactic</strong> <strong>annotation</strong><br />

<strong>of</strong> <strong>the</strong> C-<strong>ORAL</strong> <strong>Brasil</strong> <strong>Corpus</strong><br />

Eckard Bick<br />

University <strong>of</strong> Sou<strong>the</strong>rn Denmark<br />

eckhard.bick@mail.dk


Introduction<br />

● Spoken language corpora are becoming more available, growing in<br />

scope and size<br />

● To fully exploit such corpora linguistically, grammatical <strong>annotation</strong> is<br />

needed in addition to phonetic-prosodic <strong>annotation</strong><br />

● Problems<br />

Speech data are notationally ”colourful” and use different<br />

strategies for retaining non-text information in transcription<br />

Spoken language has many non-standard features, both in<br />

grammar and lexicon


Some comparable projects<br />

● C-<strong>ORAL</strong> sister projects, using statistical taggers, no syntax<br />

Italian: PiTagger (Moneglia et al. 2004) with lexicon (107.000 standard<br />

lemmas and 2000 non-standard forms) and training corpus (50.000<br />

words ~ 25% <strong>of</strong> C-<strong>ORAL</strong> size)<br />

European Portuguese: Brill tagger (Brill 1993) with written Portuguese<br />

training corpus (250.000 words)<br />

● Arabic treebank, i.e. syntax (Maamouri et al. 2010)<br />

manual selection <strong>of</strong> analyzer suggestion, followed by automatic parsing<br />

not directly comparable: broadcast news, not spontaneous speech<br />

● Our approach: Constraint Grammar <strong>annotation</strong><br />

context-driven rules ra<strong>the</strong>r than statistics


Why Constraint Grammar<br />

● robust <strong>annotation</strong> methodology (Karlsson et al. 1994)<br />

focus on disambiguation - no breakdowns due to generative rules<br />

token-based <strong>annotation</strong>, one tag per feature --> easy to filter<br />

modular methodology --> allows <strong>the</strong> combination <strong>of</strong> grammars targeting<br />

different linguistic levels<br />

● existing parser for Portuguese: PALAVRAS (Bick 2000)<br />

has been succesfully used on a large variety <strong>of</strong> genres (Linguateca<br />

project, <strong>Corpus</strong>Eye)<br />

has proven adaptability in <strong>the</strong> face <strong>of</strong> non-standard data, e.g. historical<br />

texts (Bick & Módolo 2005)<br />

keeps <strong>annotation</strong> levels seperate, while allowing <strong>the</strong>m to interact<br />

– e.g. prosodic versus syntactic information


CG in comparable projects<br />

● O<strong>the</strong>r projects with CG <strong>annotation</strong> <strong>of</strong> speech corpora<br />

Estonian speech corpus (Müürisep & Uibo 2006)<br />

Nordic Dialect <strong>Corpus</strong> (Bondi et al. 2009), hybrid method<br />

– (a) written-language CG used on Oslo dialaect<br />

– (b) manual correction<br />

– (c) train Decision Tree Tagger (Schmid 1994) for o<strong>the</strong>r dialects<br />

Spanish section <strong>of</strong> European C-<strong>ORAL</strong><br />

– CG-inspired rules for PoS disambiguation <strong>of</strong> morphological output<br />

from <strong>the</strong> GRAMPAL system (Moreno 2003)<br />

Early experiments with PALAVRAS (Bick 1998)<br />

– NURC corpus (”Norma Lingüística Urbana Culta”, Castilho 1993)


The point <strong>of</strong> departure: PALAVRAS<br />

● Modular Constraint Grammar (CG) parser with a hierarchically<br />

structured sets <strong>of</strong> contextual rules<br />

morphosyntactic tagging, dependency trees<br />

6000 rules, full lexical support, semantics<br />

O DET M S @>N #1->3<br />

último ADJ M S @>N #2->3<br />

diagnóstico N M S @SUBJ> #3->9<br />

elaborado V PCP2 M S @ICL-N< #4->3<br />

por PRP @4<br />

a DET F S @>N #6->7<br />

Comissão=Nacional PROP F S @P< #7->5<br />

não ADV @ADVL> #8->9<br />

deixa V PR 3S @FMV #9->0<br />

dúvidas N F P @9<br />

$. #11->0


TEXT<br />

Morphology<br />

Analyzer<br />

Cohorts<br />

“”<br />

“crear” V PR 3S IND<br />

“crear” V IMP 2S<br />

“creer” V PR 1/3S SUBJ<br />

sem. roles<br />

...<br />

Lexica<br />

polysemy<br />

Disambiguation<br />

Mapping<br />

Disambiguation<br />

Disambiguation<br />

Mapping<br />

CG flowchart<br />

Syntax<br />

Mapping<br />

PSG<br />

Dep.<br />

External<br />

e.g.DTT<br />

Substitution<br />

external<br />

modules<br />

tagger


Text flow normalisation


Text flow: problems<br />

● nesting and overlapping markers (<strong>the</strong> latter also<br />

problematic in xml)<br />

● focus marker é_que (2% <strong>of</strong> turns) transcribed as que -><br />

need for disambiguation<br />

● syntax needs (separate) prepositions<br />

built-in ordinary contractions: do, nele, pelo ...<br />

corpus-specific: pa (pra), pro, pum, naquea ...<br />

difficult ambiguity: pra (para vs. para_a)<br />

● post-tokenization (coral-inter) with support from<br />

normalization lexicon for <strong>the</strong> most difficult contractions,<br />

e.g. né = não é


Lexical and orthographical<br />

normalization<br />

● Parser's treatment <strong>of</strong> unknow wordforms:<br />

Affix derivations<br />

Variants: br vs. pt, accents, orthogr. reform<br />

● Special needs for C-<strong>ORAL</strong> speech corpus<br />

phonetically transcribed word forms<br />

grammatical variants (-amos -> -amo)<br />

● Solutions: two-level <strong>annotation</strong> and specialized standardisation modules<br />

meninim OALT menininho [menino] N M S<br />

coral.inter: second preprocessor with systematic and item-based changes<br />

postlex_pt: postprocessor with morphological analyzer using separate lexicon (2000<br />

entries) and overriding PALAVRAS' heuristics


(a1) emedebê MDB (phonetic abbreviations)<br />

(b3) inda ainda ( (word-initial changes)<br />

(b4) roz arroz<br />

(d2) fazido feito (overregularization)<br />

● Multi-word strings: effect also tokenization, but help disambiguate <strong>the</strong>ir<br />

parts, e.g.<br />

n' = não (not em) in: n'=era, n'=ocê<br />

● Non-systematic new words and names:<br />

(a1) fazeção N F S<br />

(a2) zenes N M P # termo de jogo<br />

(a3) caça-talentos N M S<br />

(a5) superbem-arrumada ADJ F S<br />

(b) mil-oitocentos-e=vovó=gostosa NUM M/F P<br />

(c1) remote N M S # estrangeirismo<br />

(c2) completed ADJ M/F S/P # estrangeirismo<br />

(c5) anche ADV # estrangeirismo<br />

(d1) tu=tu X # onomatopéia<br />

(e2) TIM PROP F S # company<br />

(e3) Tim<strong>of</strong>tol PROP M S


unknown / nonstandard<br />

input<br />

coral.pre<br />

tokenization<br />

PALAVRAS preprocessor<br />

e.g. complex prp, names<br />

coral.inter<br />

normalization<br />

<br />

PALAVRAS analyzer<br />

heuristics<br />

postlex_pt<br />

analysis <strong>of</strong><br />

”unknown” words<br />

CG disambiguation<br />

● Add-on lexicon used to add<br />

readings ra<strong>the</strong>r than override<br />

<strong>the</strong> parsers, allowing for<br />

contextual disambiguation, even<br />

<strong>of</strong> unintended ambiguity:<br />

pô --> pôs ---> verb vs.<br />

interjection plural (allowed in C-<br />

<strong>ORAL</strong>)<br />

CG syntax<br />

pt_forms<br />

_coral<br />

newlex_pt<br />

dependency


Syntax<br />

● Problem: syntactic noise: ah, eeh, uh<br />

Solution: two-level <strong>annotation</strong><br />

● Problem:Syntactic <strong>annotation</strong> needs long-distance contexts, so how can existing rules<br />

be made to work on a speech corpus<br />

> 80% unbounded/global CG rules in syntax<br />

but <strong>the</strong> corpus lacks sentence segmentation and punctuation to delimit <strong>the</strong>se rules<br />

Solution: Exploit prosodic information by not moving it to a meta-level, but ra<strong>the</strong>r<br />

change it into punctuation<br />

● // (major prosodic break) --> semicolon<br />

● / (s<strong>of</strong>t prosodic break) -->


prosodic ”break markers”:<br />

rule-based disambiguation<br />

● --> comma<br />

● --> meta-level<br />

(a) between a noun or a nominative pronoun or a conjunction to <strong>the</strong><br />

left, and a finite verb to <strong>the</strong> right, a prosodic /-marker is treated<br />

as (subject - verb case)<br />

(b) prosodic /-markers between a noun and ano<strong>the</strong>r np are treated<br />

as (appositions)<br />

(c) / between a prenominal and its head is treated as (np<br />

cohesion), e.g. 388 cases <strong>of</strong> article + <br />

... tipo Zé=Mourinho falando<br />

assim não <br />

o [o] DET M S @>N<br />

<br />

[campeonato] N M S @SUBJ><br />

d' OALT de [de] PRP @N<<br />

ocês OALT vocês [você] PERS M/F 3P NOM/PIV @P<<br />

é [ser] V PR 3S IND VFIN @FMV


Evaluation<br />

● random transcription file (1895 tokens)<br />

● eval_cg tool raw analysis file and revised version<br />

● challenge: alignment in <strong>the</strong> face <strong>of</strong> punctuation ambiguity


Effectiveness <strong>of</strong> using prosodic break<br />

markers as punctuation<br />

● standard run: pause/break disambiguation<br />

● no-break: /-marks ignored<br />

● no-sentence: both /- and //-marks ignored<br />

● all-break: all /-marks as commas, no disambiguation


using prosodic breaks for syntax:<br />

results<br />

● prosodic break markers do help <strong>the</strong> parser<br />

● more so for syntax than PoS/morphology (wider contextual<br />

scope with corresponding segmentation needs)<br />

● pause/break disambiguation more relevant for syntax than PoS


graphical search interface


concordances


statistics


contrastive searches: N dela - N dele


comparative search across corpora


cqp speak: [pos="ADJ" & func="N


Conclusions<br />

● A standard written-language parser (PALAVRAS) can be used to<br />

assign morphosyntactic tags to transcribed speech data<br />

● For optimal performans, certain adaptations are necessary:<br />

orthographical normalization<br />

lexicon extensions<br />

syntactic segmentation (disambiguation <strong>of</strong> prosodic breaks)<br />

● Under optimal conditions, almost-normal F-scores can be<br />

achieved (98.6% PoS, 95% syntax, 99% lemmatization)


Outlook<br />

● Annotation scheme does preserve <strong>the</strong> original prosodictranscriptional<br />

information (speech flow, retractions, overlaps<br />

etc.) encoded as metatagging alongside morphosyntactic tags,<br />

but how to make this accessible to GUI searches<br />

searchable now on with grammar: prosodic breaks<br />

not searchable now: speech flow, retractions, overlaps<br />

● Wish lish: higher level <strong>annotation</strong>:<br />

Dependency (done)<br />

semantic classes and roles<br />

anaphoric relations ....


References 1<br />

Bick, Eckhard. 2000. The Parsing System Palavras - Automatic Grammatical Analysis <strong>of</strong><br />

Portuguese in a Constraint Grammar Framework, Aarhus: Aarhus University Press<br />

Bick, Eckhard. 1998. Tagging Speech Data - Constraint Grammar Analysis <strong>of</strong> Spoken<br />

Portuguese, in: Proceedings <strong>of</strong> <strong>the</strong> 17th Scandinavian Conference <strong>of</strong> Linguistics (Odense<br />

1998)<br />

Bick, Eckhard & Marcelo Módolo. 2005. Letters and Editorials: A grammatically<br />

annotated corpus <strong>of</strong> 19th century Brazilian Portuguese. In: Claus Pusch & Johannes<br />

Kabatek & Wolfgang Raible (eds.) Romance <strong>Corpus</strong> Linguistics II: Corpora and Historical<br />

Linguistics (Proceedings <strong>of</strong> <strong>the</strong> 2nd Freiburg Workshop on Romance <strong>Corpus</strong> stics, Sept.<br />

2003). pp. 271-280. Tübingen: Gun<strong>the</strong>r Narr Verlag.<br />

Brill, Eric. 1992. A simple rule-based part <strong>of</strong> speech tagger. In: Proceedings <strong>of</strong> <strong>the</strong><br />

workshop on Speech and Natural Language. HLT '91, Morristown, NJ, USA: Association<br />

for Computational Linguistics, pp.112–116<br />

Castilho, Ataliba de (ed.), 1993. Gramática do Português Falado, vol.3, Campinas:<br />

Editora da Unicamp.<br />

Johannessen, Janne Bondi, Joel Priestley, Kristin Hagen, Tor Anders Åfarli, and Øystein<br />

Alexander Vangsnes. 2009. The Nordic Dialect <strong>Corpus</strong> - an Advanced Research Tool. In<br />

Jokinen, Kristiina and Eckhard Bick (eds.): Proceedings <strong>of</strong> <strong>the</strong> 17th Nordic Conference <strong>of</strong><br />

Computational Linguistics NODALIDA 2009. NEALT Proceedings Series Volume 4<br />

Karlsson, Fred & Voutilainen, Atro & Heikkilä, Juka & Anttila, Arto. 1995. Constraint<br />

Grammar, A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton<br />

de Gruyter.


References 2<br />

Maamouri, Mohamed et al. 2010. From Speech to Trees: Applying Treebank Annotation<br />

to Arabic Broadcast News. In: Proceedings <strong>of</strong> LREC 2010, Valletta, Malta, May 2010.<br />

Moreno, A. & J.M. Guirão. 2003. "Tagging a spontaneous speech corpus <strong>of</strong> Spanish". In:<br />

Proceedings <strong>of</strong> <strong>the</strong> International Conference on Recent Advances in Natural Language<br />

Processing.). Borovets, Bulgaria, 2003. p. 292-296.<br />

Müürisep, Kaili and Uibo, Heli (2006). "Shallow Parsing <strong>of</strong> Spoken Estonian Using<br />

Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings <strong>of</strong> NODALIDA-<br />

2005 special session on treebanking. Copenhagen Studies in Language #33/2006<br />

Moneglia, M., A. Panunzi, E. Picchi, 2004, Using PiTagger for Lemmatization and PoS<br />

Tagging <strong>of</strong> a Spontaneous Speech <strong>Corpus</strong> : C-Oral-Rom Italian. In M.T. Lino et al. (eds.),<br />

Proceedings <strong>of</strong> <strong>the</strong> 4th LREC Conference, vol. 2, ELRA, Paris, pp. 563-566.<br />

Raso, Tommaso & Heliana Mello. 2010. The C-<strong>ORAL</strong> BRASIL corpus. In: Massimo<br />

Moneglia & Alessandro Panunzi (eds): Bootstrapping Infromation from Corpora in a<br />

Cross-Linguistic Perspective. Universitá degli studi di Firenze, Biblioteca Digitale.<br />

Schmid, Helmut. 1994. Probabilistic Part-<strong>of</strong>-Speech Tagging Using Decision Trees.<br />

Proceedings <strong>of</strong> <strong>the</strong> International Conference on New Methods in Language Processing<br />

1994. pp. 44-49.


visl.sdu.dk (parsers)<br />

corp.hum.sdu.dk (corpora)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!