Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL

Morphosyntactic annotation 

of the C-ORAL Brasil Corpus 

Eckard Bick 

University of Southern Denmark 

eckhard.bick@mail.dk

Introduction 

● Spoken language corpora are becoming more available, growing in 

scope and size 

● To fully exploit such corpora linguistically, grammatical annotation is 

needed in addition to phonetic-prosodic annotation 

● Problems 

Speech data are notationally ”colourful” and use different 

strategies for retaining non-text information in transcription 

Spoken language has many non-standard features, both in 

grammar and lexicon

Some comparable projects 

● C-ORAL sister projects, using statistical taggers, no syntax 

Italian: PiTagger (Moneglia et al. 2004) with lexicon (107.000 standard 

lemmas and 2000 non-standard forms) and training corpus (50.000 

words ~ 25% of C-ORAL size) 

European Portuguese: Brill tagger (Brill 1993) with written Portuguese 

training corpus (250.000 words) 

● Arabic treebank, i.e. syntax (Maamouri et al. 2010) 

manual selection of analyzer suggestion, followed by automatic parsing 

not directly comparable: broadcast news, not spontaneous speech 

● Our approach: Constraint Grammar annotation 

context-driven rules rather than statistics

Why Constraint Grammar 

● robust annotation methodology (Karlsson et al. 1994) 

focus on disambiguation - no breakdowns due to generative rules 

token-based annotation, one tag per feature --> easy to filter 

modular methodology --> allows the combination of grammars targeting 

different linguistic levels 

● existing parser for Portuguese: PALAVRAS (Bick 2000) 

has been succesfully used on a large variety of genres (Linguateca 

project, CorpusEye) 

has proven adaptability in the face of non-standard data, e.g. historical 

texts (Bick & Módolo 2005) 

keeps annotation levels seperate, while allowing them to interact 

– e.g. prosodic versus syntactic information

CG in comparable projects 

● Other projects with CG annotation of speech corpora 

Estonian speech corpus (Müürisep & Uibo 2006) 

Nordic Dialect Corpus (Bondi et al. 2009), hybrid method 

– (a) written-language CG used on Oslo dialaect 

– (b) manual correction 

– (c) train Decision Tree Tagger (Schmid 1994) for other dialects 

Spanish section of European C-ORAL 

– CG-inspired rules for PoS disambiguation of morphological output 

from the GRAMPAL system (Moreno 2003) 

Early experiments with PALAVRAS (Bick 1998) 

– NURC corpus (”Norma Lingüística Urbana Culta”, Castilho 1993)

The point of departure: PALAVRAS 

● Modular Constraint Grammar (CG) parser with a hierarchically 

structured sets of contextual rules 

morphosyntactic tagging, dependency trees 

6000 rules, full lexical support, semantics 

O DET M S @>N #1->3 

último ADJ M S @>N #2->3 

diagnóstico N M S @SUBJ> #3->9 

elaborado V PCP2 M S @ICL-N< #4->3 

por PRP @4 

a DET F S @>N #6->7 

Comissão=Nacional PROP F S @P< #7->5 

não ADV @ADVL> #8->9 

deixa V PR 3S @FMV #9->0 

dúvidas N F P @9 

$. #11->0

TEXT 

Morphology 

Analyzer 

Cohorts 

“” 

“crear” V PR 3S IND 

“crear” V IMP 2S 

“creer” V PR 1/3S SUBJ 

sem. roles 

... 

Lexica 

polysemy 

Disambiguation 

Mapping 



Mapping 

CG flowchart 

Syntax 

Mapping 

PSG 

Dep. 

External 

e.g.DTT 

Substitution 

external 

modules 

tagger

Text flow normalisation

Text flow: problems 

● nesting and overlapping markers (the latter also 

problematic in xml) 

● focus marker é_que (2% of turns) transcribed as que -> 

need for disambiguation 

● syntax needs (separate) prepositions 

built-in ordinary contractions: do, nele, pelo ... 

corpus-specific: pa (pra), pro, pum, naquea ... 

difficult ambiguity: pra (para vs. para_a) 

● post-tokenization (coral-inter) with support from 

normalization lexicon for the most difficult contractions, 

e.g. né = não é

Lexical and orthographical 

normalization 

● Parser's treatment of unknow wordforms: 

Affix derivations 

Variants: br vs. pt, accents, orthogr. reform 

● Special needs for C-ORAL speech corpus 

phonetically transcribed word forms 

grammatical variants (-amos -> -amo) 

● Solutions: two-level annotation and specialized standardisation modules 

meninim OALT menininho [menino] N M S 

coral.inter: second preprocessor with systematic and item-based changes 

postlex_pt: postprocessor with morphological analyzer using separate lexicon (2000 

entries) and overriding PALAVRAS' heuristics

(a1) emedebê MDB (phonetic abbreviations) 

(b3) inda ainda ( (word-initial changes) 

(b4) roz arroz 

(d2) fazido feito (overregularization) 

● Multi-word strings: effect also tokenization, but help disambiguate their 

parts, e.g. 

n' = não (not em) in: n'=era, n'=ocê 

● Non-systematic new words and names: 

(a1) fazeção N F S 

(a2) zenes N M P # termo de jogo 

(a3) caça-talentos N M S 

(a5) superbem-arrumada ADJ F S 

(b) mil-oitocentos-e=vovó=gostosa NUM M/F P 

(c1) remote N M S # estrangeirismo 

(c2) completed ADJ M/F S/P # estrangeirismo 

(c5) anche ADV # estrangeirismo 

(d1) tu=tu X # onomatopéia 

(e2) TIM PROP F S # company 

(e3) Timoftol PROP M S

unknown / nonstandard 

input 

coral.pre 

tokenization 

PALAVRAS preprocessor 

e.g. complex prp, names 

coral.inter 

normalization 

 

PALAVRAS analyzer 

heuristics 

postlex_pt 

analysis of 

”unknown” words 

CG disambiguation 

● Add-on lexicon used to add 

readings rather than override 

the parsers, allowing for 

contextual disambiguation, even 

of unintended ambiguity: 

pô --> pôs ---> verb vs. 

interjection plural (allowed in C- 

ORAL) 

CG syntax 

pt_forms 

_coral 

newlex_pt 

dependency

Syntax 

● Problem: syntactic noise: ah, eeh, uh 

Solution: two-level annotation 

● Problem:Syntactic annotation needs long-distance contexts, so how can existing rules 

be made to work on a speech corpus 

> 80% unbounded/global CG rules in syntax 

but the corpus lacks sentence segmentation and punctuation to delimit these rules 

Solution: Exploit prosodic information by not moving it to a meta-level, but rather 

change it into punctuation 

● // (major prosodic break) --> semicolon 

● / (soft prosodic break) -->

prosodic ”break markers”: 

rule-based disambiguation 

● --> comma 

● --> meta-level 

(a) between a noun or a nominative pronoun or a conjunction to the 

left, and a finite verb to the right, a prosodic /-marker is treated 

as (subject - verb case) 

(b) prosodic /-markers between a noun and another np are treated 

as (appositions) 

(c) / between a prenominal and its head is treated as (np 

cohesion), e.g. 388 cases of article + 

... tipo Zé=Mourinho falando 

assim não 

o [o] DET M S @>N 

 

[campeonato] N M S @SUBJ> 

d' OALT de [de] PRP @N< 

ocês OALT vocês [você] PERS M/F 3P NOM/PIV @P< 

é [ser] V PR 3S IND VFIN @FMV

Evaluation 

● random transcription file (1895 tokens) 

● eval_cg tool raw analysis file and revised version 

● challenge: alignment in the face of punctuation ambiguity

Effectiveness of using prosodic break 

markers as punctuation 

● standard run: pause/break disambiguation 

● no-break: /-marks ignored 

● no-sentence: both /- and //-marks ignored 

● all-break: all /-marks as commas, no disambiguation

using prosodic breaks for syntax: 

results 

● prosodic break markers do help the parser 

● more so for syntax than PoS/morphology (wider contextual 

scope with corresponding segmentation needs) 

● pause/break disambiguation more relevant for syntax than PoS

graphical search interface

concordances

statistics

contrastive searches: N dela - N dele

comparative search across corpora

cqp speak: [pos="ADJ" & func="N

Conclusions 

● A standard written-language parser (PALAVRAS) can be used to 

assign morphosyntactic tags to transcribed speech data 

● For optimal performans, certain adaptations are necessary: 

orthographical normalization 

lexicon extensions 

syntactic segmentation (disambiguation of prosodic breaks) 

● Under optimal conditions, almost-normal F-scores can be 

achieved (98.6% PoS, 95% syntax, 99% lemmatization)

Outlook 

● Annotation scheme does preserve the original prosodictranscriptional 

information (speech flow, retractions, overlaps 

etc.) encoded as metatagging alongside morphosyntactic tags, 

but how to make this accessible to GUI searches 

searchable now on with grammar: prosodic breaks 

not searchable now: speech flow, retractions, overlaps 

● Wish lish: higher level annotation: 

Dependency (done) 

semantic classes and roles 

anaphoric relations ....

References 1 

Bick, Eckhard. 2000. The Parsing System Palavras - Automatic Grammatical Analysis of 

Portuguese in a Constraint Grammar Framework, Aarhus: Aarhus University Press 

Bick, Eckhard. 1998. Tagging Speech Data - Constraint Grammar Analysis of Spoken 

Portuguese, in: Proceedings of the 17th Scandinavian Conference of Linguistics (Odense 

1998) 

Bick, Eckhard & Marcelo Módolo. 2005. Letters and Editorials: A grammatically 

annotated corpus of 19th century Brazilian Portuguese. In: Claus Pusch & Johannes 

Kabatek & Wolfgang Raible (eds.) Romance Corpus Linguistics II: Corpora and Historical 

Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus stics, Sept. 

2003). pp. 271-280. Tübingen: Gunther Narr Verlag. 

Brill, Eric. 1992. A simple rule-based part of speech tagger. In: Proceedings of the 

workshop on Speech and Natural Language. HLT '91, Morristown, NJ, USA: Association 

for Computational Linguistics, pp.112–116 

Castilho, Ataliba de (ed.), 1993. Gramática do Português Falado, vol.3, Campinas: 

Editora da Unicamp. 

Johannessen, Janne Bondi, Joel Priestley, Kristin Hagen, Tor Anders Åfarli, and Øystein 

Alexander Vangsnes. 2009. The Nordic Dialect Corpus - an Advanced Research Tool. In 

Jokinen, Kristiina and Eckhard Bick (eds.): Proceedings of the 17th Nordic Conference of 

Computational Linguistics NODALIDA 2009. NEALT Proceedings Series Volume 4 

Karlsson, Fred & Voutilainen, Atro & Heikkilä, Juka & Anttila, Arto. 1995. Constraint 

Grammar, A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton 

de Gruyter.

References 2 

Maamouri, Mohamed et al. 2010. From Speech to Trees: Applying Treebank Annotation 

to Arabic Broadcast News. In: Proceedings of LREC 2010, Valletta, Malta, May 2010. 

Moreno, A. & J.M. Guirão. 2003. "Tagging a spontaneous speech corpus of Spanish". In: 

Proceedings of the International Conference on Recent Advances in Natural Language 

Processing.). Borovets, Bulgaria, 2003. p. 292-296. 

Müürisep, Kaili and Uibo, Heli (2006). "Shallow Parsing of Spoken Estonian Using 

Constraint Grammar". In: P.J.Henriksen & P.R.Skadhauge, Proceedings of NODALIDA- 

2005 special session on treebanking. Copenhagen Studies in Language #33/2006 

Moneglia, M., A. Panunzi, E. Picchi, 2004, Using PiTagger for Lemmatization and PoS 

Tagging of a Spontaneous Speech Corpus : C-Oral-Rom Italian. In M.T. Lino et al. (eds.), 

Proceedings of the 4th LREC Conference, vol. 2, ELRA, Paris, pp. 563-566. 

Raso, Tommaso & Heliana Mello. 2010. The C-ORAL BRASIL corpus. In: Massimo 

Moneglia & Alessandro Panunzi (eds): Bootstrapping Infromation from Corpora in a 

Cross-Linguistic Perspective. Universitá degli studi di Firenze, Biblioteca Digitale. 

Schmid, Helmut. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. 

Proceedings of the International Conference on New Methods in Language Processing 

1994. pp. 44-49.

visl.sdu.dk (parsers) 

corp.hum.sdu.dk (corpora)

Morphosyntactic annotation of the C-ORAL Brasil Corpus - VISL

Create successful ePaper yourself

Delete template?

Save as template?