19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Figure 4: Different syntactic analyses <strong>for</strong> self-repairs II<br />

node, as is the <strong>in</strong>terregnum. This reflects the idea that all<br />

should belong to the same unit of analysis. This particular<br />

representation, however, makes it hard to recover the well<strong>for</strong>med<br />

sentence from the utterance and also gives mislead<strong>in</strong>g<br />

search results <strong>for</strong> l<strong>in</strong>guistic queries look<strong>in</strong>g <strong>for</strong>, e.g., the<br />

number of direct objects governed by a sentence node.<br />

Figure 5: Different syntactic analyses <strong>for</strong> self-repairs III<br />

Figure 5 solves this problem by attach<strong>in</strong>g the repair directly<br />

to the sentence node (S). The <strong>in</strong>terregnum is annotated as<br />

part of the reparandum, which (same as <strong>in</strong> 3) is annotated as<br />

a fragment (FRAG). Filled pauses, however, can also occur<br />

without a self-repair. Thus we decided to treat all fillers the<br />

same and not to <strong>in</strong>clude them <strong>in</strong> the FRAG node conta<strong>in</strong><strong>in</strong>g<br />

the reparandum. This results <strong>in</strong> our prefered representation,<br />

shown <strong>in</strong> Figure 6. Our solution is similar to the treatment<br />

of self-repair <strong>in</strong> the Switchboard corpus (Figure 7).<br />

Figure 6: Different syntactic analyses <strong>for</strong> self-repairs IV<br />

In the Switchboard corpus, square brackets are used to mark<br />

the beg<strong>in</strong>n<strong>in</strong>g (RM) and end (RS) of a self-repair sequence<br />

(Figure 7), and the <strong>in</strong>terruption po<strong>in</strong>t (IP), which occurs directly<br />

be<strong>for</strong>e the <strong>in</strong>terregnum, is assign<strong>in</strong>g the plus (+) sym-<br />

34<br />

Figure 7: Syntactic representation of a self-repair <strong>in</strong> the<br />

Switchboard corpus (RM: start of disfluency, RS: end of<br />

disfluency; IP: <strong>in</strong>terruption po<strong>in</strong>t<br />

bol. The reparandum is attached to a node with the label<br />

EDITED (comparable to our FRAG node). Remov<strong>in</strong>g this<br />

EDITED node, which <strong>in</strong>cludes all material from the start<br />

of the reparandum up to the <strong>in</strong>terruption po<strong>in</strong>t, results <strong>in</strong><br />

recover<strong>in</strong>g a well-<strong>for</strong>med version of the utterance.<br />

Our f<strong>in</strong>al representation (Figure 6) allows us to consider<br />

reparandum and repair as a functional unit and to conduct<br />

corpus searches <strong>for</strong> filled pauses <strong>in</strong>side of specific constituents,<br />

e.g. compar<strong>in</strong>g the occurences of filled pauses<br />

<strong>in</strong>side NPs to ones on the sentence level. The annotation<br />

<strong>in</strong> Switchboard also explicitely encodes the end of the repair,<br />

which our annotation does not. We agree that it would<br />

be desirable to have this <strong>in</strong><strong>for</strong>mation but, consider<strong>in</strong>g the<br />

additional ef<strong>for</strong>t <strong>for</strong> manual annotation, refra<strong>in</strong> from do<strong>in</strong>g<br />

so. Unlike Switchboard, we also annotate unfilled pauses <strong>in</strong><br />

the corpus, differentiat<strong>in</strong>g beween short pauses (< 1sec),<br />

medium pauses (1 − 3sec) and long pauses (> 3sec). Our<br />

annotation enables the user to identify and discard the fragments<br />

and recover only the well-<strong>for</strong>med part of the utterance,<br />

if need be.<br />

Repetitions of (sequences of) words without a self-repair<br />

also occur frequently <strong>in</strong> spoken language data. The repeated<br />

material can occur at any position <strong>in</strong> the utterance<br />

and, like the self-repair, also <strong>in</strong>serts extra material which<br />

causes problems <strong>for</strong> syntactic analysis.<br />

The treatment of repetitions <strong>in</strong> the Switchboard corpus is<br />

illustrated <strong>in</strong> Figure 8 and looks similar to the treatment<br />

of self-corrections. The repeated material is attached to<br />

the EDITED node, as is the <strong>in</strong>terruption po<strong>in</strong>t. The square<br />

brackets mark beg<strong>in</strong> and end of the repetition, and through<br />

the deletion of the EDITED node the utterance can be trans<strong>for</strong>med<br />

<strong>in</strong>to a well-<strong>for</strong>med sentence.<br />

Figure 8: Representation of repetitions <strong>in</strong> the Switchboard<br />

corpus

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!