19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Us<strong>in</strong>g A Global Corpus Data Model <strong>for</strong> L<strong>in</strong>guistic and Phonetic <strong>Research</strong><br />

Christoph Draxler<br />

BAS Bavarian Archive of <strong>Speech</strong> Signals<br />

Institute of Phonetics and <strong>Speech</strong> Process<strong>in</strong>g<br />

Ludwig-Maximilian University Munich, Germany<br />

draxler@phonetik.uni-muenchen.de<br />

Abstract<br />

This paper presents and discusses the global corpus data model <strong>in</strong> Wiki<strong>Speech</strong> <strong>for</strong> l<strong>in</strong>guistic and phonetic data used at the BAS. The data<br />

model is implemented us<strong>in</strong>g a relational database system. Two case studies illustrate how the database is used. In the first case study,<br />

audio record<strong>in</strong>gs per<strong>for</strong>med via the web <strong>in</strong> Scotland are accessed to carry out <strong>for</strong>mant analyses of Scottish English vowels. In the second<br />

case study, the database is used <strong>for</strong> onl<strong>in</strong>e perception experiments on regional variation of speech sounds <strong>in</strong> German. In both cases, the<br />

global corpus data model has shown to an effective means <strong>for</strong> provid<strong>in</strong>g data <strong>in</strong> the required <strong>for</strong>mats <strong>for</strong> the tools used <strong>in</strong> the workflow,<br />

and to allow the use of the same database <strong>for</strong> very different types of applications.<br />

1. Introduction<br />

The Bavarian Archive <strong>for</strong> <strong>Speech</strong> Signals (BAS) has collected<br />

a number of small and large speech databases us<strong>in</strong>g<br />

standalone and web-based tools, e.g. Ph@ttSessionz<br />

(Draxler, 2006), VOYS (Dickie et al., 2009), ALC (Schiel<br />

et al., 2010), and others. Although these corpora were<br />

ma<strong>in</strong>ly established to satisfy the needs of speech technology<br />

development, they are now <strong>in</strong>creas<strong>in</strong>gly used <strong>in</strong> phonetic<br />

and l<strong>in</strong>guistic basic research as well as <strong>in</strong> education.<br />

This is facilitated by the fact that these speech databases are<br />

demographically controlled, well-documented and quite <strong>in</strong>expensive<br />

<strong>for</strong> academic research.<br />

The workflow <strong>for</strong> the creation of speech corpora consists<br />

of the steps specification, collection or record<strong>in</strong>g,<br />

signal process<strong>in</strong>g, annotation, postprocess<strong>in</strong>g and distribution<br />

or exploitation – each step <strong>in</strong>volves us<strong>in</strong>g dedicated<br />

tools, e.g. <strong>Speech</strong>Recorder (Draxler and Jänsch,<br />

2004) to record audio, Praat (Boersma, 2001), ELAN<br />

(Sloetjes et al., 2007), EXMARaLDA (Schmidt and<br />

Wörner, 2005), or WebTranscribe (Draxler, 2005) to annotate<br />

it, libassp (libassp.source<strong>for</strong>ge.net), sox<br />

(sox.source<strong>for</strong>ge.net) or Praat to per<strong>for</strong>m signal<br />

analysis tasks, and Excel, R and others to carry out statistical<br />

computations (Figure 2).<br />

All tools use their own data <strong>for</strong>mats. Some tools do provide<br />

import and export of other <strong>for</strong>mats, but <strong>in</strong> general there<br />

is some loss of <strong>in</strong><strong>for</strong>mation <strong>in</strong> go<strong>in</strong>g from one tool to the<br />

other (Schmidt et al., 2009). A manual conversion of <strong>for</strong>mats<br />

is time-consum<strong>in</strong>g and error-prone, and very often the<br />

researchers work<strong>in</strong>g with the data do not have the programm<strong>in</strong>g<br />

expertise to convert the data. F<strong>in</strong>ally, if each tool provides<br />

its own import and export converter, the number of<br />

such data converters <strong>in</strong>creases dramatically with every new<br />

tool.<br />

We have thus chosen a radical approach: we store all data<br />

<strong>in</strong>dependent of any application program <strong>in</strong> a database system<br />

<strong>in</strong> a global corpus data model <strong>in</strong> our Wiki<strong>Speech</strong> system<br />

(Draxler and Jänsch, 2008).<br />

This global data model goes beyond the general annotation<br />

graph framework <strong>for</strong> annotations proposed by Liberman<br />

and Bird, which provides a <strong>for</strong>mal description of the<br />

data structures underly<strong>in</strong>g both time-aligned and non time-<br />

51<br />

Adm<strong>in</strong>istrator<br />

Project<br />

Session<br />

Record<strong>in</strong>g-<br />

Script<br />

Section<br />

<strong>Speech</strong>Recorder<br />

Technician<br />

Organization<br />

Person<br />

Speaker<br />

Record<strong>in</strong>g<br />

Signal<br />

Annotator<br />

Praat<br />

Annotation-<br />

Project<br />

Annotation-<br />

Session<br />

Annotation<br />

Tier<br />

Segment<br />

Figure 1: Simplified view of the global corpus data model<br />

<strong>in</strong> Wiki<strong>Speech</strong>. The dashed l<strong>in</strong>es show which parts of the<br />

data model are relevant to given tools. For example, Praat<br />

handles signal files and knows about annotation tiers and<br />

(time-aligned) segments on these tiers. <strong>Speech</strong>Recorder<br />

knows record<strong>in</strong>g projects that conta<strong>in</strong> record<strong>in</strong>g sessions<br />

which br<strong>in</strong>g together speakers and record<strong>in</strong>g scripts; a<br />

script conta<strong>in</strong>s sections which <strong>in</strong> turn are made up of<br />

record<strong>in</strong>g items which result <strong>in</strong> signal files.<br />

aligned annotations (Bird and Liberman, 2001). In our<br />

global corpus data model, annotations consist of tiers which<br />

<strong>in</strong> turn conta<strong>in</strong> segments; segments may be time-aligned<br />

with or without duration, or symbolic, i.e. without time<br />

data. With<strong>in</strong> a tier segments are ordered sequentially, and<br />

between tiers there exist 1:1, 1:n and n:m relationships.<br />

Besides annotations, our global corpus data model also covers<br />

audio and video record<strong>in</strong>gs, technical <strong>in</strong><strong>for</strong>mation on<br />

the record<strong>in</strong>gs, time-based and symbolic annotations on arbitrarily<br />

many annotation tiers, metadata on the speakers<br />

and the database contents and adm<strong>in</strong>istrative data on the<br />

annotators (Figure 1.).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!