19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A l<strong>in</strong>guistics-based speech corpus<br />

Janne Bondi Johannessen 1 , Øyste<strong>in</strong> Alexander Vangsnes 2 , Joel Priestley 1 , Krist<strong>in</strong> Hagen 1<br />

Department of L<strong>in</strong>guistics and Nordic Studies, University of Oslo 1<br />

CASTL, University of Tromsø 2<br />

P.O.Box 1102 Bl<strong>in</strong>dern, UiO, 0317 Oslo, Norway 1<br />

Faculty of Humanities, UiT, N-9037 Tromsø, Norway 2<br />

E-mail: jannebj@iln.uio.no, oyste<strong>in</strong>.vangsnes@uit.no, joeljp@gmail.com, kristiha@iln.uio.no<br />

Abstract<br />

In this paper we focus on the l<strong>in</strong>guistic basis <strong>for</strong> the choices made <strong>in</strong> the Nordic Dialect Corpus. We focus on transcriptions,<br />

annotations, selection of <strong>in</strong><strong>for</strong>mants, record<strong>in</strong>g situation, user-friendly search <strong>in</strong>terface, l<strong>in</strong>ks to audio and video, various view<strong>in</strong>gs of<br />

results, <strong>in</strong>clud<strong>in</strong>g maps.<br />

Keywords: l<strong>in</strong>guistic basis, speech corpus, dialectology, Nordic languages, user-friendl<strong>in</strong>ess, transcription, tagg<strong>in</strong>g, maps.<br />

1. Introduction<br />

In this paper we will discuss and show the<br />

Nordic Dialect Corpus (Johannessen et al. 2009),<br />

which was launched <strong>in</strong> November 2011. The<br />

corpus was developed through years of<br />

discussions among l<strong>in</strong>guists about such issues as<br />

transcriptions and desirable search features. It<br />

was subsequently developed <strong>in</strong> close<br />

cooperation between technicians and l<strong>in</strong>guists.<br />

The corpus is designed to facilitate studies of<br />

variation; this is more costly, but still more<br />

reward<strong>in</strong>g. We’ve had a focus on user<br />

friendl<strong>in</strong>ess throughout, and also a focus on<br />

recoverability of the data (l<strong>in</strong>k to audio, video,<br />

two level transcription).<br />

The paper focuses on the choices decided<br />

by the l<strong>in</strong>guists and dialectologists.<br />

2. About the Nordic Dialect Corpus<br />

The Nordic Dialect Corpus (NDC) was planned<br />

by l<strong>in</strong>guists from universities <strong>in</strong> six countries:<br />

Denmark, Faroe Islands, F<strong>in</strong>land, Iceland,<br />

Norway, and Sweden with<strong>in</strong> the research<br />

network Scand<strong>in</strong>avian Dialect Syntax<br />

(ScanDiaSyn). The aim was to collect large<br />

amounts of speech data and have them available<br />

<strong>in</strong> a corpus <strong>for</strong> easy access across the Nordic<br />

countries. There were two reasons why this was<br />

a good idea: First, the Nordic (North Germanic)<br />

languages are very similar to each other, and<br />

their dialects can be argued to <strong>for</strong>m a large<br />

dialect cont<strong>in</strong>uum. Second, dialect syntax had<br />

<strong>for</strong> decades been a neglected field of research <strong>in</strong><br />

Nordic dialectology, and the hope was that with<br />

large amounts of new material, new studies<br />

would be possible.<br />

The work started <strong>in</strong> 2006 and the corpus<br />

was launched <strong>in</strong> 2011. It covers five languages<br />

(Danish, Faroese, Icelandic, Norwegian, and<br />

Swedish). Most of the record<strong>in</strong>gs have been<br />

1<br />

done after 2000, but some of the Norwegian<br />

ones are also from 1950–80. There are<br />

altogether 223 record<strong>in</strong>g places and 794<br />

<strong>in</strong><strong>for</strong>mants.<br />

The overall number of transcribed words is<br />

2,74 million. The corpus has been very costly to<br />

build because of the man power needed. As an<br />

example, transcrib<strong>in</strong>g the Norwegian part alone<br />

took 14 people to do, and more than 35 people<br />

have been <strong>in</strong>volved <strong>in</strong> the record<strong>in</strong>g work <strong>in</strong><br />

Norway only, which <strong>in</strong>cluded a lot of travel and<br />

organis<strong>in</strong>g. The Swedish record<strong>in</strong>gs were given<br />

to us by an earlier phonetics/phonology project,<br />

Swedia 2000, which was a great asset. Along the<br />

way, several national research councils, Nordic<br />

funds, and <strong>in</strong>dividual universities, have<br />

contributed. The Text Laboratory at the<br />

University of Oslo has been responsible <strong>for</strong> the<br />

technical development.<br />

We know of no other speech corpus that has<br />

the comb<strong>in</strong>ation that the NDC has of two level<br />

transcriptions, easy search-<strong>in</strong>terface, direct l<strong>in</strong>ks<br />

to audio and video, maps, result handl<strong>in</strong>g, and<br />

availability on the web. We refer to Johannessen<br />

et al. (2009) <strong>for</strong> a comparison with other corpora,<br />

and likewise to the proceed<strong>in</strong>gs of the GSCP<br />

2012 International Conference on <strong>Speech</strong> and<br />

<strong>Corpora</strong>, Belo Horizonte, Brazil (see URL <strong>in</strong><br />

reference list).<br />

3. Record<strong>in</strong>gs and <strong>in</strong><strong>for</strong>mants<br />

S<strong>in</strong>ce the goal of the project was to facilitate<br />

dialectological research, we put much ef<strong>for</strong>t <strong>in</strong>to<br />

f<strong>in</strong>d<strong>in</strong>g suitable <strong>in</strong><strong>for</strong>mants. The places were<br />

selected carefully to represent all dialect types.<br />

Given the differences between the countries, this<br />

meant that there are also differences between the<br />

necessary number of record<strong>in</strong>g places, <strong>for</strong><br />

example, <strong>in</strong> Norway there are 110 (of modern<br />

record<strong>in</strong>gs, i.e. from 2006–2011), <strong>in</strong> Denmark<br />

only 15.<br />

The <strong>in</strong><strong>for</strong>mants were to represent the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!