13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityacoustic cues (e.g. phonotactics, syllabicrhythm) are most salient in identifying languages(Ramus & Mehler, 1999). In these typesof applications, however, stimuli have been createdonce and there has been no need for realtimeprocessing.The computer-assisted language learningsystem VILLE (Wik, 2004) includes an exercisethat involves modified re-synthesis. Here,the segments in the speech produced by theuser are manipulated in terms of duration, i.e.stretched or shortened, immediately after recording.At the surface, this application sharesseveral traits with the application suggested inthis paper. However, more extensive manipulationis required to turn one phoneme into another,which is the goal for the system describedhere.PurposeThe purpose of this study was to find out if it isat all possible to remove the initial voicelessplosive from a recorded syllable, and replace itwith an “artificial” segment so that it soundsnatural. The “artificial” segment is artificial inthe sense that it was never produced by thespeaker, but constructed or retrieved fromsomewhere else. As voiceless plosives generatedby formant synthesizers are known to lackin naturalness (Carlson & Granström, 2005),retrieving the target segment from a speech databasewas considered a better option.MethodMaterialThe Swedish version of the Speecon corpus(Iskra et al, 2002) was used as a speech database,from which target phonemes were selected.This corpus contains data from 550adult speakers of both genders and of variousages. The speech in this corpus was simultaneouslyrecorded at 16 kHz/16 bit sampling frequencyby four different microphones, in differentenvironments. For this study, only therecordings made by a close headset microphone(Sennheiser ME104) were used. No restrictionswere placed on gender, age or recording environment.From this data, only utterances startingwith an initial voiceless plosive (/p/, /t/ or/k/) and a vowel were selected. This resulted ina speech database consisting of 12 857 utterances(see Table 1 for details). Henceforth, thisspeech database will be referred to as “the targetcorpus”.For the remainder part of the re-synthesis, asmall corpus of 12 utterances spoken by a femalespeaker was recorded with a Sennheiserm@b 40 microphone at 16 kHz/16 bit samplingfrequency. The recordings were made in a relativelyquiet office environment. Three utterances(/tat/, /kak/ and /pap/) were recorded fourtimes each. This corpus will be referred to as“the remainder corpus”.Table 1. Number of utterances in the target corpus.Nbr of utterancesUtterance-initial /pV/ 2 680Utterance-initial /tV/ 4 562Utterance-initial /kV/ 5 614Total 12 857Re-synthesisEach step in the re-synthesis process is describedin the following paragraphs.AlignmentFor aligning the corpora (the target corpus andthe remainder corpus), the NALIGN aligner(Sjölander, 2003) was used.Feature extractionFor the segments in the target corpus, featureswere extracted at the last frame before the middleof the vowel following the initial plosive.For the segments in the remainder corpus, featureswere extracted at the first frame after themiddle of the vowel following the initial plosive.The extracted features were the same asdescribed by Hunt & Black (1996), i.e.MFCCs, log power and F0. The Snack toolSPEATURES (Sjölander, <strong>2009</strong>) was used toextract 13 MFCCs. F0 and log power were extractedusing the Snack tools PITCH andPOWER, respectively.Calculation of join costJoin costs between all possible speech segmentcombinations (i.e. all combinations of a targetsegment from the target corpus and a remaindersegment from the remainder corpus) were calculatedas the sum of1. the Euclidean distance (Taylor, 2008) in F02. the Euclidean distance in log power199

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!