13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm UniversityThe bandwidth cost is equal to the bandwidthin Hertz. The frequency deviation cost is definedas the square of the distance to a givenreference frequency, which is formant, speaker,and phoneme dependent. This requires the labellingof the input before the formant trackingis carried out. Finally, the frequency changecost penalizes rapid changes in formant frequenciesto make sure that the extracted trajectoriesare smooth.Although only the first four formants areused in the unit library, five formants are extracted.The fifth formant is then discarded. Thejustification for this is to ensure reasonable valuesfor the fourth formant. The algorithm alsointroduces eight times over-sampling beforeaveraging, giving a reduction of the variance ofthe estimated formant frequencies. After theextraction, the data is down-sampled to 100 Hz.Synthesis SystemsThree parametric synthesis systems were exploredin the experiments described below. Thefirst was our rule-based traditional system,which has been used for many years in ourgroup as a default parametric synthesis system.It includes rules for both prosodic and contextdependent segment realizations. Several methodsto create formant trajectories have been exploredduring the development of this system.Currently simple linear trajectories in a logarithmicdomain are used to describe the formants.Slopes and target positions are controlledby the transformation rules.The second rule system, the adapted system,was based on the traditional system andadapted to a reference speaker. This speakerwas also used to develop the data-driven unitlibrary. Default formant values for each vowelwere estimated based on the unit library, andthe default rules in the traditional system werechanged accordingly. It is important to emphasizethat it is the vowel space that was datadriven and adapted to the reference speaker andnot the rules for contextual variation.Finally, the third synthesis system, the gesturesystem, was based on the adapted system,but includes concatenated formant gesturesfrom the data-driven unit library. Thus, both theadapted system and the gesture system are datadrivensystems with varying degree of mix betweenrules and data. The next section will discussin more detail the concatenation processthat we employed in our experiments.100 %0 %x % x %phonemephonemeruleunitFigure 2. Mixing proportions between a unit and arule generated parameter track. X=100% equals thephoneme duration.Parameter concatenationThe concatenation process in the gesture systemis a simple linear interpolation between therule generated formant data and the possiblejoining units from the library. At the phonemeborder the data is taken directly from the unit.The impact of the unit data is gradually reducedinside the phoneme. At a position X the influenceof the unit has been reduced to zero (Figure2). The X value is calculated relative to thesegment duration and measured in % of thesegment duration. The parameters in the middleof a segment are thus dependent on both rulesand two units.Technical evaluationA test corpus of 313 utterances was selected tocompare predicted and estimated formant dataand analyse how the X position influences thedifference. The utterances were collected in theIST project SpeeCon (Großkopf et al., 2002)and the speaker was the same as the referencespeaker behind the unit library. As a result, theadapted system also has the same referencespeaker. In total 4853 phonemes (60743 10 msframes) including 1602 vowels (17508 frames)were used in the comparison.A number of versions of each utterancewere synthesized, using the traditional system,the adapted system and the unit system withvarying values of X percent. The label filesfrom the SpeeCon project were used to makethe duration of each segment equal to the recordings.An X value of zero in the unit systemwill have the same formant tracks as theadapted system. Figure 3 shows the results ofcalculating the city-block distance between thesynthesized and measured first three formantsin the vowel frames.88

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!