Proceedings Fonetik 2009 - Institutionen för lingvistik

More documents

Recommendations

Info

Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversityThe bandwidth cost is equal to the bandwidthin Hertz. The frequency deviation cost is definedas the square of the distance to a givenreference frequency, which is formant, speaker,and phoneme dependent. This requires the labellingof the input before the formant trackingis carried out. Finally, the frequency changecost penalizes rapid changes in formant frequenciesto make sure that the extracted trajectoriesare smooth.Although only the first four formants areused in the unit library, five formants are extracted.The fifth formant is then discarded. Thejustification for this is to ensure reasonable valuesfor the fourth formant. The algorithm alsointroduces eight times over-sampling beforeaveraging, giving a reduction of the variance ofthe estimated formant frequencies. After theextraction, the data is down-sampled to 100 Hz.Synthesis SystemsThree parametric synthesis systems were exploredin the experiments described below. Thefirst was our rule-based traditional system,which has been used for many years in ourgroup as a default parametric synthesis system.It includes rules for both prosodic and contextdependent segment realizations. Several methodsto create formant trajectories have been exploredduring the development of this system.Currently simple linear trajectories in a logarithmicdomain are used to describe the formants.Slopes and target positions are controlledby the transformation rules.The second rule system, the adapted system,was based on the traditional system andadapted to a reference speaker. This speakerwas also used to develop the data-driven unitlibrary. Default formant values for each vowelwere estimated based on the unit library, andthe default rules in the traditional system werechanged accordingly. It is important to emphasizethat it is the vowel space that was datadriven and adapted to the reference speaker andnot the rules for contextual variation.Finally, the third synthesis system, the gesturesystem, was based on the adapted system,but includes concatenated formant gesturesfrom the data-driven unit library. Thus, both theadapted system and the gesture system are datadrivensystems with varying degree of mix betweenrules and data. The next section will discussin more detail the concatenation processthat we employed in our experiments.100 %0 %x % x %phonemephonemeruleunitFigure 2. Mixing proportions between a unit and arule generated parameter track. X=100% equals thephoneme duration.Parameter concatenationThe concatenation process in the gesture systemis a simple linear interpolation between therule generated formant data and the possiblejoining units from the library. At the phonemeborder the data is taken directly from the unit.The impact of the unit data is gradually reducedinside the phoneme. At a position X the influenceof the unit has been reduced to zero (Figure2). The X value is calculated relative to thesegment duration and measured in % of thesegment duration. The parameters in the middleof a segment are thus dependent on both rulesand two units.Technical evaluationA test corpus of 313 utterances was selected tocompare predicted and estimated formant dataand analyse how the X position influences thedifference. The utterances were collected in theIST project SpeeCon (Großkopf et al., 2002)and the speaker was the same as the referencespeaker behind the unit library. As a result, theadapted system also has the same referencespeaker. In total 4853 phonemes (60743 10 msframes) including 1602 vowels (17508 frames)were used in the comparison.A number of versions of each utterancewere synthesized, using the traditional system,the adapted system and the unit system withvarying values of X percent. The label filesfrom the SpeeCon project were used to makethe duration of each segment equal to the recordings.An X value of zero in the unit systemwill have the same formant tracks as theadapted system. Figure 3 shows the results ofcalculating the city-block distance between thesynthesized and measured first three formantsin the vowel frames.88
Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversityFigure 4 presents a detailed analysis of the datafor the unit system with X=70%. The first formanthas an average distance of 68 Hz with astandard deviation of 43 Hz. Correspondingdata for F2 is (107 Hz, 81 Hz), F3 (111 Hz, 68Hz) and F4 (136 Hz, 67 Hz).Clearly the adapted speaker has a quite differentvowel space compared to the traditionalsystem. Figure 5 presents the distance calculatedon a phoneme by phoneme base. The correspondingstandard deviations are 66 HZ, 58Hz or 46 Hz for the three systems.As expected, the difference between the traditionalsystem and the adapted system is quitelarge. The gesture system results in about a25% error reduction in the formant frequenciesdue to the inclusion of the stored gestures.However, whether this reduction correspondsto a difference in perceived quality cannot bepredicted on the basis of these data. The differencebetween the adapted and the gesture systemis quite interesting and of the same magnitudeas the adaptation data. The results clearlyindicate how the gesture system is able tomimic the reference speaker in more detail thanthe rule-based system. The high standard deviationindicates that a more detailed analysisshould be performed to find the problematiccases. Since the test data as usual is hamperedby errors in the formant tracking procedures wewill inherently introduce an error in the comparison.In a few cases, despite our efforts, wehave a problem with pole and formant numberassignments.Perceptual evaluationA pilot test was carried out to evaluate the naturalnessin the three synthesis systems: traditional,adapted and gesture. 9 subjects workingin the department were asked to rank the threesystems according to perceived naturalness usinga graphic interface. The subjects have beenexposed to parametric speech synthesis before.Three versions of twelve utterances includingsingle words, numbers and sentences wereranked. The traditional rule-based prosodicmodel was used for all stimuli. In total324=3*12*9 judgements were collected. Theresult of the ranking is presented in Figure 6.Hz200150100500Traditional170Adapted12420%11510340%60%Gesture93 90 89 8770%80%Frame by frame city block distance (three formants mean)100%Figure 3. Comparison between synthesized andmeasured data (frame by frame).HZ250200150100500Traditional Adapted Gesture70%Phoneme by phonemeFigure 4. Comparisons between synthesized andmeasured data for each formant (phoneme byphoneme).Hz200150100500175131Traditional Adapted Gesture 70%95f1f2f3f4Phoneme by phoneme city block distance (three formants mean)Figure 5. Comparison between synthesized andmeasured data (phoneme by phoneme).89
Page 1 and 2:
Department of LinguisticsProceeding
Page 3 and 4:
Proceedings, FONETIK 2009, Dept. of
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38: Proceedings, FONETIK 2009, Dept. of
Page 41 and 42: Proceedings, FOETIK 2009, Dept. of
Page 87: Proceedings, FONETIK 2009, Dept. of
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
Proceedings, FOETIK 2009, Dept. of
Page 205 and 206:
Proceedings, FOETIK 2009, Dept. of
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227:
Department of LinguisticsPhonetics
show all

Proceedings Fonetik 2009 - Institutionen för lingvistik

Create successful ePaper yourself

Delete template?

Save as template?