Proceedings Fonetik 2009 - Institutionen för lingvistik

More documents

Recommendations

Info

Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm UniversityExploring data driven parametric synthesisRolf Carlson 1 , Kjell Gustafson 1,21KTH, CSC, Department of Speech, Music and Hearing, Stockholm, Sweden2Acapela Group Sweden AB, Solna, SwedenAbstractThis paper describes our work on building aformant synthesis system based on both rulegenerated and database driven methods. Threeparametric synthesis systems are discussed:our traditional rule based system, a speakeradapted system, and finally a gesture system.The gesture system is a further development ofthe adapted system in that it includes concatenatedformant gestures from a data-driven unitlibrary. The systems are evaluated technically,comparing the formant tracks with an analysedtest corpus. The gesture system results in a25% error reduction in the formant frequenciesdue to the inclusion of the stored gestures. Finally,a perceptual evaluation shows a clearadvantage in naturalness for the gesture systemcompared to both the traditional system and thespeaker adapted system.IntroductionCurrent speech synthesis efforts, both in researchand in applications, are dominated bymethods based on concatenation of spokenunits. Research on speech synthesis is to a largeextent focused on how to model efficient unitselection and unit concatenation and how optimaldatabases should be created. The traditionalresearch efforts on formant synthesis andarticulatory synthesis have been significantlyreduced to a very small discipline due to thesuccess of waveform based methods. Despitethe well motivated current research path resultingin high quality output, some efforts on parametricmodelling are carried out at our department.The main reasons are flexibility inspeech generation and a genuine interest in thespeech code. We try to combine corpus basedmethods with knowledge based models and toexplore the best features of each of the two approaches.This report describes our progress inthis synthesis work.Parametric synthesisUnderlying articulatory gestures are not easilytransformed to the acoustic domain describedby a formant model, since the articulatory constraintsare not directly included in a formantbasedmodel. Traditionally, parametric speechsynthesis has been based on very labour-intensiveoptimization work. The notion analysis bysynthesis has not been explored except by manualcomparisons between hand-tuned spectralslices and a reference spectrum. When increasingour ambitions to multi-lingual, multispeakerand multi-style synthesis, it is obviousthat we want to find at least semi-automaticmethods to collect the necessary information,using speech and language databases. The workby Holmes and Pearce (1990) is a good exampleof how to speed up this process. With thehelp of a synthesis model, the spectra are automaticallymatched against analysed speech.Automatic techniques such as this will probablyalso play an important role in makingspeaker-dependent adjustments. One advantagewith these methods is that the optimization isdone in the same framework as that to be usedin the production. The synthesizer constraintsare thus already imposed in the initial state.If we want to keep the flexibility of the formantmodel but reduce the need for detailedformant synthesis rules, we need to extract formantsynthesis parameters directly from a labelledcorpus. Already more than ten years agoat Interspeech in Australia, Mannell (1998) reporteda promising effort to create a diphonelibrary for formant synthesis. The procedureincluded a speaker-specific extraction of formantfrequencies from a labelled database. In asequence of papers from Utsunomiya University,Japan, automatic formant tracking hasbeen used to generate speech synthesis of highquality using formant synthesis and an elaboratevoice source (e.g. Mori et al., 2002). Hertz(2002) and Carlson and Granström (2005) reportrecent research efforts to combine datadrivenand rule-based methods. The approachestake advantage of the fact that a unit library canbetter model detailed gestures than the generalrules.In a few cases we have seen a commercialinterest in speech synthesis using the formantmodel. One motivation is the need to generate86
Proceedings, FONETIK 2009, Dept. of Linguistics, Stockholm Universityspeech using a very small footprint. Perhaps theformant synthesis will again be an importantresearch subject because of its flexibility andalso because of how the formant synthesis approachcan be compressed into a limited applicationenvironment.A combined approach for acousticspeech synthesisThe efforts to combine data-driven and rulebasedmethods in the KTH text-to-speech systemhave been pursued in several projects. In astudy by Högberg (1997), formant parameterswere extracted from a database and structuredwith the help of classification and regressiontrees. The synthesis rules were adjusted accordingto predictions from the trees. In an evaluationexperiment the synthesis was tested andjudged to be more natural than the originalrule-based synthesis.Sjölander (2001) expanded the method intoreplacing complete formant trajectories withmanually extracted values, and also includedconsonants. According to a feasibility study,this synthesis was perceived as more naturalsounding than the rule-only synthesis (Carlsonet al., 2002). Sigvardson (2002) developed ageneric and complete system for unit selectionusing regression trees, and applied it to thedata-driven formant synthesis. In Öhlin & Carlson(2004) the rule system and the unit libraryare more clearly separated, compared to ourearlier attempts. However, by keeping the rulebasedmodel we also keep the flexibility tomake modifications and the possibility to includeboth linguistic and extra-linguisticknowledge sources.Figure 1 illustrates the approach in the KTHtext-to-speech system. A database is used tocreate a unit library and the library informationis mixed with the rule-driven parameters. Eachunit is described by a selection of extractedsynthesis parameters together with linguisticinformation about the unit’s original contextand linguistic features such as stress level. Theparameters can be extracted automaticallyand/or edited manually.In our traditional text-to-speech system thesynthesizer is controlled by rule-generated parametersfrom the text-to-parameter module(Carlson et al., 1982). The parameters are representedby time and values pairs including labelsand prosodic features such as duration andintonation. In the current approach some of therule-generated parameter values are replaced byvalues from the unit library. The process is controlledby the unit selection module that takesinto account not only parameter information butalso linguistic features supplied by the text-toparametermodule. The parameters are normalizedand concatenated before being sent to theGLOVE synthesizer (Carlson et al., 1991).Data-baseAnalysisExtractedparametersUnit libraryFeaturesUnit selectionConcatenationUnit-controlledparametersInput textText - to - parameterRule - gen era tedparameters+SynthesizerSpeech outputFigure 1. Rule-based synthesis system using a datadrivenunit library.Creation of a unit libraryIn the current experiments a male speaker recordeda set of 2055 diphones in a nonsenseword context. A unit library was then createdbased on these recordings.When creating a unit library of formant frequencies,automatic methods of formant extractionare of course preferred, due to the amountof data that has to be processed. However,available methods do not always perform adequately.With this in mind, an improved formantextraction algorithm, using segmentationinformation to lower the error rate, was developed(Öhlin & Carlson, 2004). It is akin to thealgorithms described in Lee et al. (1999),Talkin (1989) and Acero (1999).Segmentation and alignment of the waveformwere first performed automatically withnAlign (Sjölander, 2003). Manual correctionwas required, especially on vowel–vowel transitions.The waveform is divided into (overlapping)time frames of 10 ms. At each frame, anLPC model of order 30 is created; the poles arethen searched through with the Viterbi algorithmin order to find the path (i.e. the formanttrajectory) with the lowest cost. The cost is definedas the weighted sum of a number of partialcosts: the bandwidth cost, the frequencydeviation cost, and the frequency change cost.87
Page 1 and 2:
Department of LinguisticsProceeding
Page 3 and 4:
Proceedings, FONETIK 2009, Dept. of
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36: Proceedings, FONETIK 2009, Dept. of
Page 41 and 42: Proceedings, FOETIK 2009, Dept. of
Page 85: Proceedings, FONETIK 2009, Dept. of
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Page 189 and 190:
Page 191 and 192:
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
Proceedings, FOETIK 2009, Dept. of
Page 205 and 206:
Proceedings, FOETIK 2009, Dept. of
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227:
Department of LinguisticsPhonetics
show all

Proceedings Fonetik 2009 - Institutionen för lingvistik

Create successful ePaper yourself

Delete template?

Save as template?