13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm UniversityExploring data driven parametric synthesisRolf Carlson 1 , Kjell Gustafson 1,21KTH, CSC, Department of Speech, Music and Hearing, Stockholm, Sweden2Acapela Group Sweden AB, Solna, SwedenAbstractThis paper describes our work on building aformant synthesis system based on both rulegenerated and database driven methods. Threeparametric synthesis systems are discussed:our traditional rule based system, a speakeradapted system, and finally a gesture system.The gesture system is a further development ofthe adapted system in that it includes concatenatedformant gestures from a data-driven unitlibrary. The systems are evaluated technically,comparing the formant tracks with an analysedtest corpus. The gesture system results in a25% error reduction in the formant frequenciesdue to the inclusion of the stored gestures. Finally,a perceptual evaluation shows a clearadvantage in naturalness for the gesture systemcompared to both the traditional system and thespeaker adapted system.IntroductionCurrent speech synthesis efforts, both in researchand in applications, are dominated bymethods based on concatenation of spokenunits. Research on speech synthesis is to a largeextent focused on how to model efficient unitselection and unit concatenation and how optimaldatabases should be created. The traditionalresearch efforts on formant synthesis andarticulatory synthesis have been significantlyreduced to a very small discipline due to thesuccess of waveform based methods. Despitethe well motivated current research path resultingin high quality output, some efforts on parametricmodelling are carried out at our department.The main reasons are flexibility inspeech generation and a genuine interest in thespeech code. We try to combine corpus basedmethods with knowledge based models and toexplore the best features of each of the two approaches.This report describes our progress inthis synthesis work.Parametric synthesisUnderlying articulatory gestures are not easilytransformed to the acoustic domain describedby a formant model, since the articulatory constraintsare not directly included in a formantbasedmodel. Traditionally, parametric speechsynthesis has been based on very labour-intensiveoptimization work. The notion analysis bysynthesis has not been explored except by manualcomparisons between hand-tuned spectralslices and a reference spectrum. When increasingour ambitions to multi-lingual, multispeakerand multi-style synthesis, it is obviousthat we want to find at least semi-automaticmethods to collect the necessary information,using speech and language databases. The workby Holmes and Pearce (1990) is a good exampleof how to speed up this process. With thehelp of a synthesis model, the spectra are automaticallymatched against analysed speech.Automatic techniques such as this will probablyalso play an important role in makingspeaker-dependent adjustments. One advantagewith these methods is that the optimization isdone in the same framework as that to be usedin the production. The synthesizer constraintsare thus already imposed in the initial state.If we want to keep the flexibility of the formantmodel but reduce the need for detailedformant synthesis rules, we need to extract formantsynthesis parameters directly from a labelledcorpus. Already more than ten years agoat Interspeech in Australia, Mannell (1998) reporteda promising effort to create a diphonelibrary for formant synthesis. The procedureincluded a speaker-specific extraction of formantfrequencies from a labelled database. In asequence of papers from Utsunomiya University,Japan, automatic formant tracking hasbeen used to generate speech synthesis of highquality using formant synthesis and an elaboratevoice source (e.g. Mori et al., 2002). Hertz(2002) and Carlson and Granström (2005) reportrecent research efforts to combine datadrivenand rule-based methods. The approachestake advantage of the fact that a unit library canbetter model detailed gestures than the generalrules.In a few cases we have seen a commercialinterest in speech synthesis using the formantmodel. One motivation is the need to generate86

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!