13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityspeech using a very small footprint. Perhaps theformant synthesis will again be an importantresearch subject because of its flexibility andalso because of how the formant synthesis approachcan be compressed into a limited applicationenvironment.A combined approach for acousticspeech synthesisThe efforts to combine data-driven and rulebasedmethods in the KTH text-to-speech systemhave been pursued in several projects. In astudy by Högberg (1997), formant parameterswere extracted from a database and structuredwith the help of classification and regressiontrees. The synthesis rules were adjusted accordingto predictions from the trees. In an evaluationexperiment the synthesis was tested andjudged to be more natural than the originalrule-based synthesis.Sjölander (2001) expanded the method intoreplacing complete formant trajectories withmanually extracted values, and also includedconsonants. According to a feasibility study,this synthesis was perceived as more naturalsounding than the rule-only synthesis (Carlsonet al., 2002). Sigvardson (2002) developed ageneric and complete system for unit selectionusing regression trees, and applied it to thedata-driven formant synthesis. In Öhlin & Carlson(2004) the rule system and the unit libraryare more clearly separated, compared to ourearlier attempts. However, by keeping the rulebasedmodel we also keep the flexibility tomake modifications and the possibility to includeboth linguistic and extra-linguisticknowledge sources.Figure 1 illustrates the approach in the KTHtext-to-speech system. A database is used tocreate a unit library and the library informationis mixed with the rule-driven parameters. Eachunit is described by a selection of extractedsynthesis parameters together with linguisticinformation about the unit’s original contextand linguistic features such as stress level. Theparameters can be extracted automaticallyand/or edited manually.In our traditional text-to-speech system thesynthesizer is controlled by rule-generated parametersfrom the text-to-parameter module(Carlson et al., 1982). The parameters are representedby time and values pairs including labelsand prosodic features such as duration andintonation. In the current approach some of therule-generated parameter values are replaced byvalues from the unit library. The process is controlledby the unit selection module that takesinto account not only parameter information butalso linguistic features supplied by the text-toparametermodule. The parameters are normalizedand concatenated before being sent to theGLOVE synthesizer (Carlson et al., 1991).Data-baseAnalysisExtractedparametersUnit libraryFeaturesUnit selectionConcatenationUnit-controlledparametersInput textText - to - parameterRule - gen era tedparameters+SynthesizerSpeech outputFigure 1. Rule-based synthesis system using a datadrivenunit library.Creation of a unit libraryIn the current experiments a male speaker recordeda set of 2055 diphones in a nonsenseword context. A unit library was then createdbased on these recordings.When creating a unit library of formant frequencies,automatic methods of formant extractionare of course preferred, due to the amountof data that has to be processed. However,available methods do not always perform adequately.With this in mind, an improved formantextraction algorithm, using segmentationinformation to lower the error rate, was developed(Öhlin & Carlson, 2004). It is akin to thealgorithms described in Lee et al. (1999),Talkin (1989) and Acero (1999).Segmentation and alignment of the waveformwere first performed automatically withnAlign (Sjölander, 2003). Manual correctionwas required, especially on vowel–vowel transitions.The waveform is divided into (overlapping)time frames of 10 ms. At each frame, anLPC model of order 30 is created; the poles arethen searched through with the Viterbi algorithmin order to find the path (i.e. the formanttrajectory) with the lowest cost. The cost is definedas the weighted sum of a number of partialcosts: the bandwidth cost, the frequencydeviation cost, and the frequency change cost.87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!