The Lessac Technologies Time Domain Diphone ... - Festvox

The Lessac Technologies Time Domain Diphone Parametric Synthesis System forMicrocontrollers for Blizzard Challenge 2012Mike Baumgartner, Reiner Wilhelms-Tricarico, John Reichenbach{mike.baumgartner, reiner.wilhelms, john.reichenbach} @lessactech.comAbstractAdvances in the capabilities of microcomputer systemshave opened the door to new approaches to real time speechsynthesis. In the past, diphone synthesis was a popularsynthesis method. More recently, unit selection speechsynthesis has afforded higher quality synthesis, mainly byeliminating the need for significant signal processing, andthus preventing the signal processing artifacts that are theconsequences of speech segment modifications. Instead,unit selection synthesis consists substantially of realsegments of unaltered speech. It was hoped that with largeenough voice databases, that could provide enoughrecorded sections of speech, there would be sufficientcoverage for any utterance required for speech synthesis.Even as unit selection speech synthesis system databaseshave become considerably larger, the realization ofconstructing natural speech entirely from segments ofunaltered speech units has still fallen short of expectations.The Blizzard Challenge has provided a measure to quantifyhow much of a difference in quality has transpired in thenew unit selection approaches compared to the old diphonesynthesis methods. This diphone synthesis system also is anexample of working towards a goal of high qualitysynthesis that still works on very limited hardwareresources.Index Terms: Speech Synthesis, Blizzard Challenge,Diphone1. IntroductionInitially, our intent was to explore whether advances infront-end labeling systems, such as the Lesseme systemused by Lessac Technologies in its unit selection systemcould also be used to drive a very small compact parametricsynthesizer suitable for running on ten dollarmicrocontrollers. We explored multiple parametricsynthesis approaches, but mainly for reasons of time andavailable resources, we eventually defaulted to building ademonstration system based on the many years of diphonework.Why enter an older technology in the BlizzardChallenge? Because this time-domain diphone synthesiswas designed for low cost microcontroller applications, andwas expected to provide usable quality. As we learned fromthe listening test results, this diphone system is less naturalthan current unit selection systems. However, while amongthe lower ranking systems, this diphone synthesizer isranked just slightly lower than most of the conventional unitselection systems when evaluated on an MOS basis forsentences. For the longer paragraph length sections ofsynthesis, this time domain diphone synthesis does notmeasure up as well.This is a what-if scenario. If used in a low costsystem with limited resources, what difference in qualitycan be expected when compared to the best unit selectionsystems currently available? Blizzard Challenge 2012 is notan ideal structure for evaluating time domain diphoneparametric synthesis. The John Greenman Librivox voice isnot a phonetically balanced corpora. The voice corpus wasrecorded in relatively poor conditions with a fair amount ofbackground noise. The mp3 lossy compression of theoriginal source for the voice corpus introduced difficultiesin pitch-marking and signal discrimination. Despite thesehurdles, we were successful in building a moderate qualityvoice with comparatively few, but still highly noticeable,artifacts. The results of the Blizzard Challenge were helpfulin gauging what could be expected.2. Voice BuildingThere were some technical issues with selecting very smallsegments of speech from the several thousand prompts thatwere suitable for building a unit selection voice. A diphonedatabase generally consists of one copy of small segmentsof speech. Instead of trying to output the phones and joiningbetween phones, the diphone system joins in theory at themiddle of the phone where the spectra are stable (Olive et.al. 1998). Therefore all combinations of phone to phonesegments of speech that can occur need to be included inthe database.This system started with the Lessemes used in theLessac unit selection voice and converted them into 46phones. If all possible combinations occurred in the Englishlanguage, there would be 46x46 diphones or 2,116 diphonesneeded. In practice, the figure is more around 1,500. Thechallenge is to find the best example of each diphone that ismost representative of that diphone, that will match induration, pitch and spectra without any signal processing,and that is most likely to join smoothly with other diphonesin numerous linguistic contexts.Using available unit selection toolsets, oneautomatic way to achieve this would be to build a unitselection voice with all the data. Next, limit the selection ofspeech segments to diphones only. Then, run the unitselection synthesis using large text corpora that will cover

all diphone combinations in a statistically large enoughsample. A diphone use log could be amassed and then fromthe statistics of use, a diphone selection could be made.Because of the unit selection processes this might lead tomultiple diphone paths that might not converge to a singlebest choice per diphone.The voice building approach for this synthesis forthe Blizzard Challenge was done in a very simple fashion.A set of programs was written to step through the phonelabel files and build up a diphone map. We loaded the mapand either selected the diphones by hand, or defaulted to thefirst diphone on the map. The selection program allowedtesting the sound of the diphone when fitted to others. Themore frequently occurring diphones were selected by hand,and absent noting particularly bad results, the lessfrequently occurring diphones defaulted to the first one thatoccurred in the map. Because of time constraints the firstone in the map was used fairly often. One might assumethat with such a large database that the distribution of thenumber of each diphones available would generally bemore or less even. This was not the case.After the diphone selections were made, thediphone database was assembled and tested by synthesis.Areas that did not have suitable sounding phone targetswere noted and better diphones were substituted. Unlikewith unit selection, since the same small set of diphonesegments of speech are used over and over again, if they arepoorly chosen, there is a very noticeable degradation insynthesis quality.The goal was also to limit selections that involvedsubstantial phonetic co-articulation effects. Diphones fromsuch text locations found in the acoustic data sound good insome linguistic contexts, but quite out of place, andmismatched in others.The favorable thing with a diphone voice is thatchanges can be incorporated almost instantly using acurrent Pentium class system to build the voice database.3. Text to Phone, Prosaic and Duration SectionOne of the goals of this entry was to limit differences inquality in the parametric parameters driving this diphonesynthesis. We hoped to use an unchanged Lessac front-endto provide the parameters to drive the diphone back-endsynthesizer. Then only the synthesis section would bedifferent when compared to other systems.One of the potential target applications for thissystem is low cost audio book synthesis. Instead of thethousands of bytes per second required for speechcompression systems, tens of bytes per second would berequired for phones and pitch information. Then instead ofone book per device, a library of books could be held. Inother words, most target applications would not requirehaving a full text to speech system included. Only thesynthesis section would be included.It was hoped that the parameters that drive thelarge Lessac system could be used just the same. It wasdiscovered however, the pitch targets that drive the largerLessac unit selection system were not sufficient to provide acomplete high quality set of parameters for diphonesynthesis. Unit selection does not change the localizedphone pitch in speech segments. Since unit selection useslarger sections of real speech, the natural localized phonepitch variations are automatically included in the finalsynthesis. This is one of the good qualities of unit selection.This is not the case with diphone synthesis. Without theminor localized pitch variations inherent in human speech,our initial diphone synthesizer sounded quite flat androbotic.To include localized pitch variations in thisparametric synthesis, a localized variation pitch map wasconstructed using the Nancy voice from last year's BlizzardChallenge. The entire database phone map was constructedwith the pitch variations. Then using a dynamicprogramming like technique, the largest sections of phonesequences that matched the phone sequence of the requiredutterance were built up. Then these localized variationswere added to the pitch targets. In essence, the Nancy voicewas driving the subtle prose elements of the parametricsynthesis. It was noted that some of the personality of theNancy voice could be recognized in the synthesis. Thedown side to this technique was that sections that had thesame phone sequences in close proximity tended to haveunnatural pitch repetitions. This could have been avoidedby adding code that insured different areas of the localizedphone variation map were used for each utterance chunk.4. Synthesis SectionLike the Lessac unit selection synthesis, the diphones areconcatenated together in a process that works entirely in thetime domain. The concatenation of voiced sounds is donepitch synchronously, and some mutual adjustments of twosounds that are concatenated are made to increase thecoherence and to reduce clicks and warbles. The overallpitch and duration changes are added to match theparameters given.There are limits to the quality of a diphone system.The better you can discern mismatches by ear, the bettersynthesis you will have. With the approach we used forbuilding a parametric diphone synthesizer, diphoneselection was entirely based on listening and accepting orrejecting individual diphones. Therefore like the Lessac unitselection, the choices of the units or diphones used, and notthe synthesis and concatenation technique itself makes forbetter synthesis. Unlike the larger unit selection systems,where the units to be concatenated are chosen in real-timebased on join and target costs, in this diphone parametricsystem these diphone unit choices are made ahead of timewhen building the voice, and at maximum only one unitexists in this system for each actually appearing diphone.Had we had the time to more carefully and optimally selectthe diphones for the voice, one could expect a modestimprovement in quality.

6. ConclusionIn the past, companies would often choose voice talentsbased on specific voice characteristics that were thought tosynthesize better, or with fewer artifacts. For example,many TTS voices have sharp vocal chords to help maskartifacts that would otherwise occur in the synthesis. Bychoosing voice talents that already have these attributes intheir original voice; it is less anomalous when thesynthesizer produces more of these attributes as a result ofthe synthesis process.For the Blizzard Challenge 2012, each participantbuilt a John Greenman voice. Given that this voice had beenreconstructed from mp3 lossy digital compression, and wasnot selected for voice characteristics that were likely tosynthesize well, it may not have been an ideal voice forbuilding a diphone synthesizer. From the listening test plotresults on the previous page, it is obvious that our diphonesystem was outclassed in this competition. However,despite the poor showing, when compared to synthesizers ofjust a few years ago, this compact, low overhead diphonesynthesizer might compare quite favorably.Despite the poor ranking, many goals wereachieved. We demonstrated that the prosody characteristicsof another voice can be used to drive and enhanceparametric synthesis. While doing this enhanced theperceived naturalness of the synthesized voice, it probablynegatively impacted the similarity scores as compared withthe original voice talent. We demonstrated that a diphonevoice can be extracted from large voice corpora in a briefperiod of time. We also demonstrated that many of the toolsand techniques that have been developed for unit selectionapproaches to voice synthesis can be used in buildingdiphone voices. We clearly demonstrated that a diphonevoice can be built without the buzziness of LPC synthesis.We hope to be able to demonstrate that bychoosing a voice model that better meets the needs ofdiphone synthesis, the perceived quality will improve. Wehope that further development of the diphone approach, orother parametric approaches, to synthesis will allow us toapproach the quality of unit selection systems, whileretaining the very small size and footprint of this diphoneparametric system.The perceived synthesis quality, while lower thanthat of unit selection systems, may show some promise forcertain target applications. Though not a mainstreamproduct, it would be nice to have a library of audio booksthat could be heard on a ten dollar device. The buttons onsuch a device could have voice prompts so if someone wassight impaired or driving a car, they could make selectionswithout looking. If the synthesis footprint is compact, andbased on phones and pitch information only, toys such asthose given away at fast food restaurants could have a smallaudio book included as part of the package, all in the nottoo distant future. This is becoming more possible asmemory densities grow in low cost devices. It appears thatdiphone synthesis might still be a usable speech synthesistechnique.

The Lessac Technologies Time Domain Diphone ... - Festvox

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?