Text-to-speech man-machine interface in embedded systems

Speech outline• Introduction• Speech generation in humans• Artificial speech synthesis – the basics• Speech synthesis for embedded systems• Our results so far

Introduction• Various man-machine interfaces• Levers / pedals / pullies• Buttons• Keyboard and mouse• Touch screen• 3D gloves/movement tracking• Which one is the best?• None

Introduction• The ideal man-machine interface isSPEECH• The best way to make a machine dowhat you want is to tell it what you want• The ideal talking machine: Hall fromSpace Odyssey 2001• It’s 2010 now and still a long way to go• There is a lot already done – this is whatmy talk will be about• Two technologies stand behind this:• ASR – Automatic Speech Recognition• TTS – Text to Speech Synthesis

Human speech• Speech consists of a well controlledsequence of sounds• These sounds are generated by thespeech apparatus• Two groups of organs• voice producing (phonatory) organs• articulatory organs

Human speech• The phonatory organs drive speechproduction• Lungs• Larynx – vocal folds• The articulatory organs (vocal tract)shape the spectrum of the sound andgive it phonetic character• pharynx,• vellum,• palate,• nasal cavity,• tongue,• teeth and• lips

Human speechThe vocal folds vibrate toproduce the human voiceThe phone “Z“– Mixed voicingThe vocal tract shapes thespectrumThe phone “S“- UnvoicedThe phone ”I“- Voiced

Speech synthesis• The easiest way to make robots speak is by recordinghuman speech, and playing it back when needed• Trivial• High Quality• Limited use• True speech synthesis is to allow robots to generateunbounded speech output (i.e. to let them speak theirmind)• Text to Speech synthesis

Text to Speech (TTS) Synthesis• The artificial synthesis of speech based on unrestrictedtext input• Three basic paradigmsText to Speech SynthesisArticulatoryFormantConcatenative

Articulatory TTS Synthesis• Articulatory synthesis is based on a complete 3D model ofthe human speech apparatus• It uses the acoustic parameters extracted from the modelto synthesize speech

Articulatory TTS SynthesisPeter Birkholz’s - Vocal tract lab - www.vocaltractlab.de

Articulatory TTS Synthesis“Der Zug hat eine StundeVerspätung”“Nächster Halt:Hamburg”

Articulatory TTS Synthesis• Articulatory synthesis is the most powerful paradigm, butalso the most complex.• Building the model calls for thorough analysis of MRI(Magnetic Resonance Imaging) scans of speechproduction• High on intelligibility, but low on naturalness

Formant TTS Synthesis• Formant synthesis uses more of a black-box modelingapproach to speech production• It analyzes the end transfer function of the vocal tract, ratherthan the way it is made• Formants are the vocal tract’s resonant frequencies, they givethe phonetic character to speech soundsU OAEBaseHarmonic I Formant II Formant The phone ”I“- VoicedI

Formant TTS Synthesis• The driving force behind formant TTS synthesis is thesource-filter modelVocal foldsPulseTrainVoiced/UnvoicedSwitchVocal TractParametersTime-VaryingDigital FilterVocal tractNoiseGainSPEECHVocal tract constrictionsLung pressure

Formant TTS Synthesis• The voice of Stephen Hawking (DECtalk)

Formant TTS Synthesis• The formant synthesized speech• Scores high on intelligibility• Scores low on naturalness• Building the model remains a burdensome task

Concatenative TTS synthesis• Concatenative synthesis doesn’t seek to model speechproduction, at all• It uses a data base of prerecorded segments of natural speech,that it concatenates one after the other• This approach gives the synthetic speech a very natural soundm a k e d o n s k i

Concatenative TTS synthesis• Synthesizing the word “masa”

Concatenative TTS synthesis• Various segment lengths are used• The longer the segment the more natural the output speech, but the greaterthe database – a compromise is needed.• Most systems rely on diphones which are joined halves of neighboring phones/k//ε/PhonePhone

Concatenative TTS synthesis• Various segment lengths are used• The longer the segment the more natural the output speech, but the greaterthe database – a compromise is needed.• Most systems rely on diphones which are joined halves of neighboring phones/k//ε/Diphone

Concatenative TTS synthesis• Architecture of a concatenativeTTS system• Two main modules• Text Analysis (Front-end)• Waveform synthesis (Back-end)TEXTNormalizationMappingProsodyPhoneticRepresentationTextAnalysisModuleUnitDatabaseUnit selectionConcatenationProsodyWaveformSynthesisModuleSPEECH

Concatenative TTS synthesis• Two main types of inventories:• Fixed inventory – only one sample of each unit• The data-base is small (typically 1500+)• They rely extensively on unit modification• Unit-selection systems - many samples of the same unit• Usually contain hours of recorded speech material• They require very little or no unit modification• The most elaborate of these is Japanese ATR’s XIMERA, whichuses a 170 hour, 25,5 GB database of recorded speech!MBROLAdiphoneFestivaldiphoneFestivalunit-selectionAT&T’s Next-Genunit-selectionATR’s CHATRunit-selection

Concatenative TTS synthesis• The lack of a model of the speech production processmakes Concatenative TTS systems easy to develop• Concatenative TTS synthesis is the paradigm of choice formany TTS systems world-wide

TTS for embedded systems• Implementation of TTS synthesis systems in Embeddeddevices requires• Smaller processing load• Smaller memory footprint of voice segment database• We will take a look at TTS in Embedded systems fromseveral angles• TTS integrated circuits• TTS modules• TTS across embedded operating systems• TTS software applications for Embedded devices

TTS for embedded systems• Dedicated integrated circuits for TTS application• VotraxSC-01A (analog formant),SC-02 / SSI-263 / "Arctic 263"• General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)• Magnevation SpeakJet TTS256 (DECtalk)• Savage Innovations SoundGin• National Semiconductor DT1050 Digitalker (Mozer)• Silicon SystemsSSI 263 (analog formant)• Texas Instruments LPC Speech Chips: TMS5110ATMS5200• Oki Semiconductor ML22825 (ADPCM)ML22573 (HQADPCM)• ToshibaT6721A• PhilipsPCF8200

TTS for embedded systems - ICs• Epson’s S1V30120 TTS IC• brought to market in 2007• Key features of the S1V30120 text to speech IC:• Fonix DECtalk fully parametric speech synthesis• Multiple languages (US English, Castilian and LatinAmerican Spanish)• Nine TTS voices for each language (four male, four female,one child)• Audio playback (ADPCM)• Sampling rate: 11.025kHz• ADPCM decoding (in Epson's format)• Sampling rates: 16kHz, 8kHz• Bit rate: 80kbps, 64kbps, 48kbps, 32kbps, 24kbps• Target applications: portable media devices, mobilephones, home appliances, office equipments andindustrial equipments.

TTS for embedded systems - ICs• Magnevation SpeakJet• Features• Natural phonetic speech synthesis• 72 speech elements (allophones), 43 sound effects, and 12 DTMF TouchTones• Control of pitch, rate, bend and volume• Programmable announcements• Simple interface to microcontrollers• both CPU-Controlled and Stand-Alone operation• Extremely low power consumption• Low pin count• Only need power and speaker to hear speech• TTS256 Text to Code IC for SpeakJet• Built-in 600 rule database converts English text to phoneme codes

TTS for embedded systems - Modules• Digital Acoustics’ TextSpeak Embedded TTSModules• Complete stand-alone (embedded) solutionsthat convert keyboard entry and RS-232 datadirectly to voice.• TTS-03 – low-cost U.S. English and Spanish TTSmodule for robotics and radio• TTS-04 – keyboard to Speech solution (batterypowered)• TTS-EM – high-quality voice, multi-Language,RS-232 to Speech module

TTS for embedded systems - Modules• TextSpeak TTS-03 Module• Converts ASCII text to a natural, clearvoice with unlimited vocabulary.• The small footprint, plug-in solutionaccepts wide range of input data togenerating real-time speech.• 8Khz ROM based,• RS-232 and PS/2 keyboard input .• 2" sq footprint

TTS for embedded systems - Modules• TextSpeak EM TTS Module• High quality, multi-language embedded TTS• Complete speech synthesis in a 2x3" modularsolution• 16 kHz FLASH based voices rival PC quality• 5" square (2x3") PCB platform.• Integral OS provides superior flexibility andperformance• Supports over 20 Embedded TTS languages• Standard voices include English, Spanish,German and French• Optional voices and versions available with3rd party Mobile and Embedded voicesdatabases..• RS-232, Ethernet and keyboard input

TTS for embedded systems - mOS• As processing power and memory capacity of embeddeddevices increase, the line between TTS systems developedfor PC platforms and embedded devices grows ever morevaguer.• Mobile operating systems enabled with TTS synthesis:• Apple’s iOS• Microsoft’s Windows Mobile• Google’s Android• Nokia’s Symbian, etc.

TTS for embedded systems - mOS• Apple’s iOS• The first TTS integrated into an operating systemthat shipped in quantity was Apple’s MacInTalk in 1984.• Evolved into VoiceOver• first time featured in Mac OS X Tiger (10.4)• Starting with 10.6 (Snow Leopard), features a list of multiplevoices• VoiceOver is also included in:• iPod Shuffle, iPod Nano (use OS X), and• iPod Touch,• iPhone 3GS and iPhone 4.

TTS for embedded systems - mOS• Microsoft Windows Mobile• Windows systems use SAPI4-(Windows 95/98) and SAPI5-based(Speech Application Programming Interface)speech systems that include a speech recognition engine (SRE).• Microsoft Sam (Speech Articulation Module) is the default XP voice.• Microsoft Anna is the new voice shipped with Vista and 7.• Microsoft’s Voice Command• used in the Windows Mobile platform for Pocket PC or Smartphonedevices• Voice Command is a retail product sold separately from WindowsMobile.

TTS for embedded systems - mOS• Android 1.6 (Donut) added a TTS engine

TTS for embedded systems - app• Flite• a small run-time TTS synthesis engine developed atCarnegie Mellon University.• Derived from the Festival Speech Synthesis Systemfrom the University of Edinburgh and the FestVoxproject from Carnegie Mellon University.• FreeTTS• a speech synthesis system written entirely in theJavaTM programming language based on Flite

TTS for embedded systems – app• Loquendo Embedded TTS• provides expressive, natural-sounding synthetic speech in a wide range of languagesand voices, in a range of small memory footprints for Embedded devices• Used in• over 10 million navigation devices worldwide (Loquendo Automotive Solution),• new generation of mobile phones (Loquendo Embedded TTS for Mobile)• on–board announcements for trains and buses, talking bus stops,• screen readers for mobile and PC, e-book readers,• talking ATMs, domestic appliances and EPGs for TV,• Memory Requirements8 KHz• from 2.5 MB RAM• from 3.5 MB storage per voice• Type of Technology16 KHz• Unit selection, concatenative• Sampling Rate• 8/11/16/22/32/44 KHz• CPU Requirements• Xscale, ARM9, ARM11, X86, SH4, Motorola , PowerPC, TI OMAP 3621• Platforms• Android, iPhone, Symbian OS S60, Windows Mobile 5 & 6, Windows CE 5 & 6, WindowsXP Embedded and Tablet PC ed., VxWorks, Linux and QNX

Macedonian TTS• Maybe some day robots will speak Macedonian?• Our results so far:• natural recording• old system• new system:• q-diphone units• with prosodic modeling•“jas sum TTS sistemot zamakedonski jazik, razvien naInstitutot za elektronika”•“jas sum TTS sintetizatorot odMakedonija i zboruvammakedonski”39

Conclusion• We have seen how conversion of unrestricted text tospeech works• Speech synthesis will become ever more present in thenext generation of robot systems• With the availability of TTS tools its possible to makerobots speak your language

Text-to-speech man-machine interface in embedded systems

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?