11.07.2015 Views

Text-to-speech man-machine interface in embedded systems

Text-to-speech man-machine interface in embedded systems

Text-to-speech man-machine interface in embedded systems

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Speech outl<strong>in</strong>e• Introduction• Speech generation <strong>in</strong> hu<strong>man</strong>s• Artificial <strong>speech</strong> synthesis – the basics• Speech synthesis for <strong>embedded</strong> <strong>systems</strong>• Our results so far


Introduction• Various <strong>man</strong>-<strong>mach<strong>in</strong>e</strong> <strong><strong>in</strong>terface</strong>s• Levers / pedals / pullies• But<strong>to</strong>ns• Keyboard and mouse• Touch screen• 3D gloves/movement track<strong>in</strong>g• Which one is the best?• None


Introduction• The ideal <strong>man</strong>-<strong>mach<strong>in</strong>e</strong> <strong><strong>in</strong>terface</strong> isSPEECH• The best way <strong>to</strong> make a <strong>mach<strong>in</strong>e</strong> dowhat you want is <strong>to</strong> tell it what you want• The ideal talk<strong>in</strong>g <strong>mach<strong>in</strong>e</strong>: Hall fromSpace Odyssey 2001• It’s 2010 now and still a long way <strong>to</strong> go• There is a lot already done – this is whatmy talk will be about• Two technologies stand beh<strong>in</strong>d this:• ASR – Au<strong>to</strong>matic Speech Recognition• TTS – <strong>Text</strong> <strong>to</strong> Speech Synthesis


Hu<strong>man</strong> <strong>speech</strong>• Speech consists of a well controlledsequence of sounds• These sounds are generated by the<strong>speech</strong> apparatus• Two groups of organs• voice produc<strong>in</strong>g (phona<strong>to</strong>ry) organs• articula<strong>to</strong>ry organs


Hu<strong>man</strong> <strong>speech</strong>• The phona<strong>to</strong>ry organs drive <strong>speech</strong>production• Lungs• Larynx – vocal folds• The articula<strong>to</strong>ry organs (vocal tract)shape the spectrum of the sound andgive it phonetic character• pharynx,• vellum,• palate,• nasal cavity,• <strong>to</strong>ngue,• teeth and• lips


Hu<strong>man</strong> <strong>speech</strong>The vocal folds vibrate <strong>to</strong>produce the hu<strong>man</strong> voiceThe phone “Z“– Mixed voic<strong>in</strong>gThe vocal tract shapes thespectrumThe phone “S“- UnvoicedThe phone ”I“- Voiced


Speech synthesis• The easiest way <strong>to</strong> make robots speak is by record<strong>in</strong>ghu<strong>man</strong> <strong>speech</strong>, and play<strong>in</strong>g it back when needed• Trivial• High Quality• Limited use• True <strong>speech</strong> synthesis is <strong>to</strong> allow robots <strong>to</strong> generateunbounded <strong>speech</strong> output (i.e. <strong>to</strong> let them speak theirm<strong>in</strong>d)• <strong>Text</strong> <strong>to</strong> Speech synthesis


<strong>Text</strong> <strong>to</strong> Speech (TTS) Synthesis• The artificial synthesis of <strong>speech</strong> based on unrestrictedtext <strong>in</strong>put• Three basic paradigms<strong>Text</strong> <strong>to</strong> Speech SynthesisArticula<strong>to</strong>ryFor<strong>man</strong>tConcatenative


Articula<strong>to</strong>ry TTS Synthesis• Articula<strong>to</strong>ry synthesis is based on a complete 3D model ofthe hu<strong>man</strong> <strong>speech</strong> apparatus• It uses the acoustic parameters extracted from the model<strong>to</strong> synthesize <strong>speech</strong>


Articula<strong>to</strong>ry TTS SynthesisPeter Birkholz’s - Vocal tract lab - www.vocaltractlab.de


Articula<strong>to</strong>ry TTS Synthesis“Der Zug hat e<strong>in</strong>e StundeVerspätung”“Nächster Halt:Hamburg”


Articula<strong>to</strong>ry TTS Synthesis• Articula<strong>to</strong>ry synthesis is the most powerful paradigm, butalso the most complex.• Build<strong>in</strong>g the model calls for thorough analysis of MRI(Magnetic Resonance Imag<strong>in</strong>g) scans of <strong>speech</strong>production• High on <strong>in</strong>telligibility, but low on naturalness


For<strong>man</strong>t TTS Synthesis• For<strong>man</strong>t synthesis uses more of a black-box model<strong>in</strong>gapproach <strong>to</strong> <strong>speech</strong> production• It analyzes the end transfer function of the vocal tract, ratherthan the way it is made• For<strong>man</strong>ts are the vocal tract’s resonant frequencies, they givethe phonetic character <strong>to</strong> <strong>speech</strong> soundsU OAEBaseHarmonic I For<strong>man</strong>t II For<strong>man</strong>t The phone ”I“- VoicedI


For<strong>man</strong>t TTS Synthesis• The driv<strong>in</strong>g force beh<strong>in</strong>d for<strong>man</strong>t TTS synthesis is thesource-filter modelVocal foldsPulseTra<strong>in</strong>Voiced/UnvoicedSwitchVocal TractParametersTime-Vary<strong>in</strong>gDigital FilterVocal tractNoiseGa<strong>in</strong>SPEECHVocal tract constrictionsLung pressure


For<strong>man</strong>t TTS Synthesis• The voice of Stephen Hawk<strong>in</strong>g (DECtalk)


For<strong>man</strong>t TTS Synthesis• The for<strong>man</strong>t synthesized <strong>speech</strong>• Scores high on <strong>in</strong>telligibility• Scores low on naturalness• Build<strong>in</strong>g the model rema<strong>in</strong>s a burdensome task


Concatenative TTS synthesis• Concatenative synthesis doesn’t seek <strong>to</strong> model <strong>speech</strong>production, at all• It uses a data base of prerecorded segments of natural <strong>speech</strong>,that it concatenates one after the other• This approach gives the synthetic <strong>speech</strong> a very natural soundm a k e d o n s k i


Concatenative TTS synthesis• Synthesiz<strong>in</strong>g the word “masa”


Concatenative TTS synthesis• Various segment lengths are used• The longer the segment the more natural the output <strong>speech</strong>, but the greaterthe database – a compromise is needed.• Most <strong>systems</strong> rely on diphones which are jo<strong>in</strong>ed halves of neighbor<strong>in</strong>g phones/k//ε/PhonePhone


Concatenative TTS synthesis• Various segment lengths are used• The longer the segment the more natural the output <strong>speech</strong>, but the greaterthe database – a compromise is needed.• Most <strong>systems</strong> rely on diphones which are jo<strong>in</strong>ed halves of neighbor<strong>in</strong>g phones/k//ε/Diphone


Concatenative TTS synthesis• Architecture of a concatenativeTTS system• Two ma<strong>in</strong> modules• <strong>Text</strong> Analysis (Front-end)• Waveform synthesis (Back-end)TEXTNormalizationMapp<strong>in</strong>gProsodyPhoneticRepresentation<strong>Text</strong>AnalysisModuleUnitDatabaseUnit selectionConcatenationProsodyWaveformSynthesisModuleSPEECH


Concatenative TTS synthesis• Two ma<strong>in</strong> types of <strong>in</strong>ven<strong>to</strong>ries:• Fixed <strong>in</strong>ven<strong>to</strong>ry – only one sample of each unit• The data-base is small (typically 1500+)• They rely extensively on unit modification• Unit-selection <strong>systems</strong> - <strong>man</strong>y samples of the same unit• Usually conta<strong>in</strong> hours of recorded <strong>speech</strong> material• They require very little or no unit modification• The most elaborate of these is Japanese ATR’s XIMERA, whichuses a 170 hour, 25,5 GB database of recorded <strong>speech</strong>!MBROLAdiphoneFestivaldiphoneFestivalunit-selectionAT&T’s Next-Genunit-selectionATR’s CHATRunit-selection


Concatenative TTS synthesis• The lack of a model of the <strong>speech</strong> production processmakes Concatenative TTS <strong>systems</strong> easy <strong>to</strong> develop• Concatenative TTS synthesis is the paradigm of choice for<strong>man</strong>y TTS <strong>systems</strong> world-wide


TTS for <strong>embedded</strong> <strong>systems</strong>• Implementation of TTS synthesis <strong>systems</strong> <strong>in</strong> Embeddeddevices requires• Smaller process<strong>in</strong>g load• Smaller memory footpr<strong>in</strong>t of voice segment database• We will take a look at TTS <strong>in</strong> Embedded <strong>systems</strong> fromseveral angles• TTS <strong>in</strong>tegrated circuits• TTS modules• TTS across <strong>embedded</strong> operat<strong>in</strong>g <strong>systems</strong>• TTS software applications for Embedded devices


TTS for <strong>embedded</strong> <strong>systems</strong>• Dedicated <strong>in</strong>tegrated circuits for TTS application• VotraxSC-01A (analog for<strong>man</strong>t),SC-02 / SSI-263 / "Arctic 263"• General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)• Magnevation SpeakJet TTS256 (DECtalk)• Savage Innovations SoundG<strong>in</strong>• National Semiconduc<strong>to</strong>r DT1050 Digitalker (Mozer)• Silicon SystemsSSI 263 (analog for<strong>man</strong>t)• Texas Instruments LPC Speech Chips: TMS5110ATMS5200• Oki Semiconduc<strong>to</strong>r ML22825 (ADPCM)ML22573 (HQADPCM)• ToshibaT6721A• PhilipsPCF8200


TTS for <strong>embedded</strong> <strong>systems</strong> - ICs• Epson’s S1V30120 TTS IC• brought <strong>to</strong> market <strong>in</strong> 2007• Key features of the S1V30120 text <strong>to</strong> <strong>speech</strong> IC:• Fonix DECtalk fully parametric <strong>speech</strong> synthesis• Multiple languages (US English, Castilian and Lat<strong>in</strong>American Spanish)• N<strong>in</strong>e TTS voices for each language (four male, four female,one child)• Audio playback (ADPCM)• Sampl<strong>in</strong>g rate: 11.025kHz• ADPCM decod<strong>in</strong>g (<strong>in</strong> Epson's format)• Sampl<strong>in</strong>g rates: 16kHz, 8kHz• Bit rate: 80kbps, 64kbps, 48kbps, 32kbps, 24kbps• Target applications: portable media devices, mobilephones, home appliances, office equipments and<strong>in</strong>dustrial equipments.


TTS for <strong>embedded</strong> <strong>systems</strong> - ICs• Magnevation SpeakJet• Features• Natural phonetic <strong>speech</strong> synthesis• 72 <strong>speech</strong> elements (allophones), 43 sound effects, and 12 DTMF TouchTones• Control of pitch, rate, bend and volume• Programmable announcements• Simple <strong><strong>in</strong>terface</strong> <strong>to</strong> microcontrollers• both CPU-Controlled and Stand-Alone operation• Extremely low power consumption• Low p<strong>in</strong> count• Only need power and speaker <strong>to</strong> hear <strong>speech</strong>• TTS256 <strong>Text</strong> <strong>to</strong> Code IC for SpeakJet• Built-<strong>in</strong> 600 rule database converts English text <strong>to</strong> phoneme codes


TTS for <strong>embedded</strong> <strong>systems</strong> - Modules• Digital Acoustics’ <strong>Text</strong>Speak Embedded TTSModules• Complete stand-alone (<strong>embedded</strong>) solutionsthat convert keyboard entry and RS-232 datadirectly <strong>to</strong> voice.• TTS-03 – low-cost U.S. English and Spanish TTSmodule for robotics and radio• TTS-04 – keyboard <strong>to</strong> Speech solution (batterypowered)• TTS-EM – high-quality voice, multi-Language,RS-232 <strong>to</strong> Speech module


TTS for <strong>embedded</strong> <strong>systems</strong> - Modules• <strong>Text</strong>Speak TTS-03 Module• Converts ASCII text <strong>to</strong> a natural, clearvoice with unlimited vocabulary.• The small footpr<strong>in</strong>t, plug-<strong>in</strong> solutionaccepts wide range of <strong>in</strong>put data <strong>to</strong>generat<strong>in</strong>g real-time <strong>speech</strong>.• 8Khz ROM based,• RS-232 and PS/2 keyboard <strong>in</strong>put .• 2" sq footpr<strong>in</strong>t


TTS for <strong>embedded</strong> <strong>systems</strong> - Modules• <strong>Text</strong>Speak EM TTS Module• High quality, multi-language <strong>embedded</strong> TTS• Complete <strong>speech</strong> synthesis <strong>in</strong> a 2x3" modularsolution• 16 kHz FLASH based voices rival PC quality• 5" square (2x3") PCB platform.• Integral OS provides superior flexibility andperfor<strong>man</strong>ce• Supports over 20 Embedded TTS languages• Standard voices <strong>in</strong>clude English, Spanish,Ger<strong>man</strong> and French• Optional voices and versions available with3rd party Mobile and Embedded voicesdatabases..• RS-232, Ethernet and keyboard <strong>in</strong>put


TTS for <strong>embedded</strong> <strong>systems</strong> - mOS• As process<strong>in</strong>g power and memory capacity of <strong>embedded</strong>devices <strong>in</strong>crease, the l<strong>in</strong>e between TTS <strong>systems</strong> developedfor PC platforms and <strong>embedded</strong> devices grows ever morevaguer.• Mobile operat<strong>in</strong>g <strong>systems</strong> enabled with TTS synthesis:• Apple’s iOS• Microsoft’s W<strong>in</strong>dows Mobile• Google’s Android• Nokia’s Symbian, etc.


TTS for <strong>embedded</strong> <strong>systems</strong> - mOS• Apple’s iOS• The first TTS <strong>in</strong>tegrated <strong>in</strong><strong>to</strong> an operat<strong>in</strong>g systemthat shipped <strong>in</strong> quantity was Apple’s MacInTalk <strong>in</strong> 1984.• Evolved <strong>in</strong><strong>to</strong> VoiceOver• first time featured <strong>in</strong> Mac OS X Tiger (10.4)• Start<strong>in</strong>g with 10.6 (Snow Leopard), features a list of multiplevoices• VoiceOver is also <strong>in</strong>cluded <strong>in</strong>:• iPod Shuffle, iPod Nano (use OS X), and• iPod Touch,• iPhone 3GS and iPhone 4.


TTS for <strong>embedded</strong> <strong>systems</strong> - mOS• Microsoft W<strong>in</strong>dows Mobile• W<strong>in</strong>dows <strong>systems</strong> use SAPI4-(W<strong>in</strong>dows 95/98) and SAPI5-based(Speech Application Programm<strong>in</strong>g Interface)<strong>speech</strong> <strong>systems</strong> that <strong>in</strong>clude a <strong>speech</strong> recognition eng<strong>in</strong>e (SRE).• Microsoft Sam (Speech Articulation Module) is the default XP voice.• Microsoft Anna is the new voice shipped with Vista and 7.• Microsoft’s Voice Com<strong>man</strong>d• used <strong>in</strong> the W<strong>in</strong>dows Mobile platform for Pocket PC or Smartphonedevices• Voice Com<strong>man</strong>d is a retail product sold separately from W<strong>in</strong>dowsMobile.


TTS for <strong>embedded</strong> <strong>systems</strong> - mOS• Android 1.6 (Donut) added a TTS eng<strong>in</strong>e


TTS for <strong>embedded</strong> <strong>systems</strong> - app• Flite• a small run-time TTS synthesis eng<strong>in</strong>e developed atCarnegie Mellon University.• Derived from the Festival Speech Synthesis Systemfrom the University of Ed<strong>in</strong>burgh and the FestVoxproject from Carnegie Mellon University.• FreeTTS• a <strong>speech</strong> synthesis system written entirely <strong>in</strong> theJavaTM programm<strong>in</strong>g language based on Flite


TTS for <strong>embedded</strong> <strong>systems</strong> – app• Loquendo Embedded TTS• provides expressive, natural-sound<strong>in</strong>g synthetic <strong>speech</strong> <strong>in</strong> a wide range of languagesand voices, <strong>in</strong> a range of small memory footpr<strong>in</strong>ts for Embedded devices• Used <strong>in</strong>• over 10 million navigation devices worldwide (Loquendo Au<strong>to</strong>motive Solution),• new generation of mobile phones (Loquendo Embedded TTS for Mobile)• on–board announcements for tra<strong>in</strong>s and buses, talk<strong>in</strong>g bus s<strong>to</strong>ps,• screen readers for mobile and PC, e-book readers,• talk<strong>in</strong>g ATMs, domestic appliances and EPGs for TV,• Memory Requirements8 KHz• from 2.5 MB RAM• from 3.5 MB s<strong>to</strong>rage per voice• Type of Technology16 KHz• Unit selection, concatenative• Sampl<strong>in</strong>g Rate• 8/11/16/22/32/44 KHz• CPU Requirements• Xscale, ARM9, ARM11, X86, SH4, Mo<strong>to</strong>rola , PowerPC, TI OMAP 3621• Platforms• Android, iPhone, Symbian OS S60, W<strong>in</strong>dows Mobile 5 & 6, W<strong>in</strong>dows CE 5 & 6, W<strong>in</strong>dowsXP Embedded and Tablet PC ed., VxWorks, L<strong>in</strong>ux and QNX


Macedonian TTS• Maybe some day robots will speak Macedonian?• Our results so far:• natural record<strong>in</strong>g• old system• new system:• q-diphone units• with prosodic model<strong>in</strong>g•“jas sum TTS sistemot zamakedonski jazik, razvien naInstitu<strong>to</strong>t za elektronika”•“jas sum TTS s<strong>in</strong>tetiza<strong>to</strong>rot odMakedonija i zboruvammakedonski”39


Conclusion• We have seen how conversion of unrestricted text <strong>to</strong><strong>speech</strong> works• Speech synthesis will become ever more present <strong>in</strong> thenext generation of robot <strong>systems</strong>• With the availability of TTS <strong>to</strong>ols its possible <strong>to</strong> makerobots speak your language

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!