ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal
ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal
ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>ETRI</strong> <strong>INNOVATION</strong>: <strong>MPEG</strong>-4 <strong>Text</strong>-<strong>to</strong>-<strong>Speech</strong> <strong>Interface</strong><br />
<strong>ETRI</strong> has developed <strong>MPEG</strong>-4 Audio <strong>Text</strong>-<strong>to</strong>-<strong>Speech</strong> <strong>Interface</strong> (TTSI) technology which was accepted in the<br />
ISO/IEC International Standards.<br />
I. Introduction<br />
<strong>Text</strong> is widely used <strong>to</strong> transfer speech. However, it was not possible <strong>to</strong> use text <strong>to</strong> present speech in multimedia<br />
bitstream. The <strong>MPEG</strong>-4 Audio TTSI is defined so that the text bitstream can be used <strong>to</strong> transfer speech. It also enables<br />
many kinds of text-<strong>to</strong>-speech (TTS) synthesizers <strong>to</strong> use the same bitstream for this purpose. The TTS generates speech<br />
when a text is accessed as its input. The TTS changes the text in<strong>to</strong> a string of phonetic symbols and retrieves the<br />
corresponding basic synthetic units from the synthesis unit database. Then, the TTS concatenates the synthetic units <strong>to</strong><br />
generate the output speech with the rule-generated prosody.<br />
II. The Functionality of the <strong>MPEG</strong>-4 Audio TTSI<br />
The <strong>MPEG</strong>-4 Audio TTSI has a capability <strong>to</strong> select language, gender and age of the speaker, speech rate, and<br />
prosody of the speech. It can send a silence sentence that only has silence duration. One can send a bitstream that<br />
contains only the text and its length. In this case, the minimum bandwidth required is 200 bits per second. The<br />
synthesizer will add predefined or rule-generated prosody <strong>to</strong> the synthesized speech. The synthesized speech in this<br />
case will deliver the content <strong>to</strong> the listener. On the other hand, one can send a bitstream that contains text as well as<br />
the prosody of the original speech, that is, phoneme sequence, duration of each phoneme, base frequency (<strong>to</strong>ne) of<br />
each phoneme, and energy of each phoneme. The synthesized speech in this case will be very similar <strong>to</strong> the original<br />
speech since it has the original prosody. Thus, one can send speeches with subtle nuance without any loss of intention<br />
using <strong>MPEG</strong>-4 Audio TTSI.<br />
One of the important feature of the <strong>MPEG</strong>-4 Audio TTSI is synthesized speech synchronized with the lip<br />
movement of the face. In this case, the TTS synthesizer generates phoneme sequence and its duration <strong>to</strong> the facial<br />
animation visual object decoder so that it can control the lip movement. With this feature, one can not only listen<br />
the synthesized speech but also see the avatar that has synchronized lip movement.<br />
Another important feature of the <strong>MPEG</strong>-4 Audio TTSI is the moving picture dubbing with TTS. In this case<br />
<strong>MPEG</strong>-4 Audio TTSI should synchronize synthetic speech <strong>to</strong> moving picture and accommodate the functionality of<br />
trick mode. The <strong>MPEG</strong>-4 Audio TTSI decoder uses system clock <strong>to</strong> select adequate speech location in a sentence.<br />
40<br />
<strong>ETRI</strong> <strong>Journal</strong>, Vol. 21, No. 2, June 1999
The TTS synthesizer assigns appropriate duration for each phoneme. Utilizing these data, one can generates<br />
synthesized speech that is synchronized with the lip shape of the moving picture.<br />
In <strong>MPEG</strong>-4 Audio TTSI, one can use trick mode operation <strong>to</strong> start, s<strong>to</strong>p, rewind, and fast forward the synthesized<br />
speech. In this situation, users can also control the speech rate, pitch range, gender, and age of the synthesized<br />
speech.<br />
III. Applications of the <strong>MPEG</strong>-4 Audio TTSI<br />
1. <strong>MPEG</strong>-4 S<strong>to</strong>ry Teller on Demand (STOD)<br />
In the <strong>MPEG</strong>-4 S<strong>to</strong>ry Teller on Demand (STOD) application, users can select a s<strong>to</strong>ry from a huge database of s<strong>to</strong>ry<br />
libraries that are s<strong>to</strong>red in hard disks or compact disks. The STOD system reads aloud the s<strong>to</strong>ry via the decoded<br />
<strong>MPEG</strong>-4 Audio TTSI bitstream with the <strong>MPEG</strong>-4 facial animation <strong>to</strong>ol or with appropriately selected images. The user<br />
can s<strong>to</strong>p and resume speaking at any moment he wants through user interfaces of the local machine (for example,<br />
mouse or keyboard). The user can also select the gender, age, and the speech rate of the electronic s<strong>to</strong>ry-teller.<br />
2. <strong>MPEG</strong>-4 Audio TTSI with Moving Picture<br />
In this application, synchronized playback of the decoded speech and encoded moving picture is the most important<br />
issue. The architecture of the <strong>MPEG</strong>-4 Audio TTSI can provide several granularities of synchronization. Aligning the<br />
composition time of each sentence, coarse granularity of synchronization and trick mode functionality can be easily<br />
achieved. To get finer granularity of synchronization, the information about the lip shape would be utilized. The finest<br />
granularity of synchronization can be achieved by using the prosody information and the video-related information such<br />
as sentence duration and offset time in the sentence. With this synchronization capability, the <strong>MPEG</strong>-4 Audio TTSI can<br />
be used for moving picture dubbing by utilizing the lip shape and the corresponding time in the sentence.<br />
3. Other Applications<br />
Other applications of the <strong>MPEG</strong>-4 Audio TTSI include speech synthesizer for avatars in virtual reality (VR)<br />
applications, voice newspaper, dubbing <strong>to</strong>ols for animated pictures, and voice internet.<br />
<strong>ETRI</strong> <strong>Journal</strong>, Vol. 21, No. 2, June 1999 41