ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal

ETRI INNOVATION: MPEG-4 Text-to-Speech Interface 

ETRI has developed MPEG-4 Audio Text-to-Speech Interface (TTSI) technology which was accepted in the 

ISO/IEC International Standards. 

I. Introduction 

Text is widely used to transfer speech. However, it was not possible to use text to present speech in multimedia 

bitstream. The MPEG-4 Audio TTSI is defined so that the text bitstream can be used to transfer speech. It also enables 

many kinds of text-to-speech (TTS) synthesizers to use the same bitstream for this purpose. The TTS generates speech 

when a text is accessed as its input. The TTS changes the text into a string of phonetic symbols and retrieves the 

corresponding basic synthetic units from the synthesis unit database. Then, the TTS concatenates the synthetic units to 

generate the output speech with the rule-generated prosody. 

II. The Functionality of the MPEG-4 Audio TTSI 

The MPEG-4 Audio TTSI has a capability to select language, gender and age of the speaker, speech rate, and 

prosody of the speech. It can send a silence sentence that only has silence duration. One can send a bitstream that 

contains only the text and its length. In this case, the minimum bandwidth required is 200 bits per second. The 

synthesizer will add predefined or rule-generated prosody to the synthesized speech. The synthesized speech in this 

case will deliver the content to the listener. On the other hand, one can send a bitstream that contains text as well as 

the prosody of the original speech, that is, phoneme sequence, duration of each phoneme, base frequency (tone) of 

each phoneme, and energy of each phoneme. The synthesized speech in this case will be very similar to the original 

speech since it has the original prosody. Thus, one can send speeches with subtle nuance without any loss of intention 

using MPEG-4 Audio TTSI. 

One of the important feature of the MPEG-4 Audio TTSI is synthesized speech synchronized with the lip 

movement of the face. In this case, the TTS synthesizer generates phoneme sequence and its duration to the facial 

animation visual object decoder so that it can control the lip movement. With this feature, one can not only listen 

the synthesized speech but also see the avatar that has synchronized lip movement. 

Another important feature of the MPEG-4 Audio TTSI is the moving picture dubbing with TTS. In this case 

MPEG-4 Audio TTSI should synchronize synthetic speech to moving picture and accommodate the functionality of 

trick mode. The MPEG-4 Audio TTSI decoder uses system clock to select adequate speech location in a sentence. 

40 

ETRI Journal, Vol. 21, No. 2, June 1999

The TTS synthesizer assigns appropriate duration for each phoneme. Utilizing these data, one can generates 

synthesized speech that is synchronized with the lip shape of the moving picture. 

In MPEG-4 Audio TTSI, one can use trick mode operation to start, stop, rewind, and fast forward the synthesized 

speech. In this situation, users can also control the speech rate, pitch range, gender, and age of the synthesized 

speech. 

III. Applications of the MPEG-4 Audio TTSI 

1. MPEG-4 Story Teller on Demand (STOD) 

In the MPEG-4 Story Teller on Demand (STOD) application, users can select a story from a huge database of story 

libraries that are stored in hard disks or compact disks. The STOD system reads aloud the story via the decoded 

MPEG-4 Audio TTSI bitstream with the MPEG-4 facial animation tool or with appropriately selected images. The user 

can stop and resume speaking at any moment he wants through user interfaces of the local machine (for example, 

mouse or keyboard). The user can also select the gender, age, and the speech rate of the electronic story-teller. 

2. MPEG-4 Audio TTSI with Moving Picture 

In this application, synchronized playback of the decoded speech and encoded moving picture is the most important 

issue. The architecture of the MPEG-4 Audio TTSI can provide several granularities of synchronization. Aligning the 

composition time of each sentence, coarse granularity of synchronization and trick mode functionality can be easily 

achieved. To get finer granularity of synchronization, the information about the lip shape would be utilized. The finest 

granularity of synchronization can be achieved by using the prosody information and the video-related information such 

as sentence duration and offset time in the sentence. With this synchronization capability, the MPEG-4 Audio TTSI can 

be used for moving picture dubbing by utilizing the lip shape and the corresponding time in the sentence. 

3. Other Applications 

Other applications of the MPEG-4 Audio TTSI include speech synthesizer for avatars in virtual reality (VR) 

applications, voice newspaper, dubbing tools for animated pictures, and voice internet. 

ETRI Journal, Vol. 21, No. 2, June 1999 41

ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal

Create successful ePaper yourself

Delete template?

Save as template?