14.01.2015 Views

ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal

ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal

ETRI INNOVATION: MPEG-4 Text-to-Speech Interface - ETRI Journal

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>ETRI</strong> <strong>INNOVATION</strong>: <strong>MPEG</strong>-4 <strong>Text</strong>-<strong>to</strong>-<strong>Speech</strong> <strong>Interface</strong><br />

<strong>ETRI</strong> has developed <strong>MPEG</strong>-4 Audio <strong>Text</strong>-<strong>to</strong>-<strong>Speech</strong> <strong>Interface</strong> (TTSI) technology which was accepted in the<br />

ISO/IEC International Standards.<br />

I. Introduction<br />

<strong>Text</strong> is widely used <strong>to</strong> transfer speech. However, it was not possible <strong>to</strong> use text <strong>to</strong> present speech in multimedia<br />

bitstream. The <strong>MPEG</strong>-4 Audio TTSI is defined so that the text bitstream can be used <strong>to</strong> transfer speech. It also enables<br />

many kinds of text-<strong>to</strong>-speech (TTS) synthesizers <strong>to</strong> use the same bitstream for this purpose. The TTS generates speech<br />

when a text is accessed as its input. The TTS changes the text in<strong>to</strong> a string of phonetic symbols and retrieves the<br />

corresponding basic synthetic units from the synthesis unit database. Then, the TTS concatenates the synthetic units <strong>to</strong><br />

generate the output speech with the rule-generated prosody.<br />

II. The Functionality of the <strong>MPEG</strong>-4 Audio TTSI<br />

The <strong>MPEG</strong>-4 Audio TTSI has a capability <strong>to</strong> select language, gender and age of the speaker, speech rate, and<br />

prosody of the speech. It can send a silence sentence that only has silence duration. One can send a bitstream that<br />

contains only the text and its length. In this case, the minimum bandwidth required is 200 bits per second. The<br />

synthesizer will add predefined or rule-generated prosody <strong>to</strong> the synthesized speech. The synthesized speech in this<br />

case will deliver the content <strong>to</strong> the listener. On the other hand, one can send a bitstream that contains text as well as<br />

the prosody of the original speech, that is, phoneme sequence, duration of each phoneme, base frequency (<strong>to</strong>ne) of<br />

each phoneme, and energy of each phoneme. The synthesized speech in this case will be very similar <strong>to</strong> the original<br />

speech since it has the original prosody. Thus, one can send speeches with subtle nuance without any loss of intention<br />

using <strong>MPEG</strong>-4 Audio TTSI.<br />

One of the important feature of the <strong>MPEG</strong>-4 Audio TTSI is synthesized speech synchronized with the lip<br />

movement of the face. In this case, the TTS synthesizer generates phoneme sequence and its duration <strong>to</strong> the facial<br />

animation visual object decoder so that it can control the lip movement. With this feature, one can not only listen<br />

the synthesized speech but also see the avatar that has synchronized lip movement.<br />

Another important feature of the <strong>MPEG</strong>-4 Audio TTSI is the moving picture dubbing with TTS. In this case<br />

<strong>MPEG</strong>-4 Audio TTSI should synchronize synthetic speech <strong>to</strong> moving picture and accommodate the functionality of<br />

trick mode. The <strong>MPEG</strong>-4 Audio TTSI decoder uses system clock <strong>to</strong> select adequate speech location in a sentence.<br />

40<br />

<strong>ETRI</strong> <strong>Journal</strong>, Vol. 21, No. 2, June 1999


The TTS synthesizer assigns appropriate duration for each phoneme. Utilizing these data, one can generates<br />

synthesized speech that is synchronized with the lip shape of the moving picture.<br />

In <strong>MPEG</strong>-4 Audio TTSI, one can use trick mode operation <strong>to</strong> start, s<strong>to</strong>p, rewind, and fast forward the synthesized<br />

speech. In this situation, users can also control the speech rate, pitch range, gender, and age of the synthesized<br />

speech.<br />

III. Applications of the <strong>MPEG</strong>-4 Audio TTSI<br />

1. <strong>MPEG</strong>-4 S<strong>to</strong>ry Teller on Demand (STOD)<br />

In the <strong>MPEG</strong>-4 S<strong>to</strong>ry Teller on Demand (STOD) application, users can select a s<strong>to</strong>ry from a huge database of s<strong>to</strong>ry<br />

libraries that are s<strong>to</strong>red in hard disks or compact disks. The STOD system reads aloud the s<strong>to</strong>ry via the decoded<br />

<strong>MPEG</strong>-4 Audio TTSI bitstream with the <strong>MPEG</strong>-4 facial animation <strong>to</strong>ol or with appropriately selected images. The user<br />

can s<strong>to</strong>p and resume speaking at any moment he wants through user interfaces of the local machine (for example,<br />

mouse or keyboard). The user can also select the gender, age, and the speech rate of the electronic s<strong>to</strong>ry-teller.<br />

2. <strong>MPEG</strong>-4 Audio TTSI with Moving Picture<br />

In this application, synchronized playback of the decoded speech and encoded moving picture is the most important<br />

issue. The architecture of the <strong>MPEG</strong>-4 Audio TTSI can provide several granularities of synchronization. Aligning the<br />

composition time of each sentence, coarse granularity of synchronization and trick mode functionality can be easily<br />

achieved. To get finer granularity of synchronization, the information about the lip shape would be utilized. The finest<br />

granularity of synchronization can be achieved by using the prosody information and the video-related information such<br />

as sentence duration and offset time in the sentence. With this synchronization capability, the <strong>MPEG</strong>-4 Audio TTSI can<br />

be used for moving picture dubbing by utilizing the lip shape and the corresponding time in the sentence.<br />

3. Other Applications<br />

Other applications of the <strong>MPEG</strong>-4 Audio TTSI include speech synthesizer for avatars in virtual reality (VR)<br />

applications, voice newspaper, dubbing <strong>to</strong>ols for animated pictures, and voice internet.<br />

<strong>ETRI</strong> <strong>Journal</strong>, Vol. 21, No. 2, June 1999 41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!