05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6. Other Potential Applications 159<br />

automatic data-driven methods, like for example decision tree models such as Classi-<br />

fication and Regression Tree (CART) models (see Breiman et al. (1984)). Once the<br />

abstract labels are assigned to the utterance, they have to be transformed into numeri-<br />

cal F0 targets and converted into a F0 contour. There are many different approaches to<br />

this task, including the computation <strong>of</strong> an average pitch contour as well as rule-based<br />

or statistical methods.<br />

Each segment to be syn<strong>thesis</strong>ed must also have a particular duration such that the<br />

syn<strong>thesis</strong>ed speech mimics the temporal structure <strong>of</strong> typical human utterances. The<br />

duration <strong>of</strong> a speech sound may vary depending on a series <strong>of</strong> factors, including speech<br />

rate, stress patterns, their position in the word and in the phrase as well as phone-<br />

intrinsic characteristics. Traditionally, a suitable duration for each phone is estimated<br />

on the basis <strong>of</strong> rules. With the availability <strong>of</strong> labelled speech corpora, data-driven<br />

methods are used to derive duration models automatically.<br />

6.1.2.3 Waveform Generation<br />

Finally, the speech waveform is generated. The various types <strong>of</strong> approaches to wave-<br />

form generation include rule-based, concatenative and HMM-based syn<strong>thesis</strong>. Rule-<br />

based TTS syn<strong>thesis</strong> (e.g. MITalk, Allen et al. (1987)) relies on a simplified mathemat-<br />

ical model <strong>of</strong> human speech production. On the other hand, concatenative syn<strong>thesis</strong><br />

systems (e.g. Festival, Black and Taylor (1997)) exploit recorded speech data. Most<br />

current commercial TTS systems employ concatenative syn<strong>thesis</strong>, either by diphone<br />

or unit selection. For diphone syn<strong>thesis</strong>, all possible diphones (sound-to-sound transi-<br />

tions) in a particular language need to be recorded and labelled. The speech database<br />

only contains one example <strong>of</strong> each diphone spoken by the same speaker. During syn-<br />

<strong>thesis</strong>, the necessary diphones are concatenated and the target prosody is superim-<br />

posed by means <strong>of</strong> digital signal processing techniques like Linear Predictive Coding<br />

(LPC, Markel and H. (1976)), Time-Domain Pitch-Synchronous-OverLap-Add (TD-<br />

PSOLA, Moulines and Charpentier (1990)) or Multi-Band Resyn<strong>thesis</strong> OverLap-Add<br />

(MBROLA, Dutoit and Leich (1993)). Conversely, unit selection syn<strong>thesis</strong> involves<br />

less digital signal processing. However, it requires several hours <strong>of</strong> recorded speech<br />

data. Each recorded utterance is segmented into units <strong>of</strong> various sizes, including<br />

phones, syllables, morphemes, words, phrases and sentences. This segmentation is<br />

typically performed by means <strong>of</strong> a speech recogniser and hand correction. Each unit is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!