PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 6. Other Potential Applications 159<br />
automatic data-driven methods, like for example decision tree models such as Classi-<br />
fication and Regression Tree (CART) models (see Breiman et al. (1984)). Once the<br />
abstract labels are assigned to the utterance, they have to be transformed into numeri-<br />
cal F0 targets and converted into a F0 contour. There are many different approaches to<br />
this task, including the computation <strong>of</strong> an average pitch contour as well as rule-based<br />
or statistical methods.<br />
Each segment to be syn<strong>thesis</strong>ed must also have a particular duration such that the<br />
syn<strong>thesis</strong>ed speech mimics the temporal structure <strong>of</strong> typical human utterances. The<br />
duration <strong>of</strong> a speech sound may vary depending on a series <strong>of</strong> factors, including speech<br />
rate, stress patterns, their position in the word and in the phrase as well as phone-<br />
intrinsic characteristics. Traditionally, a suitable duration for each phone is estimated<br />
on the basis <strong>of</strong> rules. With the availability <strong>of</strong> labelled speech corpora, data-driven<br />
methods are used to derive duration models automatically.<br />
6.1.2.3 Waveform Generation<br />
Finally, the speech waveform is generated. The various types <strong>of</strong> approaches to wave-<br />
form generation include rule-based, concatenative and HMM-based syn<strong>thesis</strong>. Rule-<br />
based TTS syn<strong>thesis</strong> (e.g. MITalk, Allen et al. (1987)) relies on a simplified mathemat-<br />
ical model <strong>of</strong> human speech production. On the other hand, concatenative syn<strong>thesis</strong><br />
systems (e.g. Festival, Black and Taylor (1997)) exploit recorded speech data. Most<br />
current commercial TTS systems employ concatenative syn<strong>thesis</strong>, either by diphone<br />
or unit selection. For diphone syn<strong>thesis</strong>, all possible diphones (sound-to-sound transi-<br />
tions) in a particular language need to be recorded and labelled. The speech database<br />
only contains one example <strong>of</strong> each diphone spoken by the same speaker. During syn-<br />
<strong>thesis</strong>, the necessary diphones are concatenated and the target prosody is superim-<br />
posed by means <strong>of</strong> digital signal processing techniques like Linear Predictive Coding<br />
(LPC, Markel and H. (1976)), Time-Domain Pitch-Synchronous-OverLap-Add (TD-<br />
PSOLA, Moulines and Charpentier (1990)) or Multi-Band Resyn<strong>thesis</strong> OverLap-Add<br />
(MBROLA, Dutoit and Leich (1993)). Conversely, unit selection syn<strong>thesis</strong> involves<br />
less digital signal processing. However, it requires several hours <strong>of</strong> recorded speech<br />
data. Each recorded utterance is segmented into units <strong>of</strong> various sizes, including<br />
phones, syllables, morphemes, words, phrases and sentences. This segmentation is<br />
typically performed by means <strong>of</strong> a speech recogniser and hand correction. Each unit is