PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 6. Other Potential Applications 159 automatic data-driven methods, like for example decision tree models such as Classi- fication and Regression Tree (CART) models (see Breiman et al. (1984)). Once the abstract labels are assigned to the utterance, they have to be transformed into numeri- cal F0 targets and converted into a F0 contour. There are many different approaches to this task, including the computation of an average pitch contour as well as rule-based or statistical methods. Each segment to be synthesised must also have a particular duration such that the synthesised speech mimics the temporal structure of typical human utterances. The duration of a speech sound may vary depending on a series of factors, including speech rate, stress patterns, their position in the word and in the phrase as well as phone- intrinsic characteristics. Traditionally, a suitable duration for each phone is estimated on the basis of rules. With the availability of labelled speech corpora, data-driven methods are used to derive duration models automatically. 6.1.2.3 Waveform Generation Finally, the speech waveform is generated. The various types of approaches to waveform generation include rule-based, concatenative and HMM-based synthesis. Rule- based TTS synthesis (e.g. MITalk, Allen et al. (1987)) relies on a simplified mathemat- ical model of human speech production. On the other hand, concatenative synthesis systems (e.g. Festival, Black and Taylor (1997)) exploit recorded speech data. Most current commercial TTS systems employ concatenative synthesis, either by diphone or unit selection. For diphone synthesis, all possible diphones (sound-to-sound transi- tions) in a particular language need to be recorded and labelled. The speech database only contains one example of each diphone spoken by the same speaker. During synthesis, the necessary diphones are concatenated and the target prosody is superim- posed by means of digital signal processing techniques like Linear Predictive Coding (LPC, Markel and H. (1976)), Time-Domain Pitch-Synchronous-OverLap-Add (TD- PSOLA, Moulines and Charpentier (1990)) or Multi-Band Resynthesis OverLap-Add (MBROLA, Dutoit and Leich (1993)). Conversely, unit selection synthesis involves less digital signal processing. However, it requires several hours of recorded speech data. Each recorded utterance is segmented into units of various sizes, including phones, syllables, morphemes, words, phrases and sentences. This segmentation is typically performed by means of a speech recogniser and hand correction. Each unit is
Chapter 6. Other Potential Applications 160 then stored in the database according to its segmentation and acoustic features. During synthesis, a search algorithm selects the best sequence of units from the database for concatenation. The search is generally performed by means of decision trees. A further approach to waveform generation is HMM-based synthesis. According to this method, the recorded, segmented and labelled speech data is used for modelling the speech fre- quency spectrum, F0 and duration simultaneously by Hidden Markov Models (HMMs) which then generate speech waveforms based on the Maximum Likelihood criterion. Given the various processes involved in TTS synthesis, it becomes clear that par- ticularly the steps in the front-end of the system are language-dependent. Moreover, current state-of-the-art TTS systems rely on recorded speech data in a particular language. A system that is able to synthesise mixed-lingual input would not only require a text processing step which identifies foreign inclusions but would also necessitate an appropriate grapheme-to-phoneme conversion as well as suitable speech data. After examining different evaluation methods for TTS synthesis, research on polyglot TTS approaching such issues is presented. 6.1.3 Evaluation of TTS Synthesis There are different ways to evaluate the quality of synthetic speech. This section mainly focuses on two commonly used subjective tests based on listeners’ responses, namely absolute category rating and pair comparison. 6.1.3.1 Absolute Category Rating Absolute category rating (ACR), also referred to as single stimulus method, is the most common and straightforward method to evaluate synthetic speech quality numerically (Nusbaum et al., 1984). Subjects are asked to rate each test signal once using a five- point scale that ranges from bad (1) to excellent (5) (CCITT, 1989). The average score or mean opinion score (MOS) of each competing TTS system is therefore determined as the arithmetic mean of its individual signal scores. As this is a subjective evaluation, variability between subjects can be high. Evidently, this method becomes more reliable the higher the number of test speech signals and listeners. The use of a set of reference signals in the evaluation can also help to normalise for listener-dependent variations. Rather than determine the overall speech quality of synthesised speech as per-
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Chapter 4 System Extension to a New
Page 115 and 116:
Chapter 4. System Extension to a Ne
Page 117 and 118:
Page 119 and 120:
Page 121 and 122: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 171: Chapter 6. Other Potential Applicat
Page 187 and 188: Chapter 7 Conclusions and Future Wo
Page 189 and 190: Chapter 7. Conclusions and Future W
Page 191 and 192: Appendix A. Evaluation Metrics and
Page 199 and 200: Appendix B. Guidelines for Annotati
Page 205 and 206: Appendix C TIGER Tags and Labels C.
Page 207 and 208: Appendix C. TIGER Tags and Labels 1
Page 209 and 210: Appendix C. TIGER Tags and Labels 1
Page 211 and 212: Bibliography 198 Andersen, G. (2005
Page 213 and 214: Bibliography 200 Bresnan, J. (2001)
Page 215 and 216: Bibliography 202 Damashek, M. (1995
Page 217 and 218: Bibliography 204 Finkel, J., Dingar
Page 219 and 220: Bibliography 206 Hachey, B., Alex,
Page 221 and 222: Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?