Concatenative synthesis - University of Birmingham

THE UNIVERSITY 

OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

Speech Synthesis 

Part I: Concatenative synthesis 

Version 4 February 2002 

Digital Systems 

& 

Vision Processing 

Inaugural Lecture 

22-Nov-00 

SLIDE 1


OF BIRMINGHAM 

Speech Synthesis 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! The part of speech technology which is 

concerned with automatically generating 

speech from a computer. 

! Typically, input is text and the desired 

output is an acoustic speech signal, 

hence: text-to-speech synthesis 


& 


! First stage is text normalisation 


22-Nov-00 

SLIDE 2


OF BIRMINGHAM 

Text normalisation 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 3 

! Consider: 

“This morning’s 6am BBC news, read by Sarah 

Mukherjee, announced that the OPEC countries 

will restrict oil exports to the UK to 22,000 

barrels per day” 

! Problems 

– 6am 

– BBC 

– Mukherjee 

– OPEC 

– UK 

– 22,000


OF BIRMINGHAM 

Text-to-Phoneme Conversion 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! The next stage is to convert the normalised 

text into a sequence of phonetic elements - 

a symbolic description of its pronunciation 

! This is text-to-phone concersion 

! e.g: “this slide is too long” 


& 


/T I s # s l aI d # I z # t u # l Q G # / 


22-Nov-00 

SLIDE 4


OF BIRMINGHAM 

Text-to-Phoneme Conversion 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Sequence of phonetic-segments typically obtained 

by: 

– looking up individual words in a pronunciation 

dictionary (often referred to as the exceptions 

dictionary for historical reasons) or 

– applying letter-to-sound rules 


& 


! Finally, need a method to convert the sequence of 

phonetic segments into an acoustic signal. 


22-Nov-00 

SLIDE 5


OF BIRMINGHAM 

The Role of Prosody 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! The meaning of an utterance will affect its 

acoustic realisation. 

! In fact, the most commonly cited 

shortcoming of speech synthesis is its 

prosodic structure: 


& 



22-Nov-00 

SLIDE 6


OF BIRMINGHAM 

Prosody 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 7 

! Prosody is the term used to describe: 

– Durational structure of a speech signal 

(the relative lengths of its different parts, 

and the presence and lengths of any 

pauses) 

– Amplitude structure (the relative 

amplitudes of its different parts) 

– Intonational structure. In other words, 

the fundamental frequency 

! Prosody includes stress and rythmn


OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 8 

The Role of Prosody 

! Consider: 

– Joe: “Hey-did you hear? Sam took 

Mary out and bought her a pizza!” 

– Mike:“You’re wrong - Sam didn’t buy 

Mary a pizza” 

! “Sam didn’t buy Mary a PIZZA” 

! “Sam didn’t buy MARY a pizza” 

! “Sam didn’t BUY Mary a pizza” 

! “Sam DIDN’T buy Mary a pizza” 

! “SAM didn’t buy Mary a pizza” 

(from Altmann, “The ascent of Babel”, reference in 

notes)


OF BIRMINGHAM 

Approaches to synthesis 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Phone-sequence → acoustic signal 

! Many different approaches, but for the 

purposes of this course they will be divided 

into two classes: 

– Waveform concatenation 


& 


– Model-based speech synthesis 


22-Nov-00 

SLIDE 9


OF BIRMINGHAM 

Waveform Concatenation 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 10 

! Join together, or concatenate, stored 

sections of real speech 

! Sections may correspond to whole word, 

or sub-word units 

! Early systems based on whole words 

– E.G: Speaking clock - UK telephone 

system, 1936 

! Storage and access major issues 

! Speech quality requires data-rates of 

16,000 to 32,000 bits per second (bps)


OF BIRMINGHAM 

1936 “Speaking Clock” 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

From John Holmes, 

“Speech synthesis 

and recognition”, 

courtesy of British 

Telecommunications 

plc 


& 



22-Nov-00 

SLIDE 11


OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 12 

Whole-Word Concatenation 

! Whole word concatenation can give good 

quality speech (as in speaking clock) but 

has many disadvantages: 

– Pronunciation of a word is influenced by 

neighbouring words (co-articulation) 

– Prosodic effects like intonation, rate-ofspeaking 

and amplitude also influenced 

by context. 

– Interpretation of a sentence will be 

strongly influenced by details of 

individual words used - Remember 

“Sam didn’t buy Mary a pizza”!


OF BIRMINGHAM 


SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 13 

! Disadvantages (continued): 

– Words must be extracted from the right 

sort of sentence. 

– Most suitable for applications where 

structure of the sentence is constrained, 

e.g: Announcements, Lists… 

– May need to record more than one 

example of each word, E.G. raised 

pitch at end of a list, pre-pause 

lengthening…


OF BIRMINGHAM 


SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 


! Disadvantages (continued): 

– To add new words the original 

speaker must be found, or all words 

must be re-recorded. 

– Even with specialist facilities, 

selection and extraction of suitable 

words is labour intensive and time 

consuming 


22-Nov-00 

SLIDE 14


OF BIRMINGHAM 

Sub-Word Concatenation 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Limitations of word-based methods suggest 

concatenative speech synthesis based on subword 

units. 

! Need well-annotated, phonetically-balanced 

corpus of recordings of speech 

! Extract fragments from waveforms in the corpus 

which represent ‘basic units’ of speech 


& 


! These are concatenated and used for speech 

synthesis 


22-Nov-00 

SLIDE 15


OF BIRMINGHAM 


SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 


! Difficulties include: 

– careful annotation of large amounts 

of data needed 

– derivation of a good method for 

concatenation 

– identification of a set of suitable 

units. 


22-Nov-00 

SLIDE 16


OF BIRMINGHAM 


SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Sub-word concatenation overcomes difficulties 

with adding new words to the application 

vocabulary 

! BUT, other problems are exacerbated. 

! In particular, coarticulation and pitch continuity 

problems now occur within, as well as between, 

words 


& 



22-Nov-00 

SLIDE 17 

! Necessary to use several examples of each 

phone (corresponding roughly to different 

allophones).


OF BIRMINGHAM 


SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! It is natural to select fragments which 

characterise the phone target values, 

BUT modelling transitions between 

these targets is a significant problem 

Target 1 


& 


? 


22-Nov-00 

SLIDE 18 

Target 2


OF BIRMINGHAM 

Transitional Units 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Central regions of many speech sounds 

are approximately stationary and less 

susceptible to coarticulation effects. 

! Hence select fragments which characterise 

transitions between phones, rather than 

phone targets. 


& 


! E.G: diphone - transition between two 

phones. 


22-Nov-00 

SLIDE 19


OF BIRMINGHAM 

Transitional Units (Continued) 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 20 

! There are contextually-induced differences 

between instantiations of the central region 

of phone, which cause discontinuities if they 

are not attended to. 

! Possible solutions are: 

– Use several different examples of each 

diphone 

– Store short transition regions, and 

interpolate between end values.


OF BIRMINGHAM 

Transitional Units 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

(a) 

(b) 


& 



22-Nov-00 

SLIDE 21 

! Coping with coarticulation effects by modelling 

transitions and 

– (a) using multiple examples to cope with 

variation in the instantiation of the phone 

centres, and 

– (b) by interpolation between short transition 

regions

Prosody and Concatenative 


OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 22 

Synthesis 

! Discontinuity in the fundamental frequency 

exacerbated for sub-word methods. 

! Can use source-filter model to separateexcitation 

signal from vocal tract shape 

! Vocal tract shape descriptions can then be 

concatenated and an appropriately smooth 

fundamental frequency pattern can be 

added separately. 

! (Will become clearer after the section on 

model-based synthesis)


OF BIRMINGHAM 

PSOLA 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 


! PSOLA - Pitch Synchronous OverLap and 

Add (Charpentier, 1986) 

! Most successful current approach to 

concatenative synthesis 

! In PSOLA the end regions of windowed 

waveform samples corresponding to a 

single excitation period are overlapped 

pitch-synchronously and added 


22-Nov-00 

SLIDE 23


OF BIRMINGHAM 

PSOLA 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 24 

From: John Holmes and Wendy Holmes, “Speech synthesis 

and recognition”, Taylor & Francis 2001

Speech modification using 


OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

PSOLA 

! In addition to speech synthesis from 

segments, there are two other common 

applications of PSOLA: 

– Pitch modification 

– Duration modification 


& 



22-Nov-00 

SLIDE 25


OF BIRMINGHAM 

Increasing pitch using PSOLA 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 26 



Decreasing pitch using 


OF BIRMINGHAM 

PSOLA 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 27 




OF BIRMINGHAM 

The ‘Laureate’ System 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! The BT “Laureate” system is a modern, 

PSOLA-based synthesiser 

! See Edington et al. (1996a), also look at the 

web site 

! Demonstration 


& 



22-Nov-00 

SLIDE 28

PSOLA strengths and 


OF BIRMINGHAM 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 


& 



22-Nov-00 

SLIDE 29 

weaknesses 

! Strengths 

– Produces good quality speech 

! Weaknesses 

– Large, annotated corpus needed for each ‘voice’ 

– Requires accurate pitch peak detection 

– Inflexible – new voices can only be produced by 

recording and labelling significant speech 

corpora from new speakers 

! Automatic annotation of corpora using 

techniques from speech recognition


OF BIRMINGHAM 

Summary 

SCHOOL OF 

ELECTRONIC & 

ELECTRICAL 

ENGINEERING 

! Introduction to speech synthesis 

! Concatenative synthesis 

– Word concatenation 

– Sub-word concatenation 

– PSOLA 


& 



22-Nov-00 

SLIDE 30

Concatenative synthesis - University of Birmingham

Create successful ePaper yourself

Delete template?

Save as template?