Concatenative synthesis - University of Birmingham
Concatenative synthesis - University of Birmingham
Concatenative synthesis - University of Birmingham
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Speech Synthesis<br />
Part I: <strong>Concatenative</strong> <strong>synthesis</strong><br />
Version 4 February 2002<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 1
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Speech Synthesis<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! The part <strong>of</strong> speech technology which is<br />
concerned with automatically generating<br />
speech from a computer.<br />
! Typically, input is text and the desired<br />
output is an acoustic speech signal,<br />
hence: text-to-speech <strong>synthesis</strong><br />
Digital Systems<br />
&<br />
Vision Processing<br />
! First stage is text normalisation<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 2
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Text normalisation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 3<br />
! Consider:<br />
“This morning’s 6am BBC news, read by Sarah<br />
Mukherjee, announced that the OPEC countries<br />
will restrict oil exports to the UK to 22,000<br />
barrels per day”<br />
! Problems<br />
– 6am<br />
– BBC<br />
– Mukherjee<br />
– OPEC<br />
– UK<br />
– 22,000
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Text-to-Phoneme Conversion<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! The next stage is to convert the normalised<br />
text into a sequence <strong>of</strong> phonetic elements -<br />
a symbolic description <strong>of</strong> its pronunciation<br />
! This is text-to-phone concersion<br />
! e.g: “this slide is too long”<br />
Digital Systems<br />
&<br />
Vision Processing<br />
/T I s # s l aI d # I z # t u # l Q G # /<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 4
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Text-to-Phoneme Conversion<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Sequence <strong>of</strong> phonetic-segments typically obtained<br />
by:<br />
– looking up individual words in a pronunciation<br />
dictionary (<strong>of</strong>ten referred to as the exceptions<br />
dictionary for historical reasons) or<br />
– applying letter-to-sound rules<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! Finally, need a method to convert the sequence <strong>of</strong><br />
phonetic segments into an acoustic signal.<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 5
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
The Role <strong>of</strong> Prosody<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! The meaning <strong>of</strong> an utterance will affect its<br />
acoustic realisation.<br />
! In fact, the most commonly cited<br />
shortcoming <strong>of</strong> speech <strong>synthesis</strong> is its<br />
prosodic structure:<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 6
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Prosody<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 7<br />
! Prosody is the term used to describe:<br />
– Durational structure <strong>of</strong> a speech signal<br />
(the relative lengths <strong>of</strong> its different parts,<br />
and the presence and lengths <strong>of</strong> any<br />
pauses)<br />
– Amplitude structure (the relative<br />
amplitudes <strong>of</strong> its different parts)<br />
– Intonational structure. In other words,<br />
the fundamental frequency<br />
! Prosody includes stress and rythmn
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 8<br />
The Role <strong>of</strong> Prosody<br />
! Consider:<br />
– Joe: “Hey-did you hear? Sam took<br />
Mary out and bought her a pizza!”<br />
– Mike:“You’re wrong - Sam didn’t buy<br />
Mary a pizza”<br />
! “Sam didn’t buy Mary a PIZZA”<br />
! “Sam didn’t buy MARY a pizza”<br />
! “Sam didn’t BUY Mary a pizza”<br />
! “Sam DIDN’T buy Mary a pizza”<br />
! “SAM didn’t buy Mary a pizza”<br />
(from Altmann, “The ascent <strong>of</strong> Babel”, reference in<br />
notes)
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Approaches to <strong>synthesis</strong><br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Phone-sequence → acoustic signal<br />
! Many different approaches, but for the<br />
purposes <strong>of</strong> this course they will be divided<br />
into two classes:<br />
– Waveform concatenation<br />
Digital Systems<br />
&<br />
Vision Processing<br />
– Model-based speech <strong>synthesis</strong><br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 9
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Waveform Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 10<br />
! Join together, or concatenate, stored<br />
sections <strong>of</strong> real speech<br />
! Sections may correspond to whole word,<br />
or sub-word units<br />
! Early systems based on whole words<br />
– E.G: Speaking clock - UK telephone<br />
system, 1936<br />
! Storage and access major issues<br />
! Speech quality requires data-rates <strong>of</strong><br />
16,000 to 32,000 bits per second (bps)
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
1936 “Speaking Clock”<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
From John Holmes,<br />
“Speech <strong>synthesis</strong><br />
and recognition”,<br />
courtesy <strong>of</strong> British<br />
Telecommunications<br />
plc<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 11
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 12<br />
Whole-Word Concatenation<br />
! Whole word concatenation can give good<br />
quality speech (as in speaking clock) but<br />
has many disadvantages:<br />
– Pronunciation <strong>of</strong> a word is influenced by<br />
neighbouring words (co-articulation)<br />
– Prosodic effects like intonation, rate-<strong>of</strong>speaking<br />
and amplitude also influenced<br />
by context.<br />
– Interpretation <strong>of</strong> a sentence will be<br />
strongly influenced by details <strong>of</strong><br />
individual words used - Remember<br />
“Sam didn’t buy Mary a pizza”!
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Whole-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 13<br />
! Disadvantages (continued):<br />
– Words must be extracted from the right<br />
sort <strong>of</strong> sentence.<br />
– Most suitable for applications where<br />
structure <strong>of</strong> the sentence is constrained,<br />
e.g: Announcements, Lists…<br />
– May need to record more than one<br />
example <strong>of</strong> each word, E.G. raised<br />
pitch at end <strong>of</strong> a list, pre-pause<br />
lengthening…
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Whole-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! Disadvantages (continued):<br />
– To add new words the original<br />
speaker must be found, or all words<br />
must be re-recorded.<br />
– Even with specialist facilities,<br />
selection and extraction <strong>of</strong> suitable<br />
words is labour intensive and time<br />
consuming<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 14
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Sub-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Limitations <strong>of</strong> word-based methods suggest<br />
concatenative speech <strong>synthesis</strong> based on subword<br />
units.<br />
! Need well-annotated, phonetically-balanced<br />
corpus <strong>of</strong> recordings <strong>of</strong> speech<br />
! Extract fragments from waveforms in the corpus<br />
which represent ‘basic units’ <strong>of</strong> speech<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! These are concatenated and used for speech<br />
<strong>synthesis</strong><br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 15
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Sub-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! Difficulties include:<br />
– careful annotation <strong>of</strong> large amounts<br />
<strong>of</strong> data needed<br />
– derivation <strong>of</strong> a good method for<br />
concatenation<br />
– identification <strong>of</strong> a set <strong>of</strong> suitable<br />
units.<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 16
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Sub-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Sub-word concatenation overcomes difficulties<br />
with adding new words to the application<br />
vocabulary<br />
! BUT, other problems are exacerbated.<br />
! In particular, coarticulation and pitch continuity<br />
problems now occur within, as well as between,<br />
words<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 17<br />
! Necessary to use several examples <strong>of</strong> each<br />
phone (corresponding roughly to different<br />
allophones).
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Sub-Word Concatenation<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! It is natural to select fragments which<br />
characterise the phone target values,<br />
BUT modelling transitions between<br />
these targets is a significant problem<br />
Target 1<br />
Digital Systems<br />
&<br />
Vision Processing<br />
?<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 18<br />
Target 2
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Transitional Units<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Central regions <strong>of</strong> many speech sounds<br />
are approximately stationary and less<br />
susceptible to coarticulation effects.<br />
! Hence select fragments which characterise<br />
transitions between phones, rather than<br />
phone targets.<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! E.G: diphone - transition between two<br />
phones.<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 19
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Transitional Units (Continued)<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 20<br />
! There are contextually-induced differences<br />
between instantiations <strong>of</strong> the central region<br />
<strong>of</strong> phone, which cause discontinuities if they<br />
are not attended to.<br />
! Possible solutions are:<br />
– Use several different examples <strong>of</strong> each<br />
diphone<br />
– Store short transition regions, and<br />
interpolate between end values.
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Transitional Units<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
(a)<br />
(b)<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 21<br />
! Coping with coarticulation effects by modelling<br />
transitions and<br />
– (a) using multiple examples to cope with<br />
variation in the instantiation <strong>of</strong> the phone<br />
centres, and<br />
– (b) by interpolation between short transition<br />
regions
Prosody and <strong>Concatenative</strong><br />
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 22<br />
Synthesis<br />
! Discontinuity in the fundamental frequency<br />
exacerbated for sub-word methods.<br />
! Can use source-filter model to separateexcitation<br />
signal from vocal tract shape<br />
! Vocal tract shape descriptions can then be<br />
concatenated and an appropriately smooth<br />
fundamental frequency pattern can be<br />
added separately.<br />
! (Will become clearer after the section on<br />
model-based <strong>synthesis</strong>)
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
PSOLA<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! PSOLA - Pitch Synchronous OverLap and<br />
Add (Charpentier, 1986)<br />
! Most successful current approach to<br />
concatenative <strong>synthesis</strong><br />
! In PSOLA the end regions <strong>of</strong> windowed<br />
waveform samples corresponding to a<br />
single excitation period are overlapped<br />
pitch-synchronously and added<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 23
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
PSOLA<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 24<br />
From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />
and recognition”, Taylor & Francis 2001
Speech modification using<br />
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
PSOLA<br />
! In addition to speech <strong>synthesis</strong> from<br />
segments, there are two other common<br />
applications <strong>of</strong> PSOLA:<br />
– Pitch modification<br />
– Duration modification<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 25
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Increasing pitch using PSOLA<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 26<br />
From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />
and recognition”, Taylor & Francis 2001
Decreasing pitch using<br />
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
PSOLA<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 27<br />
From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />
and recognition”, Taylor & Francis 2001
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
The ‘Laureate’ System<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! The BT “Laureate” system is a modern,<br />
PSOLA-based <strong>synthesis</strong>er<br />
! See Edington et al. (1996a), also look at the<br />
web site<br />
! Demonstration<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 28
PSOLA strengths and<br />
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 29<br />
weaknesses<br />
! Strengths<br />
– Produces good quality speech<br />
! Weaknesses<br />
– Large, annotated corpus needed for each ‘voice’<br />
– Requires accurate pitch peak detection<br />
– Inflexible – new voices can only be produced by<br />
recording and labelling significant speech<br />
corpora from new speakers<br />
! Automatic annotation <strong>of</strong> corpora using<br />
techniques from speech recognition
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Summary<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Introduction to speech <strong>synthesis</strong><br />
! <strong>Concatenative</strong> <strong>synthesis</strong><br />
– Word concatenation<br />
– Sub-word concatenation<br />
– PSOLA<br />
Digital Systems<br />
&<br />
Vision Processing<br />
Inaugural Lecture<br />
22-Nov-00<br />
SLIDE 30