07.04.2014 Views

Concatenative synthesis - University of Birmingham

Concatenative synthesis - University of Birmingham

Concatenative synthesis - University of Birmingham

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Speech Synthesis<br />

Part I: <strong>Concatenative</strong> <strong>synthesis</strong><br />

Version 4 February 2002<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 1


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Speech Synthesis<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! The part <strong>of</strong> speech technology which is<br />

concerned with automatically generating<br />

speech from a computer.<br />

! Typically, input is text and the desired<br />

output is an acoustic speech signal,<br />

hence: text-to-speech <strong>synthesis</strong><br />

Digital Systems<br />

&<br />

Vision Processing<br />

! First stage is text normalisation<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 2


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Text normalisation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 3<br />

! Consider:<br />

“This morning’s 6am BBC news, read by Sarah<br />

Mukherjee, announced that the OPEC countries<br />

will restrict oil exports to the UK to 22,000<br />

barrels per day”<br />

! Problems<br />

– 6am<br />

– BBC<br />

– Mukherjee<br />

– OPEC<br />

– UK<br />

– 22,000


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Text-to-Phoneme Conversion<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! The next stage is to convert the normalised<br />

text into a sequence <strong>of</strong> phonetic elements -<br />

a symbolic description <strong>of</strong> its pronunciation<br />

! This is text-to-phone concersion<br />

! e.g: “this slide is too long”<br />

Digital Systems<br />

&<br />

Vision Processing<br />

/T I s # s l aI d # I z # t u # l Q G # /<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 4


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Text-to-Phoneme Conversion<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Sequence <strong>of</strong> phonetic-segments typically obtained<br />

by:<br />

– looking up individual words in a pronunciation<br />

dictionary (<strong>of</strong>ten referred to as the exceptions<br />

dictionary for historical reasons) or<br />

– applying letter-to-sound rules<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! Finally, need a method to convert the sequence <strong>of</strong><br />

phonetic segments into an acoustic signal.<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 5


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

The Role <strong>of</strong> Prosody<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! The meaning <strong>of</strong> an utterance will affect its<br />

acoustic realisation.<br />

! In fact, the most commonly cited<br />

shortcoming <strong>of</strong> speech <strong>synthesis</strong> is its<br />

prosodic structure:<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 6


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Prosody<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 7<br />

! Prosody is the term used to describe:<br />

– Durational structure <strong>of</strong> a speech signal<br />

(the relative lengths <strong>of</strong> its different parts,<br />

and the presence and lengths <strong>of</strong> any<br />

pauses)<br />

– Amplitude structure (the relative<br />

amplitudes <strong>of</strong> its different parts)<br />

– Intonational structure. In other words,<br />

the fundamental frequency<br />

! Prosody includes stress and rythmn


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 8<br />

The Role <strong>of</strong> Prosody<br />

! Consider:<br />

– Joe: “Hey-did you hear? Sam took<br />

Mary out and bought her a pizza!”<br />

– Mike:“You’re wrong - Sam didn’t buy<br />

Mary a pizza”<br />

! “Sam didn’t buy Mary a PIZZA”<br />

! “Sam didn’t buy MARY a pizza”<br />

! “Sam didn’t BUY Mary a pizza”<br />

! “Sam DIDN’T buy Mary a pizza”<br />

! “SAM didn’t buy Mary a pizza”<br />

(from Altmann, “The ascent <strong>of</strong> Babel”, reference in<br />

notes)


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Approaches to <strong>synthesis</strong><br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Phone-sequence → acoustic signal<br />

! Many different approaches, but for the<br />

purposes <strong>of</strong> this course they will be divided<br />

into two classes:<br />

– Waveform concatenation<br />

Digital Systems<br />

&<br />

Vision Processing<br />

– Model-based speech <strong>synthesis</strong><br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 9


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Waveform Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 10<br />

! Join together, or concatenate, stored<br />

sections <strong>of</strong> real speech<br />

! Sections may correspond to whole word,<br />

or sub-word units<br />

! Early systems based on whole words<br />

– E.G: Speaking clock - UK telephone<br />

system, 1936<br />

! Storage and access major issues<br />

! Speech quality requires data-rates <strong>of</strong><br />

16,000 to 32,000 bits per second (bps)


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

1936 “Speaking Clock”<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

From John Holmes,<br />

“Speech <strong>synthesis</strong><br />

and recognition”,<br />

courtesy <strong>of</strong> British<br />

Telecommunications<br />

plc<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 11


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 12<br />

Whole-Word Concatenation<br />

! Whole word concatenation can give good<br />

quality speech (as in speaking clock) but<br />

has many disadvantages:<br />

– Pronunciation <strong>of</strong> a word is influenced by<br />

neighbouring words (co-articulation)<br />

– Prosodic effects like intonation, rate-<strong>of</strong>speaking<br />

and amplitude also influenced<br />

by context.<br />

– Interpretation <strong>of</strong> a sentence will be<br />

strongly influenced by details <strong>of</strong><br />

individual words used - Remember<br />

“Sam didn’t buy Mary a pizza”!


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Whole-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 13<br />

! Disadvantages (continued):<br />

– Words must be extracted from the right<br />

sort <strong>of</strong> sentence.<br />

– Most suitable for applications where<br />

structure <strong>of</strong> the sentence is constrained,<br />

e.g: Announcements, Lists…<br />

– May need to record more than one<br />

example <strong>of</strong> each word, E.G. raised<br />

pitch at end <strong>of</strong> a list, pre-pause<br />

lengthening…


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Whole-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! Disadvantages (continued):<br />

– To add new words the original<br />

speaker must be found, or all words<br />

must be re-recorded.<br />

– Even with specialist facilities,<br />

selection and extraction <strong>of</strong> suitable<br />

words is labour intensive and time<br />

consuming<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 14


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Sub-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Limitations <strong>of</strong> word-based methods suggest<br />

concatenative speech <strong>synthesis</strong> based on subword<br />

units.<br />

! Need well-annotated, phonetically-balanced<br />

corpus <strong>of</strong> recordings <strong>of</strong> speech<br />

! Extract fragments from waveforms in the corpus<br />

which represent ‘basic units’ <strong>of</strong> speech<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! These are concatenated and used for speech<br />

<strong>synthesis</strong><br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 15


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Sub-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! Difficulties include:<br />

– careful annotation <strong>of</strong> large amounts<br />

<strong>of</strong> data needed<br />

– derivation <strong>of</strong> a good method for<br />

concatenation<br />

– identification <strong>of</strong> a set <strong>of</strong> suitable<br />

units.<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 16


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Sub-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Sub-word concatenation overcomes difficulties<br />

with adding new words to the application<br />

vocabulary<br />

! BUT, other problems are exacerbated.<br />

! In particular, coarticulation and pitch continuity<br />

problems now occur within, as well as between,<br />

words<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 17<br />

! Necessary to use several examples <strong>of</strong> each<br />

phone (corresponding roughly to different<br />

allophones).


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Sub-Word Concatenation<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! It is natural to select fragments which<br />

characterise the phone target values,<br />

BUT modelling transitions between<br />

these targets is a significant problem<br />

Target 1<br />

Digital Systems<br />

&<br />

Vision Processing<br />

?<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 18<br />

Target 2


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Transitional Units<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Central regions <strong>of</strong> many speech sounds<br />

are approximately stationary and less<br />

susceptible to coarticulation effects.<br />

! Hence select fragments which characterise<br />

transitions between phones, rather than<br />

phone targets.<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! E.G: diphone - transition between two<br />

phones.<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 19


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Transitional Units (Continued)<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 20<br />

! There are contextually-induced differences<br />

between instantiations <strong>of</strong> the central region<br />

<strong>of</strong> phone, which cause discontinuities if they<br />

are not attended to.<br />

! Possible solutions are:<br />

– Use several different examples <strong>of</strong> each<br />

diphone<br />

– Store short transition regions, and<br />

interpolate between end values.


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Transitional Units<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

(a)<br />

(b)<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 21<br />

! Coping with coarticulation effects by modelling<br />

transitions and<br />

– (a) using multiple examples to cope with<br />

variation in the instantiation <strong>of</strong> the phone<br />

centres, and<br />

– (b) by interpolation between short transition<br />

regions


Prosody and <strong>Concatenative</strong><br />

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 22<br />

Synthesis<br />

! Discontinuity in the fundamental frequency<br />

exacerbated for sub-word methods.<br />

! Can use source-filter model to separateexcitation<br />

signal from vocal tract shape<br />

! Vocal tract shape descriptions can then be<br />

concatenated and an appropriately smooth<br />

fundamental frequency pattern can be<br />

added separately.<br />

! (Will become clearer after the section on<br />

model-based <strong>synthesis</strong>)


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

PSOLA<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! PSOLA - Pitch Synchronous OverLap and<br />

Add (Charpentier, 1986)<br />

! Most successful current approach to<br />

concatenative <strong>synthesis</strong><br />

! In PSOLA the end regions <strong>of</strong> windowed<br />

waveform samples corresponding to a<br />

single excitation period are overlapped<br />

pitch-synchronously and added<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 23


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

PSOLA<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 24<br />

From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />

and recognition”, Taylor & Francis 2001


Speech modification using<br />

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

PSOLA<br />

! In addition to speech <strong>synthesis</strong> from<br />

segments, there are two other common<br />

applications <strong>of</strong> PSOLA:<br />

– Pitch modification<br />

– Duration modification<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 25


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Increasing pitch using PSOLA<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 26<br />

From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />

and recognition”, Taylor & Francis 2001


Decreasing pitch using<br />

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

PSOLA<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 27<br />

From: John Holmes and Wendy Holmes, “Speech <strong>synthesis</strong><br />

and recognition”, Taylor & Francis 2001


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

The ‘Laureate’ System<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! The BT “Laureate” system is a modern,<br />

PSOLA-based <strong>synthesis</strong>er<br />

! See Edington et al. (1996a), also look at the<br />

web site<br />

! Demonstration<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 28


PSOLA strengths and<br />

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 29<br />

weaknesses<br />

! Strengths<br />

– Produces good quality speech<br />

! Weaknesses<br />

– Large, annotated corpus needed for each ‘voice’<br />

– Requires accurate pitch peak detection<br />

– Inflexible – new voices can only be produced by<br />

recording and labelling significant speech<br />

corpora from new speakers<br />

! Automatic annotation <strong>of</strong> corpora using<br />

techniques from speech recognition


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Summary<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Introduction to speech <strong>synthesis</strong><br />

! <strong>Concatenative</strong> <strong>synthesis</strong><br />

– Word concatenation<br />

– Sub-word concatenation<br />

– PSOLA<br />

Digital Systems<br />

&<br />

Vision Processing<br />

Inaugural Lecture<br />

22-Nov-00<br />

SLIDE 30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!