12.12.2012 Views

Festival Speech Synthesis System: - Speech Resource Pages

Festival Speech Synthesis System: - Speech Resource Pages

Festival Speech Synthesis System: - Speech Resource Pages

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Close the audio server down but wait until it is cleared. This is useful in scripts etc. when you wish to only<br />

exit when all audio is complete.<br />

(audio_mode 'shutup)<br />

Close the audio down now, stopping the current file being played and any in the queue. Note that this may<br />

take some time to take effect depending on which audio method you use. Sometimes there can be 100s of<br />

milliseconds of audio in the device itself which cannot be stopped.<br />

(audio_mode 'query)<br />

Lists the size of each waveform currently in the queue.<br />

[ < ] [ > ] [ > ] [Top] [Contents] [Index] [ ? ]<br />

24. Voices<br />

This chapter gives some general suggestions about adding new voices to <strong>Festival</strong>. <strong>Festival</strong> attempts to offer an<br />

environment where new voices and languages can easily be slotted in to the system.<br />

24.1 Current voices Currently available voices<br />

24.2 Building a new voice<br />

24.3 Defining a new voice<br />

[ < ] [ > ] [ > ] [Top] [Contents] [Index] [ ? ]<br />

24.1 Current voices<br />

Currently there are a number of voices available in <strong>Festival</strong> and we expect that number to increase. Each is elected<br />

via a function of the name `voice_*' which sets up the waveform synthesizer, phone set, lexicon, duration and<br />

intonation models (and anything else necessary) for that speaker. These voice setup functions are defined in<br />

`lib/voices.scm'.<br />

The current voice functions are<br />

voice_rab_diphone<br />

A British English male RP speaker, Roger. This uses the UniSyn residual excited LPC diphone synthesizer.<br />

The lexicon is the computer users version of Oxford Advanced Learners' Dictionary, with letter to sound rules<br />

trained from that lexicon. Intonation is provided by a ToBI-like system using a decision tree to predict accent<br />

and end tone position. The F0 itself is predicted as three points on each syllable, using linear regression<br />

trained from the Boston University FM database (f2b) and mapped to Roger's pitch range. Duration is<br />

predicted by decision tree, predicting zscore durations for segments trained from the 460 Timit sentence<br />

spoken by another British male speaker.<br />

voice_ked_diphone<br />

An American English male speaker, Kurt. Again this uses the UniSyn residual excited LPC diphone<br />

synthesizer. This uses the CMU lexicon, and letter to sound rules trained from it. Intonation as with Roger is<br />

trained from the Boston University FM Radio corpus. Duration for this voice also comes from that database.<br />

voice_kal_diphone<br />

An American English male speaker. Again this uses the UniSyn residual excited LPC diphone synthesizer.<br />

And like ked, uses the CMU lexicon, and letter to sound rules trained from it. Intonation as with Roger is<br />

trained from the Boston University FM Radio corpus. Duration for this voice also comes from that database.<br />

This voice was built in two days work and is at least as good as ked due to us understanding the process better.<br />

The diphone labels were autoaligned with hand correction.<br />

voice_don_diphone<br />

Steve Isard's LPC based diphone synthesizer, Donovan diphones. The other parts of this voice, lexicon,<br />

intonation, and duration are the same as voice_rab_diphone described above. The quality of the<br />

diphones is not as good as the other voices because it uses spike excited LPC. Although the quality is not as<br />

good it is much faster and the database is much smaller than the others.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!