Festival Speech Synthesis System: - Speech Resource Pages
Festival Speech Synthesis System: - Speech Resource Pages
Festival Speech Synthesis System: - Speech Resource Pages
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The process involves the following steps<br />
● Pre-processing lexicon into suitable training set<br />
● Defining the set of allowable pairing of letters to phones. (We intend to do this fully automatically in future<br />
versions).<br />
● Constructing the probabilities of each letter/phone pair.<br />
● Aligning letters to an equal set of phones/_epsilons_.<br />
● Extracting the data by letter suitable for training.<br />
● Building CART models for predicting phone from letters (and context).<br />
● Building additional lexical stress assignment model (if necessary).<br />
All except the first two stages of this are fully automatic.<br />
Before building a model its wise to think a little about what you want it to do. Ideally the model is an auxiluary to the<br />
lexicon so only words not found in the lexicon will require use of the letter to sound rules. Thus only unusual forms<br />
are likely to require the rules. More precisely the most common words, often having the most non-standard<br />
pronunciations, should probably be explicitly listed always. It is possible to reduce the size of the lexicon (sometimes<br />
drastically) by removing all entries that the training LTS model correctly predicts.<br />
Before starting it is wise to consider removing some entries from the lexicon before training, I typically will remove<br />
words under 4 letters and if part of speech information is available I remove all function words, ideally only training<br />
from nouns verbs and adjectives as these are the most likely forms to be unknown in text. It is useful to have<br />
morphologically inflected and derived forms in the training set as it is often such variant forms that not found in the<br />
lexicon even though their root morpheme is. Note that in many forms of text, proper names are the most common<br />
form of unknown word and even the technique presented here may not adequately cater for that form of unknown<br />
words (especially if they unknown words are non-native names). This is all stating that this may or may not be<br />
appropriate for your task but the rules generated by this learning process have in the examples we've done been much<br />
better than what we could produce by hand writing rules of the form described in the previous section.<br />
First preprocess the lexicon into a file of lexical entries to be used for training, removing functions words and<br />
changing the head words to all lower case (may be language dependent). The entries should be of the form used for<br />
input for <strong>Festival</strong>'s lexicon compilation. Specifical the pronunciations should be simple lists of phones (no<br />
syllabification). Depending on the language, you may wish to remve the stressing--for examples here we have though<br />
later tests suggest that we should keep it in even for English. Thus the training set should look something like<br />
("table" nil (t ei b l))<br />
("suspicious" nil (s @ s p i sh @ s))<br />
It is best to split the data into a training set and a test set if you wish to know how well your training has worked. In<br />
our tests we remove every tenth entry and put it in a test set. Note this will mean our test results are probably better<br />
than if we removed say the last ten in every hundred.<br />
The second stage is to define the set of allowable letter to phone mappings irrespective of context. This can<br />
sometimes be initially done by hand then checked against the training set. Initially constract a file of the form<br />
(require 'lts_build)<br />
(set! allowables<br />
'((a _epsilon_)<br />
(b _epsilon_)<br />
(c _epsilon_)<br />
...<br />
(y _epsilon_)<br />
(z _epsilon_)<br />
(# #)))<br />
All letters that appear in the alphabet should (at least) map to _epsilon_, including any accented characters that<br />
appear in that language. Note the last two hashes. These are used by to denote beginning and end of word and are<br />
automatically added during training, they must appear in the list and should only map to themselves.