Festival Speech Synthesis System: - Speech Resource Pages
Festival Speech Synthesis System: - Speech Resource Pages
Festival Speech Synthesis System: - Speech Resource Pages
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
26. Building models from databases<br />
Because our research interests tend towards creating statistical models trained from real speech data, <strong>Festival</strong> offers<br />
various support for extracting information from speech databases, in a way suitable for building models.<br />
Models for accent prediction, F0 generation, duration, vowel reduction, homograph disambiguation, phrase break<br />
assignment and unit selection have been built using <strong>Festival</strong> to extract and process various databases.<br />
26.1 Labelling databases Phones, syllables, words etc.<br />
26.2 Extracting features Extraction of model parameters.<br />
26.3 Building models Building stochastic models from features<br />
[ < ] [ > ] [ > ] [Top] [Contents] [Index] [ ? ]<br />
26.1 Labelling databases<br />
In order for <strong>Festival</strong> to use a database it is most useful to build utterance structures for each utterance in the database.<br />
As discussed earlier, utterance structures contain relations of items. Given such a structure for each utterance in a<br />
database we can easily read in the utterance representation and access it, dumping information in a normalised way<br />
allowing for easy building and testing of models.<br />
Of course the level of labelling that exists, or that you are willing to do by hand or using some automatic tool, for a<br />
particular database will vary. For many purposes you will at least need phonetic labelling. Hand labelled data is still<br />
better than auto-labelled data, but that could change. The size and consistency of the data is important too.<br />
For this discussion we will assume labels for: segments, syllables, words, phrases, intonation events, pitch targets.<br />
Some of these can be derived, some need to be labelled. This would not fail with less labelling but of course you<br />
wouldn't be able to extract as much information from the result.<br />
In our databases these labels are in Entropic's Xlabel format, though it is fairly easy to convert any reasonable format.<br />
Segment<br />
These give phoneme labels for files. Note the these labels must be members of the phoneset that you will be<br />
using for this database. Often phone label files may contain extra labels (e.g. beginning and end silence)<br />
which are not really part of the phoneset. You should remove (or re-label) these phones accordingly.<br />
Word<br />
Again these will need to be provided. The end of the word should come at the last phone in the word (or just<br />
after). Pauses/silences should not be part of the word.<br />
Syllable<br />
There is a chance these can be automatically generated from Word and Segment files given a lexicon. Ideally<br />
these should include lexical stress.<br />
IntEvent<br />
These should ideally mark accent/boundary tone type for each syllable, but this almost definitely requires<br />
hand-labelling. Also given that hand-labelling of accent type is harder and not as accurate, it is arguable that<br />
anything other than accented vs. non-accented can be used reliably.<br />
Phrase<br />
This could just mark the last non-silence phone in each utterance, or before any silence phones in the whole<br />
utterance.<br />
Target<br />
This can be automatically derived from an F0 file and the Segment files. A marking of the mean F0 in each<br />
voiced phone seem to give adequate results.<br />
Once these files are created an utterance file can be automatically created from the above data. Note it is pretty easy<br />
to get the streams right but getting the relations between the streams is much harder. Firstly labelling is rarely<br />
accurate and small windows of error must be allowed to ensure things line up properly. The second problem is that