Evaluating the pronunciation component of text-to-speech systems ...

More documents

Recommendations

Info

168 R. I. Damper et al. % words correct % words correct % words correct 100 80 60 40 20 0 0 80 60 40 20 (a) 100 (b) 2 4 6 8 10 12 14 16 18 Thousands size of word set 0 0 2 4 6 8 10 12 14 16 18 Thousands size of word set 100 (c) 80 60 40 20 Same size Total dict. Unseen Seen Same size Total dict. 0 0 2 4 6 8 10 12 14 16 18 Thousands size of word set Figure 1. Results of training and testing on different-sized disjoint subsets of the dictionary: (a) PbA; (b) NETspeak; (c) IB1-IG. See text for specification of the testing strategies in each case. entropies were not recomputed as each new test word was individually removed from the lexicon, because of the enormous computational cost of so doing. The assumption is that any single word will have little effect on the entropy calculation across 16 280 words. The result is depicted in Figure 1(c) (top curve). The asymptotic performance is about 65% words correct, which is better than NETtalk but poorer than PbA. These results are poorer than those obtained by van den Bosch (1997, Fig. 3.3, p. 70) who obtained a higher word accuracy of 77·9% words correct with IB1-IG on CELEX (77 565 words). We attribute this apparently superior performance to use of a much smaller phoneme inventory of just 42 phonemes (including null) rather than to any problems with our implementation.
90 80 Coverage (%) 1000 70 0 Evaluating pronunciation 169 40 000 80 000 120 000 Size of lexicon (words) 160 000 200 000 Figure 2. Theoretical coverage of dictionary plus back-up pronunciation strategy for unlisted words according to Zipf’s law with R = 200 000 words (see Appendix A for explanation). The upper curve assumes the word accuracy of the back-up strategy is 65%, the middle curve assumes it is 30% and for the lower curve it is 0% (i.e. the dictionary is used alone). This illustrates the importance of a good back-up strategy to performance of a TTS system. 6.5. Speed of conversion Speed of conversion is an important aspect of the pronunciation component of a TTS system. It could be argued that PbA, for instance, is fundamentally slow and an advantage of rules is their computational efficiency. (For instance, Damper, Burnett, Gray, Straus & Symes (1987) implemented the Elovitz et al. rules in real time on a hand-held device within the constraints of mid-80s microprocessor technology, viz. 1 MHz processor with just 8 Kbytes of total memory.) In the present work, the total time to process the 16 280 words of TWB was 2·06 s on a Sun Ultra Enterprise with dual 170 MHz processor. (This machine is reserved for fastresponse, light applications.) This corresponds to about 0·127 ms/word or a conversion speed of 7903 words/s, well in excess of normal speech rates and leaving plenty of processor time for other necessary tasks in a real-time TTS system. Nor should it be forgotten that the concern here is back-up strategies which are not continuously invoked. NETtalk required 19·54 s to process the complete TWB on a 200 MHz Pentium, corresponding to 8·8 ms/word or a conversion speed of 112 words/s. This is slower that the rules but more than adequate for real-time TTS synthesis. Back-propagation training times were, however, very long: it took something like 9 days to produce the results of Figure 1 (b). With PbA, it took 2490·8 s (42·5 min) to process the complete dictionary on a 75 MHz HP 712 workstation. This corresponds to 0·153 s/word or 6·54 words per second. With IB1-IG, it took 6626 · 0 s (110 · 4 min) to process the complete dictionary on the same HP workstation. This corresponds to 0·407 s/word or 2·56 words/s. However, the translation programs for PbA and IB1-IG were written for research purposes: they are implemented in the Python rapid prototyping language (rather than C) which is interpreted and relatively inefficient. The programs perform many data logging operations which would not be necessary in production versions of the code. Most of these features were disabled for the purposes of this testing, but not all. PbA in particular could be made very much faster by, for instance, precompling lexical knowledge (implicit analogy), using a fast string-searching algorithm like Boyer–Moore, etc.
Page 1 and 2: Computer Speech and Language (1999)
Page 3 and 4: Evaluating pronunciation 157 it: it
Page 5 and 6: Evaluating pronunciation 159 is alw
Page 7 and 8: Evaluating pronunciation 161 TABLE
Page 9 and 10: Evaluating pronunciation 163 Hunnic
Page 11 and 12: Evaluating pronunciation 165 5.2. H
Page 13: Evaluating pronunciation 167 — us
Page 17 and 18: Evaluating pronunciation 171 The di
Page 19 and 20: Evaluating pronunciation 173 Machin
Page 21 and 22: Evaluating pronunciation 175 TABLE

Evaluating the pronunciation component of text-to-speech systems ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?