Main Issues in Grapheme-to-Phoneme Conversion for TTS - sepln

Procesamiento del Lenguaje Natural, núm. 35 (2005), pp. 29-34 recibido 02-05-2005; aceptado 01-06-2005Main issues in grapheme-to-phoneme conversion for TTSTatyana PolyakovaUniversitat Politècnica de CatalunyaJordi Girona 1-3, Barcelona, Spaintatyana@gps.tsc.upc.esAntonio BonafonteUniversitat Politècnica de CatalunyaJordi Girona 1-3, Barcelona, Spainantonio.bonafonte@upc.esResumen: La conversión de letras a fonemas en inglés está siendo desarrollada para su futuraintegración en un sistema de síntesis de habla dentro del proyecto TC-STAR. En este trabajo sedescriben los experimentos realizados usando dos técnicas diferentes de aprendizaje automático.Se ha considerado la predicción de la pronunciación con y sin acento. Se analiza la influenciade los diferentes parámetros en la tasa de error en la conversión de letras a fonemas. También seestudia la distribución de la tasa de error en función de la longitud de las palabras ha sidoobtenida.Palabras clave: letras en fonemas, alineado, x-grams, traductores de estados finitos, CART.Abstract: Grapheme-to-phoneme conversion system for English is being developed for furtherintegration into speech synthesis system within TC-STAR project. In this work we describeexperiments performed using two different machine learning techniques. The pronunciation waspredicted both for stressed and unstressed lexicon and the results were compared. Analysis ofdifferent parameters that may influence the error rate in grapheme-to-phoneme conversion wasperformed. The error rate as a function of the word length was studied.Keywords: grapheme-to-phoneme, alignment, x-grams, finite-state transducers, CART.1 IntroductionText-to-speech synthesis and continuous speechrecognition are the two rapidly developingtechnologies that have many applications inmany different areas of human life. It isimpossible to have a good-working text-tospeechsystem without having a tool thatgenerates correct pronunciation from theorthographic transcription.Because of the writing system irregularities(absence of unambiguity between the phonemeand grapheme full sets, where, for example oneletter may correspond to two phonemes or to nosound at all) the grapheme-to-phonemeconversion is not an easy task, especially forlanguages like English or French where therelation between letters and sounds is not veryclear.To train a grapheme-to-phoneme conversionsystem most algorithms require an alignmentbetween grapheme and phoneme strings. Manydifferent approaches have been elaborated toautomatically perform grapheme-to-phonemealignment for the translation systems; however,in grapheme-to-phoneme conversion usually adictionary aligned by hand-seeded rules is usedto train a classifier.Different methods for building automaticalignments have been proposed, among whichthere are one-to-one and many-to-manyalignments. In one-to-one alignment each lettercorresponds only to one phoneme and viceversa. To match grapheme and phoneme strings,an “empty” symbol is introduced into eitherstring.In many-to-many alignments a letter cancorrespond to more than one phoneme and aphoneme can correspond to more than one letter.An example of one-to-one alignment modelis the epsilon scattering model (ESM) where“epsilon” or “empty” phoneme is introduced toeither both grapheme and phoneme strings oronly to phoneme string (Pagel, Lenzo and Black,1998) Then the EM algorithm is applied tooptimize their positions.An example of many-to-many alignment wasproposed by Bisani and Ney (2003). In theirmodel the main idea is that the bothorthographic form and pronunciation for eachword are determined by a common sequence ofgraphones which is a pair q= (g, φ) of letter andphonemes sequences, respectively (whereq ∈ Q ⊆ G * × Φ * ; and G is the letter alphabetand Φ is the inventory of phonemic symbols).For a given sequence of letters g ЄG* theISSN: 1135-5948© 2005 Sociedad Española para el Procesamiento del Lenguaje Natural

T. Polyakova, A. Bonafonteprobability to find most likely phonemesequence φ ЄΦ* is maximized.This paper reports on a data driven-approachto pronunciation modeling. To align graphemesand phonemes we chose to use automaticepsilon-scattering. The experiments wereperformed on stressed and unstressed lexicon ofEnglish, the grapheme-to-phoneme results,obtained by CART decision trees and x-grambased Finite-State Transducers were comparedfor these lexicons for both cases.2 Grapheme-to-phoneme methods2.1 Epsilon-scattering based alignmentAs the alignment method we use automaticepsilon-scattering method to produce one -toonealignment. (Black, Lenzo and Pagel, 1998).For the cases when the number of letters isgreater than that of phonemes we introduce, the“empty” phoneme symbol into phoneticrepresentations to match the length of graphemeand phoneme strings. First we add thenecessary number of “empty” symbols, into allpossible positions in phonetic representations foreach word in the training corpus. In the Table 1we give an example of two of the possiblealignment candidates for the word “meadows”.Let. M E A D O W SPhon. m E _ d o _ zPhon. _ _ m E d o ZTable 1: Some possible alignment candidates.Such a probabilistic initialization gives usnumber of all possible imperfect alignments.Our goal is to maximize the probability thatletter g matches phoneme ϕ and thereforechoose the best alignment from the possiblecandidates. It is done by applying theexpectation maximization algorithm (EM)(Dempster, Laird and Rubin, 1977). The EMassociated with joint grapheme/phonemeprobabilities. Under certain circumstances theEM guarantees an increase of the likelihoodfunction at each iteration until convergence to alocal maximum.2.1.1 Stress assignmentIn this work we have used the phonetic alphabetthat is called SAMPA(http://www.phon.ucl.ac.uk/home/sampa/american.htm).The total number of the phoneme symbolsinvolved in the SAMPA for American English isN p =41. As we use different phonemes foraccented and unaccented vowels, to performtraining and test on the accented dictionary wehave added N a =17, accented vowel symbols andan “empty” phoneme, “_”. As an example, inTable 2 the word “abilities” is shown togetherwith its two phonetic representations: with stressand without stress. In the stressed version thedigit “1” is added to identify the stressed vowel.In this example one “empty” symbol wasintroduced to match the strings.Let. A B I L I T I E SStr. @ B I1 l @ t i _ zPhon.Unstr.Phon.@ B I l @ t i _ zTable 2: Alignment of the graphemes andphonemes including the “empty” symbol. Thedigit 1 is the stress marker.2.2 CARTDeriving the pronunciation automatically byusing decision trees, such as Classification andRegression Trees, or CART is a commonly usedtechnique in grapheme-to-phoneme conversion.Pagel, Lenzo and Black (1998) have introducedthe CART decision-trees into pronunciationprediction. As the input vectors, the graphemicsliding window, containing three letters on theleft and three letters on the right of each centerletter was considered. The advantage of usingCART method is that it produces compactmodels. The model size is defined by the totalnumber of questions and leaf nodes in thegenerated tree.2.3 Finite State TransducersIn this next approach we chose thepronunciation that maximizes{ ( / )}arg max p gϕϕ (1),where ϕ is the sequence of phonemes,including the “empty” phoneme, and g is thesequence of letters.To solve the maximization problem we use aFinite state transducer, which is similar to(Galescu and Allen, 2001). The equation (1) canbe expressed asarg max{ p( ϕ, g)}/ p( g) = arg max{ p( ϕ, g)}(2).ϕϕ30

Main Issues in Grapheme-to-Phoneme Conversion for TTSThis can be estimated, using standard n-grammethods.In the training, first the alignment isperformed. Then we define the graphonesγ , asthe pair consisting of one letter and onephoneme (includingε )NNi 1 i 1 i 1∏ ( − − i i 1 1 ) ∏ −i 1 , (3)pg ( , ϕ) = p g, ϕ / g , ϕ = p( γ / γ )i= 1 i=1where N is the number of letters in the word.This can be estimated using n-grams, but thesymbols are not words, but the “graphones”, i.e.pair (letter, phoneme) found in the aligneddictionary.N-grams can be represented using a Finitestateautomaton: for each history h we define astate and for each new graphone γl< gl,ϕl> wecreate an edge, with the label < gl,ϕl> andweight it with the probability p( γl/ h).For example, for the word “aligned” wecreate the following graphone sequence: (A,a) (L,l) (I,aI) (G,_) (N,n) (E,_)(D,d) Fig. 1 shows some states and edges for thebigram-based FSA (in practice all of the trainingdata is used to estimate probability)., p(I,aI/L,l)I, aIp(I,aI/L,l)L, l, p(I,i/L,l)I,ip(I,i/L,l)Figure 1: A Finite-state automatonThe FSA allows to compute the probabilityp(γ ), but this requires an alignment. In order tosolve the equation (2) we derive a finite statetransducer in a straightforward way: the labelsattached to the FSA are split: the letter becomesan input and the phoneme becomes an output.Fig. 2 shows the results of this after applying itto Fig. 1.Note that the FST is not deterministic. Forinstance, from the state labeled as (L,l) there aretwo possible edges for the same input letter “I”.p(I,aI/L,l), I/aIL, lI/i, p(I,i/L,l)I, aI I, iFigure 2: A Finite-state transducerTo find the pronunciation we have to solveequation (3). This is equivalent to finding thepath of the FST with maximum probabilities.The input letters limit the number of edgeswhich can be followed. The best path is foundusing the dynamic programming algorithm.In order to derive correct pronunciation weuse the flexible length x-gram model (Bonafonteand Mariño, 1996). The x-gram model assumesthat the number of conditioning graphemephonemepairs depends on each particular case.The main idea of the x-gram model lies inapplying of a merging-state algorithm whichallows us to reduce the number of states. Thetwo criteria for merging were used. Firstly, themerging is applied if the number of times of agiven history appearance on thetraining data with lexicon size J, is less than athreshold, k min (where q i is a grapheme-phonemepair). The probability to such graphemephonemepairs is derived by smoothing. Thenthe states are merged if their distributionsp=p(q/q i-m ,…q i ) and m=p(q/q i-m+1 ,…q i ) have asmall enough difference in the information. Themeasure of the difference is the divergence Ddefined asJ⎛ p( j)⎞D( p// m) = ∑ p( i)log⎜ ⎟j=1 ⎝m( j)⎠(4)Choosing the proper values of the thresholdsk min and D , one can significantly reduce thenumber of states without decreasing of modelgoodness.3 Experimental resultsLC-STAR (www.lc-star.com) has createddictionaries in 13 languages. In this work wehave used the LC-STAR dictionary of AmericanEnglish, kindly provided by NSC (naturalSpeech Communications) for the developmentof speech-to-speech translation demonstratorwithin LC-STAR project.31

T. Polyakova, A. BonafonteWe have performed experiments using theepsilon scattering method to align the lexiconcombined with CART and FST to predict thecorrect pronunciationFirst, the system was trained and tested usingthe CART tree-based phoneme prediction. TheLC-STAR dictionary contains 50 thousandwords; it was randomly split into ten equal parts.Ninety per cent of the dictionary was used totrain the system, ten percent of which was usedto perform the evaluation; other ten per centwere used to test the system.3.1. Conversion using CART andinfluence of the decision tree parameterson the error rateFor different parameters of CART decision treethe phoneme and word error rate have beenplotted in Fig. 3a and Fig. 3b. Figure 3a showsthe error rate as the function of the value ofminimum amount of the entropy gain, necessaryto justify the further tree expansion. The Fig. 3bshows these error rates for different values ofmaximum tree depth. If we set this parameter tobe very small, the word error rate tends to bevery high.erro r, %50454035302520151050phonemeswords0 0.1 0.2 0.3 0.4 0.5 0.6 0.7entropy gainFigure 3a: Phoneme and word error rates (%)as a function of minimum entropy gain neededfor further expansion of the tree.The parameters that give us the best resultswere found to be 0.001 for the entropy gain and7 for the tree depth. These values coincide withthose obtained by Black, Lenzo and Pagel(1998).Below we present the results obtained for thestressed and unstressed dictionary (see Table 3).The data presented in the first row were obtainedfor the phonetic transcription including thestress marks; a misplaced stress mark wascounted as an error. The second row shows thepercentage of correct phonemes,erorr, %706050403020100phonemeswords3 4 5 6 7depthFigure 3b: Phoneme and word error rates (%)as a function of the maximum tree depth.and words in comparing with the correctphonetic transcription for the unstressed lexicon.Phon. Wordswith stress 90.13 51.04w/o stress 93.02 65.48Table 3: Percentage of correct phonemes, andwords for stressed and unstressed dictionaryusing CART.The results can be improved by removing theambiguous abbreviations and non-standardwords, see Table 4.Phon. Wordswith stress 91.29 57.8w/o stress 93.93 68.32Table 4: Percentage of correct phonemes, andwords for stressed and unstressed dictionaryusing CART, after the removal of long nonstandardwords and abbreviations from thecorpus.As we can see from the Tables 3 and 4, thesystem performs significantly better on theunstressed lexicon, especially if we take into theconsideration the percentage of the wordscorrect obtained from the experiment.3.2 Grapheme-to-phoneme conversionusing FSTIt is important to analyze how the parameters ofthe x-gram models may affect on the error rates.In order to clarify this, we performedexperiments for different values of theparameters n and D in x-gram model, n is themaximum possible length conditionalprobability history.32

Main Issues in Grapheme-to-Phoneme Conversion for TTSThe Fig. 4a shows error rate percentage asthe function of the n.error, %80706050403020100phonemeswords1 3 5 7 9nFigure 4a: Phoneme and word error rates (%)as a function of n.Both error rates monotonically decrease asthe n increases.This implies that n=5 is the optimalparameter of the n-gram model for the givencorpus. In Fig. 4b we plotted the error rate as afunction of the divergence threshold in the x-gram model.error, %35302520151050phonemeswords0 0.1 0.2 D 0.3 0.4 0.5Figure 4b: Phoneme and word error rates (%)as a function of the divergence threshold D.The results presented in Fig. 4 imply thatallowing a significant decrease of the statesnumber the x-gram model gives good resultseven for high enough value of the divergencethreshold. The results presented in Fig. 4a and4b give the total percentage of errors in thegrapheme-to-phoneme conversions. The errorrates practically do not change in the range ofthe parameters: n≥5 and 0≤D

fT. Polyakova, A. Bonafonte0.250.20.150.10.050errors distribution functionrunning word frequencylexicon word frequency2 4 6 8 10 12 14 16lFigure 5: Probability distribution function oferrors versus the grapheme number per word(open squares). The line with filled circlesrepresents the distribution for running words;and the line with filled triangles is thedistribution for words in a lexicon in English.5 ConclusionsThe performed experiments show thatgrapheme-to-phoneme results depend on manyparameters as well as on the choice of machinelearningtechnique used to perform conversion.The results were obtained for both accented andunaccented lexicon, using epsilon-scatteringmethod to perform alignment and combining itwith CART and FST classifiers.The FST performed better on both stressedand unstressed dictionary. We have analyzeddifferent factors that may influence errors ratein grapheme-to-phoneme conversion. Theobtained error distribution (see Fig. 5) indicatesthat 9-letter long words mainly contribute to thetotal error rate if the optimal model parametersare chosen for training of the system. Anoptimal grapheme-to-phoneme alignmentmethod has the same importance as theclassification method selection and it couldgive a significant performance improvement inthe tasks of speech synthesis and speechrecognition.The problem of the currently used one-toonealignment and inserting of the “empty”symbols is that some alignments produced arecompletely artificial and do not specify thepronunciation, moreover the reasonablealignment may not be unique (Damper andMarchand, (2005). Black, Lenzo and Pagel(1998) have concluded that the choice of thedecision-tree learning technique does notinfluence on the results as much as thealignment algorithm, used to align the lexicon.In the future we are planning to work tocompare different alignment methods6 AcknowledgmentsThis work has been funded by the EuropeanUnion under the integrated project TC-STAR -Technology and Corpora for Speech to SpeechTranslation(IST-2002-FP6-506738,http://www.tc-star.org ).7 ReferencesBisani M. and H. Ney. 2002. Investigations onjoint-multigram model for grapheme-tophonemeconversion. In Proc. of the 7 th Int.Conf. on Spoken Language Processing,Denver, CO, vol.1, 105-108.Black A., K. Lenzo K. and V. Pagel. 1998.Issues in building general letter to soundrules. In Proc. of the 3rd ESCA workshopon speech synthesis., 77-80, Jenolah Caves,AustraliaBonafonte A. and J. B. Mariño, 1996 “Languagemodeling using X-grams” Proc of ICSLP-96,Vol.1, 394-397, Philadelphia, October1996.Damper R. I., Y. Marchand , J-D. Marsterns.and A. Bazin. 2004. “Aligning letters andphonemes for speech synthesis” in Proc. ofthe 5 th ISCA Speech Syntesis Workshop,Pittsburgh, 209-214.Damper R. I., Y. Marchand. 2005. Informationfusion approaches to the automaticpronunciation of print by analogy.Information Fusion. To appear.Dempster A. P., N. M. Laird and D.B. Rubin .1977. Maximum likelihood from incompletedata via the EM algorithm. Journal of theRoyal Statistical Society, Ser. B, 39, 1-38.Galescu L., J. Allen., 2001 "Bi-directionalConversion Between Graphemes andPhonemes Using a Joint N-gram Model", inProc. of the 4th ISCA Tutorial and ResearchWorkshop on Speech Synthesis, Perthshire,Scotland, 2001Pagel V., K. Lenzo, A.Black. 1998. Letter-tosoundrules for accented lexiconcompression. In Proc. of ICSLP98, vol 5,2015-2020, Sydney, AustraliaSigurd B., M. Eeg-Olofsson., J. van de Weijer.2004. Word length, sentence length andfrequency zipf-revisited. Studia Linguistica58(1), 37-52.34

Main Issues in Grapheme-to-Phoneme Conversion for TTS - sepln

Create successful ePaper yourself

Delete template?

Save as template?