11.07.2015 Views

A general-purpose IsiZulu speech synthesizer - CiteSeerX

A general-purpose IsiZulu speech synthesizer - CiteSeerX

A general-purpose IsiZulu speech synthesizer - CiteSeerX

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 S.Afr.J.Afr.Lang.,2005, 2from the temporal centre of one phone to the centre of the adjacent phone) are used. This, together with the factthat all diphones are searched for possible candidates, potentially occurring as strings of adjacent diphones in thetarget data, produces superior synthesis quality when compared to other concatenative synthesis methods.The database consists of multiple instances of each diphone type whereas a diphone <strong>synthesizer</strong> consists of adatabase with just one instance of each diphone type. Thus, no prosodic modifications, which introduce deteriorationin <strong>speech</strong> quality, are performed on the selected units. It is assumed that the selected units will contain the correctprosody, based on their context.The total cost of selecting a candidate diphone is the sum of the target and concatenation costs. Target costs arelinguistic measures of how well the candidate diphone fits the target diphone, while the concatenation cost is theacoustic cost of concatenating a candidate diphone with other possible candidates in the target string.The advantages of the Multisyn <strong>synthesizer</strong> over the more conventional Cluster unit selection method is that thetarget costs can be chosen to optimize the quality of the synthesized <strong>speech</strong>. These costs <strong>general</strong>ly depend on thecharacteristics of the language being synthesized, and are difficult to train from data, especially if the databaseis relatively small (fewer than 500 utterances). Another advantage is the fact that all units are candidates forselection, whereas in the Cluster unit selection the search space for possible candidates is broken up into clustersof acoustically similar units having the same target costs; thus, the selected units may not be optimal in that case.The disadvantages of the Multisyn <strong>synthesizer</strong> are related to its strengths: a large search space must be searched,and target costs are calculated during synthesis (whereas with the Cluster unit selection method the target costs areinherently the clusters themselves, and the clustering process significantly reduces the extent of the search space).Multisyn synthesis may therefore be slower.Task domain and text corpus collectionA ‘weather-related’ task domain was chosen for our initial development – this is attractive since the dynamicnature of weather-based information demonstrates the ability of TTS (over the use of voice recordings). Thesystem developed is nevertheless a <strong>general</strong>-<strong>purpose</strong> text-to-<strong>speech</strong> system, with performance optimized for thespecific task domain. That is, the system can pronounce any isiZulu text, but more than half of the recordedutterances were drawn from the weather domain, thus producing somewhat better synthesis in this domain.A previously prepared isiZulu text corpus was not available for this development. In order to collect such a corpus,information was drawn from different sources. A <strong>general</strong> domain text corpus of 30,000 words was collected fromthe Internet. This corpus consists mainly of government-oriented documents, relating to domains such as health,tourism and governance. The documents were processed, and official approval was obtained where copyrightrestrictions were a concern. A first-language isiZulu speaker, who ensured that only valid isiZulu sentences wereincluded, validated the corpus. The corpus was further extended with a number of customized weather-specificsources (such as transcriptions of televised weather reports, and manually developed texts).An open-source text selection tool, Optimal Text Selection (OTS), (developed at HP Labs in Hyderabad, India – seeTaludkar, 2004), was used to choose phonetically balanced sentences. Specifically, the smallest set of sentencesthat provide coverage of all diphones in the corpus, without regard for their context, was selected. A subset of 153sentences was thus selected for recording. After the completion of an early version of the system, an additional 27sentences were added to compensate for missing diphones and frequently occurring English loan words.Phone set definition and grapheme-to-phoneme rulesAn initial phone set was defined, originally based on the phone set defined in standard texts on isiZulu. For example,Poulos and Msimang (1998) define 74 phones to describe the sound system of isiZulu, but several of these areclusters (for example [mv] in imvula ‘rain’), which can be viewed as a concatenation of constituent sounds forthe <strong>purpose</strong>s of concatenative synthesis. In addition, the 18 click sounds listed in Poulos and Msimang (1998) can

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!