A general-purpose IsiZulu speech synthesizer - CiteSeerX

More documents

Recommendations

Info

6 S.Afr.J.Afr.Lang.,2005, 21. Word syllable positions:Two syllable positions were found to have a particular influence on <strong>speech</strong> quality: the final syllable in a word(see above) and the penultimate syllable. Thus, a candidate unit’s syllable position is compared to the target,counting from the end of the word. If a mismatch occurs, a penalty weight of 8 is added.2. Number of syllables in word:It was found that if a candidate unit is extracted from a word where the number of syllables differs significantlyfrom that of the target word then a perceptual mismatch occurs, since the speed of production correlates withthe number of syllables in a word. A weight of 3 is added if the number of syllables in the target word andcandidate word differ by more than a factor of 1.5.3. Left and right contexts:The neighbours to the left and to the right of the candidate and target units are compared. For example, in theword ‘test’, preceding the diphone ‘e-s’ with a diphone ‘t-e’ produces no penalty, while the preceding diphones‘w-e’ or ‘b-e’ would trigger this penalty. If a mismatch occurs a weight factor of 3 is added to the unit’s targetcost.Pilot evaluationA pilot evaluation was designed in order to determine the usability of the current version of the isiZulu TTS system.In particular, we wanted to determine whether the TTS would be understandable to users with limited literacy andlimited exposure to such technology.One part of the pilot evaluation was designed to assess the utility of our TTS system for over-the-telephone use.Note that this is a stringent test, since the low bandwidth in itself compromises intelligibility. Information onthree subjects (a weather forecast, the disease malaria, and unemployment insurance) was scripted in English,translated into isiZulu, and synthesized with our TTS system. This information was embedded in an InteractiveVoice Response (IVR) application, which provided initial instructions, and then prompted the evaluators to listento each of the information clips. All prompts are in isiZulu; whereas the synthesized voice is male, a female voicewas recorded for the rest of the IVR application.A questionnaire was prepared to (a) obtain biographic information of the evaluators (e.g. their age, home language,level of education); (b) test the evaluators’ understanding of the matter presented in this way; and (c) query theevaluators’ subjective experience of the synthesized <strong>speech</strong>. Evaluators were asked to provide their biographicinformation before interacting with the IVR application; understanding was tested after each information clip hadbeen presented, and the questions on their subjective experiences were completed after termination of the IVRapplication. All responses were recorded on paper – where the evaluators were not sufficiently literate to performthis task, an experimenter asked them the questions verbally, and filled out the questionnaire on their behalf.Finally, a Web-based evaluation was also developed in order to gain access to a wider group of evaluators (althoughthis part of the evaluation would then be limited to literate, technologically sophisticated users). The same basicprotocol was followed as for the telephone-based evaluation, but users now recorded their responses directly in theWeb application, and their responses were logged on the Web server. (For practical reasons, this system was madeavailable on an intranet, rather than the public Internet.)Evaluators were canvassed in various ways, ranging from personal contacts to a company-wide e-mail solicitation.Evaluators completed the process without assistance where possible, but in cases where their level of literacy didnot allow them to read the instructions and complete the questionnaire, an experimenter assisted them by readingfrom the questionnaire and entering their answers on the paper form. Results from the telephone-based and Webbasedtrials were entered into a spreadsheet, and processed to assess the performance of the TTS system.The current system is domain-independent, but with a bias towards the weather domain; the evaluation allowed usto assess how strongly this bias influences the quality of the synthesized utterances.
S.Afr.J.Afr.Lang.,2005, 27ResultsA test user group consisting of twenty-three people evaluated the telephony system. 40% of the test group (E1)had a low literacy level. These evaluators either indicated that they could not read or write, or had significantdifficulty reading the evaluation form. These users were typically employed as labourers, gardeners or laundryworkers.The remaining 60% of the test group (E0) were literate, and in certain cases well educated. Occupationsfor this second group ranged from domestic worker and security guard to software developer and human resourcespractitioner. Sixteen of the telephony evaluators indicated that they spoke isiZulu as a ‘home language’ – theremainder of the test group consisted of fluent isiZulu speakers with home languages that included Setswana,isiNdebele, SiSwati, Xitsonga and isiXhosa. (Fluency was assessed by the subjects themselves, and not subjectedto further verification.)E0 EvaluationThe following was noted during the E0 evaluation (literate evaluators):• 57% of evaluators indicated that they found the synthesized voice ‘fairly easy’ to ‘very easy’ to understand.While the remainder of the test group found it ‘fairly hard’ to ‘very hard’, no one indicated that the system was‘impossible’ to understand.• Surprisingly, 50% of the evaluators found the synthesized voice ‘as natural’ as or only ‘slightly less natural’than the recorded prompts used to provide the instructions. The remainder felt that the voice was either ‘a fairamount’ or ‘a lot’ more unnatural with only one evaluator indicating that the synthesized voice does not ‘soundhuman at all’.• A strong correlation was observed between people’s ability to answer the test questions correctly, and the‘intelligibility score’ they gave the system, as shown in Table 1.• Five of the 13 evaluators (36%) answered all three questions correctly. The average number of correct answersover all evaluators was 2.1 (of 3).• Home language isiZulu-speakers tended to answer the evaluation questions more accurately than other isiZuluspeakers(with an average of 2.3 correct answers compared to 1.8).Table 1: Perception of intelligibility according to number of questions answered correctly for E0evaluatorsNumber of correct answers Own intelligibility score(of 3) (of 5)3 4.42 3.31 2.7E1 EvaluationThe E1 evaluation (illiterate evaluators) produced results that differed from the above in interesting ways. Duringobservation and through after-evaluation discussions, it became clear that the test procedure was not sufficient intwo ways:• When responding to evaluation questions, evaluators found it difficult to differentiate between informationobtained from the system, and information they knew beforehand. For example, the weather questionprovided a fictional weather report and on account of the information provided, asked what the weather
Page 1 and 2: S.Afr.J.Afr.Lang.,2005, 21A general
Page 5: S.Afr.J.Afr.Lang.,2005, 25Intonatio

A general-purpose IsiZulu speech synthesizer - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?