13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm University3. the Mahalanobis distance (Taylor, 2008)for the MFCCsF0 distance was weighted by 0.5. A penalty of10 was given to those segments from the targetcorpus where the vowel following the initialplosive was not /a/, i.e. a different vowel thanthe one in the remainder corpus. The F0weighting factor and the vowel-penalty valuewere arrived at after iterative tuning. The distanceswere calculated using a combination ofPerl and Microsoft Excel.ConcatenationFor each possible segment combination((/p|t|k/) + (/ap|at|ak/), i.e. 9 possible combinationsin total), the join costs were ranked. Thefive combinations with the lowest costs withineach of these nine categories were then concatenatedusing the Snack tool CONCAT. Concatenationpoints were located to zero-crossingswithin a range of 15 samples after the middle ofthe vowel following the initial plosive. (And ifno zero-crossing was found within that range,the concatenation point was set to the middle ofthe vowel.)Evaluation7 adult subjects were recruited to perform a listeningtest. All subjects were native Swedes,with no known hearing problems and naïve inthe sense that they had not been involved in anywork related to speech synthesis development.A listening test was constructed in Tcl/Tk topresent the 45 stimuli (i.e. the five concatenationswith the lowest costs for each of the ninedifferent syllables) and 9 original recordings ofthe different syllables. The 54 stimuli were allrepeated twice (resulting in a total of 108items) and presented in random order. The taskfor the subjects was to decide what syllablethey heard (by selecting one of the nine possiblesyllables) and judge the naturalness of theutterance on a scale from 0 to 100. The subjectshad the possibility to play the stimuli as manytimes as they wanted. Before starting the actualtest, 6 training items were presented, afterwhich the subjects had the possibility of askingquestions regarding the test procedure.Statistical analysisInter-rater agreement was assessed via the intraclasscorrelation (ICC) coefficient (2, 7) forsyllable identification accuracy and naturalnessrating separately.Pearson correlations were used to assess intra-rateragreement for each listener separately.ResultsThe results of the evaluation are presented inTable 1.Table 1. Evaluation results for the concatenatedand original speech samples. The first column displaysthe percentage of correctly identified syllables,and the second column displays the averagenaturalness judgments (max = 100).% correct syll NaturalnessConcatenated 94% 49 (SD: 20)Original 100% 89 (SD: 10)The listeners demonstrated high inter-rateragreement on naturalness rating (ICC = 0.93),but lower agreement on syllable identificationaccuracy (ICC = 0.79).Average intra-rater agreement for all listenerswas 0.71 on naturalness rating, and 0.72 onsyllable identification accuracy.DiscussionConsidering that the purpose of this study wasto study the possibilities of generating understandableand close to natural sounding concatenationsof segments from different speakers,the results are actually quite promising.The listeners’ syllable identification accuracyof 94% indicates that comprehensibility is not abig problem. Although the total naturalnessjudgement average of 49 (of 100) is not at allimpressive, an inspection of the individualsamples reveals that there are actually someconcatenated samples that receive higher naturalnessratings than original samples. Thus, theresults confirm that it is indeed possible to generateclose to natural sounding samples by concatenatingspeech from different speakers.However, when considering that the long-termgoal is a working system that can be implementedand used to assist phonological therapywith children, the system is far from complete.As of now, the amount of manual interventionrequired to run the re-synthesis process islarge. Different tools were used to completedifferent steps (various Snack tools, MicrosoftExcel), and Perl scripts were used as interfacesbetween these steps. Thus, there is still a longway to real-time processing. Moreover, it isstill limited to voiceless plosives in sentenceinitialpositions, and ideally, the system should200

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!