12.07.2015 Views

Automatic Sentence Selection from Speech Corpora Including ...

Automatic Sentence Selection from Speech Corpora Including ...

Automatic Sentence Selection from Speech Corpora Including ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

TTS corpus and ask yourself whether the sentence wouldhave a negative effect on such a corpus, i.e. sentencesincluding obvious problems such as differences between textand audio, audio files with parts cut off, no leading and/ortrailing silences, and files including distortions.This manual selection resulted in a subset of 3354sentences representing 48.3% of the original corpus (6.4hours, 387 minutes). This subset was used to build an HMM-TTS voice henceforth called NEUTRAL_hand.2.3. Experimental resultsTwo HMM-TTS voices were built, one based on the fullset of 6949 sentences (henceforth called voice FULL) and theother based on the manually selected neutral subset of 3354sentences (NEUTRAL_hand). Both voices were built fullyautomatically and separately, e.g. automatic phone alignmentswere created for each corpus individually by using HMMmodels built <strong>from</strong> a flat-start [6].70 sentences were synthesized using each voice.<strong>Sentence</strong>s were selected <strong>from</strong> various domains, e.g. news,email, navigation and included several sentence types likeyes/no-questions, wh-question, imperatives and statements.A preference test was conducted via crowdsourcingincluding 24 subjects. Each subject listened to a randomlyselected subset of 30 sentences and had to listen to the samesentence in two synthesized versions: one <strong>from</strong> voiceNEUTRAL_hand the other <strong>from</strong> voice FULL. Listeners hadthe option to choose “no preference”. Sample order as wellas the order of sentences were both randomized for subjects.Table 2.1 shows the mean preference scores. VoiceNEUTRAL_hand was preferred to voice FULL (53.9% vs.32.5%). This difference is statistically significant (p f0 mean_max * 1.40f0 max too low f0 max < f0 mean_mean * 1.35f0 mean too high f0 mean > f0 mean_mean * 1.50f0 mean too low f0 mean < f0 mean_mean / 1.38Min % voicedratio frames voiced/unvoicedRMS max too high RMS max > RMS mean_max * 2.0RMS max too low RMS max < RMS mean_max * 1.1RMS mean too high RMS mean > RMS mean_mean * 1.9RMS mean too low RMS mean < RMS mean_mean / 2.8Table 3.1. Acoustic features used in the automaticneutral sentence selection approach.3.2. Text based featuresThe following text based features are used in addition toacoustic features. The features are split into (a) featuresidentifying non-neutral sentence types, i.e. sentences likely tobe spoken in a non-neutral style, such as the following• Quotation marks → likely to be direct speech• Interjections, e.g. “Oh, Ah, Hm” → expressive speech• Starting with lowercase → sentence fragment• Sequence of 3 full stops: “…” → fragment• Ending in comma, colon or semi-colon → fragmentand (b) features identifying sentences likely to include textnormalization (TN) errors:• Ampersand “&” → likely to cause problems for TN• Digit in square brackets, e.g. “[1]” → TN problems• Year digits, e.g. “1880” → TN problemsThe presence of quotation marks is likely to indicatedirect speech - often realized in a non-neutral style. Thereforeall sentences including quotation marks are discarded.However, since the current approach evaluates sentencesindividually and does not detect opening and/or closing1822

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!