12.07.2015 Views

Automatic Sentence Selection from Speech Corpora Including ...

Automatic Sentence Selection from Speech Corpora Including ...

Automatic Sentence Selection from Speech Corpora Including ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

INTERSPEECH 2011<strong>Automatic</strong> sentence selection <strong>from</strong> speech corpora including diverse speech forimproved HMM-TTS synthesis qualityNorbert Braunschweiler, Sabine BuchholzToshiba Research Europe Ltd., Cambridge Research Laboratory, Cambridge, United Kingdom{norbert.braunschweiler,sabine.buchholz}@crl.toshiba.co.ukAbstractUsing publicly available audiobooks for HMM-TTS posesnew challenges. This paper addresses the issue of diversespeech in audiobooks. The aim is to identify diverse speechlikely to have a negative effect on HMM-TTS quality. Manualremoval of diverse speech was found to yield better synthesisquality despite halving the training corpus. To handle largeamounts of data an automatic approach is proposed. Theapproach uses a small set of acoustic and text based features.A series of listening tests showed that the manual selection ismost preferred, while the automatic selection showedsignificant preference over the full training set.Index Terms: speech synthesis, HMM-TTS, corpuscreation, diverse speech, speaking styles, audiobooks1. IntroductionTraditionally synthetic voices have been built onspecifically designed corpora carefully recorded by aprofessional voice talent in a single speaking style, often aneutral style with moderate voice modulation. This ensuredthe highest synthesis quality and a homogenous speakingstyle at synthesis time, albeit often too uniform for longertexts. In an effort to realize more expressive synthetic speech,especially for long coherent texts, the use of other speechcorpora like audiobooks has become the focus of research forsynthetic voice building [1]. However, publicly availableaudiobooks are often read by non-professional readers undernon-optimal recording conditions. Many of these audiobooksinclude a wide variety of speaking styles with one or morespeakers.Work on using publicly available audiobooks forresearch and for building synthetic voices has so far beenfocused on preparing these large corpora to make themaccessible for voice building [1] [2] [3]. In [3] an approachhas been developed to automatically segment and selectsentences <strong>from</strong> audiobooks based on their correspondencebetween book text and speech recognition result. However,the presence of diverse speech as a challenge for syntheticvoice building has not been widely addressed.Diverse speech in publicly available audiobooks caninclude character voices, highly emotional speech, and speechwith audibly different loudness, rate or quality. Here the term“audibly different” refers to some baseline considered to be‘neutral’ speech with homogenous characteristics in terms ofstyle, loudness, rate and quality. Although it has beenmentioned in the literature that it is not possible to speakwithout expression [4] the concept of ‘neutral’ speech is stillhelpful as a baseline. Typically ‘neutral’ speech can beassociated with traditional TTS corpora, while manyaudiobooks include a wide range of ‘non-neutral’ speech.While using publicly available audiobooks for HMM-TTS voice building it was noticed that synthesis quality wasdegraded compared to standard TTS corpora. To test whetherthe presence of diverse speech affected this result it wasdecided to run a pilot experiment (described in detail in thenext section) which separated ‘neutral’ speech <strong>from</strong> ‘nonneutral’or potentially detrimental (for TTS voice building)speech. It was found that the removal of ‘non-neutral’ speechimproved synthesis quality despite halving the trainingcorpus. In order to handle large amounts of audiobooks anautomatic approach was created to detect diverse speechlikely to have a negative effect on synthesis quality. Thisapproach is introduced in section 3. The approach is thenevaluated in a series of listening tests in section 4 showing itsperformance; followed by the conclusion sketching somefurther directions.2. Manual selection of ‘neutral’ speechA pilot experiment was conducted to test whether an HMM-TTS voice based on a manually selected subset of sentencesconsidered to be suitable for synthetic voice building ispreferred to a voice based on all sentences extracted <strong>from</strong> anaudiobook.2.1. Audiobook corpusThe audiobook used here is based on the AmericanEnglish book “A Tramp Abroad” written by Mark Twain andpublicly available <strong>from</strong> librivox.org [5]. The recordingcondition is acceptable, albeit there are some differencesacross recordings. The book describes the experiences of anAmerican tourist while travelling through Europe around1880. Its total running time is 15 hours and 46 minutes. Thenarrator (John Greenman) uses a wide range of speakingstyles to express funny, humorous, bizarre, etc. passages.The audiobook was split into sentences by the lightlysupervised sentence segmentation and selection approach [3].This approach compares the recognition result of a sentencewith its original orthographic version in the book. When thecomparison results in a word accuracy lower than thespecified threshold the sentence will be discarded. A 75%confidence threshold was used which resulted in 6949 or87.5% of sentences selected (15 hours, 891 minutes).2.2. Manual selectionTo select neutral sentences <strong>from</strong> the audiobook a humanlabeler was instructed to listen to each sentence and judgewhether it sounds neutral or whether it should be discarded.The labeler used the following guidelines: select sentencesspoken in the speaker’s neutral style which is not strictlydefined as the absence of all expression, but may include aneutral-narrative style, but not audibly expressive styles likeshouting, whispering and speech sounding angry, happy, sad,etc. Also discard sentences potentially harmful for building aTTS system. In doubt, imagine a traditional unit selectionCopyright © 2011 ISCA182128-31 August 2011, Florence, Italy


TTS corpus and ask yourself whether the sentence wouldhave a negative effect on such a corpus, i.e. sentencesincluding obvious problems such as differences between textand audio, audio files with parts cut off, no leading and/ortrailing silences, and files including distortions.This manual selection resulted in a subset of 3354sentences representing 48.3% of the original corpus (6.4hours, 387 minutes). This subset was used to build an HMM-TTS voice henceforth called NEUTRAL_hand.2.3. Experimental resultsTwo HMM-TTS voices were built, one based on the fullset of 6949 sentences (henceforth called voice FULL) and theother based on the manually selected neutral subset of 3354sentences (NEUTRAL_hand). Both voices were built fullyautomatically and separately, e.g. automatic phone alignmentswere created for each corpus individually by using HMMmodels built <strong>from</strong> a flat-start [6].70 sentences were synthesized using each voice.<strong>Sentence</strong>s were selected <strong>from</strong> various domains, e.g. news,email, navigation and included several sentence types likeyes/no-questions, wh-question, imperatives and statements.A preference test was conducted via crowdsourcingincluding 24 subjects. Each subject listened to a randomlyselected subset of 30 sentences and had to listen to the samesentence in two synthesized versions: one <strong>from</strong> voiceNEUTRAL_hand the other <strong>from</strong> voice FULL. Listeners hadthe option to choose “no preference”. Sample order as wellas the order of sentences were both randomized for subjects.Table 2.1 shows the mean preference scores. VoiceNEUTRAL_hand was preferred to voice FULL (53.9% vs.32.5%). This difference is statistically significant (p f0 mean_max * 1.40f0 max too low f0 max < f0 mean_mean * 1.35f0 mean too high f0 mean > f0 mean_mean * 1.50f0 mean too low f0 mean < f0 mean_mean / 1.38Min % voicedratio frames voiced/unvoicedRMS max too high RMS max > RMS mean_max * 2.0RMS max too low RMS max < RMS mean_max * 1.1RMS mean too high RMS mean > RMS mean_mean * 1.9RMS mean too low RMS mean < RMS mean_mean / 2.8Table 3.1. Acoustic features used in the automaticneutral sentence selection approach.3.2. Text based featuresThe following text based features are used in addition toacoustic features. The features are split into (a) featuresidentifying non-neutral sentence types, i.e. sentences likely tobe spoken in a non-neutral style, such as the following• Quotation marks → likely to be direct speech• Interjections, e.g. “Oh, Ah, Hm” → expressive speech• Starting with lowercase → sentence fragment• Sequence of 3 full stops: “…” → fragment• Ending in comma, colon or semi-colon → fragmentand (b) features identifying sentences likely to include textnormalization (TN) errors:• Ampersand “&” → likely to cause problems for TN• Digit in square brackets, e.g. “[1]” → TN problems• Year digits, e.g. “1880” → TN problemsThe presence of quotation marks is likely to indicatedirect speech - often realized in a non-neutral style. Thereforeall sentences including quotation marks are discarded.However, since the current approach evaluates sentencesindividually and does not detect opening and/or closing1822


quotation marks across single sentences, some direct speechsentences will not be detected.TN errors could result in incorrect phone alignmentshaving a negative influence on synthesis quality.The percentage of sentences containing these kinds oftext features can vary considerably per book and depends onthe text genre, writing style of the author, formatting, etc.3.3. Additional features to exclude problematic filesAdditional features were added as follows because ofknown problems in HMM-TTS with certain training material:• 15 sec or Dur mean * 5 or 15 seconds could include multiplesentences and might be a result of mislabeling sentenceboundaries. Such long audio files have a higher chance ofhaving incorrect phone alignments. The limit of 15 secondswas determined by checking the longest sentences in typicalTTS corpora. Very short audio files ( Dur mean* 5) or a very short outlier (duration < Dur mean / 6). The nextsection presents the percentage of sentences detected by eachof the features when applied to an audiobook.3.4. Effect of featuresTable 3.2 shows an overview of the acoustic features andthe percentage of sentences discarded for corpus “A TrampAbroad” (cf. section 2.1). Some features result in a relativelylarge number of sentences being discarded, e.g. total durationtoo long, f0 maximum too low/high, whereas others are justdiscarding a small fraction of sentences, e.g. RMS mean toohigh, RMS max too low, percentage of voiced too low.Duration based features and features checking the f0 arediscarding more sentences than the RMS based features.Table 3.3 shows the effect of text based features.Features with the highest number of detected sentences aredouble quotes, ends in comma or (semi-)colon, and lowercasesentence start. The other features contribute relatively little tothe overall detection rate of the text based features.Combining audio and text features and removingduplicates results in 42.1% of all sentences being discarded.Acoustic features used to identify non-neutral speaking styles(f0, RMS, voicing) categorize 15.7% of sentences while theothers (duration, leading/trailing silences) discard 11.4%. Thelatter is mainly a result of discarding sentences >15 sec. In theacoustic features designed to identify non-neutral speakingstyles the f0 features account for the majority of hits (>99%).Text based features designed to identify non-neutralsentences are responsible for the majority of sentencesdetected by text based features (23.4%) while the remainingthree features (ampersand, digit in bracket, year digits)contribute very little (


sentences incorrectly classified as ‘neutral’. Recall was 70.6%showing that the number of false negatives was not extremelyhigh. The automatic approach discarded about 10% fewerfiles and discarded almost 30% of NEUTRAL_handsentences. More than 38% additional sentences are selectedby the automatic approach. Since the human labeler usedadditional criteria like voice quality, speaking rate,distortions, etc., some of the discrepancy may be due to this.A listening test was conducted comparing voicesNEUTRAL_hand vs. NEUTRAL_auto. The preference testwas done by 20 subjects via crowdsourcing. The result showsa significant preference (p

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!