15.07.2013 Views

Слайд 1

Слайд 1

Слайд 1

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Development of<br />

an automatic speech<br />

recognition system for Russian<br />

language with RLAT<br />

Alexey Mokin<br />

Mykola Volovyk<br />

Advisor: Tim Schlippe<br />

09-02-2011<br />

1


Content<br />

1. Goal of the practical course<br />

2. Automatic Speech Recognition (ASR)<br />

1. Advantages<br />

2. Disadvantages<br />

3. Basics<br />

3. Rapid Language Adaptation Toolkit (RLAT)<br />

4. Russian language<br />

5. Previous results<br />

6. ASR building steps<br />

7. Results<br />

1. Pronunciation dictionary<br />

2. Phonemes<br />

3. Speech data<br />

4. Acoustic model<br />

5. Language model<br />

6. Text normalization<br />

7. Crawling<br />

2


1. Goal<br />

• Build speech recognizer for Russian language with Rapid Language<br />

Adaptation Toolkit (RLAT)<br />

– Able to recognize read speech<br />

– Newspaper domain<br />

– Initial recognition might be extended to broadcast transcription<br />

3


2.1 Automatic speech recognition (ASR): advantages<br />

• Natural way of communication for human beings<br />

– No practicing necessary for users, i.e. speech does not require<br />

any teaching as opposed to reading/writing<br />

– High bandwidth (speaking is faster than typing)<br />

• Hands and eyes are free for other tasks<br />

– Works in the car / on the run / in the dark<br />

• Mobility (microphones are smaller than keyboards)<br />

• Some communication channels (e.g. phone) are designed<br />

[Schultz et al, 2010]<br />

4


2.2 Automatic speech recognition (ASR): disadvantages<br />

• Unusable where silence/confidentiality is required<br />

(meetings, library, spoken access codes)<br />

• Still unsatisfactory recognition rate when:<br />

– Environment is very noisy (party, restaurant, train)<br />

– Unknown or unlimited domains<br />

– Uncooperative speakers (whisper, mumble, …)<br />

– Problems with accents, dialects, code-switching<br />

• Cultural factors (e.g. collectivism, uncertainty avoidance)<br />

• Speech input is still more expensive than keyboard<br />

[Schultz et al, 2010]<br />

5


2.3 Automatic speech recognition (ASR): Basics<br />

[Schultz et al, 2010]<br />

[Metze, 2010]<br />

Speech data<br />

(with transcription)<br />

6<br />

Pronunciation<br />

rules<br />

Text data


2.4 Automatic speech recognition (ASR): Basics<br />

The main formula of speech recognition:<br />

• Given acoustic data A = a 1 ,a 2 ,...,a T<br />

• Find word sequence W' = w' 1 ,w' 2 ,...,w' n with highest likelihood<br />

• Such that P(W | A) is maximized<br />

Search problem can be formulated as:<br />

7


2.4 Automatic speech recognition (ASR): Basics<br />

How does ASR work?<br />

Two-stage process for statistical-based ASR:<br />

1. Train statistical model (often: Maximum Likelihood approach)<br />

2. Test on unknown data<br />

Output is most likely hypothesis according to internal model<br />

8


3. Rapid Language Adaptation Toolkit (RLAT)<br />

• RLAT stands for Rapid Language Adaptation Toolkit and is a<br />

continuation of the SPICE system which has been developed<br />

between 2004 and 2007.<br />

• RLAT builds on existing GlobalPhone and FestVox projects.<br />

Knowledge and data are shared between recognition and synthesis<br />

such as phoneme sets, pronunciation dictionaries, acoustic models,<br />

and text resources.<br />

• Goals:<br />

• Bridge the gap between technology experts and language<br />

experts<br />

• Develop web-based intelligent systems<br />

– Interactive learning with user in the loop<br />

– Rapid Adaptation of universal models to unseen<br />

languages<br />

• RLAT web page: http://csl.ira.uka.de/rlat-dev<br />

9<br />

[Schultz et al, 2010]


3. Rapid Language Adaptation Toolkit (RLAT)<br />

RLAT collects:<br />

— Text & Audio data<br />

RLAT defines:<br />

— Phoneme set<br />

— Rich prompt set<br />

— Lexical pronunciations<br />

RLAT produces:<br />

— Pronunciation model<br />

— ASR acoustic model<br />

— ASR language model<br />

— TTS voice<br />

RLAT maintains:<br />

— Projects and users login<br />

— Data and models<br />

10<br />

[Schultz et al, 2010]


4. Russian language<br />

• Russian is a Slavic language in the Indo-European<br />

family.<br />

• From the point of view of the spoken language, its closest<br />

relatives are Ukrainian and Belarusian.<br />

11<br />

[Wikipedia]


4. Russian language: difficulties<br />

Rich morphology → need MORE data<br />

— Set of prefixes, prepositional and adverbial in nature; diminutive,<br />

augmentative, and frequentative suffixes and infixes<br />

— Six cases in two numbers (singular and plural)<br />

— Up to ten additional cases are identified in linguistics textbooks<br />

— Absolutely obeying grammatical gender (masculine, feminine and<br />

neuter)<br />

Some derived languages (sometimes in articles and always in<br />

comments) → need more DIFFERENT text data<br />

— More sources for more training quality due more rules and larger<br />

dictionary<br />

Cyrillic alphabet → encoding and transliteration during LM<br />

training<br />

12


5. Previous results<br />

Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, Tanja Schultz<br />

Rapid Bootstrapping of five Eastern European Languages using the<br />

Rapid Language Adaptation Toolkit (2010)<br />

13


5. Previous results<br />

WER [%] for five Eastern European Languages<br />

Language / LM BG HR CZ PL RU<br />

+ add. websites (dev)<br />

+ training utts (dev)<br />

+ training utts (eval)<br />

+ 500K dict (eval)<br />

20.4<br />

20.0<br />

16.9<br />

17.6<br />

30.5<br />

28.9<br />

32.8<br />

33.5<br />

14<br />

26.5<br />

25.3<br />

24.8<br />

23.5<br />

27.2<br />

24.3<br />

22.3<br />

20.4<br />

41.0<br />

40.3<br />

36.6<br />

36.2<br />

«The best systems give WERs of 16.9% for BL, 32.8 %<br />

for HR, 23.5% for CZ, 20.4% for PL and 36.2% for RU<br />

on the evaluation set» [Schlippe et al, 2010]


6. ASR building steps<br />

1. Pronunciation dictionary<br />

2. Phonemes<br />

3. Speech data<br />

4. Acoustic model<br />

5. Language model<br />

6. Text normalization<br />

7. Crawling<br />

15


6.1 Pronunciation dictionary<br />

• With the term „pronuncion dictionary“ we mean: the<br />

dictionary that specifies how words of our language<br />

should be read.<br />

• Dictionary can have Janus or Festival format<br />

• Here: Janus dictionary<br />

‒ Each unit has following form:<br />

<br />

Janus dictionary for Russian was already built in<br />

GlobalPhone project<br />

16


6.2 Phonemes<br />

• The International Phonetic Alphabet (IPA) is an alphabetic<br />

system of phonetic notation<br />

• MM7 is acoustic model inventory, that was trained earlier from<br />

seven GlobalPhone languages<br />

• To build system in a new language, we need an initial state<br />

alignment<br />

• This alignment is produced by selecting the closest matching<br />

between GlobalPhone inventory MM7 and IPA-based phones.<br />

17


6.2 Phonemes<br />

• The file phonemeSetRussian.txt was extracted from<br />

dictionary and lists in MM7 format.<br />

• The association between MM7 format and the IPA format<br />

is ensured by IPA-MM7.txt<br />

– but it does not cover all cases<br />

• So the mapping could be done in following scheme:<br />

18


6.2 Phonemes<br />

Dictionary with MM7 phoneme format<br />

RLAT template for<br />

phoneme set<br />

19<br />

IPA-MM7


6.2 Phonemes<br />

• In the phoneme template of RLAT there is almost for every<br />

phoneme appropriate palatalization realization.<br />

• If there was NO palatalization realization, palatalized<br />

phones were mapped to their neighbors.<br />

– Example: "M_Sj [soft Sh] to ç" or "M_Zj [soft Zh] to dʒ„<br />

• MM7 phonemes may be assigned to more than one IPA<br />

symbol<br />

– Example: M_h can be mapped to 2 different phones (sounds) : ‚h‘<br />

and ‚χ‘‚<br />

20


6.3 Speech data<br />

The Russian speech data were provided by GlobalPhone<br />

project:<br />

1. Audio files (Sound)<br />

2. Transcriptions to these audio files (Text)<br />

3. Information about speakers<br />

21


6.3 Speech data<br />

Speech data were generated in the following way:<br />

It was taken some text with news content<br />

From this text were made short phrases (prompts)<br />

Phrase was read by user and recorded. Recorded sound<br />

file we call utterance<br />

At the beginning of the work promts were in Roman<br />

alphabet.<br />

— Example: „Gorbachov“<br />

With help of regular expressions all promts were<br />

transliterated to Cyrillic<br />

22


6.4 Acoustic model<br />

All speech data were partioned to 3 groups:<br />

— Training set<br />

— Development set (tuning for our experiments)<br />

— Test set<br />

We used the old classification that was documented in the file<br />

RussianSpeakersConfig.0, altogether 123 speakers<br />

In this file are listed all speakers with flags<br />

— flag = 0 – that is training set speaker<br />

— flag = 1 – test set speaker<br />

— flag = 3 – development set speaker<br />

23


6.4 Acoustic model<br />

As next we created file: speaker-config.txt, that consists only<br />

lines:<br />

005 033 042 065 078 097 103 106 110 122<br />

002 027 036 063 069 092 102 104 109 112<br />

(that is 16 speakers from 123 from RussianSpeakersConfig.0)<br />

Numbers indicate speakers and correspond with speaker<br />

configurations<br />

— 1st line – test-speakers (their promts and utterances will be used for<br />

testing)<br />

— 2nd line – development-speakers<br />

This file declares how acoustic model should be trained<br />

This file was uploaded on the 1st step of acoustic model building in<br />

RLAT<br />

24


6.5 Language model<br />

What is expected from Language Models in ASR?<br />

• To have possibility to add another independent information<br />

source<br />

• Word disambiguation<br />

• Word reordering<br />

• Search space reduction<br />

– When vocabulary is n words, do not consider all n k possible<br />

k-word sequences<br />

25


6.5 Language model: n-grams<br />

The probability of a word sequence can be decomposed as:<br />

P(W) = P(w 1 w 2 .. w n ) = P(w 1 ) · P(w 2 | w 1 ) · P(w 3 | w 1 w 2 ) · · · P(w n | w 1 w 2 ... w n-1 )<br />

Computing P(w | history) causes big problems, because by a vocabulary of<br />

64,000 words there is huge number of history-possibilities.<br />

Solution: replace the “whole history” by 1,2,3,… last words.<br />

– unigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k )<br />

– bigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-1 )<br />

– trigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-2 w k-1 )<br />

– n-gram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-(n-1) w k-n-2 ... w k-1)<br />

26


6.5 Language model: interpolation<br />

Some LMs can be interpolatad in one new LM<br />

27


6.5 Language model<br />

The first LM was made from prepared file with LM (between 1990<br />

and 2000)<br />

Actually it was created from transliterations of training-set<br />

— text of the first LM was also in romanized form.<br />

With help of regular expressions and our advisor it was transliterated<br />

to Cyrillic<br />

The Word Error Rate (WER) for this LM was 60.00%. It was defined<br />

as baseline.<br />

28


6.6 Text normalization<br />

Text normalization is a process by which text is transformed in some<br />

way to make it consistent<br />

— Numbers<br />

— Proper nouns<br />

— Abbreviations<br />

It is aplied on the text to reduce perplexity with regards to language<br />

model<br />

— For example:<br />

12 часов → двенадцать часов<br />

BBC → бибиси<br />

Number normalization in Russian: the same number can have<br />

different forms depending on case<br />

— For example: 3 → три, трёх, трём, тремя, третий, третья, третьего,<br />

третьей, третьему, третьей, третьем, третьей, третьих... etc<br />

29


6.6 Text normalization: SMT normalization<br />

• Using Statistical Machine Translation (SMT) for<br />

translation of a language in the same language.<br />

• User normalizes text displayed in a web interface, thereby<br />

providing a parallel corpus of normalized and<br />

nonnormalized text. [Schlippe et al, 2010]<br />

• SMT models are generated to translate non-normalized<br />

into normalized text. [Schlippe et al, 2010]<br />

• It with can be used with other kinds of normalization, for<br />

example with standard rule based text normalization<br />

during crawling<br />

30


6.6 Text normalization: SMT normalization<br />

Language model →<br />

← Translation<br />

model<br />

SMT normalization in RLAT builds TM from a language into the same one.<br />

TM<br />

LM<br />

31


6.6 Text normalization: SMT normalization<br />

Translation model<br />

— Word-to-word and phrase-to-phrase mapping<br />

— Word order different across languages (distortion model)<br />

Language model<br />

— How likely are given word sequences<br />

— Typically n-gram models<br />

Decoder<br />

— Search matching word and phrase translations<br />

— Search for best sequence of these partial translations<br />

32


6.7 Crawling<br />

need MORE data<br />

— “There's no data like more data”. If you have more data, you have more<br />

word combinations. Results will be more solid, more independent<br />

— “The former experiments indicate that massive text data crawling<br />

decreases the OOV rate significantly.” [Vu et al, 2010]<br />

need more DIFFERENT data<br />

— LMs that were built from different websites, can improve the system’s<br />

perfomance [Vu et al, 2010]<br />

need more ACTUAL data<br />

— Actual news are different to the news from text data of the basic language<br />

model<br />

33


6.7 Crawling<br />

Experiment<br />

Language<br />

Model<br />

Website Size LS SMT OOV<br />

Rate<br />

34<br />

Statistical Machine<br />

Translation<br />

Language specific rules<br />

Out of vocabulary<br />

PPL<br />

Base Baseline 1.9M 10.35% 4654.21<br />

LM1 rian.ru 4.2M x x 25.59% 3447.27<br />

LM2 inosmi.ru 792K x x 13.39% 5341.18<br />

LM3 Inosmi.ru 920K x x 46.84% 2487.77<br />

rian.ru 53M 14.23% 4866.73<br />

rian.ru 57M x 13.38% 4644.32<br />

LM5 rian.ru 58M x x 13.51% 4809.67<br />

LM4 vesti.ru 13M x x 23.53% 4639.72<br />

Last interpolated 7.9% 3325<br />

Perplexity


7. Results<br />

Baseline Eval(Test): 58.17%<br />

----------------------------------------<br />

Baseline: 60.00%<br />

Base + LM1: 52.05%<br />

Base + LM1 + LM2: 49.79% on Dev Set<br />

Base + LM1 + LM2 + LM3: 49.79%<br />

Base + LM1 + LM2 + LM4: 49.82%<br />

Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27: 44.78%<br />

-----------------------------------------<br />

LM1: inosmi.ru 792K<br />

LM2: rian.ru: 4.2M<br />

LM3: inosmi.ru: 920K<br />

LM4: vesti.ru 13M<br />

LM5: rian.ru 58M<br />

35


7. Results<br />

LM WER<br />

Baseline Eval(Test)<br />

Baseline Dev<br />

Base + LM1 Dev<br />

Base + LM1 + LM2 Dev<br />

Base + LM1 + LM2 + LM3 Dev<br />

Base + LM1 + LM2 + LM4 Dev<br />

Base + LM1 + LM2 + LM5 Dev<br />

58.17<br />

60.00<br />

52.05<br />

49.79<br />

49.79<br />

49.82<br />

44.78<br />

36<br />

WER, %<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Set<br />

Result WERs<br />

Baseline Eval(Test)<br />

Baseline Dev<br />

Base + LM1 Dev<br />

Base + LM2 Dev<br />

Base + LM1 + LM2<br />

+ LM3<br />

Base + LM1 + LM2<br />

+ LM4<br />

Base + LM1 + LM2<br />

+ LM5<br />

Weights for the last interpolation: Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27


6.7 Crawling: Experiment<br />

The same website, 3 crawls<br />

Web siteSize LS SMT OOV<br />

Rate<br />

Word Error<br />

Rate<br />

PPL WER, %<br />

(Base +<br />

LM)<br />

rian.ru 53M 14.23% 4866.73 84.92<br />

rian.ru 57M x 13.38% 4644.32 47.71<br />

rian.ru 58M x x 13.51% 4809.67 47.69<br />

Interpolation weights: 0.51*Base + 0.49*LM<br />

37<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

0<br />

Clear<br />

Rule-based<br />

Rule-based +<br />

SMT


Fixed bugs<br />

● Some normailization scripts for Russian were not included in<br />

main normalization module (file crawl.sh). It was fixed and<br />

tested.<br />

● There were errors in number normalization for Russian. It was<br />

fixed and used, but not tested apart.<br />

— For the numbers 15,16,17,18,19 and 50, 60, 70, 80, rules were partially<br />

reversed, and in these words was wrong letter "ь" in (or missing).<br />

— In some cases the endings for the numbers x00., x000. etc. were added<br />

incorrectly. E.g. "тысячного ого" instead "тысячного".<br />

numbnorm_Russain.pl<br />

38


Suggestions<br />

● Better Russian text normalization scripts, including better<br />

number normalization<br />

● The most part of experiments was made with not too large data<br />

sets, but it is noticeably, that the more text data we crawl, the<br />

better results we have<br />

— More text data crawling can make the recognition quality for<br />

Russian language better<br />

— More different sources for crawling<br />

39


Thanks for your interest!<br />

40


References<br />

[Vu et al, 2010] Ngoc Thang Vu, Tim Schlippe, Franziska Kraus,<br />

and Tanja Schultz, Rapid Bootstrapping of five Eastern European<br />

Languages using the Rapid Language Adaptation Toolkit (2010)<br />

[Schlippe et al, 2010] Tim Schlippe, Chenfei Zhu, Jan Gebhardt,<br />

and Tanja Schultz, Text Normalization based on Statistical<br />

Machine Translation and Internet User Support (2010)<br />

[Schultz et al, 2010] Tanja Schultz, Tim Schlippe, Language<br />

Modeling (MMMK-lecture SS2010)<br />

[Metze, 2010] Florian Metze, Multilingual Speech Processing<br />

(lecture, 2010)<br />

[Vogel, 2011] Stephan Vogel Statistical Machine Translation: A<br />

Gentle Introduction (lecture, 2011)<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!