Слайд 1

Development of 

an automatic speech 

recognition system for Russian 

language with RLAT 

Alexey Mokin 

Mykola Volovyk 

Advisor: Tim Schlippe 

09-02-2011 

1

Content 

1. Goal of the practical course 

2. Automatic Speech Recognition (ASR) 

1. Advantages 

2. Disadvantages 

3. Basics 

3. Rapid Language Adaptation Toolkit (RLAT) 

4. Russian language 

5. Previous results 

6. ASR building steps 

7. Results 

1. Pronunciation dictionary 

2. Phonemes 

3. Speech data 

4. Acoustic model 

5. Language model 

6. Text normalization 

7. Crawling 

2

1. Goal 

• Build speech recognizer for Russian language with Rapid Language 

Adaptation Toolkit (RLAT) 

– Able to recognize read speech 

– Newspaper domain 

– Initial recognition might be extended to broadcast transcription 

3

2.1 Automatic speech recognition (ASR): advantages 

• Natural way of communication for human beings 

– No practicing necessary for users, i.e. speech does not require 

any teaching as opposed to reading/writing 

– High bandwidth (speaking is faster than typing) 

• Hands and eyes are free for other tasks 

– Works in the car / on the run / in the dark 

• Mobility (microphones are smaller than keyboards) 

• Some communication channels (e.g. phone) are designed 

[Schultz et al, 2010] 

4

2.2 Automatic speech recognition (ASR): disadvantages 

• Unusable where silence/confidentiality is required 

(meetings, library, spoken access codes) 

• Still unsatisfactory recognition rate when: 

– Environment is very noisy (party, restaurant, train) 

– Unknown or unlimited domains 

– Uncooperative speakers (whisper, mumble, …) 

– Problems with accents, dialects, code-switching 

• Cultural factors (e.g. collectivism, uncertainty avoidance) 

• Speech input is still more expensive than keyboard 


5

2.3 Automatic speech recognition (ASR): Basics 


[Metze, 2010] 

Speech data 

(with transcription) 

6 

Pronunciation 

rules 

Text data


The main formula of speech recognition: 

• Given acoustic data A = a 1 ,a 2 ,...,a T 

• Find word sequence W' = w' 1 ,w' 2 ,...,w' n with highest likelihood 

• Such that P(W | A) is maximized 

Search problem can be formulated as: 

7


How does ASR work? 

Two-stage process for statistical-based ASR: 

1. Train statistical model (often: Maximum Likelihood approach) 

2. Test on unknown data 

Output is most likely hypothesis according to internal model 

8


• RLAT stands for Rapid Language Adaptation Toolkit and is a 

continuation of the SPICE system which has been developed 

between 2004 and 2007. 

• RLAT builds on existing GlobalPhone and FestVox projects. 

Knowledge and data are shared between recognition and synthesis 

such as phoneme sets, pronunciation dictionaries, acoustic models, 

and text resources. 

• Goals: 

• Bridge the gap between technology experts and language 

experts 

• Develop web-based intelligent systems 

– Interactive learning with user in the loop 

– Rapid Adaptation of universal models to unseen 

languages 

• RLAT web page: http://csl.ira.uka.de/rlat-dev 

9 

[Schultz et al, 2010]


RLAT collects: 

— Text & Audio data 

RLAT defines: 

— Phoneme set 

— Rich prompt set 

— Lexical pronunciations 

RLAT produces: 

— Pronunciation model 

— ASR acoustic model 

— ASR language model 

— TTS voice 

RLAT maintains: 

— Projects and users login 

— Data and models 

10 

[Schultz et al, 2010]

4. Russian language 

• Russian is a Slavic language in the Indo-European 

family. 

• From the point of view of the spoken language, its closest 

relatives are Ukrainian and Belarusian. 

11 

[Wikipedia]

4. Russian language: difficulties 

Rich morphology → need MORE data 

— Set of prefixes, prepositional and adverbial in nature; diminutive, 

augmentative, and frequentative suffixes and infixes 

— Six cases in two numbers (singular and plural) 

— Up to ten additional cases are identified in linguistics textbooks 

— Absolutely obeying grammatical gender (masculine, feminine and 

neuter) 

Some derived languages (sometimes in articles and always in 

comments) → need more DIFFERENT text data 

— More sources for more training quality due more rules and larger 

dictionary 

Cyrillic alphabet → encoding and transliteration during LM 

training 

12


Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, Tanja Schultz 

Rapid Bootstrapping of five Eastern European Languages using the 

Rapid Language Adaptation Toolkit (2010) 

13


WER [%] for five Eastern European Languages 

Language / LM BG HR CZ PL RU 

+ add. websites (dev) 

+ training utts (dev) 

+ training utts (eval) 

+ 500K dict (eval) 

20.4 

20.0 

16.9 

17.6 

30.5 

28.9 

32.8 

33.5 

14 

26.5 

25.3 

24.8 

23.5 

27.2 

24.3 

22.3 

20.4 

41.0 

40.3 

36.6 

36.2 

«The best systems give WERs of 16.9% for BL, 32.8 % 

for HR, 23.5% for CZ, 20.4% for PL and 36.2% for RU 

on the evaluation set» [Schlippe et al, 2010]

6. ASR building steps 

1. Pronunciation dictionary 

2. Phonemes 

3. Speech data 

4. Acoustic model 

5. Language model 

6. Text normalization 

7. Crawling 

15

6.1 Pronunciation dictionary 

• With the term „pronuncion dictionary“ we mean: the 

dictionary that specifies how words of our language 

should be read. 

• Dictionary can have Janus or Festival format 

• Here: Janus dictionary 

‒ Each unit has following form: 

 

Janus dictionary for Russian was already built in 

GlobalPhone project 

16

6.2 Phonemes 

• The International Phonetic Alphabet (IPA) is an alphabetic 

system of phonetic notation 

• MM7 is acoustic model inventory, that was trained earlier from 

seven GlobalPhone languages 

• To build system in a new language, we need an initial state 

alignment 

• This alignment is produced by selecting the closest matching 

between GlobalPhone inventory MM7 and IPA-based phones. 

17

6.2 Phonemes 

• The file phonemeSetRussian.txt was extracted from 

dictionary and lists in MM7 format. 

• The association between MM7 format and the IPA format 

is ensured by IPA-MM7.txt 

– but it does not cover all cases 

• So the mapping could be done in following scheme: 

18

6.2 Phonemes 

Dictionary with MM7 phoneme format 

RLAT template for 

phoneme set 

19 

IPA-MM7

6.2 Phonemes 

• In the phoneme template of RLAT there is almost for every 

phoneme appropriate palatalization realization. 

• If there was NO palatalization realization, palatalized 

phones were mapped to their neighbors. 

– Example: "M_Sj [soft Sh] to ç" or "M_Zj [soft Zh] to dʒ„ 

• MM7 phonemes may be assigned to more than one IPA 

symbol 

– Example: M_h can be mapped to 2 different phones (sounds) : ‚h‘ 

and ‚χ‘‚ 

20

6.3 Speech data 

The Russian speech data were provided by GlobalPhone 

project: 

1. Audio files (Sound) 

2. Transcriptions to these audio files (Text) 

3. Information about speakers 

21

6.3 Speech data 

Speech data were generated in the following way: 

It was taken some text with news content 

From this text were made short phrases (prompts) 

Phrase was read by user and recorded. Recorded sound 

file we call utterance 

At the beginning of the work promts were in Roman 

alphabet. 

— Example: „Gorbachov“ 

With help of regular expressions all promts were 

transliterated to Cyrillic 

22

6.4 Acoustic model 

All speech data were partioned to 3 groups: 

— Training set 

— Development set (tuning for our experiments) 

— Test set 

We used the old classification that was documented in the file 

RussianSpeakersConfig.0, altogether 123 speakers 

In this file are listed all speakers with flags 

— flag = 0 – that is training set speaker 

— flag = 1 – test set speaker 

— flag = 3 – development set speaker 

23

6.4 Acoustic model 

As next we created file: speaker-config.txt, that consists only 

lines: 

005 033 042 065 078 097 103 106 110 122 

002 027 036 063 069 092 102 104 109 112 

(that is 16 speakers from 123 from RussianSpeakersConfig.0) 

Numbers indicate speakers and correspond with speaker 

configurations 

— 1st line – test-speakers (their promts and utterances will be used for 

testing) 

— 2nd line – development-speakers 

This file declares how acoustic model should be trained 

This file was uploaded on the 1st step of acoustic model building in 

RLAT 

24

6.5 Language model 

What is expected from Language Models in ASR? 

• To have possibility to add another independent information 

source 

• Word disambiguation 

• Word reordering 

• Search space reduction 

– When vocabulary is n words, do not consider all n k possible 

k-word sequences 

25

6.5 Language model: n-grams 

The probability of a word sequence can be decomposed as: 

P(W) = P(w 1 w 2 .. w n ) = P(w 1 ) · P(w 2 | w 1 ) · P(w 3 | w 1 w 2 ) · · · P(w n | w 1 w 2 ... w n-1 ) 

Computing P(w | history) causes big problems, because by a vocabulary of 

64,000 words there is huge number of history-possibilities. 

Solution: replace the “whole history” by 1,2,3,… last words. 

– unigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k ) 

– bigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-1 ) 

– trigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-2 w k-1 ) 

– n-gram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-(n-1) w k-n-2 ... w k-1) 

26

6.5 Language model: interpolation 

Some LMs can be interpolatad in one new LM 

27

6.5 Language model 

The first LM was made from prepared file with LM (between 1990 

and 2000) 

Actually it was created from transliterations of training-set 

— text of the first LM was also in romanized form. 

With help of regular expressions and our advisor it was transliterated 

to Cyrillic 

The Word Error Rate (WER) for this LM was 60.00%. It was defined 

as baseline. 

28

6.6 Text normalization 

Text normalization is a process by which text is transformed in some 

way to make it consistent 

— Numbers 

— Proper nouns 

— Abbreviations 

It is aplied on the text to reduce perplexity with regards to language 

model 

— For example: 

12 часов → двенадцать часов 

BBC → бибиси 

Number normalization in Russian: the same number can have 

different forms depending on case 

— For example: 3 → три, трёх, трём, тремя, третий, третья, третьего, 

третьей, третьему, третьей, третьем, третьей, третьих... etc 

29

6.6 Text normalization: SMT normalization 

• Using Statistical Machine Translation (SMT) for 

translation of a language in the same language. 

• User normalizes text displayed in a web interface, thereby 

providing a parallel corpus of normalized and 

nonnormalized text. [Schlippe et al, 2010] 

• SMT models are generated to translate non-normalized 

into normalized text. [Schlippe et al, 2010] 

• It with can be used with other kinds of normalization, for 

example with standard rule based text normalization 

during crawling 

30


Language model → 

← Translation 

model 

SMT normalization in RLAT builds TM from a language into the same one. 

TM 

LM 

31


Translation model 

— Word-to-word and phrase-to-phrase mapping 

— Word order different across languages (distortion model) 

Language model 

— How likely are given word sequences 

— Typically n-gram models 

Decoder 

— Search matching word and phrase translations 

— Search for best sequence of these partial translations 

32

6.7 Crawling 

need MORE data 

— “There's no data like more data”. If you have more data, you have more 

word combinations. Results will be more solid, more independent 

— “The former experiments indicate that massive text data crawling 

decreases the OOV rate significantly.” [Vu et al, 2010] 

need more DIFFERENT data 

— LMs that were built from different websites, can improve the system’s 

perfomance [Vu et al, 2010] 

need more ACTUAL data 

— Actual news are different to the news from text data of the basic language 

model 

33

6.7 Crawling 

Experiment 

Language 

Model 

Website Size LS SMT OOV 

Rate 

34 

Statistical Machine 

Translation 

Language specific rules 

Out of vocabulary 

PPL 

Base Baseline 1.9M 10.35% 4654.21 

LM1 rian.ru 4.2M x x 25.59% 3447.27 

LM2 inosmi.ru 792K x x 13.39% 5341.18 

LM3 Inosmi.ru 920K x x 46.84% 2487.77 

rian.ru 53M 14.23% 4866.73 

rian.ru 57M x 13.38% 4644.32 

LM5 rian.ru 58M x x 13.51% 4809.67 

LM4 vesti.ru 13M x x 23.53% 4639.72 

Last interpolated 7.9% 3325 

Perplexity

7. Results 

Baseline Eval(Test): 58.17% 

---------------------------------------- 

Baseline: 60.00% 

Base + LM1: 52.05% 

Base + LM1 + LM2: 49.79% on Dev Set 

Base + LM1 + LM2 + LM3: 49.79% 

Base + LM1 + LM2 + LM4: 49.82% 

Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27: 44.78% 

----------------------------------------- 

LM1: inosmi.ru 792K 

LM2: rian.ru: 4.2M 

LM3: inosmi.ru: 920K 

LM4: vesti.ru 13M 

LM5: rian.ru 58M 

35

7. Results 

LM WER 

Baseline Eval(Test) 

Baseline Dev 

Base + LM1 Dev 

Base + LM1 + LM2 Dev 

Base + LM1 + LM2 + LM3 Dev 



58.17 

60.00 

52.05 

49.79 

49.79 

49.82 

44.78 

36 

WER, % 

70 

60 

50 

40 

30 

20 

10 

0 

Set 

Result WERs 

Baseline Eval(Test) 

Baseline Dev 



Base + LM1 + LM2 

+ LM3 


+ LM4 


+ LM5 

Weights for the last interpolation: Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27

6.7 Crawling: Experiment 

The same website, 3 crawls 

Web siteSize LS SMT OOV 

Rate 

Word Error 

Rate 

PPL WER, % 

(Base + 

LM) 

rian.ru 53M 14.23% 4866.73 84.92 

rian.ru 57M x 13.38% 4644.32 47.71 

rian.ru 58M x x 13.51% 4809.67 47.69 

Interpolation weights: 0.51*Base + 0.49*LM 

37 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 

Clear 

Rule-based 

Rule-based + 

SMT

Fixed bugs 

● Some normailization scripts for Russian were not included in 

main normalization module (file crawl.sh). It was fixed and 

tested. 

● There were errors in number normalization for Russian. It was 

fixed and used, but not tested apart. 

— For the numbers 15,16,17,18,19 and 50, 60, 70, 80, rules were partially 

reversed, and in these words was wrong letter "ь" in (or missing). 

— In some cases the endings for the numbers x00., x000. etc. were added 

incorrectly. E.g. "тысячного ого" instead "тысячного". 

numbnorm_Russain.pl 

38

Suggestions 

● Better Russian text normalization scripts, including better 

number normalization 

● The most part of experiments was made with not too large data 

sets, but it is noticeably, that the more text data we crawl, the 

better results we have 

— More text data crawling can make the recognition quality for 

Russian language better 

— More different sources for crawling 

39

Thanks for your interest! 

40

References 

[Vu et al, 2010] Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, 

and Tanja Schultz, Rapid Bootstrapping of five Eastern European 

Languages using the Rapid Language Adaptation Toolkit (2010) 

[Schlippe et al, 2010] Tim Schlippe, Chenfei Zhu, Jan Gebhardt, 

and Tanja Schultz, Text Normalization based on Statistical 

Machine Translation and Internet User Support (2010) 

[Schultz et al, 2010] Tanja Schultz, Tim Schlippe, Language 

Modeling (MMMK-lecture SS2010) 

[Metze, 2010] Florian Metze, Multilingual Speech Processing 

(lecture, 2010) 

[Vogel, 2011] Stephan Vogel Statistical Machine Translation: A 

Gentle Introduction (lecture, 2011) 

41

Слайд 1

Create successful ePaper yourself

Delete template?

Save as template?