Слайд 1
Слайд 1
Слайд 1
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Development of<br />
an automatic speech<br />
recognition system for Russian<br />
language with RLAT<br />
Alexey Mokin<br />
Mykola Volovyk<br />
Advisor: Tim Schlippe<br />
09-02-2011<br />
1
Content<br />
1. Goal of the practical course<br />
2. Automatic Speech Recognition (ASR)<br />
1. Advantages<br />
2. Disadvantages<br />
3. Basics<br />
3. Rapid Language Adaptation Toolkit (RLAT)<br />
4. Russian language<br />
5. Previous results<br />
6. ASR building steps<br />
7. Results<br />
1. Pronunciation dictionary<br />
2. Phonemes<br />
3. Speech data<br />
4. Acoustic model<br />
5. Language model<br />
6. Text normalization<br />
7. Crawling<br />
2
1. Goal<br />
• Build speech recognizer for Russian language with Rapid Language<br />
Adaptation Toolkit (RLAT)<br />
– Able to recognize read speech<br />
– Newspaper domain<br />
– Initial recognition might be extended to broadcast transcription<br />
3
2.1 Automatic speech recognition (ASR): advantages<br />
• Natural way of communication for human beings<br />
– No practicing necessary for users, i.e. speech does not require<br />
any teaching as opposed to reading/writing<br />
– High bandwidth (speaking is faster than typing)<br />
• Hands and eyes are free for other tasks<br />
– Works in the car / on the run / in the dark<br />
• Mobility (microphones are smaller than keyboards)<br />
• Some communication channels (e.g. phone) are designed<br />
[Schultz et al, 2010]<br />
4
2.2 Automatic speech recognition (ASR): disadvantages<br />
• Unusable where silence/confidentiality is required<br />
(meetings, library, spoken access codes)<br />
• Still unsatisfactory recognition rate when:<br />
– Environment is very noisy (party, restaurant, train)<br />
– Unknown or unlimited domains<br />
– Uncooperative speakers (whisper, mumble, …)<br />
– Problems with accents, dialects, code-switching<br />
• Cultural factors (e.g. collectivism, uncertainty avoidance)<br />
• Speech input is still more expensive than keyboard<br />
[Schultz et al, 2010]<br />
5
2.3 Automatic speech recognition (ASR): Basics<br />
[Schultz et al, 2010]<br />
[Metze, 2010]<br />
Speech data<br />
(with transcription)<br />
6<br />
Pronunciation<br />
rules<br />
Text data
2.4 Automatic speech recognition (ASR): Basics<br />
The main formula of speech recognition:<br />
• Given acoustic data A = a 1 ,a 2 ,...,a T<br />
• Find word sequence W' = w' 1 ,w' 2 ,...,w' n with highest likelihood<br />
• Such that P(W | A) is maximized<br />
Search problem can be formulated as:<br />
7
2.4 Automatic speech recognition (ASR): Basics<br />
How does ASR work?<br />
Two-stage process for statistical-based ASR:<br />
1. Train statistical model (often: Maximum Likelihood approach)<br />
2. Test on unknown data<br />
Output is most likely hypothesis according to internal model<br />
8
3. Rapid Language Adaptation Toolkit (RLAT)<br />
• RLAT stands for Rapid Language Adaptation Toolkit and is a<br />
continuation of the SPICE system which has been developed<br />
between 2004 and 2007.<br />
• RLAT builds on existing GlobalPhone and FestVox projects.<br />
Knowledge and data are shared between recognition and synthesis<br />
such as phoneme sets, pronunciation dictionaries, acoustic models,<br />
and text resources.<br />
• Goals:<br />
• Bridge the gap between technology experts and language<br />
experts<br />
• Develop web-based intelligent systems<br />
– Interactive learning with user in the loop<br />
– Rapid Adaptation of universal models to unseen<br />
languages<br />
• RLAT web page: http://csl.ira.uka.de/rlat-dev<br />
9<br />
[Schultz et al, 2010]
3. Rapid Language Adaptation Toolkit (RLAT)<br />
RLAT collects:<br />
— Text & Audio data<br />
RLAT defines:<br />
— Phoneme set<br />
— Rich prompt set<br />
— Lexical pronunciations<br />
RLAT produces:<br />
— Pronunciation model<br />
— ASR acoustic model<br />
— ASR language model<br />
— TTS voice<br />
RLAT maintains:<br />
— Projects and users login<br />
— Data and models<br />
10<br />
[Schultz et al, 2010]
4. Russian language<br />
• Russian is a Slavic language in the Indo-European<br />
family.<br />
• From the point of view of the spoken language, its closest<br />
relatives are Ukrainian and Belarusian.<br />
11<br />
[Wikipedia]
4. Russian language: difficulties<br />
Rich morphology → need MORE data<br />
— Set of prefixes, prepositional and adverbial in nature; diminutive,<br />
augmentative, and frequentative suffixes and infixes<br />
— Six cases in two numbers (singular and plural)<br />
— Up to ten additional cases are identified in linguistics textbooks<br />
— Absolutely obeying grammatical gender (masculine, feminine and<br />
neuter)<br />
Some derived languages (sometimes in articles and always in<br />
comments) → need more DIFFERENT text data<br />
— More sources for more training quality due more rules and larger<br />
dictionary<br />
Cyrillic alphabet → encoding and transliteration during LM<br />
training<br />
12
5. Previous results<br />
Ngoc Thang Vu, Tim Schlippe, Franziska Kraus, Tanja Schultz<br />
Rapid Bootstrapping of five Eastern European Languages using the<br />
Rapid Language Adaptation Toolkit (2010)<br />
13
5. Previous results<br />
WER [%] for five Eastern European Languages<br />
Language / LM BG HR CZ PL RU<br />
+ add. websites (dev)<br />
+ training utts (dev)<br />
+ training utts (eval)<br />
+ 500K dict (eval)<br />
20.4<br />
20.0<br />
16.9<br />
17.6<br />
30.5<br />
28.9<br />
32.8<br />
33.5<br />
14<br />
26.5<br />
25.3<br />
24.8<br />
23.5<br />
27.2<br />
24.3<br />
22.3<br />
20.4<br />
41.0<br />
40.3<br />
36.6<br />
36.2<br />
«The best systems give WERs of 16.9% for BL, 32.8 %<br />
for HR, 23.5% for CZ, 20.4% for PL and 36.2% for RU<br />
on the evaluation set» [Schlippe et al, 2010]
6. ASR building steps<br />
1. Pronunciation dictionary<br />
2. Phonemes<br />
3. Speech data<br />
4. Acoustic model<br />
5. Language model<br />
6. Text normalization<br />
7. Crawling<br />
15
6.1 Pronunciation dictionary<br />
• With the term „pronuncion dictionary“ we mean: the<br />
dictionary that specifies how words of our language<br />
should be read.<br />
• Dictionary can have Janus or Festival format<br />
• Here: Janus dictionary<br />
‒ Each unit has following form:<br />
<br />
Janus dictionary for Russian was already built in<br />
GlobalPhone project<br />
16
6.2 Phonemes<br />
• The International Phonetic Alphabet (IPA) is an alphabetic<br />
system of phonetic notation<br />
• MM7 is acoustic model inventory, that was trained earlier from<br />
seven GlobalPhone languages<br />
• To build system in a new language, we need an initial state<br />
alignment<br />
• This alignment is produced by selecting the closest matching<br />
between GlobalPhone inventory MM7 and IPA-based phones.<br />
17
6.2 Phonemes<br />
• The file phonemeSetRussian.txt was extracted from<br />
dictionary and lists in MM7 format.<br />
• The association between MM7 format and the IPA format<br />
is ensured by IPA-MM7.txt<br />
– but it does not cover all cases<br />
• So the mapping could be done in following scheme:<br />
18
6.2 Phonemes<br />
Dictionary with MM7 phoneme format<br />
RLAT template for<br />
phoneme set<br />
19<br />
IPA-MM7
6.2 Phonemes<br />
• In the phoneme template of RLAT there is almost for every<br />
phoneme appropriate palatalization realization.<br />
• If there was NO palatalization realization, palatalized<br />
phones were mapped to their neighbors.<br />
– Example: "M_Sj [soft Sh] to ç" or "M_Zj [soft Zh] to dʒ„<br />
• MM7 phonemes may be assigned to more than one IPA<br />
symbol<br />
– Example: M_h can be mapped to 2 different phones (sounds) : ‚h‘<br />
and ‚χ‘‚<br />
20
6.3 Speech data<br />
The Russian speech data were provided by GlobalPhone<br />
project:<br />
1. Audio files (Sound)<br />
2. Transcriptions to these audio files (Text)<br />
3. Information about speakers<br />
21
6.3 Speech data<br />
Speech data were generated in the following way:<br />
It was taken some text with news content<br />
From this text were made short phrases (prompts)<br />
Phrase was read by user and recorded. Recorded sound<br />
file we call utterance<br />
At the beginning of the work promts were in Roman<br />
alphabet.<br />
— Example: „Gorbachov“<br />
With help of regular expressions all promts were<br />
transliterated to Cyrillic<br />
22
6.4 Acoustic model<br />
All speech data were partioned to 3 groups:<br />
— Training set<br />
— Development set (tuning for our experiments)<br />
— Test set<br />
We used the old classification that was documented in the file<br />
RussianSpeakersConfig.0, altogether 123 speakers<br />
In this file are listed all speakers with flags<br />
— flag = 0 – that is training set speaker<br />
— flag = 1 – test set speaker<br />
— flag = 3 – development set speaker<br />
23
6.4 Acoustic model<br />
As next we created file: speaker-config.txt, that consists only<br />
lines:<br />
005 033 042 065 078 097 103 106 110 122<br />
002 027 036 063 069 092 102 104 109 112<br />
(that is 16 speakers from 123 from RussianSpeakersConfig.0)<br />
Numbers indicate speakers and correspond with speaker<br />
configurations<br />
— 1st line – test-speakers (their promts and utterances will be used for<br />
testing)<br />
— 2nd line – development-speakers<br />
This file declares how acoustic model should be trained<br />
This file was uploaded on the 1st step of acoustic model building in<br />
RLAT<br />
24
6.5 Language model<br />
What is expected from Language Models in ASR?<br />
• To have possibility to add another independent information<br />
source<br />
• Word disambiguation<br />
• Word reordering<br />
• Search space reduction<br />
– When vocabulary is n words, do not consider all n k possible<br />
k-word sequences<br />
25
6.5 Language model: n-grams<br />
The probability of a word sequence can be decomposed as:<br />
P(W) = P(w 1 w 2 .. w n ) = P(w 1 ) · P(w 2 | w 1 ) · P(w 3 | w 1 w 2 ) · · · P(w n | w 1 w 2 ... w n-1 )<br />
Computing P(w | history) causes big problems, because by a vocabulary of<br />
64,000 words there is huge number of history-possibilities.<br />
Solution: replace the “whole history” by 1,2,3,… last words.<br />
– unigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k )<br />
– bigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-1 )<br />
– trigram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-2 w k-1 )<br />
– n-gram: P'(w k | w 1 w 2 ... w k-1 ) = P(w k | w k-(n-1) w k-n-2 ... w k-1)<br />
26
6.5 Language model: interpolation<br />
Some LMs can be interpolatad in one new LM<br />
27
6.5 Language model<br />
The first LM was made from prepared file with LM (between 1990<br />
and 2000)<br />
Actually it was created from transliterations of training-set<br />
— text of the first LM was also in romanized form.<br />
With help of regular expressions and our advisor it was transliterated<br />
to Cyrillic<br />
The Word Error Rate (WER) for this LM was 60.00%. It was defined<br />
as baseline.<br />
28
6.6 Text normalization<br />
Text normalization is a process by which text is transformed in some<br />
way to make it consistent<br />
— Numbers<br />
— Proper nouns<br />
— Abbreviations<br />
It is aplied on the text to reduce perplexity with regards to language<br />
model<br />
— For example:<br />
12 часов → двенадцать часов<br />
BBC → бибиси<br />
Number normalization in Russian: the same number can have<br />
different forms depending on case<br />
— For example: 3 → три, трёх, трём, тремя, третий, третья, третьего,<br />
третьей, третьему, третьей, третьем, третьей, третьих... etc<br />
29
6.6 Text normalization: SMT normalization<br />
• Using Statistical Machine Translation (SMT) for<br />
translation of a language in the same language.<br />
• User normalizes text displayed in a web interface, thereby<br />
providing a parallel corpus of normalized and<br />
nonnormalized text. [Schlippe et al, 2010]<br />
• SMT models are generated to translate non-normalized<br />
into normalized text. [Schlippe et al, 2010]<br />
• It with can be used with other kinds of normalization, for<br />
example with standard rule based text normalization<br />
during crawling<br />
30
6.6 Text normalization: SMT normalization<br />
Language model →<br />
← Translation<br />
model<br />
SMT normalization in RLAT builds TM from a language into the same one.<br />
TM<br />
LM<br />
31
6.6 Text normalization: SMT normalization<br />
Translation model<br />
— Word-to-word and phrase-to-phrase mapping<br />
— Word order different across languages (distortion model)<br />
Language model<br />
— How likely are given word sequences<br />
— Typically n-gram models<br />
Decoder<br />
— Search matching word and phrase translations<br />
— Search for best sequence of these partial translations<br />
32
6.7 Crawling<br />
need MORE data<br />
— “There's no data like more data”. If you have more data, you have more<br />
word combinations. Results will be more solid, more independent<br />
— “The former experiments indicate that massive text data crawling<br />
decreases the OOV rate significantly.” [Vu et al, 2010]<br />
need more DIFFERENT data<br />
— LMs that were built from different websites, can improve the system’s<br />
perfomance [Vu et al, 2010]<br />
need more ACTUAL data<br />
— Actual news are different to the news from text data of the basic language<br />
model<br />
33
6.7 Crawling<br />
Experiment<br />
Language<br />
Model<br />
Website Size LS SMT OOV<br />
Rate<br />
34<br />
Statistical Machine<br />
Translation<br />
Language specific rules<br />
Out of vocabulary<br />
PPL<br />
Base Baseline 1.9M 10.35% 4654.21<br />
LM1 rian.ru 4.2M x x 25.59% 3447.27<br />
LM2 inosmi.ru 792K x x 13.39% 5341.18<br />
LM3 Inosmi.ru 920K x x 46.84% 2487.77<br />
rian.ru 53M 14.23% 4866.73<br />
rian.ru 57M x 13.38% 4644.32<br />
LM5 rian.ru 58M x x 13.51% 4809.67<br />
LM4 vesti.ru 13M x x 23.53% 4639.72<br />
Last interpolated 7.9% 3325<br />
Perplexity
7. Results<br />
Baseline Eval(Test): 58.17%<br />
----------------------------------------<br />
Baseline: 60.00%<br />
Base + LM1: 52.05%<br />
Base + LM1 + LM2: 49.79% on Dev Set<br />
Base + LM1 + LM2 + LM3: 49.79%<br />
Base + LM1 + LM2 + LM4: 49.82%<br />
Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27: 44.78%<br />
-----------------------------------------<br />
LM1: inosmi.ru 792K<br />
LM2: rian.ru: 4.2M<br />
LM3: inosmi.ru: 920K<br />
LM4: vesti.ru 13M<br />
LM5: rian.ru 58M<br />
35
7. Results<br />
LM WER<br />
Baseline Eval(Test)<br />
Baseline Dev<br />
Base + LM1 Dev<br />
Base + LM1 + LM2 Dev<br />
Base + LM1 + LM2 + LM3 Dev<br />
Base + LM1 + LM2 + LM4 Dev<br />
Base + LM1 + LM2 + LM5 Dev<br />
58.17<br />
60.00<br />
52.05<br />
49.79<br />
49.79<br />
49.82<br />
44.78<br />
36<br />
WER, %<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Set<br />
Result WERs<br />
Baseline Eval(Test)<br />
Baseline Dev<br />
Base + LM1 Dev<br />
Base + LM2 Dev<br />
Base + LM1 + LM2<br />
+ LM3<br />
Base + LM1 + LM2<br />
+ LM4<br />
Base + LM1 + LM2<br />
+ LM5<br />
Weights for the last interpolation: Base*0.37 + LM1*0.32 + LM2*0.04 + LM5*0.27
6.7 Crawling: Experiment<br />
The same website, 3 crawls<br />
Web siteSize LS SMT OOV<br />
Rate<br />
Word Error<br />
Rate<br />
PPL WER, %<br />
(Base +<br />
LM)<br />
rian.ru 53M 14.23% 4866.73 84.92<br />
rian.ru 57M x 13.38% 4644.32 47.71<br />
rian.ru 58M x x 13.51% 4809.67 47.69<br />
Interpolation weights: 0.51*Base + 0.49*LM<br />
37<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
0<br />
Clear<br />
Rule-based<br />
Rule-based +<br />
SMT
Fixed bugs<br />
● Some normailization scripts for Russian were not included in<br />
main normalization module (file crawl.sh). It was fixed and<br />
tested.<br />
● There were errors in number normalization for Russian. It was<br />
fixed and used, but not tested apart.<br />
— For the numbers 15,16,17,18,19 and 50, 60, 70, 80, rules were partially<br />
reversed, and in these words was wrong letter "ь" in (or missing).<br />
— In some cases the endings for the numbers x00., x000. etc. were added<br />
incorrectly. E.g. "тысячного ого" instead "тысячного".<br />
numbnorm_Russain.pl<br />
38
Suggestions<br />
● Better Russian text normalization scripts, including better<br />
number normalization<br />
● The most part of experiments was made with not too large data<br />
sets, but it is noticeably, that the more text data we crawl, the<br />
better results we have<br />
— More text data crawling can make the recognition quality for<br />
Russian language better<br />
— More different sources for crawling<br />
39
Thanks for your interest!<br />
40
References<br />
[Vu et al, 2010] Ngoc Thang Vu, Tim Schlippe, Franziska Kraus,<br />
and Tanja Schultz, Rapid Bootstrapping of five Eastern European<br />
Languages using the Rapid Language Adaptation Toolkit (2010)<br />
[Schlippe et al, 2010] Tim Schlippe, Chenfei Zhu, Jan Gebhardt,<br />
and Tanja Schultz, Text Normalization based on Statistical<br />
Machine Translation and Internet User Support (2010)<br />
[Schultz et al, 2010] Tanja Schultz, Tim Schlippe, Language<br />
Modeling (MMMK-lecture SS2010)<br />
[Metze, 2010] Florian Metze, Multilingual Speech Processing<br />
(lecture, 2010)<br />
[Vogel, 2011] Stephan Vogel Statistical Machine Translation: A<br />
Gentle Introduction (lecture, 2011)<br />
41