10.07.2015 Views

A Keyword Based Interactive Speech Recognition System - Research

A Keyword Based Interactive Speech Recognition System - Research

A Keyword Based Interactive Speech Recognition System - Research

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

A <strong>Keyword</strong> <strong>Based</strong> <strong>Interactive</strong><strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong> forEmbedded ApplicationsMaster’s ThesisbyIván Francisco Castro CerónAndrea Graciela García BadilloJune, 2011School of Innovation, Design and EngineeringMälardalen UniversityVästerås, SwedenSupervisor: Patrik BjörkmanExaminer: Lars Asplund


Abstract<strong>Speech</strong> recognition has been an important area of research during the past decades.The usage of automatic speech recognition systems is rapidly increasing among differentareas, such as mobile telephony, automotive, healthcare, robotics and more. However,despite the existence of many speech recognition systems, most of them use platformspecific and non-publicly available software. Nevertheless, it is possible to develop speechrecognition systems using already existing open source technology.The aim of this master’s thesis is to develop an interactive and speaker independentspeech recognition system. The system shall be able to identify predetermined keywordsfrom incoming live speech and in response, play audio files with related information.Moreover, the system shall be able to provide a response even if no keyword wasidentified. For this project, the system was implemented using PocketSphinx, a speechrecognition library, part of the open source Sphinx technology by the Carnegie MellonUniversity.During the implementation of this project, the automation of different steps of theprocess, was a key factor for a successful completion. This automation consisted on thedevelopment of different tools for the creation of the language model and the dictionary,two important components of the system. Similarly, the audio files to be played afteridentifying a keyword, as well as the evaluation of the system’s performance, were fullyautomated.The tests run show encouraging results and demonstrate that the system is a feasiblesolution that could be implemented and tested in a real embedded application. Despitethe good results, possible improvements can be implemented, such as the creation of adifferent phonetic dictionary to support different languages.<strong>Keyword</strong>s: Automatic <strong>Speech</strong> <strong>Recognition</strong>, PocketSphinx, Embedded <strong>System</strong>s


AcknowledgementsThis master’s thesis represent the completion of a journey we decided to start two yearsago. Studying in Sweden has been quite an experience and for that, we would liketo thank MDH for giving us the opportunity to study here. We would also like toexpress our gratitude to our Professor Lars Asplund for his guidance throughout thedevelopment of this project. Moreover, we would like to thank the Consejo Nacional deCiencia y Tecnología (CONACYT) at México for the financial support provided throughthe scholarships we were both granted.We would also like to thank our families for their immense support and encouragementduring all this time we have been away. Finally, we are grateful with life for having hadthe great opportunity of being able to accomplish one more goal together.


ContentsList of FiguresList of Tablesivvi1 Introduction 11.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Background 32.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Main challenges and Current <strong>Research</strong> Areas . . . . . . . . . . . . . . . . 63 Automatic <strong>Speech</strong> <strong>Recognition</strong> 83.1 <strong>Speech</strong> Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.1 Gender and Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Speaker’s Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.3 Speaker’s style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.4 Rate of <strong>Speech</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.5 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 ASR Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.1 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.2 Linguistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Training for Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . 233.3.1 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . 233.3.2 Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . 233.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.1 Word Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Design and Implementation 274.1 The CMU Sphinx Technology . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.1 Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.2 PocketSphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 <strong>System</strong> Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29i


4.3 <strong>Keyword</strong> List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Audio Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4.1 Creating the audio files . . . . . . . . . . . . . . . . . . . . . . . . 304.4.2 Playing the audio files . . . . . . . . . . . . . . . . . . . . . . . . . 334.5 Custom Language Model and Dictionary . . . . . . . . . . . . . . . . . . . 334.5.1 <strong>Keyword</strong>-<strong>Based</strong> Language Model Generation . . . . . . . . . . . . 344.5.2 Custom Dictionary Generation . . . . . . . . . . . . . . . . . . . . 364.5.3 Support for other languages . . . . . . . . . . . . . . . . . . . . . . 384.6 Decoding <strong>System</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6.1 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6.2 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.6.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6.4 <strong>System</strong> Components . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Tests and Results 445.1 Main ASR application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.1 Performance Evaluation and KDA . . . . . . . . . . . . . . . . . . 445.1.2 Test environment and setup . . . . . . . . . . . . . . . . . . . . . . 465.2 Auxiliary Java Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.1 TextExtractor Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.2 DictGenerator Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.3 RespGenerator Tool . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.4 rEvaluator Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516 Summary 526.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Bibliography 55A Guidelines 57A.1 Installation of the speech recognizer project . . . . . . . . . . . . . . . . . 57A.2 Running the speech recognition application . . . . . . . . . . . . . . . . . 58A.3 Generating a new language model . . . . . . . . . . . . . . . . . . . . . . 59A.4 Generating a new dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . 61A.5 Generating a new set of predefined responses . . . . . . . . . . . . . . . . 61A.6 Running the speech recognizer using the newly created files . . . . . . . . 63A.7 Evaluating the performance and measuring the KDA . . . . . . . . . . . . 64B Source Code 66B.1 Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.2 TextExtractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76B.3 DictGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80B.4 respGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83ii


B.5 rEvaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85iii


List of Figures2.1 DARPA <strong>Speech</strong> <strong>Recognition</strong> Benchmark Tests. . . . . . . . . . . . . . . . 42.2 Milestones and Evolution of the Technology. . . . . . . . . . . . . . . . . . 53.1 Signal processing to obtain the MFCC vectors . . . . . . . . . . . . . . . . 113.2 A typical triphone structure. . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Three state HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 Example of phone and word HMMs representation . . . . . . . . . . . . . 183.5 A detailed representation of a Multi-layer Perceptron. . . . . . . . . . . . 193.6 Tree Search Nerwork using the A* algorithm. . . . . . . . . . . . . . . . . 213.7 Sphinx training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1 <strong>Keyword</strong> File Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Sentences for Audio Files Example . . . . . . . . . . . . . . . . . . . . . . 314.3 Conversion of Sentences into Audio Files . . . . . . . . . . . . . . . . . . . 324.4 Statistical Language Model Toolkit . . . . . . . . . . . . . . . . . . . . . . 344.5 TextExtractor and CMUCLMTK . . . . . . . . . . . . . . . . . . . . . . . 354.6 Dictionary Generator Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.7 SphinxTrain Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.8 Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.9 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1 Typical Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 KDA Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Automated Test Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 Test Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5 rEvaluator Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.6 TextExtractor Operation Mode 1 . . . . . . . . . . . . . . . . . . . . . . . 495.7 TextExtractor Operation Mode 2 . . . . . . . . . . . . . . . . . . . . . . . 505.8 DictGenerator Operation Modes 1 and 2 . . . . . . . . . . . . . . . . . . . 505.9 Generated Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . 515.10 RespGenerator: Execution Messages . . . . . . . . . . . . . . . . . . . . . 515.11 rEvaluator: Execution Messages . . . . . . . . . . . . . . . . . . . . . . . . 51iv


A.1 ASR Project Folder Structure . . . . . . . . . . . . . . . . . . . . . . . . . 57A.2 Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3 Decoding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.4 Decoding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.5 <strong>Keyword</strong>s and URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60A.6 Example of a Set of Predefined Responses . . . . . . . . . . . . . . . . . . 62A.7 Responses Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62A.8 OGG Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.9 Verification File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64v


List of Tables3.1 Example of the 3 first words in an ASR dictionary . . . . . . . . . . . . . 154.1 Dictionary Words Not Found: Causes and Solutions . . . . . . . . . . . . 374.2 Phonetical Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1 <strong>Keyword</strong> List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 KDA Report Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3 Overall Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Computer Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48A.1 KDA Report Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65vi


Chapter 1IntroductionAutomatic speech recognition systems (ASR) have caught the attention of researcherssince the middle of the 20th century. From the initial attempts to identify isolated wordsusing a very limited vocabulary, to the latest advancements processing continuous speechcomposed by thousands of words, the ASR technology has grown progressively.The raise of robust speech recognition systems during the last two decades has triggered anumber of potential applications for the technology. Existing human-machine interfaces,such as keyboards, can be enhanced or even replaced by speech recognizers that interpretvoice commands to complete a task. This kind of applications is particularly importantas speech is the main form of communication among humans, for this reason, it is muchsimpler and faster to complete a task by providing vocal instructions to a machine ratherthan typing commands using a keyboard [21].The convergence of several research disciplines such as digital signal processing, machinelearning and language modeling have allowed the ASR technology to mature and currentlybe used in commercial applications. The current speech recognition systems are capableof identifying and processing commands with many words. However, they are still unableto fully handle and understand a typical human conversation [9].The performance and accuracy of existing systems allows them to be used in simple taskslike telephone-based applications (call-centers, user authentication, etc). Nevertheless, asthe accuracy and the vocabulary size are increased, the computational resources neededto implement a typical ASR system grow as well. The amount of computational powerrequired to execute a fully-functional speech recognizer system can be easily suppliedby a general-purpose personal computer, but it might be difficult to execute the samesystem in a portable device [21].Mobile applications represent a very important area where ASR systems can be used,for example in GPS navigation systems for cars, where the user can provide navigationcommands by voice, or a speech-based song selector for a music player. However, alarge amount of existing portable devices lack the processing power needed to execute1


a high-end speech recognizer. For this reason, there is a large interest in designing anddeveloping flexible ASR systems that can be run on both strong and resource-constraineddevices [8].Typically, the speech recognition systems designed for embedded devices use restrictivegrammars and do not support large vocabularies. Additionally the accuracy achieved bythose systems tends to be lower when compared to a high-end ASR. This compromisebetween processing power and accuracy is acceptable for simple applications, however,as the ASR technology becomes more popular, the expectations regarding accuracy andvocabulary increase [9].One particular area of interest in the research community is to optimize the performanceof existing speech recognizers by implementing faster, lighter and smarter algorithms.For example, PocketSphinx is an open source, continuous speech recognition system forhandheld devices developed by the Carnegie Mellon University (CMU). As its namesuggests, this system is a heavily simplified and optimized version of the existing toolSphinx, developed by the same university. There are other existing speech recognizersdeveloped for embedded devices, but most of these systems are not free and theirsource code is not publicly available. For this reason it is difficult to use them forexperimentation and research purposes [8].1.1 OutlineThis master’s thesis report is organized as the following:Chapter 2 reviews part of the history of speech recognition, the main challenges facedduring the development of ASR systems as well as some of the current areas of research.Chapter 3 presents some of the sources of variability within human speech and theireffects on speech recognition. Also described are the main components of an ASR system,how are they commonly trained and evaluated.Chapter 4 describes in detail the design and implementation of the <strong>Keyword</strong> <strong>Based</strong><strong>Interactive</strong> <strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong>. This chapter also discusses the reasons forselecting PocketSphinx as the main decoder system for this project.Chapter 5 presents the tests and results obtained after evaluating the implemented<strong>Keyword</strong> <strong>Based</strong> <strong>Interactive</strong> <strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong>.Chapter 6 presents the summary and conclusions of the paper and discusses possibilitiesof future work.2


Chapter 2BackgroundThis chapter provides an overview of the history of speech recognition and ASR systems,as well as some of the main challenges regarding ASR systems and the current areas ofresearch.2.1 HistoryThe idea of building machines capable of communicating with humans using speechemerged during the last couple of decades of the 18th century. However, these initialtrials were more focused on building machines able to speak, rather than listening andunderstanding human commands. For example, using the existing knowledge aboutthe human vocal tract, Wolfgang von Kempelen constructed an “Acoustic-Mechanical<strong>Speech</strong> Machine” with the intention of replicating speech-like sounds [10].During the first half of the 20th century, one of the main research fields related tolanguage recognition was the spectral analysis of speech and its perception by a humanlistener. Some of the most influential documents were published by Bell Laboratoriesand in 1952 they built a system able to identify isolated digits for a single speaker.The system had 10 speaker-dependent patterns, one associated to each digit, whichrepresented the first two vowel formants for each digit. Although this scheme was ratherrudimentary, the accuracy levels achieved were quite remarkable, reaching around 95 to97 % [11].The final part of the 1960’s saw the introduction of feature extraction algorithms, such asthe Fast Fourier Transform (FFT) developed by Cooley and Tukey, the Linear PredictiveCoding (LPC) developed by Atal and Hanauer, and the Cepstral Processing of speechintroduced by Oppenheim in 1968. Warping, also known as non-uniform time scalewas another technique presented to handle differences in the speaking rate and segmentlength of the input signals. This was accomplished by shrinking and stretching the input3


signals in order to match stored patterns.The introduction of Hidden Markov Models (HMMs) brought significant improvementsto the existing technology. <strong>Based</strong> on the work published by Baum and others, JamesBaker applied the HMM framework to speech recognition in 1975 as part of his graduatework at CMU. In the same year, Jelinek, Bahl and Mercer applied HMMs to speechrecognition while they were working for IBM. The main difference was the type ofdecoding used by the two systems; Baker’s system used Viterbi decoding while IBM’ssystem used stack decoding (also known as A* decoding) [10].The use of statistical methods and HMMs heavily influenced the design of the nextgeneration of speech recognizers. In fact, the use of HMMs started to spread since the1980s and has become by far the most common framework used by speech recognizers.During the late 1980s a technology involving neural networks (ANN) was introduced intothe speech recognition research, the main idea behind this technology was to identifyphonemes or even complete words using multilayer perceptrons (MLPs). However,more modern research has tried to apply MLPs as a complementary tool for HMMs,in other words some modern ASRs use hybrid MLP/HMMs in order to improve theiraccuracy.The Defense Advanced <strong>Research</strong> Projects Agency (DARPA) in the US played a majorrole in the funding, creation and improvement of speech recognition systems during thelast two decades of the 20th century. The DARPA funded and evaluated many systemsby measuring their accuracy and, most importantly, their word error rate (WER), asillustrated by Figure 2.1.Figure 2.1: DARPA <strong>Speech</strong> <strong>Recognition</strong> Benchmark Tests.4


Many different tasks were created with different difficulty levels. For example, some tasksinvolved continuous speech recognition using structured grammar such as in militarycommands, some other tasks involved recognition of conversational speech using a verylarge vocabulary with more than 20 thousand words. The DARPA program also helpedto create a number of speech databases used to train and evaluate ASR systems. Someof these databases are the Wall Street Journal, the Switchboard and the CALLHOMEdatabases; all of them are publicly available.Driven by the DARPA initiatives, the ASR technology evolved and several researchprograms were created in different parts of the world. The speech recognition systemseventually become very sophisticated and capable of supporting large vocabularies withthousands of words and phonemes. However, the WER for conversational speech stillcan be considered high, with a value of near 40% [9].During the first decade of the 21st century the research community has focused onthe use of machine learning in order to not only recognize words, but to interpret andunderstand human speech. In this regard, text-to-speech (TTS) synthesis systems havebecome popular in order to develop machines able to speak to humans. Nonetheless,designing and building machines able to mimic a person seems to be a challenge forfuture work [10]. Figure 2.2 depicts important milestones of speech recognition.Figure 2.2: Milestones and Evolution of the Technology.5


2.2 Main challenges and Current <strong>Research</strong> AreasDuring the past five decades, the ASR technology has faced many challenges. Someof these challenges have been solved; some others are still present even on the mostsophisticated systems. One of the most important roadblocks of the technology isthe high WER for large-vocabulary continuous speech recognition systems. Withoutknowing the word boundaries in advance, the whole process of recognizing words andsentences becomes much harder. Moreover, failing to correctly identify word boundariesof words will certainly produce incorrect word hypotheses and incorrect sentences. Thismeans that sophisticated language models are needed in order to discard incorrect wordhypotheses [21].Conversational and natural speeches also contain co-articulatory effects; in other words,every sound is heavily influenced by the preceding and following sounds. In order todiscriminate and correctly determine the phones associated to each sound, the ASRrequires complex and detailed acoustic models. However the use of large languageand acoustic models typically increases the processing power needed to run the system.During the 1990s even some common workstations did not have enough processing powerto run a large vocabulary ASR system [21].It is well known that increasing the accuracy of a speech recognition system increasesits processing power and memory requirements. Similarly, these requirements can berelaxed by sacrificing some accuracy in certain applications. Nevertheless, it is muchmore difficult to improve accuracy while decreasing computational requirements.<strong>Speech</strong> variability in all of its possible categories represents one of the more difficultchallenges for speech recognition technology. <strong>Speech</strong> variability can be traced to manydifferent sources such as the environment, the speaker itself or the input equipment.For this reason, it is critical to create robust systems that are able to overcome speechvariability regardless of its source [3].Another important challenge for the technology is the improvement of the trainingprocess. For example, it is important to use training data that is similar to the typeof speech used in a real application. It is also valuable to use training data that helpsthe system to discard incorrect patterns found during the process of recognizing speech.Finally, it is desired to use algorithms and training sets that can adapt in order todiscriminate incorrect patterns.As it can be seen, there are many opportunity areas and ideas to improve the existingASR technology. Nevertheless, we can organize these areas in more specific groups, forexample:• It is important to improve the ease-of use of the existing systems, in other words,the user needs to be able to use the technology in order to find more applications.6


• The ASR systems should be able to adapt and learn automatically. For instance,they should be able to learn new words and sounds.• The systems should be able to minimize and tolerate errors. This can be achievedby designing robust systems that can be used in real life applications.• <strong>Recognition</strong> of emotions could play a very important role in order to improvespeech recognition as it is one of the main causes of variability [3].7


Chapter 3Automatic <strong>Speech</strong> <strong>Recognition</strong>This chapter discusses the main sources of speech variability and their effects on theaccuracy of speech recognition. Additionally, it describes the major components of atypical ASR system, presents some of the algorithms used during the training phase andtheir common method of evaluation.3.1 <strong>Speech</strong> VariabilityOne major challenge for speech recognition is speech variability. Due to human nature,a person is capable of emitting and producing a vast variety of sounds. Therefore, sinceeach person has different vocal tract configurations, shapes and lengths, (articulatoryvariations), it is impossible for two persons to speak alike, not even a same person canreproduce a same waveform after repeating the same word. However, is not only thevocal tract configuration, but different factors that can create different effects on theresulting speech signal.3.1.1 Gender and AgeSpeaker’s gender is one of the main sources of speech variability. It makes a differenceon the produced fundamental frequencies as men and woman have different vocal sizes.Similarly, age contributes to speech variability as children ASR becomes particularlydifficult since their vocal tracts and folds are smaller compared to adults.This has a direct impact on the fundamental frequency as it becomes higher than adultfrequencies. Also, according to [1], it has been shown that children under ten years old,increase the duration of the vowels resulting in variations of the formant locations andfundamental frequencies. In addition, children might also lack of a correct pronunciationand vocabulary or even have a spontaneous speech grammatically incorrect.8


3.1.2 Speaker’s OriginVariations exist when recognizing native and non-native speech. <strong>Speech</strong> recognitionamong native speakers does not represent a significant change in the acoustics andtherefore there is not a big impact in the ASR system performance. However, thismight not be the same when recognizing a foreign speaker. Factors such as the levelof knowledge of the non-native speaker, vocabulary, speaker’s accent and pronunciationrepresent variations of the speech signal that could impact the system’s performance.Moreover, if the used speech models are only considering native speech data, the systembehavior could not be correct.3.1.3 Speaker’s styleApart from the origin of the speaker, speech also varies depending on the speakingstyle. A speaker might reduce the pronunciation of some phonemes or syllables duringa casual speech. On the other hand, portions of speech containing a complex syntaxand semantic tend to be articulated more carefully by speakers. Moreover, a phrase canbe emphasized and the pronunciation can vary due to the speaker’s mood. Thus, thecontext also determines how a speech signal is produced. Additionally, the speaker canintroduce word repetitions or expressions that denote hesitation or uncertainty.3.1.4 Rate of <strong>Speech</strong>Another important source of variability is the rate of speech (ROS) since it increasesthe complexity, within the recognition process, of mapping the acoustic signal withthe phonetic categories. Therefore, timing plays an important role as an ASR systemcan have a higher error rate due to higher speaking rate if the ROS is not properlytaken into account. In contrast, a lower speaking rate can either affect or not the ASRsystem’s performance, depending on factors such as if the speaker extremely articulatesor introduces pauses within syllables.3.1.5 EnvironmentOther sources of speech variability reside at the transmission channel, for example,distortions to the speech signal can be introduced due to the microphone arrangement.Also, the background environment can produce noise that is introduced into the speechsignal or even the room acoustics can modify the speech signal received by the ASRsystem.From physiological to environmental factors, different variations exist when speakingand between speakers, such as gender or age. Therefore, speech recognition deals with9


properly overcoming all these type of variations and their effects during the recognitionprocess.3.2 ASR ComponentsAn ASR system comprises several components and each of them should be carefullydesigned in order to have a robust and well implemented system. This section presentsthe theory behind each of these components in order to better understand the entireprocess of the design and development of an ASR system.3.2.1 Front EndThe front end is the first component of almost every ASR system as it is the one to seethe speech signal as it comes into the system. This component is the one in charge ofdoing the signal processing to the received speech signal. At the moment that the inputsignal arrives to the front end, it had already passed through an acoustic environment,where the signal might have suffered from diverse effects, such as additive noise or roomreverberation [15]. Thus, a proper signal processing needs to be made in order to enhancethe signal suppressing possible sources of variation and extract its features to be usedby the decoder.Feature ExtractionThe speech signal is usually translated into a spectral representation comprising acousticfeatures. This is done by compressing the signal into smaller set of spectral N features,where the size of N most likely depends on the duration of the signal. Therefore, inorder to make a proper feature extraction, the signal needs to be properly sampled andtreated prior the extraction.The feature extraction process should be able to extract the features that are critical tothe recognition of the message within the speech signal. However, the speech waveformcomprises several features, but, the most important feature dimension is the so calledspectral envelope. This envelope contains the main features of the articulatory apparatusand is considered the core of speech analysis for speech recognition [9].The features within the spectral envelope are obtained from a Fourier Transform, aLinear Predictive Coding (LPC) analysis or from a bank of bandpass filters. Themost common used ASR features are the Mel-Frequency Cepstral Coefficients (MFCCs),however there also exist LPC coefficients, Line-Spectral Frequencies (LSFs) among manyothers. Despite the extensive number of feature sets, they all intend to capture theenough spectral information needed to recognize the spoken phonemes [16].10


MFCC vectorsMFCCs can be seen as a way to translate an analog signal into digital feature vectors oftypically 39 numbers. However, this process requires of the execution of several steps inorder to obtain these vectors. Figure 3.1 depicts the series of steps required to get thefeature vectors from the input speech signal.Figure 3.1: Signal processing to obtain the MFCC vectorsDue to the nature of human speech, the speech signal is attenuated as the frequenciesincrease. Also, the speech signal is subject to a falloff of -6dB when passing throughthe vocal tract [16]. Therefore, the signal needs to be preemphasized, which means thata preemphasis filter, which is a high-pass filter, is applied to the signal. This increasesthe amplitude of the signal for the high frequencies while decreasing the componentsof lower frequencies. Then, having the original speech signal x, a new preemphasizedsample at time n is given by the differencex n = x n − a(x n−1 ) (3.1)where a is a factor that is typically set to a value near 0.9 [16].Once the signal has been preemphasized, it is partitioned into smaller frames of sizesof about 20 to 30 milliseconds sampled every 10 milliseconds. The higher the samplerate, the better it is to model fast speech changes [16]. The frames are overlapped inorder to avoid missing information that could have been in between the limits of eachframe. This process is called windowing and it is used in order to minimize the effectsof partitioning the signal into small windows. The most used window function in ASRis the Hamming window [13]. This function is described by( ) 2nπw n = α − (1 − α) ∗ cos(3.2)N − 1where w is a window of size N and α has a value of 0.54 [18].Next step is to compute the power spectrum by means of the Discrete Fourier Transform(DFT) using the Fast Fourier Transform (FFT) algorithm to minimize the requiredcomputation. Then, using a mel-filter bank the power spectrum is mapped onto the11


mel-scale in order to obtain a mel-weighted spectrum. The reason for using this scaleis because it is a non-linear scale that approximates the non-uniform human auditorysystem [9]. A mel-filter bank consists on a number of overlapped triangular bandpassfilters where the center frequencies are equidistant on the mel-scale. This scale is linearup to 1000 hertz and logarithmic thereafter [18].Next, a logarithm compression is made to the data obtained from the mel-filter bankresulting into log-energy coefficients. This coefficients are then orthognonalized by aDiscrete Cosine Transform (DCT) in order to compress the spectral data into a set oflow order coefficients, also known as the mel-cepstrum or cepstral vector [18]. This isdescribed by the followingC i =M∑k=1[(X k cos k − 1 ) ] π, i = 1, 2, ..., M (3.3)2 Mwhere C i is the ith MFCC and M is the number of cepstrum coefficients and X krepresents the log-energy coefficients of the kth mel-filter [2]. This vector is laternormalized in order to account for the distortions that may have occurred due to thetransmission channel.Additionally, it is common to get the first and second order differentials of the cepstrumsequence to obtain the delta-cepstrum and the delta-delta cepstrum. Furthermore, thedelta-energy and delta-delta energy parameters, which are the first and second orderdifferentials of the power spectrum are also added to the feature vector [9]. This is how,an ASR feature vector typically consists of 13 cepstral coefficients, 13 delta values and13 delta-delta values, which gives a total of 39 features per vector [16].3.2.2 Linguistic ModelsLanguage Models and N-gramsAfter processing the input sound waves in the front end, the ASR generates a seriesof symbols representing the possible phonemes in a piece of speech. In order to formwords using those phonemes, the speech recognition system uses language modeling.This modeling is characterized by a set of rules regarding how each word is related toother words. For example, a group of words cannot be put together arbitrarily; theyhave to follow a set of grammatical and syntactical rules of a language. This part ofspeech processing is necessary in order to determine the meaning of a spoken messageduring the further stages of speech understanding [9].Most ASR systems use a probabilistic framework in order to find out how words arerelated to each other. In a very general sense, the system tries to determine what wordis the latest one received based on a group of words previously received. For instance,what would be the next word in the following sentence?12


I would like to make a collect. . .Some of the possible words would be “call”, “phone-call” or “international” amongothers. In order to determine the most probable word in the sentence, the ASR needsto use probability density functions, P (W ). Where W represents a sequence of wordsw 1 , w 2 , . . . w n . The density function assigns a probability value to a word sequencedepending on how likely it appears in a speech corpus [11]. Using the same example asbefore, the probability of the word “call” occurring after “I would like to make a collect”,would be given by:P (W ) = P (”I”, ”would”, ”like”, ”to”, ”make”, ”a”, ”collect”, ”call”) (3.4)If we substitute the words by mathematical expressions we have:P (W ) = P (w 1 , w 2 , w 3 , . . . .w n − 1, w n ) (3.5)P (W ) = P (w 1 )P (w 2 |w 1 )P (w 3 |w 2 , w 1 ). . . P (w n |w n − 1. . . w 1 ) (3.6)As it can be seen, the probability function requires using all of the words in a givensentence. This might represent a big problem when using large vocabularies, especiallywhen dealing with very long sentences. A large speech corpus would be required inorder to compute the probability of each word occurring after any combination of otherwords. In order to minimize the need of a large corpus, the N-gram model is used.This can be achieved by approximating the probability function of a given word byusing a predefined number of previous words (N). For example, using a bigram model,the probability function is approximated using just the previous word. Similarly, thetrigram model uses the previous two words and so on [11].Using a bigram model it is easier to approximate the probability of occurrence for theword “call“, given the previous words:P (W ) = P (”call”|”collect”)P (”collect”|”a”)P (”a”|”make”). . . P (”would”|”I”) (3.7)The trigram model looks slightly more complicated as it uses the two words to computeeach conditional probability:P (W ) = P (”call”|”collect”, ”a”)P (”collect”|”a”, ”make”). . . P (”like”|”would”, ”I”)(3.8)The N-gram model can be generalized using the following form [9]:P (W ) =n∏P (w k |w k − 1, w k − 2. . . .w k−N+1 ) (3.9)k=1In general, as N increases, the N −gram approximation becomes more effective, however,most ASR systems use a bigram, or trigram model.13


Acoustic ModelThe main purpose of an acoustic model is to map sounds into phonemes and words. Thiscan be achieved by using a large speech corpus and generating statistical representationsof every phoneme in a language. The English language is composed by around 40 differentphonemes [6]. However, due to the co-articulation effects, one phoneme is affected bythe preceding and succeeding phonemes depending on the context. The most commonapproach to overcome this problem is by using triphones, in other words each phoneis modeled along its preceding and succeeding phones [9]. For example, the phonemerepresentation for the word HOUSEBOAT is:[ hh aw s b ow t ]In this context, in order to generate a statistical representation for the phoneme [aw],the phonemes [hh] and [s] need to be included as well. This triphone strategy can beimplemented using Hidden-Markov models, where the triphone is represented using threemain states (one for each phoneme), plus one initial and one end state. An example ofa triphone can be seen at Figure 3.2.Figure 3.2: A typical triphone structure.The purpose of the Hidden Markov model is to estimate the probability that an acousticfeature vector corresponds to a certain triphone. The computation of this probabilitycan be obtained by using what is called the evaluation problem of HMMs. There is morethan one way to solve the evaluation problem; however the most used algorithms are theforward-backward algorithm, the Viterbi algorithm and the stack decoding algorithm[16].DictionaryThe ASR dictionary complements the language model and the acoustic model by mappingwritten words into phoneme sequences. It is important to notice that the size of thedictionary should correspond to the size of the vocabulary used by the ASR system. Inother words, it should contain every single word and its phonetic representation used bythe system, as it can be seen at Table 3.1.14


WordaancoraardemaaardvarkPhoneme representationAA N K AO RAA R D EH M AHAA R D V AA R KTable 3.1: Example of the 3 first words in an ASR dictionaryThe representation of these phonemes must correspond to the representation used bythe acoustic model in order to identify words and sentences correctly. Similarly, thewords included in the dictionary must be the same as the words used by the languagemodel. Furthermore, the dictionary should be as detailed as possible in order to improvethe accuracy of the speech recognition system. For example, the dictionary used by theCMU-Sphinx system is composed by more than 100,000 words.3.2.3 DecoderHidden Markov ModelsASR systems typically make use of finite-state machines (FSM) to overcome all variationsfound within speech by means of stochastic modeling. Hidden Markov Models (HMM)are one of the most common FSM used within ASR and were first used for speechprocessing in the 1970s by Baker at CMU and Jelinek at IBM [16]. This type of modelscomprises a set of observations and hidden states with self or forward transitions andprobabilistic transitions between them. Most HMMs have a left-to-right topology, whichin case of ASR allows the modeling of the nature of sequential speech [15].In a Markov model, the probability or likelihood of being in a given state depends onlyon the immediate prior state, leaving earlier states out of consideration. Unlike a Markovmodel, in where each state corresponds to an observable event, in an HMM the state ishidden, only the output is visible.Then, an HMM can be described as having states, observations and probabilities [20]:• States.N hidden interconnected states and state q t at time t,S = S 1 , S2, ...S N (3.10)• Observations.Symbols. M observation symbols per state,V = V 1 , V 2, ...V M (3.11)15


Sequences. T observations in the sequence,O = O 1 , O 2 , ...O T (3.12)• Probabilities.State transition probability distribution A = a ij where,a ij = P [q ( t + 1) = S j |q t = S i ], 1 ≤ i, j ≤ N (3.13)Observation symbol probability distribution B = b j (k) at state S j , where,b j (k) = P [v k at t|q t = S j ], 1 ≤ j ≤ N, 1 ≤ k ≤ M (3.14)Initial state distribution π = π i where,π i = P (q 1 = S i ], 1 ≤ i ≤ N (3.15)Therefore, an HMM model can be denoted as:λ = (A, B, π) (3.16)An example of an HMM is depicted at Figure 3.3. The model has three states having selfand forward transitions between them. State q 1 has two observation symbols, state q 2has three observation symbols and state q 3 has four observation symbols. Each symbolhas its own observation probability distribution per state.Figure 3.3: Three state HMMAn HMM shall address three basic problems in order to be useful in real-world applicationssuch as in an ASR system [20]. The first problem is selecting a proper method todetermine the probability P (O|λ) that the observation sequence O = O 1 , O 2 , ...O T is16


produced by the model λ = (A, B, π). The typical approach used to solve this problemis the forward-backward procedure, being the forward step the most important at thisstage. This algorithm is presented at Section 3.3The second problem is finding the best state sequence Q = q 1 , q2, ...q t that produced theobservation sequence O = O 1 , O 2 , ...O T . For this, it is necessary to use an optimalitycriterion and learn about the model’s structure. One possibility is to select the statesq t that are individually most likely for each t. However, this approach is not efficient asit is only providing individual results, that is why the common approach to finding thebest state sequence is to use the Viterbi algorithm. This algorithm is later presented atthis section.Finally, the third problem deals with the adjustment of the model parameters λ =(A, B, π) in order to maximize the probability P (O|λ) to better determine the origin ofthe observation O = O 1 , O 2 , ...O T . This is considered to be an optimization problemwhere the output results to be the decision criterion. In addition, this is the point wheretraining takes an important role in ASR as it allows adapting the model parametersto observed training data. Therefore, a common approach to solve this is by means ofthe maximum likelihood optimization procedure. This procedure uses separate trainingobservation sequences O v to obtain model parameters for each model λ vP ∗ v = maxλ vP (O v |λ v ) (3.17)Typical usage of Hidden Markov ModelsIn ASR, various models can be made based on HMM, being the most common models theones for phonemes and words. An example of these type of models is depicted in Figure3.4 where various phone HMM models are used to construct the phonetic representationof the word one and the concatenation of these models are used to construct word HMMmodels.Phoneme models are constrained by pronunciations from a dictionary, while word modelsare constrained by a grammar. Phone or phoneme models are usually made up using oneor more state HMM. On the other hand, word models are made up of the concatenationof phoneme models, that at the same time help on the construction of sentence models[15].17


Figure 3.4: Example of phone and word HMMs representationAcoustic Probabilities and Neural Networks / MLPsThe front end phase of any speech recognition system is in charge of converting the inputsound waves into feature vectors. Nonetheless, these vectors need to be converted intoobservation probabilities in order to decode the most probable sequence of phonemes andwords in the input speech. Two of the most common methods to find the observationprobabilities are: the Gaussian probability-density functions (PDFs) and most recently,the use of neural networks (also called multi-layer perceptrons, MLPs).The Gaussian observation-probability method converts an observation feature vector o tinto a probability function b j (o t ) using a Gaussian curve with a mean value µ and acovariance matrix Σ. In the simpler version of this method, each state in the hiddenMarkov framework has one PDF. However, most ASR systems use multiple PDFsper state, for this reason, the overall probability function b j (o t ) is computed usingGaussian Mixtures. The forward-backward algorithm is commonly used to computethese probability functions and is also used to train the whole hidden Markov modelframework [11].Having the mean value and the covariance matrix, the probability function can becomputed using the following equation:b j (o t ) =1√(2π)(Σj)e [(ot−µ j) ′ Σ −1j (o t−µ j )](3.18)Artificial neural networks are one of the main alternatives to the Gaussian estimator inorder to compute observation probabilities. The key characteristic of neural networksis that they can arbitrarily map inputs to outputs as long as they have enough hiddenlayers. For example, having the feature vectors as inputs (o t ) and the corresponding18


phone labels as outputs, a neural network can be trained to get the observation probabilityfunctions b j (o t ) [10].Typically, the ASR systems that use hidden Markov models and MLPs are classifiedas hybrid (HMM-MLP) speech recognition systems. The inputs to the neural networkare various frames containing spectral features and the network has one output for eachphone in the language. In this case, the most common method to train the neuralnetwork is the back-propagation algorithm [15].Another advantage of using MLPs is that having a large set of phone labels and theircorresponding set of observations, the back-propagation algorithm iteratively adjust theweights in the MLP until the errors are minimized. Figure 3.5 presents an example of arepresentation of an MLP of three layers.Figure 3.5: A detailed representation of a Multi-layer Perceptron.The two methods to calculate the acoustic probabilities (Gaussian and MLPs) haveroughly the same performance. However, MLPs take longer time to train and theyuse less parameters. For this reason, the neural networks method seems to be moresuited to ASR applications where amount of processing power and memory are majorconcerns.19


Viterbi Algorithm and Beam SearchOne of the most complex tasks faced by speech recognition systems is the identificationof word boundaries. Due to co-articulation and fast speech rate, it is often difficultto identify the place where one word ends and the next one starts. This is called thesegmentation problem of speech and it is typically solved using N-gram models and theViterbi algorithm. Given a set of observed phones o = (o 1 , o 2 . . . o t ), the purpose of thedecoding algorithm is to find the most probable sequence of states q∗ = (q 1 , q 2 . . . q n ),and subsequently, the most probable sequence of words in a speech [11].The Viterbi search is implemented using a matrix where each cell contains the best pathafter the first t observations. Additionally, each cell contains a pointer to the last statei in the path. For example:viterbi[t, i] =max P (q 1 q 2 ...q t−1 , q t = i, o 1 , o 2 ...o t ) (3.19)q 1 ,q 2 ,...,q t−1In order to compute each cell in the matrix, the Viterbi algorithm uses what is calledthe dynamic programming invariant [9]. In other words, the algorithm assumes thatthe overall best path for the entire observation goes through state i, but sometimesthat assumption might lead to incorrect results. For example, if the best path looksbad initially, the algorithm would discard it and select a different path with a betterprobability up to state i. However, the dynamic programming invariant is often used inorder to simplify the decoding process using a recurrence rule in which each cell in theViterbi matrix can be computed using information related to the previous cell [11].viterbi[t, j] = max(viterbi[t − 1, i]a ij )b j (o t ) (3.20)iAs it can be seen, the cell [t, j] is computed using the previous cell [t − 1, i], one emissionprobability b j and one transition probability a ij . It is important to emphasize that thisis a simplified model for the Viterbi algorithm, for example a real speech recognizer thatuses HMMs would receive feature acoustical vectors instead of phones. Furthermore, thelikelihood probabilities b j (o t ) would be calculated using Gaussian probability functions ormulti-layer perceptrons (MLPs). Additionally, the hidden Markov models are typicallydivided in triphones rather than single phones. This characteristic of HMMs provides adirect segmentation of the speech utterance [16].The large amount of possible triphones and the use of a large vocabulary make thespeech decoding a computationally expensive task. For this reason, it is necessary toimplement a method to discard low probability paths and focus on the best ones. Thisprocess is called pruning and is usually implemented using a beam search algorithm.The main purpose of this algorithm is to speed up the execution of the search algorithmand the use of a lower amount of computational resources. However, the main drawbackof the algorithm is the degradation of the decoding performance [11].20


A* decoding algorithmThe A* decoding algorithm (also known as stack decoding) can be used to overcomethe limitations of the Viterbi algorithm. The most important limitation is related tothe use of the dynamic programming invariant, therefore it cannot be used with somelanguage models, such as tri-grams [11]. The stack decoding algorithm solves the sameproblem as the Viterbi algorithm, which is finding the most likely word sequence Wgiven a sequence of observations O, [17]. For this reason it is often used as the bestalternative to substitute Viterbi’s method.In this case, the speech recognition problem can be seen as a tree network search problemin which the branches leaving each junction represent words. As the tree network isprocessed, more words are appended to the current string of words in order to form themost likely path [17]. An example of this tree is illustrated by Figure 3.6.Figure 3.6: Tree Search Nerwork using the A* algorithm.21


The stack decoder computes the path with the highest probability of occurrence fromthe start to the last leaf in the sequence. This is achieved by storing a priority stack(or queue) with a list of partial sentences with a score, based on their probability ofoccurrence. The basic operation of the A* algorithm can be simplified using the followingsteps [17]:1. Initialize the stack2. Pop the best (high scoring) candidate off the stack3. If the end-of-sentence is reached, output the sentence and terminate4. Perform acoustic and language model fast matches to obtain new candidates5. For each word on the candidate list:(a) Perform acoustic and language-model detailed matches to compute new theoryoutput likelihood.6. Go to step 2i. If the end-of-sentence is not reached, insert candidate into the stackii. If the end-of sentence is reached, insert it into the stack with end-ofsentence flag.The stack decoding algorithm is based on a criterion that computes the estimated scoref ∗ (t) of a sequence of words up to time t. The score is computed by using knownprobabilities of previous words g i (t) and an heuristic function to predict the remainingwords in the sentence. For example:f ∗ (t) = g i (t) + h ∗ (t) (3.21)Alternatively, these functions can be expressed in terms of a partial path p, insteadof time t. In this case f ∗ (p) is the score of the best complete path which starts atpath p. Similarly g i (p) represents the score from the beginning of the utterance towardsthe partial path p. Lastly, h ∗ (p) estimated the best extension from the partial path ptowards the end of the sentence [11].Finding an efficient and accurate heuristic function might represent a complex task. Fastmatches are heuristic functions that are computationally cheap and are used to reducethe number of next possible word candidates. Nonetheless, these fast match functionsmust be checked by more accurate detailed match functions [17].22


3.3 Training for Hidden Markov ModelsOne of the most challenging phases of developing an automatic speech recognizer systemis finding a suitable method to train and evaluate its hidden Markov models. Thissection provides an overview on the Expectation-Maximization algorithm and presentsthe Forward-Backward algorithm that is used for the training of an ASR system.3.3.1 Expectation-Maximization AlgorithmTypically, the methods used to complete the training task of an ASR are variations of theExpectation-Maximization algorithm (EM), presented by Dempster in 1977. The goal ofthis algorithm is to approximate the transition (a ij ) and emission (b i (o t )) probabilitiesof the HMM using large sets of observation sequences O, [11].The initial values of the transition and emission probabilities can be estimated; as thetraining for the HMM progresses, those probabilities are re-estimated until their valuesconverge into a good model. The expectation-maximization algorithm uses the steepestgradient, also known as hill-climbing, method. This means that the convergence of thealgorithm is determined by local optimality [16].3.3.2 Forward-Backward AlgorithmDue to the large number of parameters involved in the training of HMMs, it is preferableto use simple methods. Some of these methods can feature adaptation in order to addrobustness to the system by using noisy or contaminated training data. In fact, this hasbeen a major research area in the field of speech recognition during the last couple ofdecades [14].Two of the most popular algorithms used to train HMMs are the Viterbi algorithm, andthe Forward-Backward (FB) algorithm, also known as the Baum-Welch algorithm. Thislast algorithm computes two main parameters in order to train HMMs, the maximumlikelihood estimates and the posterior modes (transition and emission probabilities) usingemission observations as training data. The approximation of HMM probabilities canbe achieved combining two parameters named: the forward probability (alpha) andbackward probability (beta) [11].The forward probability is defined as the probability of being in state i after the first tobservations (o 1 , o 2 , o 3 . . . o t ).α t (i) = P (o 1 , o 2 ...o t , q t = i) (3.22)Similarly, the backward probability is defined as the probability of visualizing observationsfrom time t + 1 to the end (T ) when the state is j at time t.23


β i (o t ) = P (o t+1 , o t+2 ...o T |q t = j) (3.23)In both cases, the probabilities are calculated using an initial estimate and then theremaining values are approximated using an induction step. For example:α i (1) = a 1j b j (o 1 ) (3.24)And then the other values are recursively calculated:[ N−1] ∑α j (t) = α i (t − 1)a ij b j (o t ) (3.25)i=2As it can be seen, the forward probability in any given state can be computed using theproduct of the observation likelihood b j and the forward probabilities from time t − 1.This characteristic allows this algorithm to work efficiently without drastically increasingthe number of computations as N grows [16].A similar approach is used to calculate the backward probabilities using an iterativeformula:β i (t) =N−1∑i=2a ij b j (o t + 1)β j (t + 1) (3.26)Once that the forward and backward probabilities have been calculated, both the alphaand beta factors are normalized and combined in order to approximate the new valuesfor the emission and transition probabilities.Although the Baum-Welch algorithm has proven to be an efficient way to train HMMs,the training data needs to be sufficient to avoid having parameters with probabilitiesequal to zero. Furthermore, using the forward-backward algorithm helps to train parametersfor an existing HMM, but the structure of the HMM needs to be generated manually.This could represent a major concern as finding a good method to generate the structurefor an HMM could be a difficult task [16].An example of a training procedure is depicted in Figure 3.7 in which the forward-backwardalgorithm is used by Sphinx.24


Figure 3.7: Sphinx training procedure3.4 PerformanceThe performance of an ASR system can be measured in terms of its recognition errorprobability, which is why specific metrics such as the word-error rate measurementare used to evaluate this type of systems. This section describes the word-error ratemeasurement, which is the most common measurement to evaluate the performance ofan ASR system.3.4.1 Word Error RateThe word-error rate (WER) has become the standard measurement scheme to evaluatethe performance of speech recognition systems [11]. This metric allows calculating thetotal number of incorrect words in a recognition task. Similar approaches use syllables orphonemes to calculate error rates; however the most used measurement units are words.The WER is calculated by measuring the number of inserted, deleted or substitutedwords of a correct transcript with respect to a hypothesized speech string [1].W ER =Insertions + Substitutions + Deletions100 (3.27)T otalW ords25


As it can be seen, the number of word insertions is included in the mathematicalexpression, for this reason the WER can have values above 100 percent. Typically,a WER lower than 10 percent is acceptable on most ASR systems [10]. Furthermore,the WER is the most common metric used to benchmark and evaluate improvements toexisting automatic speech recognition systems, for example, when introducing improvedor new algorithms [21].Although the WER is the most used metric to evaluate the performance of ASRs, it doesnot provide further insight of the factors that generate the recognition errors. Some othermethods have been proposed in order to measure and classify the most common speechrecognition errors. For example, the analysis of variance (ANOVA) method, which allowsthe quantification of multiple sources of errors acting in the variability of speech signals[1].During the last decade, researchers have tried to predict speech recognition errors insteadof just measuring them in order to evaluate the performance of a system. Furthermore,the predicted error rates can be used to carefully select speech data in order to train theASR systems more effectively.26


Chapter 4Design and ImplementationThis chapter describes in detail the design and implementation of a <strong>Keyword</strong> <strong>Based</strong><strong>Interactive</strong> <strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong> using PocketSphinx. A brief description of theCMU Sphinx technology and the reasons for selecting PocketSphinx is provided in thischapter.Also introduced is the creation process of both the language model and the dictionary.Furthermore, several tools were created in order to ease the system’s development andthey are introduced in this chapter. However, more information on these tools and theirusage can be found at Appendix A.4.1 The CMU Sphinx TechnologyThis section describes the Sphinx speech recognition system that is part of the CMUSphinx technology. Also described is the PocketSphinx system which is the ASR decoderlibrary selected for this project.4.1.1 SphinxSphinx is a continuous-speech and speaker-independent recognition system developedin 1988 at the Carnegie Mellon University (CMU) in order to try to overcome someof the greatest challenges within speech recognition: speaker independence, continuousspeech and large vocabularies [7, 13]. This system is part of the CMU Sphinx, opensource, technology that provides a set of speech recognizers and tools that allow thedevelopment of speech recognition systems. This technology has been used to ease thedevelopment of speech recognition systems as it provides a way to avoid developers theneed to start from scratch.27


The first Sphinx system has been improved over the years and three other versions havebeen developed: Sphinx2, 3 and 4. Sphinx2 is a semi-continuous HMM based system,while Sphinx3 is a continuous HMM based speech recognition system and both arewritten in the C programming language [19]. On the other hand, Sphinx4 is a systemthat provides a more flexible and modular framework written in the Java programminglanguage [23]. Currently, only Sphinx3 and Sphinx4 are still under development.Overall, the Sphinx systems comprise typical ASR components such as in Sphinx4 thathas a Front End, a Decoder and a Linguist as well as a set of algorithms that can beused and configured depending on the needs of the project. They provide part of theneeded technology, that given an acoustic signal and having a properly created acousticmodel, a language model and a dictionary, decode the spoken word sequence.4.1.2 PocketSphinxPocketSphinx is a large vocabulary, semi-continuous speech recognition library based onCMU’s Sphinx 2. PocketSphinx was implemented with the objective of creating a speechrecognition system for resource-constrained devices, such as hand-held computers [8].The entire system is written in C with the aim of having fast-response and light-weightapplications. For this reason, PocketSphinx can be used in live applications, such asdictation.One of the main advantages of PocketSphinx over other ASR systems is that it has beenported and executed successfully on different types of processors, most notably the x86family and several ARM processors [8]. Similarly, this ASR has been used on differentoperating systems such as Microsoft’s Windows CE, Apple’s iOS and Google’s Android[24]. Additionally, the source code of PocketSphinx has been published by the CarnegieMelon University under a BSD style license. The latest code can be retrieved fromSourceForge 1 .In order to execute PocketSphinx, typically three input files need to be specified: thelanguage model, the acoustic model and the dictionary. For more information aboutthese three components please refer to Section 3.2.2. By default, the PocketSphinxtoolkit includes at least one language model (wsj0vp.5000), one acoustic model (hub4wsj sc 8k)and one dictionary (cmu07a.dic). Nonetheless, these files are intended to support verylarge vocabularies, for example, the dictionary 2 includes around 125,000 words.Although there are not many scientific publications to describe how the PocketSphinxlibrary works, the official web page 3 contains documentation that describes how todownload and compile the source code, as well as how to create example applications.1 CMU Sphinx repository: http://sourceforge.net/projects/cmusphinx2 CMU dictionary: http://www.speech.cs.cmu.edu/cgi-bin/cmudict3 PocketSphinx: http://www.pocketsphinx.org28


Similarly, there is a discussion forum where new developers are encouraged to askquestions. Probably the main drawback about the web page is that some sections donot seem to be updated regularly and the organization is not intuitive. In this regard, itcan be cumbersome to navigate through the web page as some documentation applies toboth PocketSphinx and Sphinx 4, while some other parts apply just to Sphinx 4.At the end of the day, we chose to use PocketSphinx because it provides a developmentframework that can be adapted to fit our requirements. For example, PocketSphinx hasbeen tested in several embedded devices running different operating systems. For thisreason, we consider that is more feasible to adapt it instead of creating an ASR fromscratch.In this case, we can focus on creating a keyword-based language model and a dictionarythat maximizes the performance of our system. Similarly we chose to create an applicationin which we can measure the performance of the ASR and interact with a computer usingspeech; in the following section we will describe this application in more detail.4.2 <strong>System</strong> DescriptionThe development of this project was done considering that the system can be used by arobot that is located at a museum and that can interact with people. The robot shall beable to listen and talk to a person based on identified keywords that were pronounced bythe speaker. However, the robot shall never remain quiet. This means that if no keywordis identified, it shall be able to play additional information about the museum. For thisproject, the robot was assumed to be located at the Swedish Tekniska Museet 4 .Therefore, the system needs to be able to identify previously defined keywords from aspoken sentence and it should be speaker independent. Moreover, the system needs tobe an interactive speech recognition system, in the sense that it shall be able to reactbased on an identified keyword. In this case, an audio file with information related to thekeyword is played in response. Otherwise, if the system does not identify any keywordfrom the decoded speech, the system shall play any other default audio file.Furthermore, the system needs to be portable, as it has as main goal to be used byan embedded application. Therefore, as explained at Section 4.1.2, PocketSphinx wasselected from the CMU Sphinx family of decoders as it provides the required technologyto develop this type of systems.4 Tekniska Museet: http://www.tekniskamuseet.se29


4.3 <strong>Keyword</strong> ListThe keyword list comprises the words of relevance and that are selected to be identifiedby the system. This list mostly likely shall have the words that could be pronounced moreoften by a speaker. Therefore, they shall be selected according to the area of interest.For this project the keywords were selected according to the current exhibitions at themuseum.For the definition of the keywords a keyword file was created. This file contains the listof all the words that shall be identified from the incoming speech signal. Each line of thefile shall contain the keyword, followed by the number of audio files available for playingwhenever the keyword is identified by the system.An extra line is also added at the end of this file stating the number of available audiofiles that are to be played whenever the system is not able to identify a keyword. Theword DEFAULT preceded by a hash symbol is used for this purpose. An example of theformat of this file can be seen at Figure 4.1. This example has <strong>Keyword</strong>s 1 to n, eachwith five available audio files as well as ten default audio files.Figure 4.1: <strong>Keyword</strong> File Example4.4 Audio Files4.4.1 Creating the audio filesThe information that is played whenever a keyword is identified by the system, comesfrom selected sentences converted from text to speech. Thus, to convert the sentencesinto speech, a Text to <strong>Speech</strong> (TTS) system is required. There are different TTS systemsavailable such as eSpeak[4] or FreeTTS[5]. However, for this project we have selectedGoogle’s TTS, which is accessed via Google’s Translate[22] service, as it is easy to useand more importantly, due to its audio quality.30


Therefore, in order to ease the creation of the audio files containing the responses to begiven by the system, the tool RespGenerator was developed and written in Java. Priorthe usage of this tool, the desired sentences per keyword and the default ones shall bewritten within a text file. The file should first contain a line with the keyword precededby a hash symbol and then the lines with its associated sentences. An example of theformat of this file containing <strong>Keyword</strong>s 1 to n and their sentences can be seen at Figure4.2.Figure 4.2: Sentences for Audio Files ExampleOnce the file containing the sentences is ready, the tool reads them from the file, connectsto Google’s TTS and retrieves the generated audio files, all this by using the GNU wget 5free software package. The generated audio files are placed under a folder with the nameof the keyword that they belong to and they are numerically named from 1 to n, wheren is the number of available audio files per keyword. Figure 4.3 illustrates the processof the conversion of the sentences into audio files.5 GNU wget: http://www.gnu.org/software/wget/31


Figure 4.3: Conversion of Sentences into Audio FilesThe RespGenerator tool has only one mode of operation, in which, the user mustspecify the path and name of the file containing the sentences to be used as answers.Additionally, the user must specify the desired path for the output audio files.Regarding the wget tool, it needs to be placed in the same folder as the RespGeneratortool in order generate the audio files correctly. For linux operating systems, the wgettool is installed by default, nonetheless, for Windows operating systems, the user needsto download the executable and place it in the appropriate folder.The following command provides an example about how to execute the RespGeneratortool:java -jar RespGenerator.jar ..\SentencesFile.txt ..\OutputPath\As it can be seen, the first parameter corresponds to the path and name of the filecontaining the sentences and the second parameter is the path for the output audio files.In case that the user specifies less than two parameters, the tool will display a mesageindicating that there was an error. For example:Error: less than 2 arguments were specifiedOn the other hand, when the two inputs are correctly specified, the tool will generatethe output audio files and it will display a message indicating that the tool was executedcorrectly:kwords.txt file created succesfully!32


Additionally, the RespGenerator tool creates a text file containing a list of keywordsand the corresponding number of audio files created per keyword. This file is used bythe ASR application in order to know how many possible responses are available perkeyword as well as how many default responses can be used.4.4.2 Playing the audio filesInitially, the ASR system was designed to play MP3 files, but it was later changed toplay OGG files as it is a completely open and free format 6 . Therefore, in order to playthe audio files, it is necessary to use an external library. For that, the BASS audiolibrary 7 was used.The BASS audio library allows the streaming of different audio type files, included OGG.Moreover, the library is free for non-commercial use and it is available for differentprogramming languages, including C. Furthermore, one of the main advantages of thislibrary is that everything is contained within a small dll file of only 100 KB.4.5 Custom Language Model and DictionaryAlthough PocketSphinx can be used in applications supporting vocabularies composedby several thousands of words, the best performance in terms of accuracy and executiontime can be obtained using small vocabularies [12]. After downloading and compilingthe source code for PocketSphinx, we proceeded to run a test program using the defaultlanguage model, acoustic model and dictionary. Nonetheless, the word accuracy for theapplication tended to be very low. In other words, most of the speech sentences used asinputs were not recognized correctly.After performing some research in the documentation for PocketSphinx, we found outthat is recommended to use a reduced language model and a custom dictionary withsupport for a small vocabulary. The smaller the vocabulary, the fastest PocketSphinx willdecode the input sentences as the search space of the algorithms used by PocketSphinxgets smaller. Similarly the accuracy of the ASR becomes higher when the vocabulary issmall. For example, there is an example application included with PocketSphinx thatuses a vocabulary containing only the numbers from 0 to 9. In this case, the overallaccuracy is in the range of 90% to 98%.On the other hand, it is recommended to use one of the default acoustic models includedwith PocketSphinx. The main reason is that the default acoustic models have beencreated using huge amounts of acoustic data containing speech from several persons.In other words, the default acoustic models have been carefully tuned to be speaker6 OGG: http://www.vorbis.com/7 BASS Audio Library: http://www.un4seen.com/bass.html33


independent. If for some reason the user creates a new acoustic model, or adapts anexisting one, the acoustic training data needs to be carefully selected in order to avoidspeaker dependence.The CMUSphinx toolkit provides methods to adapt an existing acoustic model or evencreate a new one from scratch. However, in order to create a new language model,it is required to have large quantities of speech data from different speakers and thecorresponding transcript for each training sentence. For dictation applications it isrecommended to have at least 50 hours of recordings of 200 speakers 8 .4.5.1 <strong>Keyword</strong>-<strong>Based</strong> Language Model GenerationThe CMUSphinx toolkit provides a set of applications aimed to help the developerduring the creation of new language models, this group of applications is called theStatistical Language Model Toolkit (CMUCLMTK). The purpose of this toolkit is totake a text corpus as input and generate a corresponding language model. The textcorpus is a group of delimited sentences that is used to determine the structure of thelanguage and how each word is related to the others. For more information related tolanguage models please refer to Section 3.2.2 . Figure 4.4 illustrates the basic usage ofthe CMUCLMTK.Figure 4.4: Statistical Language Model ToolkitAlternatively, there is an online version of the CMUCLMTK toolkit, called lmtool. Inorder to execute this tool, the user needs to upload a text corpus into a server using aweb browser and the tool generates a language model. However, this tool is intendedto be used with small text corpora containing a couple of hundreds of sentences at themost. This limitation became a problem for us, as most of the times our web browsertimed-out before the tool was able to generate a language model. For this reason wedecided to use the CMUCLMTK toolkit instead.Besides generating a language model from a text corpus, the CMUCLMTK toolkitgenerates several other useful files, such as, the .vocab file that lists all the words foundin the vocabulary. Similarly, the .wfreq file lists all the words in the vocabulary andthe number of times they appear in the text corpus. Finally, the toolkit generates thelaguage model in two possible formats, the arpa format (.arpa), which is a text file or8 CMUCLMTK toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html34


the CMU binary format (.DMP). Both formats can be used by Pocketsphinx, howeverthe binary format is prefered since the files are much more compact.There are many ways to generate the input text corpus for the CMUCLMTK toolkit.Probably the simplest way of generating the text corpus is to write a number of sentencesby hand. Nevertheless, this method can become unfeasible very easily, especially whenthe language model generation requires a fairly high amount of input sentences. Forthis reason, it is preferred to use a different method in order to generate the text corpusautomatically.In order to optimize the generation of the language models we created a tool calledTextExtractor. This tool creates a text corpus based on web texts from the internet.More specifically, this tool downloads text from web pages based on a set of inputUniform Resource Locators (URLs). Additionally, the TextExtractor filters the raw text,expands abbreviations and acronyms, converts numbers to their word representation andremoves special characters, among other things.Furthermore, as illustrated at Figure 4.5, the TextExtractor can receive a list of keywordsin order to insert into the text corpus only the sentences containing keywords. Thisis particularly helpful to keep the language model compact even when the number ofsentences in the selected web pages is large.Figure 4.5: TextExtractor and CMUCLMTKIn general, the TextExtractor tool has two modes of operation. The simplest way touse the tool is by providing two arguments, the list of URLs to be used to generate thetext corpus and the name of the output file, for example:java -jar TextExtractor.jar ..\URL_List.txt ..\TCorpus.txtIn this case, the result of executing the tool will be a text corpus called TCorpus.txt.This file will include a list of sentences downloaded from the list of URLs. Addtionally,the sentences will not contain punctuation symbols, such as commas and colons, and allof the numbers will be replaced by its word representation.35


In the second mode of operation, the tool receives three arguments. The two argumentspreviously described and a third argument, which is a list of keywords to be used tofilter the output text corpus. As it was previously explained, the text corpus will onlyinclude sentences that contain one or more keywords. For example:java -jar TextExtractor.jar ..\URL_List.txt ..\TCorpus.txt..\<strong>Keyword</strong>_List.txtIn both operation modes, the tool will display an error message if it finds a problemduring its execution. Otherwise, a success message will be displayed when the tool hascompleted its execution succesfully. For example:Output Text file(s) createdUsing both the TextExtractor and the CMUCLMTK tools allows generating languagemodels in a quick and easy way. These characteristics gave us the opportunity to evaluatethe performance of several language models in order to choose the most appropriate onegiven the requirements of our application. Please refer to Chapter 5 in order to find outmore about the evaluation of different language models.The TextExtractor tool was entirely written in Java in order to be used in differentplatforms, while the CMUCLMTK toolkit is written in C. The CMUCLMTK toolkitcan be compiled for different platforms, most notably Windows, SunOS and Linux. Forthis reason, it is important to choose the correct version of the tool before downloadingit from the repository. Currently, there is an online document 9 describing the maincharacteristics of the toolkit. However, it is not very detailed and the procedure to usethe tools is inconsistently described in several parts.The TextExtractor tool uses a public library called boilerpipe in order to download textfrom web pages. This library is publicly available and can be downloaded from therepository for Google code projects 10 .4.5.2 Custom Dictionary GenerationAlthough the default dictionary file (cmu07a.dic) integrated with PocketSphinx includesmore than 120,000 words, it is not feasible to use it in a small vocabulary application.For this reason we generated a custom dictionary with only the subset of words used byour language model. The size of the custom dictionary is around 1/40 th the size of theoriginal dictionary. This allows the ASR application to handle the dictionary file mucheasier and perform the speech decoding faster.9 CMUCLMTK toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html10 Boilerpipe: http://code.google.com/p/boilerpipe36


The dictionary generator tool (DictGenerator) takes a custom language model andthe cmu07a dictionary from PocketSphinx as inputs. Similarly, the tool generates acustom dictionary containing just the subset of words supported by the language model.This process can be seen at Figure 4.6.Figure 4.6: Dictionary Generator ToolBesides the custom dictionary file, the DictGenerator tool generates a text file containinga list of words not found in the cmu07a dictionary. When one word is not found in thedictionary, it can be because of two main reasons as specified in Table 4.1.CauseThe word is written incorrectly or iswritten in another language.The word is written correctly, but isnot included in the dictionary.SolutionThe text corpus used to generate thelanguage model needs to be editedto remove typos and words in otherlanguages.The user needs to edit the customdictionary to include the missing wordmanually.Table 4.1: Dictionary Words Not Found: Causes and SolutionsMost of the times, the words in the language model that are not found in the cmu07adictionary contain typos. For this reason, there is no need to edit the custom dictionaryvery often.The DictGenerator has two main operation modes. In the first operation mode, itreceives at least two arguments, one indicating the name and path of the input dictionaryfile (cmudict07.dic) and the second indicating the name and path of a file containing allthe words in the custom vocabulary. For example:java -jar dictGenerator.jar ..\cmu07a.dic ..\TCorpus.vocab37


In this operation mode, the tool will generate a custom dictionary using a defaultname (outputDict.dic) and a file containing words not found in the input dictionary(notfound.txt). By default, the two files will be created in the same folder as theDictGenerator tool.In the second operation mode, the tool receives a third parameter. This allows the userto specify a name and path for the custom dictionary:java -jar dictGenerator.jar ..\cmu07a.dic ..\TCorpus.vocab ..\TCorpus.dicThe main difference with respect to mode operation one is that the output file will becalled TCorpus.dic and it will be stored in the path specified by the user. The otheroutput file will have the same name, but it will be stored in the same location as thecustom output dictionary. For both operation modes, the tool will display a messagewhen it finished its execution. For example:Custom Dictionary was created succesfully!The DictGenerator tool was written in Java in order to allow the user to execute it indifferent platforms. This is consistent with other tools developed during the developmentof this project.4.5.3 Support for other languagesEnglish is the default language used during the development of this project. However,there are language models, acoustic models and dictionaries available for other popularlanguages, such as, German, French and Mandarin. All this models and dictionaries arepublicly available and can be downloaded from the PocketSphinx repository 11 .In order to support some other languages, a new language model, a new acoustic modeland a new dictionary need to be created. The language model can be created in thesame way as it was described previously using the TextExtractor and CMUCLMTKtools. The only difference is that the web pages used as input need to be written in thedesired language.The generation of a new phonetic dictionary might be much more complex as every wordsupported by the language model needs to be included in the dictionary. It is importantto mention that every word needs to be translated into its corresponding ARPAbetrepresentation in order to be used by PocketSphinx. For example, Table 4.2 illustratesthe first five words in a phonetic dictionary for Spanish.11 http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/38


WordÁBREGOÁFRICAÁLVAREZÁLVAROÁNFORASARPAbet representationA B R E G OA F R I K AA L V A R E SA L V A R OA N F O R A STable 4.2: Phonetical DictionaryBuilding a detailed phonetic dictionary for a language can be a very time-consumingand challenging task as the developer needs to have a very good understanding of thelanguage and its pronunciation. In fact, it is recommended to get help from a group oflinguists in order to aid the developers during the generation of the dictionary.In contrast, a new acoustic model can be trained automatically, provided there areenough recordings (acoustic database) and the corresponding transcript for every singlesentence is available. PocketSphinx includes a tool called SphinxTrain that is intendedto be used to train new acoustic models. The tool receives four inputs that can be seenat Figure 4.7.Figure 4.7: SphinxTrain Tool39


The inputs include the following files:1. A set of recordings in .wav format, one file per sentence per speaker.2. A transcript file containing each sentence in the acoustic database, the same filecan be used for all the speakers.3. A phonetic dictionary for the desired language.4. A filler dictionary which maps non-speech sounds in the recordings into correspondingspeech-like sound units.Besides downloading and compiling the SphinxTrain tool, it is required to have perland python installed in the same computer in order to execute SphinxTrain. For moreinformation regarding the installation and the configuration of the SphinxTrain tool,please refer to the tutorial 12 published by the Carnegie Melon University.4.6 Decoding <strong>System</strong>The ASR system was written entirely in C in order to be able to use the PocketSphinxApplication Programming Interface (API). Microsoft Visual Studio 2008 was used as theIntegrated Development Environment (IDE). This section describes the inputs, outputsas well as the main algorithm used by the ASR system to perform the decoding of thespeech signal. In addition the interaction between the ASR decoder system and allthe tools that were developed during its implementation is explained at the end of thissection.4.6.1 InputsThe acoustic model, the language model and the dictionary are the three main inputsthe ASR system requires in order to perform the decoding of the incoming speech signal.For this project, the default acoustic model included with PocketSphinx is the one usedby the system. On the other hand, a new language model as well as a new dictionarywere created from scratch following the steps specified at Sections 4.5.1 and 4.5.2.4.6.2 OutputsThe ASR system outputs the decoded incoming speech signal by providing a hypothesisthat best describes the speaker’s utterance. However, since it is a keyword-based system,it also outputs the identified keyword. In addition, a file containing the obtained resultsduring the ASR system execution is created. The file contains information from theobtained hypothesis and keywords for every speech signal that was decoded.12 SphinxTrain tool: http://cmusphinx.sourceforge.net/wiki/tutorialam40


4.6.3 AlgorithmThe following describes the decoding process of the ASR system which comprises fourmain states: Ready, Listening, Decoding and Scanning. The states represent thesteps needed to transform the live speech signal into a word sequence and properlyidentify a keyword.A state shall only start after the completion of the previous state. After the final state hasbeen reached, the system will return to the first state without the need of restarting thesystem. This process is described by the following and is illustrated at Figure 4.8.1. Ready. At the beginning, the system initializes the audio input device in orderto start listening to the speaker. The system will also load the acoustic model, thelanguage model and the dictionary in order to later perform the decoding of thesignal. At this state the available keywords and their number of available audiofiles are also loaded into the system.Once the startup has finished the system will play the message: “I’m listening”and it will indicate that is ready to start listening to the speaker. This message isonly played during the system’s startup, the following times only the word Readyis printed on screen.2. Listening. As soon as the speaker starts talking, the system will listen to theincoming speech and will perform the feature extraction until the speaker hasfinished. For this, the system will wait for a long silence to occur. This silence isdetermined by verifying if one second has elapsed since the last spoken utterance.Once a long silence has been determined, the system will stop listening and thedecoding process will then take place. Otherwise, the system will continue listeninguntil the end of the speaker’s utterance is determined. The system will indicatewhen it stops listening by printing Stopped listening on screen.3. Decoding. At the decoding state the received speech signal is decoded usingthe information from the feature vectors and the linguist to outcome a resulthypothesis. For this, the linguist uses the data from the acoustic model, thelanguage model and the dictionary.• The acoustic model provides statistical information that is used to map theincoming sounds within the speech signal into phonemes and words.• The language model contains statistical information on how likely words canoccur. For this project, a 3-gram language model format was used.• The dictionary provides the phonetic representation (pronunciation) of thewords within the language model and it is used to map phoneme sequencesinto written words.Please refer to Section 3.2.2 for more information on the linguist.41


Once the signal has been decoded, a hypothesis of what was spoken is given as anoutcome. The hypothesis consists of a word sequence that is determined to be thebest option to represent the speaker’s utterance.4. Scanning. The provided hypothesis is then scanned in order to verify if one ofthe words is a keyword from the predefined list. If a keyword is found, a randomselection of one of its associated audio files is performed and played. Otherwise, anaudio file from the default ones is randomly selected and played. After the audiofile has been played, the ASR system will be ready to listen to the speaker again.There is no need to restart the program to continue its execution.In order to terminate the program’s execution the speaker can either say the wordOUT or manually exit the program. The system will play the message: “Bye” andthe results file, including the hypothesis and keywords identified, will be created.Figure 4.8: Decoding Algorithm42


4.6.4 <strong>System</strong> ComponentsIn order to better illustrate the interaction between the main system and all developedtools, a component diagram is depicted at Figure 4.9. As it can be seen in this diagram,the entire system comprises several files, four tools and the ASR decoder in order toproperly operate. Some of the files are generated by the user and others are the resultof the execution of some of the tools.Figure 4.9: Component DiagramThe ASR decoder, which is the main system, uses the language model, the customdictionary and the acoustic responses to operate. This outputs are result of the executionof three tools: the TextExtractor, the DictGenerator and the RespGenerator. However,at the same time, this tools require that the user properly generates the following files:the keyword file, the URL list, the master dictionary and the file with the responsesentences.On the other hand, the rEvaluator tool uses the output of the main system, the LogFile,as well as the verification file to properly execute. Finally, this tool generates theperformance report that is used to evaluate the system’s performance.43


Chapter 5Tests and ResultsThis chapter provides a detailed description of the testing procedure used to evaluate theperformance and accuracy of the <strong>Keyword</strong> <strong>Based</strong> <strong>Interactive</strong> <strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong>.Furthermore, it includes a set of results obtained using different language models andphonetic dictionaries. Additionally, this chapter includes a list of tests performed toverify the correct operation of a set of Java tools developed during this project.5.1 Main ASR application5.1.1 Performance Evaluation and KDAAfter developing a keyword-based ASR application using PocketSphinx, the systemwas evaluated in order to quantify its accuracy and error rate. As the main purposeof the system is to identify keywords in the input utterances, it is not practical tomeasure the word error rate (WER) as in other ASR applications. In this case, wedefined a parameter called <strong>Keyword</strong> Decoding Accuracy (KDA), which representsthe percentage of keywords correctly identified by the ASR, given a set of utterancescontaining keywords. This is represented by the following:KDA = (Number of keywords decoded correctly/Total number of keywords)100 (5.1)Given the KDA parameter, we focused on optimizing the language model and thephonetic dictionary in order to maximize the accuracy of the ASR application.Many ASR tools developed using PocketSphinx evaluate their accuracy by reading audiostreams (files). In other words, the audio files are directly loaded into memory anddecoded by the ASR tool. This approach, as depicted in Figure 5.1, can be somehowunrealistic as the audio decoded by the ASR is absent of ambient noise or distortionscaused by the environment.44


Figure 5.1: Typical Test SetupTo evaluate the accuracy of our ASR application, the KDA was measured using liveaudio and the default input device (microphone) in our development workstation. Thisallowed testing the application in a realistic environment although the accuracy mightbe slightly lower. Figure 5.2 illustrates the mentioned configuration.Figure 5.2: KDA Test SetupIn order to assure repeatability in the process of measuring the KDA, the set of testsentences articulated by the user were recorded and played using the computer’s soundcard. This also allowed automating the process of measuring the KDA for severallanguage models without the need of having a person speaking to the microphone eachtime. Another advantage of this approach is that the ASR application does not needto be modified to alternate between the microphone and the line-in device. Figure 5.3illustrates the automation of the evaluation process.(a) 1st Step : Recording of Test Sentences(b) 2nd Step : Evaluating the PerformanceFigure 5.3: Automated Test Process45


5.1.2 Test environment and setupThe first step in the testing process was selecting a list of keywords to generate alanguage model and the dictionary to be used by the ASR application. This was achievedby selecting 20 keywords from the Swedish Tekniska Museet (Museum of Science andTechnology) web page. It is important to mention that some keywords are actuallycomposed by two words in order to test the application using more complex phrases.Table 5.1 comprises the list of keywords.MECHANICAL WORKSHOP POWER NASASTEAM ENGINE ADVENTURE MINEMECHANICAL ALPHABET ENERGY RAILWAYINDUSTRIAL REVOLUTION SPORTS WOMENRADIO STATION EXHIBITION TELEPHONEINTERACTIVE EXPERIMENTS TECHNOLOGYINSPIRATIONTRAIN SETTable 5.1: <strong>Keyword</strong> ListThe second step in the testing process was defining a set of test sentences to measurethe KDA for the application. In this case we recorded 5 sentences per keyword, whichresulted in a total of 100 test sentences. The same set of sentences was recorded bytwo persons in order to evaluate the application’s speaker independence. Each of thesentences was used later on as input to the ASR application in order to evaluate itsaccuracy. Figure 5.4 shows some of the test sentences used to measure the KDA.Figure 5.4: Test Sentences46


After generating the set of test sentences, the next phase consisted in generating alanguage model and a dictionary using a list of web pages (URLs). During the first setof tests we only used web pages from the Tekniska Museet in order to build a very simplelanguage model. After evaluating the accuracy of the first set of tests, we proceeded toadd more URLs into the list in order to generate more elaborated language models.In this case, several articles from Wikipedia were used as most articles contain largequantities of well-written sentences.For the sake of automating the measurement of the KDA we developed a tool calledrEvaluator. This tool is written in Java and it receives a log file from the ASRapplication and the list of keywords included in the test sentences. The tool generatesa comma separated file report in which the KDA metric can be identified for eachkeyword and for the overall set of test sentences. The comma-separated report can beeasily imported into Excel in order to visualize and compare the KDA metrics. Figure5.5 depicts the operation of this tool.Figure 5.5: rEvaluator ToolTable 5.2 depicts an example of the output report using 5 keywords. As it can be seen,it lists the number of correctly decoded keywords, the number of sentences per keywordand the KDA.<strong>Keyword</strong> Matches Sentences KDA(%)TECHNOLOGY 5 5 100TELEPHONE 4 5 80STEAM ENGINE 3 5 60RADIO STATION 5 5 100ENERGY 5 5 100OVERALL METRICS 22 25 88Table 5.2: KDA Report Format47


Using the rEvaluator tool, we gathered results using different language models and adifferent number of URLs. Table 5.3 shows the performance of the ASR application.Test URLs Vocab Size KDA KDA Overall KDA (%)(Words) Speaker1 Speaker21 15 1174 56.5 59 57.752 19* 2679 59 64 61.53 19 1650 68 70 694 23 2596 78 82 805 35 3562 83 87 85Table 5.3: Overall Test Results*Note. For test number 2, the input text corpus was not filtered. In other words, itincluded some sentences that did not contain keywords. This is why the vocabulary sizein test number 3 is smaller but the number of URLs remained the same.As it can be seen from the results table, the overall KDA increased as the number ofURLs (and vocabulary size) increased. However, as the vocabulary size increased, theexecution time of the ASR application increased as well. For this reason, we decided tokeep the vocabulary size around 3,500 words in order to maintain the execution timelow.During the final round of tests, the ASR application managed to get a keyword decodingaccuracy of 85%. In other words, around 85 of the 100 keywords in the test sentenceswere decoded correctly while maintaining an average execution time of around 0.5seconds per sentence. Regarding the maximum KDA achieved, it must be clarifiedthat none of the persons participating in the tests speak English as their first language.As it was explained previously, this could affect the overall performance of the applicationadversely.Finally, the system was executed and tested using two workstations; Table 5.4 illustratesthe main characteristics of the systems and their architecture.<strong>System</strong> Architecture OS Processor RAMDell Studio 1555 x86 Vista SP2 Intel Core 2 Duo T6500 3.0 GBDell XPS 1640 x86 64 Vista SP2 Intel Core 2 Duo P8600 4.0 GBTable 5.4: Computer SpecificationsAlternatively, during the preliminary stages of development, the application was alsocompiled and run on Linux using OpenSUSE 11.4 and the same Dell XPS 1640 workstationwith an external Hard Drive. However, during the latter stages we continued thedevelopment of the application using Windows Vista and Visual Studio 2008.48


5.2 Auxiliary Java ToolsIn order to verify the correct operation of tools such as the TextExtractor, RespGeneratorand DictGenerator, a list of tests was designed. This verification phase involved testingall of the different operation modes for each tool as well as verifying their output files.The following sections describe the different tests performed for each tool.5.2.1 TextExtractor ToolFor the TextExtractor tool, two operation modes were tested. For the first operationmode it was verified that the tool was able to generate a text corpus from a list of URLs.Furthermore it was verified that the list of sentences included in the corpus was notfiltered based on a list of keywords. Figure 5.6 shows the list of messages displayed inthe command line after executing the tool:Figure 5.6: TextExtractor Operation Mode 1Similarly, the second operation mode was tested. In this case, it was verified that the listof sentences included in the text corpus was filtered based on a list of given keywords.Figure 5.7 shows the list of messages displayed in the command line after executing thetool:49


Figure 5.7: TextExtractor Operation Mode 25.2.2 DictGenerator ToolThe testing for the DictGenerator tool was more straightforward as there is just onedifference between the operation modes. In the first mode, the user does not define aname for the generated dictionary, while in the second mode the user specifies a pathand a name for the custom dictionary. In both cases the set of messages shown in thecommand line is the same. This can be visualized in Figure 5.8. Similarly, Figure 5.9shows the custom dictionary generated using each operation mode. As it can be seen,operation mode 1 generates a dictionary with a default name (outputDict.dic).Figure 5.8: DictGenerator Operation Modes 1 and 250


Figure 5.9: Generated Dictionary Files5.2.3 RespGenerator ToolA similar test was designed for the RespGenerator tool. In this case the Java applicationhas only one operation mode. The user needs to provide the name of a file containingthe sentences to be used as acoustic responses from the ASR systen. Additionally theuser must specify the path where the audio files will be stored after executing the tool.This can be seen at Figure 5.10.Figure 5.10: RespGenerator: Execution Messages5.2.4 rEvaluator ToolFinally, the rEvaluator tool was verified by testing that the KDA report was generatedcorrectly given three input parameters. The first parameter corresponds to a log filegenerated by the ASR application. The second parameter is a text file listing thekeywords corresponding to each decoded utterance. Finally, the third parameter is thename of the generated output report. Figure 5.11 shows the set of messages displayedin the command window when the tool was executed.Figure 5.11: rEvaluator: Execution Messages51


Chapter 6SummaryThis section recapitulates the main lessons learned and conclusions gathered during thedevelopment of the <strong>Keyword</strong> <strong>Based</strong> <strong>Interactive</strong> Automatic <strong>Speech</strong> <strong>Recognition</strong> <strong>System</strong>.Addionally, it highlights the main areas of improvement and possible topics for futurework.6.1 ConclusionsThe Automatic <strong>Speech</strong> <strong>Recognition</strong> technology has been in constant development duringthe last fifty years. The possibility of recognizing and understanding human speech hasdriven numerous research groups to develop powerful and complex systems able to useand handle extensive vocabularies. Furthermore, a number of commercial tools are nowavailable to be used in automotive and telephone-based applications, among others.However, the day when a machine is able to understand continuous and spontaneousspeech still seems to be far away in the future.The accuracy and performance of modern ASR systems is strictly related to the size ofits vocabulary, the complexity of the algorithms used to decode the input speech andfinally, the size of the speech corpus utilized to train the system. For these reasons, themost accurate ASR systems are also the ones that require more computational powerin order to work correctly. One important research area is related to the developmentof fast and efficient algorithms that decrease the processing power needed to executelarge-vocabulary ASRs. For example, the implementation of hybrid ASR systems thatuse neural networks and hidden Markov models (MLP-HMM). Another example is theutilization of A* algorithms (stack decoding) instead of the well-known Viterbi andbeam-search algorithms to decode speech.Although the task of recognizing and understanding speech is linked to several differentresearch areas such as language modeling and digital signal processing, the main building52


locks of an ASR system are now well defined. Thus, a research group can focus onimproving one or more particular areas, such as the front end or the speech decoder.There is a small amount of open-source ASR systems used by the research community toimplement and compare new solutions in the ASR domain against established benchmarks.Using available tools such as PocketSphinx, proved to be a good strategy as it simplifiedthe process of creating an interactive ASR application. In this regard, even thoughPocketSphinx is a free open-source library, its documentation is still somehow incompletecompared to other Sphinx decoders, most notably Sphinx 4. Nonetheless, one of the mainadvantages of PocketSphinx is that it can be compiled for different platforms, such asWindows, Linux, iOS and Android.Furthermore, PocketSphinx was designed to handle live audio applications such asdictation. For this reason, it is the fastest decoder of the CMU Sphinx family. Nevertheless,it is also one of the least flexible compared to Sphinx 4, which is written in Java.Consequently, trying to change or improve the algorithms used by PocketSphinx canbe a troublesome and complex task. In fact, in most applications, the library is usedwithout any changes made by the developers.Although PocketSphinx can be used on large-vocabulary applications, the overall bestperformance in terms of execution time can be reached using small vocabularies. Ingeneral, as the vocabulary gets smaller, the execution time gets lower as well. Duringthe development of our keyword-based ASR application, we tested different languagemodels and phonetic dictionaries in order to maximize the overall accuracy. At the finalround of tests, the ASR application managed to get a keyword decoding accuracy of 85%while maintaining an average execution time of around 0.5 seconds per sentence.Since the early stages of development of the ASR system, we aimed to develop tools toautomate the process of generating the configuration files needed to run the recognizer.In the end we successfully created tools to generate the language model, the dictionaryand the set of predefined responses to be played when a keyword is identified. In a similarfashion, the process of evaluating the accuracy (KDA) was automated. As a result of thisautomation, it is possible to change the set of keywords and generate the correspondingconfiguration files in no more than a couple of hours. Furthermore, it is possible to runthe application and use the new files without re-compiling the project.6.2 Future WorkAlthough we managed to successfully complete the development of the keyword basedASR system, there are some areas that can be improved in order to enhance its operation.For example, the application can be updated to support more languages. In this regard,the main focus would be the creation of a phonetic dictionary for the desired language.Unfortunately, this can become an extensive task, as it requires a deep understandingof the given language.53


Regarding the ASR decoder, it would be desirable to experiment using other decodingstrategies, such as hybrid hidden Markov models that incorporate multilayer perceptrons(MLP-HMM). Similarly, it is desirable to test more decoding algorithms, such as stack(A*) decoding. However, the lack of flexibility of PocketSphinx might represent aproblem when trying to change or improve the current algorithms.Even though the overall accuracy of the speech recognizer application is fairly high, itstill can be improved. For this reason, more testing is needed in order to identify areasof improvement. For example, it would be beneficial to quantify the effect of speechvariability, such as the speaker’s accent, on the overall accuracy of the system.Finally, it would be recommended to cross-compile the ASR project in order to be run onan embedded system. During the development of the project we were able to execute thespeech recognition application using Windows and Linux, however it would be interestingto evaluate the application’s performance using other operating systems and differenthardware.54


Bibliography[1] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore,P. Laface, A. Mertins, C. Ris, et al. Automatic speech recognition and speechvariability: A review. <strong>Speech</strong> Communication, 49(10-11):763–786, 2007.[2] S. Davis and P. Mermelstein. Comparison of parametric representations formonosyllabic word recognition in continuously spoken sentences. Acoustics, <strong>Speech</strong>and Signal Processing, IEEE Transactions on, 28(4):357–366, 1980.[3] H.L. Doe. Evaluating the effects of automatic speech recognition word accuracy.PhD thesis, Virginia Polytechnic Institute and State University, 1998.[4] eSpeak: <strong>Speech</strong> Synthesizer. http://espeak.sourceforge.net, May 2011.[5] FreeTTS. http://freetts.sourceforge.net, May 2011.[6] J.P. Haton. <strong>Speech</strong> analysis for automatic speech recongnition: A review. In <strong>Speech</strong>Technology and Human-Computer Dialogue, 2009. SpeD’09. Proceedings of the 5-thConference on, pages 1–5. IEEE.[7] X. Huang, F. Alleva, M.Y. Hwang, and R. Rosenfeld. An overview of the sphinx-iispeech recognition system. In Proceedings of the workshop on Human LanguageTechnology, pages 81–86. Association for Computational Linguistics, 1993.[8] D. Huggins-Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and A.I.Rudnicky. PocketSphinx: A free, real-time continuous speech recognition system forhand-held devices. In Acoustics, <strong>Speech</strong> and Signal Processing, 2006. ICASSP 2006Proceedings. 2006 IEEE International Conference on, volume 1, pages I–I. IEEE,2006.[9] B.H. Juang and S. Furui. Automatic recognition and understanding of spokenlanguage-a first step toward natural human-machine communication. Proceedingsof the IEEE, 88(8):1142–1165, 2000.[10] BH Juang and L.R. Rabiner. Automatic speech recognition–A brief history of thetechnology development. Encyclopedia of Language and Linguistics, Elsevier, 2005.55


[11] D. Jurafsky, J.H. Martin, A. Kehler, K. Vander Linden, and N. Ward. <strong>Speech</strong> andlanguage processing: An introduction to natural language processing, computationallinguistics, and speech recognition, volume 163. MIT Press, 2000.[12] A. Kumar, A. Tewari, S. Horrigan, M. Kam, F. Metze, and J. Canny. Rethinkingspeech recognition on mobile devices.[13] K.F. Lee, H.W. Hon, and R. Reddy. An overview of the SPHINX speechrecognition system. Acoustics, <strong>Speech</strong> and Signal Processing, IEEE Transactionson, 38(1):35–45, 1990.[14] M. Matassoni, M. Omologo, D. Giuliani, and P. Svaizer. Hidden Markov modeltraining with contaminated speech material for distant-talking speech recognition.Computer <strong>Speech</strong> & Language, 16(2):205–223, 2002.[15] N. Morgan and H. Bourlard. Continuous speech recognition. Signal ProcessingMagazine, IEEE, 12(3):24–42, 1995.[16] D. O’Shaughnessy. Interacting with computers by voice: automatic speechrecognition and synthesis. Proceedings of the IEEE, 91(9):1272–1305, 2003.[17] D.B. Paul. An efficient A* stack decoder algorithm for continuous speech recognitionwith a stochastic language model. In icassp, pages 25–28. IEEE, 1992.[18] S. Phadke, R. Limaye, S. Verma, and K. Subramanian. On design andimplementation of an embedded automatic speech recognition system. In VLSIDesign, 2004. Proceedings. 17th International Conference on, pages 127–132. IEEE,2004.[19] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar,R. Rosenfeld, K. Seymore, M. Siegler, et al. The 1996 hub-4 sphinx-3 system. InProc. DARPA <strong>Speech</strong> recognition workshop, pages 85–89. Citeseer, 1997.[20] L.R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.[21] M.K. Ravishankar. Efficient Algorithms for <strong>Speech</strong> <strong>Recognition</strong>. PhD thesis,Citeseer, 2005.[22] Google Translate. http://translate.google.com, May 2011.[23] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, andJ. Woelfel. Sphinx-4: A flexible open source framework for speech recognition.Sun Microsystems, Inc. Mountain View, CA, USA, page 18, 2004.[24] CMUSphinx Wiki. http://cmusphinx.sourceforge.net/wiki, May 2011.56


Appendix AGuidelinesThe following sections describe useful instructions about how to use the set of toolsdeveloped during the speech recognition project. Similarly they describe how to installthe Pocketsphinx project and how to generate files to run the main application. Itis strongly recommended to follow the instructions section by section in order to takeadvantage of the given examples and the main program.A.1 Installation of the speech recognizer projectIn order to install Pocketsphinx and the ASR recognizer application, the user mustperform the following steps:1. Extract the contents of the compressed file containing the complete ASR projectinto a new folder, for instance: C:\ASRProject\. Inside the ASRProject folder,seven new folders will be created as seen at Figure A.1.Figure A.1: ASR Project Folder Structure57


2. Go to the folder sphinxbase inside the ASRProject folder and then open thesphinxbase.sln file using Visual Studio.3. Once that the sphinxbase project is open, go to the build menu and select BuildSolution. After building the project, Visual Studio should show the messagebuild succeeded.4. Go to the folder \sphinxbase\bin\debug and copy the file sphinxbase.dll intothe folder \pocketsphinx\bin\debug.5. Now go to the folder pocketsphinx inside the ASRProject folder and open thepocketsphinx.sln file using Visual Studio.6. Once that the pocketsphinx project is open, go to the build menu and select BuildSolution. After building the project, Visual Studio should show the messagebuild succeeded.7. Now the recognizer application is ready to be run using a language model and adictionary.A.2 Running the speech recognition applicationOnce that Pocketsphinx has been installed correctly, the user can execute the recognizerby specifying an acoustic model, a language model and a dictionary. The following setof steps must be performed in order to execute the application:1. Open a command window, go to \ASRProject\pocketsphinx\bin\debug andtype the following command:recognizer -hmm ../../model/hmm/en_US/hub4wsj_sc_8k/-lm ../../model/lm/en/tekniska/TextCorpusFiltered.lm.DMP-dict ../../model/lm/en/tekniska/TextCorpusFiltered.dicWhere:• hub4wsj sc 8k is the folder containing the acoustic model to be used by theapplication.• TextCorpusFiltered.lm.DMP is the language model.• and TextCorpusFiltered.dic is the phonetic dictionary.2. The command window will display a set of diagnostic messages and the computerwill play the sound message: “I’m listening”. Additionally, the command windowwill display “READY. . . ” as in Figure A.2.58


Figure A.2: Message3. Now, the user should speak any phrase containing any of the predefined keywordssupported by the program. For example: “What is TECHNOLOGY?”4. The computer should decode the input utterance and identify the specified keywordTECHNOLOGY. Then it will play one of the predefined responses for thatgiven keyword. For example: “Technology is the usage and knowledge of tools andsystems in order to solve a problem”. At the same time, as seen at Figure A.3,some diagnostic messages will be displayed.Figure A.3: Decoding Example5. After the computer completes playing the response message, it will show “READY. . . ”in the command window again. Then the user is supposed to say another phrase.The whole process is repeated until the user commands the computer to close theapplication. This is achieved by saying the command“out”.6. Once that the command is identified, the computer will play an exit message(“bye”) and the application will close. This is illustrated at Figure A.4.Figure A.4: Decoding ExampleA.3 Generating a new language modelIn order to generate a new language model the user should create two text files:• One file should contain the list of keywords to be used by the recognizer application.• One file should contain a list of URLs where the keywords are used extensively.For example, the word NASA can be found many times in the URL:59


http://en.wikipedia.org/wiki/NASACurrently there is no limit in the number of URLs supported, for this reason theuser can specify more than one URL per keyword. An example of these files isseen at Figure A.5, where each keyword has an associated URL.Figure A.5: <strong>Keyword</strong>s and URLsFor simplification purposes, the files <strong>Keyword</strong> List.txt and URL List.txt can be editedin order to specify keywords and URLs. These two files are located in the root folderfor the project. The next step is performed using the TextExtractor tool and theCMUCLMTK tool to generate a new language model. This can be achieved usingthe following procedure:1. Open a command window and go to the \ASRProject\Tools folder.2. Generate a text corpus using the TextExtractor tool by executing the followingcommand:java -jar TextExtractor.jar ..\URL_List.txt..\GeneratedFiles\LanguageModels\TCorpus.txtWhere:• The first and third arguments specify the input files.• The second argument specifies the name of the text corpus...\<strong>Keyword</strong>_List.txt3. After executing the tool, a file called TCorpusFiltered.txt will be created in theLanguageModels folder.4. Now execute the CMUCLMTK tool by executing the following command:GenerateLM.bat ..\GeneratedFiles\LanguageModels\TCorpusFiltered.txt..\GeneratedFiles\LanguageModels\TCorpus.ccsWhere:• The first argument corresponds to the text corpus recently created.60


• The second parameter is the filler dictionary, which specifies the meaning ofthe and in the text corpus. Note. The GenerateLM.bat file needsto be edited to specify the path for the CMUCLMTK and sphinxbase folders.5. After executing the CMUCLMTK tool, the TCorpusFiltered.lm.DMP file willbe created in the LanguageModels folder. This file contains a new language modelthat uses the set of defined keywords.A.4 Generating a new dictionaryThe process of generating a new dictionary is very straightforward as only one commandneeds to be executed. Assuming that the user has one command window open, thefollowing command has to be executed in the \ASRProject\Tools folder:java -jar dictGenerator.jar ..\pocketsphinx\model\lm\en_US\cmu07a.dic..\GeneratedFiles\LanguageModels\TCorpusFiltered.vocab..\GeneratedFiles\LanguageModels\TCorpusFiltered.dicWhere:• The first argument is the name of the CMU phonetic dictionary.• The second argument is the name of the file containing the vocabulary used in thetext corpus.• The third argument is the name of the output dictionary.It is important to clarify that the cmu07a.dic is a file included in the Pocketsphinxpackage, while the TCorpusFiltered.vocab is a file created when the CMUCLMTK toolgenerates a language model.A.5 Generating a new set of predefined responsesA new set of predefined responses (audio files) can be created using the RespGeneratortool when the set of keywords is changed. The tool receives a text file containing avariable number of predefined responses for each keyword. An example of this file ispresented at Figure A.6.As it can be seen, each keyword is preceded by a hash (#) character and succeeded bya number of sentences. It is important to mention that a list of default sentences isincluded although DEFAULT is not a keyword. These sentences are used as responsesin case the speech recognizer application is not able to decode any keyword from thegiven input utterance.61


Figure A.6: Example of a Set of Predefined ResponsesFor simplification purposes, the file <strong>Keyword</strong> Responses.txt can be edited in orderto specify the set of possible responses for every keyword. This file is located in the rootfolder for the project.Then, the RespGenerator tool can be used to create the new sound files using thefollowing procedure:1. Open a command window and go to the \ASRProject\Tools folder.2. Execute the RespGenerator tool using the following command:java -jar RespGenerator.jar ..\<strong>Keyword</strong>_Responses.txt..\GeneratedFiles\AcousticResponses\Where:• The first argument is the file containing the possible responses• The second argument is the destination folder for the output sound files3. When the tool has finished generating the output files, the AcousticResponsesfolder should contain a list of folders whose names corresponds to each keyword, asin Figure A.7. Inside each folder, there should be a number of mp3 files containingthe responses.Figure A.7: Responses Structure62


4. Using a file converter tool such as Audacity, convert all the generated mp3 filesinto ogg format. Make sure the new files have the same names as the mp3 files.Figure A.8: OGG FilesA.6 Running the speech recognizer using the newly createdfilesAfter generating a new language model and a new dictionary, the files created in theLanguageModels folder need to be copied to the Pocketsphinx folder. Similarly, theacoustic responses need to be incorporated into Pocketsphinx in order to be used bythe recognizer application. This can be done using the command line. Assumingthat the current folder is \ASRProject\Tools, the user should execute the followingcommands:copy ..\GeneratedFiles\LanguageModels\TCorpusFiltered.dic..\pocketsphinx\model\lm\en\tekniska\copy ..\GeneratedFiles\LanguageModels\TCorpusFiltered.lm.DMP..\pocketsphinx\model\lm\en\tekniska\copy ..\GeneratedFiles\AcousticResponses\kwords.txt..\pocketsphinx\bin\debug\xcopy ..\GeneratedFiles\AcousticResponses\*..\pocketsphinx\bin\debug\Responses /s /iAfter the files have been placed in the corresponding Pocketsphinx folder, the recognizerapplication can be run using the newly created files. First the user should go to the\ASRProject\pocketsphinx\bin\debug folder and then execute the recognizer:recognizer -hmm ../../model/hmm/en_US/hub4wsj_sc_8k/-lm ../../model/lm/en/tekniska/TCorpusFiltered.lm.DMP-dict ../../model/lm/en/tekniska/TCorpusFiltered.dic63


A.7 Evaluating the performance and measuring the KDAEach time that the speech recognizer application has been run and closed, a log file iscreated. This file contains each decoded sentence as well as the keyword identified foreach utterance. Using the rEvaluator tool this file can be used to measure the accuracy(KDA) for the complete set of utterances.First the user needs to create a verification file containing the list of keywords used ineach utterance that the recognizer tool had to decode. For example, let’s suppose thatjust 4 utterances were used the last time the recognizer tool was run. The verificationfile should look like Figure A.9.Figure A.9: Verification FileNote. This file should be placed in the ResultsEvaluation folder Similarly, assumingthat the speech recognizer tool generated a log file called HypLogFile May25.txt thelast time it was run. In order to generate the results file the user needs to follow thenext sequence of steps.1. Copy the log file from the pocketsphinx folder to the ResultsEvaluation folder:copy Logfiles\HypLogFile_May25.txt..\..\..\GeneratedFiles\ResultsEvaluation\2. Execute the rEvaluator tool using the following command:java -jar rEvaluator.jar..\GeneratedFiles\ResultsEvaluation\HypLogFile_May25_004516.txt..\GeneratedFiles\ResultsEvaluation\key_utterances.txt..\GeneratedFiles\ResultsEvaluation\Results.csvWhere:• The first argument is the log file created the last time the recognizer was run.• The second argument is the verification file containing the list of keywordsused for each input utterance.• The third argument is the name of the comma-separated report file.64


After running the rEvaluator tool, the newly generated report file should containmeasurements regarding the accuracy of the speech recognizer.<strong>Keyword</strong> Matches Sentences KDA(%)TECHNOLOGY 1 1 100TELEPHONE 1 1 100WOMEN 1 1 100MECHANICAL ALPHABET 0 1 0OVERALL METRICS 3 4 75Table A.1: KDA Report Format65


Appendix BSource CodeThis following pages of this appendix include the source code for the main program andall the tools created during the development of this project.• Recognizer - Main Application• TextExtractor - Tool• DictGenerator - Tool• RespGenerator - Tool• rEvaluator - Tool66


B.1 Recognizer/* File: recognizer.cDescription: Adapted from pocketsphinx_continuous/continuous.c in order to developan interactive ASR system. The ASR system reacts based on identified"keywords" incoming speech signal. An *.OGG audio file with informationrelated to the keyword identified. Additional information is playedwhenever a keyword is not identified.Author(s): Ivan Castro, Andrea GarciaEmail(s): ivan.castro.c@gmail.com, andyggb@gmail.com*/#include #include #include #include "pocketsphinx.h"#include "err.h"#include "ad.h"#include "cont_ad.h"#include "bass.h"#if !defined(_WIN32_WCE)#include #include #endif#if defined(WIN32) && !defined(GNUWINCE)#include #else#include #include #endif#define NUM_KWORDS 20 //Number of available keywordsstatic const arg_t cont_args_def[] = {POCKETSPHINX_OPTIONS,/* Argument file. */{ "-argfile",ARG_STRING,NULL,"Argument file giving extra arguments." },{ "-adcdev", ARG_STRING, NULL, "Name of audio device to use for input." },CMDLN_EMPTY_OPTION};static ad_rec_t *ad;static ps_decoder_t *ps;//<strong>Keyword</strong> structuretypedef struct keyword{char *kWordName; //keywordint numSounds; //available audio files per keyword}<strong>Keyword</strong>;char *kWordList[NUM_KWORDS]; //List of existing keywordsint soundList[NUM_KWORDS]; //Number of sound recordings per keywordchar *unknownkWord; //Name of the dir with recordings to play in case of an unknown keywordint numUnknownSound; //Number of recordings to play in case of an unknown keywordconst char *keyFile = "kwords.txt"; //Name of the file containing the keywordsconst char *soundDir = "Responses/"; //Directory containing the recordings67


const char *startSoundFile = "Responses/EXTRA/EXTRA1.ogg"; //Recording to be played at startupconst char *endSoundFile = "Responses/EXTRA/EXTRA2.ogg"; //Recording to be played at the endFILE *hypLog; //File used to log information about the obtained hypothesisconst char *logFile; //String containing the log filename// Methods<strong>Keyword</strong> get<strong>Keyword</strong>(char const *hyp);static void load<strong>Keyword</strong>s();void playOGG(char const *oggFile);char const *selectOGG(char const *keyword, int numOGG);void terminate(int sig);static void sighandler(int signo);static void sleep_msec(int32 ms);static void utterance_loop();int main(int argc, char *argv[]);/* Play an OGG audio file */void playOGG(char const *oggFile){DWORD chan,act,time;QWORD pos;// check the correct BASS was loadedif (HIWORD(BASS_GetVersion())!=BASSVERSION) {printf("An incorrect version of BASS was loaded\n");return;}// setup output - default deviceif (!BASS_Init(-1,44100,0,0,NULL)){printf("Error(%d): Can’t initialize device\n",BASS_ErrorGetCode());BASS_Free();exit(0);}if(!(chan=BASS_StreamCreateFile(FALSE, oggFile, 0, 0, BASS_SAMPLE_FLOAT))){printf("Error(%d): Can’t play the file %s\n",BASS_ErrorGetCode(),oggFile);BASS_Free();return;}// play the audio fileBASS_ChannelPlay(chan,FALSE);while (act=BASS_ChannelIsActive(chan)){pos=BASS_ChannelGetPosition(chan,BASS_POS_BYTE);time=BASS_ChannelBytes2Seconds(chan,pos);sleep_msec(50);}}BASS_ChannelSlideAttribute(chan,BASS_ATTRIB_FREQ,1000,500);sleep_msec(300);BASS_ChannelSlideAttribute(chan,BASS_ATTRIB_VOL,-1,200);while (BASS_ChannelIsSliding(chan,0)) sleep_msec(1);BASS_Free();/* Select an audio file based on the specified keyword */char const *selectOGG(char const *keyword, int numOGG){int selFile, len, rNum;char fileNum[3];char *oggFile;68


static int first_time = 1;if( first_time ){first_time = 0;srand( (unsigned int)time( NULL ) );}rNum = rand() % (numOGG + 1);selFile = max(rNum,1);itoa(selFile,fileNum,10);len = strlen(soundDir)+strlen(keyword)*2+strlen(fileNum)+6;oggFile = (char *)malloc(len * sizeof(char));strcpy(oggFile,soundDir);strcat(oggFile,keyword);strcat(oggFile,"/");strcat(oggFile,keyword);strcat(oggFile,fileNum);strcat(oggFile,".ogg");oggFile[len-1]=’\0’;}return oggFile;/* Load predefined keywords into an array */static void load<strong>Keyword</strong>s(){FILE *key;char line[50];char number[2];int len, wordLen, i, j, k;int buffer = 50;int num<strong>Keyword</strong>s = 0;key = fopen (keyFile, "r");if (key == NULL) {fprintf(stderr, "Error when opening the keyword file!\n");exit(1);}while( fgets(line, buffer, key) != NULL){//Remove ’\n’ from the read linelen = strlen(line);if( line[len-1] == ’\n’ ){line[len-1] = ’\0’;}if(line[0] != ’#’){wordLen = strcspn(line,",");//Add the word into the keyword listkWordList[num<strong>Keyword</strong>s] = (char *)malloc((wordLen+1) * sizeof(char));strncpy(kWordList[num<strong>Keyword</strong>s],line,wordLen);kWordList[num<strong>Keyword</strong>s][wordLen]=’\0’;//Number of recordings of the current keywordi = wordLen+1;j = 0;while (line[i] != NULL){number[j] = line[i];i++;j++;69


}soundList[num<strong>Keyword</strong>s] = atoi(number);num<strong>Keyword</strong>s++;}else{}wordLen = strcspn(line,",")-1; //Don’t count the #//Add the word into the keyword listunknownkWord = (char *)malloc((wordLen+1) * sizeof(char));i = 1;j = 0;k = 0;while (line[i] != NULL){if (i


* Properly exit the program if a termination signal is detected (CTRL-C) */void terminate(int sig){printf("\nExiting...\n");playOGG(endSoundFile);fclose (hypLog);printf("Results were written to: %s\n", logFile);}ps_free(ps);ad_close(ad);exit(sig);/* Sleep for specified msec */static void sleep_msec(int32 ms){#if (defined(WIN32) && !defined(GNUWINCE)) || defined(_WIN32_WCE)Sleep(ms);#else/* ------------------- Unix ------------------ */struct timeval tmo;tmo.tv_sec = 0;tmo.tv_usec = ms * 1000;select(0, NULL, NULL, NULL, &tmo);#endif}/** Main utterance processing loop:* for (;;) {* wait for start of next utterance;* decode utterance until silence of at least 1 sec observed;* scan the decoded utterance to see if there is a keyword* play an OGG file with information related to the keyword* otherwise play additional default information* write the results into the logfile* }*/static void utterance_loop(){int16 adbuf[4096];int32 k, ts, rem, score;char const *hyp;char const *uttid;cont_ad_t *cont;char word[256];<strong>Keyword</strong> kword;char const *oggFile;clock_t clock_start;double elapsed;time_t rawtime;struct tm * timeinfo;char hypLogFile[100];static int first_time = 1;printf("Loading keywords...\n");load<strong>Keyword</strong>s();playOGG(startSoundFile);fflush(stdout);71


* Initialize continuous listening module */if ((cont = cont_ad_init(ad, ad_read)) == NULL)E_FATAL("cont_ad_init failed\n");if (ad_start_rec(ad) < 0)E_FATAL("ad_start_rec failed\n");if (cont_ad_calib(cont) < 0)E_FATAL("cont_ad_calib failed\n");// Open the file to write the obtained hypothesistime (&rawtime);timeinfo = localtime ( &rawtime );strftime (hypLogFile,100,"Logfiles/HypLogFile_%b%d_%H%M%S.txt",timeinfo);logFile = hypLogFile;hypLog = fopen (hypLogFile, "w");if (hypLog == NULL) {fprintf(stderr, "Error when creating the log file!\n");exit(1);}(void) signal(SIGINT,terminate);for (;;) {/* Indicate listening for next utterance */printf("\nREADY....\n");fflush(stdout);fflush(stderr);/* Await data for next utterance */while ((k = cont_ad_read(cont, adbuf, 4096)) == 0)sleep_msec(100);if (k < 0)E_FATAL("cont_ad_read failed\n");/** Non-zero amount of data received; start recognition of new utterance.* NULL argument to uttproc_begin_utt => automatic generation of utterance-id.*/if (ps_start_utt(ps, NULL) < 0)E_FATAL("ps_start_utt() failed\n");ps_process_raw(ps, adbuf, k, FALSE, FALSE);printf("Listening...\n");fflush(stdout);/* Note timestamp for this first block of data */ts = cont->read_ts;/* Decode utterance until end (marked by a "long" silence, >1sec) */for (;;) {/* Read non-silence audio data, if any, from continuous listening module */if ((k = cont_ad_read(cont, adbuf, 4096)) < 0)E_FATAL("cont_ad_read failed\n");if (k == 0) {/** No speech data available; check current timestamp with most recent* speech to see if more than 1 sec elapsed. If so, end of utterance.*/if ((cont->read_ts - ts) > DEFAULT_SAMPLES_PER_SEC)break;}else {72


}/* New speech data received; note current timestamp */ts = cont->read_ts;/** Decode whatever data was read above.*/rem = ps_process_raw(ps, adbuf, k, FALSE, FALSE);}/* If no work to be done, sleep a bit */if ((rem == 0) && (k == 0))sleep_msec(20);/** Utterance ended; flush any accumulated, unprocessed A/D data and stop* listening until current utterance completely decoded*/ad_stop_rec(ad);while (ad_read(ad, adbuf, 4096) >= 0);cont_ad_reset(cont);printf("Stopped listening, please wait...\n");fflush(stdout);/* Finish decoding, obtain and print result */ps_end_utt(ps);clock_start=clock();hyp = ps_get_hyp(ps, &score, &uttid);elapsed=(double)(clock()- clock_start)/CLOCKS_PER_SEC;printf("\n************************************\n");printf("Elapsed time: %f seconds\n",elapsed);printf("%s: %s (%d)\n", uttid, hyp, score);}/* Exit if the first word spoken was OUT */if (hyp){sscanf(hyp, "%s", word);if (strcmp(word, "OUT") == 0){printf("************************************\n");playOGG(endSoundFile);break;}/***************************************************************************/// Obtain the keyword from the current hypothesiskword = get<strong>Keyword</strong>(hyp);if (strcmp(kword.kWordName,"UNKNOWN") != 0){printf("<strong>Keyword</strong>: %s\n", kword.kWordName);}else{//Select and play information related to the keywordoggFile = selectOGG(kword.kWordName,kword.numSounds);printf("Selected file: %s\n",oggFile);printf("************************************\n");playOGG(oggFile);//Select and play additional informationprintf("No keyword identified!\n");oggFile = selectOGG(unknownkWord,numUnknownSound);printf("Playing additional information...\n");printf("Selected file: %s\n",oggFile);73


}printf("************************************\n");playOGG(oggFile);// Print the result data into the logfilefprintf(hypLog,"%s: %s (%d) %f %s\n", uttid, hyp, score, elapsed, kword);/***************************************************************************/fflush(stdout);}/* Resume A/D recording for next utterance */if (ad_start_rec(ad) < 0)E_FATAL("ad_start_rec failed\n");}fclose (hypLog);cont_ad_close(cont);printf("\nExiting...\n");printf("Results were written to: %s\n", hypLogFile);static jmp_buf jbuf;static void sighandler(int signo){longjmp(jbuf, 1);}int main(int argc, char *argv[]){cmd_ln_t *config;char const *cfg;/* Make sure we exit cleanly (needed for profiling among other things) *//* Signals seem to be broken in arm-wince-pe. */#if !defined(GNUWINCE) && !defined(_WIN32_WCE)signal(SIGINT, &sighandler);#endifif (argc == 2) {config = cmd_ln_parse_file_r(NULL, cont_args_def, argv[1], TRUE);}else {config = cmd_ln_parse_r(NULL, cont_args_def, argc, argv, FALSE);}/* Handle argument file as -argfile. */if (config && (cfg = cmd_ln_str_r(config, "-argfile")) != NULL) {config = cmd_ln_parse_file_r(config, cont_args_def, cfg, FALSE);}if (config == NULL)return 1;ps = ps_init(config);if (ps == NULL)return 1;if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"),(int)cmd_ln_float32_r(config, "-samprate"))) == NULL)E_FATAL("ad_open_dev failed\n");if (setjmp(jbuf) == 0) {utterance_loop();}74


}ps_free(ps);ad_close(ad);return 0;75


B.2 TextExtractor/* File: tExtract.javaDescription: This program generates a filtered text corpus based on a set of URLs.The output format is compatible with the CMUCLMTK tool.Author(s): Ivan Castro, Andrea GarciaEmail(s): ivan.castro.c@gmail.com, andyggb@gmail.com*/import java.io.*;import java.net.URL;import de.l3s.boilerpipe.extractors.ArticleSentencesExtractor;import java.util.ArrayList;import java.util.regex.Pattern;import java.util.regex.Matcher;public class tExtract {public static void main(String[] args) throws Exception {String rawText = "";String tmpText = "";String inputFile = "";String outputFile = "";String keywordsFile = "";// Process command line argumentsif (args.length == 0){// Display an error, at least one argument should be input<strong>System</strong>.err.println ("Error: No input arguments were specified");<strong>System</strong>.exit(1);}else if(args.length == 1){// The user provides the path and name of the input fileinputFile = args[0];inputFile = inputFile.replace("\\", "/");// Create the output file in the same folder as the input fileoutputFile = inputFile.substring(0, inputFile.lastIndexOf("/")+1)+ "outputCorpus.txt";}else if(args.length == 2){// The user provides the path and name of the input and output files,// but no keywords fileinputFile = args[0];inputFile = inputFile.replace("\\", "/");outputFile = args[1];outputFile = outputFile.replace("\\", "/");}else if(args.length > 2){// The user provides the path and name of the input, output and keywords// filesinputFile = args[0];inputFile = inputFile.replace("\\", "/");outputFile = args[1];outputFile = outputFile.replace("\\", "/");keywordsFile = args[2];keywordsFile = keywordsFile.replace("\\", "/");}try{76


tmpText = tmpText.replace("U.K.", " United Kingdom ");tmpText = tmpText.replace("A.M.", " before noon ");tmpText = tmpText.replace(" AM ", " before noon ");tmpText = tmpText.replace("a.m.", " before noon ");tmpText = tmpText.replace("P.M.", " after noon ");tmpText = tmpText.replace(" PM ", " after noon ");tmpText = tmpText.replace("p.m.", " after noon ");// Remove square brackets and their contentstmpText = tmpText.replaceAll("\\[.*?\\]", "");// Replace numbers with it’s text representationString numregex = "(\\b\\d+)|(\\.\\d+)";String match = "";String numStr = "";Pattern p = Pattern.compile(numregex);Matcher m = p.matcher(tmpText);int numCnt = 0;while (m.find()){numCnt++;match = tmpText.substring(m.start(),m.end());if (match.contains(".")){numStr = "point ";match = match.replace(".", "");}numStr = numStr + NumTranslate.translate(Integer.parseInt(match)) ;tmpText = tmpText.replaceFirst(match, numStr);numStr = "";m = p.matcher(tmpText);}// Replace ...tmpText = tmpText.replace("...", "");// Replace some decimal points with spacestmpText = tmpText.replaceAll("\\b\\.\\b", " ");// Remove some decimal points with end of line characterstmpText = tmpText.replaceAll("[\\s]*\\.[\\s]*", "\n");// Remove dashestmpText = tmpText.replace(Character.toString((char)45), " "); // hypentmpText = tmpText.replace(Character.toString((char)150), " "); // En dashtmpText = tmpText.replace(Character.toString((char)151), " "); // Em dashtmpText = tmpText.replace(Character.toString((char)8211), " "); // En dashtmpText = tmpText.replace(Character.toString((char)8212), " "); // Em dash// remove weird spacestmpText = tmpText.replace(Character.toString((char)160), " ");// space// Replace Roman NumberstmpText = tmpText.replace(" II ", "second");tmpText = tmpText.replace(" III ", "third");tmpText = tmpText.replace(" IV ", "fourth");tmpText = tmpText.replace(" V ", "fifth");tmpText = tmpText.replace(" VI ", "sixth");tmpText = tmpText.replace(" VII ", "seventh");tmpText = tmpText.replace(" VIII ", "eigth");tmpText = tmpText.replace(" IX ", "ninth");tmpText = tmpText.replace(" X ", "tenth");78


Finally, remove anything that is not a wordtmpText = tmpText.replaceAll("[^a-zA-Z\\s\\n]", " ");// Remove extra spaces between wordstmpText = tmpText.replaceAll("\\b\\s{2,}\\b", " ");// Convert all the words to upper casetmpText = tmpText.toUpperCase();// Optional. Add the and characters to each sentencetmpText = " " + tmpText; // First LinetmpText = tmpText.replaceAll("\\n", " \n ");tmpText = tmpText.substring(0, tmpText.length() -5);try{// Optional file containing raw text extracted from the web pages<strong>System</strong>.out.println("Generating Raw Text File");String rawOutputFile = outputFile.substring(0, outputFile.lastIndexOf("/")+1)+ "rawOutputCorpus.txt";Writer out1 = new OutputStreamWriter(new FileOutputStream(rawOutputFile));out1.write(rawText);out1.close();}}}// Not-Filtered Corpus file<strong>System</strong>.out.println("Generating Corpus Text File");Writer out = new OutputStreamWriter(new FileOutputStream(outputFile));out.write(tmpText);out.close();// Filtered Corpus Fileif (keywordsFile.length() > 0){<strong>System</strong>.out.println("Generating Filtered Text File");String filteredOutputFile = outputFile.substring(0,outputFile.lastIndexOf(".")) + "Filtered.txt";filtCorpus.filter<strong>Keyword</strong>s(outputFile, keywordsFile, filteredOutputFile);}// Context file (.ccs) used by the language model generator<strong>System</strong>.out.println("Generating Context Text File");String OutputContextFile = outputFile.substring(0,outputFile.lastIndexOf(".")+1) + "ccs";Writer contextFile = new OutputStreamWriter(new FileOutputStream(OutputContextFile));contextFile.write("\n");contextFile.close();<strong>System</strong>.out.println("Output Text file(s) created");}catch(Exception e){<strong>System</strong>.err.println ("Error while writing the output file");<strong>System</strong>.exit(1);}}catch(Exception e){<strong>System</strong>.err.println ("Error while opening the input file");<strong>System</strong>.exit(1);79


B.3 DictGenerator/* File: dGenerator.javaDescription: This program generates a custom dictionary based on the cmu07.dicdictionary and a given vocabulary. The purpose of this tool is to generate adictionary including just the words contained in a given language model.Author(s): Ivan Castro, Andrea GarciaEmail(s): ivan.castro.c@gmail.com, andyggb@gmail.com*/import java.io.*;import java.util.regex.Pattern;import java.util.regex.Matcher;import java.util.ArrayList;import java.util.Collections;public class dGenerator {public static void main(String[] args) throws Exception {String masterDictFile = "";String inputVocabFile = "";String outputDictFile = "";String notFoundFile = "";// Process command line argumentsif (args.length < 2){// Display an error, at least one argument should be input<strong>System</strong>.err.println ("Error: Less than two arguments were specified");<strong>System</strong>.exit(1);}else if(args.length == 2){// The user provides the path and name of the dictionary file and the// input vocabularymasterDictFile = args[0];masterDictFile = masterDictFile.replace("\\", "/");inputVocabFile = args[1];inputVocabFile = inputVocabFile.replace("\\", "/");// Create the output file in the same folder as the input vocabularyoutputDictFile = inputVocabFile.substring(0,inputVocabFile.lastIndexOf("/")+1)+ "outputDict.dic";notFoundFile = inputVocabFile.substring(0,inputVocabFile.lastIndexOf("/")+1)+ "notfound.txt";}else if(args.length >= 3){// The user provides the path and name of the input and output filesmasterDictFile = args[0];masterDictFile = masterDictFile.replace("\\", "/");inputVocabFile = args[1];inputVocabFile = inputVocabFile.replace("\\", "/");// Create the output file in the same folder as the input vocabularyoutputDictFile = args[2];outputDictFile = outputDictFile.replace("\\", "/");notFoundFile = outputDictFile.substring(0,80


}try{inputVocabFile.lastIndexOf("/")+1)+ "notfound.txt";File inputDict = new File(masterDictFile);File vocab = new File(inputVocabFile);String line = null;int wordCounter = 0;int matchCounter = 0;BufferedReader readerVocab = new BufferedReader(new FileReader(vocab));BufferedReader readerDict = new BufferedReader(new FileReader(inputDict));StringBuffer contentsDict = new StringBuffer();ArrayList rowList = new ArrayList();ArrayList notFoundList = new ArrayList();ArrayList filteredList = new ArrayList();int endOfLine = 0;Pattern p;Matcher m; // get a matcher object// Read all of the lines in the dictionary<strong>System</strong>.out.println("Reading Master Dictionary");while ((line = readerDict.readLine()) != null) {contentsDict.append(line).append("\n");}// Read all of the lines in the vocabulary<strong>System</strong>.out.println("Building Custom Dictionary");while ((line = readerVocab.readLine()) != null) {wordCounter++;matchCounter = 0;if (wordCounter % 100 == 0){<strong>System</strong>.out.println(Integer.toString(wordCounter)+ " words have been processed");}// For each word in the vocabulary, look for its pronunciation in the// master dictionaryline = "\n" + line.toLowerCase(); // add a new line characterp = Pattern.compile(line + "(\\((.*?)\\))*" + "\t");m = p.matcher(contentsDict); // get a matcher objectwhile(m.find()) {matchCounter++;endOfLine = contentsDict.indexOf("\n", m.start() +1);rowList.add(contentsDict.substring(m.start()+1,endOfLine+1).toUpperCase());}if (matchCounter == 0){// Current word was not found in the dictionarynotFoundList.add(line);}}<strong>System</strong>.out.println(Integer.toString(wordCounter)+ " words found in the vocabulary file");// Eliminate repetitions:for(int i=0;i


}filteredList.add(rowList.get(i));}}// Sort the arrayCollections.sort(filteredList);// Write the output dictionarytry{Writer out = new OutputStreamWriter(newFileOutputStream(outputDictFile));for(int i=0;i


}else{<strong>System</strong>.out.println(e.getMessage());}sentenceNum = 0;sentenceNum++;cmdLineString = "wget -q -U Mozilla -O " + "\"" + outputFolder+ current<strong>Keyword</strong> + "/" + current<strong>Keyword</strong> + sentenceNum + ".mp3"+ "\" + " + "\"http://translate.google.com/translate_tts?tl=en&q="+ line.replace(" ", "+") + "\"";<strong>System</strong>.out.println(cmdLineString);// Execute wget toolpr = Runtime.getRuntime().exec(cmdLineString);}}// write the kwords.txt file used by the recognizer applicationWriter out = new OutputStreamWriter(new FileOutputStream(outputFolder + "kwords.txt"));Collections.sort(app<strong>Keyword</strong>sFile);for(int i=0;i


B.5 rEvaluator/* File: rEval.javaDescription: This program generates a comma-separated report containingmeasurements of the accuracy (KDA) of the ASR recognizer tool.Author(s): Ivan Castro, Andrea GarciaEmail(s): ivan.castro.c@gmail.com, andyggb@gmail.com*/import java.io.*;import java.util.ArrayList;public class rEval {public static void main(String[] args) throws Exception {String inputResultsFile = "";String keywordsFile = "";String outputMetricsFile = "";// Process command line argumentsif (args.length < 2){// Display an error, at least one argument should be input<strong>System</strong>.err.println ("Error: Less than two arguments were specified");<strong>System</strong>.exit(1);}else if(args.length == 2){// The user provides the path and name of the results file and the// keywords fileinputResultsFile = args[0];inputResultsFile = inputResultsFile.replace("\\", "/");keywordsFile = args[1];keywordsFile = keywordsFile.replace("\\", "/");// Create the output file in the same folder as the input vocabularyoutputMetricsFile = inputResultsFile.substring(0,inputResultsFile.lastIndexOf("/")+1) + "outputMetrics.csv";}else if(args.length >= 3){// The user provides the path and name of the input and output filesinputResultsFile = args[0];inputResultsFile = inputResultsFile.replace("\\", "/");keywordsFile = args[1];keywordsFile = keywordsFile.replace("\\", "/");}try{// Create the output file in the same folder as the input vocabularyoutputMetricsFile = args[2];outputMetricsFile = outputMetricsFile.replace("\\", "/");File inputRes = new File(inputResultsFile);File inputKeyw = new File(keywordsFile);BufferedReader readerRes = new BufferedReader(new FileReader(inputRes));BufferedReader readerKeyw = new BufferedReader(new FileReader(inputKeyw));ArrayList contentsRes = new ArrayList();ArrayList keywords = new ArrayList();ArrayList filtered<strong>Keyword</strong>s = new ArrayList();String temp1 = "";String temp2 = "";85


String line = null;// Read all the decoded utterances from the results file<strong>System</strong>.out.println("Reading Results File");while ((line = readerRes.readLine()) != null) {contentsRes.add(line);}// Read all the keywords from the keywords file<strong>System</strong>.out.println("Reading <strong>Keyword</strong>s File");while ((line = readerKeyw.readLine()) != null) {keywords.add(line);}// Check that the number of keywords in the file match the number of// decoded utterancesif (keywords.size() > contentsRes.size()){<strong>System</strong>.err.println ("More <strong>Keyword</strong>s than utterances were found!");<strong>System</strong>.exit(1);}else if (keywords.size() < contentsRes.size()){<strong>System</strong>.err.println ("Less <strong>Keyword</strong>s than utterances were found!");<strong>System</strong>.exit(1);}// Count the number of non-repeated keywordsfor(int i=0;i


}}// Now build the results filetry{int overallMatches = 0;int overallRepetitions = 0;Writer out = new OutputStreamWriter(newFileOutputStream(outputMetricsFile));out.write("KEYWORD,MATCHES,SENTENCES,ACCURACY (%),\r\n");for (int i = 0; i < filtered<strong>Keyword</strong>s.size(); i++) {out.write(keyData[i].get<strong>Keyword</strong>Name() + ","+ keyData[i].get<strong>Keyword</strong>Matches() + ","+ keyData[i].get<strong>Keyword</strong>Repetitions() + "," +(100 * keyData[i].get<strong>Keyword</strong>Matches())/keyData[i].get<strong>Keyword</strong>Repetitions() + ",\r\n");overallMatches += keyData[i].get<strong>Keyword</strong>Matches();overallRepetitions += keyData[i].get<strong>Keyword</strong>Repetitions();}out.write("OVERALL METRICS" + "," + overallMatches + ","+ overallRepetitions + "," +(100 * overallMatches)/overallRepetitions + ",\r\n");out.close();<strong>System</strong>.out.println("Output Report File was created succesfully!");}catch(Exception e){<strong>System</strong>.err.println ("Error while writing the output file");<strong>System</strong>.exit(1);}}catch(Exception e){<strong>System</strong>.err.println ("Error while processing the input files");<strong>System</strong>.exit(1);}87

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!