1 - LumenVox

Table of Contents 

About LumenVox 

Why Choose LumenVox 

Discover LumenVox 

Speech Recognition Engine 

Speech enables any application with its flexible API, powering every solution that 

LumenVox provides. 

Speech Platform 

Allows you to develop and deploy your speech application: with just a few steps, 

your application can go from conception to reality. 

Speech Driven Assistant 

Integrates seamlessly with Vertical Communication's TeleVantage IP-PBX, permitting 

you to speech-enable the name directory, contact list, voicemail, IVR, and email. 

Speech Tuner 

Maintains and tests existing applications, ensuring that any speech recognition 

application, including those driven by Nuance and ScanSoft⎯continue to work well. 

LumenVox Training 

Describes classes and instructors available to help you learn about the speech 

industry, application design, and tuning with LumenVox’s products. 

Application Development Overview 

Provides insight to help you develop high-quality and effective applications for your 

customer base. 

Tuning Guide 

Gives a basic overview of the steps required when tuning and improving your speech 

applications. 

2 

4 

6 

8 

14 

22 

26 

36 

40 

52 

1

About LumenVox 

LumenVox is a speech recognition company with over a decade of 

telephony experience. We utilize a business and technology approach 

that allows businesses, corporations, resellers, platform providers, and 

service providers access to the complex speech recognition industry. 

Our revolutionary speech recognition software products have gained 

industry recognition, winning over 17 awards for innovation, technical 

excellence, and user's choice. 

Whether your organization wants to quickly and easily speech-enable current applications, or 

maintain existing ones, LumenVox provides all the necessary tools: our state-of-the-art Speech 

Recognition Engine, Speech Platform, Speech Driven Assistant, and Speech Tuner. 

Tired of extra costs for adaptations, application updates, and new deployments? At LumenVox, we 

believe that you know your company's needs best: our tools allow you to develop speech 

applications on your own terms, at your company's pace, without service fees for each new update. 

With LumenVox's software suite, you remain in control. 

LumenVox Team 

Expert. Dedicated. Innovative. 

LumenVox's development team⎯speech scientists, speech user interface designers, and experienced 

sales and marketing personnel⎯has over 30 years of practical experience in the development and 

integration of speech recognition systems, telephony, database design, and hardware integration. 

At LumenVox, we know the challenges and opportunities that come with integrating speech into 

your application. We offer guidelines, tips, and best practices, and our design team can assist in 

any phase of your development cycle. We understand speech. 

LumenVox is committed to providing the most powerful, flexible, and useful speech products and 

services with excellent customer service to clients of any size. 

Recent Awards 

LumenVox offers an impressive suite of 

versatile, world-class speech recognition 

technologies. 

Dr. Danny Lange 

CEO of Vocomo Software 

2 3

Why Choose LumenVox? 

Rethinking the Business of Speech 

Traditionally, speech recognition technology providers keep the 

development and maintenance of speech applications under wraps. 

Instead, LumenVox believes in empowering the users and developers of 

our speech recognition technology. 

Other technology providers create, deploy, and maintain their own proprietary applications for their 

clients. For this, the price tag of the final application can be in the millions, due to the extensive 

and on-going professional service fees. Traditional companies also have tiered pricing and 

functionality for their core Speech Recognition technology. So depending on what you need or want, 

the pricing can and will vary greatly. This greatly limits the development and expansion of speech 

recognition, preventing smaller businesses from utilizing these products. 

LumenVox Provides: 

Exceptional customer and technical support 

Hardware independent Automatic Speech Recognizer 

(ASR) with a distributed client-server architecture 

that runs on both Windows and Linux 

Extensive logging of audio, grammars, results, and 

scores, which allow you to recreate every call 

State-of-the-art testing and post-deployment tools to 

constantly improve your LumenVox, Nuance, or 

ScanSoft speech application 

Support and development of current and emerging 

industry standards 

We provide the tools and education needed when creating applications. We aspire to make speech 

recognition widely available, so our business model is structured on a single per port charge, as 

opposed to tiered pricing models. 

At LumenVox, we believe in the power of speech recognition to help revolutionize many industries, 

including businesses that are currently excluded from using speech by tiered pricing and costly 

professional service fees. Contact us to learn more about how speech recognition can work for your 

business. 

I think your product is the tops, and 

I’ve been around the block. The MAJOR 

KEY TO ME is that I can use it effectively 

right out of the box BUT ALSO create 

custom speech-activated (true-IVR) 

applications right down to the 

TeleVantage API level! 

Evan Klayman President of Brainstem 

4 5

Discover LumenVox 

LumenVox’s Products 

The speech industry 

contains many 

different technologies, 

platforms, and types 

of applications. This 

chart enables you to 

quickly view the 

components of a final 

speech application, as 

well as the products 

LumenVox offers. 

Tuning 

The process of 

changing an 

application to improve 

performance. 

Applications 

Speech applications 

allow callers to access 

any database to get 

account information, 

perform order entry, 

take customer surveys, 

or check order status. 

Speech Tuner 

Complete maintenance tool to 

tune and test any speech 

application using various ASRs 

including LumenVox, Nuance, 

and ScanSoft. 

Speech Assistant 

Complete solution, currently 

for TeleVantage owners to 

speech-enable their name 

directories, contact lists, 

voicemails, IVRs, and emails. 

Professional Services 

Experts in the speech industry 

that provide a variety of 

services including speech 

application design, 

development, and tuning. 

Pre-Packaged 

Pre-packaged applications can 

offer features such as email 

and calendar management. 

Custom Built 

Custom speech applications 

can address any vertical or 

horizontal market need 

where touch-tone 

applications can not. 

Platform 

Defines what type of 

applications can be 

run, and how those 

applications are 

allowed to operate. 


Complete platform with a 

toolkit to design, and deploy 

speech applications. The Call 

Handler supports any phone 

system via analog, digital, 

ISDN, or PRI. 

VoiceXML 

Standards-based speech 

application programming 

language supported by many 

platforms. 

Others 

Many platforms are available 

that use languages, other than 

VoiceXML, for creating and 

running speech applications. 

Core Speech 

Technology 

These technologies 

form the basis for 

building any speech 

application. 

Speech Engine 

Automatic Speech Recognizer 

(ASR) that supports SRGS, 

MRCP, and SISR. Integrated 

in various VoiceXML and 

proprietary platforms. 

ASR 

Technology used for 

interpreting audio data from 

phone, web, microphone, or 2- 

way radio. 

Other Core Technology 

Text-to-Speech is used to 

produce audio from text. Voice 

Verification is used to identify 

individual speakers for security 

purposes. 

6 7

Speech Engine 

LumenVox's Speech Recognition Engine is a flexible 

API that performs speech recognition on audio data 

from any audio source. 

The Speech Engine is speaker and hardware 

independent: it supports SRGS and SISR on both 

Windows and Linux platforms. 

How the Speech Engine works 

The Speech Engine provides speech application developers with an 

efficient development and runtime platform, allowing for dynamic 

language, grammar, audio format, and logging capabilities to customize 

every step of their applications. Grammars are entered as a simple list of 

words or pronunciations, or in the industry standard Speech Recognition 

Grammar Specification (SRGS), as defined by the W3C. 

Grammars 

Just 13 lines of code (8 calls to the 

Speech Engine) will implement a 

simple "yes-no" speech recognition 

system. The system must provide the 

audio and the audio data length for 

SoundData and SoundDataLength. 

LumenVox’s technology provides us 

with some of the very best in speech 

technologies today, at the right price for 

our customers. 

Mark Kelley President of Parallax 

Sample Code 

void RecognizeSpeech (void* SoundData, int SoundDataLength) 

{ 

const char* GrammarString = 

"#ABNF 1.0\n" 

"language en-US;\n" 

"mode voice;\n" 

"tag-format ;\n" 

"$yes = (yes | yeah | okay):'true';\n" 

"$no = (nope | no):'false';\n"; 

LVSpeechPort Port; 

Port.OpenPort (); 

Port.LoadGrammarFromBuffer (0, GrammarString); 

Port.LoadVoiceChannel (0, SoundData,SoundDataLength, ULAW_8KHZ); 

Port.Decode (0, 0, LV_DECODE_SEMANTIC_INTERPRETATION | 

LV_DECODE_BLOCK ); 

int NumInterpretations = Port.GetNumberOfInterpretations (0); 

for (int i = 0; i < NumInterpretations; ++i) 

cout

Supporting Standards 

LumenVox supports the W3C's Speech Recognition Grammar 

Specification (SRGS), part of the VoiceXML 2.0 and SALT 

specifications. Companies that track these specifications are 

dedicated to the future of speech, and to integrating with 

other companies committed to promoting speech recognition. 

The LumenVox SRGS implementation is backward compatible 

with the existing LumenVox BNF grammar format; current 

deployments will leverage the power of the SRGS system 

immediately and transparently. 

Both companies (LumenVox and 

SandCherry) have focused on simplifying 

the development, integration, and 

deployment of speech services while 

maintaining affordability. 

Charles Corfield 

President and CEO of 

SandCherry 

LumenVox recognizes that the speech community will need to 

work together to develop solutions for businesses, and as 

such, LumenVox applications complement the following 

technologies: 

VXML 

VoiceXML (VXML) is a mark-up language designed to code speech 

applications with many of the same architectural components as HTML. 

VoiceXML platforms connect to a combination of speech recognition engines, 

text-to-speech synthesis, telephony interfaces and VoiceXML interpreter 

software to process the call. In order to interface VXML with any speech 

engine, the engine must understand SRGS and SISR. 

LumenVox's Speech Engine is compliant with what VXML expects, and our 

engine powers the speech recognition portion of several VXML platforms. 

SALT 

Speech Application Language Tags (SALT) is similar to VoiceXML but also 

adds support for multi-modal systems. SALT extends existing mark-up 

languages such as XHTML, XML, and HTML. Similar to our work with VXML, 

the LumenVox Speech Recognition Engine conforms to SALT specifications. 

Semantic Interpretation 

Engine Features and Functionality: 

Streaming audio 

LumenVox has implemented the W3C's Semantic Interpretation for Speech 

Recognition (SISR) working draft, also part of the VoiceXML 2.0 specification. 

SISR allows grammar authors to embed snippets of JavaScript code into 

Supports English, Latin American Spanish, and Canadian French 

their SRGS grammars, to automatically transform what a speaker says into a 

format understandable to an application. With LumenVox's Semantic Tags, 

Flexible API easily integrates into current OA&M, billing, provisioning, and debugging 

callers can say, "September thirteenth two thousand four," and your 

systems 

application will understand "2004-09-13." 

LumenVox is committed to supporting the W3C's working draft. As the draft 

evolves, we will support both new and old drafts, so application developers 

can be confident that their grammars and tags will perform to specification. 

Client/Server architecture distributes speech-processing load 

Run time defined grammars entered as simple text, BNF, raw phonetic spelling, or SRGS 

Advanced dynamic barge-in adapts to each call in real-time 

SDK includes documentation and a demonstration C/C++ application 

10 Flexible error recovery through the use of confidence scores and n-best results 

11

Advanced Features 

Noise Reduction Module 

When noise is present, it will degrade the performance of any speech recognition system. 

Quality noise reduction improves the accuracy of Voice Activity Detection and Core 

Recognition, both essential parts of a speech recognition system. 

To improve application robustness in noisy environments, LumenVox implemented a Noise 

Reduction Module (NRM) into our Speech Recognition Engine. The NRM automatically 

adapts to the acoustic environment, and dynamically updates its estimate of noise levels. 

The adaptive algorithm enables the NRM to reduce the effects of noise. 

The waveforms below demonstrate the power of LumenVox's Noise Reduction Module. In 

the original audio [Fig. 1], a truck driver speaks on a cell phone while driving. In addition 

to noise from the truck engine and blowing wind, another vehicle engine starts in the middle 

of the recording. Although traditional noise reduction implementations often fail to adapt to 

such dramatic changes, LumenVox's NRM adjusts to the new noise characteristics rapidly 

and automatically. [Fig.2] 

Fig. 1 Original Audio 

Truck’s engine noise Another vehicle starting 

Fig. 2 Audio after noise cancellation 

Reduced truck’s noise Another vehicle starting Adapting to new engine noise 

Voice Activity Detection 

Voice Activity Detection (VAD), also referred to as barge-in and/or End-Of-Speech (EOS) detection, 

identifies when a person begins speaking, finishes speaking, or pauses while speaking. 

LumenVox's VAD implementation delivers high performance despite challenging conditions: hisses, 

pops, abrupt changes in background noise, telephone line echo, and squawks from two-way radio 

communication. 

The Voice Activity Detection module is highly configurable and can be adapted to work equally well 

within telephone, VoIP, or microphone-based applications. 

We are delighted that LumenVox is 

extending the capabilities of our 

platform. 

Media Resource Control Protocol (MRCP) 

Speech synthesizers…Audio recorders…DTMF recognizers…Speech 

recognizers…Speech verifiers…a fully functioning, media-rich application 

needs lots of components to work together. Until now, all of these 

components had to be provided by a single vendor, or required extensive 

custom programming to integrate them. MRCP changes all this. The 

Media Resource Control Protocol allows you to seamlessly manage 

diverse media resources. MRCP provides a common language to speak 

to all of these devices. 

With MRCP, vendors can compete on the basis of their strengths, rather 

than attempting to create an all-inclusive, yet mediocre package. 

Customers can take the best product from each vendor, creating a 

speech application package that is tailored to their particular needs. 

For detailed information visit: 

http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-06.txt 

n-best Results 

Instead of returning only the top scoring result, you can 

instruct the engine to return several of the highest scoring, 

most likely answers, often called n-best results. Returning 

n-best results is particularly effective when callers need to 

spell names, street addresses, or e-mail addresses. 

Without n-best results, if a caller spells a name beginning 

with "N," but the engine returns a low confidence score, the 

caller would be asked to repeat the letter-and given how 

similar "N" is to "M," it's likely that the second answer 

would have a similarly low confidence score. With n-best 

results, the system can prompt the caller using several of 

the likely results, such as "Did you mean 'M,' as in 'Mary'?" 

When the caller responds, "No," the system goes to its next 

option, "Perhaps you meant 'N' as in 'Nancy'?" 

Returning n-best results improves the caller's experience: 

instead of asking the caller to simply repeat an answer that 

received a low confidence score, the system can confirm the 

caller's intention using several likely choices. 

Server-Side Grammar 

LumenVox offers even more efficient support for 

large grammars, by allowing clients to pre-load 

grammars onto the server, allowing users to send 

the grammar prior to the decode requests. 

Typically, the grammar itself accompanies each 

decode request, but in the case of large grammars, 

sending the grammar to the server prior to 

decoding is more efficient⎯reducing network traffic. 

John Hibel Vocalocity’s Vice President 

of Marketing and Business Development 

12 13


The Speech Platform is an intuitive GUI-based toolkit to quickly design, 

develop, and deploy any speech application or IVR. By connecting to 

almost any phone system and database, the Platform can easily be 

designed for a Speech Driven Technical Support, Call Router, Customer 

Service, Dealer Locator, Auto-Attendant, or any other speech 

application. 

Platform Features: 

English, Latin American Spanish, and Canadian French Support 

Client/Server functionality 

Database Connectivity through Custom Action DLLs 

Call Bridging and Outbound Dialing through Custom Action DLLs 

Support for Intel Dialogic Dx1E, JCT, Dx2, JCT, and DMV Series cards 

Enterprise level distribution 

User-created sophisticated grammars 

Loop start / Analog or T1 / PRI ISDN / Digital Switch 

Barge-In capability 

Live updates without rebooting system 

DTMF and speech input 

Detailed Call Flow logging 

Live runtime GUI status monitoring 

Complete Call Flow handling 

Flash-hook transfer capability 

Flexible Call Job definitions 

Carrier-grade application-ready 

Full SRGS, SISR support 

On-the-fly project switching 

File or SQL/MSDE Database-based projects 

Flexible line setting 

Assign each phone line to a different database, file project, or CAPI only mode 

Noise Reduction Module 

Speech Tuner included 

Speech Recognition Engine included 

LumenVox’s Speech Platform is a 

clear leader in the speech recognition 

sector. 

Nadii Tehrani Chairman of TMC 

14 15

Everything You Need 

LumenVox's Speech Platform includes all of the components you need to 

produce, adjust, and maintain your speech applications. Designed from 

the outset to work together, the Platform's components operate 

seamlessly. 

Platform Component Descriptions: 

The Platform Designer allows you to construct the framework for your speech 

application in a GUI environment. 

Platform Extensions are used to handle any situation that the Platform Designer 

cannot support internally. 

The Speech Engine recognizes what the caller says, returning the results to the Call 

Handler. 

The Call Handler works with the Platform’s Designer, Extensions, and Speech 

Recognition Engine, executing the logic of your application, and directing calls 

appropriately. 

1 

2 

3 

The Call Flow View allows the designer of the speech 

application to see all of the modules that are associated with 

the application and how they relate to each other. This is 

where the application dialog flow is created and can be 

tracked to see how callers will flow through the speech 

application or IVR. 

The Objects Panel allows application designers to drag and 

drop Modules that perform a specific function, or to add 

notes with Annotations to provide quick information or 

reminders on the Call Flow itself. Actions Lists, Actions, 

and Grammars are drag and drop icons to add within each 

individual Module's properties. 

The Properties Window has six tabs: Project, Modules, Audio 

Library, Quick Audio, To Do, and Notes. Project shows 

theglobal and project specific properties. Modules is a list of 

modules you have created so far for the application. The 

Audio Library and Quick Audio displays all audio prompts 

contained within the project. To Do reminds you that there 

are objects that have not been finished within the whole 

project. Notes provides a place to enter comments or 

thoughts concerning the applications. 

The Speech Tuner provides a comprehensive window into your application: it allows 

you to quickly note where your application performs well, and determine which 

areas need improvement. The Tuner also lets you simulate changes to your 

application, using audio from past calls, to determine the effectiveness of each 

change. 

2 

LumenVox allows us to have control 

over the development of call flow, 

grammars and tuning, while keeping the 

costs at a respectable level. 

Brian Lauzon President of TelASK 

3 

1 

16 17

Platform Extensions 

The Speech Platform can be extended, allowing you to create a 

number of speech applications, by accessing pre-built libraries and 

controlling the call flow. These Platform Extensions can be written 

as either a Visual Basic ActiveX exe or C/C++ DLL. 

Examples of Platform Extensions: 

Connect to a live, or regularly updated, website or RSS 

feed 

Connect to a database via an ADO (ActiveX Data 

Object) 

Use SRGS for grammars and enable semantic 

interpretation 

Direct callers to different modules within the 

Application Designer project based on external 

decision trees 

Example Applications: 

Pin code and account number capture 

Survey systems 

Automated billing 

Collecting demographic information 

Auto-attendant 

Transaction completion 

Appointment scheduling 

Email reader 

Voicemail access 

By utilizing LumenVox’s Speech 

Platform our callers actually enjoy the 

experience of the phone call, helping us 

to build a good relationship with our 

customers. 

Derek Henry CEO of 1-800-US-LOTTO 

Corporation 

VB ActiveX Platform Extension Example 

18 C/C++ Platform Extension Example 19

Call Handler 

The Speech Platform's Call Handler runs the speech application. The Platform's Settings, and 

more specifically Line Settings, inform the Call Handler as to which Project to use. 

CallHandler 

The Call Handler supports a variety of Intel Dialogic telephony 

cards, and is designed to make hardware setup as easy as possible; 

this allows more time for you to develop and test your speech 

application. 

When developing your speech application with the Speech Platform's Designer, run tests by 

clicking on the Test button in the Toolbar. The Call Handler allows you to use speakers and a 

microphone, to completely test your application in-house⎯before releasing it to customers. 

Handler 

Design Tips 

Build the application for the expected case, anticipating the caller's situation and the 

application's goals. Focus on these goals as you design the call flow and dialogues. 

Keep prompts and grammars concise in your initial design, then expand based on 

callers' interactions, learned from the tuning process. 

Give thought and time to determine the persona and brand image. Verify that the 

voice talent creates the proper tone and perception for your company. 

Make the system match the user. Listen to the way a user responds and interacts 

with the system. 

Apply the strengths of speech recognition to your application: remember that many 

of the hierarchical structures used in DTMF or touch tone call flows are not 

appropriate in speech applications! 

The reason we chose LumenVox’s 

speech technology was simply because of 

the flexibility it provides us, as developers 

Never disguise a list question as a yes/no by adding unnecessary pauses, as in, 

"Would you like Red (pause), Blue, or Black?" This leads to caller confusion. 

Brian Lauzon 

President of TelASK 

Adjust the "No Input" timeout to match the complexity of the question. 

Make explicit decisions with yes/no. 

Make it easy for the caller to leave the speech application and reach a live person. 

20 21

Speech Driven Assistant 

LumenVox's Speech Driven Assistant for TeleVantage is a complete 

turn-key program to speech-enable name directories, contact lists, 

voicemail, IVRs, and emails, providing all callers hands-free phone 

interactions. The system also allows for alternate names, nicknames, 

and various spellings and pronunciations to be recognized. 

LumenVox’s Speech Driven Assistant 

for TeleVantage has become the most 

valued component in our operation of a 

speech-activated hotline, 1-800-US-LOTTO. 

Derek Henry 

CEO of 1-800-US-LOTTO 

Corporation 

Remote Access to: 

Speech enabled outbound dialing 

Users can say a name from their personal contact lists when they want to place a call 

Speech enabled name directory 

Callers can speak the name of the person or department they want to reach 

Support for workgroups and multiple company configurations 

Transfer fax calls to specified extension 

Add alternate names, spellings, and pronunciations 

Speech enabled voicemail access 

Access, control, and manage voicemail with speech 

Reply directly to caller's phone number or extension 

Forward messages to other users 

Speech enabled IVR (LumenVox's Speech Platform) 

GUI-based development tool to create a variety of IVR applications 

Supports both Speech and DTMF input 

External database access 

Speech enabled access to email 

Access POP3, Exchange Server, and IMAP email 

Reply to email in an attached recorded audio format or with user predefined text messages 

NeoSpeech Text-to-Speech (TTS) 

Play prompts of un-recorded proper names, dynamic IVR applications, and emails 

Assistant Features: 

Speech activated dialing to contact list 

Transfer to another extension from voicemail 

Navigate, forward, save, and delete voicemail 

IMAP, POP3 and Exchange server email access 

DTMF and Speech input 

Configure Speech Driven Assistant database remotely 

Barge-in capability with CSP compatible Intel Dialogic cards 

Live run-time GUI status monitoring 

Speech Platform included 

Speech Tuner included 

NeoSpeech Text-to-Speech included 

22 23

Configuration Tool 

User’s email account information and pre-defined 

text message replies are easily managed. 

LumenVox’s Configuration tool is where all modifications to the Speech 

Driven Assistant are made. 

The Configuration program allows 

administrators to modify user 

information, maintain email 

servers, configure speech 

applications, and check the 

server’s telephony hardware 

compatibility. 

Administrators can easily select and 

change any user’s information, which is 

continuously synchronized with 

TeleVantage’s database. 

The Configuration tool 

can also discover and 

display all Intel Dialogic 

cards installed. All 

features associated with 

the telephony cards are 

shown as a quick 

reference of its 

capabilities. 

Alternate names or departments 

associated with a particular user can 

be added quickly to the live system. 

For commonly mispronounced 

names, alternate 

pronunciations can be added. 

The Phonetic Speller provides 

alternate pronunciations as 

well as the chance to hear 

how it sounds, using Text-to- 

Speech. 

LumenVox’s Speech Driven Assistant 

has provided the best integration to 

TeleVantage of any other speech 

recognition product we tested. 

John Gagliardi 

President of GTI Solutions 

24 25

Speech Tuner 

LumenVox's Speech Tuner is a complete 

maintenance tool for end-users, valueadded 

resellers, and platform providers. 

It’s designed to perform tuning and 

transcription, as well as parameter, 

grammar, and version upgrade testing of 

any speech application. 

With this GUI-based tool, companies 

developing speech applications on 

various ASR platforms (including 

Nuance and ScanSoft) can bring speech 

application tuning in-house and avoid 

professional service fees. 

LumenVox is on the cutting edge of 

speech technologies and customer 

satisfaction by supporting not only their 

own speech engine, but other leaders in the 

industry as well. 

Bruce Balentine 

Executive Vice President 

and Chief Scientist at EIG 

Why Do Companies Need the 

LumenVox Speech Tuner? 

No untuned speech application 

survives contact with actual 

customers. Tuning is an absolute 

requirement for every speech application 

deployment. Our Tuner allows you to 

quickly assess changes and upgrades. 

Tuner Capabilities: 

Evaluate and improve the speech recognition 

application 

Analyze each stage in the call process 

Transcribe audio data, make pinpoint adjustments, 

and immediately measure the effects on performance 

Test changes against actual calls immediately 

Analyze data collected using different ASR engines 

Test design and development decisions of new 

applications, using data from deployed applications 

26 27

Tuner Overview 

The Speech Tuner is comprised of several key functions and windows. 

SRE Custom Tags 

displays log 

information supplied 

by the application’s 

developers. 

The Call Log displays the list calls and controls the display 

of information in the rest of the Tuner. Each interaction 

(or turn) in a call is marked with an event type, such as: 

: Beginning of the call 

: Speech event 

: Touch tone event 

: 'No input' event; the application 

did not detect any speech or touch tones 

: 'Unknown' event; these 

events are typically ASR-specific 

: End of the call 

The Transcription tool allows a 

transcriber to type the text of the 

caller's speech. Transcriptions 

are automatically evaluated and 

stored in the database for use in 

performance evaluations. 

The Statistics window 

displays performance 

statistics directly related 

to the calls in the Call Log 

window. 

The Answer window shows 

recognition results. Full 

n-best support is available, 

as well as the semantic 

interpretations, and actual 

words recognized. 

28 

Listen to the application prompt, and the caller's pre- and postprocessed 

speech. The recognized words are displayed where the 

ASR found them within the caller's speech. Vertical bars, which are 

useful for detecting problems, indicate the beginning and end of each 

word. 

29

Tuning Processes 

LumenVox's Speech Tuner provides full support for LumenVox's Speech 

Recognition Engine, Nuance 8.5, ScanSoft OSR 2, and other ASRs. The 

Speech Tuner allows you to work with any supported ASR via a single 

interface. 

LumenVox is an active supporter of the Tools committee in the VXML 

Forum, and is working to help define standard logging information, to 

help ease the tuning process. 

The tuning process involves three easy steps: 

Import Data. 

1 

2 

3 

The basic process is simple. Users import call log data into 

the Speech Tuner database. All information stored by the call 

log is available in the Speech Tuner. In most cases, log fields 

between ASR engines are very similar; when the information 

differs, every effort is made to preserve the original data. 

Each special case is fully documented. 

Transcribe Speech. 

Transcribers can type the text of the caller's speech directly 

into the Speech Tuner. Once the audio is transcribed, the 

Tuner compares audio transcripts with the speech engine 

results to determine accuracy, greatly reducing errors 

associated with hand evaluations. If semantic interpretations 

are available, the transcriber can also mark whether the 

semantic interpretation was correct or incorrect. The 

transcripts are evaluated using the actual decode grammar, 

producing measurements such as word-error-rate, in- and 

out-of-grammar rates, and semantic error rates. 

Test Immediately. 

Selecting an interaction in the Call Log automatically loads 

the associated audio and grammar into the Tester. The 

grammar can be edited, speech engine parameters set, and 

individual recognition tests generated. The Speech Tuner 

natively supports industry standard SRGS grammars. Once 

a set of possible changes is identified, users can batch test 

audio to evaluate performance, using those changes. 

The Speech Tuner assumes the user possesses licensed 

versions of the relevant ASR, that the ASR platform is up and 

running, and that the platform is able to accept connections. 

LumenVox Speech Tuner Database 

The Speech Tuner communicates with an open-source, freeware database called SQLite 

(www.sqlite.org). The Speech Tuner manages call log importing, searching, and exporting⎯so 

users can focus on the task of tuning, not log management. The database is contained in a single 

file, is easy to back up and transport, and can be queried using SQL-92 (see the SQLite website for 

full details) from a variety of exterior tools. Other speech engine vendors are free to convert their 

native logs to ones the engine understands. The format, content, and semantics of the LumenVox 

Speech Tuner database are published. 

The database maintains all the information contained in the original call log. The Speech Tuner 

includes not only the decode grammar and ASR results, but also the decode platform, parameter 

settings, alternative results, prompt audio, and pre- and post-processed audio. 

Depending on the platform logging capabilities, the database can provide more advanced 

information, such as ASR result alignments within the audio; the list of phonemes used in the 

decode; and word, utterance, and semantic interpretation confidence measurements. 

In addition, the Tuner stores all transcripts and evaluations within the call log. As transcripts are 

entered into the Speech Tuner, they are automatically evaluated against the decode grammar. 

These transcripts, and any notes or additional information, are stored directly into the database. 

Individual scores⎯such as word error rate, semantic error rate, and in- and out-of-grammar 

measurements⎯are stored along with their alignments, as well as information about how the scores 

were reached. 

Users can generate a variety of reports from these results, including error rate by grammar or 

dialog, confusion matrices, transcription progress, and confidence thresholds for confirmation or 

rejection settings. 

In the future, LumenVox's Speech Tuner will also support back-end database replacement, for use in 

enterprise level systems, where multiple users will be analyzing the same data simultaneously. 

Companies who use an ODBC-capable database can replace, with certain SQL changes, the diskbased 

SQLite system with an enterprise system such as MS SQL Server 2000, MySQL, PostgreSQL, 

and/or Oracle. 

LumenVox has created speech 

recognition products that are easy to 

code with and GUI-based tools, such as 

the new Speech Tuner that greatly 

simplifies post-deployment 

maintenance. 

Vern Baker 

President of enGenic 

Corporation 

30 31

Taking Out the Guesswork 

Make changes to grammars, parameters, or ASR engines, secure in the 

knowledge that those changes will make the application better, faster, 

and more accurate. The Speech Tuner uses historical information to 

validate your changes, ensuring your success. 

Grammar Tester 

Most 'tuning' tools are passive log viewers, requiring that changes be 

made in the live speech application and retested over a period of time 

with live callers. With LumenVox's Tuner, we send the changes to the 

Speech Engine, simulating the recognition process and evaluating 

changes instantly. Instead of slow, non-interactive, static tuning, the 

Speech Tuner enables on-the-fly, highly interactive, dynamic tuning. 

Make a change, do the test, get the results! 

The Grammar Tester is a dynamic 

testing component. You can switch 

ASR engines, grammars, and engine 

search parameters on-the-fly, and test 

changes in single or batch tests. 

Grammar Evaluation 

Evaluate speech and grammar sets against the speech engine, as they 

took place during the actual call. Adjust grammars and instantly 

re-test and re-score to evaluate improvements in performance. With 

LumenVox's Speech Tuner, you can instantly determine whether adding 

a new phrase to the grammar will improve your accuracy. 

Parameter Evaluation 

Setting parameters optimizes the speech engine performance, further 

improving the caller's experience. Traditionally, changing ASR 

parameters is a difficult and time-consuming task, often requiring long 

delays between changing a parameter, and evaluating its effects on 

performance. Our Speech Tuner can dramatically shorten the process. 

The dynamic test capability of the LumenVox Speech Tuner allows the 

user to quickly make and test parameter changes: now, ASR engine 

parameters such as search optimizations, speech end-pointing, and 

n-best result processing can be easily adjusted, and immediately 

re-tested and re-scored from within the Speech Tuner. 

Performance Measurements 

The Speech Tuner rates performance against commonly accepted 

measures like WER (Word Error Rate), Grammar Coverage, and 

Semantic Interpretation matching. This helps give an accurate picture 

of details such as average confidence scores, correct versus incorrect 

responses, and In-Grammar versus Out-of-Grammar performance. 

Assessing Upgrades 

Installing new versions of platforms and ASR engines entails a certain 

risk with each new upgrade. On occasion, new default settings, search 

routines, changes to acoustic models, and so on will actually worsen 

the caller's experience, until the application is re-tuned. But using the 

LumenVox Speech Tuner, you can perform baseline testing with the old 

version to establish the minimum acceptable performance. Then, 

using the upgraded version of the ASR engine, you can easily re-test 

all existing data and compare the results to the baseline. The new 

performance, judged against the baseline, gives you the information 

you need to make a decision, and deploy an upgrade with confidence. 

32 33

Tuner Reports 

The LumenVox Speech Tuner defines several pre-built queries for most 

common reports. The reports are generated using SQL queries against 

the Tuner database, with results produced in a pre-defined XML format. 

The format, content, and semantics of the LumenVox Speech Tuner 

database are published: if you need to extract data in logs that are not 

provided in the Tuner interface by default, you can easily produce 

custom reports by writing SQL commands. 

Tuner Tips 

When a problem occurs with a transaction, 

determine if the fault lies primarily with 

prompts or grammars. 

Never make a change in your call flow or 

grammar for just one failed call. 

Train acoustic models with environment 

noise in play. 

Train acoustic models to allow for caller 

dialects and regional pronunciations. 

Tune grammars, prompts, and all system 

parameters. 

Remember that accurate transcriptions 

need to account for noise. 

Speech Understood 

Speech Understood 

The Speech Tuner is an excellent fully 

integrated tool for improving speech 

applications. 

Bruce Balentine Executive Vice President 

and Chief Scientist at EIG 

34 35

LumenVox Training Courses 

The LumenVox Team, a group of knowledgeable 

professionals with extensive development and 

technical support experience, provides our 

courses. We will help you get the most out of 

your LumenVox products. 

Courses include Speech Application Design, API Development, 

Speech Application Tuning, and many others. These courses give 

developers and business personnel opportunities to learn about 

speech development and sales opportunities on our premises, with 

the assistance and advice of the LumenVox Team. 

Who Should Attend 

People responsible for developing, maintaining, 

marketing, and/or selling LumenVox speech 

applications will benefit from LumenVox training. 

Classes also benefit anyone with an interest in 

designing, developing, tuning, testing, or maintaining 

any speech telephony system. 

What to Expect 

LumenVox training is key to accelerating your learning curve. Through a combination of 

presentations and hands-on exercises, our courses provide the details of creating and maintaining 

applications. In these courses, you will learn solutions to real problems encountered during actual 

application design, development, deployment, tuning, marketing, and selling. 

LumenVox training will give you the guidance you need to successfully design, develop, deploy, and 

refine your applications. We tailor our trainings to meet your particular needs. 

About the Instructors 

Our team of expert instructors is committed to your success. Every LumenVox instructor has a 

background in computer telephony, application development, and speech recognition. We are 

familiar with the development challenges you will encounter on a daily basis; we will offer solutions 

to routine problems, as well as creative approaches to not-so-routine problems. 

LumenVox Support Services: Ensuring Your Success 

At LumenVox, we recognize that high quality, cost effective technical support is a crucial component 

of successful application development. With proper support, subscribers gain a deeper product 

understanding, resulting in enhanced productivity, and ultimately in greater customer satisfaction. 

With this in mind, LumenVox offers a simplified technical support system designed to meet varying 

customer needs. LumenVox technical support plans are available for VARs, Distributors, and End 

Users with ongoing projects/support needs. The key component of LumenVox technical support is 

the Customer Hotline. Two additional avenues are also available: Fax Support and Email Support. 

Whichever method you choose, know that LumenVox will work efficiently to answer your questions 

and resolve the problem. 

Our LumenVox technical support team is made up of knowledgeable 

professionals with extensive LumenVox development and support experience. 

We are well versed in computer telephony technology and are available to 

assist you with: 

General LumenVox Technical Assistance 

Timely Problem Resolution 

Product Installation Assistance 

LumenVox Database/Host Connection Assistance 

Intel Dialogic Hardware Optimization 

I really appreciate the cooperation 

and assistance that we received from the 

LumenVox engineers. They are an easy 

group to work with. 

Chris Riggenbach CXM Product 

Development Manager 

36 37

Our Partners Include... 

Keeping People Connected 

LumenVox and SandCherry Speech Enable 2-way Radio Networks 

SandCherry's Voice4 Radio Message System (RMS) dramatically improves workgroup efficiency for 

teams with 2-way radio, phone, and Web users by providing easy-to-use voice and data messaging. 

The ability to offer messaging to 2-way radios⎯one of the most widely used tools for mobile work 

forces⎯adds an entirely new dimension to workgroup communications. 

Leaving messages for radio users when they are unavailable, and providing access to the same 

system for phone and Web users to leave and retrieve messages, provides the critical bridge 

between what had been independent communications networks. The system also offers phone 

users a patch capability to connect to the radio network for real-time communication with radio 

users. Voice4 RMS equips a workgroup to improve its efficiency and performance using tools and 

processes already in place, without relying on a dispatcher. 

Using LumenVox's Speech Engine, SandCherry's Voice4 RMS provides an unparalleled level of 

voice-driven functionality for mobile workgroups. Voice4 RMS is easy to install and use, and 

provides a cost-effective, robust communications solution. 

Making Travel Easier 

Los Angeles World Airports (LAWA) using LumenVox Engine and enGenic technology to 

speech-enable travel information 

Los Angeles World Airports, including Los Angeles International Airport (LAX), has implemented the 

first fully speech enabled voice response system for a major airport network. In compliance with 

Homeland Security regulations, LAWA helps callers access the most up-to-date flight information 

directly from the airport hotline, rather than calling individual airline carriers. 

Now callers can check the status of their flights, get information about parking, ground 

transportation, services for people with disabilities, inquire about lost and found items, contact 

administrative offices, and get directions to the airport⎯all through an automated speech system. 

The hotline was created using enGenic's development and engineering tools, and LumenVox's 

Speech Engine. Jim Coulter, CEO of enGenic, states, "LumenVox provides us with fast and accurate 

recognition of speech grammars, in an ever-changing world environment. Callers around the world, 

with different accents, can use simple English commands to obtain information from all four 

airports, over 185 airlines, 40 different taxi and limousine services, and 20 different airport 

departments." 

The new system will enable us to 

continue our rapid growth and at the 

same time improve the efficiency of our 

ordering system without adding 

additional staff. 

Mike Gilson 

Vice President of ATA Retail 

Services 

Helping Doctors Treat Patients 

LumenVox Speech Engine Used in Innovative Patient Follow-up System 

The University of Ottawa Heart Institute in Ottawa, Canada has implemented an innovative 

automated patient follow-up system developed by LumenVox partner TelASK Technologies Inc. 

(www.TelASK.com). The TelASK System incorporates the LumenVox Speech Engine, allowing the 

Ottawa Heart Institute to closely monitor the progress and recovery of surgical patients by placing 

an automated call to their homes on day three, and again on day ten, after the patients have been 

discharged. The speech enabled outbound dialing system automatically phones the patient and 

delivers a pre-set list of questions. Each question is a strong indicator of the patient's progress. 

Using a specially designed algorithm, patients' answers are grouped as either requiring an 

immediate call back, contact within a day, or progressing normally. In the event of a response that 

may indicate a problem, the system will hold callers on the line and connect them to an Advance 

Practice Nurse at the Heart Institute for immediate attention. The University of Ottawa Heart 

Institute will also use this system to monitor patients with Acute Coronary Syndrome, and patients 

participating in their Smoking Cessation programs. 

38 

39

Types of Speech Recognition 

Speech recognition is used in a wide range of applications, from 

automated commercial phone systems to enhancing personal 

productivity. This technology appeals to anyone who needs or wants a 

hands-free approach to computing tasks. 

There are two main types of speaker models: speaker independent and speaker dependent. 

Speaker independent models recognize the speech patterns of a large group of people. Speaker 

dependent models recognize speech patterns from only one person. Both models use mathematical 

and statistical formulas to yield the best word match for speech. A third variation of speaker 

models is now emerging, called speaker adaptive. Speaker adaptive systems usually begin with a 

speaker independent model and adjust these models more closely to each individual during a brief 

training period. 

Leveraging the Power of Speech 

Although many companies' first instincts are to simply speech enable their existing DTMF 

applications, doing so does not leverage the power and strengths of speech: speech enabling a 

DTMF application will not make the system smoother, faster or easier-to-use. 

The combination speech/DTMF system lengthens already complex menus, by adding the "press or 

say" routine⎯"For Checking, press or say 'one,'" or even worse, "For Checking, press one or say 

'checking'." The typical combination speech/DTMF system requires the caller to remember too 

much, putting undue burdens on the caller. 

Migrating a DTMF application and its prompt design does not fully utilize the conversational aspect 

of speech. For speech applications to perform well, the call flow and dialog design is crucial. 

Designers must study the user patterns of the existing system, so they can redesign prompts, 

menus, and change the steps of the call flow to make the experience faster and more pleasant for 

callers. 

Well-designed speech applications offer many advantages over the combination speech/DTMF 

systems. 

By using a speech enabled system, our 

merchandisers realize significant time 

savings through 24-hour-a-day telephone 

access to information, from delivery status 

to new delivery days and product 

opportunities. 

Mike Gilson Vice President of ATA Retail 

Services 

Speech is: 

More Human 

With speech, prompts are phrased as easy 

questions, and callers can answer simply and 

naturally⎯with their voices. Speech systems 

provide a more natural interface than touch tone 

menus. 

Smooth and Fast 

Good speech call flow designs permit callers to 

get what they need faster, without having to wade 

through cumbersome filter menus. 

Easy-to-Use 

Navigation is much simpler, and callers can use 

the application with the interface mechanisms 

they are most familiar with: their voices. 

More Personal 

Speech applications give the impression of the 

ideal employee: attentive, empathic, alert, and 

consistently agreeable, rather than an impersonal 

string of numbers and tones. 

When to Use DTMF 

Sometimes, DTMF is appropriate: as an errorhandling 

backup, or in special, security-sensitive 

interactions, such as pin code or credit card entry. 

In terms of customer satisfaction for most calls, 

speech applications outperform DTMF. Rather 

than speech-enabling an existing DTMF system, 

design your application with a conversation in 

mind⎯and learn to leverage the power of speech. 

40 41

State of the Industry 

To get an accurate understanding of the current state of the speech 

industry, we must first look at the history of Interactive Voice Response 

(IVR) systems. 

Companies have attempted to handle customer interactions with touch tone IVR systems or live 

agents. Yet, most customers become frustrated with ineffective DTMF interfaces, or hang up while 

holding for a live agent. In order to support customer interactions more quickly and efficiently, 

companies began to request speech recognition interfaces. 

This movement towards speech provided the speech recognition industry with tremendous growth 

potential, however, many companies consider speech to be in the early adopter stage of the market. 

Why is that? 

While people have been hearing about speech for decades, only in the past decade or so have 

advances in the technology and supporting hardware allowed speech to finally become a viable 

option, with most performing tasks with over 90 percent accuracy. During this period many Fortune 

500 companies implemented speech recognition, and helped educate consumers on how to interact 

with speech applications. These applications have become so advanced and mainstream, that 

businesses⎯both large and small⎯now turn to speech solutions for everything from basic autoattendants 

to more complex order-taking systems. 

Vendor Selection Tips 

Select a partner with the technological and business 

expertise that best suits your company and future 

projects. Ensure that the partner you choose will 

provide all of the services and products you'll 

need to be successful. 

Search for tools that allow for every change and 

adjustment to be automatically and rigorously 

tested, with actual historical call data. 

Include a tool in your development process that 

verifies any dead ends or unfinished work. 

Choose technologies that best fit with your application 

requirements. 

Select partners with deployment experience. 

Verify the level of technical support provided⎯ensure that 

you will receive the support you need. 

Speech recognition offers a great solution for large and small businesses: it simplifies customer 

interactions, increases efficiency, and reduces operating costs. 

Analysts at Cahners In-Stat and Giga report that calls that are handled by live agents can have an 

average cost-per-call from $2 to over $15. With speech recognition, the average cost-per-call can 

be cut to $.20 per call or less. 

LumenVox’s corporate and product 

strategy is right in sync with us. 

Since this technology appeals to anyone who needs or wants a hands-free approach to computing 

tasks, it is becoming a standard software option. At LumenVox, we focus on developing tools that 

will empower the user to build, customize, maintain, and refine their own applications. 

Vern Baker 

President of enGenic 

Corporation 

Speech recognition system development is still something that requires time to prepare and 

monitor. With tools like LumenVox's Speech Platform and Speech Tuner, we are continually working 

to simplify the aspects of speech application development⎯to help your business get the most out 

of speech recognition. 

42 43

Effective Design = Customer Satisfaction 

Speech recognition applications come in many varieties, from simple 

routers to complex ordering systems. What designers must remember is 

ease-of-use. Even with a complex system, callers must be able to 

navigate through the system easily. 

Speech applications allow customers to accomplish their goals quickly and easily. Much of the 

internal work is in the design phase: building the call flow, creating grammars, recording prompts, 

and conducting usability testing. Speech application designers will modify each aspect throughout 

design and internal testing phases. 

But with all the speech applications on the market today, and most prominent speech companies 

boasting recognition accuracy in the high 90%'s, why do so many people feel that "speech doesn't 

work?" 

Usually, it's because callers don't know what to say. 

You can avoid common problems by carrying 

out the following steps during the design phase: 

1 Research Needs and Create Initial Design. 

First, speak to the people who currently answer phone calls, and get their 

input. What current questions and interactions could potentially be automated? 

If the company already uses a DTMF application, how well does DTMF handle 

these interactions and questions? Not all interactions match well with speech's 

capabilities, so initial research is critical. 

Next, sketch out a potential progression of the call flow, and share this with 

others to make sure the progression makes sense and that callers can quickly 

and easily navigate the system. 

2Develop Prompts and Grammars. 

Designers should decide how much of a "natural language" system callers need 

or desire. "How may I help you?" only works if callers know precisely what 

they want, and the designer can accurately predict their responses. Generally, 

callers will need some guidance and cues as to what to say. A "How may I 

help you?" question involves more extensive grammar development and testing 

than a question like: "We offer three choices, A, B, or C. Which would you 

prefer?" 

The customer service that we have 

received from the LumenVox team has 

always been top of the line and we are 

looking forward to continuing our 

partnership for many years in the 

future. 

The application developer must keep the system as conversational as possible, 

but prevent callers from treating the machine as a human. When callers think 

the system "actually" understands, or think that it possesses a greater 

vocabulary than it really does, they get lost⎯or make requests that are outside 

the system's capabilities. These problems rapidly compound, resulting in caller 

frustration and dissatisfaction. Effective, intentional, clearly designed prompts 

and grammars help keep customers satisfied while managing cost. 

3 Test with Real Customers. 

The ultimate measure of an application is the first live deployment of the 

system. The first live deployment must be a test version with actual users, not 

the programmers who are intimately familiar with the application's design. 

This will be the first time that assumptions about user behavior will be 

seriously tested; the resulting data will allow designers to modify the 

application to meet the caller's needs. 

4 Tune with Real Data. 

Test deployments permit designers to fine-tune the system, often resulting in 

significant changes, if they review actual caller experiences. By refining 

prompts, grammars, and call flow design, the application will become more 

robust, error-free, clear, and effective⎯in short, an application that customers 

will want to use. 

To tune the application effectively, all of the components of initial 

design⎯prompts, grammars, call flow, and the persona of the call 

system⎯must be tested. Since these elements are often built separately, 

Victor Salazar President of Technical 

developers must ensure that all of the parts combine effectively in the testing 

Support Systems 

phase to achieve the desired effects. Properly tuning the speech application 

involves a thorough assessment of initial design components and real caller 

44 

interactions. 

45

Error Handling 

Make your speech system more efficient and usable by optimizing error 

handling. 

Focus on the basics: Is the system accurate? Does the caller achieve call completion? Does the 

caller like using the system, and want to continue doing business with your company? 

An optimal application addresses these issues by combining technology with art. Mixing technical 

aspects, like programming and testing, with aesthetic elements, like writing, casting, and coaching, 

reduces errors and increases customer satisfaction. Give sufficient consideration to each part of 

the development process. 

Understand the speech recognizer that you are using⎯the confidence scores it returns allow you to 

make good decisions about the call flow. Track confidence scores at the project, grammar, and 

single call levels, to set both static and dynamic thresholds. This will permit the system to make a 

good decision on whether or not to confirm. 

Remember that it is better to confirm than to make a mistake. Although confirming can be 

unpleasant for callers, it is preferable to the frustration of being lost. Figure out when you need to 

confirm using the confidence scores, and try to make the confirmation prompts less complex than 

the original prompt. If you choose your confirmations wisely, even though it takes a little longer, 

users will not become irritated or impatient, and will get to where they need to be. 

Help Callers Avoid Frustration… 

Errors occur because callers get lost⎯or are unsure of what to say. In either case, this 

is usually a prompt issue. Effective prompt writing guides callers to say what is in the 

grammar. Another option is to make the grammar robust enough to handle reasonable 

requests, even when the caller's phrasing is clumsy. 

No matter the style, the prompt should always focus on the task at hand: moving the 

call forward. 

Sequential prompts should be connected with transitions, e.g., first, next, finally. 

Prompts can also use related questions, as in "Please tell me your account 

number…And your pin code." 

Although long prompts should generally be avoided, sometimes a few extra words pay 

off. Phrases like "Just to confirm," "Almost finished," and "So I can help you better" 

create a forward mental model…in these cases, reassuring the caller is more beneficial 

than the few seconds you gain by clipping the prompt. 

Whatever your company's style or needs, LumenVox can help. 

Solving Problems 

When an error happens, fix it! And don't let it happen 

again. 

Sounds simple enough, but how? And what will 

customers do in the meantime? 

Figuring out why an error happens is the key to fixing 

it. The feedback callers provide is often vague; 

instead, go straight for empirical data. Examine the 

context of the call to help pinpoint what caused the 

error: too much background noise; not enough 

volume; mispronunciation. Invariably, errors will 

occur. To provide great customer service through 

speech applications, we need to minimize errors, and 

make the caller as comfortable as possible when errors 

do occur. 

Error-handling interfaces often increase caller 

frustration. Recognize any of these? 

Adversarial error responses: 

"I need you to be more specific." 

Generic responses: 

"I'm sorry, I didn't understand." 

Annoyingly enthusiastic responses: 

"Let's try it again!" 

When developing speech applications, most companies strive to 

achieve a balance between saving money on live agents and 

providing better service than traditional DTMF systems. 

Unfortunately, callers are sometimes uncomfortable with speech 

applications⎯they try to talk as they would to a human, or worse, 

speak in a stilted way, because they think computers will better 

process their requests. Successful experiences with your call 

system will help increase caller confidence⎯and when errors 

occur, callers will be more patient if they have had positive 

experiences in the past. 

Improving the caller experience is about good service. 

Thinking carefully about the error interface, designing 

effective prompts, testing the call system often, and 

examining the context of errors can help improve 

the caller's satisfaction with the speech 

application…and with your business. 

We are delighted that an industry 

leader like LumenVox has met the market 

demand with a product specialized for our 

TeleVantage platform. 

Rob Black Product Marketing Manager 

of Vertical (formerly Artisoft) 

46 47

Voice Matters 

The voice of your speech application is your company's 

first representative. Choose the voice wisely. You're not 

just looking for a type of voice; you're looking for an 

emissary. The voice needs to be able to explain, inspire, 

soothe, excite, and above all else, sound sincere. 

Assuming the Voice User Interface (VUI) design is good, the prompts must be 

recorded to best represent that design. The four essential pieces for great 

prompt recording include: 

Great Casting 

Great Directing 

Great Concatenate Recording 

Great "Voice" 

Think about your desired call experience⎯and consider using professional 

voice talent rather than your receptionist. Professionals will provide 

better sounding prompts: they will know appropriate variations in 

pitch, rhythm, speed, duration of pauses, and elongation of words. 

In addition, professionals will avoid novice errors like wobble, 

nervous tics, sloppy diction, and colloquial pronunciations. Most 

importantly, talent is directable; professionals can respond to 

instructions regarding persona ideas and desired inflections, to 

create the proper tone for the application. 

Voice Talents versus Voice Actors 

Voice Talents are people who speak well, with good resonance 

and intonation. Voice Actors are people who are Voice Talents, 

but through the use of their voices alone, can also convey 

character, humor, sincerity, and meaning. Voice actors can take 

your application to the next level, providing a better experience 

for your callers. 

Voice Actors do not necessarily over-articulate everything. They 

stress the most important words and concepts, which often 

corresponds to what the system is trying to recognize. Voice Actors will 

also know how to record concatenated prompts with consistency. If 

feasible, using a Voice Actor will give your speech application polish⎯and 

this polish, combined with a well-designed prompt and grammar, will lead to 

more satisfied customers. 

Prompt Tips 

Never disguise the system as being a real person. 

Ensure that prompts elicit a predictable response. 

Offer the needed information at the right time, 

and try not to frontload the application with too 

much information. 

Never use "Press or Say 1" style prompts. 

List options from specific to general⎯so that 

users do not choose a general category when a 

more specific one will be offered later. 

Allow more time for difficult tasks: the pace of 

the recorded prompts dictates the pace at which 

the callers will respond. 

Keep the caller in the transaction by saying 

"first…next…finally" or similar transition words, in 

the appropriate places. 

Insert pauses in your prompts to allow 

experienced callers a turn-taking option to 

interrupt and move forward in the application. 

Provide audio rewards at the completion of 

difficult transactions, such as "great, wonderful, 

excellent..." 

LumenVox's software proved to be 

very effective in precisely interpreting 

callers’ speech patterns. 

Chris Riggenbach CXM Product 

Development Manager 

48 49

1 

2 

3 

Alternate Pronunciations 

One of the most useful speech applications today is the front-end call 

router, however, it's also one of the most challenging applications 

because the system must recognize names. 

Sometimes names are derived from languages other than English, and the 

pronunciation reflects rules from the other language. Often these names 

contain sounds that are not apparent from the spelling, or the caller 

stresses a syllable that differs from the common pronunciation. 

Imagine a person looking at the name "Elicia." Is the initial sound a 

soft 'AX' as in "about", an 'EH' as in "bed" or is it stressed heavily 

with a long 'IY' as in "equal?" Is the third syllable pronounced with 

an 's' sound or a 'SH'?" The speech application needs help to 

determine this. 

If a word or name is not in the dictionary, the Speech Recognition 

Engine will try to figure out how that word is pronounced using a set 

of phonetic rules, similar to how a person might try sounding out the 

new word. Unfortunately, the Speech Engine is not always correct. A 

good indicator is that if a person has trouble figuring out how to 

pronounce a name, the speech engine will, too. 

Steps for Developing a Smooth Call Router: 

Figure out who the incoming callers are. Are they strangers, people who know the employees, or 

a combination of both? In other words, will the callers know how to pronounce the name correctly, 

or will other likely pronunciations need to be added? Do the callers refer to employees by first or 

last name only, and are they familiar enough to know people's nicknames? 

Find out what the Speech Engine thinks is the correct pronunciation, or whether other 

pronunciations are needed. You can use the Phonetic Speller located in the Speech Platform to 

see how the Speech Engine determines the pronunciations of the names, without having to run the 

Speech Engine itself. 

Add the new pronunciations for names into the grammar. Here's an illustration of this process: 

The name "Paty" (pronounced "Patty") is not a common spelling and is not in the Speech Engine's 

dictionary. When typing it into the Phonetic Speller, the system returns 'P EY DX IY' which sounds 

something like "Paydee", instead of the correct 'P AE DX IY'. To add a new pronunciation, the 

phonemes will need to be made within a set of curly braces. Adding a colon followed by the true 

spelling of the name helps readability, so it's a good idea to include it. The final entry of {P AE DX 

IY: Paty} as an alternative pronunciation should help the system's performance, since now callers 

who say "Patty" will be more likely to be recognized, instead of the incorrect {P EY DX IY}. 

Phonemes 

The unit of sound the recognition engine actually recognizes is the phoneme. All phrase formats 

are ultimately translated into phonetic spellings for decoding. These phonetic spellings can be 

directly entered if surrounded by curly braces. 

The phonetic alphabet used by the American English language model is below. 

Phoneme Example 1 Phonetic Spelling 1 Example 2 Phonetic Spelling 2 

Considering alternate pronunciations and spellings at the outset will help avoid errors 

50 and frustration later! 

51 

Vowels 

AA 

AE 

AH 

AO 

AW 

AX 

AXR 

AY 

EH 

ER 

EY 

IH 

IX 

IY 

OW 

OY 

UH 

UW 

Consonants 

B 

CH 

D 

DH 

DX 

F 

G 

HH 

JH 

K 

L 

M 

N 

NG 

P 

R 

S 

SH 

T 

TH 

V 

W 

Y 

Z 

ZH 

barn 

bat 

what 

more 

cow 

about 

butter 

type 

check 

church 

take 

little 

action 

team 

loan 

hoist 

book 

flew 

web 

chair 

reed 

with 

forty 

four 

peg 

halt 

cage 

coin 

late 

lemon 

night 

ring 

pay 

rest 

sit 

blush 

raft 

three 

van 

swap 

yes 

arms 

Asian 

B AA R N 

B AE T 

W AH T 

M AO R 

C AW 

AX B AW T 

B AH DX AXR 

T AY P 

CH EH K 

CH ER CH 

T EY K 

L IH DX AX L 

AE K SH IX N 

T IY M 

L OW N 

H OY S T 

B UH K 

F L UW 

W EH B 

CH EY R 

R IY D 

W IH DH 

F AO R DX IY 

F AO R 

P EH G 

HH AO L T 

K EY JH 

K OY N 

L EY T 

L EH M AH N 

N AY T 

R IH NG 

P EY 

R EH S T 

S IH T 

B L AH SH 

R AE F T 

TH R IY 

V AE N 

S W AA P 

Y EH S 

AA R M Z 

EY ZH AH N 

top 

crab 

cut 

auto 

house 

dial 

career 

life 

mess 

bird 

hail 

rib 

women 

keep 

robe 

joy 

look 

who 

bear 

statue 

dark 

other 

butter 

graph 

exam 

Jose 

Jack 

back 

really 

mail 

any 

ankle 

beep 

prior 

bass 

sure 

taped 

youth 

river 

wing 

year 

blaze 

genre 

T AA P 

K R AE B 

K AH T 

AO T OW 

HH AW S 

D AY AX L 

K AXR IH R 

K AY F 

M EH S 

B ER D 

HH EY L 

R IH B 

W IH M IX N 

K IY P 

R OW B 

JH OY 

L UH K 

HH UW 

B EH R 

S T AE CH UW 

D AA R K 

AH DH ER 

B AH DX AXR 

G R AE F 

IH G Z AE M 

HH OW Z EY 

JH AE K 

B AE K 

R IH L IY 

M EY L 

EH N IY 

AE NG K AH L 

B IY P 

P R AY ER 

B AE S 

SH UH R 

Y EY P T 

Y UW TH 

R IH V AXR 

W IH NG 

Y IY R 

B L EY Z 

ZH AA N R AH

Practical Guide To Tuning 

Untuned speech applications do not survive contact with customers: 

whether your company has live speech applications in deployment 

today, plans to implement one within the next three to six months, or is 

only beginning to consider adding speech applications, you should 

consider the importance of tuning. Tuning uses prompts, grammars, 

call flow, and caller data to improve the speech application as a whole. 

There are three ideas to keep in mind when approaching 

the tuning task: 

1 

Tuning Takes Time. 

Even the best of "best-practices" build on assumptions that might not hold 

true after deployment⎯once you have callers, you must often readjust or 

remove these assumptions to provide the quality experience callers expect. 

To give an idea of how much time tuning can take, the speech industry 

estimates that 40-50% of total development and deployment time should 

be spent on the tuning process. Putting emphasis on tuning will help your 

application run more smoothly, keeping callers happy. 

2 

Adapt the System to the Caller. 

Start with Small Changes. 

In general, you will not be able to make 

users do anything in any particular way. 

You can, and should, give as much 

guidance for callers as possible, but 

ultimately the caller dictates the 

conversation. The trick is to provide 

good cues and guidelines, so callers 

choose the pathway you designed for 

the application. Remember that if the 

system fails to meet the caller's needs, 

it's not the caller who has failed; it's 

the speech application. 

3 

It is all too easy to get caught up in the 

moment, expending hours of effort on a 

seemingly enormous problem⎯for something 

that really only affects a few out of several 

hundred callers. Identify the issues that are the 

easiest to resolve and provide the biggest 

benefit. Making small changes to improve the 

experience for most callers is preferable to 

costly changes that only benefit a few. 

Instead, try this process when tuning an application: 

Familiarize Yourself with the Caller's Experiences. 

1 

2 

Do this by listening to the calls, from start to finish. Compare 

the ASR results with respect to the audio prompts and the 

caller's speech. Transcribe the audio, so you can analyze 

the accuracy and performance. 

Use your ASR platform's reporting and analytical tools 

to maximize your information. You can even use 

LumenVox Speech Tuner on Nuance's 8.5 or 

ScanSoft's OSR. 

Above all, identify the key issues and prioritize them. 

Solve the easiest dilemmas first, like typical grammar 

problems. Then, move to prompt and dialogue 

changes, and finally proceed to acoustic model 

training and adaptations. 

Test Changes Rigorously. 

When you make a change, you must test it. You did 

the transcripts, and so you have the grammar and 

audio data: as much as possible, test under 'real' 

conditions. Give yourself the assurance that any 

change will help, and then test to find what solution 

works best. 

What you shouldn't do when tuning a speech application: 

Don’t Make Changes 

Based on One Instance. 

This should be fairly obvious, but it still 

happens. Making changes based on a single 

instance usually results in fixing a problem that 

doesn't really exist. There are numerous 

'one-off' errors in speech recognition, many of 

which are associated with noise, or transient 

effects that won't be generally reproducible. 

Real issues will arise multiple times, in multiple 

places, with plenty of evidence to help you 

decide how to solve them. 

Don’t Make Changes on 

Unanalyzed Reports. 

Treat the report with respect: analyze the call, 

compare it with other calls, see what really 

happened⎯often, the system worked as 

designed, but the design was flawed. Research 

the problem carefully so that you avoid 

unnecessary (and costly) changes. 

The LumenVox support team is 

always available and always willing to go 

the extra mile to provide us with 

excellent support. 

Derek Henry 

CEO of 1-800-US-LOTTO 

Corporation 

52 53

Tuning Grammars 

There are many places to make effective changes, but generally, we have 

found grammars to be the easiest and most effective place to start. In 

this segment, we will look at how to detect errors and modify grammars 

to optimize performance. 

Grammar Terms 

In-Grammar (IG) and Out-of-Grammar (OOG) are labels that look at whether the ASR matches 

a path in the grammar with what the caller actually said. If it can, then the spoken words are 

In-Grammar, if not, the spoken words are considered Out-of-Grammar. 

Confidence scores indicate the ASR's certainty about the answer it returns. 

Confirmations are dialog techniques to help the speech application avoid making a mistake, in 

cases where the results are ambiguous. 

Substitutions are a particular kind of error the ASR makes; this occurs when the result from the 

ASR does not match the words the caller said. 

LumenVox’s development and support 

staff has been very responsive to our 

requirements and issues. 

Out-Of-Grammar Indicators 

There are a number of ways to determine whether or not the error is due to an Out-of-Grammar 

issue. The easiest, most efficient way to discover this is to use the tools provided by the platform 

or ASR provider. Pre-configured reports will usually highlight these issues up front. 

If Out-of-Grammar is a big problem, you will likely receive many customer complaints and low 

completion and usage rates. Call lots will also have many "No Matches" or empty results. Finally, 

when you do obtain results, the confidence score will be significantly lower than the rest of the 

application. 

As the call logs are transcribed, look for low accuracies. ASRs will usually get anything that is 

In-Grammar, so if there is still low accuracy, look for a bunch of Out-of-Grammar speech. 

The other good indicator is a high Out-of-Grammar rate. Generally, this is a direct indication that 

callers are not saying things your grammar understands. 

There are a few reasons for Out-of-Grammar problems, most of which are easy to resolve. 

One common reason is that the grammar designer simply forgot to add items, but callers are asking 

for them. Leaving "next" out of a navigation grammar, or forgetting to add a product name is not 

uncommon. 

Another common error is forgetting common synonyms, for example, 'copier', but not 'Xerox', or 

different dialectal versions such as 'soda' in the west, and 'pop' in the South. Usually, you can just 

add the missing items. 

In-Grammar Indicators 

Typical In-Grammar issues will often be oriented towards improving the confidence scores of 

recognition, but you will also confront misrecognitions. 

Doug Behl 

President of Malibu Software 

What kinds of issues arise frequently? 

Regularly confused phrases are fairly common, and often result because two or more phrases sound 

quite similar. Another common issue is the result of bad pronunciations in the grammar. ASRs 

provide a methodology for arriving at pronunciations for words that aren't in the dictionary, but the 

automatic pronunciations are not always the best. 

So, how can we handle this? 

For regularly confused phrases, differentiate them by choosing alternate ways to describe the 

words. For bad pronunciations, you must add the pronunciations to the dictionary or grammar. 

There are tools for helping with this task, although they are nearly always ASR-specific. 

54 55

Failures and Fixes: Common 

Prompt Tuning Issues 

Effective prompt design takes time and practice⎯some errors will not 

present themselves until the prompts are tested. There are, however, 

some common issues that arise when tuning prompts, which should 

help streamline your prompt design and tuning process. 

Here's what you should do when callers… 

…Give Long, Perplexing Answers 

Callers continuously give full sentence answers instead of short, to the point answers. Typically, 

this occurs because the prompt asks a very open-ended question, such as "How may I help 

you?" Avoid these open-ended prompts; callers usually do not know what responses are 

appropriate at particular points in the call. The only real solution when this error occurs is to 

redesign the prompt to be more specific, or redesign the interaction to focus caller responses 

on specific tasks. 

…Answer with Out-of-Grammar (OOG) Phrases 

Callers regularly use a particular phrase that is not in the grammar. Prompts are designed to 

elicit particular pieces of information from the caller. Because of this, the prompts usually try to 

lead the caller to using the correct words or phrases to minimize recognition errors and caller 

confusion. When callers regularly use Out-of-Grammar phrases, it's usually because the prompt 

leads them to the wrong phrases. Two choices are available: include the Out-of-Grammar 

phrases, or revise the prompt to more obviously reflect available options. 

…Answer Randomly, or 'Hunt' for the Right Phrase 

Unclear, incomplete prompts force the caller to search unnecessarily for the correct response. 

Adding clarifying information will generally fix this problem. 

Transcription and Training 

Humans are exceptionally good at speech processing. 

We handle a variety of accents, speaking styles, pitch 

differences, noise, and more, with a high degree of 

accuracy. No speakers say the same word exactly the 

same way. Needless to say, this represents a considerable 

challenge for automated speech recognizers. Accurate 

transcription and tuning is essential. 

Every speech recognizer uses a statistical model of speech in 

order to perform good recognition. These models are built 

during training, where speech audio and text transcriptions are 

combined with algorithms that 'learn' how speech sounds. The 

models attempt to determine what 'average' speakers sound 

like when they speak particular words, and apply that 

knowledge to new incoming speech to determine what words 

were spoken. 

Words and speaking styles are different for every 

application domain (i.e., the vocabulary for a travel 

system is quite different from that of a financial 

application). Speech applications benefit from 

acoustic models specially trained with data 

from their specific domains. 

Transcribing audio data must be exact, word 

for word, and include noise tags so the 

system can learn the differences between 

noise and speech. To do this, the data must 

include as many speakers (both male and 

female) as possible, so that the new acoustic 

models accurately reflect the average speaker, and 

not just one or two particular speakers. 

With a larger volume of transcribed audio data, 

new models will perform better. New acoustic 

models will likely require a new round of 

tuning, particularly with respect to 

confirmation thresholds. 

…Answer 'Yes' or 'No,' Instead of Expected Content 

If callers respond 'yes' or 'no' when the prompt requests a content word, a poorly designed 

prompt is responsible. For example, the prompt might ask a question like "Do you want..." 

or "Would you like..." and pauses after the first choice. The pause can be long enough to 

make the caller believe the desired answer is yes or no, rather than a list of choices. 

Similarly, multi-item lists may pause too long at later points, as in, "Would you like 

pizza, soft drinks, or side items?" Generally, we recommend that 

you reserve "Do" and "Would" lead-offs for questions that require yes or no answers. 

LumenVox support is very fast and 

accessible. 

Kelly Lumpkin CEO of Alternate Access 

56 57

Standards and Systems Supported 

Industry Standards 

VXML/VoiceXML (Voice eXtensible Markup 

Language) 

SALT (Speech Application Language Tags) 

MRCP (Media Resource Control Protocol) 

SRGS (Speech Recognition Grammar 

Specification) 

SAPI 5 (Microsoft SAPI TTS) 

Operating Systems Supported 

Linux 

Windows NT, 2000, XP, 2003 

Telephony Standards 

PSTN (Public Switched Telephone 

Network) 

SIP (Session Initiation Protocol) - 

Signaling protocol for Internet 

conferencing, telephony, and instant 

messaging 

VoIP (Voice Over Internet Protocol) - 

Protocol to send audio and data 

information in digital 

DM3 - Intel® series of boards 

HMP (Host Media Processing) - 

Software that performs media 

processing tasks based on Intel® 

architecture 

Global Call - Protocol for handling the 

call control interface for Intel® cards 





Brian Lauzon 


University Research 

Carnegie Mellon University 

Performs research for all aspects of speech 

recognition, including signal processing, 

acoustic model training, language model 

training, decoding, spoken language parsing 

and interface building. 

University of Colorado, Boulder 

Focused on research and education in areas 

of human communication technology. 

Oregon University 

Center for Spoken Language-Synthesis, 

Recognition and Enhancement. 

Stanford University 

The Center for the Study of Language and 

Information (CSLI) is an Independent 

Research Center founded in 1983 by 

researchers from Stanford University, 

SRI International, and Xerox PARC. 

UC Berkeley 

The major application area researched in 

the Speech Group at ICSI is speech 

recognition, although some of this work has 

led to basic research in auditory processing. 

M.I.T. 

Computer Science and Artificial Intelligence 

Laboratory that focuses on Spoken 

Language Systems. 

Other Speech Groups 

Carnegie Mellon Speech Group 

Develops user interfaces that improve 

human-computer and human-human 


IEICE 

The Institute of Electronics, Information and 

Communication Engineers aims at the 

investigation and exchange of knowledge on 

the science and technology of electronics, 

information, and communications. 

Speech Publications 

Speech Technology Magazine 

Leading source of information promoting the 

speech technology solutions that are 

changing communications, and reports the 

technology needs of organizations 

worldwide. 

Telephone Strategy News 

Newsletter that includes full coverage of the 

impact of the Voice User Interface on 

telephony. 

ASRNews 

Monthly newsletter, which tracks the latest 

development in the speech recognition and 

text-to-speech marketplace. 

Call Center And Telephony 

Technology Marketing 

Corporation 

TMC publishes industry-leading print 

magazines including Internet Telephony, 

Customer Inter@ction Solutions, and 

Communications Solutions. 

CMP Media 

Leading multimedia company that prints 

numerous magazines, including Call 

Center and Communications Convergence. 

Business Communications 

Review Magazine 

Magazine for the enterprise network 

manager and other communications 

professionals.

Standards and Systems Supported 

Industry Standards 

VXML/VoiceXML (Voice eXtensible Markup 

Language) 

SALT (Speech Application Language Tags) 

MRCP (Media Resource Control Protocol) 

SRGS (Speech Recognition Grammar 

Specification) 

SAPI 5 (Microsoft SAPI TTS) 

Operating Systems Supported 

Linux 

Windows NT, 2000, XP, 2003 

Telephony Standards 

PSTN (Public Switched Telephone 

Network) 

SIP (Session Initiation Protocol) - 

Signaling protocol for Internet 

conferencing, telephony, and instant 

messaging 

VoIP (Voice Over Internet Protocol) - 

Protocol to send audio and data 

information in digital 

DM3 - Intel® series of boards 

HMP (Host Media Processing) - 

Software that performs media 

processing tasks based on Intel® 

architecture 

Global Call - Protocol for handling the 

call control interface for Intel® cards 





Brian Lauzon 


University Research 

Carnegie Mellon University 

Performs research for all aspects of speech 

recognition, including signal processing, 

acoustic model training, language model 

training, decoding, spoken language parsing 

and interface building. 

University of Colorado, Boulder 

Focused on research and education in areas 

of human communication technology. 

Oregon University 

Center for Spoken Language-Synthesis, 

Recognition and Enhancement. 

Stanford University 

The Center for the Study of Language and 

Information (CSLI) is an Independent 

Research Center founded in 1983 by 

researchers from Stanford University, 

SRI International, and Xerox PARC. 

UC Berkeley 

The major application area researched in 

the Speech Group at ICSI is speech 

recognition, although some of this work has 

led to basic research in auditory processing. 

M.I.T. 

Computer Science and Artificial Intelligence 

Laboratory that focuses on Spoken 

Language Systems. 

Other Speech Groups 

Carnegie Mellon Speech Group 

Develops user interfaces that improve 

human-computer and human-human 


IEICE 

The Institute of Electronics, Information and 

Communication Engineers aims at the 

investigation and exchange of knowledge on 

the science and technology of electronics, 

information, and communications. 

Speech Publications 

Speech Technology Magazine 

Leading source of information promoting the 

speech technology solutions that are 

changing communications, and reports the 

technology needs of organizations 

worldwide. 

Telephone Strategy News 

Newsletter that includes full coverage of the 

impact of the Voice User Interface on 

telephony. 

ASRNews 

Monthly newsletter, which tracks the latest 

development in the speech recognition and 

text-to-speech marketplace. 

Call Center And Telephony 

Technology Marketing 

Corporation 

TMC publishes industry-leading print 

magazines including Internet Telephony, 

Customer Inter@ction Solutions, and 

Communications Solutions. 

CMP Media 

Leading multimedia company that prints 

numerous magazines, including Call 

Center and Communications Convergence. 

Business Communications 

Review Magazine 

Magazine for the enterprise network 

manager and other communications 

professionals.

1 - LumenVox

Create successful ePaper yourself

Delete template?

Save as template?