20.03.2013 Views

Real-World Speech Recognition - Haskins Laboratories

Real-World Speech Recognition - Haskins Laboratories

Real-World Speech Recognition - Haskins Laboratories

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

<strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong><br />

NSF SBE “Grand Challenge” White Paper<br />

by Philip Rubin, <strong>Haskins</strong> <strong>Laboratories</strong>, Sep. 24, 2010<br />

Abstract<br />

<strong>Speech</strong> recognition would seem, to many, to be a scientific/technical<br />

problem that has been solved. Inexpensive recognition systems are<br />

commonly available for personal computers and mobile devices. Why then<br />

is the use of such a potentially enabling technology not as ubiquitous as past<br />

predictions would have led us to believe? Note that I have typed this into<br />

my computer, not spoken to it. One rarely sees people talking to their<br />

computers, unless they are skyping, although the recognition (“talking<br />

typewriter”) technology supposedly has been mastered. Unfortunately,<br />

recognition performance is severely limited by real-world constraints.<br />

Ambient noise, variability in the clarity of a speaker’s voice due to age,<br />

speaker style, infirmity, and a host of other conditions limit the practical and<br />

reliable use of speech interfaces. <strong>Speech</strong> is more informal and capricious<br />

than algorithmic approaches are designed to handle. In addition, we help<br />

disambiguate such ephemeral information by using as many contextual,<br />

communicative cues as are available to us, including facial information,<br />

gesture, indications of emotion, and situational indicators. The challenge is<br />

to mount a sustained, focused effort to develop recognition systems (speech,<br />

gesture, facial information, emotion, semantic, etc.) that work reliably in<br />

real-world conditions, from the workplace to the battlefield.<br />

Philip Rubin, Ph.D.<br />

<strong>Haskins</strong> <strong>Laboratories</strong><br />

300 George St., Suite 900<br />

New Haven, CT 06511<br />

www.haskins.yale.edu<br />

rubin@haskins.yale.edu<br />

1


NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

<strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong><br />

<strong>Speech</strong> is a convenient, ubiquitous, and easily learned form of<br />

communication for most. Reading is more difficult for many — its mastery<br />

can be challenging, with such difficulties being reflected in the high profile<br />

struggles of our educational system. The centrality of speech to the human<br />

enterprise is considerable. It consumes our attention. Even in this era of<br />

texting and tweeting, the urge to gab is boundless – we swim in a sea of<br />

mobile, vocal communication. We are drawn to the human voice and seek it<br />

out in increasingly varied forms: movies, television, theatre, YouTube<br />

videos, live poetry readings, website multimedia, conferences, etc. Spoken<br />

language creates social, cultural, and geographic bonds, and demarks our<br />

differences, from generational to political even as it allows us to<br />

communicate. We can speak around corners and, through technological<br />

advances, around the world, often at minimal or no expense. The potential<br />

uses of voice, gesture and facial information in assistive and educational<br />

environments are numerous and excite our imaginations (Bickmore and<br />

Cassell, 2005). There are considerable opportunities for enhancing<br />

technologies related to these areas, including improving human-machine<br />

interfaces. I could have spoken into my mobile phone as it automatically<br />

transcribed the intended text (often referred to as “the talking typewriter” in<br />

the olden days) instead of typing as I have always done.<br />

<strong>Speech</strong> recognition would seem, to many, to be a scientific/technical<br />

problem that has been solved. Inexpensive recognition systems are<br />

commonly available for personal computers and mobile devices. Why then<br />

is the use of such a potentially enabling technology not as ubiquitous as<br />

predictions would lead us to believe? Unfortunately, recognition<br />

2


NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

performance is severely limited by real-world constraints. Ambient and<br />

background noise, variability in the clarity of a speaker’s voice due to age,<br />

the importance of high-quality unidirectional microphones, speaker style and<br />

accent, dialectal and regional differences, multilingualism, infirmity, and a<br />

host of other factors limit the practical and reliable use of speech interfaces.<br />

<strong>Speech</strong> is more informal and capricious then algorithmic approaches are<br />

designed to handle. In addition, we help disambiguate such ephemeral<br />

information by using as many contextual, communicative cues as are<br />

potentially available to us, including facial information, gesture, indications<br />

of emotion, and situational indicators.<br />

The use of state-of-the art signal processing, techniques such as<br />

Markovian modeling, other statistical and predictive methods, and a variety<br />

of computational and engineering techniques have led to considerable<br />

advances in recognition performance in the past decade. Such developments<br />

show promise for the future. However, they have reached the limits of their<br />

performance. The ability to extract linguistic categories, rather than<br />

statistical approximations to them, must be informed by the realities and<br />

complexities of the real world if we are to be able to improve our success<br />

rate. This encourages considerations of embodied and situated aspects of<br />

systems. Our physiological form and the nature of the world that we inhabit<br />

-- natural, physical, social and cultural -- shape our behavior. Linguistic and<br />

cognitive contributions to current recognition systems are not as significant<br />

as are, for example, statistical methodologies. Yet, if the knowledge needed<br />

to make improvements is linguistic and/or cognitive, those areas must be<br />

critical to the development of future systems. In addition, the approaches<br />

that we use when studying, modeling, and building such systems need to<br />

give greater attention to what can be very difficult issues: temporality,<br />

3


NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

dynamics, and complexity. Our scientific approaches must take such issues<br />

seriously.<br />

The scientific/technological difficulties in this enterprise require both<br />

increased attention to scientific fundamentals and a multidisciplinary<br />

approach that brings together biologists, computer scientists, educators,<br />

engineers, linguistics, neuroscientists, psychologists, physicists, and social<br />

scientists. The scientific questions are numerous and difficult. Critical areas<br />

of importance include:<br />

• the physiology of speech production<br />

• neural representations of speech and language and the control of<br />

motor behavior<br />

• computational modeling of language, speech, and the mental lexicon<br />

• development and use of realistic embodied conversational agents<br />

• physical/physiological modeling of sound production<br />

• a deeper understanding of the physics of sound/gesture production<br />

• rich techniques for auditory/visual scene analysis and parsing<br />

• cognitive, emotional, cultural and social aspects of language<br />

understanding and use<br />

Attacking such difficult issues will also require investments in<br />

infrastructure and substantial advances in tool development, such as:<br />

• computational models of speech production that are open source and<br />

modular, and support comparison and addition of existing and<br />

future articulatory and aerodynamic models, include rich<br />

temporal controls to explore the evolution of<br />

physiological/linguistic events over time<br />

• multimodal, large-scale databases that provide for the display,<br />

analysis and archiving of physiological, video, audio,<br />

4


NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

neuroimaging, and other data, and that support automatic and<br />

expert markup and annotation<br />

• ontologies that represent linguistic, cognitive, and other knowledge<br />

• statistical/computational techniques for characterizing and analyzing<br />

temporal aspects of signals and significant events<br />

In a way, speech is more of a dance than it is a sterile enumeration of<br />

numbers, rules, or lists of features. When we talk, we usually engage in an<br />

active dialogue that spans time. How we act, how we move, and what we say<br />

are determined, in part, by the interactive, dynamically changing relationship<br />

with our partner. As with dance, our speech also conveys our personality,<br />

our heritage, our spontaneity, and our emotions. To deal with this degree of<br />

richness, computational recognition systems should embody approaches that<br />

are ecologically valid, taking into accounts the actions, goals, characteristics,<br />

and other differences of the individuals engaged in conversation. Moreover,<br />

the vagaries and realities of language use, the changing environments in<br />

which language is used, and the diverse and momentarily changing aspects<br />

of those who use it must be central concerns. Where possible, the<br />

relationship between speech and the physical system that produces it,<br />

particularly as it evolves over time, should be understood and used to<br />

constrain and simplify the recognition process (Hogden, et al., 2007;<br />

Iskarous, et al., 2010).<br />

In summary, then, the challenge is to mount a sustained, focused<br />

effort to develop recognition systems (speech, gesture, facial information,<br />

emotion, semantic, etc.) that work reliably in real-world conditions, from the<br />

workplace to the battlefield.<br />

5


NSF SBE Grand Challenge White Paper: <strong>Real</strong>-<strong>World</strong> <strong>Speech</strong> <strong>Recognition</strong> Philip Rubin<br />

References<br />

Bickmore, Timothy & Cassell, Justine. (2005) “Social Dialogue with Embodied<br />

Conversational Agents” In J. van Kuppevelt, L. Dybkjaer, & N. Bernsen (eds.),<br />

Advances in Natural, Multimodal Dialogue Systems. New York: Kluwer Academic.<br />

Hogden, J., Rubin, P., McDermott, E., Katagiri, S., & Goldstein, L. (2007). Inverting<br />

mappings from smooth paths through R n to paths through R m : A technique applied<br />

to recovering articulation from acoustics. <strong>Speech</strong> Communication, May 2007,<br />

Volume 49, Issue 5, 361-383.<br />

Iskarous, Khalil, Nam, Hosung, and Whalen, D. H. (2010). Perception of articulatory<br />

dynamics from acoustic signatures. Journal of the Acoustical Society of America,<br />

June 2010, 127 (6), 3717-3727.<br />

<strong>Real</strong>-<strong>World</strong> <strong>Speech</strong><br />

<strong>Recognition</strong> by Philip Rubin is licensed under<br />

a Creative<br />

Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!