04.02.2013 Views

MAS.632 Conversational Computer Systems - MIT OpenCourseWare

MAS.632 Conversational Computer Systems - MIT OpenCourseWare

MAS.632 Conversational Computer Systems - MIT OpenCourseWare

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

PROSODY<br />

Torwd More Ronlosarmmunniikoi<br />

measures. Under this approach, a grammar is augmented with word-sequence<br />

probabilities based on analysis of a corpus of utterances spoken by naive subjects<br />

attempting to access the chosen task [Hirschman et al. 1991].<br />

But in natural speech these probabilities are dynamic and depend on the current<br />

state of the conversation as well as the listener's expectations of what the<br />

talker may be up to. In a fully integrated conversational application, pragmatic<br />

information such as identification of a partially completed plan or detection of<br />

elements of a script could tune word probabilities based on expectations of what<br />

is likely to follow. Similarly a focus model could suggest heightened probabilities<br />

of words relevant to attributes of entities recently discussed in addition to resolving<br />

anaphoric references. Domain knowledge plays a role as well, from simplistic<br />

awareness of the number of digits in telephone numbers to knowledge of the<br />

acceleration and turning capabilities of various types of aircraft in an air traffic<br />

control scenario.<br />

Part of the difficulty with flexible parsing is the excessive degree of isolation<br />

between the application, its discourse system, and the speech recognition component.<br />

For many current systems the recognizer is given a language model in<br />

whatever form it requires; it then listens to speech and returns a string of text to<br />

the application, which must parse it again to know how to interpret the words<br />

meaningfully. Not only has syntactic information been needlessly lost when<br />

reporting the recognized speech as a string of words, but it also may be detrimental<br />

to strip the representation of any remaining acoustic evidence such as the<br />

recognizer's degree of certainty or possible alternate choices of recognition<br />

results. How can partial recognition results be reported to the parser? Perhaps<br />

several noun phrases were identified but the verb was not, confounding classification<br />

of the nouns into the possible roles that might be expressed in a framedbased<br />

representation.<br />

These observations are intended to suggest that despite this book's portrayal of<br />

the various layers of language understanding and generation as distinct entities,<br />

they still must be tightly woven into a coherent whole. Isolation ofthe word identification<br />

portion of discourse understanding into a well bounded "speech recognizer"<br />

component cannot in the long run support sophisticated conversational<br />

systems. Knowledge must be communicated easily across components, and analysis<br />

must be flexible and based on dynamic conversation constraints.<br />

Prosody refers to the spoken style of discourse independent oflexical content, and<br />

it includes several aspects of how we speak. Intonation is the tune of an utterance:<br />

how we modulate FO to change the pitch of our speech. Intonation operates<br />

at a sentence or phrase level; the rising tune of a yes-or-no question immediately<br />

differentiates it from the falling tune of a declarative statement. Intonation also<br />

helps to convey the stress of words and syllables within a sentence as stressed<br />

syllables are spoken with a pitch higher or lower than normal, and specific words<br />

are emphasized-an important aspect of communicating intent-by stressing<br />

301

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!