11.06.2014 Views

Dialogue Processing in a Conversational Speech ... - CiteSeerX

Dialogue Processing in a Conversational Speech ... - CiteSeerX

Dialogue Processing in a Conversational Speech ... - CiteSeerX

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Unsegmented <strong>Speech</strong> Recognition:<br />

(%noise% si1 mira toda la man5ana estoy disponible<br />

%noise% %noise% y tambie1n el f<strong>in</strong> de semana si podri1a<br />

hacer mejor un di1a f<strong>in</strong> de semana porque justo el once<br />

no puedo me es imposible va a poder f<strong>in</strong> de semana)<br />

Pre-broken <strong>Speech</strong> Recognition:<br />

(si1)<br />

(mira toda la man5ana estoy disponible %noise% %noise%<br />

y tambie1n el f<strong>in</strong> de semana)<br />

(si podri1a hacer mejor un di1a f<strong>in</strong> de semana)<br />

(porque justo el once no puedo me es imposible va a<br />

poder f<strong>in</strong> de semana)<br />

Parser SDU Segmentation (of Pre-broken Input):<br />

(((si1))<br />

((mira) (toda la man5ana estoy disponible) (y tambie1n)<br />

(el f<strong>in</strong> de semana))<br />

((si podri1a hacer mejor un di1a f<strong>in</strong> de semana))<br />

((porque el once no puedo) (me es imposible)<br />

(va a poder f<strong>in</strong> de semana)))<br />

Translation:<br />

"yes --- Look all morn<strong>in</strong>g is good for me -- and also<br />

-- the weekend --- If a day weekend is better ---<br />

because on the eleventh I can’t meet --<br />

That is bad for me can meet on weekend"<br />

Figure 3: Segmentation of a Spanish Full Utterance<br />

augmented with a statistical component. The plan-based approach<br />

handles knowledge-<strong>in</strong>tensive tasks, exploit<strong>in</strong>g various knowledge<br />

sources. The f<strong>in</strong>ite state approach provides a fast and efficient alternative<br />

to the more time-consum<strong>in</strong>g plan-based approach. Currently,<br />

the two discourse processors are used separately. We <strong>in</strong>tend to comb<strong>in</strong>e<br />

these two approaches with a layered architecture, similar to the<br />

one proposed for Verbmobil [6], <strong>in</strong> which the f<strong>in</strong>ite state mach<strong>in</strong>e<br />

would constitute a lower layer provid<strong>in</strong>g an efficient way of recogniz<strong>in</strong>g<br />

speech acts, while the plan-based discourse processor, at<br />

a higher layer, would be used to handle more knowledge-<strong>in</strong>tensive<br />

processes, such as recogniz<strong>in</strong>g doubt or clarification sub-dialogues<br />

and robust ellipsis resolution. The performance of each approach <strong>in</strong><br />

assign<strong>in</strong>g speech acts is presented <strong>in</strong> Section 4.2.<br />

The plan-based discourse processor [7] takes as its <strong>in</strong>put the best<br />

parse returned by the parser. The discourse context is represented<br />

as a plan tree. The ma<strong>in</strong> task of the discourse processor is to relate<br />

the <strong>in</strong>put to the context, or the plan tree. In general, plan <strong>in</strong>ference<br />

starts from the surface forms of sentences from which speech<br />

acts are then <strong>in</strong>ferred. Multiple speech acts can be <strong>in</strong>ferred for one<br />

ILT. A separate <strong>in</strong>ference cha<strong>in</strong> is created for each potential speech<br />

act performed by the associated ILT. Preferences for pick<strong>in</strong>g one<br />

<strong>in</strong>ference cha<strong>in</strong> over another were determ<strong>in</strong>ed by a set of focus<strong>in</strong>g<br />

heuristics, which provide ordered expectations of discourse actions<br />

given the exist<strong>in</strong>g plan tree. The speech act is recognized <strong>in</strong> the<br />

course of determ<strong>in</strong><strong>in</strong>g how the <strong>in</strong>ference cha<strong>in</strong> attaches to the plan<br />

tree.<br />

the predicted speech act follows idealized dialogue act sequences.<br />

The states <strong>in</strong> the FSM represent speech acts <strong>in</strong> the doma<strong>in</strong>. The<br />

transitions between states record turn-tak<strong>in</strong>g <strong>in</strong>formation. Given<br />

the current state, multiple follow<strong>in</strong>g speech acts are possible. The<br />

statistical component (consist<strong>in</strong>g of speech act n-grams) is used to<br />

provide ranked predictions for the follow<strong>in</strong>g speech acts.<br />

One novel feature of the f<strong>in</strong>ite state approach is that we <strong>in</strong>corporate<br />

a solution to the cumulative error problem. Cumulative error<br />

is <strong>in</strong>troduced when an <strong>in</strong>correct hypothesis is chosen and <strong>in</strong>corporated<br />

<strong>in</strong>to the context, thus provid<strong>in</strong>g an <strong>in</strong>accurate context from<br />

which subsequent context-based predictions are made. It is especially<br />

a problem <strong>in</strong> spontaneous speech systems where unexpected<br />

<strong>in</strong>put, out-of-doma<strong>in</strong> utterances and miss<strong>in</strong>g <strong>in</strong>formation are hard<br />

to fit <strong>in</strong>to the standard structure of the contextual model. To reduce<br />

cumulative error, we focus on <strong>in</strong>stances of conflict between the predictions<br />

of the FSM and the grammar. Our experiments show that<br />

<strong>in</strong> the case of a prediction conflict between the grammar and the<br />

FSM, <strong>in</strong>stead of bl<strong>in</strong>dly trust<strong>in</strong>g the predictions from the dialogue<br />

context, trust<strong>in</strong>g the non-context-based grammar predictions gives<br />

better performance <strong>in</strong> assign<strong>in</strong>g speech acts. This corresponds to<br />

a jump from one state to another <strong>in</strong> the f<strong>in</strong>ite state mach<strong>in</strong>e. Section<br />

4.2 reports the performance of the FSM with jumps determ<strong>in</strong>ed<br />

by the non-context-based predictions of the grammar.<br />

4.1. Late Stage Disambiguation<br />

The robust pars<strong>in</strong>g components discussed <strong>in</strong> Section 2 employ a<br />

large flexible grammar to handle such features of spoken language<br />

as speech disfluencies, speech recognition errors, and the lack of<br />

clearly marked sentence boundaries. This is necessary to ensure<br />

the robustness and flexibility of the parser. However, as a sideeffect,<br />

the number of ambiguities <strong>in</strong>creases. An important feature<br />

of our approach to reduc<strong>in</strong>g parse ambiguity is to allow multiple<br />

hypotheses to be processed through the system, and to use context<br />

to disambiguate between alternatives <strong>in</strong> the f<strong>in</strong>al stages of the process<strong>in</strong>g,<br />

where knowledge can be exploited to the fullest. Local<br />

utterance-level predictions are generated by the parser. The larger<br />

discourse context is processed and ma<strong>in</strong>ta<strong>in</strong>ed by the discourse process<strong>in</strong>g<br />

component, which has been extended to produce contextbased<br />

predictions for resolv<strong>in</strong>g ambiguity. The predictions from<br />

the context-based discourse process<strong>in</strong>g approach and those from the<br />

non-context-based parser approach are comb<strong>in</strong>ed <strong>in</strong> the f<strong>in</strong>al stage<br />

of process<strong>in</strong>g.<br />

We experimented with two methods of automatically learn<strong>in</strong>g functions<br />

for comb<strong>in</strong><strong>in</strong>g the context-based and non-context-based scores<br />

for disambiguation, namely a genetic programm<strong>in</strong>g approach and a<br />

neural net approach. While we were able, <strong>in</strong> the absence of cumulative<br />

error, to get an improvement of both comb<strong>in</strong>ation techniques<br />

over the parser’s non-context-based statistical disambiguation technique,<br />

<strong>in</strong> the face of cumulative error, the performance decreased<br />

significantly. We are <strong>in</strong> the process of <strong>in</strong>corporat<strong>in</strong>g our cumulative<br />

error reduction technique <strong>in</strong> the task of disambiguation.<br />

The f<strong>in</strong>ite state mach<strong>in</strong>e (FSM) discourse processor [5] describes<br />

representative sequences of speech acts <strong>in</strong> the schedul<strong>in</strong>g doma<strong>in</strong>.<br />

It is used to record the standard dialogue flow and to check whether

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!