Dialogue Processing in a Conversational Speech ... - CiteSeerX
Dialogue Processing in a Conversational Speech ... - CiteSeerX
Dialogue Processing in a Conversational Speech ... - CiteSeerX
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Unsegmented <strong>Speech</strong> Recognition:<br />
(%noise% si1 mira toda la man5ana estoy disponible<br />
%noise% %noise% y tambie1n el f<strong>in</strong> de semana si podri1a<br />
hacer mejor un di1a f<strong>in</strong> de semana porque justo el once<br />
no puedo me es imposible va a poder f<strong>in</strong> de semana)<br />
Pre-broken <strong>Speech</strong> Recognition:<br />
(si1)<br />
(mira toda la man5ana estoy disponible %noise% %noise%<br />
y tambie1n el f<strong>in</strong> de semana)<br />
(si podri1a hacer mejor un di1a f<strong>in</strong> de semana)<br />
(porque justo el once no puedo me es imposible va a<br />
poder f<strong>in</strong> de semana)<br />
Parser SDU Segmentation (of Pre-broken Input):<br />
(((si1))<br />
((mira) (toda la man5ana estoy disponible) (y tambie1n)<br />
(el f<strong>in</strong> de semana))<br />
((si podri1a hacer mejor un di1a f<strong>in</strong> de semana))<br />
((porque el once no puedo) (me es imposible)<br />
(va a poder f<strong>in</strong> de semana)))<br />
Translation:<br />
"yes --- Look all morn<strong>in</strong>g is good for me -- and also<br />
-- the weekend --- If a day weekend is better ---<br />
because on the eleventh I can’t meet --<br />
That is bad for me can meet on weekend"<br />
Figure 3: Segmentation of a Spanish Full Utterance<br />
augmented with a statistical component. The plan-based approach<br />
handles knowledge-<strong>in</strong>tensive tasks, exploit<strong>in</strong>g various knowledge<br />
sources. The f<strong>in</strong>ite state approach provides a fast and efficient alternative<br />
to the more time-consum<strong>in</strong>g plan-based approach. Currently,<br />
the two discourse processors are used separately. We <strong>in</strong>tend to comb<strong>in</strong>e<br />
these two approaches with a layered architecture, similar to the<br />
one proposed for Verbmobil [6], <strong>in</strong> which the f<strong>in</strong>ite state mach<strong>in</strong>e<br />
would constitute a lower layer provid<strong>in</strong>g an efficient way of recogniz<strong>in</strong>g<br />
speech acts, while the plan-based discourse processor, at<br />
a higher layer, would be used to handle more knowledge-<strong>in</strong>tensive<br />
processes, such as recogniz<strong>in</strong>g doubt or clarification sub-dialogues<br />
and robust ellipsis resolution. The performance of each approach <strong>in</strong><br />
assign<strong>in</strong>g speech acts is presented <strong>in</strong> Section 4.2.<br />
The plan-based discourse processor [7] takes as its <strong>in</strong>put the best<br />
parse returned by the parser. The discourse context is represented<br />
as a plan tree. The ma<strong>in</strong> task of the discourse processor is to relate<br />
the <strong>in</strong>put to the context, or the plan tree. In general, plan <strong>in</strong>ference<br />
starts from the surface forms of sentences from which speech<br />
acts are then <strong>in</strong>ferred. Multiple speech acts can be <strong>in</strong>ferred for one<br />
ILT. A separate <strong>in</strong>ference cha<strong>in</strong> is created for each potential speech<br />
act performed by the associated ILT. Preferences for pick<strong>in</strong>g one<br />
<strong>in</strong>ference cha<strong>in</strong> over another were determ<strong>in</strong>ed by a set of focus<strong>in</strong>g<br />
heuristics, which provide ordered expectations of discourse actions<br />
given the exist<strong>in</strong>g plan tree. The speech act is recognized <strong>in</strong> the<br />
course of determ<strong>in</strong><strong>in</strong>g how the <strong>in</strong>ference cha<strong>in</strong> attaches to the plan<br />
tree.<br />
the predicted speech act follows idealized dialogue act sequences.<br />
The states <strong>in</strong> the FSM represent speech acts <strong>in</strong> the doma<strong>in</strong>. The<br />
transitions between states record turn-tak<strong>in</strong>g <strong>in</strong>formation. Given<br />
the current state, multiple follow<strong>in</strong>g speech acts are possible. The<br />
statistical component (consist<strong>in</strong>g of speech act n-grams) is used to<br />
provide ranked predictions for the follow<strong>in</strong>g speech acts.<br />
One novel feature of the f<strong>in</strong>ite state approach is that we <strong>in</strong>corporate<br />
a solution to the cumulative error problem. Cumulative error<br />
is <strong>in</strong>troduced when an <strong>in</strong>correct hypothesis is chosen and <strong>in</strong>corporated<br />
<strong>in</strong>to the context, thus provid<strong>in</strong>g an <strong>in</strong>accurate context from<br />
which subsequent context-based predictions are made. It is especially<br />
a problem <strong>in</strong> spontaneous speech systems where unexpected<br />
<strong>in</strong>put, out-of-doma<strong>in</strong> utterances and miss<strong>in</strong>g <strong>in</strong>formation are hard<br />
to fit <strong>in</strong>to the standard structure of the contextual model. To reduce<br />
cumulative error, we focus on <strong>in</strong>stances of conflict between the predictions<br />
of the FSM and the grammar. Our experiments show that<br />
<strong>in</strong> the case of a prediction conflict between the grammar and the<br />
FSM, <strong>in</strong>stead of bl<strong>in</strong>dly trust<strong>in</strong>g the predictions from the dialogue<br />
context, trust<strong>in</strong>g the non-context-based grammar predictions gives<br />
better performance <strong>in</strong> assign<strong>in</strong>g speech acts. This corresponds to<br />
a jump from one state to another <strong>in</strong> the f<strong>in</strong>ite state mach<strong>in</strong>e. Section<br />
4.2 reports the performance of the FSM with jumps determ<strong>in</strong>ed<br />
by the non-context-based predictions of the grammar.<br />
4.1. Late Stage Disambiguation<br />
The robust pars<strong>in</strong>g components discussed <strong>in</strong> Section 2 employ a<br />
large flexible grammar to handle such features of spoken language<br />
as speech disfluencies, speech recognition errors, and the lack of<br />
clearly marked sentence boundaries. This is necessary to ensure<br />
the robustness and flexibility of the parser. However, as a sideeffect,<br />
the number of ambiguities <strong>in</strong>creases. An important feature<br />
of our approach to reduc<strong>in</strong>g parse ambiguity is to allow multiple<br />
hypotheses to be processed through the system, and to use context<br />
to disambiguate between alternatives <strong>in</strong> the f<strong>in</strong>al stages of the process<strong>in</strong>g,<br />
where knowledge can be exploited to the fullest. Local<br />
utterance-level predictions are generated by the parser. The larger<br />
discourse context is processed and ma<strong>in</strong>ta<strong>in</strong>ed by the discourse process<strong>in</strong>g<br />
component, which has been extended to produce contextbased<br />
predictions for resolv<strong>in</strong>g ambiguity. The predictions from<br />
the context-based discourse process<strong>in</strong>g approach and those from the<br />
non-context-based parser approach are comb<strong>in</strong>ed <strong>in</strong> the f<strong>in</strong>al stage<br />
of process<strong>in</strong>g.<br />
We experimented with two methods of automatically learn<strong>in</strong>g functions<br />
for comb<strong>in</strong><strong>in</strong>g the context-based and non-context-based scores<br />
for disambiguation, namely a genetic programm<strong>in</strong>g approach and a<br />
neural net approach. While we were able, <strong>in</strong> the absence of cumulative<br />
error, to get an improvement of both comb<strong>in</strong>ation techniques<br />
over the parser’s non-context-based statistical disambiguation technique,<br />
<strong>in</strong> the face of cumulative error, the performance decreased<br />
significantly. We are <strong>in</strong> the process of <strong>in</strong>corporat<strong>in</strong>g our cumulative<br />
error reduction technique <strong>in</strong> the task of disambiguation.<br />
The f<strong>in</strong>ite state mach<strong>in</strong>e (FSM) discourse processor [5] describes<br />
representative sequences of speech acts <strong>in</strong> the schedul<strong>in</strong>g doma<strong>in</strong>.<br />
It is used to record the standard dialogue flow and to check whether