Model-based Bayesian Reinforcement Learning for Dialogue ...
Model-based Bayesian Reinforcement Learning for Dialogue ...
Model-based Bayesian Reinforcement Learning for Dialogue ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Model</strong>-<strong>based</strong> <strong>Bayesian</strong><br />
<strong>Rein<strong>for</strong>cement</strong> <strong>Learning</strong> <strong>for</strong><br />
<strong>Dialogue</strong> Management<br />
Pierre Lison<br />
Language Technology Group,<br />
University of Oslo<br />
August 26, 2013<br />
Interspeech
Motivation<br />
• Hand-crafting dialogue policies is hard!<br />
• Noise & uncertainty (e.g. speech recognition errors)<br />
• Large number of possible trajectories<br />
• Alternative: automatically optimise dialogue<br />
policies from (real or simulated) experience<br />
• Two types of approaches:<br />
• <strong>Model</strong>-free rein<strong>for</strong>cement learning<br />
• <strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning<br />
2
Motivation<br />
• Hand-crafting dialogue policies is hard!<br />
• Noise & uncertainty (e.g. speech recognition errors)<br />
• Large number of possible trajectories<br />
• Alternative: automatically optimise dialogue<br />
policies from (real or simulated) experience<br />
• Two types of approaches:<br />
• <strong>Model</strong>-free rein<strong>for</strong>cement learning<br />
• <strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning<br />
Focus of<br />
this talk<br />
2
Motivation<br />
•<br />
<strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning:<br />
• Collect interactions and use them to estimate explicit<br />
models of the domain<br />
• Use the resulting models to plan the best action<br />
•<br />
•<br />
Key advantage: can exploit prior knowledge<br />
to structure the domain models<br />
We present an experiment showing the<br />
benefits of this approach<br />
3
POMDPs<br />
St<br />
Ot<br />
4
POMDPs<br />
St<br />
Rt<br />
Ot<br />
At<br />
4
POMDPs<br />
St<br />
St+1<br />
Rt<br />
Ot<br />
At<br />
Ot+1<br />
4
POMDPs<br />
St<br />
St+1<br />
Rt<br />
Rt+1<br />
Ot<br />
At<br />
Ot+1<br />
At+1<br />
4
POMDPs with model uncertainty<br />
θT<br />
θR<br />
St<br />
St+1<br />
Rt<br />
Rt+1<br />
Ot<br />
At<br />
Ot+1<br />
At+1<br />
θO<br />
5
Approach<br />
• After each observation, the parameters<br />
are updated via <strong>Bayesian</strong> inference<br />
• Parameter distributions gradually narrowed down to<br />
the values that best fit the observed data<br />
• Forward planning is used to select the<br />
next action to execute at runtime<br />
• Three source of uncertainty: state uncertainty,<br />
stochastic action effects, and model uncertainty<br />
6
Abstraction<br />
• <strong>Dialogue</strong> domains often have large, complex<br />
state and action spaces<br />
• Need generalisation/abstraction techniques to<br />
avoid the «curse of dimensionality»<br />
• The framework of probabilistic rules<br />
offers such abstraction language<br />
• Capture domain structure through (parametrised) rules<br />
mapping conditions to probabilistic effects<br />
• Drastic reduction in the number of parameters<br />
7
Probabilistic rules<br />
• Structured if...then...else cases associating<br />
conditions to distributions over effects:<br />
if (condition 1 holds) then<br />
P(effect 1 )= θ 1 , P(effect 2 )= θ 2 , ...<br />
else if (condition 2 holds) then<br />
P(effect 3 ) = θ 3 , ...<br />
...<br />
• Probabilistic rules serve as high-level<br />
templates <strong>for</strong> a <strong>Bayesian</strong> network<br />
[P. Lison, «Probabilistic <strong>Dialogue</strong> <strong>Model</strong>s with Prior Domain Knowledge», SIGDIAL 2012]<br />
8
Probabilistic rules: example<br />
r 1 : ∀ X:<br />
if (a m = AskConfirm(X) ∧ i u ≠ X) then<br />
[P(a u ’ = Disconfirm) = θ 1 ]<br />
9
Probabilistic rules: example<br />
r 1 : ∀ X:<br />
if (a m = AskConfirm(X) ∧ i u ≠ X) then<br />
[P(a u ’ = Disconfirm) = θ 1 ]<br />
iu<br />
θ 1<br />
r1<br />
au’<br />
am<br />
9
Evaluation<br />
• Evaluation of the learning<br />
approach in a simulated<br />
environment:<br />
• Human-robot interaction<br />
domain (with Nao robot)<br />
• Simulator constructed from<br />
Wizard-of-Oz data<br />
• Goal: estimate the transition<br />
model of the domain (reward<br />
model is given)<br />
10
Simulator<br />
• Simulation models:<br />
• User modelling: how the user is expected to react to<br />
the system actions<br />
• Context modelling: how the system actions change the<br />
state of the environment<br />
• Error modelling: how understanding errors can occur<br />
• Collected and annotated Wizard-of-Oz<br />
data to empirically estimate these models<br />
11
Experimental setup<br />
• Two alternative <strong>for</strong>malisations of the<br />
transition model:<br />
Baseline:<br />
Our approach:<br />
θr1<br />
St-1<br />
St<br />
St-1<br />
r1<br />
St<br />
θrn<br />
...<br />
At-1<br />
θ<br />
At-1<br />
rn<br />
Classical (factored)<br />
categorical distributions<br />
<strong>Model</strong> structured with<br />
probabilistic rules<br />
12
Results: average return<br />
6,0<br />
Average return<br />
4,5<br />
3,0<br />
1,5<br />
0<br />
Categorical distributions<br />
Probabilistic rules<br />
2 14 26 38 50 62 74 86 98 110 122 134 146<br />
Number of simulated interactions<br />
13
Conclusion<br />
• Hybrid approach to dialogue policy optimisation:<br />
• Domain models structured with probabilistic rules<br />
• Rule parameters estimated via model-<strong>based</strong> <strong>Bayesian</strong> RL<br />
• Experiment shows that the rule-structured<br />
model outper<strong>for</strong>ms a classical factored model<br />
•<br />
Future work:<br />
• Evaluate the approach with real interactions<br />
• Combine offline and online planning<br />
14