25.12.2013 Views

Model-based Bayesian Reinforcement Learning for Dialogue ...

Model-based Bayesian Reinforcement Learning for Dialogue ...

Model-based Bayesian Reinforcement Learning for Dialogue ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Model</strong>-<strong>based</strong> <strong>Bayesian</strong><br />

<strong>Rein<strong>for</strong>cement</strong> <strong>Learning</strong> <strong>for</strong><br />

<strong>Dialogue</strong> Management<br />

Pierre Lison<br />

Language Technology Group,<br />

University of Oslo<br />

August 26, 2013<br />

Interspeech


Motivation<br />

• Hand-crafting dialogue policies is hard!<br />

• Noise & uncertainty (e.g. speech recognition errors)<br />

• Large number of possible trajectories<br />

• Alternative: automatically optimise dialogue<br />

policies from (real or simulated) experience<br />

• Two types of approaches:<br />

• <strong>Model</strong>-free rein<strong>for</strong>cement learning<br />

• <strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning<br />

2


Motivation<br />

• Hand-crafting dialogue policies is hard!<br />

• Noise & uncertainty (e.g. speech recognition errors)<br />

• Large number of possible trajectories<br />

• Alternative: automatically optimise dialogue<br />

policies from (real or simulated) experience<br />

• Two types of approaches:<br />

• <strong>Model</strong>-free rein<strong>for</strong>cement learning<br />

• <strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning<br />

Focus of<br />

this talk<br />

2


Motivation<br />

•<br />

<strong>Model</strong>-<strong>based</strong> rein<strong>for</strong>cement learning:<br />

• Collect interactions and use them to estimate explicit<br />

models of the domain<br />

• Use the resulting models to plan the best action<br />

•<br />

•<br />

Key advantage: can exploit prior knowledge<br />

to structure the domain models<br />

We present an experiment showing the<br />

benefits of this approach<br />

3


POMDPs<br />

St<br />

Ot<br />

4


POMDPs<br />

St<br />

Rt<br />

Ot<br />

At<br />

4


POMDPs<br />

St<br />

St+1<br />

Rt<br />

Ot<br />

At<br />

Ot+1<br />

4


POMDPs<br />

St<br />

St+1<br />

Rt<br />

Rt+1<br />

Ot<br />

At<br />

Ot+1<br />

At+1<br />

4


POMDPs with model uncertainty<br />

θT<br />

θR<br />

St<br />

St+1<br />

Rt<br />

Rt+1<br />

Ot<br />

At<br />

Ot+1<br />

At+1<br />

θO<br />

5


Approach<br />

• After each observation, the parameters<br />

are updated via <strong>Bayesian</strong> inference<br />

• Parameter distributions gradually narrowed down to<br />

the values that best fit the observed data<br />

• Forward planning is used to select the<br />

next action to execute at runtime<br />

• Three source of uncertainty: state uncertainty,<br />

stochastic action effects, and model uncertainty<br />

6


Abstraction<br />

• <strong>Dialogue</strong> domains often have large, complex<br />

state and action spaces<br />

• Need generalisation/abstraction techniques to<br />

avoid the «curse of dimensionality»<br />

• The framework of probabilistic rules<br />

offers such abstraction language<br />

• Capture domain structure through (parametrised) rules<br />

mapping conditions to probabilistic effects<br />

• Drastic reduction in the number of parameters<br />

7


Probabilistic rules<br />

• Structured if...then...else cases associating<br />

conditions to distributions over effects:<br />

if (condition 1 holds) then<br />

P(effect 1 )= θ 1 , P(effect 2 )= θ 2 , ...<br />

else if (condition 2 holds) then<br />

P(effect 3 ) = θ 3 , ...<br />

...<br />

• Probabilistic rules serve as high-level<br />

templates <strong>for</strong> a <strong>Bayesian</strong> network<br />

[P. Lison, «Probabilistic <strong>Dialogue</strong> <strong>Model</strong>s with Prior Domain Knowledge», SIGDIAL 2012]<br />

8


Probabilistic rules: example<br />

r 1 : ∀ X:<br />

if (a m = AskConfirm(X) ∧ i u ≠ X) then<br />

[P(a u ’ = Disconfirm) = θ 1 ]<br />

9


Probabilistic rules: example<br />

r 1 : ∀ X:<br />

if (a m = AskConfirm(X) ∧ i u ≠ X) then<br />

[P(a u ’ = Disconfirm) = θ 1 ]<br />

iu<br />

θ 1<br />

r1<br />

au’<br />

am<br />

9


Evaluation<br />

• Evaluation of the learning<br />

approach in a simulated<br />

environment:<br />

• Human-robot interaction<br />

domain (with Nao robot)<br />

• Simulator constructed from<br />

Wizard-of-Oz data<br />

• Goal: estimate the transition<br />

model of the domain (reward<br />

model is given)<br />

10


Simulator<br />

• Simulation models:<br />

• User modelling: how the user is expected to react to<br />

the system actions<br />

• Context modelling: how the system actions change the<br />

state of the environment<br />

• Error modelling: how understanding errors can occur<br />

• Collected and annotated Wizard-of-Oz<br />

data to empirically estimate these models<br />

11


Experimental setup<br />

• Two alternative <strong>for</strong>malisations of the<br />

transition model:<br />

Baseline:<br />

Our approach:<br />

θr1<br />

St-1<br />

St<br />

St-1<br />

r1<br />

St<br />

θrn<br />

...<br />

At-1<br />

θ<br />

At-1<br />

rn<br />

Classical (factored)<br />

categorical distributions<br />

<strong>Model</strong> structured with<br />

probabilistic rules<br />

12


Results: average return<br />

6,0<br />

Average return<br />

4,5<br />

3,0<br />

1,5<br />

0<br />

Categorical distributions<br />

Probabilistic rules<br />

2 14 26 38 50 62 74 86 98 110 122 134 146<br />

Number of simulated interactions<br />

13


Conclusion<br />

• Hybrid approach to dialogue policy optimisation:<br />

• Domain models structured with probabilistic rules<br />

• Rule parameters estimated via model-<strong>based</strong> <strong>Bayesian</strong> RL<br />

• Experiment shows that the rule-structured<br />

model outper<strong>for</strong>ms a classical factored model<br />

•<br />

Future work:<br />

• Evaluate the approach with real interactions<br />

• Combine offline and online planning<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!