Model-based Bayesian Reinforcement Learning for Dialogue ...

Model-based Bayesian 

Reinforcement Learning for 

Dialogue Management 

Pierre Lison 

Language Technology Group, 

University of Oslo 

August 26, 2013 

Interspeech

Motivation 

• Hand-crafting dialogue policies is hard! 

• Noise & uncertainty (e.g. speech recognition errors) 

• Large number of possible trajectories 

• Alternative: automatically optimise dialogue 

policies from (real or simulated) experience 

• Two types of approaches: 

• Model-free reinforcement learning 

• Model-based reinforcement learning 

2

Motivation 

• Hand-crafting dialogue policies is hard! 

• Noise & uncertainty (e.g. speech recognition errors) 

• Large number of possible trajectories 

• Alternative: automatically optimise dialogue 

policies from (real or simulated) experience 

• Two types of approaches: 

• Model-free reinforcement learning 

• Model-based reinforcement learning 

Focus of 

this talk 

2

Motivation 

• 

Model-based reinforcement learning: 

• Collect interactions and use them to estimate explicit 

models of the domain 

• Use the resulting models to plan the best action 

• 

• 

Key advantage: can exploit prior knowledge 

to structure the domain models 

We present an experiment showing the 

benefits of this approach 

3

POMDPs 

St 

Ot 

4

POMDPs 

St 

Rt 

Ot 

At 

4

POMDPs 

St 

St+1 

Rt 

Ot 

At 

Ot+1 

4

POMDPs 

St 

St+1 

Rt 

Rt+1 

Ot 

At 

Ot+1 

At+1 

4

POMDPs with model uncertainty 

θT 

θR 

St 

St+1 

Rt 

Rt+1 

Ot 

At 

Ot+1 

At+1 

θO 

5

Approach 

• After each observation, the parameters 

are updated via Bayesian inference 

• Parameter distributions gradually narrowed down to 

the values that best fit the observed data 

• Forward planning is used to select the 

next action to execute at runtime 

• Three source of uncertainty: state uncertainty, 

stochastic action effects, and model uncertainty 

6

Abstraction 

• Dialogue domains often have large, complex 

state and action spaces 

• Need generalisation/abstraction techniques to 

avoid the «curse of dimensionality» 

• The framework of probabilistic rules 

offers such abstraction language 

• Capture domain structure through (parametrised) rules 

mapping conditions to probabilistic effects 

• Drastic reduction in the number of parameters 

7

Probabilistic rules 

• Structured if...then...else cases associating 

conditions to distributions over effects: 

if (condition 1 holds) then 

P(effect 1 )= θ 1 , P(effect 2 )= θ 2 , ... 

else if (condition 2 holds) then 

P(effect 3 ) = θ 3 , ... 

... 

• Probabilistic rules serve as high-level 

templates for a Bayesian network 

[P. Lison, «Probabilistic Dialogue Models with Prior Domain Knowledge», SIGDIAL 2012] 

8

Probabilistic rules: example 

r 1 : ∀ X: 

if (a m = AskConfirm(X) ∧ i u ≠ X) then 

[P(a u ’ = Disconfirm) = θ 1 ] 

9

Probabilistic rules: example 

r 1 : ∀ X: 

if (a m = AskConfirm(X) ∧ i u ≠ X) then 

[P(a u ’ = Disconfirm) = θ 1 ] 

iu 

θ 1 

r1 

au’ 

am 

9

Evaluation 

• Evaluation of the learning 

approach in a simulated 

environment: 

• Human-robot interaction 

domain (with Nao robot) 

• Simulator constructed from 

Wizard-of-Oz data 

• Goal: estimate the transition 

model of the domain (reward 

model is given) 

10

Simulator 

• Simulation models: 

• User modelling: how the user is expected to react to 

the system actions 

• Context modelling: how the system actions change the 

state of the environment 

• Error modelling: how understanding errors can occur 

• Collected and annotated Wizard-of-Oz 

data to empirically estimate these models 

11

Experimental setup 

• Two alternative formalisations of the 

transition model: 

Baseline: 

Our approach: 

θr1 

St-1 

St 

St-1 

r1 

St 

θrn 

... 

At-1 

θ 

At-1 

rn 

Classical (factored) 

categorical distributions 

Model structured with 

probabilistic rules 

12

Results: average return 

6,0 

Average return 

4,5 

3,0 

1,5 

0 

Categorical distributions 

Probabilistic rules 

2 14 26 38 50 62 74 86 98 110 122 134 146 

Number of simulated interactions 

13

Conclusion 

• Hybrid approach to dialogue policy optimisation: 

• Domain models structured with probabilistic rules 

• Rule parameters estimated via model-based Bayesian RL 

• Experiment shows that the rule-structured 

model outperforms a classical factored model 

• 

Future work: 

• Evaluate the approach with real interactions 

• Combine offline and online planning 

14

Model-based Bayesian Reinforcement Learning for Dialogue ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?