02.11.2013 Views

Learning Motor Skills - Intelligent Autonomous Systems

Learning Motor Skills - Intelligent Autonomous Systems

Learning Motor Skills - Intelligent Autonomous Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2007], such a derivation results in a lower bound on the expected return using Jensen’s inequality and<br />

the concavity of the logarithm. Thus, we obtain<br />

log J(θ ′ ) = log<br />

≥<br />

¢<br />

<br />

¢<br />

<br />

p θ ′ (τ) R (τ) dτ = log<br />

p θ (τ) R (τ) log p θ ′ (τ) dτ + const,<br />

p θ (τ)<br />

¢<br />

<br />

p θ (τ)<br />

p θ (τ) p θ ′ (τ) R (τ) dτ,<br />

which is proportional to<br />

where<br />

−D p θ (τ) R (τ) ‖p θ ′ (τ) = L θ (θ ′ ),<br />

D p (τ) ‖q (τ) =<br />

¢<br />

p (τ) log p (τ)<br />

q (τ) dτ<br />

denotes the Kullback-Leibler divergence, and the constant is needed for tightness of the bound. Note<br />

that p θ (τ) R (τ) is an improper probability distribution as pointed out by Dayan and Hinton [1997]. The<br />

policy improvement step is equivalent to maximizing the lower bound on the expected return L θ (θ ′ ), and<br />

we will now show how it relates to previous policy learning methods.<br />

Resulting Policy Updates<br />

In this section, we will discuss three different policy updates, which are directly derived from the results<br />

of Section 4.2.2. First, we show that policy gradients [Williams, 1992, Sutton et al., 1999, Lawrence et al.,<br />

2003, Tedrake et al., 2004, Peters and Schaal, 2006] can be derived from the lower bound L θ (θ ′ ), which<br />

is straightforward from a supervised learning perspective [Binder et al., 1997]. Subsequently, we show<br />

that natural policy gradients [Bagnell and Schneider, 2003, Peters and Schaal, 2006] can be seen as an<br />

additional constraint regularizing the change in the path distribution resulting from a policy update when<br />

improving the policy incrementally. Finally, we will show how expectation-maximization (EM) algorithms<br />

for policy learning can be generated.<br />

Policy Gradients.<br />

When differentiating the function L θ (θ ′ ) that defines the lower bound on the expected return, we<br />

directly obtain<br />

¡ ∑T <br />

∂ θ ′ L θ (θ ′ ) =<br />

p θ (τ)R(τ)∂ θ ′ log p θ ′(τ)dτ = E<br />

t=1 ∂ θ ′ log π(a t|s t , t) R(τ) , (4.2)<br />

where<br />

∂ θ ′ log p θ ′ (τ) = ∑ T<br />

t=1 ∂ θ ′ log π a t |s t , t <br />

denotes the log-derivative of the path distribution. As this log-derivative depends only on the policy we<br />

can estimate a gradient from rollouts, without having a model, by simply replacing the expectation by a<br />

sum. When θ ′ is close to θ , we have the policy gradient estimator, which is widely known as episodic<br />

REINFORCE [Williams, 1992]<br />

lim θ ′ →θ ∂ θ ′ L θ (θ ′ ) = ∂ θ J(θ ).<br />

See Algorithm 4.1 for an example implementation of this algorithm and Appendix 4.A.1 for the detailed<br />

steps of the derivation. A MATLAB implementation of this algorithm is available at http://www.robotlearning.de/Member/JensKober.<br />

44 4 Policy Search for <strong>Motor</strong> Primitives in Robotics

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!