Learning Motor Skills - Intelligent Autonomous Systems
Learning Motor Skills - Intelligent Autonomous Systems
Learning Motor Skills - Intelligent Autonomous Systems
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2007], such a derivation results in a lower bound on the expected return using Jensen’s inequality and<br />
the concavity of the logarithm. Thus, we obtain<br />
log J(θ ′ ) = log<br />
≥<br />
¢<br />
<br />
¢<br />
<br />
p θ ′ (τ) R (τ) dτ = log<br />
p θ (τ) R (τ) log p θ ′ (τ) dτ + const,<br />
p θ (τ)<br />
¢<br />
<br />
p θ (τ)<br />
p θ (τ) p θ ′ (τ) R (τ) dτ,<br />
which is proportional to<br />
where<br />
−D p θ (τ) R (τ) ‖p θ ′ (τ) = L θ (θ ′ ),<br />
D p (τ) ‖q (τ) =<br />
¢<br />
p (τ) log p (τ)<br />
q (τ) dτ<br />
denotes the Kullback-Leibler divergence, and the constant is needed for tightness of the bound. Note<br />
that p θ (τ) R (τ) is an improper probability distribution as pointed out by Dayan and Hinton [1997]. The<br />
policy improvement step is equivalent to maximizing the lower bound on the expected return L θ (θ ′ ), and<br />
we will now show how it relates to previous policy learning methods.<br />
Resulting Policy Updates<br />
In this section, we will discuss three different policy updates, which are directly derived from the results<br />
of Section 4.2.2. First, we show that policy gradients [Williams, 1992, Sutton et al., 1999, Lawrence et al.,<br />
2003, Tedrake et al., 2004, Peters and Schaal, 2006] can be derived from the lower bound L θ (θ ′ ), which<br />
is straightforward from a supervised learning perspective [Binder et al., 1997]. Subsequently, we show<br />
that natural policy gradients [Bagnell and Schneider, 2003, Peters and Schaal, 2006] can be seen as an<br />
additional constraint regularizing the change in the path distribution resulting from a policy update when<br />
improving the policy incrementally. Finally, we will show how expectation-maximization (EM) algorithms<br />
for policy learning can be generated.<br />
Policy Gradients.<br />
When differentiating the function L θ (θ ′ ) that defines the lower bound on the expected return, we<br />
directly obtain<br />
¡ ∑T <br />
∂ θ ′ L θ (θ ′ ) =<br />
p θ (τ)R(τ)∂ θ ′ log p θ ′(τ)dτ = E<br />
t=1 ∂ θ ′ log π(a t|s t , t) R(τ) , (4.2)<br />
where<br />
∂ θ ′ log p θ ′ (τ) = ∑ T<br />
t=1 ∂ θ ′ log π a t |s t , t <br />
denotes the log-derivative of the path distribution. As this log-derivative depends only on the policy we<br />
can estimate a gradient from rollouts, without having a model, by simply replacing the expectation by a<br />
sum. When θ ′ is close to θ , we have the policy gradient estimator, which is widely known as episodic<br />
REINFORCE [Williams, 1992]<br />
lim θ ′ →θ ∂ θ ′ L θ (θ ′ ) = ∂ θ J(θ ).<br />
See Algorithm 4.1 for an example implementation of this algorithm and Appendix 4.A.1 for the detailed<br />
steps of the derivation. A MATLAB implementation of this algorithm is available at http://www.robotlearning.de/Member/JensKober.<br />
44 4 Policy Search for <strong>Motor</strong> Primitives in Robotics