Learning Motor Skills - Intelligent Autonomous Systems

More documents

Recommendations

Info

2007], such a derivation results in a lower bound on the expected return using Jensen’s inequality and the concavity of the logarithm. Thus, we obtain log J(θ ′ ) = log ≥ ¢ ¢ p θ ′ (τ) R (τ) dτ = log p θ (τ) R (τ) log p θ ′ (τ) dτ + const, p θ (τ) ¢ p θ (τ) p θ (τ) p θ ′ (τ) R (τ) dτ, which is proportional to where −D p θ (τ) R (τ) ‖p θ ′ (τ) = L θ (θ ′ ), D p (τ) ‖q (τ) = ¢ p (τ) log p (τ) q (τ) dτ denotes the Kullback-Leibler divergence, and the constant is needed for tightness of the bound. Note that p θ (τ) R (τ) is an improper probability distribution as pointed out by Dayan and Hinton [1997]. The policy improvement step is equivalent to maximizing the lower bound on the expected return L θ (θ ′ ), and we will now show how it relates to previous policy learning methods. Resulting Policy Updates In this section, we will discuss three different policy updates, which are directly derived from the results of Section 4.2.2. First, we show that policy gradients [Williams, 1992, Sutton et al., 1999, Lawrence et al., 2003, Tedrake et al., 2004, Peters and Schaal, 2006] can be derived from the lower bound L θ (θ ′ ), which is straightforward from a supervised learning perspective [Binder et al., 1997]. Subsequently, we show that natural policy gradients [Bagnell and Schneider, 2003, Peters and Schaal, 2006] can be seen as an additional constraint regularizing the change in the path distribution resulting from a policy update when improving the policy incrementally. Finally, we will show how expectation-maximization (EM) algorithms for policy learning can be generated. Policy Gradients. When differentiating the function L θ (θ ′ ) that defines the lower bound on the expected return, we directly obtain ¡ ∑T ∂ θ ′ L θ (θ ′ ) = p θ (τ)R(τ)∂ θ ′ log p θ ′(τ)dτ = E t=1 ∂ θ ′ log π(a t|s t , t) R(τ) , (4.2) where ∂ θ ′ log p θ ′ (τ) = ∑ T t=1 ∂ θ ′ log π a t |s t , t denotes the log-derivative of the path distribution. As this log-derivative depends only on the policy we can estimate a gradient from rollouts, without having a model, by simply replacing the expectation by a sum. When θ ′ is close to θ , we have the policy gradient estimator, which is widely known as episodic REINFORCE [Williams, 1992] lim θ ′ →θ ∂ θ ′ L θ (θ ′ ) = ∂ θ J(θ ). See Algorithm 4.1 for an example implementation of this algorithm and Appendix 4.A.1 for the detailed steps of the derivation. A MATLAB implementation of this algorithm is available at http://www.robotlearning.de/Member/JensKober. 44 4 Policy Search for <strong>Motor</strong> Primitives in Robotics
Algorithm 4.1 ‘Vanilla’ Policy Gradients (VPG) Input: initial policy parameters θ 0 repeat Sample: Perform h = {1, . . . , H} rollouts using a = θ T φ(s, t) + ɛ t with [ɛ n t ] ∼ (0, (σh,n ) 2 ) as stochastic policy and collect all (t, s h t , ah t , sh t+1 , ɛh t , rh t+1 ) for t = {1, 2, . . . , T + 1}. Compute: Return R h = ∑ T+1 t=1 rh t , eligibility ψ h,n = ∂ log p τ h ∂ θ n = T∑ t=1 ∂ log π a h t |sh t , t ∂ θ n = T∑ t=1 ɛ h,n t σ n h 2 φ n s h,n , t t and baseline b n = ∑ H h=1 ψ h,n 2 R h ∑ H h=1 ψ h,n 2 for each parameter n = {1, . . . , N} from rollouts. Compute Gradient: Update policy using g n VP = E ∂ log p(τ h ) ∂ θ n R(τ h ) − b n = 1 H H∑ ψ h,n (R h − b n ). h=1 until Convergence θ k+1 ≈ θ k . θ k+1 = θ k + αg VP . 4.2 Policy Search for Parametrized <strong>Motor</strong> Primitives 45
Page 1 and 2:
Learning Motor Skills: From Algorit
Page 3:
Erklärung zur Dissertation Hiermit
Page 6 and 7:
epresenting sub-tasks, can be combi
Page 8 and 9: durch das Anpassen einer kleinen An
Page 11 and 12: Contents Abstract Zusammenfassung A
Page 13: 6 Learning Prioritized Control of M
Page 16 and 17: Policy search, also known as policy
Page 18 and 19: Figure 1.1: This figure illustrates
Page 21 and 22: 2 Reinforcement Learning in Robotic
Page 23 and 24: (a) OBELIX robot (b) Zebra Zero rob
Page 25 and 26: the task’s performance. This prob
Page 27 and 28: the policy is considered a conditio
Page 29 and 30: POLICY SEARCH Approach Employed by.
Page 31 and 32: SMART STATE-ACTION DISCRETIZATION A
Page 33 and 34: Figure 2.3: Boston Dynamics LittleD
Page 35 and 36: DEMONSTRATION Approach Employed by.
Page 37 and 38: Benefits of Noise: A complex real-w
Page 39 and 40: (a) Schematic drawings of the ball-
Page 41 and 42: The policy converges to the maximum
Page 43: obot reinforcement learning tractab
Page 48 and 49: This canonical system has the time
Page 50 and 51: 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 3 2
Page 52 and 53: with f ref containing the values of
Page 54 and 55: Figure 3.7: Generalization to vario
Page 56 and 57: We evaluate the algorithms derived
Page 60 and 61: Algorithm 4.2 episodic Natural Acto
Page 62 and 63: Algorithm 4.3 episodic Reward Weigh
Page 64 and 65: Open Parameters DoF Rollouts Policy
Page 66 and 67: (a) minimum motor command (b) passi
Page 68 and 69: 1 average return 0.9 0.8 0.7 0.6 FD
Page 70 and 71: Figure 4.7: This figure shows the i
Page 72 and 73: Figure 4.10: This figure illustrate
Page 74 and 75: 1 0.8 average return 0.6 0.4 0.2 0
Page 76 and 77: updates in simulation (such as Dyna
Page 78 and 79: complex. We modeled the system in a
Page 80 and 81: and, thus, ∂ θ log π = σ −2
Page 83 and 84: 5 Reinforcement Learning to Adjust
Page 85 and 86: parameters for scaling the duration
Page 89 and 90: egression solution w = (Φ T RΦ +
Page 91 and 92: 100 (a) Velocity 2 (b) Precision 3
Page 93 and 94: The setup is given as follows: A to
Page 95 and 96: 1.4 1.2 Cost−regularized Kernel R
Page 97 and 98: (a) Left. (b) Half left. (c) Center
Page 99 and 100: 0.9 cost/success 0.7 0.5 0.3 Succes
Page 101 and 102: position and velocity of the arm, t
Page 103 and 104: Figure 5.21: This figure illustrate
Page 105: favor behaviors with a higher numbe
Page 108 and 109:
are possible. This redundancy can b
Page 110 and 111:
6.2.1 Single Primitive Control Law
Page 112 and 113:
(a) Exaggerated schematic drawing.
Page 114 and 115:
Dominance Structure Number of Hits
Page 117 and 118:
7 Conclusion In this thesis, we hav
Page 119 and 120:
Learning Motor Skills The presented
Page 121 and 122:
Learning Layers Jointly For the bal
Page 123:
Book Chapters J. Kober and J. Peter
Page 126:
J. A. Bagnell and J. C. Schneider.
Page 129 and 130:
H. Fässler, H. A. Beyer, and J. T.
Page 131 and 132:
F. Kirchner. Q-learning of complex
Page 133 and 134:
J. A. Martín H., J. de Lope, and D
Page 135 and 136:
L. Peshkin. Reinforcement Learning
Page 137 and 138:
F. Sehnke, C. Osendorfer, T. Rücks
Page 139:
A. Ude, A. Gams, T. Asfour, and J.
Page 142 and 143:
3.2 In this figure, we convey the i
Page 144 and 145:
4.11 This figure shows schematic dr
Page 146 and 147:
5.17 This figure illustrates the si
Page 149:
List of Tables 2.1 This table illus
Page 152 and 153:
As symbols in this PhD thesis, the
show all

Learning Motor Skills - Intelligent Autonomous Systems

Create successful ePaper yourself

Delete template?

Save as template?