Learning Motor Skills - Intelligent Autonomous Systems

More documents

Recommendations

Info

Figure 3.7: Generalization to various targets (five different forehands at posture ➂) are shown approximately when hitting the ball. of the current position and to calculate the velocities, the raw 3D positions are filtered by a specialized Kalman filter [Kalman, 1960] that takes contacts of the ball with the table and the racket into account [Mülling and Peters, 2009]. When used as a Kalman predictor, we can again determine the target point for the primitive with a pre-specified target velocity with the method described in [Mülling and Peters, 2009]. The results obtained for the still ball generalize well from the static ball to the one launched by a ball launcher at 3m/s which are returned at speeds up to 8m/s. A sequence of frames from the attached video is shown in Figure 3.6. The plane of possible virtual hitting points again has a diameter of roughly 1m as shown in Figure 3.7. The modified motor primitives generated movements with the desired hitting position and velocity. The robot hit the ball in the air in approx. 95% of the trials. However, due to a simplistic ball model and execution inaccuracies the ball was often not properly returned on the table. Please see the videos accompanying this chapter http://www.robot-learning.de/Research/HittingMPs. Note that our results differ significantly from previous approaches as we use a framework that allows us to learn striking movements from human demonstrations unlike previous work in batting [Senoo et al., 2006] and table tennis [Andersson, 1988]. Unlike baseball which only requires four degrees of freedom (as, e.g., in [Senoo et al., 2006] who used a 4 DoF WAM arm in a manually coded high speed setting), and previous work in table tennis (which had only low-inertia, was overpowered and had mostly prismatic joints [Andersson, 1988, Fässler et al., 1990, Matsushima et al., 2005]), we use a full seven degrees of freedom revolutionary joint robot and, thus, have to deal with larger inertia as the wrist adds roughly 2.5kg weight at the elbow. Hence, it was essential to train trajectories by imitation learning that distribute the torques well over the redundant joints as the human teacher was suffering from the same constraints. 3.4 Conclusion In this paper, we rethink previous work on dynamic systems motor primitive [Ijspeert et al., 2002a, Schaal et al., 2003, 2007] in order to obtain movement templates that can be used reactively in batting and hitting sports. This reformulation allows to change the target velocity of the movement while maintaining the overall duration and shape. Furthermore, we present a modification that overcomes the problem of an initial acceleration step which is particularly important for safe generalization of learned movements. Our adaptations retain the advantages of the original formulation and perform well in practice. We evaluate this novel motor primitive formulation first in hitting a stationary table tennis ball and, subsequently, in returning ball served by a ping pong ball launcher. In both cases, the novel motor primitives manage to generalize well while maintaining the features of the demonstration. This new formulation of the motor primitives can hopefully be used together with meta-parameter leraning (Chapter 5) in a mixture of motor primitives [Mülling et al., 2010] in order to create a complete framework for learning tasks like table tennis autonomously. 40 3 Movement Templates for Learning of Hitting and Batting
4 Policy Search for Motor Primitives in Robotics Many motor skills in humanoid robotics can be learned using parametrized motor primitives. While successful applications to date have been achieved with imitation learning, most of the interesting motor learning problems are high-dimensional reinforcement learning problems. These problems are often beyond the reach of current reinforcement learning methods. In this chapter, we study parametrized policy search methods and apply these to benchmark problems of motor primitive learning in robotics. We show that many well-known parametrized policy search methods can be derived from a general, common framework. This framework yields both policy gradient methods and expectation-maximization (EM) inspired algorithms. We introduce a novel EM-inspired algorithm for policy learning that is particularly well-suited for dynamical system motor primitives. We compare this algorithm, both in simulation and on a real robot, to several well-known parametrized policy search methods such as episodic REINFORCE, ‘Vanilla’ Policy Gradients with optimal baselines, episodic Natural Actor Critic, and episodic Reward- Weighted Regression. We show that the proposed method out-performs them on an empirical benchmark of learning dynamical system motor primitives both in simulation and on a real robot. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task on a real Barrett WAM robot arm. 4.1 Introduction To date, most robots are still taught by a skilled human operator either via direct programming or a teach-in. Learning approaches for automatic task acquisition and refinement would be a key step for making robots progress towards autonomous behavior. Although imitation learning can make this task more straightforward, it will always be limited by the observed demonstrations. For many motor learning tasks, skill transfer by imitation learning is prohibitively hard given that the human teacher is not capable of conveying sufficient task knowledge in the demonstration. In such cases, reinforcement learning is often an alternative to a teacher’s presentation, or a means of improving upon it. In the highdimensional domain of anthropomorphic robotics with its continuous states and actions, reinforcement learning suffers particularly from the curse of dimensionality. However, by using a task-appropriate policy representation and encoding prior knowledge into the system by imitation learning, local reinforcement learning approaches are capable of dealing with the problems of this domain. Policy search (also known as policy learning) is particularly well-suited in this context, as it allows the usage of domain-appropriate pre-structured policies [Toussaint and Goerick, 2007], the straightforward integration of a teacher’s presentation [Guenter et al., 2007, Peters and Schaal, 2006] as well as fast online learning [Bagnell et al., 2004, Ng and Jordan, 2000, Hoffman et al., 2007]. Recently, policy search has become an accepted alternative of value-function-based reinforcement learning [Bagnell et al., 2004, Strens and Moore, 2001, Kwee et al., 2001, Peshkin, 2001, El-Fakdi et al., 2006, Taylor et al., 2007] due to many of these advantages. In this chapter, we will introduce a policy search framework for episodic reinforcement learning and show how it relates to policy gradient methods [Williams, 1992, Sutton et al., 1999, Lawrence et al., 2003, Tedrake et al., 2004, Peters and Schaal, 2006] as well as expectation-maximization (EM) inspired algorithms [Dayan and Hinton, 1997, Peters and Schaal, 2007]. This framework allows us to re-derive or to generalize well-known approaches such as episodic REINFORCE [Williams, 1992], the policy gradient theorem [Sutton et al., 1999, Peters and Schaal, 2006], the episodic Natural Actor Critic [Peters et al., 2003, 2005], and an episodic generalization of the Reward-Weighted Regression [Peters and Schaal, 2007]. We derive a new algorithm called Policy Learning by Weighting Exploration with the Returns (PoWER), which is particularly well-suited for the learning of trial-based tasks in motor control. 41
Page 1 and 2:
Learning Motor Skills: From Algorit
Page 3: Erklärung zur Dissertation Hiermit
Page 6 and 7: epresenting sub-tasks, can be combi
Page 8 and 9: durch das Anpassen einer kleinen An
Page 11 and 12: Contents Abstract Zusammenfassung A
Page 13: 6 Learning Prioritized Control of M
Page 16 and 17: Policy search, also known as policy
Page 18 and 19: Figure 1.1: This figure illustrates
Page 21 and 22: 2 Reinforcement Learning in Robotic
Page 23 and 24: (a) OBELIX robot (b) Zebra Zero rob
Page 25 and 26: the task’s performance. This prob
Page 27 and 28: the policy is considered a conditio
Page 29 and 30: POLICY SEARCH Approach Employed by.
Page 31 and 32: SMART STATE-ACTION DISCRETIZATION A
Page 33 and 34: Figure 2.3: Boston Dynamics LittleD
Page 35 and 36: DEMONSTRATION Approach Employed by.
Page 37 and 38: Benefits of Noise: A complex real-w
Page 39 and 40: (a) Schematic drawings of the ball-
Page 41 and 42: The policy converges to the maximum
Page 43: obot reinforcement learning tractab
Page 48 and 49: This canonical system has the time
Page 50 and 51: 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 3 2
Page 52 and 53: with f ref containing the values of
Page 56 and 57: We evaluate the algorithms derived
Page 58 and 59: 2007], such a derivation results in
Page 60 and 61: Algorithm 4.2 episodic Natural Acto
Page 62 and 63: Algorithm 4.3 episodic Reward Weigh
Page 64 and 65: Open Parameters DoF Rollouts Policy
Page 66 and 67: (a) minimum motor command (b) passi
Page 68 and 69: 1 average return 0.9 0.8 0.7 0.6 FD
Page 70 and 71: Figure 4.7: This figure shows the i
Page 72 and 73: Figure 4.10: This figure illustrate
Page 74 and 75: 1 0.8 average return 0.6 0.4 0.2 0
Page 76 and 77: updates in simulation (such as Dyna
Page 78 and 79: complex. We modeled the system in a
Page 80 and 81: and, thus, ∂ θ log π = σ −2
Page 83 and 84: 5 Reinforcement Learning to Adjust
Page 85 and 86: parameters for scaling the duration
Page 89 and 90: egression solution w = (Φ T RΦ +
Page 91 and 92: 100 (a) Velocity 2 (b) Precision 3
Page 93 and 94: The setup is given as follows: A to
Page 95 and 96: 1.4 1.2 Cost−regularized Kernel R
Page 97 and 98: (a) Left. (b) Half left. (c) Center
Page 99 and 100: 0.9 cost/success 0.7 0.5 0.3 Succes
Page 101 and 102: position and velocity of the arm, t
Page 103 and 104: Figure 5.21: This figure illustrate
Page 105:
favor behaviors with a higher numbe
Page 108 and 109:
are possible. This redundancy can b
Page 110 and 111:
6.2.1 Single Primitive Control Law
Page 112 and 113:
(a) Exaggerated schematic drawing.
Page 114 and 115:
Dominance Structure Number of Hits
Page 117 and 118:
7 Conclusion In this thesis, we hav
Page 119 and 120:
Learning Motor Skills The presented
Page 121 and 122:
Learning Layers Jointly For the bal
Page 123:
Book Chapters J. Kober and J. Peter
Page 126:
J. A. Bagnell and J. C. Schneider.
Page 129 and 130:
H. Fässler, H. A. Beyer, and J. T.
Page 131 and 132:
F. Kirchner. Q-learning of complex
Page 133 and 134:
J. A. Martín H., J. de Lope, and D
Page 135 and 136:
L. Peshkin. Reinforcement Learning
Page 137 and 138:
F. Sehnke, C. Osendorfer, T. Rücks
Page 139:
A. Ude, A. Gams, T. Asfour, and J.
Page 142 and 143:
3.2 In this figure, we convey the i
Page 144 and 145:
4.11 This figure shows schematic dr
Page 146 and 147:
5.17 This figure illustrates the si
Page 149:
List of Tables 2.1 This table illus
Page 152 and 153:
As symbols in this PhD thesis, the
show all

Learning Motor Skills - Intelligent Autonomous Systems

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?