Kenji Doya 2001

ferent state representations [11]. The model replicated manyof the experimental findings, including the differential effectsof blockade experiments.The hypothesis that different coordinate frames are usedin different stages of sequence learning was tested in a humansequence learning experiment [33]. Once the subjectslearned a button press sequence, their generalization performancewas tested in two conditions: one in which the targetsin the same spatial location were pressed with differentfinger movements and another in which different spatial targetswere pressed with the same finger movements. Analysisof response times showed significantly betterperformance in the latter case (i.e., sequences in the samemotor coordinates) after extended training. This supportsour hypothesis that sequence representation in motor coordinatestakes time to be developed, but, once developed, allowsquick execution.Learning to Stand withHierarchical RepresentationsThe application of reinforcement learning to a high-dimensionaldynamical system is quite difficult because of thecomplexity of learning the value function in a high-dimensionalspace, known as the curse ofdimensionality. Thus, we developed a hierarchicalreinforcement learning architecture [34]. Inthe upper level, coarsely discretized states in areduced-dimensional space are used to makeglobal exploration feasible. In the lower level, localdynamics in a continuous, high-dimensionalstate space are considered for smooth controlperformance.We applied this hierarchical reinforcementlearning architecture to the task of learning tovertically balance a three-link robot (Fig. 9). Thegoal was to find the dynamic movement sequencefor standing. The reward was given bythe height of the head and a punishment wasgiven when the robot tumbled. Within severalhundred trials, a successful pattern of standingup was achieved by the hierarchical reinforcementlearning system. The learning was severaltimes quicker than with simple reinforcementlearning [34].The reason for this quick learning was the selectionof the upper-level state representation,which included kinematic, task-oriented variablessuch as the relative center of mass positionmeasured from the foot. Although the hierarchicalarchitecture was developed simply to let therobot learn the task quickly (before the hardwarebroke down), it is interesting that themultimodal representation resembles the architectureof the multiple cortico-basal ganglialoops [11], [27].ut ( )xt ( )ut ( )Emergence of Modular OrganizationA remarkable feature of the biological motor control systemis its realization of both flexibility and robustness. For example,in a motor adaptation experiment of reaching to a targetwhile wearing prism glasses, if a subject is well adapted toan altered condition and then returns to a normal condition,there is an after-effect (i.e., an error in the opposite direction).However, de-adaptation to a normal condition is usuallymuch faster than adaptation to a novel condition.Further, if a subject is trained alternately in normal and alteredconditions, she will adapt to either condition veryquickly. Such results suggest that a subject does not simplymodify the parameters of a single controller, but retainsmultiple controllers for different conditions and can switchbetween them easily. Evidence from arm reaching experimentssuggests that the outputs of controllers for similarconditions can be smoothly interpolated for a novel, intermediatecondition [15].The idea of switching among multiple controllers is quitecommon; however, a difficult problem in designing an adaptivemodular control system is how to select an appropriatemodule for a given situation. To evaluate a set of controllers,we basically have to test the performance of each controllerPredictor 1Controller 1Predictor iController iPredictor nController nEnvironmentFigure 10. The MOSAIC architecture.^.xt i( )ut i( )λ i ( t).xt ( )xt ( )++Softmax^λi ( ) tAugust 2001 IEEE Control Systems Magazine 51

one by one, which takes a lot of time. On the other hand, aset of predictors can be evaluated simultaneously by runningthem in parallel and comparing their prediction outputswith the output of the real system. This simple factmotivated a modular control architecture in which eachcontroller is paired with a predictor [35].Fig. 10 shows the module selection and identification control(MOSAIC) architecture, based on the prediction error ofeach moduleE ( t) = x$ ( t) −x( t)iiwhere x$ i( t)is the output of the ith predictor. The responsibilitysignal is given by the soft-max function..θ [rad/s ]2..θ [rad/s ]220100−10−20620100−10−2060441Before Learning2 0 2θ [rad](a)After 200 Trials20θ [rad](b)23TimeFigure 11. A result of learning to swing up an underpoweredpendulum using a reinforcement learning version of the MOSAICarchitecture [36]. (a) The sinusoidal nonlinearity of the gravity termin the angular acceleration was approximated by two linear models,shown in different colors. In each module, a local quadratic rewardmodel was also learned and a linear feedback policy was derived bysolving a Riccati equation. (b) The resulting policy for the firstmodule made the downward position unstable. As the pendulummoved upward, based on the relative prediction errors of the twoprediction models, the second module was selected and stabilizedthe upward position.22,4446652( Ei( t)/σ )2( Ejt σ )expλi( t ) =∑ exp ( ) /jThis is used for weighting the outputs of multiple controllers,i.e.,u( t) =Σλ ( t) u ( t),i i iwhere ui ( t)is the output of the ith controller. The responsibilitysignal is also used to weight the learning rates of thepredictors and the controllers of the modules, which causesmodules to be specialized for different situations. The parameterσ, which controls the sharpness of module selection,is initially set large to avoid suboptimal specializationof modules.Fig. 11 shows an example of using this scheme in a simplenonlinear control task of swinging up a pendulum [36]. Eachmodule learns a locally linear dynamic model and a locallyquadratic reward model. Based on these models, the valuefunction and the corresponding control policy for eachmodule are derived by solving a Riccati equation, whichmakes learning much faster than by iterative estimation ofthe value function. After about 100 trials, each module successfullyapproximated the sinusoidal nonlinearity in thedynamics in either the bottom or top half of the state space.Accordingly, a controller that destabilizes the stable equilibriumat the bottom and another controller that stabilizesthe unstable equilibrium at the top were derived. They weresuccessfully switched based on the responsibility signal.The scheme has also been shown to be applicable to anonstationary control task. The results suggest the usefulnessof this biologically motivated modular control architecturein decomposing nonlinear and/or nonstationarycontrol tasks in space and time based on the predictabilityof the system dynamics. A careful theoretical study assessingthe conditions in which modular learning and controlmethods work reliably is still required.Imamizu and colleagues performed a series of visuomotoradaptation experiments using a computer mousewith rotated pointing direction (e.g., the cursor moves tothe right when the mouse is moved upward). Initially inlearning, a large part of the cerebellum was activated. However,as the subject became proficient in the rotated mousemovement, small spots of activation were found in the lateralcerebellum, which can be interpreted as the neural correlateof the internal model of the new tool [37].Furthermore, when the subject was asked to use two differentkinds of an unusual computer mouse, two different setsof activation spots were found in the lateral cerebellum (Fig.12) [38].Experiments of multiple sequential movement in monkeyshave shown that neurons in the supplementary motor area(SMA) are selectively activated during movements in particularsequences. Furthermore, in an adjacent area calledpre-SMA, some neurons were activated when the monkey.52 IEEE Control Systems Magazine August 2001

was instructed to change the movement sequence [39].These results suggest the possibility that modular organizationof internal models like MOSAIC is realized in a circuit includingthe cerebellum and the cerebral cortex.ConclusionWe reviewed three topics in motor control and learning: predictionof future rewards, the use of internal models, andmodular decomposition of a complex task using multiplemodels. We have seen that these three aspects of motor controlare related to the functions of the basal ganglia, the cerebellum,and the cerebral cortex, which are specialized forreinforcement, supervised, and unsupervised learning, respectively[1], [2].Our brain undoubtedly implements the most efficientand robust control system available to date. However, howit really works cannot be understood just by watching its activityor by breaking it down piece by piece. It was not untilthe development of reinforcement learning theory that aclear light was shed on the function of the basal ganglia.Theories of adaptive control and studies of artificial neuralnetworks were essential in understanding the function ofthe cerebellum. Such understanding provided new insightsfor the design of efficient learning and control systems. Thetheory of adaptive systems and the understanding of brainfunction are highly complementary developments.Rotating MouseIntegrating MouseAcknowledgmentsWe thank Raju Bapi, Hiroaki Gomi, Okihide Hikosaka,Hiroshi Imamizu, Jun Morimoto, Hiroyuki Nakahara, andKazuyuki Samejima for their collaboration on this article.Studies reported here were supported by the ERATO andCREST programs of Japan Science and Technology (JST)Corp.References[1] K. Doya, “What are the computations of the cerebellum, the basal ganglia,and the cerebral cortex,” Neural Net., vol. 12, pp. 961-974, 1999.[2] K. Doya, “Complementary roles of basal ganglia and cerebellum in learningand motor control,” Current Opinion in Neurobiology, vol. 10, pp. 732-739, 2000.[3] R.S. Sutton and A.G. Barto, Reinforcement Learning. Cambridge, MA: MITPress, 1998.[4] G. Tesauro, “TD-Gammon, a self-teaching backgammon program, achievesmaster-level play,” Neural Comput., vol. 6, pp. 215-219, 1994.[5] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming. Belmont,MA: Athena Scientific, 1996.[6] K. Doya, “Reinforcement learning in continuous time and space,” NeuralComput., vol. 12, pp. 219-245, 2000.[7] W. Schultz, P. Dayan, and P.R. Montague, “A neural substrate of predictionand reward,” Science, vol. 275, pp. 1593-1599, 1997.[8] J.C. Houk, J.L. Adams, and A.G. Barto, “A model of how the basal gangliagenerate and use neural signals that predict reinforcement,” in Models of InformationProcessing in the Basal Ganglia, J.C. Houk, J.L. Davis, and D.G.Beiser, Eds. Cambridge, MA: MIT Press, 1995, pp. 249-270.[9] P.R. Montague, P. Dayan, and T.J. Sejnowski, “A framework formesencephalic dopamine systems based on predictive Hebbian learning,” J.Neurosci., vol. 16, pp. 1936-1947, 1996.[10] J.N. Kerr and J.R. Wickens, “Dopamine D-1/D-5 receptor activation is requiredfor long-term potentiation in the rat neostriatum in vitro,” J.Neurophysiol., vol. 85, pp. 117-124, 2001.[11] H. Nakahara, K. Doya, and O. Hikosaka, “Parallel cortico-basal gangliamechanisms for acquisition and execution of visuo-motor sequences—Acomputational approach,” J. Cognitive Neurosci., vol. 13, no. 5, 2001.[12] R.E. Suri and W. Schultz, “Temporal difference model reproduces anticipatoryneural activity,” Neural Comput., vol. 13, pp. 841-862, 2001.[13] J. Brown, D. Bullock, and S. Grossberg, “How the basal ganglia use parallelexcitatory and inhibitory learning pathways to selectively respond to unexpectedrewarding cues,” J. Neurosci., vol. 19, pp. 10502-10511, 1999.[14] H. Gomi and M. Kawato, “Equilibrium-point control hypothesis examinedby measured arm stiffness during multijoint movement,” Science, vol. 272, pp.117-120, 1996.[15] D.M. Wolpert, R.C. Miall, and M. Kawato, “Internal models in the cerebellum,”Trends in Cognitive Sciences, vol. 2, pp. 338-347, 1998.Figure 12. The activity in the cerebellum for two different kinds ofcomputer mouse: a rotating mouse (red), in which the direction ofthe cursor movement is rotated, and an integrating mouse (yellow),in which the mouse position specifies the velocity of the cursormovement [38]. The subjects were asked to track a complex cursormovement trajectory on the screen, alternately using the twodifferent mouse settings. Large areas in the cerebellum wereactivated initially. After several hours of training, activities wereseen in limited spots in the lateral cerebellum, which were differentfor different types of mouse.[16] M. Kawato, “Internal models for motor control and trajectory planning,”Current Opinion in Neurobiology, vol. 9, pp. 718-727, 1999.[17] D. Marr, “A theory of cerebellar cortex,” J. Physiol., vol. 202, pp. 437-470,1969.[18] J.S. Albus, “A theory of cerebellar function,” Math. Biosci., vol. 10, pp.25-61, 1971.[19] M. Ito, M. Sakurai, and P. Tongroach, “Climbing fibre induced depressionof both mossy fibre responsiveness and glutamate sensitivity of cerebellarPurkinje cells,” J. Physiol., vol. 324, pp. 113-134, 1982.[20] Y. Kobayashi, K. Kawano, A. Takemura, Y. Inoue, T. Kitama, H. Gomi, andM. Kawato, “Temporal firing patterns of Purkinje cells in the cerebellar ven-August 2001 IEEE Control Systems Magazine 53

tral paraflocculus during ocular following responses in monkeys. II. Complexspikes,” J. Neurophysiol., vol. 80, pp. 832-848, 1998.[21] S. Kitazawa, T. Kimura, and P.-B. Yin, “Cerebellar complex spikes encodeboth destinations and errors in arm movements,” Nature, vol. 392, pp.494-497, 1998.[22] M. Kawato, K. Furukawa, and R. Suzuki, “A hierarchical neural networkmodel for control and learning of voluntary movement,” Biol. Cybern., vol. 57,pp. 169-185, 1987.[23] M. Kawato, “The feedback-error-learning neural network for supervisedmotor learning,” in Neural Network for Sensory and Motor Systems, R.Eckmiller, Ed. Amsterdam: Elsevier, 1990, pp. 365-372.[24] K. Doya, H. Kimura, and A. Miyamura, “Motor control: Neural models andsystem theory,” Appl. Math. Comput. Sci., vol. 11, pp. 101-128, 2001.[25] A. Miyamura and H. Kimura, “Stability of feedback error learningscheme,” submitted for publication.[26] M. Lotze, P. Montoya, M. Erb, E. Hulsmann, H. Flor, U. Klose, N. Birbaumer,and W. Grodd, “Activation of cortical and cerebellar motor areas during executedand imagined hand movements: An fMRI study,” J. Cognitive Neurosci.,vol. 11, pp. 491-501, 1999.[27] O. Hikosaka, H. Nakahara, M.K. Rand, K. Sakai, X. Lu, K. Nakamura, S.Miyachi, and K. Doya, “Parallel neural networks for learning sequential procedures,”Trends Neurosci., vol. 22, pp. 464-471, 1999.[28] J.H. Gao, L.M. Parsons, J.M. Bower, J. Xiong, J. Li, and P.T. Fox, “Cerebellumimplicated in sensory acquisition and discrimination rather than motorcontrol,” Science, vol. 272, pp. 545-547, 1996.[29] S.J. Blakemore, D.M. Wolpert, and C.D. Frith, “Central cancellation ofself-produced tickle sensation,” Nature Neurosci., vol. 1, pp. 635-640, 1998.[30] M. Ito, “Movement and thought: Identical control mechanisms by the cerebellum,”Trends Neurosci., vol. 16, pp. 448-450, 1993.[31] Y.P. Shimansky, “Spinal motor control system incorporates an internalmodel of limb dynamics,” Biol. Cybern., vol. 83, pp. 379-389, 2000.[32] E. Nakano, H. Imamizu, R. Osu, Y. Uno, H. Gomi, T. Yoshioka, and M.Kawato, “Quantitative examinations of internal representations for arm trajectoryplanning: Minimum commanded torque change model,” J.Neurophysiol., vol. 81, pp. 2140-2155, 1999.[33] R.S. Bapi, K. Doya, and A.M. Harner, “Evidence for effector independentand dependent representations and their differential time course of acquisitionduring motor sequence learning,” Experimental Brain Res., vol. 132, pp.149-162, 2000.[34] J. Morimoto and K. Doya, “Acquisition of stand-up behavior by a real robotusing hierarchical reinforcement learning,” in 17th Int. Conf. MachineLearning, 2000, pp. 623-630.[35] D.M. Wolpert and M. Kawato, “Multiple paired forward and inverse modelsfor motor control,” Neural Net., vol. 11, pp. 1317-1329, 1998.[36] K. Doya, K. Samejima, K. Katagiri, and M. Kawato, “Multiple model-basedreinforcement learning,” Japan Sci. and Technol. Corp., Kawato DynamicBrain Project Tech. Rep. KDB-TR-08, 2000.[37] H. Imamizu, S. Miyauchi, T. Tamada, Y. Sasaki, R. Takino, B. Pütz, T.Yoshioka, and M. Kawato, “Human cerebellar activity reflecting an acquiredinternal model of a new tool,” Nature, vol. 403, pp. 192-195, 2000.[38] H. Imamizu, S. Miyauchi, Y. Sasaki, R. Takino, B. Pütz, and M. Kawato,“Separated modules for visuomotor control and learning in the cerebellum: Afunctional MRI study,” in NeuroImage: Third International Conference on FunctionalMapping of the Human Brain, vol. 5, A.W. Toga, R.S.J. Frackowiak, andJ.C. Mazziotta, Eds. Copenhagen, Denmark, 1997, pp. S598.[39] K. Shima, H. Mushiake, N. Saito, and J. Tanji, “Role for cells in thepresupplementary motor area in updating motor plans,” in Proc. Nat. Academyof Sciences, vol. 93, pp. 8694-8698, 1996.Kenji Doya received the Ph.D. in engineering from the Universityof Tokyo in 1991. He was a Research Associate at theUniversity of Tokyo in 1986, at the University of California,San Diego, in 1991, and at Salk Institute in 1993. He has beena Senior Researcher at ATR International since 1994, and theDirector of Metalearning, Neuromodulation, and EmotionResearch, CREST, at JST, since 1999. He serves as an actioneditor of Neural Networks and Neural Computation and as aboard member of the Japanese Neural Network Society. Hisresearch interests include reinforcement learning, the functionsof the basal ganglia and the cerebellum, and the rolesof neuromodulators in metalearning.Hidenori Kimura received the Ph.D. in engineering fromthe University of Tokyo in 1970. He was appointed a facultymember at Osaka University in 1970, a Professor with theDepartment of Mechanical Engineering for Computer-ControlledMachinery, Osaka University, in 1987, and a Professorin the Department of Mathematical Engineering andInformation Physics, University of Tokyo, in 1995. He hasbeen working on the theory and application of robust controland system identification. He received the IEEE-CSS outstandingpaper award in 1985 and the distinguishedmember award of the IEEE Control Systems Society in 1996.He is an IEEE Fellow.Mitsuo Kawato received the Ph.D. in engineering fromOsaka University in 1981. He became a faculty member atOsaka University in 1981, a Senior Researcher at ATR Auditoryand Visual Processing Research laboratories in 1988, adepartment head at ATR Human Information Processing ResearchLaboratories in 1992, and the leader of the ComputationalNeuroscience Project at Information Sciences Division,ATR International, in 2001. He has been the project leader ofthe Kawato Dynamic Brain Project, ERATO, JST, since 1996.He received an outstanding research award from the InternationalNeural Network Society in 1992 and an award from theMinistry of Science and Technology in 1993. He serves as aco-editor-in-chief of Neural Networks and a board member ofthe Japanese Neural Network Society. His research interestsinclude the functions of the cerebellum and the roles of internalmodels in motor control and cognitive functions.54 IEEE Control Systems Magazine August 2001

Kenji Doya 2001

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?