Kenji Doya 2001

More documents

Recommendations

Info

ferent state representations [11]. The model replicated manyof the experimental findings, including the differential effectsof blockade experiments.The hypothesis that different coordinate frames are usedin different stages of sequence learning was tested in a humansequence learning experiment [33]. Once the subjectslearned a button press sequence, their generalization performancewas tested in two conditions: one in which the targetsin the same spatial location were pressed with differentfinger movements and another in which different spatial targetswere pressed with the same finger movements. Analysisof response times showed significantly betterperformance in the latter case (i.e., sequences in the samemotor coordinates) after extended training. This supportsour hypothesis that sequence representation in motor coordinatestakes time to be developed, but, once developed, allowsquick execution.Learning to Stand withHierarchical RepresentationsThe application of reinforcement learning to a high-dimensionaldynamical system is quite difficult because of thecomplexity of learning the value function in a high-dimensionalspace, known as the curse ofdimensionality. Thus, we developed a hierarchicalreinforcement learning architecture [34]. Inthe upper level, coarsely discretized states in areduced-dimensional space are used to makeglobal exploration feasible. In the lower level, localdynamics in a continuous, high-dimensionalstate space are considered for smooth controlperformance.We applied this hierarchical reinforcementlearning architecture to the task of learning tovertically balance a three-link robot (Fig. 9). Thegoal was to find the dynamic movement sequencefor standing. The reward was given bythe height of the head and a punishment wasgiven when the robot tumbled. Within severalhundred trials, a successful pattern of standingup was achieved by the hierarchical reinforcementlearning system. The learning was severaltimes quicker than with simple reinforcementlearning [34].The reason for this quick learning was the selectionof the upper-level state representation,which included kinematic, task-oriented variablessuch as the relative center of mass positionmeasured from the foot. Although the hierarchicalarchitecture was developed simply to let therobot learn the task quickly (before the hardwarebroke down), it is interesting that themultimodal representation resembles the architectureof the multiple cortico-basal ganglialoops [11], [27].ut ( )xt ( )ut ( )Emergence of Modular OrganizationA remarkable feature of the biological motor control systemis its realization of both flexibility and robustness. For example,in a motor adaptation experiment of reaching to a targetwhile wearing prism glasses, if a subject is well adapted toan altered condition and then returns to a normal condition,there is an after-effect (i.e., an error in the opposite direction).However, de-adaptation to a normal condition is usuallymuch faster than adaptation to a novel condition.Further, if a subject is trained alternately in normal and alteredconditions, she will adapt to either condition veryquickly. Such results suggest that a subject does not simplymodify the parameters of a single controller, but retainsmultiple controllers for different conditions and can switchbetween them easily. Evidence from arm reaching experimentssuggests that the outputs of controllers for similarconditions can be smoothly interpolated for a novel, intermediatecondition [15].The idea of switching among multiple controllers is quitecommon; however, a difficult problem in designing an adaptivemodular control system is how to select an appropriatemodule for a given situation. To evaluate a set of controllers,we basically have to test the performance of each controllerPredictor 1Controller 1Predictor iController iPredictor nController nEnvironmentFigure 10. The MOSAIC architecture.^.xt i( )ut i( )λ i ( t).xt ( )xt ( )++Softmax^λi ( ) tAugust <strong>2001</strong> IEEE Control Systems Magazine 51
one by one, which takes a lot of time. On the other hand, aset of predictors can be evaluated simultaneously by runningthem in parallel and comparing their prediction outputswith the output of the real system. This simple factmotivated a modular control architecture in which eachcontroller is paired with a predictor [35].Fig. 10 shows the module selection and identification control(MOSAIC) architecture, based on the prediction error ofeach moduleE ( t) = x$ ( t) −x( t)iiwhere x$ i( t)is the output of the ith predictor. The responsibilitysignal is given by the soft-max function..θ [rad/s ]2..θ [rad/s ]220100−10−20620100−10−2060441Before Learning2 0 2θ [rad](a)After 200 Trials20θ [rad](b)23TimeFigure 11. A result of learning to swing up an underpoweredpendulum using a reinforcement learning version of the MOSAICarchitecture [36]. (a) The sinusoidal nonlinearity of the gravity termin the angular acceleration was approximated by two linear models,shown in different colors. In each module, a local quadratic rewardmodel was also learned and a linear feedback policy was derived bysolving a Riccati equation. (b) The resulting policy for the firstmodule made the downward position unstable. As the pendulummoved upward, based on the relative prediction errors of the twoprediction models, the second module was selected and stabilizedthe upward position.22,4446652( Ei( t)/σ )2( Ejt σ )expλi( t ) =∑ exp ( ) /jThis is used for weighting the outputs of multiple controllers,i.e.,u( t) =Σλ ( t) u ( t),i i iwhere ui ( t)is the output of the ith controller. The responsibilitysignal is also used to weight the learning rates of thepredictors and the controllers of the modules, which causesmodules to be specialized for different situations. The parameterσ, which controls the sharpness of module selection,is initially set large to avoid suboptimal specializationof modules.Fig. 11 shows an example of using this scheme in a simplenonlinear control task of swinging up a pendulum [36]. Eachmodule learns a locally linear dynamic model and a locallyquadratic reward model. Based on these models, the valuefunction and the corresponding control policy for eachmodule are derived by solving a Riccati equation, whichmakes learning much faster than by iterative estimation ofthe value function. After about 100 trials, each module successfullyapproximated the sinusoidal nonlinearity in thedynamics in either the bottom or top half of the state space.Accordingly, a controller that destabilizes the stable equilibriumat the bottom and another controller that stabilizesthe unstable equilibrium at the top were derived. They weresuccessfully switched based on the responsibility signal.The scheme has also been shown to be applicable to anonstationary control task. The results suggest the usefulnessof this biologically motivated modular control architecturein decomposing nonlinear and/or nonstationarycontrol tasks in space and time based on the predictabilityof the system dynamics. A careful theoretical study assessingthe conditions in which modular learning and controlmethods work reliably is still required.Imamizu and colleagues performed a series of visuomotoradaptation experiments using a computer mousewith rotated pointing direction (e.g., the cursor moves tothe right when the mouse is moved upward). Initially inlearning, a large part of the cerebellum was activated. However,as the subject became proficient in the rotated mousemovement, small spots of activation were found in the lateralcerebellum, which can be interpreted as the neural correlateof the internal model of the new tool [37].Furthermore, when the subject was asked to use two differentkinds of an unusual computer mouse, two different setsof activation spots were found in the lateral cerebellum (Fig.12) [38].Experiments of multiple sequential movement in monkeyshave shown that neurons in the supplementary motor area(SMA) are selectively activated during movements in particularsequences. Furthermore, in an adjacent area calledpre-SMA, some neurons were activated when the monkey.52 IEEE Control Systems Magazine August <strong>2001</strong>
Page 12 and 13: was instructed to change the moveme

Kenji Doya 2001

Create successful ePaper yourself

Delete template?

Save as template?