13.07.2015 Views

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.4. Featuring Maggie’s DMS 105iteration, the dominant motivation is computed as the maximum motivation whose value(internal needs plus external stimulus) is over its activation level. This parameter has beenfixed to 10 for every motivation. Consi<strong>de</strong>ring the dominant motivation, the current statesrelated to objects, and the Q values associated to each feasible action in this state, the nextaction is chosen. These Q values represent how good a particular action is at a particularstate.At the beginning of the robot’s life, it does not have any knowledge, so learning isessential. In or<strong>de</strong>r to help learning, the robot explores all possibilities many times. But,in or<strong>de</strong>r to live better, the robot has to exploit the acquired knowledge to make the best<strong>de</strong>cisions. This is the dilemma of exploration vs. exploitation, several times refereed in thefield of reinforcement learning [163]. The level of exploration represents the probabilitiesof executing actions different than those with the highest values. Exploitation means theselection of the action with the highest value for each situation. Therefore, during therobot’s life, there are two phases clearly differentiated: learning or exploration phase, an<strong>de</strong>xploiting phase.Then, according to a specific level of exploration/exploitation, the probabilities for selectingan action differs. Using the Boltzmann distribution, the probabilities of selecting anaction a in a given state s is <strong>de</strong>termined by Equation (5.1).P s (a) =∑bǫAe Q(s,a)Te Q(s,b)T(5.1)Q(s, a) is the value for action a in state s, and A represents the set of all possibleactions; T is the temperature and it pon<strong>de</strong>rs exploration and exploitation. A high value ofT gives the same likelihood of selection to all possible actions and the next action is almostrandomly selected; low T enforces actions with high values: the higher value, the higherprobability to be executed. This approach has been previously used by Gadanho [197, 198].As presented in [49], T value is set according to Equation (5.2).T = δ ∗ ¯Q (5.2)where ¯Q is the mean value of all possible Q values. According to the Equation (5.2),high δ implies high temperature and, therefore, exploration dominates: all actions have thesame probability of being selected. Low δ produces low temperatures and, consequently,exploitation prevails: actions with high values are likely chosen.Therefore, when learning is essential, δ is set to a very high value so actions are randomlychosen, in<strong>de</strong>pen<strong>de</strong>ntly of their values, so all actions are explored. However, when itis <strong>de</strong>sired to select the most appropriate actions, δ is minimized. Then, the action with thehighest values are always chosen.During the experiments, δ is varied <strong>de</strong>pending on the phase of the robot’s life: duringlearning, high level of exploration is required (δ = 100), then the action selection is totally

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!