13.07.2015 Views

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

TESIS DOCTORAL - Robotics Lab - Universidad Carlos III de Madrid

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

174 Chapter 8. Testing the experimental setupthe amplified one (Figure 8.4(a)) has higher values, the policy seems to be equal. However,focusing on the going to the player action, this is not equal. This action is required in or<strong>de</strong>rto satisfy the need of entertainment. In Figure 8.4(a), the Q value associated to this actionis the forth highest positive value. In contrast, in Figure 8.4(b), this Q value is negative andother actions not related to the motivation of fun are over its value.Using the Amplified Reward the learned values are higher, therefore, the back-propagationalong all successive nee<strong>de</strong>d actions is stronger and it reaches farther actions faster.Probably, longer experiments will end with a positive value of the go to the playeraction. However, by means of Amplified Reward this is achieved in a shorter period oftime.Well-balanced explorationAs expressed in Section 6.3.1, an exhausted exploration of all situations in or<strong>de</strong>r to correctlylearn the proper behaviors is nee<strong>de</strong>d. Next, a situation where exploration is poorly achievedis shown. Figure 8.5 presents a four hundred iterations learning session where the WellbalancedExploration has not been consi<strong>de</strong>red. It corresponds to the dominant motivationrelax which associated drive is the slowest one (this has been explained in Section 5.4.1).The remarkable issue extracted from Figure 8.5 is the long periods where non of thevalues are updated. Roughly, these periods correspond to the iterations ranges from 0 to160 and from 250 to 390; this is about one hour and a half. These long lasting periods withstability of values during a learning session means that this motivation is not explored inthese periods. In other words, relax does not frequently become the dominant motivation.These circumstances lead to a set of state-action pairs that are not enough explored andtherefore they will not be properly learned in an acceptable amount of time.The effects of the Well-balanced Exploration when relax is the dominant motivationcan be observed in Figure 8.3(b). During the whole learning session, there is a frequentupdate of any state-action pair related to the relax motivation. There are not more of thoselong periods of un<strong>de</strong>sired stability in a particular motivation.8.5 SummaryAt the beginning, this chapter introduces the structure of the experiments with two differentphases: the exploring phase where learning is achieved, and the exploiting phase where thelearned policy is employed. Moreover, the available active objects were introduced: theusers; two people will share the robot’s environment during the experiments: Perico (whoalways positively interacts) and Alvaro (he sporadically harms the robot).This chapter has proved the correct working of the DMS. Initially, how the intensitiesof motivations are formed due to the interconnections with internal and external stimuli hasbeen clarified and examined in a fragment of a real experiment.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!