13.07.2015 Views

Exercise Set 3 The first three exercises are meant simply to be ...

Exercise Set 3 The first three exercises are meant simply to be ...

Exercise Set 3 The first three exercises are meant simply to be ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Exercise</strong> <strong>Set</strong> 3<strong>The</strong> <strong>first</strong> <strong>three</strong> <strong>exercises</strong> <strong>are</strong> <strong>meant</strong> <strong>simply</strong> <strong>to</strong> <strong>be</strong> thought provoking anddo not have specific answers.<strong>Exercise</strong> 3.2 Is the reinforcement learning framework adequate <strong>to</strong> usefullyrepresent all goal-directed learning tasks? Can you think of any clear exceptions?<strong>Exercise</strong> 3.3 Consider the problem of driving. You could define the actionsin terms of the accelera<strong>to</strong>r, steering wheel, and brake, that is, where yourbody meets the machine. Or you could define them farther out—say, wherethe rub<strong>be</strong>r meets the road, considering your actions <strong>to</strong> <strong>be</strong> tire <strong>to</strong>rques. Oryou could define them farther in—say, where your brain meets your body,the actions <strong>be</strong>ing muscle twitches <strong>to</strong> control your limbs. Or you could go <strong>to</strong> <strong>are</strong>ally high level and say that your actions <strong>are</strong> your choices of where <strong>to</strong> drive.What is the right level, the right place <strong>to</strong> draw the line <strong>be</strong>tween agent andenvironment? On what basis is one location of the line <strong>to</strong> <strong>be</strong> preferred overanother? Is there any fundamental reason for preferring one location overanother, or is it a free choice?<strong>Exercise</strong> 3.4 Suppose you treated pole-balancing as an episodic task butalso used discounting, with all rewards zero except for −1 upon failure. Whatthen would the return <strong>be</strong> at each time? How does this return differ from thatin the discounted, continuing formulation of this task?<strong>Exercise</strong> 3.5 Imagine that you <strong>are</strong> designing a robot <strong>to</strong> run a maze. Youdecide <strong>to</strong> give it a reward of +1 for escaping from the maze and a rewardof zero at all other times. <strong>The</strong> task seems <strong>to</strong> break down naturally in<strong>to</strong>episodes—the successive runs through the maze—so you decide <strong>to</strong> treat it asan episodic task, where the goal is <strong>to</strong> maximize expected <strong>to</strong>tal reward (3.1).After running the learning agent for a while, you find that it is showing noimprovement in escaping from the maze. What is going wrong? Have youeffectively communicated <strong>to</strong> the agent what you want it <strong>to</strong> achieve?<strong>Exercise</strong> 3.6 (Modified) Broken Vision System Imagine that you <strong>are</strong>a vision system. When you <strong>are</strong> <strong>first</strong> turned on, an image floods in<strong>to</strong> yourcamera. You can see lots of things, but not all things. You can’t see objectsthat <strong>are</strong> occluded, and of course you can’t see objects that <strong>are</strong> <strong>be</strong>hind you.While seeing that <strong>first</strong> scene, do you have access <strong>to</strong> a Markov state? Suppose1


your camera was broken and you received no images at all, all day. Wouldyou have access <strong>to</strong> a Markov state then?<strong>Exercise</strong> 3.7 Assuming a finite MDP with a finite num<strong>be</strong>r of reward values,write an equation for the transition probabilities and the expected rewardsin terms of the joint conditional distribution in (3.5).<strong>Exercise</strong> 3.8 What is the Bellman equation for action values, for Q π ? Itmust give the action value Q π (s, a) in terms of the action values, Q π (s ′ , a ′ ),of possible successors <strong>to</strong> the state-action pair (s, a). As a hint, the backupdiagram corresponding <strong>to</strong> this equation is given in Figure 3.4b. Show thesequence of equations analogous <strong>to</strong> (3.10), but for action values.<strong>Exercise</strong> 3.9 <strong>The</strong> Bellman equation (3.10) must hold for each state for thevalue function V π shown in Figure 3.5b. As an example, show numericallythat this equation holds for the center state, valued at +0.7, with respect <strong>to</strong>its four neighboring states, valued at +2.3, +0.4, −0.4, and +0.7. (<strong>The</strong>senum<strong>be</strong>rs <strong>are</strong> accurate only <strong>to</strong> one decimal place.)<strong>Exercise</strong> 3.10 In the gridworld example, rewards <strong>are</strong> positive for goals,negative for running in<strong>to</strong> the edge of the world, and zero the rest of thetime. Are the signs of these rewards important, or only the intervals <strong>be</strong>tweenthem? Prove, using (3.2), that adding a constant C <strong>to</strong> all the rewards adds aconstant, K, <strong>to</strong> the values of all states, and thus does not affect the relativevalues of any states under any policies. What is K in terms of C and γ?<strong>Exercise</strong> 3.11 Now consider adding a constant C <strong>to</strong> all the rewards in anepisodic task, such as maze running. Would this have any effect, or would itleave the task unchanged as in the continuing task above? Why or why not?Give an example.<strong>Exercise</strong> 3.12 <strong>The</strong> value of a state depends on the the values of the actionspossible in that state and on how likely each action is <strong>to</strong> <strong>be</strong> taken underthe current policy. We can think of this in terms of a small backup diagramrooted at the state and considering each possible action:taken withprobability π(s,a)sQ π (s,a)a 1a 2a 3V π (s)Give the equation corresponding <strong>to</strong> this intuition and diagram for the value atthe root node, V π (s), in terms of the value at the expected leaf node, Q π (s, a),given s t = s. This expectation depends on the policy, π. <strong>The</strong>n give a secondequation in which the expected value is written out explicitly in terms ofπ(s, a) such that no expected value notation appears in the equation.2


<strong>Exercise</strong> 3.13 <strong>The</strong> value of an action, Q π (s, a), can <strong>be</strong> divided in<strong>to</strong> twoparts, the expected next reward, which does not depend on the policy π,and the expected sum of the remaining rewards, which depends on the nextstate and the policy. Again we can think of this in terms of a small backupdiagram, this one rooted at an action (state–action pair) and branching <strong>to</strong>the possible next states:s,ar 1 r 2 r 3s’ 1 s’ 2 s’3Q π (s,a)V π (s)Give the equation corresponding <strong>to</strong> this intuition and diagram for the actionvalue, Q π (s, a), in terms of the expected next reward, r t+1 , and the expectednext state value, V π (s t+1 ), given that s t = s and a t = a. <strong>The</strong>n give a secondequation, writing out the expected value explicitly in terms of P a ss ′ and Ra ss ′,defined respectively by (3.6) and (3.7), such that no expected value notationappears in the equation.<strong>Exercise</strong> 3.14golf example.Draw or descri<strong>be</strong> the optimal state-value function for the<strong>Exercise</strong> 3.15 Draw or descri<strong>be</strong> the con<strong>to</strong>urs of the optimal action-valuefunction for putting, Q ∗ (s, putter), for the golf example.<strong>Exercise</strong> 3.16 Give the Bellman equation for Q ∗ for the recycling robot.<strong>Exercise</strong> 3.17 Figure 3.8 gives the optimal value of the <strong>be</strong>st state of thegridworld as 24.4, <strong>to</strong> one decimal place. Use your knowledge of the optimalpolicy and (3.2) <strong>to</strong> express this value symbolically, and then <strong>to</strong> compute it<strong>to</strong> <strong>three</strong> decimal places.3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!