Exercise Set 3 The first three exercises are meant simply to be ...

Exercise Set 3The first three exercises are meant simply to be thought provoking anddo not have specific answers.Exercise 3.2 Is the reinforcement learning framework adequate to usefullyrepresent all goal-directed learning tasks? Can you think of any clear exceptions?Exercise 3.3 Consider the problem of driving. You could define the actionsin terms of the accelerator, steering wheel, and brake, that is, where yourbody meets the machine. Or you could define them farther out—say, wherethe rubber meets the road, considering your actions to be tire torques. Oryou could define them farther in—say, where your brain meets your body,the actions being muscle twitches to control your limbs. Or you could go to areally high level and say that your actions are your choices of where to drive.What is the right level, the right place to draw the line between agent andenvironment? On what basis is one location of the line to be preferred overanother? Is there any fundamental reason for preferring one location overanother, or is it a free choice?Exercise 3.4 Suppose you treated pole-balancing as an episodic task butalso used discounting, with all rewards zero except for −1 upon failure. Whatthen would the return be at each time? How does this return differ from thatin the discounted, continuing formulation of this task?Exercise 3.5 Imagine that you are designing a robot to run a maze. Youdecide to give it a reward of +1 for escaping from the maze and a rewardof zero at all other times. The task seems to break down naturally intoepisodes—the successive runs through the maze—so you decide to treat it asan episodic task, where the goal is to maximize expected total reward (3.1).After running the learning agent for a while, you find that it is showing noimprovement in escaping from the maze. What is going wrong? Have youeffectively communicated to the agent what you want it to achieve?Exercise 3.6 (Modified) Broken Vision System Imagine that you area vision system. When you are first turned on, an image floods into yourcamera. You can see lots of things, but not all things. You can’t see objectsthat are occluded, and of course you can’t see objects that are behind you.While seeing that first scene, do you have access to a Markov state? Suppose1

your camera was broken and you received no images at all, all day. Wouldyou have access to a Markov state then?Exercise 3.7 Assuming a finite MDP with a finite number of reward values,write an equation for the transition probabilities and the expected rewardsin terms of the joint conditional distribution in (3.5).Exercise 3.8 What is the Bellman equation for action values, for Q π ? Itmust give the action value Q π (s, a) in terms of the action values, Q π (s ′ , a ′ ),of possible successors to the state-action pair (s, a). As a hint, the backupdiagram corresponding to this equation is given in Figure 3.4b. Show thesequence of equations analogous to (3.10), but for action values.Exercise 3.9 The Bellman equation (3.10) must hold for each state for thevalue function V π shown in Figure 3.5b. As an example, show numericallythat this equation holds for the center state, valued at +0.7, with respect toits four neighboring states, valued at +2.3, +0.4, −0.4, and +0.7. (Thesenumbers are accurate only to one decimal place.)Exercise 3.10 In the gridworld example, rewards are positive for goals,negative for running into the edge of the world, and zero the rest of thetime. Are the signs of these rewards important, or only the intervals betweenthem? Prove, using (3.2), that adding a constant C to all the rewards adds aconstant, K, to the values of all states, and thus does not affect the relativevalues of any states under any policies. What is K in terms of C and γ?Exercise 3.11 Now consider adding a constant C to all the rewards in anepisodic task, such as maze running. Would this have any effect, or would itleave the task unchanged as in the continuing task above? Why or why not?Give an example.Exercise 3.12 The value of a state depends on the the values of the actionspossible in that state and on how likely each action is to be taken underthe current policy. We can think of this in terms of a small backup diagramrooted at the state and considering each possible action:taken withprobability π(s,a)sQ π (s,a)a 1a 2a 3V π (s)Give the equation corresponding to this intuition and diagram for the value atthe root node, V π (s), in terms of the value at the expected leaf node, Q π (s, a),given s t = s. This expectation depends on the policy, π. Then give a secondequation in which the expected value is written out explicitly in terms ofπ(s, a) such that no expected value notation appears in the equation.2

Exercise 3.13 The value of an action, Q π (s, a), can be divided into twoparts, the expected next reward, which does not depend on the policy π,and the expected sum of the remaining rewards, which depends on the nextstate and the policy. Again we can think of this in terms of a small backupdiagram, this one rooted at an action (state–action pair) and branching tothe possible next states:s,ar 1 r 2 r 3s’ 1 s’ 2 s’3Q π (s,a)V π (s)Give the equation corresponding to this intuition and diagram for the actionvalue, Q π (s, a), in terms of the expected next reward, r t+1 , and the expectednext state value, V π (s t+1 ), given that s t = s and a t = a. Then give a secondequation, writing out the expected value explicitly in terms of P a ss ′ and Ra ss ′,defined respectively by (3.6) and (3.7), such that no expected value notationappears in the equation.Exercise 3.14golf example.Draw or describe the optimal state-value function for theExercise 3.15 Draw or describe the contours of the optimal action-valuefunction for putting, Q ∗ (s, putter), for the golf example.Exercise 3.16 Give the Bellman equation for Q ∗ for the recycling robot.Exercise 3.17 Figure 3.8 gives the optimal value of the best state of thegridworld as 24.4, to one decimal place. Use your knowledge of the optimalpolicy and (3.2) to express this value symbolically, and then to compute itto three decimal places.3

Exercise Set 3 The first three exercises are meant simply to be ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?