23.08.2015 Views

Here - Agents Lab - University of Nottingham

Here - Agents Lab - University of Nottingham

Here - Agents Lab - University of Nottingham

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

chooses an action a ∈ A s that causes the environment to transition to state s ′ inthe next time step t+1 and return a reward with the expected value r ∈ R(s, a).<strong>Here</strong> S is the set <strong>of</strong> all possible states, A s is the set <strong>of</strong> all possible actions ina given state s, and R(s, a) is the function that determines the reward for takingan action a in state s. The probability that the process advances to states ′ is given by the state transition function P (s ′ |s, a). The agent’s behaviour isdescribed by a policy π that determines how the agent chooses an action in anygiven state. The optimal value in this setup can be obtained using dynamic programmingand is given by Bellman’s equation (Equation 1) [19], that relates thevalue function V ∗ (s) in one time step to the value function V ∗ (s ′ ) in the nexttime step. <strong>Here</strong>, γ ∈ [0, 1) is the discount factor that determines the importance<strong>of</strong> future rewards.V ∗ (s) = R(s, a) + maxa∈A sγ ∑ s ′ P (s ′ |s, a)V ∗ (s ′ ). (1)In the reinforcement learning setting both P (s ′ |s, a) and R(s, a) are unknown.So the agent has little choice but to physically act in the environment to observethe immediate reward, and use the samples over time to build estimates <strong>of</strong> theexpected return in each state, in the hope <strong>of</strong> obtaining a good approximation<strong>of</strong> the optimal policy. Typically, the agent tries to maximise some cumulativefunction <strong>of</strong> the immediate rewards, such as the expected discounted return R π (s)(Equation 2) at each time step t. R π (s) captures the infinite-horizon discounted(by γ) sum <strong>of</strong> the rewards that the agent may expect (denoted by E) to receivestarting in state s and following the policy π.R π (s) = E{r t+1 + γr t+2 + γ 2 r t+3 + . . .}. (2)One way to maximise this function is to evaluate all policies by simply followingeach one, sampling the rewards obtained, and then choosing the policy thatgave the best return. The obvious problem with such a brute force method is thatthe number <strong>of</strong> possible policies is <strong>of</strong>ten too large to be practical. Furthermore,if rewards were stochastic, then even more samples will be required in orderto estimate the expected return. A practical solution, based on Bellman’s workon value iteration, is Watkins’ Q-Learning algorithm [20] given by the actionvaluefunction (Equation 3). The Q-function gives the expected discounted returnfor taking action a in state s and following the policy π thereafter. <strong>Here</strong>α is the learning rate that determines to what extent the existing Q-value (i.e.,Q π (s, a)) will be corrected by the new update (i.e., R(s, a) + γ max a ′(Q(s ′ , a ′ ))),and max a ′(Q(s ′ , a ′ )) is the maximum possible reward in the following state, i.e.,it is the reward for taking the optimal action thereafter.[]Q π (s, a) ← Q π (s, a) + α R(s, a) + γ max(Q(s ′ , a ′ )) − Q π (s, a) . (3)a ′In order to learn the Q-values, the agent must try out available actions in eachstate and learn from these experiences over time. Given that acting and learningare interleaved and ongoing performance is important, a key challenge when151

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!