06.03.2013 Views

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

How can we compute the utility value of being in a state? Suppose, if<br />

we reach the goal state, the utility value should be high, say 1. But what<br />

would be the utility value in other states? One simple way to compute static<br />

utility values in a system with the known starting <strong>and</strong> the goal state is given<br />

here. Suppose the agent reaches the goal S7 from S 1 (fig. 13.13) through a<br />

state say S2. Now we repeat the experiment <strong>and</strong> find how many times S 2 has<br />

been visited. If we assume that out of 100 experiments, S2 is visited 5 times,<br />

then we assign the utility of state S2 as 5/100=0.05. Further we may assume<br />

that the agent can move from one state to its neighboring state (diagonal<br />

movements not allowed) with an unbiased probability. For example, the agent<br />

can move from S1 to S 2 or S6 (but not to S 5) with a probability of 0.5. If it is in<br />

S5, it could move to S 2, S 4, S 8 or S 6 with a probability of 0.25.<br />

We here make an important assumption on utility.<br />

“The utility of sequence is the sum of the rewards accumulated in the<br />

states of the sequence” [13]. The static utility values are difficult to extract as<br />

it requires large number of experiments. The key to re<strong>info</strong>rcement learning is<br />

to update the utility values, given the training sequences [13].<br />

In adaptive dynamic programming, we compute utility U(i) of state i by<br />

using the following expression.<br />

U(i) = R(i) + ∑∀j Mij U(j) (13.5)<br />

where R(i) is the reward of being in state i, Mij is the probability of transition<br />

from state i to state j.<br />

In adaptive dynamic programming, we presume the agent to be passive.<br />

So, we do not want to maximize the ∑Mij U(j) term.<br />

For a small stochastic system, we can evaluate the U(i), ∀i by solving<br />

the set of all utility equations like (13.5) for all states. But when the state<br />

space is large, it becomes somewhat intractable.<br />

13.4.3 Temporal Difference Learning<br />

To avoid solving the constraint equations like (13.5), we make an alternative<br />

formulation to compute U(i) by the following expression.<br />

U(i) ← U(i) + α[ R(i ) + (U(j) – U(i) ] (13.6)<br />

where α is the learning rate, normally set in [0, 1].

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!