06.03.2013 Views

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

In the last expression, we updated U(i) by considering the fact that we<br />

should allow transition to state j from state i, when U(j) >> U(i). Since we<br />

consider temporal difference of utilities, we call this kind of learning temporal<br />

difference (TD) learning.<br />

It seems that when a rare transition occurs from state j to state i, U(j)-<br />

U(i) will be too large, causing U(i) large by (13.6). However, it should be kept<br />

in mind that the average value of U(i) will not change much, though its<br />

instantaneous value seems to be large occasionally.<br />

13.4.4 Active Learning<br />

For passive learner, we considered M to be a constant matrix. But for an<br />

active learner, it must be a variable matrix. So, we redefine the utility<br />

equation of (13.5) as follows.<br />

U(i) = R(i) + Maxa∑∀j Mij a U(j) (13.7)<br />

where Mij a denotes the probability of reaching state j through an action ‘a’<br />

performed at state i. The agent will now choose the action a for which Mij a is<br />

maximum. Consequently, U(i) will be maximum.<br />

13.4.5 Q-Learning<br />

In Q-learning, instead of utility values, we use q-values. We employ Q(a, i) to<br />

denote the Q-value of doing an action a at state i. The utility values <strong>and</strong> Qvalues<br />

are related by the following expression:<br />

U(i) = maxa Q(a, i). (13.8)<br />

Like utilities, we can construct a constraint equation that holds at equilibrium,<br />

when the Q-values are correct [13].<br />

Q(a, i) = R(i) + ∑Mij a .maxa’ Q(a’,i). (13.9)<br />

The corresponding temporal-difference updating equation is given by<br />

Q(a, i) ← Q(a, i) + α[ R(i) + max(a’, j) – Q(a, i) ] (13.9(a))<br />

which is to be evaluated after every transition from state i to state j.<br />

The Q-learning continues following expression 13.9 (a) until the Q-values at<br />

each state i in the space reaches a steady value.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!