06.03.2013 Views

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

Artificial Intelligence and Soft Computing: Behavioral ... - Arteimi.info

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Let us now assume that the game is being played between a computer<br />

<strong>and</strong> a person <strong>and</strong> the computer keeps a record of the moves it has chosen in all<br />

its turns in a game. This is recorded in a matrix, where the (i, j)-th element of<br />

the matrix st<strong>and</strong>s for the probability of success, if the computer in its turn<br />

makes a change from j-th to i-th state. It is to be noted that the sum of all<br />

elements under each column of the above matrix is one. This directly follows<br />

intuitively, since the next state could be any of the possible states under one<br />

column. The structure of the matrix is presented in fig. 13.12.<br />

It should be added that the system learns from a reward-penalty<br />

mechanism. This is realized in the following manner. After completion of a<br />

game, the computer adjusts the elements of the matrix. If it wins the game,<br />

then the elements corresponding to all its moves are increased by δ <strong>and</strong> the<br />

rest of the elements under each column are decreased equally, so that the<br />

column-sum remains one. On the other h<strong>and</strong>, if the computer loses the game,<br />

then the elements corresponding to its moves are decreased by δ <strong>and</strong> the<br />

remaining elements under each column are increased equally, so that column<br />

sum remains one.<br />

After a large number of such trials, the matrix becomes invariant <strong>and</strong> the<br />

computer in its turn selects the state with highest probability under a given<br />

column.<br />

13.4.2 Adaptive Dynamic Programming<br />

The re<strong>info</strong>rcement learning presumes that the agent receives a response from<br />

the environment but can determine its status (rewarding/punishable) only at<br />

the end of its activity, called the terminal state. We also assume that initially<br />

the agent is at a state S0 <strong>and</strong> after performing an action on the environment, it<br />

moves to a new state S1. If the action is denoted by a 0, we say<br />

a0<br />

S0 → S1,<br />

i.e., because of action a0, the agent changes its state from S0 to S1. Further, the<br />

reward of an agent can be represented by a utility function. For example, the<br />

points of a ping-pong agent could be its utility.<br />

The agent in re<strong>info</strong>rcement learning could be either passive or active. A<br />

passive learner attempts to learn the utility through its presence in different<br />

states. An active learner, on the other h<strong>and</strong>, can infer the utility at unknown<br />

states from its knowledge, gained through learning.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!