16.01.2013 Views

Learning Agents Playing Whist

Learning Agents Playing Whist

Learning Agents Playing Whist

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Roy<br />

Netanel<br />

<strong>Learning</strong> <strong>Agents</strong><br />

<strong>Playing</strong> <strong>Whist</strong><br />

Reinforcement <strong>Learning</strong>


About the game


About the game<br />

<strong>Whist</strong> is a 4 players card game that involves lots of random<br />

elements.<br />

•A A full description of the game can be found here:<br />

•http://www.pagat.com/exact/israeli_whist.html<br />

http://www.pagat.com/exact/israeli_whist.html<br />

Our simplified version<br />

Each player(1-4) is dealt 6 cards.<br />

Each player then makes a bid – how much he/she will win.<br />

In each round the player who plays first determines the suit,<br />

Each other player must play a card of the same suit.<br />

If no card of the determined suit is available any card can be<br />

played.<br />

The winner of the round is the player who played the highest<br />

card of the determined suit.<br />

Rewards(players' score)<br />

The reward of each player is as follows:<br />

If the player's bid was incorrect the reward is 0.<br />

If the bid was correct(won the same number of rounds as the<br />

bid) then the reward is bid*bid+5.


The idea behind the project<br />

In this project we wanted to see if and how can learning be used<br />

in an environment where there are more than 2 agents(4 in our<br />

case) and where information is not full(random effects of dealing<br />

the cards + unknown strategies of the other players).<br />

In order to test our learning player, we created several types of<br />

players with a fixed strategy and ran the game in several<br />

scenarios.


Modeling the problem<br />

We modeled the game as state based learning problem.<br />

Each situation is represented in each player's learning table<br />

according to the knowledge of that player.<br />

Problem with modeling the game<br />

The state space is too large(to have an effective learning in a<br />

reasonable amount of training) – there are 134596 initial states<br />

and that's before a single card was played, and from here it<br />

grows exponentially.<br />

Solution – Heuristics<br />

In order to decrease the state space size we used 2 heuristics:<br />

First when considering the bid, the learning player only<br />

considers the sum of the values of the cards in the initial hand –<br />

that greatly decreases the number of initial states.<br />

Second when considering which card to play in each round the<br />

player doesn't consider the full history of the game and which<br />

cards are still present in the current hand, but rather it only<br />

considers how many rounds won already, how many still needed<br />

to win and whether it is possible to win the current round(by<br />

considering the current high card of the determined suit).<br />

Those heuristics hurt the learning process(by ignoring some of<br />

the available information) but still allow effective learning(as we<br />

later found out by running various scenarios) in a reasonable<br />

amount of training.


Algorithm<br />

We used Q-<strong>Learning</strong> as basic reference<br />

Which we then modified to suit our problem.<br />

Since the game formation is known in advance and ends in a<br />

(known) finite steps each time, there was no need for<br />

decreasing rewards and we simply updated the tables when the<br />

game ended(the rewards are then known).


Results & Analysis<br />

Scenario:p1: Scenario: p1: learning, p2: random, p3:aggressive, p4:defensive<br />

Analysis<br />

training: 5000 10000 20000 50000<br />

player1 learning 1.96 2.17 2.2 2.38<br />

player2 random 1.31 1.3 1.34 1.33<br />

player3 aggressive 1.76 1.69 1.72 1.68<br />

player4 defensive 1.58 1.54 1.58 1.56<br />

average reward<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

learning-random-aggressive-defensive<br />

0 10000 20000 30000 40000 50000 60000<br />

amount of training<br />

learning<br />

random<br />

aggressive<br />

defensive<br />

As expected, the learning player improves with the amount of training.<br />

Also expected, the random player has the worst results.<br />

Interesting point: point:<br />

although the learning player's results improve with<br />

the amount of training, the results of the other(fixed-strategy) players<br />

remain more or less constant which means the learning here increases<br />

the total rewards(social gain) by improving its own reward.


Results & Analysis<br />

Scenario:p1: Scenario: p1: learning, p2: aggressive, p3:aggressive, p4:aggressive<br />

Analysis<br />

training: 5000 10000 20000 50000<br />

player1 learning 2.03 2.3 2.39 2.45<br />

player2 aggressive 1.49 1.45 1.41 1.43<br />

player3 aggressive 1.44 1.45 1.48 1.42<br />

player4 aggressive 1.46 1.42 1.42 1.42<br />

average reward<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

learning-3aggressive<br />

0 10000 20000 30000 40000 50000 60000<br />

amount of training<br />

learning<br />

aggressive<br />

aggressive<br />

aggressive<br />

As expected, the learning player improves with the amount of training.<br />

Interesting point 1: 1:<br />

when playing vs fixed strategy(all the other<br />

players are doing the same) the learner gets better results -the<br />

learning is more effective.<br />

Interesting point 2: 2:<br />

when playing vs fixed strategy the rewards<br />

converge earlier - comparing to different strategies.<br />

Interesting point 3: 3:<br />

the aggressive players block each other resulting<br />

in lower rewards for them comparing to the previous scenario.


Results & Analysis<br />

Scenario:p1: Scenario: p1: learning, p2: defensive, p3:defensive, p4:defensive<br />

Analysis<br />

training: 5000 10000 20000 50000<br />

player1 learning 2.03 2.22 2.31 2.42<br />

player2 defensive 1.7 1.68 1.7 1.68<br />

player3 defensive 1.7 1.69 1.66 1.69<br />

player4 defensive 1.69 1.67 1.68 1.68<br />

average reward<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

learning-3defensive<br />

0 10000 20000 30000 40000 50000 60000<br />

amount of training<br />

learning<br />

defensive<br />

defensive<br />

defensive<br />

As expected, the learning player improves with the amount of training.<br />

Interesting point: point:<br />

when playing vs a different fixed<br />

strategy(defensive) the learner gets more or less the same results –<br />

the kind of fixed strategy doesn't matter.


Results & Analysis<br />

Scenario:p1: Scenario: p1: learning, p2: random, p3:random, p4:random<br />

Analysis<br />

training: 5000 10000 20000 50000<br />

player1 learning 2.14 2.2 2.26 2.32<br />

player2 random 1.28 1.28 1.28 1.31<br />

player3 random 1.34 1.29 1.3 1.31<br />

player4 random 1.32 1.28 1.3 1.29<br />

average reward<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

learning-3random<br />

0 10000 20000 30000 40000 50000 60000<br />

amount of training<br />

learning<br />

random<br />

random<br />

random<br />

As expected, the learning player improves with the amount of training.<br />

Interesting point 1: 1:<br />

when playing vs random players the learning<br />

process converges faster.<br />

Interesting point 2: 2:<br />

when playing vs random players the learning is<br />

much less effective(that is the learner can only take advantage of<br />

only portion of the game aspects).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!