Learning Agents Playing Whist
Learning Agents Playing Whist
Learning Agents Playing Whist
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Roy<br />
Netanel<br />
<strong>Learning</strong> <strong>Agents</strong><br />
<strong>Playing</strong> <strong>Whist</strong><br />
Reinforcement <strong>Learning</strong>
About the game
About the game<br />
<strong>Whist</strong> is a 4 players card game that involves lots of random<br />
elements.<br />
•A A full description of the game can be found here:<br />
•http://www.pagat.com/exact/israeli_whist.html<br />
http://www.pagat.com/exact/israeli_whist.html<br />
Our simplified version<br />
Each player(1-4) is dealt 6 cards.<br />
Each player then makes a bid – how much he/she will win.<br />
In each round the player who plays first determines the suit,<br />
Each other player must play a card of the same suit.<br />
If no card of the determined suit is available any card can be<br />
played.<br />
The winner of the round is the player who played the highest<br />
card of the determined suit.<br />
Rewards(players' score)<br />
The reward of each player is as follows:<br />
If the player's bid was incorrect the reward is 0.<br />
If the bid was correct(won the same number of rounds as the<br />
bid) then the reward is bid*bid+5.
The idea behind the project<br />
In this project we wanted to see if and how can learning be used<br />
in an environment where there are more than 2 agents(4 in our<br />
case) and where information is not full(random effects of dealing<br />
the cards + unknown strategies of the other players).<br />
In order to test our learning player, we created several types of<br />
players with a fixed strategy and ran the game in several<br />
scenarios.
Modeling the problem<br />
We modeled the game as state based learning problem.<br />
Each situation is represented in each player's learning table<br />
according to the knowledge of that player.<br />
Problem with modeling the game<br />
The state space is too large(to have an effective learning in a<br />
reasonable amount of training) – there are 134596 initial states<br />
and that's before a single card was played, and from here it<br />
grows exponentially.<br />
Solution – Heuristics<br />
In order to decrease the state space size we used 2 heuristics:<br />
First when considering the bid, the learning player only<br />
considers the sum of the values of the cards in the initial hand –<br />
that greatly decreases the number of initial states.<br />
Second when considering which card to play in each round the<br />
player doesn't consider the full history of the game and which<br />
cards are still present in the current hand, but rather it only<br />
considers how many rounds won already, how many still needed<br />
to win and whether it is possible to win the current round(by<br />
considering the current high card of the determined suit).<br />
Those heuristics hurt the learning process(by ignoring some of<br />
the available information) but still allow effective learning(as we<br />
later found out by running various scenarios) in a reasonable<br />
amount of training.
Algorithm<br />
We used Q-<strong>Learning</strong> as basic reference<br />
Which we then modified to suit our problem.<br />
Since the game formation is known in advance and ends in a<br />
(known) finite steps each time, there was no need for<br />
decreasing rewards and we simply updated the tables when the<br />
game ended(the rewards are then known).
Results & Analysis<br />
Scenario:p1: Scenario: p1: learning, p2: random, p3:aggressive, p4:defensive<br />
Analysis<br />
training: 5000 10000 20000 50000<br />
player1 learning 1.96 2.17 2.2 2.38<br />
player2 random 1.31 1.3 1.34 1.33<br />
player3 aggressive 1.76 1.69 1.72 1.68<br />
player4 defensive 1.58 1.54 1.58 1.56<br />
average reward<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
learning-random-aggressive-defensive<br />
0 10000 20000 30000 40000 50000 60000<br />
amount of training<br />
learning<br />
random<br />
aggressive<br />
defensive<br />
As expected, the learning player improves with the amount of training.<br />
Also expected, the random player has the worst results.<br />
Interesting point: point:<br />
although the learning player's results improve with<br />
the amount of training, the results of the other(fixed-strategy) players<br />
remain more or less constant which means the learning here increases<br />
the total rewards(social gain) by improving its own reward.
Results & Analysis<br />
Scenario:p1: Scenario: p1: learning, p2: aggressive, p3:aggressive, p4:aggressive<br />
Analysis<br />
training: 5000 10000 20000 50000<br />
player1 learning 2.03 2.3 2.39 2.45<br />
player2 aggressive 1.49 1.45 1.41 1.43<br />
player3 aggressive 1.44 1.45 1.48 1.42<br />
player4 aggressive 1.46 1.42 1.42 1.42<br />
average reward<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
learning-3aggressive<br />
0 10000 20000 30000 40000 50000 60000<br />
amount of training<br />
learning<br />
aggressive<br />
aggressive<br />
aggressive<br />
As expected, the learning player improves with the amount of training.<br />
Interesting point 1: 1:<br />
when playing vs fixed strategy(all the other<br />
players are doing the same) the learner gets better results -the<br />
learning is more effective.<br />
Interesting point 2: 2:<br />
when playing vs fixed strategy the rewards<br />
converge earlier - comparing to different strategies.<br />
Interesting point 3: 3:<br />
the aggressive players block each other resulting<br />
in lower rewards for them comparing to the previous scenario.
Results & Analysis<br />
Scenario:p1: Scenario: p1: learning, p2: defensive, p3:defensive, p4:defensive<br />
Analysis<br />
training: 5000 10000 20000 50000<br />
player1 learning 2.03 2.22 2.31 2.42<br />
player2 defensive 1.7 1.68 1.7 1.68<br />
player3 defensive 1.7 1.69 1.66 1.69<br />
player4 defensive 1.69 1.67 1.68 1.68<br />
average reward<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
learning-3defensive<br />
0 10000 20000 30000 40000 50000 60000<br />
amount of training<br />
learning<br />
defensive<br />
defensive<br />
defensive<br />
As expected, the learning player improves with the amount of training.<br />
Interesting point: point:<br />
when playing vs a different fixed<br />
strategy(defensive) the learner gets more or less the same results –<br />
the kind of fixed strategy doesn't matter.
Results & Analysis<br />
Scenario:p1: Scenario: p1: learning, p2: random, p3:random, p4:random<br />
Analysis<br />
training: 5000 10000 20000 50000<br />
player1 learning 2.14 2.2 2.26 2.32<br />
player2 random 1.28 1.28 1.28 1.31<br />
player3 random 1.34 1.29 1.3 1.31<br />
player4 random 1.32 1.28 1.3 1.29<br />
average reward<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
learning-3random<br />
0 10000 20000 30000 40000 50000 60000<br />
amount of training<br />
learning<br />
random<br />
random<br />
random<br />
As expected, the learning player improves with the amount of training.<br />
Interesting point 1: 1:<br />
when playing vs random players the learning<br />
process converges faster.<br />
Interesting point 2: 2:<br />
when playing vs random players the learning is<br />
much less effective(that is the learner can only take advantage of<br />
only portion of the game aspects).