Learning Agents Playing Whist

Roy 

Netanel 

Learning Agents 

Playing Whist 

Reinforcement Learning

About the game

About the game 

Whist is a 4 players card game that involves lots of random 

elements. 

•A A full description of the game can be found here: 

•http://www.pagat.com/exact/israeli_whist.html 

http://www.pagat.com/exact/israeli_whist.html 

Our simplified version 

Each player(1-4) is dealt 6 cards. 

Each player then makes a bid – how much he/she will win. 

In each round the player who plays first determines the suit, 

Each other player must play a card of the same suit. 

If no card of the determined suit is available any card can be 

played. 

The winner of the round is the player who played the highest 

card of the determined suit. 

Rewards(players' score) 

The reward of each player is as follows: 

If the player's bid was incorrect the reward is 0. 

If the bid was correct(won the same number of rounds as the 

bid) then the reward is bid*bid+5.

The idea behind the project 

In this project we wanted to see if and how can learning be used 

in an environment where there are more than 2 agents(4 in our 

case) and where information is not full(random effects of dealing 

the cards + unknown strategies of the other players). 

In order to test our learning player, we created several types of 

players with a fixed strategy and ran the game in several 

scenarios.

Modeling the problem 

We modeled the game as state based learning problem. 

Each situation is represented in each player's learning table 

according to the knowledge of that player. 

Problem with modeling the game 

The state space is too large(to have an effective learning in a 

reasonable amount of training) – there are 134596 initial states 

and that's before a single card was played, and from here it 

grows exponentially. 

Solution – Heuristics 

In order to decrease the state space size we used 2 heuristics: 

First when considering the bid, the learning player only 

considers the sum of the values of the cards in the initial hand – 

that greatly decreases the number of initial states. 

Second when considering which card to play in each round the 

player doesn't consider the full history of the game and which 

cards are still present in the current hand, but rather it only 

considers how many rounds won already, how many still needed 

to win and whether it is possible to win the current round(by 

considering the current high card of the determined suit). 

Those heuristics hurt the learning process(by ignoring some of 

the available information) but still allow effective learning(as we 

later found out by running various scenarios) in a reasonable 

amount of training.

Algorithm 

We used Q-Learning as basic reference 

Which we then modified to suit our problem. 

Since the game formation is known in advance and ends in a 

(known) finite steps each time, there was no need for 

decreasing rewards and we simply updated the tables when the 

game ended(the rewards are then known).

Results & Analysis 

Scenario:p1: Scenario: p1: learning, p2: random, p3:aggressive, p4:defensive 

Analysis 

training: 5000 10000 20000 50000 

player1 learning 1.96 2.17 2.2 2.38 

player2 random 1.31 1.3 1.34 1.33 

player3 aggressive 1.76 1.69 1.72 1.68 

player4 defensive 1.58 1.54 1.58 1.56 

average reward 

2.5 

2 

1.5 

1 

0.5 

0 

learning-random-aggressive-defensive 

0 10000 20000 30000 40000 50000 60000 

amount of training 

learning 

random 

aggressive 

defensive 

As expected, the learning player improves with the amount of training. 

Also expected, the random player has the worst results. 

Interesting point: point: 

although the learning player's results improve with 

the amount of training, the results of the other(fixed-strategy) players 

remain more or less constant which means the learning here increases 

the total rewards(social gain) by improving its own reward.


Scenario:p1: Scenario: p1: learning, p2: aggressive, p3:aggressive, p4:aggressive 

Analysis 

training: 5000 10000 20000 50000 






3 

2.5 

2 

1.5 

1 

0.5 

0 

learning-3aggressive 

0 10000 20000 30000 40000 50000 60000 


learning 

aggressive 

aggressive 

aggressive 


Interesting point 1: 1: 

when playing vs fixed strategy(all the other 

players are doing the same) the learner gets better results -the 

learning is more effective. 


when playing vs fixed strategy the rewards 

converge earlier - comparing to different strategies. 


the aggressive players block each other resulting 

in lower rewards for them comparing to the previous scenario.


Scenario:p1: Scenario: p1: learning, p2: defensive, p3:defensive, p4:defensive 

Analysis 

training: 5000 10000 20000 50000 






3 

2.5 

2 

1.5 

1 

0.5 

0 

learning-3defensive 

0 10000 20000 30000 40000 50000 60000 


learning 

defensive 

defensive 

defensive 


Interesting point: point: 

when playing vs a different fixed 

strategy(defensive) the learner gets more or less the same results – 

the kind of fixed strategy doesn't matter.


Scenario:p1: Scenario: p1: learning, p2: random, p3:random, p4:random 

Analysis 

training: 5000 10000 20000 50000 


player2 random 1.28 1.28 1.28 1.31 

player3 random 1.34 1.29 1.3 1.31 

player4 random 1.32 1.28 1.3 1.29 


2.5 

2 

1.5 

1 

0.5 

0 

learning-3random 

0 10000 20000 30000 40000 50000 60000 


learning 

random 

random 

random 



when playing vs random players the learning 

process converges faster. 


when playing vs random players the learning is 

much less effective(that is the learner can only take advantage of 

only portion of the game aspects).

Learning Agents Playing Whist

Create successful ePaper yourself

Delete template?

Save as template?