12.07.2015 Views

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Bluff(D 1 , D 2 ) is a dice-bidd<strong>in</strong>g game played over a numberof rounds. Each player i starts with D i six-sided dice.In each round, players roll their dice <strong>and</strong> look at the resultwithout show<strong>in</strong>g their opponent. Then, players alternate bybidd<strong>in</strong>g a quantity of a face value, q-f, of all dice <strong>in</strong> playuntil one player claims that the other is bluff<strong>in</strong>g (i.e., claimsthat the bid does not hold). To place a new bid, a player must<strong>in</strong>crease q or f of the current bid. A face of 6 is considered“wild” <strong>and</strong> counts as any other face value. The player call<strong>in</strong>gbluff w<strong>in</strong>s the round if the opponent’s last bid is <strong>in</strong>correct,<strong>and</strong> loses otherwise. The los<strong>in</strong>g player removes one of theirdice from the game <strong>and</strong> a new round beg<strong>in</strong>s. Once a playerhas no more dice left, that player loses the game <strong>and</strong> receivesa utility of −1, while the w<strong>in</strong>n<strong>in</strong>g player earns +1 utility.F<strong>in</strong>ally, we consider heads-up (i.e., two-player) limitTexas hold’em poker that is played over four bett<strong>in</strong>g rounds.To beg<strong>in</strong>, each player is dealt two private cards. In laterrounds, public community cards are revealed with a fifth <strong>and</strong>f<strong>in</strong>al card appear<strong>in</strong>g <strong>in</strong> the last round. Dur<strong>in</strong>g each bett<strong>in</strong>ground, players can either fold (forfeit the game), call (matchthe previous bet), or raise (<strong>in</strong>crease the previous bet), witha maximum of four raises per round. If neither player folds,then the player with the highest ranked poker h<strong>and</strong> w<strong>in</strong>s allof the bets. Hold’em conta<strong>in</strong>s approximately 3 × 10 14 <strong>in</strong>formationsets, mak<strong>in</strong>g the game <strong>in</strong>tractable for any equilibriumcomputation technique. A common approach <strong>in</strong> poker is toapply a card abstraction that merges similar card deal<strong>in</strong>gs together<strong>in</strong>to a s<strong>in</strong>gle chance “bucket” (Gilp<strong>in</strong> <strong>and</strong> S<strong>and</strong>holm2006). We apply a ten-bucket abstraction that reduces thebranch<strong>in</strong>g factor at each chance node down to ten, wheredeal<strong>in</strong>gs are grouped accord<strong>in</strong>g to expected h<strong>and</strong> strengthsquared as described by Z<strong>in</strong>kevich et al. (2008). This abstractgame conta<strong>in</strong>s roughly 57 million <strong>in</strong>formation sets.We use doma<strong>in</strong> knowledge <strong>and</strong> our <strong>in</strong>tuition to select thesampl<strong>in</strong>g schemes Q. By our earlier assumption, we alwayssampl<strong>in</strong>g a s<strong>in</strong>gle action on-policy when P (h) ≠ i, as isdone <strong>in</strong> Algorithm 1. For the travers<strong>in</strong>g player i, we focuson sampl<strong>in</strong>g actions lead<strong>in</strong>g to more “important” parts of thetree, while sampl<strong>in</strong>g other actions less frequently. Do<strong>in</strong>g soupdates the regret at the important <strong>in</strong>formation sets more frequentlyto quickly improve play at those locations. In Goofspiel,we always sample the lowest <strong>and</strong> highest bids, whilesampl<strong>in</strong>g each of the rema<strong>in</strong><strong>in</strong>g bids <strong>in</strong>dependently withprobability 0.5. Strong play can be achieved by only everplay<strong>in</strong>g the highest bid (giv<strong>in</strong>g the best chance at w<strong>in</strong>n<strong>in</strong>gthe bid) or the lowest bid (sacrific<strong>in</strong>g the current bid, leav<strong>in</strong>ghigher cards for w<strong>in</strong>n<strong>in</strong>g future bids), suggest<strong>in</strong>g that theseactions will often be taken <strong>in</strong> equilibrium. In Bluff(2,2), wealways sample “bluff” <strong>and</strong> the bids 1-5, 2-5, 1-6, 2-6, <strong>and</strong>for each face x that we roll, n-x for all 1 ≤ n ≤ 4. Bidd<strong>in</strong>gon the highest face is generally the best bluff s<strong>in</strong>ce theopponent’s next bid must <strong>in</strong>crease the quantity, <strong>and</strong> bidd<strong>in</strong>gon one’s own dice roll is more likely to be correct. F<strong>in</strong>ally,<strong>in</strong> hold’em, we always sample fold <strong>and</strong> raise actions, whilesampl<strong>in</strong>g call with probability 0.5. Folds are cheap to sample(s<strong>in</strong>ce the game ends) <strong>and</strong> raise actions <strong>in</strong>crease the numberof bets <strong>and</strong> consequently the magnitude of the utilities.Firstly, we performed a test run of CFR <strong>in</strong> Goofspiel(6)that measured the empirical variance of the samples ṽ i (σ, I)<strong>Variance</strong>0.90.80.70.60.50.40.30.20.1MCCFRAlgorithm 10.00 20 40 60 80 100 120 140IterationsFigure 2: Empirical Var[ṽ i (σ t , I)] <strong>and</strong> Var[ˆv i (σ t , I)] overiterations at the root of Goofspiel(6) <strong>in</strong> a test run of CFR.provided by MCCFR <strong>and</strong> of ˆv i (σ, I) provided by Algorithm1. Dur<strong>in</strong>g each iteration t of the test run, we performed2000 traversals with no regret or strategy updates, wherethe first 1000 traversals computed ṽ i (σ t , I) <strong>and</strong> the second1000 computed ˆv i (σ t , I) at the root I of the game. Bothṽ i (σ t , I) <strong>and</strong> ˆv i (σ t , I) were computed under the same sampl<strong>in</strong>gscheme Q described above for Goofspiel. Once theempirical variance of each estimator was recorded from thesamples at time t, a full vanilla CFR traversal was then performedto update the regrets <strong>and</strong> acquire the next strategyσ t+1 . The first 150 empirical variances are reported <strong>in</strong> Figure2. S<strong>in</strong>ce the estimators are unbiased, the variance here isalso equal to the mean squared error of the estimates. Over1000 test iterations, the average variances were 0.295 forMCCFR <strong>and</strong> 0.133 for Algorithm 1. This agrees with ourearlier <strong>in</strong>tuition that prob<strong>in</strong>g reduces variance <strong>and</strong> providessome validation for our choice of estimator.Next, we performed five runs for each of MCCFR <strong>and</strong>Algorithm 1, each under the same sampl<strong>in</strong>g schemes Q describedabove. Similar to Algorithm 1, our MCCFR implementationalso performs stochastic averag<strong>in</strong>g dur<strong>in</strong>g the opponent’stree traversal. For each doma<strong>in</strong>, the average of theresults are provided <strong>in</strong> Figure 3. Our new algorithm convergesfaster than MCCFR <strong>in</strong> all three doma<strong>in</strong>s. In particular,at our f<strong>in</strong>al data po<strong>in</strong>ts, Algorithm 1 shows a 31%,10%, <strong>and</strong> 18% improvement over MCCFR <strong>in</strong> Goofspiel(7),Bluff(2,2), <strong>and</strong> Texas hold’em respectively. For both Goofspiel(7)<strong>and</strong> hold’em, the improvement was statisticallysignificant. In Goofspiel(7), for example, the level of exploitabilityreached by MCCFR’s last averaged data po<strong>in</strong>tis reached by Algorithm 1 <strong>in</strong> nearly half the time.ConclusionWe have provided a new theoretical framework that generalizesMCCFR <strong>and</strong> provides new <strong>in</strong>sights <strong>in</strong>to how theestimated values affect the rate of convergence to an approximateNash equilibrium. As opposed to the sampledcounterfactual values ṽ i (σ, I) explicitly def<strong>in</strong>ed by MCCFR,we considered any estimate ˆv i (σ, I) of the true counterfac-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!