Generalized Sampling and Variance in Counterfactual Regret ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

Bluff(D 1 , D 2 ) is a dice-bidding game played over a numberof rounds. Each player i starts with D i six-sided dice.In each round, players roll their dice and look at the resultwithout showing their opponent. Then, players alternate bybidding a quantity of a face value, q-f, of all dice in playuntil one player claims that the other is bluffing (i.e., claimsthat the bid does not hold). To place a new bid, a player mustincrease q or f of the current bid. A face of 6 is considered“wild” and counts as any other face value. The player callingbluff wins the round if the opponent’s last bid is incorrect,and loses otherwise. The losing player removes one of theirdice from the game and a new round begins. Once a playerhas no more dice left, that player loses the game and receivesa utility of −1, while the winning player earns +1 utility.Finally, we consider heads-up (i.e., two-player) limitTexas hold’em poker that is played over four betting rounds.To begin, each player is dealt two private cards. In laterrounds, public community cards are revealed with a fifth andfinal card appearing in the last round. During each bettinground, players can either fold (forfeit the game), call (matchthe previous bet), or raise (increase the previous bet), witha maximum of four raises per round. If neither player folds,then the player with the highest ranked poker hand wins allof the bets. Hold’em contains approximately 3 × 10 14 informationsets, making the game intractable for any equilibriumcomputation technique. A common approach in poker is toapply a card abstraction that merges similar card dealings togetherinto a single chance “bucket” (Gilpin and Sandholm2006). We apply a ten-bucket abstraction that reduces thebranching factor at each chance node down to ten, wheredealings are grouped according to expected hand strengthsquared as described by Zinkevich et al. (2008). This abstractgame contains roughly 57 million information sets.We use domain knowledge and our intuition to select thesampling schemes Q. By our earlier assumption, we alwayssampling a single action on-policy when P (h) ≠ i, as isdone in Algorithm 1. For the traversing player i, we focuson sampling actions leading to more “important” parts of thetree, while sampling other actions less frequently. Doing soupdates the regret at the important information sets more frequentlyto quickly improve play at those locations. In Goofspiel,we always sample the lowest and highest bids, whilesampling each of the remaining bids independently withprobability 0.5. Strong play can be achieved by only everplaying the highest bid (giving the best chance at winningthe bid) or the lowest bid (sacrificing the current bid, leavinghigher cards for winning future bids), suggesting that theseactions will often be taken in equilibrium. In Bluff(2,2), wealways sample “bluff” and the bids 1-5, 2-5, 1-6, 2-6, andfor each face x that we roll, n-x for all 1 ≤ n ≤ 4. Biddingon the highest face is generally the best bluff since theopponent’s next bid must increase the quantity, and biddingon one’s own dice roll is more likely to be correct. Finally,in hold’em, we always sample fold and raise actions, whilesampling call with probability 0.5. Folds are cheap to sample(since the game ends) and raise actions increase the numberof bets and consequently the magnitude of the utilities.Firstly, we performed a test run of CFR in Goofspiel(6)that measured the empirical variance of the samples ṽ i (σ, I)Variance0.90.80.70.60.50.40.30.20.1MCCFRAlgorithm 10.00 20 40 60 80 100 120 140IterationsFigure 2: Empirical Var[ṽ i (σ t , I)] and Var[ˆv i (σ t , I)] overiterations at the root of Goofspiel(6) in a test run of CFR.provided by MCCFR and of ˆv i (σ, I) provided by Algorithm1. During each iteration t of the test run, we performed2000 traversals with no regret or strategy updates, wherethe first 1000 traversals computed ṽ i (σ t , I) and the second1000 computed ˆv i (σ t , I) at the root I of the game. Bothṽ i (σ t , I) and ˆv i (σ t , I) were computed under the same samplingscheme Q described above for Goofspiel. Once theempirical variance of each estimator was recorded from thesamples at time t, a full vanilla CFR traversal was then performedto update the regrets and acquire the next strategyσ t+1 . The first 150 empirical variances are reported in Figure2. Since the estimators are unbiased, the variance here isalso equal to the mean squared error of the estimates. Over1000 test iterations, the average variances were 0.295 forMCCFR and 0.133 for Algorithm 1. This agrees with ourearlier intuition that probing reduces variance and providessome validation for our choice of estimator.Next, we performed five runs for each of MCCFR andAlgorithm 1, each under the same sampling schemes Q describedabove. Similar to Algorithm 1, our MCCFR implementationalso performs stochastic averaging during the opponent’stree traversal. For each domain, the average of theresults are provided in Figure 3. Our new algorithm convergesfaster than MCCFR in all three domains. In particular,at our final data points, Algorithm 1 shows a 31%,10%, and 18% improvement over MCCFR in Goofspiel(7),Bluff(2,2), and Texas hold’em respectively. For both Goofspiel(7)and hold’em, the improvement was statisticallysignificant. In Goofspiel(7), for example, the level of exploitabilityreached by MCCFR’s last averaged data pointis reached by Algorithm 1 in nearly half the time.ConclusionWe have provided a new theoretical framework that generalizesMCCFR and provides new insights into how theestimated values affect the rate of convergence to an approximateNash equilibrium. As opposed to the sampledcounterfactual values ṽ i (σ, I) explicitly defined by MCCFR,we considered any estimate ˆv i (σ, I) of the true counterfac-
tual values. We showed that the average regret is minimized(probabilistically) when the estimates are bounded and unbiased.In addition, we derived an upper bound on the averageregret in terms of the variance of the estimates, suggestingthat estimators with lower variance will converge toan ɛ-Nash equilibrium in fewer iterations. Finally, we providedan example of a non-MCCFR algorithm that reducesvariance with little computational overhead by probing nonsampledactions. Our new algorithm approached equilibriumfaster than its MCCFR counterpart in all of the reportedexperiments. We suspect that there are other efficientlycomputabledefinitions of ˆv i (σ, I) that are bounded, unbiased,and exhibit lower variance than our probing example.Future work will attempt to further improve convergencerates through such alternative definitions.AcknowledgementsWe would like to thank the members of the Computer PokerResearch Group at the University of Alberta for their helpfulsuggestions throughout this project. This work was supportedby NSERC, Alberta Innovates – Technology Futures,and the use of computing resources provided by WestGridand Compute Canada.ReferencesGibson, R.; Lanctot, M.; Burch, N.; Szafron, D.; and Bowling,M. 2012. Generalized sampling and variance in counterfactualregret minimization. Technical Report TR12-02,University of Alberta.Gilpin, A., and Sandholm, T. 2006. A competitive TexasHold’em poker player via automated abstraction and realtimeequilibrium computation. In Twenty-First Conferenceon Artificial Intelligence (AAAI), 1007–1013.Hart, S., and Mas-Colell, A. 2000. A simple adaptiveprocedure leading to correlated equilibrium. Econometrica68:1127–1150.Hoda, S.; Gilpin, A.; Peña, J.; and Sandholm, T. 2010.Smoothing techniques for computing Nash equilibria ofsequential games. Mathematics of Operations Research35(2):494–512.Koller, D.; Megiddo, N.; and von Stengel, B. 1994.Fast algorithms for finding randomized strategies in gametrees. In Annual ACM Symposium on Theory of Computing(STOC’94), 750–759.Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.2009a. Monte Carlo sampling for regret minimization inextensive games. In Advances in Neural Information ProcessingSystems 22 (NIPS), 1078–1086.Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.2009b. Monte Carlo sampling for regret minimization inextensive games. Technical Report TR09-15, University ofAlberta.Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione,C. 2008. Regret minimization in games with incompleteinformation. In Advances in Neural Information ProcessingSystems 20 (NIPS), 905–912.ExploitabilityExploitabilityAbstract game exploit. (mbb/g)0.200.180.160.140.120.100.080.060.040.020.000.200.180.160.140.120.100.080.060.040.020.00109876543210MCCFRAlgorithm 10 4 8 12 16 20 24 28Time (hours)(a) Goofspiel(7)MCCFRAlgorithm 10 1 2 3 4 5 6 7 8 9 10Time (hours)(b) Bluff(2,2)MCCFRAlgorithm 10 8 16 24 32 40 48Time (hours)(c) Texas hold’emFigure 3: Exploitability over time of strategies computedby MCCFR and by Algorithm 1 using identical samplingschemes Q, averaged over five runs. Error bars indicate 95%confidence intervals at each of the five averaged data points.In hold’em, exploitability is measured in terms of milli-bigblindsper game (mbb/g).
Page 1 and 2: Generalized Sampling and Variance i
Page 3 and 4: accumulated regrets,σ T +1 (I, a)
Page 5: eduction should lead to less regret

Generalized Sampling and Variance in Counterfactual Regret ...

Create successful ePaper yourself

Delete template?

Save as template?