Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret MinimizationRichard Gibson and Marc Lanctot and Neil Burch and Duane Szafron and Michael BowlingDepartment of Computing Science, University of AlbertaEdmonton, Alberta, T6G 2E8, Canada{rggibson | lanctot | nburch | dszafron | mbowling}@ualberta.caAbstractIn large extensive form games with imperfect information,Counterfactual Regret Minimization (CFR) is a popular, iterativealgorithm for computing approximate Nash equilibria.While the base algorithm performs a full tree traversalon each iteration, Monte Carlo CFR (MCCFR) reduces theper iteration time cost by traversing just a sampled portion ofthe tree. On the other hand, MCCFR’s sampled values introducevariance, and the effects of this variance were previouslyunknown. In this paper, we generalize MCCFR by consideringany generic estimator of the sought values. We show thatany choice of an estimator can be used to probabilisticallyminimize regret, provided the estimator is bounded and unbiased.In addition, we relate the variance of the estimatorto the convergence rate of an algorithm that calculates regretdirectly from the estimator. We demonstrate the applicationof our analysis by defining a new bounded, unbiased estimatorwith empirically lower variance than MCCFR estimates.Finally, we use this estimator in a new sampling algorithmto compute approximate equilibria in Goofspiel, Bluff, andTexas hold’em poker. Under each of our selected samplingschemes, our new algorithm converges faster than MCCFR.IntroductionAn extensive form game is a common formalism usedto model sequential decision making problems. Extensivegames provide a versatile framework capable of representingmultiple agents, imperfect information, and stochasticevents. Counterfactual Regret Minimization (CFR) (Zinkevichet al. 2008) is an algorithm capable of finding effectivestrategies in a variety of games. In 2-player zero-sumgames with perfect recall, CFR converges to an approximateNash equilibrium profile. Other techniques for computingNash equilibria include linear programming (Koller,Megiddo, and von Stengel 1994) and the Excessive GapTechnique (Hoda et al. 2010).CFR is an iterative algorithm that updates every player’sstrategy through a full game tree traversal on each iteration.Theoretical results indicate that for a fixed solution quality,the procedure takes a number of iterations at most quadraticin the size of the game (Zinkevich et al. 2008, Theorem 4).As we consider larger games, however, traversals becomeCopyright c○ 2012, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.more time consuming, and thus more time is required to convergeto an equilibrium.Monte Carlo CFR (MCCFR) (Lanctot et al. 2009a) canbe used to reduce the traversal time per iteration by consideringonly a sampled portion of the game tree at each step.Compared to CFR, MCCFR can update strategies faster andlead to less overall computation time. However, the strategyupdates are noisy because any action that is not sampled isassumed to provide zero counterfactual value to the strategy.When a non-sampled action provides large value, MCCFRintroduces a lot of variance. Previous work does not discusshow this variance affects the convergence rate.Our main contributions in this paper result from a moregeneral analysis of the effects of using sampled values inCFR updates. We show that any bounded, unbiased estimatesof the true counterfactual values can be used to minimizeregret, whether the estimates are derived from MC-CFR or not. Furthermore, we prove a new upper bound onthe average regret in terms of the variance of the estimates,suggesting that estimates with lower variance are preferred.In addition to these main results, we introduce a new CFRsampling algorithm that lives outside of the MCCFR familyof algorithms. By “probing” the value of non-sampledactions, our new algorithm demonstrates one way of reducingthe variance in the updates to provide faster convergenceto equilibrium. This is shown in three domains: Goofspiel,Bluff, and Texas hold’em poker.BackgroundA finite extensive game contains a game tree with nodescorresponding to histories of actions h ∈ H and edgescorresponding to actions a ∈ A(h) available to playerP (h) ∈ N ∪{c} (where N is the set of players and c denoteschance). When P (h) = c, σ c (h, a) is the (fixed) probabilityof chance generating action a at h. We call h a prefix ofhistory h ′ , written h ⊑ h ′ , if h ′ begins with the sequenceh. Each terminal history z ∈ Z has associated utilitiesu i (z) for each player i. In imperfect information games,non-terminal histories are partitioned into information setsI ∈ I i representing the different game states that player icannot distinguish between. For example, in poker, playeri does not see the private cards dealt to the opponents, andthus all histories differing only in the private cards of the opponentsare in the same information set for player i. The ac-

tion sets A(h) must be identical for all h ∈ I, and we denotethis set by A(I). We assume perfect recall that guaranteesplayers always remember information that was revealed tothem and the order in which it was revealed.A strategy for player i, σ i ∈ Σ i , is a function thatmaps each information set I ∈ I i to a probability distributionover A(I). A strategy profile is a vector of strategiesσ = (σ 1 , ..., σ |N| ) ∈ Σ, one for each player. Define u i (σ)to be the expected utility for player i, given that all playersplay according to σ. We let σ −i refer to the strategies in σexcluding σ i .Let π σ (h) be the probability of history h occurring if allplayers choose actions according to σ. We can decomposeπ σ (h) =∏πi σ (h)i∈N∪{c}into each player’s and chance’s contribution to this probability.Here, πi σ (h) is the contribution to this probability fromplayer i when playing according to σ i . Let π−i σ (h) be theproduct of all players’ contribution (including chance) exceptthat of player i. Furthermore, let π σ (h, h ′ ) be the probabilityof history h ′ occurring after h, given h has occurred.Let πi σ(h, h′ ) and π−i σ (h, h′ ) be defined similarly.A best response to σ −i is a strategy that maximizesplayer i’s expected payoff against σ −i . The best responsevalue for player i is the value of that strategy, b i (σ −i ) =max σ ′i ∈Σ iu i (σ i ′, σ −i). A strategy profile σ is an ɛ-Nashequilibrium if no player can unilaterally deviate from σ andgain more than ɛ; i.e., u i (σ) + ɛ ≥ b i (σ −i ) for all i ∈ N.In this paper, we will focus on two-player zero-sumgames: N = {1, 2} and u 1 (z) = −u 2 (z) for all z ∈ Z.In this case, the exploitability of σ, e(σ) = (b 1 (σ 2 ) +b 2 (σ 1 ))/2, measures how much σ loses to a worst case opponentwhen players alternate positions. A 0-Nash equilibrium(or simply a Nash equilibrium) has zero exploitability.Counterfactual Regret Minimization (CFR) is an iterativeprocedure that, for two-player zero-sum games, obtainsan ɛ-Nash equilibrium in O(|H||I i |/ɛ 2 ) time (Zinkevich etal. 2008, Theorem 4). On each iteration t, CFR (or “vanillaCFR”) recursively traverses the entire game tree, calculatingthe expected utility for player i at each information setI ∈ I i under the current profile σ t , assuming player i playsto reach I. This expectation is the counterfactual value forplayer i,v i (σ, I) = ∑u i (z)π−i(z[I])π σ σ (z[I], z),z∈Z Iwhere Z I is the set of terminal histories passing throughI and z[I] is the prefix of z contained in I. For each actiona ∈ A(I), these values determine the counterfactualregret at iteration t, ri t(I, a) = v i(σ(I→a) t , I) − v i(σ t , I),where σ (I→a) is the profile σ except at I, action a is alwaystaken. This process is shown visually in Figure 1a. The regretri t (I, a) measures how much player i would rather playaction a at I than play σ t . The counterfactual regretsR T i (I, a) =T∑ri(I, t a)t=1∑ σ (I ,b)v i(σ (I → b), I )bIa 1a 2a 3v i(σ (I →a1 ) , I ) v i(σ (I →a2 ), I )v i(σ (I →a3 ) , I )σ (I ,a 1) ̃v i(σ ( I → a 1), I )Ia 1a 2a 3r i (I ,a)=v i (σ ( I →a ) , I )−∑ σ (I , b)v i (σ ( I →b) , I)b(a)̃v i(σ (I →a1 ), I ) 00Ĩr i (I ,a 1 )= ̃v i (σ ( I →a 1), I)−σ (I , a 1 ) ̃v i (σ (I → a1 ), I )̃r i (I ,a 2 )=−σ(I , a 1 ) ̃v i (σ ( I → a1 ) , I )̃r i (I ,a 3 )=−σ (I , a 1 ) ̃v i (σ ( I → a1 ), I )(b)∑ σ (I ,b) ̂v i(σ (I → b), I )ba 1a 2a 3̂v i(σ (I →a1 ) , I ) ̂v i (σ (I →a 3) , I )= probe i (σ (I →a 3) , I )̂v i(σ (I → a2 ), I )= probe i(σ (I → a 2), I)̂r i (I ,a)= ̂v i (σ ( I →a ) , I )−∑ σ (I , b) ̂v i (σ ( I →b) , I)b(c)Figure 1: (a) The computed values at information set I duringvanilla CFR. First, for each action, the counterfactualvalues are recursively computed. The counterfactual regretsare then computed before returning the counterfactual valueat I to the parent. (b) The computed values at I during outcomesampling. Here, only action a 1 is sampled and its sampledcounterfactual value is recursively computed. The remainingtwo actions are effectively assigned zero sampledcounterfactual value. The sampled counterfactual regrets arethen computed before returning the sampled counterfactualvalue at I to the parent. (c) An example of computed valuesat I during our new sampling algorithm. In this example,again only a 1 is sampled and its estimated counterfactualvalue is recursively computed. The remaining two actionsare “probed” to improve both the estimated counterfactualregrets and the returned estimated counterfactual value at I.are accumulated and σ t is updated by applying regret matching(Hart and Mas-Colell 2000; Zinkevich et al. 2008) to the

accumulated regrets,σ T +1 (I, a) =R T,+i (I, a)∑R T,+i (I, b)b∈A(I)where x + = max{x, 0} and actions are chosen uniformly atrandom when the denominator is zero. This procedure minimizesthe average of the counterfactual regrets, which in turnminimizes the average (external) regret Ri T /T (Zinkevichet al. 2008, Theorem 3), whereR T i= maxσ ′ ∈Σ iT∑t=1(ui (σ ′ , σ t −i) − u i (σ t i, σ t −i) ) .It is well known that in a two-player zero-sum game, ifRi T /T < ɛ for i ∈ {1, 2}, then the average profile ¯σT isa 2ɛ-Nash equilibrium.For large games, CFR’s full game tree traversal can bevery expensive. Alternatively, one can still obtain an approximateequilibrium by traversing a smaller, sampled portionof the tree on each iteration using Monte Carlo CFR (MC-CFR) (Lanctot et al. 2009a). Let Q be a set of subsets, orblocks, of the terminal histories Z such that the union of Qspans Z. On each iteration, a block Q ∈ Q is sampled accordingto a probability distribution over Q. Outcome samplingis an example of MCCFR that uses blocks containinga single terminal history (Q = {z}). On each iteration ofoutcome sampling, the block is chosen during traversal bysampling a single action at the current decision point untila terminal history is reached. The sampled counterfactualvalue for player i,ṽ i (σ, I) =∑u i (z)π−i(z[I])π σ σ (z[I], z)/q(z)z∈Z I ∩Qwhere q(z) is the probability that z was sampled, defines thesampled counterfactual regret on iteration t for action a atI, ˜r t i (I, a) = ṽ i(σ t (I→a) , I) − ṽ i(σ t , I). The sampled counterfactualvalues are unbiased estimates of the true counterfactualvalues (Lanctot et al. 2009a, Lemma 1). In outcomesampling, for example, only the regrets along the sampledterminal history are computed (all others are zero by definition).Outcome sampling converges to equilibrium fasterthan vanilla CFR in a number of different games (Lanctot etal. 2009a, Figure 1).As we sample fewer actions at a given node, the sampledcounterfactual value is potentially less accurate. Figure 1b illustratesthis point in the case of outcome sampling. Here, an“informative” sampled counterfactual value for just a singleaction is obtained at each information set along the sampledblock (history). All other actions are assigned a sampledcounterfactual value of zero. While E Q [ṽ i (σ, I)] = v i (σ, I),variance is introduced, affecting both the regret updates andthe value recursed back to the parent. As we will see inthe next section, this variance plays an important role in thenumber of iterations required to converge.Generalized SamplingOur main contributions in this paper are new theoretical findingsthat generalize those of MCCFR. We begin by pre-(1)senting a previously established bound on the average regretachieved through MCCFR. Let |A i | = max I∈Ii |A(I)|and suppose δ > 0 satisfies the following: ∀z ∈ Z eitherπ−i σ (z) = 0 or q(z) ≥ δ > 0 at every iteration.We can then bound the difference between any two samplesṽ i (σ (I→a) , I) − ṽ i (σ (I→b) , I) ≤ ˜∆ i = ∆ i /δ, where∆ i = max z∈Z u i (z) − min z∈Z u i (z). The average regretcan then be bounded as follows:Theorem 1 (Lanctot et al. (2009a), Theorem 5) Let p ∈(0, 1]. When using outcome-sampling MCCFR, with probability1 − p, average regret is bounded by( √ )RiT 2T ≤ ˜∆i |I i | √ |A i |˜∆ i + √ √ . (2)p TA related bound holds for all MCCFR instances (Lanctotet al. 2009b, Theorem 7). We note here that Lanctot etal. present a slightly tighter bound than equation (2) where|I i | is replaced with a game-dependent constant M i that isindependent of the sampling scheme and satisfies √ |I i | ≤M i ≤ |I i |. This constant is somewhat complicated to define,and thus we omit these details here. Recall that minimizingthe average regret yields an approximate Nash equilibrium.Theorem 1 suggests that the rate at which regret is minimizeddepends on the bound ˜∆ i on the difference betweentwo sampled counterfactual values.We now present a new, generalized bound on the averageregret. While MCCFR provides an explicit form for thesampled counterfactual values ṽ i (σ, I), we let ˆv i (σ, I) denoteany estimator of the true counterfactual value v i (σ, I).We can then define the estimated counterfactual regret oniteration t for action a at I to be ˆr t i (I, a) = ˆv i(σ t (I→a) , I) −ˆv i (σ t , I). This generalization creates many possibilities notconsidered in MCCFR. For instance, instead of sampling ablock Q of terminal histories, one can consider a sampled setof information sets and only update regrets at those sampledlocations. Another example is provided later in the paper.The following lemma probabilistically bounds the averageregret in terms of the variance, covariance, and bias betweenthe estimated and true counterfactual regrets:Lemma 1 Let p ∈ (0, 1] and suppose that there existsa bound ˆ∆ i on the difference between any two estimates,ˆv i (σ (I→a) , I) − ˆv i (σ (I→b) , I) ≤ ˆ∆ i . If strategies are selectedaccording to regret matching on the estimated counterfactualregrets, then with probability at least 1 − p, theaverage regret is bounded byR T iwhereT ≤ |I i| √ |A i |Var =max⎛t∈{1,...,T }I∈I ia∈A(I)⎝ ˆ∆ i√T+with Cov and E similarly defined.√VarpT + Covp+ E2pVar [ r t i(I, a) − ˆr t i(I, a) ] ,⎞⎠

The proof is similar to that of Theorem 7 by Lanctot etal. (2009b) and can be found, along with all other proofs,in the technical report (Gibson et al. 2012). Lemma 1 impliesthat unbiased estimators of v i (σ, I) give a probabilisticguarantee of minimizing R T i /T :Theorem 2 If in addition to the conditions of Lemma 1,ˆv i (σ, I) is an unbiased estimator of v i (σ, I), then with probabilityat least 1 − p,R T iT ≤ (ˆ∆ i +√ )Var |I i | √ A√ √ i. (3)p TTheorem 2 shows that the bound on the difference betweentwo estimates, ˆ∆ i , plays a role in bounding the average overallregret. This is not surprising as ˜∆ i plays a similar rolein Theorem 1. However, Theorem 2 provides new insightinto the role played by the variance of the estimator. Giventwo unbiased estimators ˆv i (σ, I) and ˆv i ′ (σ, I) with a commonbound ˆ∆ i but differing variance, using the estimatorwith lower variance will yield a smaller bound on the averageregret after T iterations. For a fixed ɛ > 0, this suggeststhat estimators with lower variance will require feweriterations to converge to an ɛ-Nash equilibrium. In addition,since in MCCFR we can bound √ Var ≤ √ 2 max{ ˜∆ i , ∆ i },equation (3) is more informative than equation (2). Furthermore,if some structure on the estimates ˆv i (σ, I) holds, wecan produce a tighter bound than equation (3) by incorporatingthe game-dependent constant M i introduced by Lanctotet al. (2009a). Details of this improvement are included inthe technical report (Gibson et al. 2012).While unbiased estimators with lower variance may reducethe number of iterations required, we must define theseestimators carefully. If the estimator is expensive to compute,the time per iteration will be costly and overall computationtime may even increase. For example, the true counterfactualvalue v i (σ, I) has zero variance, but computingthe value with vanilla CFR is too time consuming in largegames. In the next section, we present a new bounded, unbiasedestimator that exhibits lower variance than ṽ i (σ, I) andcan be computed nearly as fast.A New CFR Sampling AlgorithmWe now provide an example of how our new theoretical findingscan be leveraged to reduce the computation time requiredto obtain an ɛ-Nash equilibrium. Our example is anextension of MCCFR that attempts to reduce variance by replacing“zeroed-out” counterfactual values of player i’s nonsampledactions with closer estimates of the true counterfactualvalues. Figure 1c illustrates this idea. The simplestinstance of our new algorithm “probes” each non-sampledaction a at I for its counterfactual value. A probe is a singleMonte Carlo roll-out, starting with action a at I and selectingsubsequent actions according to the current strategyσ t until a terminal history z is reached. By rolling out actionsaccording to the current strategy, a probe is guaranteedto provide an unbiased estimate of the counterfactual valuefor a at I. In general, one can perform multiple probes pernon-sampled action, probe only a subset of the non-sampledactions, probe off-policy, or factor in multiple terminal historiesper probe. While the technical report touches on thisgeneralization (Gibson et al. 2012), our presentation heresticks to the simple, inexpensive case of one on-policy, singletrajectory probe for each non-sampled action.We now formally define the estimated counterfactualvalue ˆv i (σ, I) obtained via probing, followed by a descriptionof a new CFR sampling algorithm that updates regretsaccording to these estimates. Similar to MCCFR, let Q bea set of blocks spanning Z from which we sample a blockQ ∈ Q for player i on every iteration. To further simplifyour discussion, we will assume from here on that each Qsamples a single action at every history h not belonging toplayer i, sampled according to the known chance probabilitiesσ c or the opponent’s current strategy σ −i . Additionally,we assume that the set of actions sampled at I ∈ I i , denotedQ(I), is nonempty and independent of every other set of actionssampled. While probing can be generalized to workfor any choice of Q (Gibson et al. 2012), this simplificationreduces the number of probabilities to compute in our algorithmand worked well in preliminary experiments. Once Qhas been sampled, we form an additional set of terminal histories,or probes, B ⊆ Z\Q, generated as follows. For eachnon-terminal history h with P (h) = i reached and each actiona ∈ A(h) that Q does not sample (i.e., ∃z ∈ Q suchthat h ⊑ z, but ∀z ∈ Q, ha ⋢ z), we generate exactly oneterminal history z = z ha ∈ B, where z ∈ Z\Q is selectedon-policy (i.e., with probability π σ (ha, z)). In other words,each non-sampled action is probed according to the currentstrategy profile σ and the known chance probabilities. Recallthat Z I is the set of terminal histories that have a prefix in theinformation set I. Given both Q and B, when Z I ∩ Q ≠ ∅,our estimated counterfactual value is defined to be⎡ˆv i (σ, I) = 1 ⎣∑πi σ (z[I], z)u i (z)q i (I)z∈Z I ∩Q+ ∑]πi σ (z ha [I], ha)u i (z ha ) ,whereq i (I) =z ha ∈Z I ∩B∏(I ′ ,a ′ )∈X i(I)P[a ′ ∈ Q(I ′ )]is the probability that Z I ∩Q ≠ ∅ contributed from samplingplayer i’s actions. Here, X i (I) is the sequence of informationset, action pairs for player i that lead to information setI, and this sequence is unique due to perfect recall. WhenZ I ∩ Q = ∅, ˆv i (σ, I) is defined to be zero.Proposition 1 If q i (I) > 0 for all I ∈ I i , then ˆv i (σ, I) is abounded, unbiased estimate of v i (σ, I).Proposition 1 and Theorem 2 provide a probabilistic guaranteethat updating regrets according to our estimated counterfactualvalues will minimize the average regret, andthus produce an ɛ-Nash equilibrium. Note that the differencesṽ i (σ (I→a) , I) − ṽ i (σ (I→b) , I) and ˆv i (σ (I→a) , I) −ˆv i (σ (I→b) I) are both bounded above by ˆ∆ i = ∆ i /δ, whereδ = min I∈Ii q i (I). Thus, Theorem 2 suggests that variance

eduction should lead to less regret after a fixed number ofiterations. Probing specifically aims to achieve variance reductionthrough ˆv i (σ, I) when only a strict subset of playeri’s actions are sampled. Note that if we always sample all ofplayer i’s actions, we have B = ∅ and ˆv i (σ, I) = ṽ i (σ, I).Algorithm 1 provides pseudocode for our new algorithmthat updates regrets according to our estimated counterfactualvalues. The Probe function recurses down the treefrom history h following a single trajectory according to theknown chance probabilities (line 9) and the current strategyobtained through regret matching (line 13, see equation (1))until the utility at a single terminal history is returned (line7). The WalkTree function is the main part of the algorithmand contains three major cases. Firstly, if the current historyh is a terminal history, we simply return the utility (line 19).Second, if player i does not act at h (lines 20 to 31), then ourassumption on Q dictates that a single action is traversed onpolicy(lines 21 and 29). The third case covers player i actingat h (lines 32 to 46). After sampling a set of actions (line34), the value of each action a, v[a], is obtained. For eachsampled action, we obtain its value by recursing down thataction (line 38) after updating the sample probability for futurehistories (line 37). While MCCFR assigns zero value toeach non-sampled action, Algorithm 1 instead obtains theseaction values through the probe function (line 40). Note thatq = q i (I) is the probability of reaching I contributed fromsampling player i’s actions, and that the estimated counterfactualvalue ˆv i (σ (I→a) , I) = v[a]/q. After obtaining allvalues, the regret of each action is updated (line 44).Running the Solve function for a large enough number ofiterations T will produce an approximate Nash equilibrium¯σ, where ¯σ(I, a) = s I [a]/ ∑ b∈A(I) s I[b]. Note that Algorithm1 updates the cumulative profile (line 27) in a slightlydifferent manner than the MCCFR algorithm presented byLanctot et al. (2009b, Algorithm 1). Firstly, we divide theupdate by the probability q i (I) of reaching I under Q (whichLanctot et al. call updating “stochastically”), as opposed to“optimistic averaging” that the authors mention is not technicallycorrect. Second, we update player i’s part of the profileduring the opponent’s traversal instead of during playeri’s traversal. Doing so ensures that any information set thatwill contribute positive probability to the cumulative profilecan be reached, which along with stochastic updating producesan unbiased update. These changes are not specific toour new algorithm and can also be applied to MCCFR.Experimental ResultsIn this section, we compare MCCFR to our new samplingalgorithm in three domains, which we now describe.Goofspiel(n) is a card-bidding game consisting of nrounds. Each player begins with a hand of bidding cardsnumbered 1 to n. In our version, on round k, players secretlyand simultaneously play one bid from their remaining cardsand the player with the highest bid receives n−k +1 points;in the case of a tie, no points are awarded. The player withthe highest score after n rounds receives a utility of +1 andthe other player earns −1, and both receive 0 utility in a tie.Our version of Goofspiel is less informative than conven-Algorithm 1 CFR sampling with probing1: Require: ∀ I, action set sampling distribution Q(I)2: Initialize regret: ∀I, ∀a ∈ A(I) : r I [a] ← 03: Initialize cumulative profile: ∀I, ∀a ∈ A(I) : s I [a] ← 04:5: function Probe(history h, player i):6: if h ∈ Z then7: return u i (h)8: else if h ∈ P (c) then9: Sample action a ∼ σ c (h, ·)10: else11: I ← Information set containing h12: σ ←RegretMatching(r I )13: Sample action a ∼ σ14: end if15: return Probe(ha, i)16:17: function WalkTree(history h, player i, sample prob q):18: if h ∈ Z then19: return u i (h)20: else if h ∈ P (c) then21: Sample action a ∼ σ c (h, ·)22: return WalkTree(ha, i, q)23: else if h /∈ P (i) then24: I ← Information set containing h25: σ ←RegretMatching(I)26: for a ∈ A(I) do27: s I [a] ← s I [a] + (σ[a]/q)28: end for29: Sample action a ∼ σ30: return WalkTree(ha, i, q)31: end if32: I ← Information set containing h33: σ ← RegretMatching(r I )34: Sample action set Q(I) ∼ Q(I)35: for a ∈ A(I) do36: if a ∈ Q(I) then37: q ′ ← q · P Q(I) [a ∈ Q(I)]38: v[a] ← WalkTree(ha, i, q ′ )39: else40: v[a] ← Probe(ha, i)41: end if42: end for43: for a ∈ A(I) do44: r I [a] ← r I [a] + (1/q)(v[a] − ∑ )b∈A(I) σ[b]v[b]45: end for46: return ∑ a∈A(I) σ[a]v[a]47:48: function Solve(iterations T ):49: for t ∈ {1, 2, ..., T } do50: WalkTree(∅, 1, 1)51: WalkTree(∅, 2, 1)52: end fortional Goofspiel as players know which of the previous bidswere won or lost, but not which cards the opponent played.

Bluff(D 1 , D 2 ) is a dice-bidding game played over a numberof rounds. Each player i starts with D i six-sided dice.In each round, players roll their dice and look at the resultwithout showing their opponent. Then, players alternate bybidding a quantity of a face value, q-f, of all dice in playuntil one player claims that the other is bluffing (i.e., claimsthat the bid does not hold). To place a new bid, a player mustincrease q or f of the current bid. A face of 6 is considered“wild” and counts as any other face value. The player callingbluff wins the round if the opponent’s last bid is incorrect,and loses otherwise. The losing player removes one of theirdice from the game and a new round begins. Once a playerhas no more dice left, that player loses the game and receivesa utility of −1, while the winning player earns +1 utility.Finally, we consider heads-up (i.e., two-player) limitTexas hold’em poker that is played over four betting rounds.To begin, each player is dealt two private cards. In laterrounds, public community cards are revealed with a fifth andfinal card appearing in the last round. During each bettinground, players can either fold (forfeit the game), call (matchthe previous bet), or raise (increase the previous bet), witha maximum of four raises per round. If neither player folds,then the player with the highest ranked poker hand wins allof the bets. Hold’em contains approximately 3 × 10 14 informationsets, making the game intractable for any equilibriumcomputation technique. A common approach in poker is toapply a card abstraction that merges similar card dealings togetherinto a single chance “bucket” (Gilpin and Sandholm2006). We apply a ten-bucket abstraction that reduces thebranching factor at each chance node down to ten, wheredealings are grouped according to expected hand strengthsquared as described by Zinkevich et al. (2008). This abstractgame contains roughly 57 million information sets.We use domain knowledge and our intuition to select thesampling schemes Q. By our earlier assumption, we alwayssampling a single action on-policy when P (h) ≠ i, as isdone in Algorithm 1. For the traversing player i, we focuson sampling actions leading to more “important” parts of thetree, while sampling other actions less frequently. Doing soupdates the regret at the important information sets more frequentlyto quickly improve play at those locations. In Goofspiel,we always sample the lowest and highest bids, whilesampling each of the remaining bids independently withprobability 0.5. Strong play can be achieved by only everplaying the highest bid (giving the best chance at winningthe bid) or the lowest bid (sacrificing the current bid, leavinghigher cards for winning future bids), suggesting that theseactions will often be taken in equilibrium. In Bluff(2,2), wealways sample “bluff” and the bids 1-5, 2-5, 1-6, 2-6, andfor each face x that we roll, n-x for all 1 ≤ n ≤ 4. Biddingon the highest face is generally the best bluff since theopponent’s next bid must increase the quantity, and biddingon one’s own dice roll is more likely to be correct. Finally,in hold’em, we always sample fold and raise actions, whilesampling call with probability 0.5. Folds are cheap to sample(since the game ends) and raise actions increase the numberof bets and consequently the magnitude of the utilities.Firstly, we performed a test run of CFR in Goofspiel(6)that measured the empirical variance of the samples ṽ i (σ, I)Variance0.90.80.70.60.50.40.30.20.1MCCFRAlgorithm 10.00 20 40 60 80 100 120 140IterationsFigure 2: Empirical Var[ṽ i (σ t , I)] and Var[ˆv i (σ t , I)] overiterations at the root of Goofspiel(6) in a test run of CFR.provided by MCCFR and of ˆv i (σ, I) provided by Algorithm1. During each iteration t of the test run, we performed2000 traversals with no regret or strategy updates, wherethe first 1000 traversals computed ṽ i (σ t , I) and the second1000 computed ˆv i (σ t , I) at the root I of the game. Bothṽ i (σ t , I) and ˆv i (σ t , I) were computed under the same samplingscheme Q described above for Goofspiel. Once theempirical variance of each estimator was recorded from thesamples at time t, a full vanilla CFR traversal was then performedto update the regrets and acquire the next strategyσ t+1 . The first 150 empirical variances are reported in Figure2. Since the estimators are unbiased, the variance here isalso equal to the mean squared error of the estimates. Over1000 test iterations, the average variances were 0.295 forMCCFR and 0.133 for Algorithm 1. This agrees with ourearlier intuition that probing reduces variance and providessome validation for our choice of estimator.Next, we performed five runs for each of MCCFR andAlgorithm 1, each under the same sampling schemes Q describedabove. Similar to Algorithm 1, our MCCFR implementationalso performs stochastic averaging during the opponent’stree traversal. For each domain, the average of theresults are provided in Figure 3. Our new algorithm convergesfaster than MCCFR in all three domains. In particular,at our final data points, Algorithm 1 shows a 31%,10%, and 18% improvement over MCCFR in Goofspiel(7),Bluff(2,2), and Texas hold’em respectively. For both Goofspiel(7)and hold’em, the improvement was statisticallysignificant. In Goofspiel(7), for example, the level of exploitabilityreached by MCCFR’s last averaged data pointis reached by Algorithm 1 in nearly half the time.ConclusionWe have provided a new theoretical framework that generalizesMCCFR and provides new insights into how theestimated values affect the rate of convergence to an approximateNash equilibrium. As opposed to the sampledcounterfactual values ṽ i (σ, I) explicitly defined by MCCFR,we considered any estimate ˆv i (σ, I) of the true counterfac-

tual values. We showed that the average regret is minimized(probabilistically) when the estimates are bounded and unbiased.In addition, we derived an upper bound on the averageregret in terms of the variance of the estimates, suggestingthat estimators with lower variance will converge toan ɛ-Nash equilibrium in fewer iterations. Finally, we providedan example of a non-MCCFR algorithm that reducesvariance with little computational overhead by probing nonsampledactions. Our new algorithm approached equilibriumfaster than its MCCFR counterpart in all of the reportedexperiments. We suspect that there are other efficientlycomputabledefinitions of ˆv i (σ, I) that are bounded, unbiased,and exhibit lower variance than our probing example.Future work will attempt to further improve convergencerates through such alternative definitions.AcknowledgementsWe would like to thank the members of the Computer PokerResearch Group at the University of Alberta for their helpfulsuggestions throughout this project. This work was supportedby NSERC, Alberta Innovates – Technology Futures,and the use of computing resources provided by WestGridand Compute Canada.ReferencesGibson, R.; Lanctot, M.; Burch, N.; Szafron, D.; and Bowling,M. 2012. Generalized sampling and variance in counterfactualregret minimization. Technical Report TR12-02,University of Alberta.Gilpin, A., and Sandholm, T. 2006. A competitive TexasHold’em poker player via automated abstraction and realtimeequilibrium computation. In Twenty-First Conferenceon Artificial Intelligence (AAAI), 1007–1013.Hart, S., and Mas-Colell, A. 2000. A simple adaptiveprocedure leading to correlated equilibrium. Econometrica68:1127–1150.Hoda, S.; Gilpin, A.; Peña, J.; and Sandholm, T. 2010.Smoothing techniques for computing Nash equilibria ofsequential games. Mathematics of Operations Research35(2):494–512.Koller, D.; Megiddo, N.; and von Stengel, B. 1994.Fast algorithms for finding randomized strategies in gametrees. In Annual ACM Symposium on Theory of Computing(STOC’94), 750–759.Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.2009a. Monte Carlo sampling for regret minimization inextensive games. In Advances in Neural Information ProcessingSystems 22 (NIPS), 1078–1086.Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.2009b. Monte Carlo sampling for regret minimization inextensive games. Technical Report TR09-15, University ofAlberta.Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione,C. 2008. Regret minimization in games with incompleteinformation. In Advances in Neural Information ProcessingSystems 20 (NIPS), 905–912.ExploitabilityExploitabilityAbstract game exploit. (mbb/g)0.200.180.160.140.120.100.080.060.040.020.000.200.180.160.140.120.100.080.060.040.020.00109876543210MCCFRAlgorithm 10 4 8 12 16 20 24 28Time (hours)(a) Goofspiel(7)MCCFRAlgorithm 10 1 2 3 4 5 6 7 8 9 10Time (hours)(b) Bluff(2,2)MCCFRAlgorithm 10 8 16 24 32 40 48Time (hours)(c) Texas hold’emFigure 3: Exploitability over time of strategies computedby MCCFR and by Algorithm 1 using identical samplingschemes Q, averaged over five runs. Error bars indicate 95%confidence intervals at each of the five averaged data points.In hold’em, exploitability is measured in terms of milli-bigblindsper game (mbb/g).

Generalized Sampling and Variance in Counterfactual Regret ...

Create successful ePaper yourself

Delete template?

Save as template?