12.07.2015 Views

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Generalized</strong> <strong>Sampl<strong>in</strong>g</strong> <strong>and</strong> <strong>Variance</strong> <strong>in</strong> <strong>Counterfactual</strong> <strong>Regret</strong> M<strong>in</strong>imizationRichard Gibson <strong>and</strong> Marc Lanctot <strong>and</strong> Neil Burch <strong>and</strong> Duane Szafron <strong>and</strong> Michael Bowl<strong>in</strong>gDepartment of Comput<strong>in</strong>g Science, University of AlbertaEdmonton, Alberta, T6G 2E8, Canada{rggibson | lanctot | nburch | dszafron | mbowl<strong>in</strong>g}@ualberta.caAbstractIn large extensive form games with imperfect <strong>in</strong>formation,<strong>Counterfactual</strong> <strong>Regret</strong> M<strong>in</strong>imization (CFR) is a popular, iterativealgorithm for comput<strong>in</strong>g approximate Nash equilibria.While the base algorithm performs a full tree traversalon each iteration, Monte Carlo CFR (MCCFR) reduces theper iteration time cost by travers<strong>in</strong>g just a sampled portion ofthe tree. On the other h<strong>and</strong>, MCCFR’s sampled values <strong>in</strong>troducevariance, <strong>and</strong> the effects of this variance were previouslyunknown. In this paper, we generalize MCCFR by consider<strong>in</strong>gany generic estimator of the sought values. We show thatany choice of an estimator can be used to probabilisticallym<strong>in</strong>imize regret, provided the estimator is bounded <strong>and</strong> unbiased.In addition, we relate the variance of the estimatorto the convergence rate of an algorithm that calculates regretdirectly from the estimator. We demonstrate the applicationof our analysis by def<strong>in</strong><strong>in</strong>g a new bounded, unbiased estimatorwith empirically lower variance than MCCFR estimates.F<strong>in</strong>ally, we use this estimator <strong>in</strong> a new sampl<strong>in</strong>g algorithmto compute approximate equilibria <strong>in</strong> Goofspiel, Bluff, <strong>and</strong>Texas hold’em poker. Under each of our selected sampl<strong>in</strong>gschemes, our new algorithm converges faster than MCCFR.IntroductionAn extensive form game is a common formalism usedto model sequential decision mak<strong>in</strong>g problems. Extensivegames provide a versatile framework capable of represent<strong>in</strong>gmultiple agents, imperfect <strong>in</strong>formation, <strong>and</strong> stochasticevents. <strong>Counterfactual</strong> <strong>Regret</strong> M<strong>in</strong>imization (CFR) (Z<strong>in</strong>kevichet al. 2008) is an algorithm capable of f<strong>in</strong>d<strong>in</strong>g effectivestrategies <strong>in</strong> a variety of games. In 2-player zero-sumgames with perfect recall, CFR converges to an approximateNash equilibrium profile. Other techniques for comput<strong>in</strong>gNash equilibria <strong>in</strong>clude l<strong>in</strong>ear programm<strong>in</strong>g (Koller,Megiddo, <strong>and</strong> von Stengel 1994) <strong>and</strong> the Excessive GapTechnique (Hoda et al. 2010).CFR is an iterative algorithm that updates every player’sstrategy through a full game tree traversal on each iteration.Theoretical results <strong>in</strong>dicate that for a fixed solution quality,the procedure takes a number of iterations at most quadratic<strong>in</strong> the size of the game (Z<strong>in</strong>kevich et al. 2008, Theorem 4).As we consider larger games, however, traversals becomeCopyright c○ 2012, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.more time consum<strong>in</strong>g, <strong>and</strong> thus more time is required to convergeto an equilibrium.Monte Carlo CFR (MCCFR) (Lanctot et al. 2009a) canbe used to reduce the traversal time per iteration by consider<strong>in</strong>gonly a sampled portion of the game tree at each step.Compared to CFR, MCCFR can update strategies faster <strong>and</strong>lead to less overall computation time. However, the strategyupdates are noisy because any action that is not sampled isassumed to provide zero counterfactual value to the strategy.When a non-sampled action provides large value, MCCFR<strong>in</strong>troduces a lot of variance. Previous work does not discusshow this variance affects the convergence rate.Our ma<strong>in</strong> contributions <strong>in</strong> this paper result from a moregeneral analysis of the effects of us<strong>in</strong>g sampled values <strong>in</strong>CFR updates. We show that any bounded, unbiased estimatesof the true counterfactual values can be used to m<strong>in</strong>imizeregret, whether the estimates are derived from MC-CFR or not. Furthermore, we prove a new upper bound onthe average regret <strong>in</strong> terms of the variance of the estimates,suggest<strong>in</strong>g that estimates with lower variance are preferred.In addition to these ma<strong>in</strong> results, we <strong>in</strong>troduce a new CFRsampl<strong>in</strong>g algorithm that lives outside of the MCCFR familyof algorithms. By “prob<strong>in</strong>g” the value of non-sampledactions, our new algorithm demonstrates one way of reduc<strong>in</strong>gthe variance <strong>in</strong> the updates to provide faster convergenceto equilibrium. This is shown <strong>in</strong> three doma<strong>in</strong>s: Goofspiel,Bluff, <strong>and</strong> Texas hold’em poker.BackgroundA f<strong>in</strong>ite extensive game conta<strong>in</strong>s a game tree with nodescorrespond<strong>in</strong>g to histories of actions h ∈ H <strong>and</strong> edgescorrespond<strong>in</strong>g to actions a ∈ A(h) available to playerP (h) ∈ N ∪{c} (where N is the set of players <strong>and</strong> c denoteschance). When P (h) = c, σ c (h, a) is the (fixed) probabilityof chance generat<strong>in</strong>g action a at h. We call h a prefix ofhistory h ′ , written h ⊑ h ′ , if h ′ beg<strong>in</strong>s with the sequenceh. Each term<strong>in</strong>al history z ∈ Z has associated utilitiesu i (z) for each player i. In imperfect <strong>in</strong>formation games,non-term<strong>in</strong>al histories are partitioned <strong>in</strong>to <strong>in</strong>formation setsI ∈ I i represent<strong>in</strong>g the different game states that player icannot dist<strong>in</strong>guish between. For example, <strong>in</strong> poker, playeri does not see the private cards dealt to the opponents, <strong>and</strong>thus all histories differ<strong>in</strong>g only <strong>in</strong> the private cards of the opponentsare <strong>in</strong> the same <strong>in</strong>formation set for player i. The ac-


tion sets A(h) must be identical for all h ∈ I, <strong>and</strong> we denotethis set by A(I). We assume perfect recall that guaranteesplayers always remember <strong>in</strong>formation that was revealed tothem <strong>and</strong> the order <strong>in</strong> which it was revealed.A strategy for player i, σ i ∈ Σ i , is a function thatmaps each <strong>in</strong>formation set I ∈ I i to a probability distributionover A(I). A strategy profile is a vector of strategiesσ = (σ 1 , ..., σ |N| ) ∈ Σ, one for each player. Def<strong>in</strong>e u i (σ)to be the expected utility for player i, given that all playersplay accord<strong>in</strong>g to σ. We let σ −i refer to the strategies <strong>in</strong> σexclud<strong>in</strong>g σ i .Let π σ (h) be the probability of history h occurr<strong>in</strong>g if allplayers choose actions accord<strong>in</strong>g to σ. We can decomposeπ σ (h) =∏πi σ (h)i∈N∪{c}<strong>in</strong>to each player’s <strong>and</strong> chance’s contribution to this probability.Here, πi σ (h) is the contribution to this probability fromplayer i when play<strong>in</strong>g accord<strong>in</strong>g to σ i . Let π−i σ (h) be theproduct of all players’ contribution (<strong>in</strong>clud<strong>in</strong>g chance) exceptthat of player i. Furthermore, let π σ (h, h ′ ) be the probabilityof history h ′ occurr<strong>in</strong>g after h, given h has occurred.Let πi σ(h, h′ ) <strong>and</strong> π−i σ (h, h′ ) be def<strong>in</strong>ed similarly.A best response to σ −i is a strategy that maximizesplayer i’s expected payoff aga<strong>in</strong>st σ −i . The best responsevalue for player i is the value of that strategy, b i (σ −i ) =max σ ′i ∈Σ iu i (σ i ′, σ −i). A strategy profile σ is an ɛ-Nashequilibrium if no player can unilaterally deviate from σ <strong>and</strong>ga<strong>in</strong> more than ɛ; i.e., u i (σ) + ɛ ≥ b i (σ −i ) for all i ∈ N.In this paper, we will focus on two-player zero-sumgames: N = {1, 2} <strong>and</strong> u 1 (z) = −u 2 (z) for all z ∈ Z.In this case, the exploitability of σ, e(σ) = (b 1 (σ 2 ) +b 2 (σ 1 ))/2, measures how much σ loses to a worst case opponentwhen players alternate positions. A 0-Nash equilibrium(or simply a Nash equilibrium) has zero exploitability.<strong>Counterfactual</strong> <strong>Regret</strong> M<strong>in</strong>imization (CFR) is an iterativeprocedure that, for two-player zero-sum games, obta<strong>in</strong>san ɛ-Nash equilibrium <strong>in</strong> O(|H||I i |/ɛ 2 ) time (Z<strong>in</strong>kevich etal. 2008, Theorem 4). On each iteration t, CFR (or “vanillaCFR”) recursively traverses the entire game tree, calculat<strong>in</strong>gthe expected utility for player i at each <strong>in</strong>formation setI ∈ I i under the current profile σ t , assum<strong>in</strong>g player i playsto reach I. This expectation is the counterfactual value forplayer i,v i (σ, I) = ∑u i (z)π−i(z[I])π σ σ (z[I], z),z∈Z Iwhere Z I is the set of term<strong>in</strong>al histories pass<strong>in</strong>g throughI <strong>and</strong> z[I] is the prefix of z conta<strong>in</strong>ed <strong>in</strong> I. For each actiona ∈ A(I), these values determ<strong>in</strong>e the counterfactualregret at iteration t, ri t(I, a) = v i(σ(I→a) t , I) − v i(σ t , I),where σ (I→a) is the profile σ except at I, action a is alwaystaken. This process is shown visually <strong>in</strong> Figure 1a. The regretri t (I, a) measures how much player i would rather playaction a at I than play σ t . The counterfactual regretsR T i (I, a) =T∑ri(I, t a)t=1∑ σ (I ,b)v i(σ (I → b), I )bIa 1a 2a 3v i(σ (I →a1 ) , I ) v i(σ (I →a2 ), I )v i(σ (I →a3 ) , I )σ (I ,a 1) ̃v i(σ ( I → a 1), I )Ia 1a 2a 3r i (I ,a)=v i (σ ( I →a ) , I )−∑ σ (I , b)v i (σ ( I →b) , I)b(a)̃v i(σ (I →a1 ), I ) 00Ĩr i (I ,a 1 )= ̃v i (σ ( I →a 1), I)−σ (I , a 1 ) ̃v i (σ (I → a1 ), I )̃r i (I ,a 2 )=−σ(I , a 1 ) ̃v i (σ ( I → a1 ) , I )̃r i (I ,a 3 )=−σ (I , a 1 ) ̃v i (σ ( I → a1 ), I )(b)∑ σ (I ,b) ̂v i(σ (I → b), I )ba 1a 2a 3̂v i(σ (I →a1 ) , I ) ̂v i (σ (I →a 3) , I )= probe i (σ (I →a 3) , I )̂v i(σ (I → a2 ), I )= probe i(σ (I → a 2), I)̂r i (I ,a)= ̂v i (σ ( I →a ) , I )−∑ σ (I , b) ̂v i (σ ( I →b) , I)b(c)Figure 1: (a) The computed values at <strong>in</strong>formation set I dur<strong>in</strong>gvanilla CFR. First, for each action, the counterfactualvalues are recursively computed. The counterfactual regretsare then computed before return<strong>in</strong>g the counterfactual valueat I to the parent. (b) The computed values at I dur<strong>in</strong>g outcomesampl<strong>in</strong>g. Here, only action a 1 is sampled <strong>and</strong> its sampledcounterfactual value is recursively computed. The rema<strong>in</strong><strong>in</strong>gtwo actions are effectively assigned zero sampledcounterfactual value. The sampled counterfactual regrets arethen computed before return<strong>in</strong>g the sampled counterfactualvalue at I to the parent. (c) An example of computed valuesat I dur<strong>in</strong>g our new sampl<strong>in</strong>g algorithm. In this example,aga<strong>in</strong> only a 1 is sampled <strong>and</strong> its estimated counterfactualvalue is recursively computed. The rema<strong>in</strong><strong>in</strong>g two actionsare “probed” to improve both the estimated counterfactualregrets <strong>and</strong> the returned estimated counterfactual value at I.are accumulated <strong>and</strong> σ t is updated by apply<strong>in</strong>g regret match<strong>in</strong>g(Hart <strong>and</strong> Mas-Colell 2000; Z<strong>in</strong>kevich et al. 2008) to the


accumulated regrets,σ T +1 (I, a) =R T,+i (I, a)∑R T,+i (I, b)b∈A(I)where x + = max{x, 0} <strong>and</strong> actions are chosen uniformly atr<strong>and</strong>om when the denom<strong>in</strong>ator is zero. This procedure m<strong>in</strong>imizesthe average of the counterfactual regrets, which <strong>in</strong> turnm<strong>in</strong>imizes the average (external) regret Ri T /T (Z<strong>in</strong>kevichet al. 2008, Theorem 3), whereR T i= maxσ ′ ∈Σ iT∑t=1(ui (σ ′ , σ t −i) − u i (σ t i, σ t −i) ) .It is well known that <strong>in</strong> a two-player zero-sum game, ifRi T /T < ɛ for i ∈ {1, 2}, then the average profile ¯σT isa 2ɛ-Nash equilibrium.For large games, CFR’s full game tree traversal can bevery expensive. Alternatively, one can still obta<strong>in</strong> an approximateequilibrium by travers<strong>in</strong>g a smaller, sampled portionof the tree on each iteration us<strong>in</strong>g Monte Carlo CFR (MC-CFR) (Lanctot et al. 2009a). Let Q be a set of subsets, orblocks, of the term<strong>in</strong>al histories Z such that the union of Qspans Z. On each iteration, a block Q ∈ Q is sampled accord<strong>in</strong>gto a probability distribution over Q. Outcome sampl<strong>in</strong>gis an example of MCCFR that uses blocks conta<strong>in</strong><strong>in</strong>ga s<strong>in</strong>gle term<strong>in</strong>al history (Q = {z}). On each iteration ofoutcome sampl<strong>in</strong>g, the block is chosen dur<strong>in</strong>g traversal bysampl<strong>in</strong>g a s<strong>in</strong>gle action at the current decision po<strong>in</strong>t untila term<strong>in</strong>al history is reached. The sampled counterfactualvalue for player i,ṽ i (σ, I) =∑u i (z)π−i(z[I])π σ σ (z[I], z)/q(z)z∈Z I ∩Qwhere q(z) is the probability that z was sampled, def<strong>in</strong>es thesampled counterfactual regret on iteration t for action a atI, ˜r t i (I, a) = ṽ i(σ t (I→a) , I) − ṽ i(σ t , I). The sampled counterfactualvalues are unbiased estimates of the true counterfactualvalues (Lanctot et al. 2009a, Lemma 1). In outcomesampl<strong>in</strong>g, for example, only the regrets along the sampledterm<strong>in</strong>al history are computed (all others are zero by def<strong>in</strong>ition).Outcome sampl<strong>in</strong>g converges to equilibrium fasterthan vanilla CFR <strong>in</strong> a number of different games (Lanctot etal. 2009a, Figure 1).As we sample fewer actions at a given node, the sampledcounterfactual value is potentially less accurate. Figure 1b illustratesthis po<strong>in</strong>t <strong>in</strong> the case of outcome sampl<strong>in</strong>g. Here, an“<strong>in</strong>formative” sampled counterfactual value for just a s<strong>in</strong>gleaction is obta<strong>in</strong>ed at each <strong>in</strong>formation set along the sampledblock (history). All other actions are assigned a sampledcounterfactual value of zero. While E Q [ṽ i (σ, I)] = v i (σ, I),variance is <strong>in</strong>troduced, affect<strong>in</strong>g both the regret updates <strong>and</strong>the value recursed back to the parent. As we will see <strong>in</strong>the next section, this variance plays an important role <strong>in</strong> thenumber of iterations required to converge.<strong>Generalized</strong> <strong>Sampl<strong>in</strong>g</strong>Our ma<strong>in</strong> contributions <strong>in</strong> this paper are new theoretical f<strong>in</strong>d<strong>in</strong>gsthat generalize those of MCCFR. We beg<strong>in</strong> by pre-(1)sent<strong>in</strong>g a previously established bound on the average regretachieved through MCCFR. Let |A i | = max I∈Ii |A(I)|<strong>and</strong> suppose δ > 0 satisfies the follow<strong>in</strong>g: ∀z ∈ Z eitherπ−i σ (z) = 0 or q(z) ≥ δ > 0 at every iteration.We can then bound the difference between any two samplesṽ i (σ (I→a) , I) − ṽ i (σ (I→b) , I) ≤ ˜∆ i = ∆ i /δ, where∆ i = max z∈Z u i (z) − m<strong>in</strong> z∈Z u i (z). The average regretcan then be bounded as follows:Theorem 1 (Lanctot et al. (2009a), Theorem 5) Let p ∈(0, 1]. When us<strong>in</strong>g outcome-sampl<strong>in</strong>g MCCFR, with probability1 − p, average regret is bounded by( √ )RiT 2T ≤ ˜∆i |I i | √ |A i |˜∆ i + √ √ . (2)p TA related bound holds for all MCCFR <strong>in</strong>stances (Lanctotet al. 2009b, Theorem 7). We note here that Lanctot etal. present a slightly tighter bound than equation (2) where|I i | is replaced with a game-dependent constant M i that is<strong>in</strong>dependent of the sampl<strong>in</strong>g scheme <strong>and</strong> satisfies √ |I i | ≤M i ≤ |I i |. This constant is somewhat complicated to def<strong>in</strong>e,<strong>and</strong> thus we omit these details here. Recall that m<strong>in</strong>imiz<strong>in</strong>gthe average regret yields an approximate Nash equilibrium.Theorem 1 suggests that the rate at which regret is m<strong>in</strong>imizeddepends on the bound ˜∆ i on the difference betweentwo sampled counterfactual values.We now present a new, generalized bound on the averageregret. While MCCFR provides an explicit form for thesampled counterfactual values ṽ i (σ, I), we let ˆv i (σ, I) denoteany estimator of the true counterfactual value v i (σ, I).We can then def<strong>in</strong>e the estimated counterfactual regret oniteration t for action a at I to be ˆr t i (I, a) = ˆv i(σ t (I→a) , I) −ˆv i (σ t , I). This generalization creates many possibilities notconsidered <strong>in</strong> MCCFR. For <strong>in</strong>stance, <strong>in</strong>stead of sampl<strong>in</strong>g ablock Q of term<strong>in</strong>al histories, one can consider a sampled setof <strong>in</strong>formation sets <strong>and</strong> only update regrets at those sampledlocations. Another example is provided later <strong>in</strong> the paper.The follow<strong>in</strong>g lemma probabilistically bounds the averageregret <strong>in</strong> terms of the variance, covariance, <strong>and</strong> bias betweenthe estimated <strong>and</strong> true counterfactual regrets:Lemma 1 Let p ∈ (0, 1] <strong>and</strong> suppose that there existsa bound ˆ∆ i on the difference between any two estimates,ˆv i (σ (I→a) , I) − ˆv i (σ (I→b) , I) ≤ ˆ∆ i . If strategies are selectedaccord<strong>in</strong>g to regret match<strong>in</strong>g on the estimated counterfactualregrets, then with probability at least 1 − p, theaverage regret is bounded byR T iwhereT ≤ |I i| √ |A i |Var =max⎛t∈{1,...,T }I∈I ia∈A(I)⎝ ˆ∆ i√T+with Cov <strong>and</strong> E similarly def<strong>in</strong>ed.√VarpT + Covp+ E2pVar [ r t i(I, a) − ˆr t i(I, a) ] ,⎞⎠


The proof is similar to that of Theorem 7 by Lanctot etal. (2009b) <strong>and</strong> can be found, along with all other proofs,<strong>in</strong> the technical report (Gibson et al. 2012). Lemma 1 impliesthat unbiased estimators of v i (σ, I) give a probabilisticguarantee of m<strong>in</strong>imiz<strong>in</strong>g R T i /T :Theorem 2 If <strong>in</strong> addition to the conditions of Lemma 1,ˆv i (σ, I) is an unbiased estimator of v i (σ, I), then with probabilityat least 1 − p,R T iT ≤ (ˆ∆ i +√ )Var |I i | √ A√ √ i. (3)p TTheorem 2 shows that the bound on the difference betweentwo estimates, ˆ∆ i , plays a role <strong>in</strong> bound<strong>in</strong>g the average overallregret. This is not surpris<strong>in</strong>g as ˜∆ i plays a similar role<strong>in</strong> Theorem 1. However, Theorem 2 provides new <strong>in</strong>sight<strong>in</strong>to the role played by the variance of the estimator. Giventwo unbiased estimators ˆv i (σ, I) <strong>and</strong> ˆv i ′ (σ, I) with a commonbound ˆ∆ i but differ<strong>in</strong>g variance, us<strong>in</strong>g the estimatorwith lower variance will yield a smaller bound on the averageregret after T iterations. For a fixed ɛ > 0, this suggeststhat estimators with lower variance will require feweriterations to converge to an ɛ-Nash equilibrium. In addition,s<strong>in</strong>ce <strong>in</strong> MCCFR we can bound √ Var ≤ √ 2 max{ ˜∆ i , ∆ i },equation (3) is more <strong>in</strong>formative than equation (2). Furthermore,if some structure on the estimates ˆv i (σ, I) holds, wecan produce a tighter bound than equation (3) by <strong>in</strong>corporat<strong>in</strong>gthe game-dependent constant M i <strong>in</strong>troduced by Lanctotet al. (2009a). Details of this improvement are <strong>in</strong>cluded <strong>in</strong>the technical report (Gibson et al. 2012).While unbiased estimators with lower variance may reducethe number of iterations required, we must def<strong>in</strong>e theseestimators carefully. If the estimator is expensive to compute,the time per iteration will be costly <strong>and</strong> overall computationtime may even <strong>in</strong>crease. For example, the true counterfactualvalue v i (σ, I) has zero variance, but comput<strong>in</strong>gthe value with vanilla CFR is too time consum<strong>in</strong>g <strong>in</strong> largegames. In the next section, we present a new bounded, unbiasedestimator that exhibits lower variance than ṽ i (σ, I) <strong>and</strong>can be computed nearly as fast.A New CFR <strong>Sampl<strong>in</strong>g</strong> AlgorithmWe now provide an example of how our new theoretical f<strong>in</strong>d<strong>in</strong>gscan be leveraged to reduce the computation time requiredto obta<strong>in</strong> an ɛ-Nash equilibrium. Our example is anextension of MCCFR that attempts to reduce variance by replac<strong>in</strong>g“zeroed-out” counterfactual values of player i’s nonsampledactions with closer estimates of the true counterfactualvalues. Figure 1c illustrates this idea. The simplest<strong>in</strong>stance of our new algorithm “probes” each non-sampledaction a at I for its counterfactual value. A probe is a s<strong>in</strong>gleMonte Carlo roll-out, start<strong>in</strong>g with action a at I <strong>and</strong> select<strong>in</strong>gsubsequent actions accord<strong>in</strong>g to the current strategyσ t until a term<strong>in</strong>al history z is reached. By roll<strong>in</strong>g out actionsaccord<strong>in</strong>g to the current strategy, a probe is guaranteedto provide an unbiased estimate of the counterfactual valuefor a at I. In general, one can perform multiple probes pernon-sampled action, probe only a subset of the non-sampledactions, probe off-policy, or factor <strong>in</strong> multiple term<strong>in</strong>al historiesper probe. While the technical report touches on thisgeneralization (Gibson et al. 2012), our presentation heresticks to the simple, <strong>in</strong>expensive case of one on-policy, s<strong>in</strong>gletrajectory probe for each non-sampled action.We now formally def<strong>in</strong>e the estimated counterfactualvalue ˆv i (σ, I) obta<strong>in</strong>ed via prob<strong>in</strong>g, followed by a descriptionof a new CFR sampl<strong>in</strong>g algorithm that updates regretsaccord<strong>in</strong>g to these estimates. Similar to MCCFR, let Q bea set of blocks spann<strong>in</strong>g Z from which we sample a blockQ ∈ Q for player i on every iteration. To further simplifyour discussion, we will assume from here on that each Qsamples a s<strong>in</strong>gle action at every history h not belong<strong>in</strong>g toplayer i, sampled accord<strong>in</strong>g to the known chance probabilitiesσ c or the opponent’s current strategy σ −i . Additionally,we assume that the set of actions sampled at I ∈ I i , denotedQ(I), is nonempty <strong>and</strong> <strong>in</strong>dependent of every other set of actionssampled. While prob<strong>in</strong>g can be generalized to workfor any choice of Q (Gibson et al. 2012), this simplificationreduces the number of probabilities to compute <strong>in</strong> our algorithm<strong>and</strong> worked well <strong>in</strong> prelim<strong>in</strong>ary experiments. Once Qhas been sampled, we form an additional set of term<strong>in</strong>al histories,or probes, B ⊆ Z\Q, generated as follows. For eachnon-term<strong>in</strong>al history h with P (h) = i reached <strong>and</strong> each actiona ∈ A(h) that Q does not sample (i.e., ∃z ∈ Q suchthat h ⊑ z, but ∀z ∈ Q, ha ⋢ z), we generate exactly oneterm<strong>in</strong>al history z = z ha ∈ B, where z ∈ Z\Q is selectedon-policy (i.e., with probability π σ (ha, z)). In other words,each non-sampled action is probed accord<strong>in</strong>g to the currentstrategy profile σ <strong>and</strong> the known chance probabilities. Recallthat Z I is the set of term<strong>in</strong>al histories that have a prefix <strong>in</strong> the<strong>in</strong>formation set I. Given both Q <strong>and</strong> B, when Z I ∩ Q ≠ ∅,our estimated counterfactual value is def<strong>in</strong>ed to be⎡ˆv i (σ, I) = 1 ⎣∑πi σ (z[I], z)u i (z)q i (I)z∈Z I ∩Q+ ∑]πi σ (z ha [I], ha)u i (z ha ) ,whereq i (I) =z ha ∈Z I ∩B∏(I ′ ,a ′ )∈X i(I)P[a ′ ∈ Q(I ′ )]is the probability that Z I ∩Q ≠ ∅ contributed from sampl<strong>in</strong>gplayer i’s actions. Here, X i (I) is the sequence of <strong>in</strong>formationset, action pairs for player i that lead to <strong>in</strong>formation setI, <strong>and</strong> this sequence is unique due to perfect recall. WhenZ I ∩ Q = ∅, ˆv i (σ, I) is def<strong>in</strong>ed to be zero.Proposition 1 If q i (I) > 0 for all I ∈ I i , then ˆv i (σ, I) is abounded, unbiased estimate of v i (σ, I).Proposition 1 <strong>and</strong> Theorem 2 provide a probabilistic guaranteethat updat<strong>in</strong>g regrets accord<strong>in</strong>g to our estimated counterfactualvalues will m<strong>in</strong>imize the average regret, <strong>and</strong>thus produce an ɛ-Nash equilibrium. Note that the differencesṽ i (σ (I→a) , I) − ṽ i (σ (I→b) , I) <strong>and</strong> ˆv i (σ (I→a) , I) −ˆv i (σ (I→b) I) are both bounded above by ˆ∆ i = ∆ i /δ, whereδ = m<strong>in</strong> I∈Ii q i (I). Thus, Theorem 2 suggests that variance


eduction should lead to less regret after a fixed number ofiterations. Prob<strong>in</strong>g specifically aims to achieve variance reductionthrough ˆv i (σ, I) when only a strict subset of playeri’s actions are sampled. Note that if we always sample all ofplayer i’s actions, we have B = ∅ <strong>and</strong> ˆv i (σ, I) = ṽ i (σ, I).Algorithm 1 provides pseudocode for our new algorithmthat updates regrets accord<strong>in</strong>g to our estimated counterfactualvalues. The Probe function recurses down the treefrom history h follow<strong>in</strong>g a s<strong>in</strong>gle trajectory accord<strong>in</strong>g to theknown chance probabilities (l<strong>in</strong>e 9) <strong>and</strong> the current strategyobta<strong>in</strong>ed through regret match<strong>in</strong>g (l<strong>in</strong>e 13, see equation (1))until the utility at a s<strong>in</strong>gle term<strong>in</strong>al history is returned (l<strong>in</strong>e7). The WalkTree function is the ma<strong>in</strong> part of the algorithm<strong>and</strong> conta<strong>in</strong>s three major cases. Firstly, if the current historyh is a term<strong>in</strong>al history, we simply return the utility (l<strong>in</strong>e 19).Second, if player i does not act at h (l<strong>in</strong>es 20 to 31), then ourassumption on Q dictates that a s<strong>in</strong>gle action is traversed onpolicy(l<strong>in</strong>es 21 <strong>and</strong> 29). The third case covers player i act<strong>in</strong>gat h (l<strong>in</strong>es 32 to 46). After sampl<strong>in</strong>g a set of actions (l<strong>in</strong>e34), the value of each action a, v[a], is obta<strong>in</strong>ed. For eachsampled action, we obta<strong>in</strong> its value by recurs<strong>in</strong>g down thataction (l<strong>in</strong>e 38) after updat<strong>in</strong>g the sample probability for futurehistories (l<strong>in</strong>e 37). While MCCFR assigns zero value toeach non-sampled action, Algorithm 1 <strong>in</strong>stead obta<strong>in</strong>s theseaction values through the probe function (l<strong>in</strong>e 40). Note thatq = q i (I) is the probability of reach<strong>in</strong>g I contributed fromsampl<strong>in</strong>g player i’s actions, <strong>and</strong> that the estimated counterfactualvalue ˆv i (σ (I→a) , I) = v[a]/q. After obta<strong>in</strong><strong>in</strong>g allvalues, the regret of each action is updated (l<strong>in</strong>e 44).Runn<strong>in</strong>g the Solve function for a large enough number ofiterations T will produce an approximate Nash equilibrium¯σ, where ¯σ(I, a) = s I [a]/ ∑ b∈A(I) s I[b]. Note that Algorithm1 updates the cumulative profile (l<strong>in</strong>e 27) <strong>in</strong> a slightlydifferent manner than the MCCFR algorithm presented byLanctot et al. (2009b, Algorithm 1). Firstly, we divide theupdate by the probability q i (I) of reach<strong>in</strong>g I under Q (whichLanctot et al. call updat<strong>in</strong>g “stochastically”), as opposed to“optimistic averag<strong>in</strong>g” that the authors mention is not technicallycorrect. Second, we update player i’s part of the profiledur<strong>in</strong>g the opponent’s traversal <strong>in</strong>stead of dur<strong>in</strong>g playeri’s traversal. Do<strong>in</strong>g so ensures that any <strong>in</strong>formation set thatwill contribute positive probability to the cumulative profilecan be reached, which along with stochastic updat<strong>in</strong>g producesan unbiased update. These changes are not specific toour new algorithm <strong>and</strong> can also be applied to MCCFR.Experimental ResultsIn this section, we compare MCCFR to our new sampl<strong>in</strong>galgorithm <strong>in</strong> three doma<strong>in</strong>s, which we now describe.Goofspiel(n) is a card-bidd<strong>in</strong>g game consist<strong>in</strong>g of nrounds. Each player beg<strong>in</strong>s with a h<strong>and</strong> of bidd<strong>in</strong>g cardsnumbered 1 to n. In our version, on round k, players secretly<strong>and</strong> simultaneously play one bid from their rema<strong>in</strong><strong>in</strong>g cards<strong>and</strong> the player with the highest bid receives n−k +1 po<strong>in</strong>ts;<strong>in</strong> the case of a tie, no po<strong>in</strong>ts are awarded. The player withthe highest score after n rounds receives a utility of +1 <strong>and</strong>the other player earns −1, <strong>and</strong> both receive 0 utility <strong>in</strong> a tie.Our version of Goofspiel is less <strong>in</strong>formative than conven-Algorithm 1 CFR sampl<strong>in</strong>g with prob<strong>in</strong>g1: Require: ∀ I, action set sampl<strong>in</strong>g distribution Q(I)2: Initialize regret: ∀I, ∀a ∈ A(I) : r I [a] ← 03: Initialize cumulative profile: ∀I, ∀a ∈ A(I) : s I [a] ← 04:5: function Probe(history h, player i):6: if h ∈ Z then7: return u i (h)8: else if h ∈ P (c) then9: Sample action a ∼ σ c (h, ·)10: else11: I ← Information set conta<strong>in</strong><strong>in</strong>g h12: σ ←<strong>Regret</strong>Match<strong>in</strong>g(r I )13: Sample action a ∼ σ14: end if15: return Probe(ha, i)16:17: function WalkTree(history h, player i, sample prob q):18: if h ∈ Z then19: return u i (h)20: else if h ∈ P (c) then21: Sample action a ∼ σ c (h, ·)22: return WalkTree(ha, i, q)23: else if h /∈ P (i) then24: I ← Information set conta<strong>in</strong><strong>in</strong>g h25: σ ←<strong>Regret</strong>Match<strong>in</strong>g(I)26: for a ∈ A(I) do27: s I [a] ← s I [a] + (σ[a]/q)28: end for29: Sample action a ∼ σ30: return WalkTree(ha, i, q)31: end if32: I ← Information set conta<strong>in</strong><strong>in</strong>g h33: σ ← <strong>Regret</strong>Match<strong>in</strong>g(r I )34: Sample action set Q(I) ∼ Q(I)35: for a ∈ A(I) do36: if a ∈ Q(I) then37: q ′ ← q · P Q(I) [a ∈ Q(I)]38: v[a] ← WalkTree(ha, i, q ′ )39: else40: v[a] ← Probe(ha, i)41: end if42: end for43: for a ∈ A(I) do44: r I [a] ← r I [a] + (1/q)(v[a] − ∑ )b∈A(I) σ[b]v[b]45: end for46: return ∑ a∈A(I) σ[a]v[a]47:48: function Solve(iterations T ):49: for t ∈ {1, 2, ..., T } do50: WalkTree(∅, 1, 1)51: WalkTree(∅, 2, 1)52: end fortional Goofspiel as players know which of the previous bidswere won or lost, but not which cards the opponent played.


Bluff(D 1 , D 2 ) is a dice-bidd<strong>in</strong>g game played over a numberof rounds. Each player i starts with D i six-sided dice.In each round, players roll their dice <strong>and</strong> look at the resultwithout show<strong>in</strong>g their opponent. Then, players alternate bybidd<strong>in</strong>g a quantity of a face value, q-f, of all dice <strong>in</strong> playuntil one player claims that the other is bluff<strong>in</strong>g (i.e., claimsthat the bid does not hold). To place a new bid, a player must<strong>in</strong>crease q or f of the current bid. A face of 6 is considered“wild” <strong>and</strong> counts as any other face value. The player call<strong>in</strong>gbluff w<strong>in</strong>s the round if the opponent’s last bid is <strong>in</strong>correct,<strong>and</strong> loses otherwise. The los<strong>in</strong>g player removes one of theirdice from the game <strong>and</strong> a new round beg<strong>in</strong>s. Once a playerhas no more dice left, that player loses the game <strong>and</strong> receivesa utility of −1, while the w<strong>in</strong>n<strong>in</strong>g player earns +1 utility.F<strong>in</strong>ally, we consider heads-up (i.e., two-player) limitTexas hold’em poker that is played over four bett<strong>in</strong>g rounds.To beg<strong>in</strong>, each player is dealt two private cards. In laterrounds, public community cards are revealed with a fifth <strong>and</strong>f<strong>in</strong>al card appear<strong>in</strong>g <strong>in</strong> the last round. Dur<strong>in</strong>g each bett<strong>in</strong>ground, players can either fold (forfeit the game), call (matchthe previous bet), or raise (<strong>in</strong>crease the previous bet), witha maximum of four raises per round. If neither player folds,then the player with the highest ranked poker h<strong>and</strong> w<strong>in</strong>s allof the bets. Hold’em conta<strong>in</strong>s approximately 3 × 10 14 <strong>in</strong>formationsets, mak<strong>in</strong>g the game <strong>in</strong>tractable for any equilibriumcomputation technique. A common approach <strong>in</strong> poker is toapply a card abstraction that merges similar card deal<strong>in</strong>gs together<strong>in</strong>to a s<strong>in</strong>gle chance “bucket” (Gilp<strong>in</strong> <strong>and</strong> S<strong>and</strong>holm2006). We apply a ten-bucket abstraction that reduces thebranch<strong>in</strong>g factor at each chance node down to ten, wheredeal<strong>in</strong>gs are grouped accord<strong>in</strong>g to expected h<strong>and</strong> strengthsquared as described by Z<strong>in</strong>kevich et al. (2008). This abstractgame conta<strong>in</strong>s roughly 57 million <strong>in</strong>formation sets.We use doma<strong>in</strong> knowledge <strong>and</strong> our <strong>in</strong>tuition to select thesampl<strong>in</strong>g schemes Q. By our earlier assumption, we alwayssampl<strong>in</strong>g a s<strong>in</strong>gle action on-policy when P (h) ≠ i, as isdone <strong>in</strong> Algorithm 1. For the travers<strong>in</strong>g player i, we focuson sampl<strong>in</strong>g actions lead<strong>in</strong>g to more “important” parts of thetree, while sampl<strong>in</strong>g other actions less frequently. Do<strong>in</strong>g soupdates the regret at the important <strong>in</strong>formation sets more frequentlyto quickly improve play at those locations. In Goofspiel,we always sample the lowest <strong>and</strong> highest bids, whilesampl<strong>in</strong>g each of the rema<strong>in</strong><strong>in</strong>g bids <strong>in</strong>dependently withprobability 0.5. Strong play can be achieved by only everplay<strong>in</strong>g the highest bid (giv<strong>in</strong>g the best chance at w<strong>in</strong>n<strong>in</strong>gthe bid) or the lowest bid (sacrific<strong>in</strong>g the current bid, leav<strong>in</strong>ghigher cards for w<strong>in</strong>n<strong>in</strong>g future bids), suggest<strong>in</strong>g that theseactions will often be taken <strong>in</strong> equilibrium. In Bluff(2,2), wealways sample “bluff” <strong>and</strong> the bids 1-5, 2-5, 1-6, 2-6, <strong>and</strong>for each face x that we roll, n-x for all 1 ≤ n ≤ 4. Bidd<strong>in</strong>gon the highest face is generally the best bluff s<strong>in</strong>ce theopponent’s next bid must <strong>in</strong>crease the quantity, <strong>and</strong> bidd<strong>in</strong>gon one’s own dice roll is more likely to be correct. F<strong>in</strong>ally,<strong>in</strong> hold’em, we always sample fold <strong>and</strong> raise actions, whilesampl<strong>in</strong>g call with probability 0.5. Folds are cheap to sample(s<strong>in</strong>ce the game ends) <strong>and</strong> raise actions <strong>in</strong>crease the numberof bets <strong>and</strong> consequently the magnitude of the utilities.Firstly, we performed a test run of CFR <strong>in</strong> Goofspiel(6)that measured the empirical variance of the samples ṽ i (σ, I)<strong>Variance</strong>0.90.80.70.60.50.40.30.20.1MCCFRAlgorithm 10.00 20 40 60 80 100 120 140IterationsFigure 2: Empirical Var[ṽ i (σ t , I)] <strong>and</strong> Var[ˆv i (σ t , I)] overiterations at the root of Goofspiel(6) <strong>in</strong> a test run of CFR.provided by MCCFR <strong>and</strong> of ˆv i (σ, I) provided by Algorithm1. Dur<strong>in</strong>g each iteration t of the test run, we performed2000 traversals with no regret or strategy updates, wherethe first 1000 traversals computed ṽ i (σ t , I) <strong>and</strong> the second1000 computed ˆv i (σ t , I) at the root I of the game. Bothṽ i (σ t , I) <strong>and</strong> ˆv i (σ t , I) were computed under the same sampl<strong>in</strong>gscheme Q described above for Goofspiel. Once theempirical variance of each estimator was recorded from thesamples at time t, a full vanilla CFR traversal was then performedto update the regrets <strong>and</strong> acquire the next strategyσ t+1 . The first 150 empirical variances are reported <strong>in</strong> Figure2. S<strong>in</strong>ce the estimators are unbiased, the variance here isalso equal to the mean squared error of the estimates. Over1000 test iterations, the average variances were 0.295 forMCCFR <strong>and</strong> 0.133 for Algorithm 1. This agrees with ourearlier <strong>in</strong>tuition that prob<strong>in</strong>g reduces variance <strong>and</strong> providessome validation for our choice of estimator.Next, we performed five runs for each of MCCFR <strong>and</strong>Algorithm 1, each under the same sampl<strong>in</strong>g schemes Q describedabove. Similar to Algorithm 1, our MCCFR implementationalso performs stochastic averag<strong>in</strong>g dur<strong>in</strong>g the opponent’stree traversal. For each doma<strong>in</strong>, the average of theresults are provided <strong>in</strong> Figure 3. Our new algorithm convergesfaster than MCCFR <strong>in</strong> all three doma<strong>in</strong>s. In particular,at our f<strong>in</strong>al data po<strong>in</strong>ts, Algorithm 1 shows a 31%,10%, <strong>and</strong> 18% improvement over MCCFR <strong>in</strong> Goofspiel(7),Bluff(2,2), <strong>and</strong> Texas hold’em respectively. For both Goofspiel(7)<strong>and</strong> hold’em, the improvement was statisticallysignificant. In Goofspiel(7), for example, the level of exploitabilityreached by MCCFR’s last averaged data po<strong>in</strong>tis reached by Algorithm 1 <strong>in</strong> nearly half the time.ConclusionWe have provided a new theoretical framework that generalizesMCCFR <strong>and</strong> provides new <strong>in</strong>sights <strong>in</strong>to how theestimated values affect the rate of convergence to an approximateNash equilibrium. As opposed to the sampledcounterfactual values ṽ i (σ, I) explicitly def<strong>in</strong>ed by MCCFR,we considered any estimate ˆv i (σ, I) of the true counterfac-


tual values. We showed that the average regret is m<strong>in</strong>imized(probabilistically) when the estimates are bounded <strong>and</strong> unbiased.In addition, we derived an upper bound on the averageregret <strong>in</strong> terms of the variance of the estimates, suggest<strong>in</strong>gthat estimators with lower variance will converge toan ɛ-Nash equilibrium <strong>in</strong> fewer iterations. F<strong>in</strong>ally, we providedan example of a non-MCCFR algorithm that reducesvariance with little computational overhead by prob<strong>in</strong>g nonsampledactions. Our new algorithm approached equilibriumfaster than its MCCFR counterpart <strong>in</strong> all of the reportedexperiments. We suspect that there are other efficientlycomputabledef<strong>in</strong>itions of ˆv i (σ, I) that are bounded, unbiased,<strong>and</strong> exhibit lower variance than our prob<strong>in</strong>g example.Future work will attempt to further improve convergencerates through such alternative def<strong>in</strong>itions.AcknowledgementsWe would like to thank the members of the Computer PokerResearch Group at the University of Alberta for their helpfulsuggestions throughout this project. This work was supportedby NSERC, Alberta Innovates – Technology Futures,<strong>and</strong> the use of comput<strong>in</strong>g resources provided by WestGrid<strong>and</strong> Compute Canada.ReferencesGibson, R.; Lanctot, M.; Burch, N.; Szafron, D.; <strong>and</strong> Bowl<strong>in</strong>g,M. 2012. <strong>Generalized</strong> sampl<strong>in</strong>g <strong>and</strong> variance <strong>in</strong> counterfactualregret m<strong>in</strong>imization. Technical Report TR12-02,University of Alberta.Gilp<strong>in</strong>, A., <strong>and</strong> S<strong>and</strong>holm, T. 2006. A competitive TexasHold’em poker player via automated abstraction <strong>and</strong> realtimeequilibrium computation. In Twenty-First Conferenceon Artificial Intelligence (AAAI), 1007–1013.Hart, S., <strong>and</strong> Mas-Colell, A. 2000. A simple adaptiveprocedure lead<strong>in</strong>g to correlated equilibrium. Econometrica68:1127–1150.Hoda, S.; Gilp<strong>in</strong>, A.; Peña, J.; <strong>and</strong> S<strong>and</strong>holm, T. 2010.Smooth<strong>in</strong>g techniques for comput<strong>in</strong>g Nash equilibria ofsequential games. Mathematics of Operations Research35(2):494–512.Koller, D.; Megiddo, N.; <strong>and</strong> von Stengel, B. 1994.Fast algorithms for f<strong>in</strong>d<strong>in</strong>g r<strong>and</strong>omized strategies <strong>in</strong> gametrees. In Annual ACM Symposium on Theory of Comput<strong>in</strong>g(STOC’94), 750–759.Lanctot, M.; Waugh, K.; Z<strong>in</strong>kevich, M.; <strong>and</strong> Bowl<strong>in</strong>g, M.2009a. Monte Carlo sampl<strong>in</strong>g for regret m<strong>in</strong>imization <strong>in</strong>extensive games. In Advances <strong>in</strong> Neural Information Process<strong>in</strong>gSystems 22 (NIPS), 1078–1086.Lanctot, M.; Waugh, K.; Z<strong>in</strong>kevich, M.; <strong>and</strong> Bowl<strong>in</strong>g, M.2009b. Monte Carlo sampl<strong>in</strong>g for regret m<strong>in</strong>imization <strong>in</strong>extensive games. Technical Report TR09-15, University ofAlberta.Z<strong>in</strong>kevich, M.; Johanson, M.; Bowl<strong>in</strong>g, M.; <strong>and</strong> Piccione,C. 2008. <strong>Regret</strong> m<strong>in</strong>imization <strong>in</strong> games with <strong>in</strong>complete<strong>in</strong>formation. In Advances <strong>in</strong> Neural Information Process<strong>in</strong>gSystems 20 (NIPS), 905–912.ExploitabilityExploitabilityAbstract game exploit. (mbb/g)0.200.180.160.140.120.100.080.060.040.020.000.200.180.160.140.120.100.080.060.040.020.00109876543210MCCFRAlgorithm 10 4 8 12 16 20 24 28Time (hours)(a) Goofspiel(7)MCCFRAlgorithm 10 1 2 3 4 5 6 7 8 9 10Time (hours)(b) Bluff(2,2)MCCFRAlgorithm 10 8 16 24 32 40 48Time (hours)(c) Texas hold’emFigure 3: Exploitability over time of strategies computedby MCCFR <strong>and</strong> by Algorithm 1 us<strong>in</strong>g identical sampl<strong>in</strong>gschemes Q, averaged over five runs. Error bars <strong>in</strong>dicate 95%confidence <strong>in</strong>tervals at each of the five averaged data po<strong>in</strong>ts.In hold’em, exploitability is measured <strong>in</strong> terms of milli-bigbl<strong>in</strong>dsper game (mbb/g).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!