Generalized Sampling and Variance in Counterfactual Regret ...

More documents

Recommendations

Info

$Formatting Instructions for Authors Using LaTeX - the Department of ...$

tion sets A(h) must be identical for all h ∈ I, and we denotethis set by A(I). We assume perfect recall that guaranteesplayers always remember information that was revealed tothem and the order in which it was revealed.A strategy for player i, σ i ∈ Σ i , is a function thatmaps each information set I ∈ I i to a probability distributionover A(I). A strategy profile is a vector of strategiesσ = (σ 1 , ..., σ |N| ) ∈ Σ, one for each player. Define u i (σ)to be the expected utility for player i, given that all playersplay according to σ. We let σ −i refer to the strategies in σexcluding σ i .Let π σ (h) be the probability of history h occurring if allplayers choose actions according to σ. We can decomposeπ σ (h) =∏πi σ (h)i∈N∪{c}into each player’s and chance’s contribution to this probability.Here, πi σ (h) is the contribution to this probability fromplayer i when playing according to σ i . Let π−i σ (h) be theproduct of all players’ contribution (including chance) exceptthat of player i. Furthermore, let π σ (h, h ′ ) be the probabilityof history h ′ occurring after h, given h has occurred.Let πi σ(h, h′ ) and π−i σ (h, h′ ) be defined similarly.A best response to σ −i is a strategy that maximizesplayer i’s expected payoff against σ −i . The best responsevalue for player i is the value of that strategy, b i (σ −i ) =max σ ′i ∈Σ iu i (σ i ′, σ −i). A strategy profile σ is an ɛ-Nashequilibrium if no player can unilaterally deviate from σ andgain more than ɛ; i.e., u i (σ) + ɛ ≥ b i (σ −i ) for all i ∈ N.In this paper, we will focus on two-player zero-sumgames: N = {1, 2} and u 1 (z) = −u 2 (z) for all z ∈ Z.In this case, the exploitability of σ, e(σ) = (b 1 (σ 2 ) +b 2 (σ 1 ))/2, measures how much σ loses to a worst case opponentwhen players alternate positions. A 0-Nash equilibrium(or simply a Nash equilibrium) has zero exploitability.Counterfactual Regret Minimization (CFR) is an iterativeprocedure that, for two-player zero-sum games, obtainsan ɛ-Nash equilibrium in O(|H||I i |/ɛ 2 ) time (Zinkevich etal. 2008, Theorem 4). On each iteration t, CFR (or “vanillaCFR”) recursively traverses the entire game tree, calculatingthe expected utility for player i at each information setI ∈ I i under the current profile σ t , assuming player i playsto reach I. This expectation is the counterfactual value forplayer i,v i (σ, I) = ∑u i (z)π−i(z[I])π σ σ (z[I], z),z∈Z Iwhere Z I is the set of terminal histories passing throughI and z[I] is the prefix of z contained in I. For each actiona ∈ A(I), these values determine the counterfactualregret at iteration t, ri t(I, a) = v i(σ(I→a) t , I) − v i(σ t , I),where σ (I→a) is the profile σ except at I, action a is alwaystaken. This process is shown visually in Figure 1a. The regretri t (I, a) measures how much player i would rather playaction a at I than play σ t . The counterfactual regretsR T i (I, a) =T∑ri(I, t a)t=1∑ σ (I ,b)v i(σ (I → b), I )bIa 1a 2a 3v i(σ (I →a1 ) , I ) v i(σ (I →a2 ), I )v i(σ (I →a3 ) , I )σ (I ,a 1) ̃v i(σ ( I → a 1), I )Ia 1a 2a 3r i (I ,a)=v i (σ ( I →a ) , I )−∑ σ (I , b)v i (σ ( I →b) , I)b(a)̃v i(σ (I →a1 ), I ) 00Ĩr i (I ,a 1 )= ̃v i (σ ( I →a 1), I)−σ (I , a 1 ) ̃v i (σ (I → a1 ), I )̃r i (I ,a 2 )=−σ(I , a 1 ) ̃v i (σ ( I → a1 ) , I )̃r i (I ,a 3 )=−σ (I , a 1 ) ̃v i (σ ( I → a1 ), I )(b)∑ σ (I ,b) ̂v i(σ (I → b), I )ba 1a 2a 3̂v i(σ (I →a1 ) , I ) ̂v i (σ (I →a 3) , I )= probe i (σ (I →a 3) , I )̂v i(σ (I → a2 ), I )= probe i(σ (I → a 2), I)̂r i (I ,a)= ̂v i (σ ( I →a ) , I )−∑ σ (I , b) ̂v i (σ ( I →b) , I)b(c)Figure 1: (a) The computed values at information set I duringvanilla CFR. First, for each action, the counterfactualvalues are recursively computed. The counterfactual regretsare then computed before returning the counterfactual valueat I to the parent. (b) The computed values at I during outcomesampling. Here, only action a 1 is sampled and its sampledcounterfactual value is recursively computed. The remainingtwo actions are effectively assigned zero sampledcounterfactual value. The sampled counterfactual regrets arethen computed before returning the sampled counterfactualvalue at I to the parent. (c) An example of computed valuesat I during our new sampling algorithm. In this example,again only a 1 is sampled and its estimated counterfactualvalue is recursively computed. The remaining two actionsare “probed” to improve both the estimated counterfactualregrets and the returned estimated counterfactual value at I.are accumulated and σ t is updated by applying regret matching(Hart and Mas-Colell 2000; Zinkevich et al. 2008) to the
accumulated regrets,σ T +1 (I, a) =R T,+i (I, a)∑R T,+i (I, b)b∈A(I)where x + = max{x, 0} and actions are chosen uniformly atrandom when the denominator is zero. This procedure minimizesthe average of the counterfactual regrets, which in turnminimizes the average (external) regret Ri T /T (Zinkevichet al. 2008, Theorem 3), whereR T i= maxσ ′ ∈Σ iT∑t=1(ui (σ ′ , σ t −i) − u i (σ t i, σ t −i) ) .It is well known that in a two-player zero-sum game, ifRi T /T < ɛ for i ∈ {1, 2}, then the average profile ¯σT isa 2ɛ-Nash equilibrium.For large games, CFR’s full game tree traversal can bevery expensive. Alternatively, one can still obtain an approximateequilibrium by traversing a smaller, sampled portionof the tree on each iteration using Monte Carlo CFR (MC-CFR) (Lanctot et al. 2009a). Let Q be a set of subsets, orblocks, of the terminal histories Z such that the union of Qspans Z. On each iteration, a block Q ∈ Q is sampled accordingto a probability distribution over Q. Outcome samplingis an example of MCCFR that uses blocks containinga single terminal history (Q = {z}). On each iteration ofoutcome sampling, the block is chosen during traversal bysampling a single action at the current decision point untila terminal history is reached. The sampled counterfactualvalue for player i,ṽ i (σ, I) =∑u i (z)π−i(z[I])π σ σ (z[I], z)/q(z)z∈Z I ∩Qwhere q(z) is the probability that z was sampled, defines thesampled counterfactual regret on iteration t for action a atI, ˜r t i (I, a) = ṽ i(σ t (I→a) , I) − ṽ i(σ t , I). The sampled counterfactualvalues are unbiased estimates of the true counterfactualvalues (Lanctot et al. 2009a, Lemma 1). In outcomesampling, for example, only the regrets along the sampledterminal history are computed (all others are zero by definition).Outcome sampling converges to equilibrium fasterthan vanilla CFR in a number of different games (Lanctot etal. 2009a, Figure 1).As we sample fewer actions at a given node, the sampledcounterfactual value is potentially less accurate. Figure 1b illustratesthis point in the case of outcome sampling. Here, an“informative” sampled counterfactual value for just a singleaction is obtained at each information set along the sampledblock (history). All other actions are assigned a sampledcounterfactual value of zero. While E Q [ṽ i (σ, I)] = v i (σ, I),variance is introduced, affecting both the regret updates andthe value recursed back to the parent. As we will see inthe next section, this variance plays an important role in thenumber of iterations required to converge.Generalized SamplingOur main contributions in this paper are new theoretical findingsthat generalize those of MCCFR. We begin by pre-(1)senting a previously established bound on the average regretachieved through MCCFR. Let |A i | = max I∈Ii |A(I)|and suppose δ > 0 satisfies the following: ∀z ∈ Z eitherπ−i σ (z) = 0 or q(z) ≥ δ > 0 at every iteration.We can then bound the difference between any two samplesṽ i (σ (I→a) , I) − ṽ i (σ (I→b) , I) ≤ ˜∆ i = ∆ i /δ, where∆ i = max z∈Z u i (z) − min z∈Z u i (z). The average regretcan then be bounded as follows:Theorem 1 (Lanctot et al. (2009a), Theorem 5) Let p ∈(0, 1]. When using outcome-sampling MCCFR, with probability1 − p, average regret is bounded by( √ )RiT 2T ≤ ˜∆i |I i | √ |A i |˜∆ i + √ √ . (2)p TA related bound holds for all MCCFR instances (Lanctotet al. 2009b, Theorem 7). We note here that Lanctot etal. present a slightly tighter bound than equation (2) where|I i | is replaced with a game-dependent constant M i that isindependent of the sampling scheme and satisfies √ |I i | ≤M i ≤ |I i |. This constant is somewhat complicated to define,and thus we omit these details here. Recall that minimizingthe average regret yields an approximate Nash equilibrium.Theorem 1 suggests that the rate at which regret is minimizeddepends on the bound ˜∆ i on the difference betweentwo sampled counterfactual values.We now present a new, generalized bound on the averageregret. While MCCFR provides an explicit form for thesampled counterfactual values ṽ i (σ, I), we let ˆv i (σ, I) denoteany estimator of the true counterfactual value v i (σ, I).We can then define the estimated counterfactual regret oniteration t for action a at I to be ˆr t i (I, a) = ˆv i(σ t (I→a) , I) −ˆv i (σ t , I). This generalization creates many possibilities notconsidered in MCCFR. For instance, instead of sampling ablock Q of terminal histories, one can consider a sampled setof information sets and only update regrets at those sampledlocations. Another example is provided later in the paper.The following lemma probabilistically bounds the averageregret in terms of the variance, covariance, and bias betweenthe estimated and true counterfactual regrets:Lemma 1 Let p ∈ (0, 1] and suppose that there existsa bound ˆ∆ i on the difference between any two estimates,ˆv i (σ (I→a) , I) − ˆv i (σ (I→b) , I) ≤ ˆ∆ i . If strategies are selectedaccording to regret matching on the estimated counterfactualregrets, then with probability at least 1 − p, theaverage regret is bounded byR T iwhereT ≤ |I i| √ |A i |Var =max⎛t∈{1,...,T }I∈I ia∈A(I)⎝ ˆ∆ i√T+with Cov and E similarly defined.√VarpT + Covp+ E2pVar [ r t i(I, a) − ˆr t i(I, a) ] ,⎞⎠
Page 1: Generalized Sampling and Variance i
Page 5 and 6: eduction should lead to less regret
Page 7: tual values. We showed that the ave

Generalized Sampling and Variance in Counterfactual Regret ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?