12.07.2015 Views

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

Generalized Sampling and Variance in Counterfactual Regret ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

tion sets A(h) must be identical for all h ∈ I, <strong>and</strong> we denotethis set by A(I). We assume perfect recall that guaranteesplayers always remember <strong>in</strong>formation that was revealed tothem <strong>and</strong> the order <strong>in</strong> which it was revealed.A strategy for player i, σ i ∈ Σ i , is a function thatmaps each <strong>in</strong>formation set I ∈ I i to a probability distributionover A(I). A strategy profile is a vector of strategiesσ = (σ 1 , ..., σ |N| ) ∈ Σ, one for each player. Def<strong>in</strong>e u i (σ)to be the expected utility for player i, given that all playersplay accord<strong>in</strong>g to σ. We let σ −i refer to the strategies <strong>in</strong> σexclud<strong>in</strong>g σ i .Let π σ (h) be the probability of history h occurr<strong>in</strong>g if allplayers choose actions accord<strong>in</strong>g to σ. We can decomposeπ σ (h) =∏πi σ (h)i∈N∪{c}<strong>in</strong>to each player’s <strong>and</strong> chance’s contribution to this probability.Here, πi σ (h) is the contribution to this probability fromplayer i when play<strong>in</strong>g accord<strong>in</strong>g to σ i . Let π−i σ (h) be theproduct of all players’ contribution (<strong>in</strong>clud<strong>in</strong>g chance) exceptthat of player i. Furthermore, let π σ (h, h ′ ) be the probabilityof history h ′ occurr<strong>in</strong>g after h, given h has occurred.Let πi σ(h, h′ ) <strong>and</strong> π−i σ (h, h′ ) be def<strong>in</strong>ed similarly.A best response to σ −i is a strategy that maximizesplayer i’s expected payoff aga<strong>in</strong>st σ −i . The best responsevalue for player i is the value of that strategy, b i (σ −i ) =max σ ′i ∈Σ iu i (σ i ′, σ −i). A strategy profile σ is an ɛ-Nashequilibrium if no player can unilaterally deviate from σ <strong>and</strong>ga<strong>in</strong> more than ɛ; i.e., u i (σ) + ɛ ≥ b i (σ −i ) for all i ∈ N.In this paper, we will focus on two-player zero-sumgames: N = {1, 2} <strong>and</strong> u 1 (z) = −u 2 (z) for all z ∈ Z.In this case, the exploitability of σ, e(σ) = (b 1 (σ 2 ) +b 2 (σ 1 ))/2, measures how much σ loses to a worst case opponentwhen players alternate positions. A 0-Nash equilibrium(or simply a Nash equilibrium) has zero exploitability.<strong>Counterfactual</strong> <strong>Regret</strong> M<strong>in</strong>imization (CFR) is an iterativeprocedure that, for two-player zero-sum games, obta<strong>in</strong>san ɛ-Nash equilibrium <strong>in</strong> O(|H||I i |/ɛ 2 ) time (Z<strong>in</strong>kevich etal. 2008, Theorem 4). On each iteration t, CFR (or “vanillaCFR”) recursively traverses the entire game tree, calculat<strong>in</strong>gthe expected utility for player i at each <strong>in</strong>formation setI ∈ I i under the current profile σ t , assum<strong>in</strong>g player i playsto reach I. This expectation is the counterfactual value forplayer i,v i (σ, I) = ∑u i (z)π−i(z[I])π σ σ (z[I], z),z∈Z Iwhere Z I is the set of term<strong>in</strong>al histories pass<strong>in</strong>g throughI <strong>and</strong> z[I] is the prefix of z conta<strong>in</strong>ed <strong>in</strong> I. For each actiona ∈ A(I), these values determ<strong>in</strong>e the counterfactualregret at iteration t, ri t(I, a) = v i(σ(I→a) t , I) − v i(σ t , I),where σ (I→a) is the profile σ except at I, action a is alwaystaken. This process is shown visually <strong>in</strong> Figure 1a. The regretri t (I, a) measures how much player i would rather playaction a at I than play σ t . The counterfactual regretsR T i (I, a) =T∑ri(I, t a)t=1∑ σ (I ,b)v i(σ (I → b), I )bIa 1a 2a 3v i(σ (I →a1 ) , I ) v i(σ (I →a2 ), I )v i(σ (I →a3 ) , I )σ (I ,a 1) ̃v i(σ ( I → a 1), I )Ia 1a 2a 3r i (I ,a)=v i (σ ( I →a ) , I )−∑ σ (I , b)v i (σ ( I →b) , I)b(a)̃v i(σ (I →a1 ), I ) 00Ĩr i (I ,a 1 )= ̃v i (σ ( I →a 1), I)−σ (I , a 1 ) ̃v i (σ (I → a1 ), I )̃r i (I ,a 2 )=−σ(I , a 1 ) ̃v i (σ ( I → a1 ) , I )̃r i (I ,a 3 )=−σ (I , a 1 ) ̃v i (σ ( I → a1 ), I )(b)∑ σ (I ,b) ̂v i(σ (I → b), I )ba 1a 2a 3̂v i(σ (I →a1 ) , I ) ̂v i (σ (I →a 3) , I )= probe i (σ (I →a 3) , I )̂v i(σ (I → a2 ), I )= probe i(σ (I → a 2), I)̂r i (I ,a)= ̂v i (σ ( I →a ) , I )−∑ σ (I , b) ̂v i (σ ( I →b) , I)b(c)Figure 1: (a) The computed values at <strong>in</strong>formation set I dur<strong>in</strong>gvanilla CFR. First, for each action, the counterfactualvalues are recursively computed. The counterfactual regretsare then computed before return<strong>in</strong>g the counterfactual valueat I to the parent. (b) The computed values at I dur<strong>in</strong>g outcomesampl<strong>in</strong>g. Here, only action a 1 is sampled <strong>and</strong> its sampledcounterfactual value is recursively computed. The rema<strong>in</strong><strong>in</strong>gtwo actions are effectively assigned zero sampledcounterfactual value. The sampled counterfactual regrets arethen computed before return<strong>in</strong>g the sampled counterfactualvalue at I to the parent. (c) An example of computed valuesat I dur<strong>in</strong>g our new sampl<strong>in</strong>g algorithm. In this example,aga<strong>in</strong> only a 1 is sampled <strong>and</strong> its estimated counterfactualvalue is recursively computed. The rema<strong>in</strong><strong>in</strong>g two actionsare “probed” to improve both the estimated counterfactualregrets <strong>and</strong> the returned estimated counterfactual value at I.are accumulated <strong>and</strong> σ t is updated by apply<strong>in</strong>g regret match<strong>in</strong>g(Hart <strong>and</strong> Mas-Colell 2000; Z<strong>in</strong>kevich et al. 2008) to the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!