12.07.2015 Views

Coordination in Multiagent Reinforcement Learning: A Bayesian ...

Coordination in Multiagent Reinforcement Learning: A Bayesian ...

Coordination in Multiagent Reinforcement Learning: A Bayesian ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

the repeated game itself. 5 Standard <strong>Bayesian</strong> updat<strong>in</strong>g methods areused to ma<strong>in</strong>ta<strong>in</strong> these beliefs over time, and best responses areplayed by the agent w.r.t. the expectation over strategy profiles.Repeated games are a special case of stochastic games [15, 5],which can be viewed as a multiagent extension of MDPs. Formally,a stochastic game G = 〈α, {A i} i∈α, S,Pr, {R i} i∈α〉 consists offive components. The agents α and action sets A i are as <strong>in</strong> a typicalgame, and the components S and Pr are as <strong>in</strong> an MDP, exceptthat Pr now refers to jo<strong>in</strong>t actions a ∈ A = ×A i. R i is athe reward function for agent i, def<strong>in</strong>ed over states s ∈ S (pairs〈s, a〉 ∈ S × A). The aim of each agent is, as with an MDP, toact to maximize the expected sum of discounted rewards. However,the presence of other agents requires treat<strong>in</strong>g this problem <strong>in</strong>a game theoretic fashion. For particular classes of games, such aszero-sum stochastic games [15, 5], algorithms like value iterationcan be used to compute Markovian equilibrium strategies.The existence of multiple “stage game” equilibria is aga<strong>in</strong> a problemthat plagues the construction of optimal strategies for stochasticgames. Consider another simple identical <strong>in</strong>terest example, thestochastic game shown <strong>in</strong> Figure 3(a). In this game, there are twooptimal strategy profiles that maximize reward. In both, the firstagent chooses to “opt <strong>in</strong>” at state s 1 by choos<strong>in</strong>g action a, whichtakes the agents (with high probability) to state s 2; then at s 2 bothagents either choose a or both choose b—either jo<strong>in</strong>t strategy givesan optimal equilibrium.Intuitively, the existence of these two equilibria gives rise to acoord<strong>in</strong>ation problem at state s 2. If the agents choose their partof the equilibrium randomly, there is a 0.5 chance that they miscoord<strong>in</strong>ateat s 2, thereby obta<strong>in</strong><strong>in</strong>g an expected immediate rewardof 0. On this basis, one might be tempted to propose methodswhereby the agents decide to “opt out” at s 1 (the first agent takesaction b) and obta<strong>in</strong> the safe payoff of 6. However, if we allowsome means of coord<strong>in</strong>ation—for example, simple learn<strong>in</strong>g ruleslike fictitious play or randomization—the sequential nature of thisproblem means that the short-term risk of miscoord<strong>in</strong>ation at s 2 canbe more than compensated for by the eventual stream of high payoffsshould they coord<strong>in</strong>ate. Boutilier [1] argues that the solutionof games like this, assum<strong>in</strong>g some (generally, history-dependent)mechanism for resolv<strong>in</strong>g these stage game coord<strong>in</strong>ation problems,requires explicit reason<strong>in</strong>g about the odds and benefits of coord<strong>in</strong>ation,the expected cost of attempt<strong>in</strong>g to coord<strong>in</strong>ate, and the alternativecourses of action.Repeated games can be viewed as stochastic games with a s<strong>in</strong>glestate. If our concern is not with stage-game equilibrium, but withreward accrued over the sequence of <strong>in</strong>teractions, techniques forsolv<strong>in</strong>g stochastic games can be applied to solv<strong>in</strong>g such repeatedgames. Furthermore, the po<strong>in</strong>ts made above regard<strong>in</strong>g the risksassociated with us<strong>in</strong>g specific learn<strong>in</strong>g rules for coord<strong>in</strong>ation canbe applied with equal force to repeated games.2.3 <strong>Multiagent</strong> RLIn this section, we describe some exist<strong>in</strong>g approaches to MARL,and po<strong>in</strong>t out recent efforts to augment standard RL schemes to encourageRL agents <strong>in</strong> multiagent systems to converge to optimalequilibria. Intuitively, MARL can be viewed as the direct or <strong>in</strong>directapplication of RL techniques to stochastic games <strong>in</strong> whichthe underly<strong>in</strong>g model (i.e., transitions and rewards) are unknown.In some cases, it is assumed that the learner is even unaware (orchooses to ignore) the existence of other agents.Formally, we suppose we have some underly<strong>in</strong>g stochastic game5 A strategy <strong>in</strong> the repeated game is any mapp<strong>in</strong>g from the observedhistory of play to a (stochastic) action choice. This admits the possibilityof model<strong>in</strong>g other agents’s learn<strong>in</strong>g processes.G = 〈α, {A i} i∈α, S,Pr, {R i} i∈α〉. We consider the case whereeach agent knows the “structure” of the games—that is, it knowsthe set of agents, the actions available to each agent, and the set ofstates—but knows neither the dynamics Pr, nor the reward functionsR i. The agents learn how to act <strong>in</strong> the world through experience.At each po<strong>in</strong>t <strong>in</strong> time, the agents are at a known state s,and each agent i executes one of its actions; the result<strong>in</strong>g jo<strong>in</strong>t actiona <strong>in</strong>duces a transition to state t and a reward r i for each agent.We assume that each agent can observe the actions chosen by otheragents, the result<strong>in</strong>g state t, and the rewards r i.Littman [12] devised an extension of Q-learn<strong>in</strong>g for zero-sumMarkov games called m<strong>in</strong>imax-Q. At each state, the agents haveestimated Q-values over jo<strong>in</strong>t actions, which can be used to computean (estimated) equilibrium value at that state. M<strong>in</strong>imax-Q convergesto the equilibrium value of the stochastic game [13]. Huand Wellman [7] apply similar ideas—us<strong>in</strong>g an equilibrium computationon estimated Q-values to estimate state values—to generalsum games, with somewhat weaker convergence guarantees. Algorithmshave also been devised for agents that do not observe thebehavior of their counterparts [2, 8].Identical <strong>in</strong>terest games have drawn much attention, provid<strong>in</strong>gsuitable models for task distribution among teams of agents. Clausand Boutilier [3] proposed several MARL methods for repeatedgames <strong>in</strong> this context. A simple jo<strong>in</strong>t-action learner (JAL) protocollearned the (myopic, or one-stage) Q-values of jo<strong>in</strong>t actions.The novelty of this approach lies <strong>in</strong> its exploration strategy: (a)a fictitious play protocol estimates the strategies of other agents;and (b) exploration is biased by the expected Q-value of actions.Specifically, the estimated value of an action is given by its expectedQ-value, where the expectation is taken w.r.t. the fictitiousplay beliefs over the other agents’s strategies. When semi-greedyexploration is used, this method will converge to an equilibrium <strong>in</strong>the underly<strong>in</strong>g stage game.One drawback of the JAL method is the fact that the equilibriumit converges to depends on the specific path of play, whichis stochastic. Certa<strong>in</strong> equilibria can exhibit serious resistance—forexample, the odds of converg<strong>in</strong>g to an optimal equilibrium <strong>in</strong> thepenalty game above are quite small (and decrease dramatically withthe magnitude of the penalty). Claus and Boutilier propose severalheuristic methods that bias exploration toward optimal equilibria:for <strong>in</strong>stance, action selection can be biased toward actions that formpart of an optimal equilibrium. In the penalty game, for <strong>in</strong>stance,despite the fact that agent B may be predicted to play a strategythat makes the a0 look unpromis<strong>in</strong>g, the repeated play of the a0by A can be justified by optimistically assum<strong>in</strong>g B will play itspart of this optimal equilibrium. This is further motivated by thefact that repeated play of a0 would eventually draw B toward thisequilibrium.This issue of learn<strong>in</strong>g optimal equilibria <strong>in</strong> identical <strong>in</strong>terest gameshas been addressed recently <strong>in</strong> much greater detail. Lauer andRiedmiller [11] describe a Q-learn<strong>in</strong>g method for identical <strong>in</strong>tereststochastic games that explicitly embodies this optimistic assumption<strong>in</strong> its Q-value estimates. Kapetanakis and Kudenko [10] proposea method called FMQ for repeated games that uses the optimisticassumption to bias exploration, much like [3], but <strong>in</strong> the contextof <strong>in</strong>dividual learners. Wang and Sandholm [16] similarly usethe optimistic assumption <strong>in</strong> repeated games to guarantee convergenceto an optimal equilibrium. We critique these methods below.3. A BAYESIAN VIEW OF MARLThe spate of activity described above on MARL <strong>in</strong> identical <strong>in</strong>terestgames has focused exclusively on devis<strong>in</strong>g methods that ensureeventual convergence to optimal equilibria. In cooperative


games, this pursuit is well-def<strong>in</strong>ed, and <strong>in</strong> some circumstances maybe justified. However, these methods do not account for the factthat—by forc<strong>in</strong>g agents to undertake actions that have potentiallydrastic effects <strong>in</strong> order to reach an optimal equilibrium—they canhave a dramatic impact on accumulated reward. The penalty gamewas devised to show that these highly penalized states can bias(supposedly rational) agents away from certa<strong>in</strong> equilibria; yet optimisticexploration methods ignore this and bl<strong>in</strong>dly pursue theseequilibria at all costs. Under certa<strong>in</strong> performance metrics (e.g., averagereward over an <strong>in</strong>f<strong>in</strong>ite horizon) one might justify these techniques.6 However, us<strong>in</strong>g the discounted reward criterion (which allof these methods are designed for), the tradeoff between long-termbenefit and short-term cost should be addressed.This tradeoff was discussed above <strong>in</strong> the context of known-modelstochastic games. In this section, we attempt to address the sametradeoff <strong>in</strong> the RL context. To do so, we formulate a <strong>Bayesian</strong>approach to model-based MARL. By ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g probabilistic beliefsover the space of models and the space of opponent strategies,our learn<strong>in</strong>g agents can explicitly account for the effects their actionscan have on (a) their knowledge of the underly<strong>in</strong>g model; (b)their knowledge of the other agent strategies; (c) expected immediatereward; and (d) expected future behavior of other agents. Components(a) and (c) are classical parts of the s<strong>in</strong>gle-agent <strong>Bayesian</strong>RL model [4]. Components (b) and (d) are key to the multiagentextension, allow<strong>in</strong>g an agent to explicitly reason about the potentialcosts and benefits of coord<strong>in</strong>ation.3.1 Theoretical Underp<strong>in</strong>n<strong>in</strong>gsWe assume a stochastic game G <strong>in</strong> which each agent knows thegame structure, but not the reward or transition models. A learn<strong>in</strong>gagent i is able to observe the actions taken by all agents, the result<strong>in</strong>ggame state, and the rewards received by other agents. Thusan agent’s experience at each po<strong>in</strong>t <strong>in</strong> time is simply 〈s, a,r i, t〉,where s is a state <strong>in</strong> which jo<strong>in</strong>t action a was taken, r i is the rewardreceived, and t is the result<strong>in</strong>g state.A <strong>Bayesian</strong> MARL agent has some prior distribution over thespace of possible models as well as the space of possible strategiesbe<strong>in</strong>g employed by other agents. These beliefs are updatedas the agent acts and observes the results of its actions and the actionchoices of other agents. The strategies of other agents may behistory-dependent, and we allow our <strong>Bayesian</strong> agent (BA) to assignpositive support to such strategies. As such, <strong>in</strong> order to make accuratepredictions about the actions others will take, the BA mustmonitor appropriate observable history. In general, the history (orsummary thereof) required will be a function of the strategies towhich the BA assigns positive support. We assume that the BAkeeps track of sufficient history to make such predictions. 7The belief state of the BA has the form b = 〈P M, P S, s, h〉,where: P M is some density over the space of possible models (i.e.,games); P S is a jo<strong>in</strong>t density over the possible strategies playedby other agents; s is the current state of the system; and h is asummary of the relevant aspects of game history, sufficient to predictthe action of any agent given any strategy consistent with P S.Given experience 〈s, a, r i, t〉, the BA updates its belief state us<strong>in</strong>gstandard <strong>Bayesian</strong> methods. The updated belief state is:b ′ = b(〈s, a,r i, t〉) = 〈P ′ M, P ′ S, t, h ′ 〉 (2)6 Even then, more ref<strong>in</strong>ed measures such as bias optimality mightcast these techniques <strong>in</strong> a less favorable light.7 For example, should the BA believe that its opponent’s strategylies <strong>in</strong> the space of f<strong>in</strong>ite state controller that depends on the lasttwo jo<strong>in</strong>t actions played, the BA will need to keep track of these lasttwo actions. If it uses fictitious play beliefs (which can be viewedas Dirichlet priors) over strategies, no history need be ma<strong>in</strong>ta<strong>in</strong>ed.Updates are given by Bayes rule: P M(m) ′ = z Pr(t,r i|a, m)P M(m)and P S(σ ′ −i) = z Pr(a −i|s, h, σ −i)P S(σ −i). And h ′ is a suitableupdate of the observed history (as described above). Thismodel comb<strong>in</strong>es aspects of <strong>Bayesian</strong> re<strong>in</strong>forcement learn<strong>in</strong>g [4]and <strong>Bayesian</strong> strategy model<strong>in</strong>g [9].To make belief state ma<strong>in</strong>tenance tractable (and admit computationallyviable methods for action selection below), we assumea specific form for these beliefs [4]. First, our prior over modelswill be factored <strong>in</strong>to <strong>in</strong>dependent local models for both rewardsand transitions. We assume <strong>in</strong>dependent priors PR s over reward distributionsat each state s, and P s,aDover system dynamics for eachstate and jo<strong>in</strong>t action pair. These local densities are Dirichlet, whichare conjugate for the mult<strong>in</strong>omial distributions to be learned. Thismeans that each density can be represented us<strong>in</strong>g a small numberof hyperparameters, expected transition probabilities can be computedreadily, and the density can be updated easily. For example,our BA’s prior beliefs about the transition probabilities for jo<strong>in</strong>taction a at state s will be represented by a vector n s,a with oneparameter per successor state t. Expected transition probabilitiesand updates of these beliefs are as described <strong>in</strong> Section 2.1. The<strong>in</strong>dependence of these local densities is assured after each update.Second, we assume that the beliefs about opponent strategies canbe factored and represented <strong>in</strong> some convenient form. For example,it would be natural to assume that the strategies of other agentsare <strong>in</strong>dependent. Simple fictitious play models could be used tomodel the BA’s beliefs about opponent strategies (correspond<strong>in</strong>g toDirichlet priors over mixed strategies), allow<strong>in</strong>g ready update andcomputation of expectations, and obviat<strong>in</strong>g the need to store history<strong>in</strong> the belief state. Similarly, distributions over specific classes off<strong>in</strong>ite state controllers could also be used. We will not pursue furtherdevelopment of such models <strong>in</strong> this paper, s<strong>in</strong>ce we use onlysimple opponent models <strong>in</strong> our experiments below. But the developmentof tractable classes of (realistic) opponent models rema<strong>in</strong>san <strong>in</strong>terest<strong>in</strong>g problem.We provide a different perspective on <strong>Bayesian</strong> exploration thanthat described <strong>in</strong> [4]. The value of perform<strong>in</strong>g an action a i at abelief state b can be viewed as <strong>in</strong>volv<strong>in</strong>g two ma<strong>in</strong> components: anexpected value with respect to the current belief state; and its impacton the current belief state. The first component is typical <strong>in</strong>RL, while the second captures the expected value of <strong>in</strong>formation(EVOI) of an action. S<strong>in</strong>ce each action gives rise to some “response”by the environment that changes the agent’s beliefs, andthese changes <strong>in</strong> belief can <strong>in</strong>fluence subsequent action choice andexpected reward, we wish to quantify the value of that <strong>in</strong>formationby determ<strong>in</strong><strong>in</strong>g its impact on subsequent decisions.EVOI need not be computed directly, but can be comb<strong>in</strong>ed with“object-level” expected value through the follow<strong>in</strong>g Bellman equationsover the belief state MDP:Q(a i, b) = X a −iPr(a −i|b) X tXPr(t|a i ◦ a −i, b)r iPr(r i|a i ◦ a −i, b)[r i + γV (b(〈s,a, r i, t〉))] (3)V (b) = maxa iQ(a i, b) (4)These equations describe the solution to the POMDP that representsthe exploration-exploitation problem, by conversion to a beliefstate MDP. These can (<strong>in</strong> pr<strong>in</strong>ciple) be solved us<strong>in</strong>g any methodfor solv<strong>in</strong>g high-dimensional cont<strong>in</strong>uous MDPs—of course, <strong>in</strong> practice,a number of computational shortcuts and approximations willbe required (as we detail below). We complete the specification


80155060100-5040200-20BVPIBOLWoLFComb<strong>in</strong>ed Boltzmann ExplorationOptimistic Boltzmann ExplorationKK50-5BVPIBOLWoLFComb<strong>in</strong>ed Boltzmann ExplorationOptimistic Boltzmann ExplorationKK-100-150-200-250-300BVPIBOLWoLFComb<strong>in</strong>ed Boltzmann ExplorationOptimistic Boltzmann ExplorationKKBVPI (un<strong>in</strong>f. priors, penalty=-100)BOL (un<strong>in</strong>f. priors, penalty=-100)-40-10-350-400-600 20 40 60 80 100Number of Iterations(a) k = −20, γ = 0.95, un<strong>in</strong>f. priors-150 5 10 15 20 25 30 35 40 45 50Number of Iterations(b) k = −20, γ = 0.75, un<strong>in</strong>f. priors-4500 20 40 60 80 100Number of Iterations(c) k = −100, γ = 0.95, <strong>in</strong>form. priorsFigure 1: Penalty Game Resultswith the straightforward def<strong>in</strong>ition of the follow<strong>in</strong>g terms:ZPr(a −i|b) = Pr(a −i|σ −i)P S(σ −i) (5)σ −iZPr(t|a, b) = Pr(t|s, a, m)P M(m) (6)mZPr(r i|b) = Pr(r i|s, m)P M(m) (7)mWe note that the evaluation of Eqs. 6 and 7 is trivial us<strong>in</strong>g the decomposedDirichlet priors mentioned above.This formulation determ<strong>in</strong>es the optimal policy as a function ofthe BA’s belief state. This policy <strong>in</strong>corporates the tradeoffs betweenexploration and exploitation, both with respect to the underly<strong>in</strong>g(dynamics and reward) model, and with respect to the behaviorof other agents. As with <strong>Bayesian</strong> RL and <strong>Bayesian</strong> learn<strong>in</strong>g <strong>in</strong>games, no explicit exploration actions are required. Of course, it isimportant to realize that this model may not converge to an optimalpolicy for the true underly<strong>in</strong>g stochastic game. Priors that fail toreflect the true model, or unfortunate samples early on, can easilymislead an agent. But it is precisely this behavior that allows anagent to learn how to behave well without drastic penalty.3.2 Computational ApproximationsSolv<strong>in</strong>g the belief state MDP above will generally be computationally<strong>in</strong>feasible. In specific MARL problems, the generality ofsuch a solution—def<strong>in</strong><strong>in</strong>g as it does a value for every possible beliefstate—is not needed anyway. Most belief states are not reachablegiven a specific <strong>in</strong>itial belief state. A more directed search-basedmethod can be used to solve this MDP for the agent’s current beliefstate b. We consider a form of myopic EVOI <strong>in</strong> which only immediatesuccessor belief states are considered, and their values areestimated without us<strong>in</strong>g VOI or lookahead.Formally, myopic action selection is def<strong>in</strong>ed as follows. Givenbelief state b, the myopic Q-function for each a i ∈ A i is:Q m(a i, b) = X a −iPr(a −i|b) X tXPr(t|a i ◦ a −i, b)Pr(r i|a i ◦ a −i, b)[r + γV m(b(〈s,a, r i, t〉))] (8)r iZV m(b) = max Q(a i, s|m, σ −i)P M(m)P S(σ −i) (9)a iZm σ −iThe action performed is that with maximum myopic Q-value. Eq. 8differs from Eq. 3 <strong>in</strong> the use of the myopic value function V m,which is def<strong>in</strong>ed as the expected value of the optimal action at thecurrent state, assum<strong>in</strong>g a fixed distribution over models and strategies.8 Intuitively, this myopic approximation performs one steplookahead<strong>in</strong> belief space, then evaluates these successor states bydeterm<strong>in</strong><strong>in</strong>g the expected value to BA w.r.t. a fixed distribution overmodels, and a fixed distribution over successor states.This computation <strong>in</strong>volves the evaluation of a f<strong>in</strong>ite number ofsuccessor belief states—A · R · S such states, where A is the numberof jo<strong>in</strong>t actions, R is the number of rewards, and S is the sizeof the state space (unless b restricts the number of reachable states,plausible strategies, etc.). Greater accuracy can be realized withmultistage lookahead, with the requisite <strong>in</strong>crease <strong>in</strong> computationalcost. Conversely, the myopic action can be approximated by sampl<strong>in</strong>gsuccessor beliefs (us<strong>in</strong>g the <strong>in</strong>duced distributions def<strong>in</strong>ed <strong>in</strong>Eqs. 5, 6, 7) if the branch<strong>in</strong>g factor A · R · S is problematic.The f<strong>in</strong>al bottleneck <strong>in</strong>volves the evaluation of the myopic valuefunction V m(b) over successor belief states. The Q(a i, s|m, σ −i)terms are Q-values for standard MDPs, and can be evaluated us<strong>in</strong>gstandard methods, but direct evaluation of the <strong>in</strong>tegral over allmodels is generally impossible. However, sampl<strong>in</strong>g techniques canbe used [4]. Specifically, some number of models can be sampled,the correspond<strong>in</strong>g MDPs solved, and the expected Q-values estimatedby averag<strong>in</strong>g over the sampled results. Various techniquesfor mak<strong>in</strong>g this process more efficient can be used as well, <strong>in</strong>clud<strong>in</strong>gimportance sampl<strong>in</strong>g (allow<strong>in</strong>g results from one MDP to beused multiple times by reweight<strong>in</strong>g) and “repair” of the solutionfor one MDP when solv<strong>in</strong>g a related MDP [4].For certa<strong>in</strong> classes of problems, this evaluation can be performeddirectly. For <strong>in</strong>stance, suppose a repeated game is be<strong>in</strong>g learned,and the BA’s strategy model consists of fictitious play beliefs. Theimmediate expected reward of any action a i taken by the BA (w.r.t.successor b ′ ) is given by its expectation w.r.t. its estimated rewarddistribution and fictitious play beliefs. The maximiz<strong>in</strong>g action a ∗ iwith highest immediate reward will be the best action at all subsequentstages of the repeated game—and has a fixed expected rewardr(a ∗ i ) under the myopic (value) assumption that beliefs are fixed byb ′ . Thus the long-term value at b ′ is r(a ∗ i )/(1 − γ).The approaches above are motivated by approximat<strong>in</strong>g the directmyopic solution to the “exploration POMDP.” A different approachto this approximation is proposed <strong>in</strong> [4], which estimates the (myopic)value of obta<strong>in</strong><strong>in</strong>g perfect <strong>in</strong>formation about Q(a,s). Supposethat, given an agent’s current belief state, the expected value8 We note that V m(b) only provides a crude measure of the valueof belief state b under this fixed uncerta<strong>in</strong>ty. Other measures could<strong>in</strong>clude the expected value of V (s|m, σ −i).


3.514031202.510021.58060BVPIWoLFa a; 0s1s2a a; 0s3a a; 0s4a a; 0a a; 10s510.5BVPIWoLF4020b b; 2 b b; 2 b b; 2 b b; 2(a) Cha<strong>in</strong> World Doma<strong>in</strong>00 5 10 15 20 25 30 35 40 45 50Number of Iterations(b) γ = 0.7500 20 40 60 80 100 120 140 160Number of Iterations(c) γ = 0.99Figure 2: Cha<strong>in</strong> World Resultsof action a is given by Q(a, s). Let a 1 be the action with highestexpected Q-value and state s and a 2 be the second-highest. We def<strong>in</strong>ethe ga<strong>in</strong> associated with learn<strong>in</strong>g that the true value of Q(a,s)(for any a) is <strong>in</strong> fact q as follows:8 Q(a 1, s)0, otherwiseIntuitively, the ga<strong>in</strong> reflects the effect on decision quality of learn<strong>in</strong>gthe true Q-value of a specific action at state s. In the first twocases, what is learned causes us to change our decision (<strong>in</strong> the firstcase, the estimated optimal action is learned to be worse than predicted,and <strong>in</strong> the second, some other action is learned to be optimal).In the third case, no change <strong>in</strong> decision at s is <strong>in</strong>duced, so the<strong>in</strong>formation has no impact on decision quality.Adapted to our sett<strong>in</strong>g, a computational approximation to thisnaive sampl<strong>in</strong>g approach <strong>in</strong>volves the follow<strong>in</strong>g steps:(a) a f<strong>in</strong>ite set of k models is sampled from the density P M;(b) each sampled MDP j is solved (w.r.t. the density P S over strategies),giv<strong>in</strong>g optimal Q-values Q j (a i, s) for each a i <strong>in</strong> thatMDP, and average Q-value Q(a i, s) (over all k MDPs);(c) for each a i, compute ga<strong>in</strong> s,ai (Q j (a i, s)) for each of the kMDPS; let EVPI(a i, s) be the average of these k values;(d) def<strong>in</strong>e the value of a i to be Q(a i, s)+EVPI(a i, s) and executethe action with highest value.This approach can benefit from the use of importance sampl<strong>in</strong>g andrepair (with m<strong>in</strong>or modifications). Naive sampl<strong>in</strong>g can be morecomputationally effective than one-step lookahead (which requiressampl<strong>in</strong>g and solv<strong>in</strong>g MDPs from multiple belief states). The pricepaid is approximation <strong>in</strong>herent <strong>in</strong> the perfect <strong>in</strong>formation assumption:the execution of jo<strong>in</strong>t action a does not come close to provid<strong>in</strong>gperfect <strong>in</strong>formation about Q(a,s).4. EXPERIMENTAL RESULTSWe have conducted a number of experiments with both repeatedand stochastic games to evaluate the <strong>Bayesian</strong> approach. We focuson two-player identical <strong>in</strong>terest games largely to compare to exist<strong>in</strong>gmethods for “encourag<strong>in</strong>g” convergence to optimal equilibria.The <strong>Bayesian</strong> methods exam<strong>in</strong>ed are one-step lookahead (BOL)and naive sampl<strong>in</strong>g for estimat<strong>in</strong>g VPI (BVPI) described <strong>in</strong> Section3. In all cases, the <strong>Bayesian</strong> agents use a simple fictitious playmodel to represent their uncerta<strong>in</strong>ty over the other agent’s strategy.Thus at each iteration, a BA believes its opponent to play an actionwith the empirical probability observed <strong>in</strong> the past (for multistategames, these beliefs are <strong>in</strong>dependent at each state).BOL is used only for repeated games, s<strong>in</strong>ce these allow the immediatecomputation of expected values over the <strong>in</strong>f<strong>in</strong>ite horizonat the successor belief states: expected reward for each jo<strong>in</strong>t actioncan readily be comb<strong>in</strong>ed with the BA’s fictitious play beliefs tocompute the expected value of an action (over the <strong>in</strong>f<strong>in</strong>ite horizon)s<strong>in</strong>ce the only model uncerta<strong>in</strong>ty is <strong>in</strong> the reward. BVPI is usedfor both repeated and multi-state games. In all cases, five modelsare sampled to estimate the VPI of the agent’s actions. The strategypriors for the BAs are given by Dirichlet parameters (“priorcounts”) of 0.1 or 1 for each opponent action. The model priors aresimilarly un<strong>in</strong>formative, with each state-action pair given the sameprior distribution over reward and transition distributions (exceptfor one experiment as noted).We first compare our method to several different algorithms on astochastic version of the penalty game described above. The gameis altered so that jo<strong>in</strong>t actions provide a stochastic reward, whosemean is the value shown <strong>in</strong> the game matrix. 9 We compare the<strong>Bayesian</strong> approach to the follow<strong>in</strong>g algorithms. KK is an algorithm[10] that biases exploration to encourage convergence to optimalequilibria <strong>in</strong> just these types of games. Two related heuristic algorithmsthat also bias exploration optimistically are the OptimisticBoltzmann (OB) and Comb<strong>in</strong>ed OB (CB) [3] are also evaluated.Unlike KK, these algorithms observe and make predictions aboutthe other agent’s play. F<strong>in</strong>ally, we test the more general algorithmWoLF-PHC [2], which works with arbitrary, general-sum stochasticgames (and has no special heuristics for equilibrium selection).In each case the game is played by two learn<strong>in</strong>g agents of the sametype. The parameters of all algorithms were empirically tuned togive good performance.The first set of experiments tested these agents on the stochasticpenalty game with k set to −20 and discount factors of 0.95 and0.75. Both BOL and BVPI agents use un<strong>in</strong>formative priors overthe set of reward values. 10 Results appear <strong>in</strong> <strong>in</strong> Figures 1(a) and (b)show<strong>in</strong>g the total discounted reward accumulated by the learn<strong>in</strong>gagents, averaged over 30 trials. Discounted accumulated rewardprovides a suitable way to measure both the cost be<strong>in</strong>g paid <strong>in</strong> theattempt to coord<strong>in</strong>ate as well as the benefits of coord<strong>in</strong>ation (or lack9 Each jo<strong>in</strong>t action gives rise to X dist<strong>in</strong>ct rewards.10 Specifically, any possible reward (for any jo<strong>in</strong>t action) is givenequal (expected) probability <strong>in</strong> the agent’s Dirichlet priors.


620018051604140120s1* *s4a,a; b,bs2a,b; b,aa *b ** *s5s3* *s6* *+10-10+632100 5 10 15 20 25 30 35 40 45 50Number of IterationsBVPIWoLF100806040200-200 20 40 60 80 100 120 140 160Number of IterationsBVPIWoLF(a) The Opt In or Out Doma<strong>in</strong>(b) γ = 0.75(c) γ = 0.99Figure 3: “Opt In or Out” Results: Low Noisethereof). The results show that both <strong>Bayesian</strong> methods perform significantlybetter than the methods designed to force convergence toan optimal equilibrium. Indeed, OB, CB and KK converge to anoptimal equilibrium <strong>in</strong> virtually all of their 30 runs, but clearly paya high price. 11 By comparison, when γ = 0.95, the ratio of convergenceto an optimal equilibrium, the nonoptimal equilibrium, ora nonequilibrium is, for BVPI 25/1/4 and for BOL 14/16/0. Surpris<strong>in</strong>gly,WoLF-PHC does better than OB, CB and KK <strong>in</strong> bothtests. In fact, this method outperforms the BVPI agent <strong>in</strong> the caseof γ = 0.75 (largely because BVPI converges to a nonequilibrium5 times and the suboptimal equilibrium 4 times). WoLF-PHC converges<strong>in</strong> each <strong>in</strong>stance to an optimal equilibrium.We repeated this comparison on a more “difficult” problem witha mean penalty of -100 (and <strong>in</strong>creas<strong>in</strong>g the variance of the rewardfor each action). To test one of the benefits of the <strong>Bayesian</strong> perspectivewe <strong>in</strong>cluded a version of BOL and BVPI with <strong>in</strong>formativepriors (giv<strong>in</strong>g it strong <strong>in</strong>formation about rewards by restrict<strong>in</strong>g theprior to assign (uniform) nonzero expected probability to the smallrange of truly feasible rewards for each action). Results are shown<strong>in</strong> Figure 1(c) (averaged over 30 runs). Wolf-PHC and CB convergeto the optimal equilibrium each time, while the <strong>in</strong>creased rewardstochasticity made it impossible for KK and OB to convergeat all. All four methods perform poorly w.r.t. discounted reward.The <strong>Bayesian</strong> methods perform much better. Not surpris<strong>in</strong>gly, theagents BOL and BVPI with <strong>in</strong>formative priors do better than their“un<strong>in</strong>formed” counterparts; however, because of the high penalty(despite the high discount factor), they converge to the suboptimalequilibrium most of the time (22 and 23 times, respectively).We also applied the <strong>Bayesian</strong> approach to two identical-<strong>in</strong>terest,multi-state, stochastic games. The first is a version of Cha<strong>in</strong> World[4] modified for multiagent coord<strong>in</strong>ation, and is illustrated <strong>in</strong> Figure2(a). The optimal jo<strong>in</strong>t policy is for the agents to do action aat each state, though these actions have no payoff until state s 5 isreached. Coord<strong>in</strong>at<strong>in</strong>g on b leads to an immediate, but smaller, payoff,and resets the process. 12 Unmatched actions 〈a,b〉 and 〈b, a〉result <strong>in</strong> zero-reward self-transitions (omitted from the diagram forclarity). Transitions are noisy, with a 10% chance that an agent’saction has the “effect” of the opposite action. The orig<strong>in</strong>al Cha<strong>in</strong>World is difficult for standard RL algorithms, and is made especiallydifficult here by the requirement of coord<strong>in</strong>ation.We compared BVPI to WoLF-PHC on this doma<strong>in</strong> us<strong>in</strong>g two11 All penalty game experiments were run to 500 games, though onlythe <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>itial segments of the graphs are shown.12 As above, rewards are stochastic with means shown <strong>in</strong> the figure.different discount factors, plott<strong>in</strong>g the total discounted reward (averagedover 30 runs) <strong>in</strong> Figure 2(b) and (c). Only <strong>in</strong>itial segmentsare shown, though the results project smoothly to 50000 iterations.BVPI compares favorably to Wolf-PHC <strong>in</strong> terms of onl<strong>in</strong>e performance.BVPI converged to the optimal policy <strong>in</strong> 7 (of 30) runswith γ = 0.99 and <strong>in</strong> 3 runs with γ = 0.75, <strong>in</strong>tuitively reflect<strong>in</strong>gthe <strong>in</strong>creased risk aversion due to <strong>in</strong>creased discount<strong>in</strong>g. WoLF-PHC rarely even managed to reach state s 5, though <strong>in</strong> 2 (of 30)runs with γ = 0.75 it stumbled across s 5 early enough to convergeto the optimal policy. The <strong>Bayesian</strong> approach manages to encourage<strong>in</strong>telligent exploration of action space <strong>in</strong> a way that trades offrisks and predicted rewards; and we see <strong>in</strong>creased exploration withthe higher discount factor, as expected.The second multi-state game is “Opt <strong>in</strong> or Out” shown <strong>in</strong> Figure3(a). The transitions are stochastic, with the action selectedby an agent hav<strong>in</strong>g the “effect” of the opposite action with someprobability. Two versions of the problem were tested, one with low“noise” (probability 0.05 of an action effect be<strong>in</strong>g reversed), andone with “medium” noise level (probability roughly 0.11). Withlow noise, the optimal policy is as if the doma<strong>in</strong> were determ<strong>in</strong>istic(the first agent opts <strong>in</strong> at s 1 and both play a coord<strong>in</strong>ated choiceat s 2), while with medium noise, the “opt <strong>in</strong>” policy and the “optout” policy (where the safe move to s 6 is adopted) have roughlyequal value. BVPI is compared to WoLF-PHC under two differentdiscount rates, with low noise results shown <strong>in</strong> Figure 3(b) and(c), and high noise results <strong>in</strong> Figure 4(a) and (b). Aga<strong>in</strong> BVPIcompares favorably to WoLF-PHC, <strong>in</strong> terms of dicounted rewardaveraged over 30 runs. In the low noise problem, BVPI convergedto the optimal policy <strong>in</strong> 18 (of 30) runs with γ = 0.99 and 15 runswith γ = 0.75. The WoLF agents converged <strong>in</strong> the optimal policyonly once with γ = 0.99, but 17 times with γ = 0.75. Withmedium noise, BVPI chose the “opt <strong>in</strong>” policy <strong>in</strong> 10 (0.99) and13 (0.75) runs, but learned to coord<strong>in</strong>ate at s 2 even <strong>in</strong> the “opt out”cases. Interest<strong>in</strong>gly, WoLF-PHC always converged on the “opt out”policy (recall both policies are optimal with medium noise).F<strong>in</strong>ally, we remark that the <strong>Bayesian</strong> methods do <strong>in</strong>cur greatercomputational cost per experience than the simpler model-free RLmethods we compared to. BOL <strong>in</strong> particular can be <strong>in</strong>tensive, tak<strong>in</strong>gup to 25 times as long to select actions <strong>in</strong> the repeated gameexperiments as BVPI. Still, BOL computation time per step is onlyhalf a millisecond. BVPI is comparable to all other methods on repeatedgames. In the stochastic games, BVPI takes roughly 0.15msto compute action selection, about 8 times as long as WoLF <strong>in</strong>Cha<strong>in</strong>World, and 24 times <strong>in</strong> Opt In or Out.


4.543.532.521.510.50-0.50 5 10 15 20 25 30 35 40 45 50160140120100806040200Number of Iterations(a) γ = 0.75BVPIWoLF-200 20 40 60 80 100 120 140 160Number of Iterations(b) γ = 0.99BVPIWoLFFigure 4: “Opt In or Out” Results: Medium Noise5. CONCLUDING REMARKSWe have described a <strong>Bayesian</strong> approach to model<strong>in</strong>g MARLproblems that allows agents to explicitly reason about their uncerta<strong>in</strong>tyregard<strong>in</strong>g the underly<strong>in</strong>g doma<strong>in</strong> and the strategies of theircounterparts. We’ve provided a formulation of optimal explorationunder this model and developed several computational approximationsfor <strong>Bayesian</strong> exploration <strong>in</strong> MARL.The experimental results presented here demonstrate quite effectivelythat <strong>Bayesian</strong> exploration enables agents to make the tradeoffsdescribed. Our results show that this can enhance onl<strong>in</strong>e performance(reward accumulated while learn<strong>in</strong>g) of MARL agents<strong>in</strong> coord<strong>in</strong>ation problems, when compared to heuristic explorationtechniques that explicitly try to <strong>in</strong>duce convergence to optimal equilibria.This implies that BAs run the risk of converg<strong>in</strong>g on a suboptimalpolicy; but this risk is taken “will<strong>in</strong>gly” through due considerationof the learn<strong>in</strong>g process given the agent’s current beliefs aboutthe doma<strong>in</strong>. Still we see that BAs often f<strong>in</strong>d optimal strategies <strong>in</strong>any case. Key to this is a BA’s will<strong>in</strong>gness to exploit what it knowsbefore it is very confident <strong>in</strong> this knowledge—it simply needs to beconfident enough to be will<strong>in</strong>g to sacrifice certa<strong>in</strong> alternatives.While the framework is general, our experiments were conf<strong>in</strong>edto identical <strong>in</strong>terest games and fictitious play beliefs as (admittedlysimple) opponent models. These results are encourag<strong>in</strong>g but needto be extended <strong>in</strong> several ways. Empirically, the application of thisframework to more general problems is important to verify its utility.More work on computational approximations to estimat<strong>in</strong>g VPIor solv<strong>in</strong>g the belief state MDP is also needed. F<strong>in</strong>ally, the developmentof computationally tractable means of represent<strong>in</strong>g andreason<strong>in</strong>g with distributions over strategy models is required.AcknowledgementsThanks to Bob Price for very helpful discussion and the use ofMARL software. This research was supported by the Natural Sciencesand Eng<strong>in</strong>eer<strong>in</strong>g Research Council.6. REFERENCES[1] C. Boutilier. Sequential optimality and coord<strong>in</strong>ation <strong>in</strong>multiagent systems. In Proceed<strong>in</strong>gs of the SixteenthInternational Jo<strong>in</strong>t Conference on Artificial Intelligence,pages 478–485, Stockholm, 1999.[2] M. Bowl<strong>in</strong>g and M. Veloso. Rational and convergentlearn<strong>in</strong>g <strong>in</strong> stochastic games. In Proceed<strong>in</strong>gs of theSeventeenth International Jo<strong>in</strong>t Conference on ArtificialIntelligence, pages 1021–1026, Seattle, 2001.[3] C. Claus and C. Boutilier. The dynamics of re<strong>in</strong>forcementlearn<strong>in</strong>g <strong>in</strong> cooperative multiagent systems. In Proceed<strong>in</strong>gsof the Fifteenth National Conference on ArtificialIntelligence, pages 746–752, Madison, WI, 1998.[4] R. Dearden, N. Friedman, and D. Andre. Model-basedbayesian exploration. In Proceed<strong>in</strong>gs of the FifteenthConference on Uncerta<strong>in</strong>ty <strong>in</strong> Artificial Intelligence, pages150–159, Stockholm, 1999.[5] J. A. Filar and K. Vrieze. Competitive Markov DecisionProcesses. Spr<strong>in</strong>ger Verlag, New York, 1996.[6] Drew Fudenberg and David K. Lev<strong>in</strong>e. The Theory ofLearn<strong>in</strong>g <strong>in</strong> Games. MIT Press, Cambridge, MA, 1998.[7] J. Hu and M. P. Wellman. <strong>Multiagent</strong> re<strong>in</strong>forcement learn<strong>in</strong>g:Theoretical framework and an algorithm. In Proceed<strong>in</strong>gs ofthe Fifteenth International Conference on Mach<strong>in</strong>e Learn<strong>in</strong>g,pages 242–250, Madison, WI, 1998.[8] P. Jehiel and D. Samet. Learn<strong>in</strong>g to play games <strong>in</strong> extensiveform by valuation. NAJ Economics, Peer Reviews ofEconomics Publications, 3, 2001.[9] E. Kalai and E. Lehrer. Rational learn<strong>in</strong>g leads to Nashequilibrium. Econometrica, 61(5):1019–1045, 1993.[10] S. Kapetanakis and D. Kudenko. Re<strong>in</strong>forcement learn<strong>in</strong>g ofcoord<strong>in</strong>ation <strong>in</strong> cooperative multi-agent systems. InProceed<strong>in</strong>gs of the Eighteenth National Conference onArtificial Intelligence, pages 326–331, Edmonton, 2002.[11] M. Lauer and M. Riedmiller. An algorithm for distributedre<strong>in</strong>forcement learn<strong>in</strong>g <strong>in</strong> cooperative multi-agent systems.In Proceed<strong>in</strong>gs of the Seventeenth International Conferenceon Mach<strong>in</strong>e Learn<strong>in</strong>g, pages 535–542, Stanford, CA, 2000.[12] M. L. Littman. Markov games as a framework formulti-agent re<strong>in</strong>forcement learn<strong>in</strong>g. In Proceed<strong>in</strong>gs of theEleventh International Conference on Mach<strong>in</strong>e Learn<strong>in</strong>g,pages 157–163, New Brunswick, NJ, 1994.[13] M. L. Littman and C. Szepesvári. A generalizedre<strong>in</strong>forcement-learn<strong>in</strong>g model: Convergence andapplications. In Proceed<strong>in</strong>gs of the Thirteenth InternationalConference on Mach<strong>in</strong>e Learn<strong>in</strong>g, pages 310–318, Bari,Italy, 1996.[14] J. Rob<strong>in</strong>son. An iterative method of solv<strong>in</strong>g a game. Annalsof Mathematics, 54:296–301, 1951.[15] L. S. Shapley. Stochastic games. Proceed<strong>in</strong>gs of the NationalAcademy of Sciences, 39:327–332, 1953.[16] X. Wang and T. Sandholm. Re<strong>in</strong>forcement learn<strong>in</strong>g to playan optimal nash equilibrium <strong>in</strong> team markov games. InAdvances <strong>in</strong> Neural Information Process<strong>in</strong>g Systems 15(NIPS-2002), Vancouver, 2002. to appear.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!