Product distribution theory for control of multi-agent systems

More documents

Recommendations

Info

tributed fashion, since P is a product distribution. Finally,the techniques for modifying agents’ utilities in RL-basedsystems often require rerunning the MAS from different initialconditions, or equivalently exploiting knowledge of thefunctional form of G to evaluate how particular changes tothe initial conditions would affect G. PD theory provides uswith techniques for improving convergence even when wehave no such knowledge, and when we cannot rerun the system.In the next section we review the game-theory motivationof PD theory. We then present details of our Lagrangianminimizationalgorithm. We end with computer experimentsinvolving both controlling a spin glass (a system ofcoupled binary variables) and controlling the agents in avariant of the bar problem [13]. These experiments both validatesome of the predictions of PD theory of how best tomodify the agents’ utilities from G, and of how to improveconvergence when one cannot rerun the system.2. Bounded Rational Game TheoryIn this section we motivate PD theory as the informationtheoreticformulation of bounded rational game theory.2.1. Review of noncooperative game theoryIn noncooperative game theory one has a set of N players.Each player i has its own set of allowed pure strategies.A mixed strategy is a distribution q i (x i ) over playeri’s possible pure strategies. Each player i also has a privateutility function g i that maps the pure strategies adopted byall N of the players into the real numbers. So given mixedstrategies of all the players, the expected utility of player iis E(g i ) = ∫ dx ∏ j q j(x j )g i (x) 1 .In a Nash equilibrium every player adopts the mixedstrategy that maximizes its expected utility, given the mixedstrategies of the other players. More formally, ∀i, q i =argmax q ′i∫dx q′i∏j≠i q j(x j ) g i (x). Perhaps the major objectionthat has been raised to the Nash equilibrium conceptis its assumption of full rationality [5, 6]. This isthe assumption that every player i can both calculate whatthe strategies q j≠i will be and then calculate its associatedoptimal distribution. In other words, it is the assumptionthat every player will calculate the entire joint distributionq(x) = ∏ j q j(x j ). If for no other reasons than computationallimitations of real humans, this assumption is essentiallyuntenable.2.2. Review of the maximum entropy principleShannon was the first person to realize that based on anyof several separate sets of very simple desiderata, there is aunique real-valued quantification of the amount of syntacticinformation in a distribution P (y). He showed that thisamount of information is (the negative of) the Shannon entropyof that distribution, S(P ) = − ∫ dy P (y)ln[ P (y)µ(y) ].So for example, the distribution with minimal informationis the one that doesn’t distinguish at all between the variousy, i.e., the uniform distribution. Conversely, the most informativedistribution is the one that specifies a single possibley. Note that for a product distribution, entropy is additive,i.e., S( ∏ i q i(y i )) = ∑ i S(q i).Say we are given some incomplete prior knowledgeabout a distribution P (y). How should one estimate P (y)based on that prior knowledge? Shannon’s result tells ushow to do that in the most conservative way: have your estimateof P (y) contain the minimal amount of extra informationbeyond that already contained in the prior knowledgeabout P (y). Intuitively, this can be viewed as a version ofOccam’s razor. This approach is called the maximum entropy(maxent) principle. It has proven useful in domainsranging from signal processing to supervised learning [7].2.3. Maxent LagrangiansMuch of the work on equilibrium concepts in game theoryadopts the perspective of an external observer of a game.We are told something concerning the game, e.g., its utilityfunctions, information sets, etc., and from that wish to predictwhat joint strategy will be followed by real-world playersof the game. Say that in addition to such information,we are told the expected utilities of the players. What is ourbest estimate of the distribution q that generated those expectedutility values? By the maxent principle, it is the distributionwith maximal entropy, subject to those expectationvalues.To formalize this, for simplicity assume a finite numberof players and of possible strategies for each player.To agree with the convention in other fields, from now onwe implicitly flip the sign of each g i so that the associatedplayer i wants to minimize that function rather than maximizeit. Intuitively, this flipped g i (x) is the “cost” to playeri when the joint-strategy is x, though we will still use theterm “utility”.Then for prior knowledge that the expected utilities ofthe players are given by the set of values {ɛ i }, the maxentestimate of the associated q is given by the minimizer ofthe Lagrangian1 Throughout this paper, the integral sign is implicitly interpreted as appropriate,e.g., as Lebesgue integrals, point-sums, etc.L(q) ≡ ∑ iβ i [E q (g i ) − ɛ i ] − S(q) (1)
= ∑ i∫β i [dx ∏ jq j (x j )g i (x) − ɛ i ] − S(q)(2)where the subscript on the expectation value indicates thatit evaluated under distribution q, and the {β i } are “inversetemperatures” implicitly set by the constraints on the expectedutilities.Solving, we find that the mixed strategies minimizing theLagrangian are related to each other viaq i (x i ) ∝ e −E q (i)(G|x i )where the overall proportionality constant for each i is setby normalization, and G ≡ ∑ i β ig i 2 . In Eq. 3 the probabilityof player i choosing pure strategy x i depends on the effectof that choice on the utilities of the other players. Thisreflects the fact that our prior knowledge concerns all theplayers equally.If we wish to focus only on the behavior of player i, it isappropriate to modify our prior knowledge. To see how todo this, first consider the case of maximal prior knowledge,in which we know the actual joint-strategy of the players,and therefore all of their expected costs. For this case, trivially,the maxent principle says we should “estimate” q asthat joint-strategy (it being the q with maximal entropy thatis consistent with our prior knowledge). The same conclusionholds if our prior knowledge also includes the expectedcost of player i.Modify this maximal set of prior knowledge by removingfrom it specification of player i’s strategy. So our priorknowledge is the mixed strategies of all players other thani, together with player i’s expected cost. We can incorporateprior knowledge of the other players’ mixed strategiesdirectly, without introducing Lagrange parameters. The resultantmaxent Lagrangian isL i (q i ) ≡ β i [ɛ i − E(g∫ i )] − S i (q i )= β i [ɛ i − dx ∏ q j (x j )g i (x)] − S i (q i )jsolved by a set of coupled Boltzmann distributions:(3)q i (x i ) ∝ e −β iE q(i) (g i |x i ) . (4)Following Nash, we can use Brouwer’s fixed point theoremto establish that for any non-negative values {β}, there mustexist at least one product distribution given by the productof these Boltzmann distributions (one term in the productfor each i).The first term in L i is minimized by a perfectly rationalplayer. The second term is minimized by a perfectly2 The subscript q (i) on the expectation value indicates that it is evaluatedaccording the distribution ∏ j≠i q j.irrational player, i.e., by a perfectly uniform mixed strategyq i . So β i in the maxent Lagrangian explicitly specifiesthe balance between the rational and irrational behavior ofthe player. In particular, for β → ∞, by minimizing the Lagrangianswe recover the Nash equilibria of the game. Moreformally, in that limit the set of q that simultaneously minimizethe Lagrangians is the same as the set of delta functionsabout the Nash equilibria of the game. The same istrue for Eq. 3.Eq. 3 is just a special case of Eq. 4, where all player’sshare the same private utility, G. (Such games are knownas team games.) This relationship reflects the fact that forthis case, the difference between the maxent Lagrangian andthe one in Eq. 2 is independent of q i . Due to this relationship,our guarantee of the existence of a solution to the setof maxent Lagrangians implies the existence of a solution ofthe form Eq. 3. Typically players will be closer to minimizingtheir expected cost than maximizing it. For prior knowledgeconsistent with such a case, the β i are all non-negative.For each player i definef i (x, q i (x i )) ≡ β i g i (x) + ln[q i (x i )]. (5)Then we can maxent Lagrangian for player i is∫L i (q) = dx q(x)f i (x, q i (x i )). (6)Now in a bounded rational game every player sets its strategyto minimize its Lagrangian, given the strategies of theother players. In light of Eq. 6, this means that we interpreteach player in a bounded rational game as being perfectlyrational for a utility that incorporates its computationalcost. To do so we simply need to expand the domainof “cost functions” to include probability values as well asjoint moves.Often our prior knowledge will not consist of exact specificationof the expected costs of the players, even if thatknowledge arises from watching the players make theirmoves. Such alternative kinds of prior knowledge are addressedin [10, 11]. Those references also demonstrate theextension of the formulation to allow multiple utility functionsof the players, and even variable numbers of players.Also discussed there are semi-coordinate transformations,under which, intuitively, the moves of the agents are modifiedto set in binding contracts.3. Optimizing the LagrangianFirst we introduce the shorthand[G | x i ] ≡ E(G∫| x i ) (7)= dx ′ δ(x ′ i − x i )G(x) ∏ q i (x ′ i), (8)j≠i
Page 1: Product distribution theory for con
Page 5 and 6: 2. Each agent picks its choice of s
Page 7 and 8: −10−150−15−200−20−250

Product distribution theory for control of multi-agent systems

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?