12.07.2015 Views

Product distribution theory for control of multi-agent systems

Product distribution theory for control of multi-agent systems

Product distribution theory for control of multi-agent systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Product</strong> <strong>distribution</strong> <strong>theory</strong> <strong>for</strong> <strong>control</strong> <strong>of</strong> <strong>multi</strong>-<strong>agent</strong> <strong>systems</strong>Chiu Fan LeePhysics Department, Ox<strong>for</strong>d UniversityParks Road, Ox<strong>for</strong>d OX1 3PU, U.K.c.lee1@physics.ox.ac.ukDavid H. WolpertMS 269-1, NASA Ames Research CenterM<strong>of</strong>fett Field, CA 94035, USAdhw@ptolemy.arc.nasa.govAbstract<strong>Product</strong> Distribution (PD) <strong>theory</strong> is a new framework <strong>for</strong><strong>control</strong>ling Multi-Agent Systems (MAS’s). First we reviewone motivation <strong>of</strong> PD <strong>theory</strong>, as the in<strong>for</strong>mation-theoreticextension <strong>of</strong> conventional full-rationality game <strong>theory</strong> to thecase <strong>of</strong> bounded rational <strong>agent</strong>s. In this extension the equilibrium<strong>of</strong> the game is the optimizer <strong>of</strong> a Lagrangian <strong>of</strong> the(probability <strong>distribution</strong> <strong>of</strong>) the joint state <strong>of</strong> the <strong>agent</strong>s. Accordinglywe can consider a team game in which the sharedutility is a per<strong>for</strong>mance measure <strong>of</strong> the behavior <strong>of</strong> the MAS.For such a scenario the game is at equilibrium — the Lagrangianis optimized — when the joint <strong>distribution</strong> <strong>of</strong> the<strong>agent</strong>s optimizes the system’s expected per<strong>for</strong>mance. Onecommon way to find that equilibrium is to have each <strong>agent</strong>run a rein<strong>for</strong>cement learning algorithm. Here we investigatethe alternative <strong>of</strong> exploiting PD <strong>theory</strong> to run gradientdescent on the Lagrangian. We present computer experimentsvalidating some <strong>of</strong> the predictions <strong>of</strong> PD <strong>theory</strong> <strong>for</strong>how best to do that gradient descent. We also demonstratehow PD <strong>theory</strong> can improve per<strong>for</strong>mance even when we arenot allowed to rerun the MAS from different initial conditions,a requirement implicit in some previous work.1. Introduction<strong>Product</strong> Distribution (PD) <strong>theory</strong> is a recently introducedbroad framework <strong>for</strong> analyzing, <strong>control</strong>ling, and optimizingdistributed <strong>systems</strong> [9, 10, 11]. Among its potentialapplications are adaptive, distributed <strong>control</strong> <strong>of</strong> aMulti-Agent System (MAS), (constrained) optimization,sampling <strong>of</strong> high-dimensional probability densities (i.e.,improvements to Metropolis sampling), density estimation,numerical integration, rein<strong>for</strong>cement learning, in<strong>for</strong>mationtheoreticbounded rational game <strong>theory</strong>, population biology,and management <strong>theory</strong>. Some <strong>of</strong> these are investigated in[2, 1, 8, 3].Here we investigate PD <strong>theory</strong>’s use <strong>for</strong> adaptive, distributed<strong>control</strong> <strong>of</strong> a MAS. Typically such <strong>control</strong> is doneby having each <strong>agent</strong> run its own rein<strong>for</strong>cement learning algorithm[4, 13, 14, 12]. In this approach the utility function<strong>of</strong> each <strong>agent</strong> is based on the world utility G(x) mappingthe joint move <strong>of</strong> the <strong>agent</strong>s, x ∈ X, to the per<strong>for</strong>mance<strong>of</strong> the overall system. However in practice the <strong>agent</strong>sin a MAS are bounded rational. Moreover the equilibriumthey reach will typically involve mixed strategies rather thanpure strategies, i.e., they don’t settle on a single point x optimizingG(x). This suggests <strong>for</strong>mulating an approach thatexplicitly accounts <strong>for</strong> the bounded rational, mixed strategycharacter <strong>of</strong> the <strong>agent</strong>s.Now in any game, bounded rational or otherwise, the<strong>agent</strong>s are independent, with each <strong>agent</strong> i choosing its movex i at any instant by sampling its probability <strong>distribution</strong>(mixed strategy) at that instant, q i (x i ). Accordingly, the<strong>distribution</strong> <strong>of</strong> the joint-moves is a product <strong>distribution</strong>,P (x) = ∏ i q i(x i ). In this representation <strong>of</strong> a MAS, all couplingbetween the <strong>agent</strong>s occurs indirectly; it is the separate<strong>distribution</strong>s <strong>of</strong> the <strong>agent</strong>s {q i } that are statistically coupled,while the actual moves <strong>of</strong> the <strong>agent</strong>s are independent.PD <strong>theory</strong> adopts this perspective to show that the equilibrium<strong>of</strong> a MAS is the minimizer <strong>of</strong> a Lagrangian L(P ),derived using in<strong>for</strong>mation <strong>theory</strong>, that quantifies the expectedvalue <strong>of</strong> G <strong>for</strong> the joint <strong>distribution</strong> P (x). From thisperspective, the update rules used by the <strong>agent</strong>s in RL-based<strong>systems</strong> <strong>for</strong> <strong>control</strong>ling MAS’s are just particular (inefficient)ways <strong>of</strong> finding that minimizing <strong>distribution</strong>.Various techniques <strong>for</strong> modifying the <strong>agent</strong>s’ utilitiesfrom G to improve convergence have been used with suchRL-based <strong>systems</strong> [13]. As illustrated below, by viewing theproblem as one <strong>of</strong> Lagrangian-minimization, PD <strong>theory</strong> canbe used to derive corrections to those techiques. In addition,PD <strong>theory</strong> suggests novel ways to find the equilbirum, e.g.,applying any <strong>of</strong> the powerful search techniques <strong>for</strong> continuousvariables, like gradient descent, to find the P optimizingL. By casting the problem this way in terms <strong>of</strong> findingan optimal P rather than finding an optimal x, we canexploit the power <strong>of</strong> search techniques <strong>for</strong> continuous variableseven when X is a discrete, finite space. Moreover, typicallysuch search techniques can be run in a highly dis-


tributed fashion, since P is a product <strong>distribution</strong>. Finally,the techniques <strong>for</strong> modifying <strong>agent</strong>s’ utilities in RL-based<strong>systems</strong> <strong>of</strong>ten require rerunning the MAS from different initialconditions, or equivalently exploiting knowledge <strong>of</strong> thefunctional <strong>for</strong>m <strong>of</strong> G to evaluate how particular changes tothe initial conditions would affect G. PD <strong>theory</strong> provides uswith techniques <strong>for</strong> improving convergence even when wehave no such knowledge, and when we cannot rerun the system.In the next section we review the game-<strong>theory</strong> motivation<strong>of</strong> PD <strong>theory</strong>. We then present details <strong>of</strong> our Lagrangianminimizationalgorithm. We end with computer experimentsinvolving both <strong>control</strong>ling a spin glass (a system <strong>of</strong>coupled binary variables) and <strong>control</strong>ling the <strong>agent</strong>s in avariant <strong>of</strong> the bar problem [13]. These experiments both validatesome <strong>of</strong> the predictions <strong>of</strong> PD <strong>theory</strong> <strong>of</strong> how best tomodify the <strong>agent</strong>s’ utilities from G, and <strong>of</strong> how to improveconvergence when one cannot rerun the system.2. Bounded Rational Game TheoryIn this section we motivate PD <strong>theory</strong> as the in<strong>for</strong>mationtheoretic<strong>for</strong>mulation <strong>of</strong> bounded rational game <strong>theory</strong>.2.1. Review <strong>of</strong> noncooperative game <strong>theory</strong>In noncooperative game <strong>theory</strong> one has a set <strong>of</strong> N players.Each player i has its own set <strong>of</strong> allowed pure strategies.A mixed strategy is a <strong>distribution</strong> q i (x i ) over playeri’s possible pure strategies. Each player i also has a privateutility function g i that maps the pure strategies adopted byall N <strong>of</strong> the players into the real numbers. So given mixedstrategies <strong>of</strong> all the players, the expected utility <strong>of</strong> player iis E(g i ) = ∫ dx ∏ j q j(x j )g i (x) 1 .In a Nash equilibrium every player adopts the mixedstrategy that maximizes its expected utility, given the mixedstrategies <strong>of</strong> the other players. More <strong>for</strong>mally, ∀i, q i =argmax q ′i∫dx q′i∏j≠i q j(x j ) g i (x). Perhaps the major objectionthat has been raised to the Nash equilibrium conceptis its assumption <strong>of</strong> full rationality [5, 6]. This isthe assumption that every player i can both calculate whatthe strategies q j≠i will be and then calculate its associatedoptimal <strong>distribution</strong>. In other words, it is the assumptionthat every player will calculate the entire joint <strong>distribution</strong>q(x) = ∏ j q j(x j ). If <strong>for</strong> no other reasons than computationallimitations <strong>of</strong> real humans, this assumption is essentiallyuntenable.2.2. Review <strong>of</strong> the maximum entropy principleShannon was the first person to realize that based on any<strong>of</strong> several separate sets <strong>of</strong> very simple desiderata, there is aunique real-valued quantification <strong>of</strong> the amount <strong>of</strong> syntacticin<strong>for</strong>mation in a <strong>distribution</strong> P (y). He showed that thisamount <strong>of</strong> in<strong>for</strong>mation is (the negative <strong>of</strong>) the Shannon entropy<strong>of</strong> that <strong>distribution</strong>, S(P ) = − ∫ dy P (y)ln[ P (y)µ(y) ].So <strong>for</strong> example, the <strong>distribution</strong> with minimal in<strong>for</strong>mationis the one that doesn’t distinguish at all between the variousy, i.e., the uni<strong>for</strong>m <strong>distribution</strong>. Conversely, the most in<strong>for</strong>mative<strong>distribution</strong> is the one that specifies a single possibley. Note that <strong>for</strong> a product <strong>distribution</strong>, entropy is additive,i.e., S( ∏ i q i(y i )) = ∑ i S(q i).Say we are given some incomplete prior knowledgeabout a <strong>distribution</strong> P (y). How should one estimate P (y)based on that prior knowledge? Shannon’s result tells ushow to do that in the most conservative way: have your estimate<strong>of</strong> P (y) contain the minimal amount <strong>of</strong> extra in<strong>for</strong>mationbeyond that already contained in the prior knowledgeabout P (y). Intuitively, this can be viewed as a version <strong>of</strong>Occam’s razor. This approach is called the maximum entropy(maxent) principle. It has proven useful in domainsranging from signal processing to supervised learning [7].2.3. Maxent LagrangiansMuch <strong>of</strong> the work on equilibrium concepts in game <strong>theory</strong>adopts the perspective <strong>of</strong> an external observer <strong>of</strong> a game.We are told something concerning the game, e.g., its utilityfunctions, in<strong>for</strong>mation sets, etc., and from that wish to predictwhat joint strategy will be followed by real-world players<strong>of</strong> the game. Say that in addition to such in<strong>for</strong>mation,we are told the expected utilities <strong>of</strong> the players. What is ourbest estimate <strong>of</strong> the <strong>distribution</strong> q that generated those expectedutility values? By the maxent principle, it is the <strong>distribution</strong>with maximal entropy, subject to those expectationvalues.To <strong>for</strong>malize this, <strong>for</strong> simplicity assume a finite number<strong>of</strong> players and <strong>of</strong> possible strategies <strong>for</strong> each player.To agree with the convention in other fields, from now onwe implicitly flip the sign <strong>of</strong> each g i so that the associatedplayer i wants to minimize that function rather than maximizeit. Intuitively, this flipped g i (x) is the “cost” to playeri when the joint-strategy is x, though we will still use theterm “utility”.Then <strong>for</strong> prior knowledge that the expected utilities <strong>of</strong>the players are given by the set <strong>of</strong> values {ɛ i }, the maxentestimate <strong>of</strong> the associated q is given by the minimizer <strong>of</strong>the Lagrangian1 Throughout this paper, the integral sign is implicitly interpreted as appropriate,e.g., as Lebesgue integrals, point-sums, etc.L(q) ≡ ∑ iβ i [E q (g i ) − ɛ i ] − S(q) (1)


= ∑ i∫β i [dx ∏ jq j (x j )g i (x) − ɛ i ] − S(q)(2)where the subscript on the expectation value indicates thatit evaluated under <strong>distribution</strong> q, and the {β i } are “inversetemperatures” implicitly set by the constraints on the expectedutilities.Solving, we find that the mixed strategies minimizing theLagrangian are related to each other viaq i (x i ) ∝ e −E q (i)(G|x i )where the overall proportionality constant <strong>for</strong> each i is setby normalization, and G ≡ ∑ i β ig i 2 . In Eq. 3 the probability<strong>of</strong> player i choosing pure strategy x i depends on the effect<strong>of</strong> that choice on the utilities <strong>of</strong> the other players. Thisreflects the fact that our prior knowledge concerns all theplayers equally.If we wish to focus only on the behavior <strong>of</strong> player i, it isappropriate to modify our prior knowledge. To see how todo this, first consider the case <strong>of</strong> maximal prior knowledge,in which we know the actual joint-strategy <strong>of</strong> the players,and there<strong>for</strong>e all <strong>of</strong> their expected costs. For this case, trivially,the maxent principle says we should “estimate” q asthat joint-strategy (it being the q with maximal entropy thatis consistent with our prior knowledge). The same conclusionholds if our prior knowledge also includes the expectedcost <strong>of</strong> player i.Modify this maximal set <strong>of</strong> prior knowledge by removingfrom it specification <strong>of</strong> player i’s strategy. So our priorknowledge is the mixed strategies <strong>of</strong> all players other thani, together with player i’s expected cost. We can incorporateprior knowledge <strong>of</strong> the other players’ mixed strategiesdirectly, without introducing Lagrange parameters. The resultantmaxent Lagrangian isL i (q i ) ≡ β i [ɛ i − E(g∫ i )] − S i (q i )= β i [ɛ i − dx ∏ q j (x j )g i (x)] − S i (q i )jsolved by a set <strong>of</strong> coupled Boltzmann <strong>distribution</strong>s:(3)q i (x i ) ∝ e −β iE q(i) (g i |x i ) . (4)Following Nash, we can use Brouwer’s fixed point theoremto establish that <strong>for</strong> any non-negative values {β}, there mustexist at least one product <strong>distribution</strong> given by the product<strong>of</strong> these Boltzmann <strong>distribution</strong>s (one term in the product<strong>for</strong> each i).The first term in L i is minimized by a perfectly rationalplayer. The second term is minimized by a perfectly2 The subscript q (i) on the expectation value indicates that it is evaluatedaccording the <strong>distribution</strong> ∏ j≠i q j.irrational player, i.e., by a perfectly uni<strong>for</strong>m mixed strategyq i . So β i in the maxent Lagrangian explicitly specifiesthe balance between the rational and irrational behavior <strong>of</strong>the player. In particular, <strong>for</strong> β → ∞, by minimizing the Lagrangianswe recover the Nash equilibria <strong>of</strong> the game. More<strong>for</strong>mally, in that limit the set <strong>of</strong> q that simultaneously minimizethe Lagrangians is the same as the set <strong>of</strong> delta functionsabout the Nash equilibria <strong>of</strong> the game. The same istrue <strong>for</strong> Eq. 3.Eq. 3 is just a special case <strong>of</strong> Eq. 4, where all player’sshare the same private utility, G. (Such games are knownas team games.) This relationship reflects the fact that <strong>for</strong>this case, the difference between the maxent Lagrangian andthe one in Eq. 2 is independent <strong>of</strong> q i . Due to this relationship,our guarantee <strong>of</strong> the existence <strong>of</strong> a solution to the set<strong>of</strong> maxent Lagrangians implies the existence <strong>of</strong> a solution <strong>of</strong>the <strong>for</strong>m Eq. 3. Typically players will be closer to minimizingtheir expected cost than maximizing it. For prior knowledgeconsistent with such a case, the β i are all non-negative.For each player i definef i (x, q i (x i )) ≡ β i g i (x) + ln[q i (x i )]. (5)Then we can maxent Lagrangian <strong>for</strong> player i is∫L i (q) = dx q(x)f i (x, q i (x i )). (6)Now in a bounded rational game every player sets its strategyto minimize its Lagrangian, given the strategies <strong>of</strong> theother players. In light <strong>of</strong> Eq. 6, this means that we interpreteach player in a bounded rational game as being perfectlyrational <strong>for</strong> a utility that incorporates its computationalcost. To do so we simply need to expand the domain<strong>of</strong> “cost functions” to include probability values as well asjoint moves.Often our prior knowledge will not consist <strong>of</strong> exact specification<strong>of</strong> the expected costs <strong>of</strong> the players, even if thatknowledge arises from watching the players make theirmoves. Such alternative kinds <strong>of</strong> prior knowledge are addressedin [10, 11]. Those references also demonstrate theextension <strong>of</strong> the <strong>for</strong>mulation to allow <strong>multi</strong>ple utility functions<strong>of</strong> the players, and even variable numbers <strong>of</strong> players.Also discussed there are semi-coordinate trans<strong>for</strong>mations,under which, intuitively, the moves <strong>of</strong> the <strong>agent</strong>s are modifiedto set in binding contracts.3. Optimizing the LagrangianFirst we introduce the shorthand[G | x i ] ≡ E(G∫| x i ) (7)= dx ′ δ(x ′ i − x i )G(x) ∏ q i (x ′ i), (8)j≠i


where the delta function <strong>for</strong>ces x ′ i = x i in the usual way.Now given any initial q, one may use gradient descent tosearch <strong>for</strong> the q optimizing L(q). Taking the appropriatepartial derivatives, the descent direction is given by△q i (x i ) =δLδq i (x i ) = [G|x i] + β −1 log q(x i ) + C (9)where C is a constant set to preserve the norm <strong>of</strong> the probability<strong>distribution</strong> after update, i.e., set to ensure that∫∫dx i q i (x i ) = dx i (q i (x i ) + △q i (x i )) = 1. (10)Evaluating, we find thatC = −∫ 1 ∫}dx i{[G|x i ] + β −1 log q i (x i ) . (11)dxi 1(Note that <strong>for</strong> finite X, those integrals are just sums.)To follow this gradient, we need an efficient scheme <strong>for</strong>estimation <strong>of</strong> the conditional expected G <strong>for</strong> different x i .Here we do this via Monte Carlo sampling, i.e., by repeatedlyIID sampling q and recording the resultant private utilityvalues. After using those samples to <strong>for</strong>m an estimate <strong>of</strong>the gradient <strong>for</strong> each <strong>agent</strong>, we update the <strong>agent</strong>s’ <strong>distribution</strong>saccordingly. We then start another block <strong>of</strong> IID samplingto generate estimates <strong>of</strong> the next gradients.In large <strong>systems</strong>, the sampling within any given blockcan be slow to converge. One way to speed up the convergenceis to replace each [G | x i ] with [g i | x i ], where g i isset to minimize bias plus variance [9, 11]:∫g i (x) := G(x) −dx iL −1x i∫dx′i L −1x ′ iG(x −i , x i ), (12)where L xi is the number <strong>of</strong> times the particular value x iarose in the most recent block <strong>of</strong> L samples. This is calledthe Aristocrat Utility (AU). It is a correction to the utility<strong>of</strong> the same name investigated in [13] and referencestherein.Note that evaluation <strong>of</strong> AU <strong>for</strong> <strong>agent</strong> i requires knowingG <strong>for</strong> all possible x i , with x (i) held fixed. Accordingly,we consider an approximation, which is called the WonderfulLife utility (WLU), in which we replace the valuesL −1x∫ idefining <strong>agent</strong> i’s AU with a delta function aboutdx ′ i L−1 x ′ iabout the least likely (according to q i ) <strong>of</strong> that <strong>agent</strong>’s moves.(This is version <strong>of</strong> the utility <strong>of</strong> the same name investigatedin [13] and references therein.)Below we present computer experiments validating thetheoretical predictions that AU converges faster than theteam game, and that the WLU defined here converges fasterthan its reverse, in which the delta function is centered onthe most likely <strong>of</strong> the <strong>agent</strong>’s moves.Both WLU and AU require recording not just G(x) <strong>for</strong>the Monte Carlo sample x, but also G at a set <strong>of</strong> points relatedin a particular way to x. When the functional <strong>for</strong>m <strong>of</strong>G is known, <strong>of</strong>ten there is cancellation <strong>of</strong> terms that obviatesthis need. Indeed, <strong>of</strong>ten what one does need to recordin these cases is easier to evaluate than is G. However whenthe functional <strong>for</strong>m <strong>of</strong> G is not known, using such privateutilities would require rerunning the system, i.e., evaluatingG <strong>for</strong> many points besides x.PD <strong>theory</strong> provides us an alternative way to improve theconvergence <strong>of</strong> the sampling. This alternative exploits thefact that the joint <strong>distribution</strong> <strong>of</strong> all the <strong>agent</strong>s is a product<strong>distribution</strong>. Accordingly, we can have the <strong>agent</strong>s all announcetheir separate <strong>distribution</strong>s {q i } at the end <strong>of</strong> eachblock. By itself, this is <strong>of</strong> no help. However say that x aswell as G(x) is recorded <strong>for</strong> all the samples taken so far(not just those in the preceding block). We can use this in<strong>for</strong>mationas a training set <strong>for</strong> a supervised learning algorithmthat estimates G. Again, this piece <strong>of</strong> in<strong>for</strong>mation is<strong>of</strong> no use by itself. But if we combine it with the announced{q i }, we can <strong>for</strong>m an estimate <strong>of</strong> each [G | x i ].This estimate is in addition to the estimate based on theMonte Carlo samples — here the Monte Carlo samples fromall blocks are used, to approximate G(x), rather than to directlyestimate the various [G | x i ]. Accordingly we cancombine these two estimates. Below we present computerexperiments validating this technique.4. Experiments4.1. Known world utilitiesWe first consider the case where the functional <strong>for</strong>m <strong>of</strong>the world utility is known. Technically, the specific problemthat we consider is the equilibration <strong>of</strong> a spin glass in an externalfield, where each spin has a total angular momentum3/2. The problem consists <strong>of</strong> 50 spins in a circular <strong>for</strong>mation,where every spin is interacting with three spins on itsright, three spins on its left as well as with itself. There arealso external fields which interact with each individual spindifferently. The world utility is thus <strong>of</strong> the following <strong>for</strong>m:G(x) = ∑ h i x i + ∑J ij x i x j , (13)iwhere ∑ means summing over all the interacting pairsonce. In our problem, the set elements in the set {h i } and{J ij } are generated uni<strong>for</strong>mly at random from −0.5 to 0.5.The algorithm <strong>for</strong> the Lagrangian estimation goes as follows:1. Each spin is treated as an <strong>agent</strong> which possessesa probability <strong>distribution</strong> on his set <strong>of</strong> actions:{q i (x i ) | x i ∈ σ i ≡ {−1, 1}}, which is initially set tobe uni<strong>for</strong>m.


2. Each <strong>agent</strong> picks its choice <strong>of</strong> state according to theprobability <strong>distribution</strong> <strong>for</strong> L times sequentially, whereL is the Monte Carlo block size. We denote the number<strong>of</strong> state x i picked by <strong>agent</strong> i by L xi . We require L xito be non-empty <strong>for</strong> all x i ∈ ω i , i.e., if some L xi = 0,we randomly pick a sample x ′ and set x i = x ′ i so thatL xi = 1. This process is to ensure that we can getconditional expected values [G|x i ] <strong>for</strong> all x i ∈ σ i . Itshould be noted though that it violates the assumptions<strong>of</strong> IID sampling underpinning the derivation <strong>of</strong> the privateutilities minimizing bias plus variance.3. The gradients <strong>for</strong> each individual component is calculatedbased on the L samples taken from the previousstep (c.f. eq. 9), and gradient descents are per<strong>for</strong>med<strong>for</strong> all i simultaneously. Since all probabilities must bepositive, <strong>for</strong> each component i, the magnitude <strong>of</strong> descentis halved if q i (x i ) is no longer positive <strong>for</strong> somex i .4. Repeat steps 2 and 3.In figure 1, we have shown a comparison <strong>of</strong> three differentways <strong>of</strong> doing the descent direction estimation in step 3above. Team game means that we use [G|x i ] to get the descentdirections, weighted Aristocratic Utility correspondsto using the <strong>for</strong>mula in eq. 12 to get the descent directions,and uni<strong>for</strong>m Aristocratic Utility corresponds to simplifyingthe functions {g i } toĝ i (x) := G(x) − 1|σ i |∑x i∈σ iG(x −i , x i ). (14)dx ′ i L−1 x ′ iIn figure 1, we see that weighted AU outper<strong>for</strong>ms uni<strong>for</strong>mAU except at β −1 = 0.2. This unexpected result atβ −1 = 0.2 may be due to the limitation on the size <strong>of</strong> L.(Recall that we have required that L xi ≠ 0, and if it everdoes, we randomly pick a sample x ′ and set x i = x ′ i sothat L xi = 1.) Hence, as shown in figure 3, the number<strong>of</strong> L xi = 1 is greater when β −1 = 0.2 than that whenβ −1 = 0.6. This demonstrates that at β −1 = 0.2, quite afew re<strong>distribution</strong>s <strong>of</strong> the samples are happening and hencethe size <strong>of</strong> L has to be enlarged to get decent statistics.The speculation is further strengthened by comparing correctWLU (where−1x L∫ idefining <strong>agent</strong> i’s AU are replacedwith a delta function about about the least likely (accordingto q i ) <strong>of</strong> that <strong>agent</strong>’s moves) and incorrect WLU(where the same quantities are replaced with a delta functionabout about the most likely (according to q i ) <strong>of</strong> that<strong>agent</strong>’s moves) with different sample size L. As shown infigures 4 and 5, the increase in sample size does amend theproblem caused by resampling.5. Unknown world utilitiesWe now consider the case where the explicit <strong>for</strong>mula <strong>for</strong>the world utility is not known and hence the calculations <strong>for</strong>WLU, uni<strong>for</strong>m AU and weighted AU are not possible. Recallthat <strong>for</strong> this case we require that each player not onlysubmits her choices <strong>of</strong> actions during each Monte Carloblock, but her probability <strong>distribution</strong> as well. Although thisbrings a constant overhead to the transmission, this becomesnegligible when L is large.The problem we consider here is a 100-<strong>agent</strong> 4-night barproblem [14]. In this problem, each <strong>agent</strong>’s strategy set consists<strong>of</strong> four elements: {1, 2, 3, 4}. The world utility is <strong>of</strong> the<strong>for</strong>m:4∑G(x) = −50 × e −f k(x)/6(15)k=1where f k (x) = ∑ i δ(x i − k), i.e., f k (x) is the number <strong>of</strong><strong>agent</strong>s attending the bar at night k. The precise algorithm isas follows:1. Each <strong>agent</strong> possesses a probability <strong>distribution</strong> on herset <strong>of</strong> actions: {q i (x i ) | x i ∈ ω i }, which is initially setto be uni<strong>for</strong>m.2. Each <strong>agent</strong> picks its state according to the probability<strong>distribution</strong> <strong>for</strong> L times sequentially, where L is theMonte Carlo block size, as well as her probability <strong>distribution</strong>{q i (x i ) | x i ∈ ω i }. Again, we require L xi tobe non-empty <strong>for</strong> all x i ∈ ω i , i.e., if some L xi = 0,we randomly pick a sample x ′ and set x i = x ′ i so thatL xi = 1.3. Denote the set <strong>of</strong> samples in the L Monte Carlo step byS, each <strong>agent</strong> generates a set <strong>of</strong> artificial data points accordingto <strong>agent</strong>s’ probability <strong>distribution</strong>s, and denotethose by A i . Then we define the following quantity:Ḡ xi := 1 − α ∑δ(x i − x ′|S|i)G(x) (16)x ′ ∈S+ α ∑δ(x i − x ′|A i |i)ĜS(x) (17)x ′ ∈A iwhere α is a weighting parameter between 0 and 1 andĜ is defined by:∑xĜ S (x) :=′ ∈S d(x, x′ )G(x ′ )∑x ′ ∈S d(x, (18)x′ )where d( . , . ) is some appropriate metric. In thepresent 100-<strong>agent</strong> 4-night bar problem, d(x, x ′ ) :=e −2×∑ 4|f k(x)−f k (x ′ )| k=1 where the functions {f k (.)}are as defined in eq. 15.4. Each <strong>agent</strong> updates her probability <strong>distribution</strong> accordingto the gradients calculated as in eq. 9 but with[G|x i ] replaced by Ḡx i. Again, <strong>for</strong> each <strong>agent</strong> i, the


−30−40−50−60−70−80−900 0.5 1 1.5Figure 1. Plots <strong>of</strong> β −1 vs. the Lagrangian<strong>for</strong> different utilities. The curves are generatedby plotting the Lagrangian at the 20thtimestep, i.e., after 20 descents. The initialstep sizes are set to be 0.2 times the gradients.Also, L = 100 and a total <strong>of</strong> 40 simulationsare per<strong>for</strong>med. (Red dotted line: Teamgame, green dashed line: uni<strong>for</strong>m AU, bluesolid line: weighted AU.)−30−35−40−45−500 5 10 15 20 25Figure 2. The time series <strong>of</strong> Lagrangian alongthe 20 time steps (curves generated at β −1 =0.6). (Red dotted line: Team game, greendashed line: uni<strong>for</strong>m AU, blue solid line:weighted AU.)1.5magnitude <strong>of</strong> descent is halved if q i (x i ) is no longerpositive <strong>for</strong> some x ′ i .5. Repeat steps 2 to 4.Given that the utility function is reasonably smooth, itis natural to expect that the estimates aided by artificialdata points will provide an improvement. And this is indeedshown in figure 6 with varying α. The same experimentsare also per<strong>for</strong>med <strong>for</strong> the 50-spin model introduced in section4.1 but with a new metric d(x, x ′ ) = ∑ 50i=1 δ(x i − x ′ i ).The results are shown in figure 7.An immediately improvement to the scheme above is torealize that there is no need to restrict ourselves to dataavailable in that particular time step in calculating Ḡx iineq. 16. Namely, unlike step 3 above, we can accumulate“true” data from previous steps in calculating Ḡx iwhichwill certainly improve the accuracy <strong>of</strong> the estimation. To illustratethis idea, at time step t, we define S ′ to be the settrue samples drawn at time step t and with the set <strong>of</strong> truesamples drawn at time step t − 1, and we calculate Ḡx ias:Ḡ xi := 1 − α|S|+ α|A i |∑δ(x i − x ′ i)G(x) (19)x ′ ∈S∑δ(x i − x ′ i)ĜS ′(x) . (20)x ′ ∈A iThe results <strong>for</strong> the bar problem are shown in figure 8.10.500 5 10 15 20 25Figure∑ ∑3. Plots <strong>of</strong> time step versus150 i x i ∈σ iδ(L xi − 1) at different temperatures.(Red dotted line: β −1 = 0.2, bluesolid line: β −1 = 0.6.)6. Conclusion<strong>Product</strong> Distribution (PD) <strong>theory</strong> is a recently introducedbroad framework <strong>for</strong> analyzing, <strong>control</strong>ling, and optimizingdistributed <strong>systems</strong> [9, 10, 11]. Here we investigate PD <strong>theory</strong>’suse <strong>for</strong> adaptive, distributed <strong>control</strong> <strong>of</strong> a MAS. Typicallysuch <strong>control</strong> is done by having each <strong>agent</strong> run its ownrein<strong>for</strong>cement learning algorithm [4, 13, 14, 12].In this approach the utility function <strong>of</strong> each <strong>agent</strong> isbased on the world utility G(x) mapping the joint move <strong>of</strong>


−10−150−15−200−20−250−25−300 5 10 15Figure 4. The time series <strong>of</strong> Lagrangian alongthe 20 time steps with sample size L = 100(curves generated at β −1 = 0.2). (Red dottedline: correct WLU, blue solid line: incorrectWLU.)−3000 0.5 1 1.5Figure 6. Plots <strong>of</strong> β −1 vs. the Lagrangian withvarious weighing parameters α <strong>for</strong> the barproblem. The curves are generated by plottingthe Lagrangian at the 20th timestep. (Reddotted line: no data augmentation, greendashed line: data aug. with α = 0.3, blue solidline: data aug. with α = 0.5, black dashed dottedline: data aug. with α = 0.7.)−10−15−20−25−30−35−400 5 10 15 20 25Figure 5. The time series <strong>of</strong> Lagrangian alongthe 20 time steps with sample size L = 200(curves generated at β −1 = 0.2). (Red dottedline: correct WLU, blue solid line: incorrectWLU.)−20−30−40−50−60−70−800 0.5 1 1.5Figure 7. Plots <strong>of</strong> β −1 vs. the Lagrangian<strong>for</strong> the 50-spin model. The curves are generatedby plotting the Lagrangian at the 20thtimestep. (Red dotted line: no data augmentation,blue solid line: data aug. with α = 0.5.)


−100−150−200−2500 0.5 1 1.5Figure 8. Plots <strong>of</strong> β −1 vs. the Lagrangianwith data accumulation (blue line) and withoutdata accumulation (red dotted line) <strong>for</strong>the bar problem. The curves are generatedby plotting the free energies at the 10thtimestep. The initial step sizes are set to be0.2 times the gradients. Also, the number <strong>of</strong>true data |S| is 20, the number <strong>of</strong> data storedis 20 and the number <strong>of</strong> artificial data |A i | is40. A total <strong>of</strong> 80 simulations are per<strong>for</strong>med.the <strong>agent</strong>s, x ∈ X, to the per<strong>for</strong>mance <strong>of</strong> the overall system.However in practice the <strong>agent</strong>s in a MAS are bounded rational.Moreover the equilibrium they reach will typically involvemixed strategies rather than pure strategies, i.e., theydon’t settle on a single point x optimizing G(x). This suggests<strong>for</strong>mulating an approach that explicitly accounts <strong>for</strong>the bounded rational, mixed strategy character <strong>of</strong> the <strong>agent</strong>s.PD <strong>theory</strong> directly addresses these issues by casting the<strong>control</strong> problem as one <strong>of</strong> minimizing a Lagrangian <strong>of</strong> thejoint probablity <strong>distribution</strong> <strong>of</strong> the <strong>agent</strong>s. This allows theequilibrium to be found using gradient descent techniques.In PD <strong>theory</strong>, such gradient descent can be done in a distributedmanner.We present experiments validating PD <strong>theory</strong>’s predictions<strong>for</strong> how to speed the convergence <strong>of</strong> that gradient descent.We then present other experiments validating the use<strong>of</strong> PD <strong>theory</strong> to improve convergence even if one is not allowedto rerun the system (an approach common in RLbasedschemes). These results demonstrate the power <strong>of</strong> PD<strong>theory</strong> <strong>for</strong> providing a principled way to <strong>control</strong> a MAS ina distributed manner.References[1] S. Airiau and D. H. Wolpert. <strong>Product</strong> <strong>distribution</strong> <strong>theory</strong> andsemi-coordinate trans<strong>for</strong>mations. 2004. Submitted to AA-MAS 04.[2] N. Antoine, S. Bieniawski, I. Kroo, and D. H. Wolpert. Fleetassignment using collective intelligence. In Proceedings <strong>of</strong>42nd Aerospace Sciences Meeting, 2004. AIAA-2004-0622.[3] S. Bieniawski and D. H. Wolpert. Adaptive, distributed <strong>control</strong><strong>of</strong> constrained <strong>multi</strong>-<strong>agent</strong> <strong>systems</strong>. 2004. Submitted toAAMAS 04.[4] R. H. Crites and A. G. Barto. Improving elevator per<strong>for</strong>manceusing rein<strong>for</strong>cement learning. In D. S. Touretzky,M. C. Mozer, and M. E. Hasselmo, editors, Advances in NeuralIn<strong>for</strong>mation Processing Systems - 8, pages 1017–1023.MIT Press, 1996.[5] D. Fudenberg and D. K. Levine. The Theory <strong>of</strong> Learning inGames. MIT Press, Cambridge, MA, 1998.[6] D. Fudenberg and J. Tirole. Game Theory. MIT Press, Cambridge,MA, 1991.[7] D. Mackay. In<strong>for</strong>mation <strong>theory</strong>, inference, and learning algorithms.Cambridge University Press, 2003.[8] S. Bieniawski W. Macready and D.H. Wolpert. Adaptive<strong>multi</strong>-<strong>agent</strong> <strong>systems</strong> <strong>for</strong> constrained optimization. 2004.Submitted to AAAI 04.[9] D. H. Wolpert. Factoring a canonical ensemble. 2003. condmat/0307630.[10] D. H. Wolpert. Bounded rational games, in<strong>for</strong>mation <strong>theory</strong>,and statistical physics. In D. Braha and Y. Bar-Yam, editors,Complex Engineering Systems, 2004.[11] D. H. Wolpert. Generalizing mean field <strong>theory</strong> <strong>for</strong> distributedoptimization and <strong>control</strong>. 2004. Submitted.[12] D. H. Wolpert and K. Tumer. Optimal pay<strong>of</strong>f functions<strong>for</strong> members <strong>of</strong> collectives. Advances in Complex Systems,4(2/3):265–279, 2001.[13] D. H. Wolpert and K. Tumer. Collective intelligence, datarouting and braess’ paradox. Journal <strong>of</strong> Artificial IntelligenceResearch, 2002.[14] D. H. Wolpert, K. Wheeler, and K. Tumer. Collective intelligence<strong>for</strong> <strong>control</strong> <strong>of</strong> distributed dynamical <strong>systems</strong>. EurophysicsLetters, 49(6), March 2000.Acknowledgements CFL thanks NASA and NSERC(Canada) <strong>for</strong> financial support.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!