decision making with multiple imperfect decision makers - Institute of ...

More documents

Recommendations

Info

1S AS BS AU BU AS CS B2S AU AU AS BU BU B3S CS AS BS CU BU AS CFigure 1: An example semi Bayes net.Figure 2:Bayes net.An example iterated semicontrast to learning-in-games models whereby players do not use their opponents’ reward informationto predict their opponents’ decisions and to choose their own actions. Second, this approachis computationally feasible even on real-world problems. This is in contrast to equilibrium modelssuch as subgame perfect equilibrium and quantal response equilibrium [3]. This is also in contrast toPOMDP models (e.g. I-POMDP) in which players are required to maintain a belief state over spacesthat quickly explode. Third, with this general formulation of the policy mapping, it is straightforwardto introduce experimentally motivated behavioral features such as noisy, sampled or boundedmemory. Another source of realism is that, with the level-K model instead of an equilibrium model,we avoid the awkward assumption that players’ predictions about each other are always correct.We investigate all this for modeling a cyber-battle over a smart power grid. We discuss the relationshipbetween the behavior predicted by our model and what one might expect of real humandefenders and attackers.2 Game Representation and Solution ConceptIn this paper, the players will be interacting in an iterated semi net-form game. To explain an iteratedsemi net-form game, we will begin by describing a semi Bayes net. A semi Bayes net is a Bayesnet with the conditional distributions of some nodes left unspecified. A pictoral example of a semiBayes net is given in Figure 1. Like a standard Bayes net, a semi Bayes net consist of a set of nodesand directed edges. The ovular nodes labeled “S” have specified conditional distributions with thedirected edges showing the dependencies among the nodes. Unlike a standard Bayes net, there arealso rectangular nodes labeled “U” that have unspecified conditional dependencies. In this paper,the unspecified distributions will be set by the interacting human players. A semi net-form game,as described in [4], consists of a semi Bayes net plus a reward function mapping the outcome of thesemi Bayes net to rewards for the players.An iterated semi Bayes net is a Bayes net which has been time extended. It comprises of a semiBayes net (such as the one in Figure 1), which is replicated T times. Figure 2 shows the semi Bayesnet replicated three times. A set of directed edges L sets the dependencies between two successiveiterations of the semi Bayes net. Each edge in L connects a node in stage t − 1 with a node in staget as is shown by the dashed edges in Figure 2. This set of L nodes is the same between every twosuccessive stages. An iterated semi net-form game comprises of two parts: an iterated semi Bayesnet and a set of reward functions which map the results of each step of the semi Bayes net into anincremental reward for each player. In Figure 2, the unspecified nodes have been labeled “U A ” and“U B ” to specify which player sets which nodes.Having described above our model of the strategic scenario in the language of iterated semi net-formgames, we now describe our solution concept. Our solution concept is a combination of the level-Kmodel, described below, and reinforcement learning (RL). The level-K model is a game theoretic2
solution concept used to predict the outcome of human-human interactions. A number of studies[1, 2] have shown promising results predicting experimental data in games using this method. Thesolution to the level-K model is defined recursively as follows. A level K player plays as though allother players are playing at level K − 1, who, in turn, play as though all other players are playingat level K − 2, etc. The process continues until level 0 is reached, where the level 0 player playsaccording to a prespecified prior distribution. Notice that running this process for a player at K ≥ 2results in ricocheting between players. For example, if player A is a level 2 player, he plays asthough player B is a level 1 player, who in turn plays as though player A is a level 0 player playingaccording to the prior distribution. Note that player B in this example may not actually be a level 1player in reality – only that player A assumes him to be during his reasoning process.This work extends the standard level-K model to time-extended strategic scenarios, such as iteratedsemi net-form games. In particular, each Undetermined node associated with player i in the iteratedsemi net-form game represents an action choice by player i at some time t. We model player i’saction choices using the policy function, ρ i , which takes an element of the Cartesian product ofthe spaces given by the parent nodes of i’s Undetermined node to an action for player i. Note thatthis definition requires a special type of iterated semi-Bayes net in which the spaces of the parentsof each of i’s action nodes must be identical. This requirement ensures that the policy function isalways well-defined and acts on the same domain at every step in the iterated semi net-form game.We calculate policies using reinforcement learning (RL) algorithms. That is, we first define a level0 policy for each player, ρ 0 i . We then use RL to find player i’s level 1 policy, ρ1 i , given the level 0policies of the other players, ρ 0 −1, and the iterated semi net-form game. We do this for each player iand each level K. 13 Application: Cybersecurity of a Smart Power NetworkIn order to test our iterated semi net-form game modeling concept, we adopt a model for analyzingthe behavior of intruders into cyber-physical systems. In particular, we consider SupervisoryControl and Data Acquisition (SCADA) systems [5], which are used to monitor and control manytypes of critical infrastructure. A SCADA system consists of cyber-communication infrastructurethat transmits data from and sends control commands to physical devices, e.g. circuit breakers inthe electrical grid. SCADA systems are partly automated and partly human-operated. Increasingconnection to other cyber systems creating vulnerabilities to SCADA cyber attackers [6].Figure 3 shows a single, radial distribution circuit [7] from the transformer at a substation (node1) serving two load nodes. Node 2 is an aggregate of small consumer loads distributed along thecircuit, and node 3 is a relatively large distributed generator located near the end of the circuit. Inthis figure V i , p i , and q i are the voltage, real power, and reactive power at node i. P i , Q i , r i , andx i are the real power, reactive power, resistance and reactance of circuit segment i. Together, thesevalues represent the following physical system [7], where all terms are normalized by the nominalsystem voltage.P 2 = −p 3 , Q 2 = −q 3 , P 1 = P 2 + p 2 , Q 1 = Q 2 + q 2 (1)V 2 = V 1 − (r 1 P 1 + x 1 Q 1 ), V 3 = V 2 − (r 2 P 2 + x 2 Q 2 ) (2)In this model, r, x, and p 3 are static parameters, q 2 and p 2 are drawn from a random distributionat each step of the game, V 1 is the decision variable of the defender, q 3 is the decision variable ofthe attacker, and V 2 and V 3 are determined by the equations above. The injection of real powerp 3 and reactive power q 3 can modify the P i and Q i causing the voltage V 2 to deviate from 1.0.Excessive deviation of V 2 or V 3 can damage customer equipment or even initiate a cascading failurebeyond the circuit in question. In this example, the SCADA operator’s (defender’s) control over q 3is compromised by an attacker who seeks to create deviations of V 2 causing damage to the system.In this model, the defender has direct control over V 1 via a variable-tap transformer. The hardwareof the transformer limits the defenders actions at time t to the following domainD D (t) = 〈min(v max , V 1,t−1 + v), V 1,t−1 , max(v min , V 1,t−1 − v)〉1 Although this work uses level-K and RL exclusively, we are by no means wedded to this solution concept.Previous work on semi net-form games used a method known as Level-K Best-of-M/M’ instead of RL todetermine actions. This was not used in this paper because the possible action space is so large.3
Page 1 and 2: The 2nd International Workshop onDE
Page 3 and 4: The 2nd International Workshop onDE
Page 5 and 6: Time Title Authors7:30—7:50 Openi
Page 7: Modeling Humans as Reinforcement Le
Page 11 and 12: an ɛ-greedy policy parameterizatio
Page 13 and 14: Automated Explanations for MDP Poli
Page 15 and 16: V E = ∑ i∈Eλ π∗s 0(sc i )
Page 17 and 18: Figure 1: User Perception of MDP-Ba
Page 19 and 20: [9] Warren B. Powell. Approximate D
Page 21 and 22: information (technological knowledg
Page 23 and 24: The complete fulfilment of preferen
Page 25 and 26: References[1] J.O. Berger. Statisti
Page 27 and 28: (happiness, anger, fear, disgust, s
Page 30 and 31: David H Wolpert, editors, Decision
Page 32 and 33: on some of their concepts, we devel
Page 34 and 35: We assume that they are evaluated t
Page 36 and 37: AcknowledgmentsResearch supported b
Page 38 and 39: Bayesian Combination of Multiple, I
Page 40 and 41: is not in closed form [5], requirin
Page 42 and 43: where α (k)j is updated by adding
Page 44 and 45: Figure 3: Prototypical confusion ma
Page 46 and 47: Artificial Intelligence Designfor R
Page 48 and 49: the game. Each has its own attribut
Page 50 and 51: 4.3 Experimental Results and Limita
Page 52 and 53: Distributed Decision Making byCateg
Page 54 and 55: q 1a 1 (1) b 1(1)a 2(1)b 2(1)a K(1)
Page 56 and 57: Bayes risk0.250.20.150.10.05Collabo
Page 58 and 59:
Decision making and working memory
Page 60 and 61:
in WM as well as inhibitory tasks [
Page 62 and 63:
Difficulty level will be automatica
Page 64 and 65:
Overall, the current literature lea
Page 66 and 67:
[31] W K Bickel, R Yi, R D Landes,
Page 68 and 69:
Each time instant, the agents first
Page 70 and 71:
Figure 2: Two basic decentralized a
Page 72 and 73:
[11] M. Kárný and T.V. Guy. Shari
Page 74 and 75:
these results yields a design metho
Page 76 and 77:
Minimum of (16) is well known from
Page 78 and 79:
Algorithm 2 VB-DP variant of the di
Page 80 and 81:
Further simplications can be achiev
Page 82 and 83:
The basic terms we use are as follo
Page 84 and 85:
3 Connection to the Bayesian soluti
Page 86 and 87:
4 ConclusionThis paper brings an im
Page 88 and 89:
2 Dynamic programming and revisions
Page 90 and 91:
The possible way how to recognize t
Page 92 and 93:
The data used for experiment are da
Page 95:
T.V. Guy, Institute of Information
show all

decision making with multiple imperfect decision makers - Institute of ...

Create successful ePaper yourself

Delete template?

Save as template?