12.07.2015 Views

decision making with multiple imperfect decision makers - Institute of ...

decision making with multiple imperfect decision makers - Institute of ...

decision making with multiple imperfect decision makers - Institute of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

VP 1, Q 1 1V P 2, Q 22V 3r 1, x 1r 2, x 2p 2, q 2p 3, q 3Figure 3: Schematic drawing <strong>of</strong> the three-node distribution circuit.where v is the voltage step size for the transformer, and v min and v max represent the absolute minand max voltage the transformer can produce. Similarly, the attacker has taken control <strong>of</strong> q 3 andits actions are limited by its capacity to produce real power, p 3,max as represented by the followingdomain.D A (t) = 〈−p 3,max , . . . , 0, . . . , p 3,max 〉Via the SCADA system and the attacker’s control <strong>of</strong> node 3, the observation spaces <strong>of</strong> the twoplayers includesΩ D = {V 1 , V 2 , V 3 , P 1 , Q 1 , M D }, Ω A = {V 2 , V 3 , p 3 , q 3 , M A }where M D and M A are used to denote each two real numbers that represent the respective player’smemory <strong>of</strong> the past events in the game. Both the defender and attacker manipulate their controls ina way to increase their own rewards. The defender desires to maintain a high quality <strong>of</strong> service bymaintaining the voltages V 2 and V 3 near the desired normalized voltage <strong>of</strong> one while the attacherwishes to damage equipment at node 2 by forcing V 2 beyond operating limits, i.e.( ) 2 ( ) 2 V2 − 1 V3 − 1R D = −−, R A = Θ[V 2 − (1 + ɛ)] + Θ[(1 − ɛ) − V 2 ]ɛɛHere, ɛ ∼ 0.05 for most distribution system under consideration, Θ is a Heaviside step function.Level 1 defender policy The level 0 defender is modeled myopically and seeks to maximize hisreward by following a policy that adjusts V 1 to move the average <strong>of</strong> V 2 and V 3 closer to one, i.e.(V 2 + V 3 )π D (V 2 , V 3 ) = arg min V1∈D D (t) − 12Level 1 attacker policy The level 0 attacker adopts a drift and strike policy based on intimateknowledge <strong>of</strong> the system. If V 2 < 1, we propose that the attacker would decrease q 3 by loweringit by one step. This would cause Q 1 to increase and V 2 to fall even farther. This policy achievessuccess if the defender raises V 1 in order to keep V 2 and V 3 in the acceptable range. The attackercontinues this strategy, pushing the defender towards v max until he can quickly raise q 3 to push V 2above 1 + ɛ . If the defender has neared v max , then a number <strong>of</strong> time steps will be required to forthe defender to bring V 2 back in range. More formally this policy can be expressed asLEVEL0ATTACKER()1 V ∗ = max q∈DA (t) |V 2 − 1|;2 if V ∗ > θ A3 then return arg max q∈DA (t) |V 2 − 1|;4 if V 2 < 15 then return q 3,t−1 − 1;6 return q 3,t−1 + 1;where θ A is a threshold parameter.3.1 Reinforcement Learning ImplementationUsing defined level 0 policies as the starting point, we now bootstrap up to higher levels by trainingeach level K policy against an opponent playing level K − 1 policy. To find policies that maximizereward, we can apply any algorithm from the reinforcement learning literature. In this paper, we use4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!