decision making with multiple imperfect decision makers - Institute of ...

More documents

Recommendations

Info

where δ (s ′ , s 0 ) is a Kroenecker delta which assigns 1 when s ′ = s 0 and 0 otherwise. The occupancyfrequencies for a scenario (or a set of scenarios), λ π s 0(sc), is the expected number of times we reacha scenario sc, from starting state s 0 , by executing policy π i.e., λ π s 0(sc) = ∑ s∈sc λπ s 0(s). Let sc rbe a set of scenarios with reward value r. The dot product of occupancy frequencies and rewardsgives the value of a policy, as shown in Eq. 2.V π (s 0 ) = ∑ rλ π s 0(sc r ) · r (2)An optimal policy π ∗ earns the highest value for all states (i.e., V π∗ (s) ≥ V π (s) ∀π, s).3 Explanations for MDPs3.1 Templates for ExplanationsOur explanation answers the question, “Why has this action been recommended?” by populatinggeneric templates, at run-time, with domain-specific information from the MDP i.e., occupancyfrequency of a scenario. The reward function implicitly partitions the state space in regions withequal reward value. These regions can be defined as partial variable assignments corresponding toscenarios or sets of scenarios. An explanation then could be the frequency of reaching a scenariois highest (or lowest). This is especially useful when this scenario also has a relatively high (orlow) reward. Below we describe templates in which the underlined phrases (scenarios and theirfrequencies) are populated at run-time.• Template 1: “ActionName is the only action that is likely to take you toV ar 1 = V al 1 , V ar 2 = V al 2 , ... about λ times, which is higher (or lower) than any otheraction”• Template 2: “ActionName is likely to take you to V ar 1 = V al 1 , V ar 2 = V al 2 , ... aboutλ times, which is as high (or low) as any other action”• Template 3: “ActionName is likely to take you to V ar 1 = V al 1 , V ar 2 = V al 2 , ... aboutλ times”While these templates provide a method to present explanations, multiple templates can be populatedeven for non-optimal actions; a non-optimal action can have the highest frequency of reaching ascenario without having the maximum expected utility. Thus, we need to identify a set of templatesthat justify the optimal action.3.2 Minimal Sufficient ExplanationsWe define an explanation as sufficient if it can prove that the recommendation is optimal, i.e., theselected templates show the action is optimal without needing additional templates. A sufficient explanationcannot be generated for a non-optimal action since an explanation for another action (e.g.,the optimal action) will have a higher utility. A sufficient explanation is also minimal if it includesthe minimum number of templates needed to ensure it is sufficient. The minimality constraint isuseful for users and sufficiency constraint is useful for designers.Let s 0 be the state where we need to explain why π ∗ (s 0 ) is an optimal action. We can compute thevalue of the optimal policy V π∗ (s 0 ) or the Q-function 1 Q π∗ (s 0 , a) using Eq. 2. Since a template ispopulated by a frequency and a scenario, the utility of this pair in a template is λ π∗s 0(sc r ) · r. Let Ebe the set of frequency-scenario pairs that appear in an explanation. If we exclude a pair from theexplanation, the utility is λ π∗s 0(sc i ) · ¯r, where r min is the minimum value for the reward variable.This definition indicates that the worst is assumed for the scenario in this pair. The utility of anexplanation V E is1 In reinforcement learning, the Q-function Q π (s, a) denotes the value of executing action a in state sfollowed by policy π.2
V E = ∑ i∈Eλ π∗s 0(sc i ) · r i + ∑ λ π∗s 0(sc j ) · r min (3)j /∈Ewhere the first part includes the utility from all the pairs in the explanation and the second partconsiders the worst case for all other pairs. For an explanation to be sufficient, its utility has tobe higher than the next best action, i.e., V π∗ ≥ V E > Q π∗ (s 0 , a) ∀a ≠ π ∗ (s 0 ). For it to beminimal, it should use the fewest possible pairs. Let us define the gain of including a pair in anexplanation as the difference between the utility of including versus excluding that pair (λ π∗s 0(sc i ) ·r i − λ π∗s 0(sc i ) · r min ). To find a minimal sufficient explanation, we can sort the gains of all pairsin descending order and select the first k pairs that ensure V E ≥ Q π∗ (s 0 , a). This provides ourminimal sufficient explanation.3.3 Workflow and AlgorithmThe designer identifies the states and actions, and specifies the transition and reward functions. Theoptimal policy is computed, using a technique such as value iteration, and is consulted to determinethe optimal action. Now an explanation can be requested. The pseudo code for the algorithm tocompute a minimal sufficient explanation is shown in Algorithm 1.Algorithm 1 Computing Minimal Sufficient ExplanationsThe function ComputeScenarios returns the set of scenarios with reward value r which is availablein the encoding of the reward function. The function ComputeOccupancyFrequency isthe most expensive step which corresponds to solging the system of linear system defined in Eq. 1,which has a worst case complexity that is cubic in the size of the state space. However, in practice,the running time can often be sublinear by using variable elimination to exploit conditionalindependence and algebraic decision diagrams [5] to automatically aggregate states with identicalvalues/frequencies. The function GenerateTemplates chooses an applicable template, fromthe list of templates, in the order of the list, with the last always applicable.4 Experiments and Evaluation4.1 Sample ExplanationsWe ran experiments on course-advising and hand-washing MDPs [6]. We only discuss the courseadvisingdomain here due to space considerations. The transition model was obtained by usinghistorical data collected over several years at the University of Waterloo. The reward function providesrewards for completing different degree requirements. The horizon of this problem is 3 steps,3
Page 1 and 2: The 2nd International Workshop onDE
Page 3 and 4: The 2nd International Workshop onDE
Page 5 and 6: Time Title Authors7:30—7:50 Openi
Page 7 and 8: Modeling Humans as Reinforcement Le
Page 9 and 10: solution concept used to predict th
Page 11 and 12: an ɛ-greedy policy parameterizatio
Page 13: Automated Explanations for MDP Poli
Page 17 and 18: Figure 1: User Perception of MDP-Ba
Page 19 and 20: [9] Warren B. Powell. Approximate D
Page 21 and 22: information (technological knowledg
Page 23 and 24: The complete fulfilment of preferen
Page 25 and 26: References[1] J.O. Berger. Statisti
Page 27 and 28: (happiness, anger, fear, disgust, s
Page 30 and 31: David H Wolpert, editors, Decision
Page 32 and 33: on some of their concepts, we devel
Page 34 and 35: We assume that they are evaluated t
Page 36 and 37: AcknowledgmentsResearch supported b
Page 38 and 39: Bayesian Combination of Multiple, I
Page 40 and 41: is not in closed form [5], requirin
Page 42 and 43: where α (k)j is updated by adding
Page 44 and 45: Figure 3: Prototypical confusion ma
Page 46 and 47: Artificial Intelligence Designfor R
Page 48 and 49: the game. Each has its own attribut
Page 50 and 51: 4.3 Experimental Results and Limita
Page 52 and 53: Distributed Decision Making byCateg
Page 54 and 55: q 1a 1 (1) b 1(1)a 2(1)b 2(1)a K(1)
Page 56 and 57: Bayes risk0.250.20.150.10.05Collabo
Page 58 and 59: Decision making and working memory
Page 60 and 61: in WM as well as inhibitory tasks [
Page 62 and 63: Difficulty level will be automatica
Page 64 and 65:
Overall, the current literature lea
Page 66 and 67:
[31] W K Bickel, R Yi, R D Landes,
Page 68 and 69:
Each time instant, the agents first
Page 70 and 71:
Figure 2: Two basic decentralized a
Page 72 and 73:
[11] M. Kárný and T.V. Guy. Shari
Page 74 and 75:
these results yields a design metho
Page 76 and 77:
Minimum of (16) is well known from
Page 78 and 79:
Algorithm 2 VB-DP variant of the di
Page 80 and 81:
Further simplications can be achiev
Page 82 and 83:
The basic terms we use are as follo
Page 84 and 85:
3 Connection to the Bayesian soluti
Page 86 and 87:
4 ConclusionThis paper brings an im
Page 88 and 89:
2 Dynamic programming and revisions
Page 90 and 91:
The possible way how to recognize t
Page 92 and 93:
The data used for experiment are da
Page 95:
T.V. Guy, Institute of Information
show all

decision making with multiple imperfect decision makers - Institute of ...

Create successful ePaper yourself

Delete template?

Save as template?