Possible Futures: Complexity in Sequential Decision-Making Dan Lizotte Assistant Professor, David R. Cheriton School of Computer Science University of Waterloo
Waterloo Institute for Complexity and Innovation 22 Jan 2013
Agents, Rationality, and Sequential Decision-Making • Sequential decision-making from the AI tradition • Supporting sequential clinical decision-making • The assumption of rationality: • mitigates complexity ☺ • reduces expressiveness ☹ • Alternatives to “simple rationality” • Please interrupt at any time
“An agent is just something that perceives and acts.� -Russell & Norvig
Graphic: Leslie Pack Kaelbling, MIT
Agent state
st
reward
action
rt
at
rt+1 st+1
Environment
The “Reward Hypothesis” “That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).” — Rich Sutton A.K.A. “Rationality.” (“reward” ≡ “utility”)
Agent Status state st
Outcome reward
Treatment action
rt
at
rt+1 st+1
Environment
Autonomous Agent Paradigm
Agent
•
st
Good
• • • •
state
reward
action
rt
at
rt+1 st+1
Environment
Goal is to maximize long term reward Makes context-dependent, individualized decisions Handles uncertain environments naturally
Bad
• • •
Need to prepare for every eventuality Relies on the reward hypothesis Relies on the assumption of rational* behaviour *Rational in the non-judgemental, mathematical sense.
Decision Support Agent Place a human in-between
Agent state
st
reward
action
rt
at
rt+1 st+1
Environment
Recommend rather than perform actions, still leverage autonomous agent idea
Rational Decision Making Should I Move or Stay?
Assume we have: Observation of state of the world Rewards associated with each state Model of how the world changes
P = 0.1, R = 0
Move
P = 0.1, R = 100 Q(
,Move) = 5
P = 0.7, R = 0
P = 0.1, R = -50
P = 0.1, R = -25
Stay
P = 0.1, R = 10 Q(
,Stay) = -1.5
P = 0.7, R = 0
P = 0.1, R = 0
How do we decide what to do? • Often, “we” assume we want to maximize expected reward
•
For each potential action
•
Compute expected reward over all possible future states
• Choose the action with the highest expected reward
• “Q-function” Q(s,a) = E[R|s,a]
P = 0.1, R = 0
Move
P = 0.1, R = 100 Q(
,Move) = 5
P = 0.7, R = 0
P = 0.1, R = -50
P = 0.1, R = -25 Q( Stay
,Move) = 5 P= = -1.5 0.1, R = 10 ,Stay)
Q( π(
V(
) = Move Q(
,Stay) = -1.5
)=5 P = 0.7, R = 0
P = 0.1, R = 0
Sequential Decision-Making •
For each potential action
•
Compute expected reward over all possible future states
• Feasible if reward is known for each state. • What if reward is influenced by subsequent decision-making?
Mosey P = 0.1, R = 10
P = 1.0, R = 10 Q(
,Mosey) = 10
Q( Scramble!
Q( π(
,Mosey) = 10 ,Scramble) = 90 P = 0.9, R = 100 ) = Scramble Q( ,Scramble) = 90 V( ) = 90 P = 0.1, R = 0
P = 0.1, R = -25
Q(
Stay
,Move) = 5 P = 0.1, V R == 90 10 Q( ,Stay) = -1.5 6.5 ,Stay) = -1.5 6.5 MoveQ( π( ) = Stay V( ) = 6.5 5 P = 0.7, R = 0
P = 0.1, R = 0
Sequential Decision-Making • Optimal decisions are influenced by future plans
• Not a problem, if we know what we will do when we get to any future state.
• Replace R (short-term) with V (long-term) by the law of total expectation
• Expectation is now over sequences of states
Policy Evaluation
Policy Evaluation
Policy Evaluation
Policy Evaluation
Policy Evaluation
Time to evaluate: O(S2⋅T) Number of policies: O(AS⋅T) Exhaustive search: O(AS⋅T ⋅S2⋅T)
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Policy Learning
Time to evaluate: O(A⋅S2⋅T) Exhaustive search: O(AS⋅T ⋅S2⋅T)
Policy Learning • Enormous complexity savings:
From O(AS⋅T ⋅S2⋅T) to O(A⋅S2⋅T)
•
What assumptions are required?
• • •
Markov property Rationality (linearity of expectation)
What if we want to avoid the rationality assumption?
Beyond Simple Rationality • maximization of the expected value of the cumulative sum of a received scalar signal
• What is the “right” scalar signal? • Symptom relief? • Side-effects? • Cost? • Quality of life? • Different for every decision-maker, every patient.
Multiple Reward Signals state
st
•
How can we reason with multiple reward signals?
Agent
reward
action
rt
at
rt+1 st+1
Environment
• Make use of the human in the loop! • Preference Revealing • Multi-outcome Screening • Each adds computational complexity (DL,Bowling,Murphy)
(DL,Ferguson,Laber)
Preference Elicitation Suppose D different rewards are important for decision-making, r[1], r[2], ..., r[D] Assume each person has a function f that takes these and gives utility, that person’s happiness given any configuration of the r[i] expressed as a scalar value. We could use this as our new reward! Preference Elicitation attempts to figure out an individual’s f.
Preference Elicitation 1. Determine preferences of the decision-maker 2. Construct reward function from “basis rewards” (different outcomes) 3. Compute the recommended treatment, by policy learning
•
Cost: O(A⋅S2⋅T) per elicitation
Agent state
st
reward
action
rt
at
rt+1 st+1
Environment
Preference Elicitation One way: Assume f has a nice form: f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D] Then Preference Elicitation figures out the δ, or weights, an individual attaches to different rewards. How? The values δ[i] and δ[j] defines an exchange rate between r[i] and r[j].
Preference Elicitation f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D] “If I lost δ[j] units of r[i], but I gained δ[i] units of r[j], I would be equally happy.” Preference elicitation asks questions like: “If I took away 5 units of r[i], how many units of r[j] would you want?” Once f is known, standard single-outcome methods can be applied.
Preference Elicitation • Are the questions based in reality? • Even if they are, can the decision-maker answer them?
• How will the decision-maker respond to “I know what you want.” ?
Agent state
st
reward
action
rt
at
rt+1 st+1
Environment
Preference Revealing (DL, S Murphy, M Bowling)
1. Determine the preferences of the decision-maker 2. Compute the recommended treatment for all possible preferences (δ) 3. Show, for each action, what preferences are consistent with that action being recommended
Positive And Negative Syndrome Scale vs. Body Mass Index vs. Heinrichs Quality of Life Scale
Phase 1
Data:
Reward: BMI δ = (0, 1, 0)
Clinical Antipsychotic Trials of Intervention Effectiveness (NIMH)
Reward: PANSS δ = (1, 0, 0)
! Ziprasidone — Olanzapine
Reward: HQLS δ = (0, 0, 1)
Preference Revealing Benefits No reliance on preference elicitation Facilitates deliberation rather than imposing a single recommended treatment Information still individualized through patient state Treatments that are not suggested for any preference are implicitly screened
Costs O(AT⋅ST) worst case cf. O(A⋅S2⋅T) for single reward cf. 2ℵ0 for “enumeration” Feasible for short horizons Value function/policy now a function of state and preference Value functions not convex in preference, thus related methods for POMDPs do not apply
Multi-outcome Screening 1. Elicit “clinically meaningful difference” for each outcome 2. Screen out treatments that are “definitely bad” 3. Recommend the set of remaining treatments
Multi-outcome Screening (DL, B Ferguson, E Laber)
Suppose two* different rewards are important for decision making: r[1], r[2] Screen out a treatment if it is much worse than another treatment for one reward and not any better for the other reward. Do not screen if 1) treatments are not much different or 2) one treatment is much worse for one reward but much better for the other Output: Set containing one or both treatments, possibly with a reason if both are included.
[D]t
ˆ
t+1
t (·, ·, p), ..., Q[D]t (·, ·, p))
g Praction
Multi-outcome Screening Eliminate using Strong Practical Domination
a3
6
a2
QT [2] (sT , a)
a1
4
a)
in Qt
a1 a4
4
a5 2
6
0
0
2
4 QT [1] (sT , a) (b)
6
Multi-outcome Screening Benefits
Costs
No model of preference required
Must consider all policies the user might follow in future
Suggests a set rather than imposing a single recommended treatment Information still individualized through patient state Treatments with bad evidence are explicitly screened Screening criterion is intuitive
Possible Futures 6
Positive And Negative Syndrome Scale and Body Mass Index Daniel J. Lizotte, Eric B. Laber
Fig. 2 Plot of a vector-valued Q-function (QT 1[1] (sT 1 , aT 1 ; pT ), QT 1[2] (sT 1 , aT 1 ; pT )) for a fixed state sT 1 . Here, aT 1 2 A = {+, 4, #, 竍・, 竚マ, and pT 2 P, a collection of 20 possible future policies. Thus each
Multi-outcome Screening Benefits
Costs
No notion of preference required
Must consider all policies the user might follow in future
Suggests a set rather than imposing a single recommended treatment Information still individualized through patient state Treatments with bad evidence are explicitly screened Screening criterion is intuitive
Enumeration: O(AS⋅T ⋅S2⋅T) Restriction to policies that 1) follow recommendations and 2) are “not too complex” makes computation feasible CATIE: 10124 possible futures reduced to 1213 possible futures
Multi-Outcome Screening
Non-deterministic multiple-reward fitted-Q for decision support
17
Fig. 7 CATIE NDP for Phase 1 made using P 9 ; “warning” actions that would have been eliminated by Practical Domination but not by Strong Practical Domination have been removed.
Figure 7 shows the NDP learned for Phase 1 using our algorithm with Strong Practical Domination (D1 = D2 = 2.5) and P 9 , and actions that receive a “warning” according to Practical Domination have been removed. Here, choice is further increased by requiring an action to be practically better than another action in order to dominate it. Furthermore, although we have removed actions that were warned to have a bad tradeoff—those that were slightly better for one reward but practically worse for another—we still achieved increased choice over using the Pareto frontier alone. In this NDP, the mean number of choices per
Summary • Autonomous Agent model is for sequential
decision making but we want decision support.
• The Reward Hypothesis is useful
computationally, but too restrictive for us
• Preference Revealing:
Compute for all possible scalar rewards
• Multi-outcome Screening:
Compute for all “reasonable” future policies
• In both cases, increase in complexity is tolerable
Thank you! • Dan Lizotte dlizotte@uwaterloo.ca Acknowledgements: My colleagues Susan Murphy, Michael Bowling, Eric Laber Natural Sciences and Engineering Research Council of Canada (NSERC) Data were obtained from the limited access datasets distributed from the NIHsupported “Clinical Antipsychotic Trials of Intervention Effectiveness in Schizophrenia” (CATIE-Sz). The study was supported by NIMH Contract N01MH90001 to the University of North Carolina at Chapel Hill. The ClinicalTrials.gov identifier is NCT00014001. This talk reflects the view of the author and may not reflect the opinions or views of the CATIE-Sz Study Investigators or the NIH.