Possible Futures by Waterloo Institute for Complexity and Innovation

Possible Futures: Complexity in Sequential Decision-Making Dan Lizotte Assistant Professor, David R. Cheriton School of Computer Science University of Waterloo

Waterloo Institute for Complexity and Innovation 22 Jan 2013

Agents, Rationality, and Sequential Decision-Making • Sequential decision-making from the AI tradition • Supporting sequential clinical decision-making • The assumption of rationality: • mitigates complexity ☺ • reduces expressiveness ☹ • Alternatives to “simple rationality” • Please interrupt at any time

â&#x20AC;&#x153;An agent is just something that perceives and acts.â&#x20AC;? -Russell & Norvig

Graphic: Leslie Pack Kaelbling, MIT

Agent state

reward

action

rt+1 st+1

Environment

The “Reward Hypothesis” “That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).” — Rich Sutton A.K.A. “Rationality.” (“reward” ≡ “utility”)

Agent Status state st

Outcome reward

Treatment action

rt+1 st+1

Environment

Autonomous Agent Paradigm

Agent

•

Good

• • • •

state

reward

action

rt+1 st+1

Environment

Goal is to maximize long term reward Makes context-dependent, individualized decisions Handles uncertain environments naturally

Bad

• • •

Need to prepare for every eventuality Relies on the reward hypothesis Relies on the assumption of rational* behaviour *Rational in the non-judgemental, mathematical sense.

Decision Support Agent Place a human in-between

Agent state

reward

action

rt+1 st+1

Environment

Recommend rather than perform actions, still leverage autonomous agent idea

Rational Decision Making Should I Move or Stay?

Assume we have: Observation of state of the world Rewards associated with each state Model of how the world changes

P = 0.1, R = 0

Move

P = 0.1, R = 100 Q(

,Move) = 5

P = 0.7, R = 0

P = 0.1, R = -50

P = 0.1, R = -25

Stay

P = 0.1, R = 10 Q(

,Stay) = -1.5

P = 0.7, R = 0

P = 0.1, R = 0

How do we decide what to do? • Often, “we” assume we want to maximize expected reward

•

For each potential action

•

Compute expected reward over all possible future states

• Choose the action with the highest expected reward

• “Q-function” Q(s,a) = E[R|s,a]

P = 0.1, R = 0

Move

P = 0.1, R = 100 Q(

,Move) = 5

P = 0.7, R = 0

P = 0.1, R = -50

P = 0.1, R = -25 Q( Stay

,Move) = 5 P= = -1.5 0.1, R = 10 ,Stay)

Q( Ď&#x20AC;(

) = Move Q(

,Stay) = -1.5

)=5 P = 0.7, R = 0

P = 0.1, R = 0

Sequential Decision-Making •

For each potential action

•

Compute expected reward over all possible future states

• Feasible if reward is known for each state. • What if reward is influenced by subsequent decision-making?

Mosey P = 0.1, R = 10

P = 1.0, R = 10 Q(

,Mosey) = 10

Q( Scramble!

Q( Ď&#x20AC;(

,Mosey) = 10 ,Scramble) = 90 P = 0.9, R = 100 ) = Scramble Q( ,Scramble) = 90 V( ) = 90 P = 0.1, R = 0

P = 0.1, R = -25

Stay

,Move) = 5 P = 0.1, V R == 90 10 Q( ,Stay) = -1.5 6.5 ,Stay) = -1.5 6.5 MoveQ( Ď&#x20AC;( ) = Stay V( ) = 6.5 5 P = 0.7, R = 0

P = 0.1, R = 0

Sequential Decision-Making • Optimal decisions are influenced by future plans

• Not a problem, if we know what we will do when we get to any future state.

• Replace R (short-term) with V (long-term) by the law of total expectation

• Expectation is now over sequences of states

Policy Evaluation

Time to evaluate: O(S2⋅T) Number of policies: O(AS⋅T) Exhaustive search: O(AS⋅T ⋅S2⋅T)

Policy Learning

Time to evaluate: O(A⋅S2⋅T) Exhaustive search: O(AS⋅T ⋅S2⋅T)

Policy Learning • Enormous complexity savings:

From O(AS⋅T ⋅S2⋅T) to O(A⋅S2⋅T)

•

What assumptions are required?

• • •

Markov property Rationality (linearity of expectation)

What if we want to avoid the rationality assumption?

Beyond Simple Rationality • maximization of the expected value of the cumulative sum of a received scalar signal

• What is the “right” scalar signal? • Symptom relief? • Side-eﬀects? • Cost? • Quality of life? • Diﬀerent for every decision-maker, every patient.

Multiple Reward Signals state

•

How can we reason with multiple reward signals?

Agent

reward

action

rt+1 st+1

Environment

• Make use of the human in the loop! • Preference Revealing • Multi-outcome Screening • Each adds computational complexity (DL,Bowling,Murphy)

(DL,Ferguson,Laber)

Preference Elicitation Suppose D different rewards are important for decision-making, r[1], r[2], ..., r[D] Assume each person has a function f that takes these and gives utility, that personâ&#x20AC;&#x2122;s happiness given any configuration of the r[i] expressed as a scalar value. We could use this as our new reward! Preference Elicitation attempts to figure out an individualâ&#x20AC;&#x2122;s f.

Preference Elicitation 1. Determine preferences of the decision-maker 2. Construct reward function from “basis rewards” (diﬀerent outcomes) 3. Compute the recommended treatment, by policy learning

•

Cost: O(A⋅S2⋅T) per elicitation

Agent state

reward

action

rt+1 st+1

Environment

Preference Elicitation One way: Assume f has a nice form: f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D] Then Preference Elicitation figures out the δ, or weights, an individual attaches to different rewards. How? The values δ[i] and δ[j] defines an exchange rate between r[i] and r[j].

Preference Elicitation f(r[1], r[2], ..., r[D]) = δ[1]r[1] + δ[2]r[2] + ... + δ[D]r[D] “If I lost δ[j] units of r[i], but I gained δ[i] units of r[j], I would be equally happy.” Preference elicitation asks questions like: “If I took away 5 units of r[i], how many units of r[j] would you want?” Once f is known, standard single-outcome methods can be applied.

Preference Elicitation • Are the questions based in reality? • Even if they are, can the decision-maker answer them?

• How will the decision-maker respond to “I know what you want.” ?

Agent state

reward

action

rt+1 st+1

Environment

Preference Revealing (DL, S Murphy, M Bowling)

1. Determine the preferences of the decision-maker 2. Compute the recommended treatment for all possible preferences (Î´) 3. Show, for each action, what preferences are consistent with that action being recommended

Positive And Negative Syndrome Scale vs. Body Mass Index vs. Heinrichs Quality of Life Scale

Phase 1

Data:

Reward: BMI δ = (0, 1, 0)

Clinical Antipsychotic Trials of Intervention Eﬀectiveness (NIMH)

Reward: PANSS δ = (1, 0, 0)

! Ziprasidone — Olanzapine

Reward: HQLS δ = (0, 0, 1)

Preference Revealing Benefits No reliance on preference elicitation Facilitates deliberation rather than imposing a single recommended treatment Information still individualized through patient state Treatments that are not suggested for any preference are implicitly screened

Costs O(AT⋅ST) worst case cf. O(A⋅S2⋅T) for single reward cf. 2ℵ0 for “enumeration” Feasible for short horizons Value function/policy now a function of state and preference Value functions not convex in preference, thus related methods for POMDPs do not apply

Multi-outcome Screening 1. Elicit “clinically meaningful diﬀerence” for each outcome 2. Screen out treatments that are “definitely bad” 3. Recommend the set of remaining treatments

Multi-outcome Screening (DL, B Ferguson, E Laber)

Suppose two* different rewards are important for decision making: r[1], r[2] Screen out a treatment if it is much worse than another treatment for one reward and not any better for the other reward. Do not screen if 1) treatments are not much different or 2) one treatment is much worse for one reward but much better for the other Output: Set containing one or both treatments, possibly with a reason if both are included.

[D]t

t+1

t (·, ·, p), ..., Q[D]t (·, ·, p))

g Praction

Multi-outcome Screening Eliminate using Strong Practical Domination

QT [2] (sT , a)

in Qt

a1 a4

a5 2

4 QT [1] (sT , a) (b)

Multi-outcome Screening Benefits

Costs

No model of preference required

Must consider all policies the user might follow in future

Suggests a set rather than imposing a single recommended treatment Information still individualized through patient state Treatments with bad evidence are explicitly screened Screening criterion is intuitive

Possible Futures 6

Positive And Negative Syndrome Scale and Body Mass Index Daniel J. Lizotte, Eric B. Laber

Fig. 2 Plot of a vector-valued Q-function (QT 1[1] (sT 1 , aT 1 ; pT ), QT 1[2] (sT 1 , aT 1 ; pT )) for a fixed state sT 1 . Here, aT 1 2 A = {+, 4, #, 竍･, 竚マ, and pT 2 P, a collection of 20 possible future policies. Thus each

Multi-outcome Screening Benefits

Costs

No notion of preference required

Must consider all policies the user might follow in future

Enumeration: O(AS⋅T ⋅S2⋅T) Restriction to policies that 1) follow recommendations and 2) are “not too complex” makes computation feasible CATIE: 10124 possible futures reduced to 1213 possible futures

Multi-Outcome Screening

Non-deterministic multiple-reward fitted-Q for decision support

Fig. 7 CATIE NDP for Phase 1 made using P 9 ; “warning” actions that would have been eliminated by Practical Domination but not by Strong Practical Domination have been removed.

Figure 7 shows the NDP learned for Phase 1 using our algorithm with Strong Practical Domination (D1 = D2 = 2.5) and P 9 , and actions that receive a “warning” according to Practical Domination have been removed. Here, choice is further increased by requiring an action to be practically better than another action in order to dominate it. Furthermore, although we have removed actions that were warned to have a bad tradeoff—those that were slightly better for one reward but practically worse for another—we still achieved increased choice over using the Pareto frontier alone. In this NDP, the mean number of choices per

Summary • Autonomous Agent model is for sequential

decision making but we want decision support.

• The Reward Hypothesis is useful

computationally, but too restrictive for us

• Preference Revealing:

Compute for all possible scalar rewards

• Multi-outcome Screening:

Compute for all “reasonable” future policies

• In both cases, increase in complexity is tolerable

Thank you! • Dan Lizotte dlizotte@uwaterloo.ca Acknowledgements: My colleagues Susan Murphy, Michael Bowling, Eric Laber Natural Sciences and Engineering Research Council of Canada (NSERC) Data were obtained from the limited access datasets distributed from the NIHsupported “Clinical Antipsychotic Trials of Intervention Eﬀectiveness in Schizophrenia” (CATIE-Sz). The study was supported by NIMH Contract N01MH90001 to the University of North Carolina at Chapel Hill. The ClinicalTrials.gov identifier is NCT00014001. This talk reflects the view of the author and may not reflect the opinions or views of the CATIE-Sz Study Investigators or the NIH.