prof._john_guttag by Epoch Foundation

Using Analytics to Improve Outcomes John Guttag MIT

My Research Group Technical areas Machine learning and data mining Algorithm design and signal processing Computer vision Software systems Application areas Medical analytics Financial analytics Sports analytics Principal funding Quanta Computer NSF, NIH, NSERC

July 18, 2013

ÂŠJohn Guttag

Medical Analytics Cardiovascular risk stratification Billions of heart beats Healthcare associated infections Millions of hospital admissions Risk-adjusted quality assessment Millions of surgeries Video-based monitoring and diagnoses

July 18, 2013

©John Guttag

Financial Analytics Major projects Reducing tail risk in the equities market Understanding the role of “news” in the equities market Reuters and Twitter With Andy Lo (Sloan and CSAIL) Millions of transactions

July 18, 2013

©John Guttag

Sports Analytics Major projects Predicting local outcomes in baseball Millions of pitches Understanding impact of strategic decisions in basketball Millions of frames of video

July 18, 2013

ÂŠJohn Guttag

Rest of Talk Convey a sense of the possibilities, not technical details Financial analytics Medical analytics Sports analytics (time permitting) Primarily the work of Gartheeban Ganeshapillai and Jenna Wiens

July 18, 2013

ÂŠJohn Guttag

Finance: Beating the Market You can have a higher expected return than the market Each day pick a few volatile equities In the long run, you should out perform the market “But this long run is a misleading guide to current affairs. In the long run we are all dead.” – J. M. Keynes

July 18, 2013

©John Guttag

The Theory Mitigate risk by diversification Choose equities with low correlation of returns Minimum variance portfolio (MVP) Minimizes expected variance around expected return Based on correlation of historical returns Works pretty well, most of the time Does not account for “tail risk”

July 18, 2013

©John Guttag

1991; Pekasiewicz, 2007). Chen & Chihying (2007) provides a method to model the temporal sequence associations for rare events. Arnold et al. (2007) examines a host of algorithms that, loosely speaking, fall under the category of graphical Granger methods to quantify the connectedness in time series. Tail Risk

3. Method MVP typically uses a covariance matrix (C) that weights

If the closing equally prices of the equity j on day T and T − 1 and pand , the return for equity j on day T is are pT ,j Positive T −1,j negative returns (pTreturns given bySmall rT ,jand=large ,j − pT −1,j )/pT −1,j . On day T + 1, we are historical dailythe returns m equities in Wegiven supplement that by learning likelihoodfor of two stocks crashing the{r same time a T ×m matrix Rat= 1 ≤ t ≤ T, 1 ≤ j ≤ m. We t,j }; Connection matrixdays, (G) use indexing t for and j, k for equities. When Learned using a recursive regularized regression equity j had an rt,j < −0.1 (10% loss), we say that event on day t. We assume that daily returns (rows of R) are independent. While daily returns are generally believed to be heteroskedastic (White, 1980), since we focus on large returns are rare, we can safely assume July 18,that 2013 ©John Guttag that the modeling errors are uncorrelated. We use regularization to tackle over fitting. The regularization parameter λ is determined by cross validation. Factor model representation Learning the Model, Partof 1 returns is common in finance and econometrics (Longin & Solnik, 1999; Khandani & Lo, 2007). Active return Connectedness with other equities

We model the returnMarket of equity Dependencyk on day t by � rˆt,k = ak + bk rt,Λ + wj,k (rt,j − dt,j ) � �� j=1:m;j�=k

(1)

dt,k

Daily return of stock k on day t In this model, we explicitly learn the factors for equity

July 18, 2013

©John Guttag

ranking ev daily retu achieve hi guity is a problems, diﬀerent c

We can e {(ak , bk , w timating t weights di fitting (Be parameter squares pr min

a∗,b∗,w∗

t k

We use gr square err rameters b ak ← bk ←

wj,k ←

Here, η is justed usin We use th our mode mate of th to make p training se convergen

Experiment (Retrospective) We use daily return data Consider 369 companies in the S&P500 from 2000 - 2011 Distribute the capital equally across sectors Build MVP each day for each sector Using covariance matrix (COV) Using connectedness matrix (FAC) Compare results

July 18, 2013

©John Guttag

Cumulative Returns – Energy Sector

July 18, 2013

©John Guttag

Combining All Sectors, 2000 - 2011

Measures

FAC1

COV

S&P500 Index

Worst Day

-0.04

-0.11

-0.09

Total Return

6.52

3.68

-0.09

Annual Sharpe Ratio (risk adjusted total return)

2.26

1.45

N/A

Caveats 2008 – 2011 not your typical market What works on yesterday’s markets may not work on tomorrow’s

July 18, 2013

©John Guttag

Onto Medicine Capacity to gather medically significant data growing quickly Better instrumentation (e.g., MRI machines, ambulatory monitors, cameras) generates more information/patient

More storage capacity allows information to be saved Economic and social forces creating more aggregation of data

July 18, 2013

©John Guttag

The Big Data Challenge in Medicine Too Big: images, videos, signals 10’s of millions of patients Billions of bits/patient Too Hard Multiple modalities Signals, lab results, images, genomic, natural language … Always incomplete, often incorrect Financial and sports data much better Existing analytical methods not up to the challenge One saving grace: human physiology and medical practice changes a lot more slowly than financial markets What we learn likely to be of long-lasting value July 18, 2013

©John Guttag

An Example: Healthcare-Associated Infections (HAIs) Contribute to ~4% of all deaths in the US Despite much effort, they remain stubbornly prevalent “Medicare Shift Fails to Cut Hospital Infections” -- Oct. 10th 2012

Our focus: Clostridium difficile (C. diff) High variation in prevalance across institutions Dynamics of risk factors poorly understood July 18, 2013

©John Guttag

Problem 1: Risk Factors Vary Over Time

•  Collected at the time of admission

•  Changes during the hospitalization

•  e.g., admission complaint, previous admissions, home meds

•  e.g., current meds, current procedures, current location, hospital conditions

PATIENT RISK Representing and reasoning about temporal processes promises to enhance the accuracy of inferences about risk.

July 18, 2013

©John Guttag

Variables Used in Our Study Data available at the time of admission Age Sex Admission Complaint Admission Procedure Data from Hospital Admissions/Visits in last 90 days Medical History Data collected during the hospital stay Medications Locations of the Patient Procedures Lab Results Devices Physicians EMR Data, not billing data July 18, 2013

©John Guttag

Handling Time Typical Approaches in Clinical Literature Calculate risk using only a snapshot of patient’s state At time of admission n days before index event

Estimated Risk

Our Approach Calculate risk using the entire evolving risk profile

Index Event

Time (days)

July 18, 2013

©John Guttag

The Task and Approach Predict who is a risk of acquiring an HAI after the patient spends t days in the hospital Time of Admission

Time of Prediction

0 days t days Collected features available here

Positive Test Results Time

Using Topic modeling to reduce dimensionality Soft margin SVM’s to produce “risk processes” that capture evolution of risk over time Multiple machine learning techniques to combine time series data with other data to generate risk May 2013

©John Guttag

Results

Potentially actionable information Prospective clinical study planned for fall 2013

July 18, 2013

ÂŠJohn Guttag

Problem 2: Getting Enough Useful Data Most models built from large national or international data sets Predictions can be pretty good ON AVERAGE Models learned on global datasets often perform poorly when applied to specific hospitals since they do not account for institutional differences Hospital-specific models built using data from a single hospital can account for institutional differences, but often suffer from a scarcity of training data

July 18, 2013

ÂŠJohn Guttag

Potential Solution Solution: Augment hospital-specific data with data from other hospitals. Transfer learning. Violates assumption that training data and future data in the same feature space and have the same distribution Data from different hospitals in different feature spaces Data from different hospitals have different distributions

July 18, 2013

ÂŠJohn Guttag

Two Approaches Target-Only: Classic single-task approach, uses only data from the target task and seeks a solution in the target space Source + Target (joint): Uses data from both the source and target tasks, assuming a solution of the form

July 18, 2013

ÂŠJohn Guttag

The Data Three hospitals within the same network: Hospital A ~180 beds and ~10,000 admissions per year Hospital B ~250 beds and 15,000 admissions per year Hospital C >900 beds and >40,000 admissions per year

July 18, 2013

ÂŠJohn Guttag

Preliminary Results

0.85

AUROC

0.8 0.75

0.85

0.8

0.75

0.8

Target-Only Source-Only 0.7 Source+Target

0.75

0.7

Target Task B (b)

0.6 0.6

July 18, 2013

Target TaskCA Target Task (a) (c)

0.6

0.75 0.7 0.65

0.65

0.65 0.65

0.65 0.6

0.85

Target Task B (b) ÂŠJohn Guttag

0.6

Target Ta (c)

Sports Analytics Work Basketball Rebounding strategy Early defensive and offensive positioning Other things I can’t talk about in public Baseball Predicting the next pitch Surprisingly (to me) accurate But nothing to compare it to Predicting the next inning (for starting pitchers) Importance of different variables interesting Pitch count over-rated More accurate predictor (even in late innings of close games) than MLB manager’s decisions July 18, 2013

©John Guttag

Two Models Manager model uses only pitch count and opposing team score This 2 parameter model is 95% accurate in predicting manager’s decision! Our model Current game statistics (5) Previous inning statistics (16) Priors on pitchers and batters (47)

July 18, 2013

©John Guttag

Most Important Features Pitcher-Inning (SLG) Pitcher-Batting team (SLG) Pitcher-1st batter (Runs) Pitcher-3rd batter (Runs) Pitcher-Inning (number of times) Notice what’s not there: Pitch count Opposing team score

July 18, 2013

©John Guttag

Our Model vs. Actual Decisions (after 4th inning)

OR 5012 innings in which manager and model pull starter 6201 innings in which manager and model leave starter in Runs scored in 17.7% of those innings 9288 innings in which manager left starter in, but model would not have Runs scored in 31.5% of those innings 1037 innings in which manager pulled starter, but model would have left him in No way to know what would have happened July 18, 2013

©John Guttag

Wrapping Up We’ve entered a data-centric era Most important decisions should be data-driven But we must always remembers that data in the hands of knaves or fools is more likely to mislead than inform Billions of heart beats, millions of hospital admissions, financial transactions, and pitches is lots of data But data ≠ information ≠ knowledge ≠ wisdom Data doesn’t lie, but people use data to lie Decision makers must learn to distinguish valid inferences from invalid ones and to understand that even valid inferences should usually be viewed in a probabilistic context July 18, 2013

©John Guttag

Principal Collaborators and Funding Faculty/Physicians Zeeshan Syed (Univ. of Michigan) Eric Horvitz, MD, PhD (MSR) Ben Scirica, MD (BWH) Collin Stultz, MD, PhD (MIT) Daryush Mehta (MGH) Current students Anima Singh Jen Gong Marzyeh Ghassemi

Andrew Lo (MIT) Fredo Durand (MIT) Bill Freeman (MIT) Bob Hillman (MGH) Matias Zañartu (USM)

Jenna Wiens Joel Brooks Guha Balakrishnan Yun Liu Gartheeban Ganeshapillai Amy Zhao

Funding Quanta Computer, General Electric, NSF, NIH, NSERC

July 18, 2013