Using Analytics to Improve Outcomes John Guttag MIT
My Research Group Technical areas Machine learning and data mining Algorithm design and signal processing Computer vision Software systems Application areas Medical analytics Financial analytics Sports analytics Principal funding Quanta Computer NSF, NIH, NSERC
July 18, 2013
ŠJohn Guttag
Medical Analytics Cardiovascular risk stratification Billions of heart beats Healthcare associated infections Millions of hospital admissions Risk-adjusted quality assessment Millions of surgeries Video-based monitoring and diagnoses
July 18, 2013
©John Guttag
Financial Analytics Major projects Reducing tail risk in the equities market Understanding the role of “news” in the equities market Reuters and Twitter With Andy Lo (Sloan and CSAIL) Millions of transactions
July 18, 2013
©John Guttag
Sports Analytics Major projects Predicting local outcomes in baseball Millions of pitches Understanding impact of strategic decisions in basketball Millions of frames of video
July 18, 2013
ŠJohn Guttag
Rest of Talk Convey a sense of the possibilities, not technical details Financial analytics Medical analytics Sports analytics (time permitting) Primarily the work of Gartheeban Ganeshapillai and Jenna Wiens
July 18, 2013
ŠJohn Guttag
Finance: Beating the Market You can have a higher expected return than the market Each day pick a few volatile equities In the long run, you should out perform the market “But this long run is a misleading guide to current affairs. In the long run we are all dead.” – J. M. Keynes
July 18, 2013
©John Guttag
The Theory Mitigate risk by diversification Choose equities with low correlation of returns Minimum variance portfolio (MVP) Minimizes expected variance around expected return Based on correlation of historical returns Works pretty well, most of the time Does not account for “tail risk”
July 18, 2013
©John Guttag
1991; Pekasiewicz, 2007). Chen & Chihying (2007) provides a method to model the temporal sequence associations for rare events. Arnold et al. (2007) examines a host of algorithms that, loosely speaking, fall under the category of graphical Granger methods to quantify the connectedness in time series. Tail Risk
3. Method MVP typically uses a covariance matrix (C) that weights
If the closing equally prices of the equity j on day T and T − 1 and pand , the return for equity j on day T is are pT ,j Positive T −1,j negative returns (pTreturns given bySmall rT ,jand=large ,j − pT −1,j )/pT −1,j . On day T + 1, we are historical dailythe returns m equities in Wegiven supplement that by learning likelihoodfor of two stocks crashing the{r same time a T ×m matrix Rat= 1 ≤ t ≤ T, 1 ≤ j ≤ m. We t,j }; Connection matrixdays, (G) use indexing t for and j, k for equities. When Learned using a recursive regularized regression equity j had an rt,j < −0.1 (10% loss), we say that event on day t. We assume that daily returns (rows of R) are independent. While daily returns are generally believed to be heteroskedastic (White, 1980), since we focus on large returns are rare, we can safely assume July 18,that 2013 ©John Guttag that the modeling errors are uncorrelated. We use regularization to tackle over fitting. The regularization parameter λ is determined by cross validation. Factor model representation Learning the Model, Partof 1 returns is common in finance and econometrics (Longin & Solnik, 1999; Khandani & Lo, 2007). Active return Connectedness with other equities
We model the returnMarket of equity Dependencyk on day t by � rˆt,k = ak + bk rt,Λ + wj,k (rt,j − dt,j ) � �� � j=1:m;j�=k
(1)
dt,k
Daily return of stock k on day t In this model, we explicitly learn the factors for equity
July 18, 2013
©John Guttag
ranking ev daily retu achieve hi guity is a problems, different c
We can e {(ak , bk , w timating t weights di fitting (Be parameter squares pr min
a∗,b∗,w∗
t k
We use gr square err rameters b ak ← bk ←
wj,k ←
Here, η is justed usin We use th our mode mate of th to make p training se convergen
Experiment (Retrospective) We use daily return data Consider 369 companies in the S&P500 from 2000 - 2011 Distribute the capital equally across sectors Build MVP each day for each sector Using covariance matrix (COV) Using connectedness matrix (FAC) Compare results
July 18, 2013
©John Guttag
Cumulative Returns – Energy Sector
July 18, 2013
©John Guttag
Combining All Sectors, 2000 - 2011
Measures
FAC1
COV
S&P500 Index
Worst Day
-0.04
-0.11
-0.09
Total Return
6.52
3.68
-0.09
Annual Sharpe Ratio (risk adjusted total return)
2.26
1.45
N/A
Caveats 2008 – 2011 not your typical market What works on yesterday’s markets may not work on tomorrow’s
July 18, 2013
©John Guttag
Onto Medicine Capacity to gather medically significant data growing quickly Better instrumentation (e.g., MRI machines, ambulatory monitors, cameras) generates more information/patient
More storage capacity allows information to be saved Economic and social forces creating more aggregation of data
July 18, 2013
©John Guttag
The Big Data Challenge in Medicine Too Big: images, videos, signals 10’s of millions of patients Billions of bits/patient Too Hard Multiple modalities Signals, lab results, images, genomic, natural language … Always incomplete, often incorrect Financial and sports data much better Existing analytical methods not up to the challenge One saving grace: human physiology and medical practice changes a lot more slowly than financial markets What we learn likely to be of long-lasting value July 18, 2013
©John Guttag
An Example: Healthcare-Associated Infections (HAIs) Contribute to ~4% of all deaths in the US Despite much effort, they remain stubbornly prevalent “Medicare Shift Fails to Cut Hospital Infections” -- Oct. 10th 2012
Our focus: Clostridium difficile (C. diff) High variation in prevalance across institutions Dynamics of risk factors poorly understood July 18, 2013
©John Guttag
Problem 1: Risk Factors Vary Over Time
• Collected at the time of admission
• Changes during the hospitalization
• e.g., admission complaint, previous admissions, home meds
• e.g., current meds, current procedures, current location, hospital conditions
PATIENT RISK Representing and reasoning about temporal processes promises to enhance the accuracy of inferences about risk.
July 18, 2013
©John Guttag
Variables Used in Our Study Data available at the time of admission Age Sex Admission Complaint Admission Procedure Data from Hospital Admissions/Visits in last 90 days Medical History Data collected during the hospital stay Medications Locations of the Patient Procedures Lab Results Devices Physicians EMR Data, not billing data July 18, 2013
©John Guttag
Handling Time Typical Approaches in Clinical Literature Calculate risk using only a snapshot of patient’s state At time of admission n days before index event
Estimated Risk
Our Approach Calculate risk using the entire evolving risk profile
Index Event
Time (days)
July 18, 2013
©John Guttag
The Task and Approach Predict who is a risk of acquiring an HAI after the patient spends t days in the hospital Time of Admission
Time of Prediction
0 days t days Collected features available here
Positive Test Results Time
Using Topic modeling to reduce dimensionality Soft margin SVM’s to produce “risk processes” that capture evolution of risk over time Multiple machine learning techniques to combine time series data with other data to generate risk May 2013
©John Guttag
Results
Potentially actionable information Prospective clinical study planned for fall 2013
July 18, 2013
ŠJohn Guttag
Problem 2: Getting Enough Useful Data Most models built from large national or international data sets Predictions can be pretty good ON AVERAGE Models learned on global datasets often perform poorly when applied to specific hospitals since they do not account for institutional differences Hospital-specific models built using data from a single hospital can account for institutional differences, but often suffer from a scarcity of training data
July 18, 2013
ŠJohn Guttag
Potential Solution Solution: Augment hospital-specific data with data from other hospitals. Transfer learning. Violates assumption that training data and future data in the same feature space and have the same distribution Data from different hospitals in different feature spaces Data from different hospitals have different distributions
July 18, 2013
ŠJohn Guttag
Two Approaches Target-Only: Classic single-task approach, uses only data from the target task and seeks a solution in the target space Source + Target (joint): Uses data from both the source and target tasks, assuming a solution of the form
July 18, 2013
ŠJohn Guttag
The Data Three hospitals within the same network: Hospital A ~180 beds and ~10,000 admissions per year Hospital B ~250 beds and 15,000 admissions per year Hospital C >900 beds and >40,000 admissions per year
July 18, 2013
ŠJohn Guttag
Preliminary Results
0.85
AUROC
0.8 0.75
0.85
0.8
0.8
0.8
0.75
0.75
0.8
Target-Only Source-Only 0.7 Source+Target
0.75
0.7
0.7
0.7
Target Task B (b)
0.6 0.6
July 18, 2013
Target TaskCA Target Task (a) (c)
0.6
0.75 0.7 0.65
0.65
0.65 0.65
0.65 0.6
0.85
0.85
0.85
Target Task B (b) ŠJohn Guttag
0.6
Target Ta (c)
Sports Analytics Work Basketball Rebounding strategy Early defensive and offensive positioning Other things I can’t talk about in public Baseball Predicting the next pitch Surprisingly (to me) accurate But nothing to compare it to Predicting the next inning (for starting pitchers) Importance of different variables interesting Pitch count over-rated More accurate predictor (even in late innings of close games) than MLB manager’s decisions July 18, 2013
©John Guttag
Two Models Manager model uses only pitch count and opposing team score This 2 parameter model is 95% accurate in predicting manager’s decision! Our model Current game statistics (5) Previous inning statistics (16) Priors on pitchers and batters (47)
July 18, 2013
©John Guttag
Most Important Features Pitcher-Inning (SLG) Pitcher-Batting team (SLG) Pitcher-1st batter (Runs) Pitcher-3rd batter (Runs) Pitcher-Inning (number of times) Notice what’s not there: Pitch count Opposing team score
July 18, 2013
©John Guttag
Our Model vs. Actual Decisions (after 4th inning)
OR 5012 innings in which manager and model pull starter 6201 innings in which manager and model leave starter in Runs scored in 17.7% of those innings 9288 innings in which manager left starter in, but model would not have Runs scored in 31.5% of those innings 1037 innings in which manager pulled starter, but model would have left him in No way to know what would have happened July 18, 2013
©John Guttag
Wrapping Up We’ve entered a data-centric era Most important decisions should be data-driven But we must always remembers that data in the hands of knaves or fools is more likely to mislead than inform Billions of heart beats, millions of hospital admissions, financial transactions, and pitches is lots of data But data ≠ information ≠ knowledge ≠ wisdom Data doesn’t lie, but people use data to lie Decision makers must learn to distinguish valid inferences from invalid ones and to understand that even valid inferences should usually be viewed in a probabilistic context July 18, 2013
©John Guttag
Principal Collaborators and Funding Faculty/Physicians Zeeshan Syed (Univ. of Michigan) Eric Horvitz, MD, PhD (MSR) Ben Scirica, MD (BWH) Collin Stultz, MD, PhD (MIT) Daryush Mehta (MGH) Current students Anima Singh Jen Gong Marzyeh Ghassemi
Andrew Lo (MIT) Fredo Durand (MIT) Bill Freeman (MIT) Bob Hillman (MGH) Matias Zañartu (USM)
Jenna Wiens Joel Brooks Guha Balakrishnan Yun Liu Gartheeban Ganeshapillai Amy Zhao
Funding Quanta Computer, General Electric, NSF, NIH, NSERC
July 18, 2013
©John Guttag