Washington University in St. Louis OLIN BUSINESS SCHOOL Eli Snir, Ph. D. Senior Lecturer in Management May 2015 Olin is seeped with tradition. From Convocation, through the Olin barbeque on the first day of the semester, to graduation. Among the many traditions we have here is the QBA 121 – Managerial Statistics II poster session. While the poster session is a relatively new activity, it has become synonymous with the end of the semester and the beginning of finals week. The poster session is the culmination of students’ semester-long projects applying statistical methods to problems of their choice. By allowing students to choose a topic to analyze realize how statistics applies to varied business decisions. For the past few semesters the poster session has been held in the main hallway of the new Knight and Bauer Halls. This semester visitors were intrigued by the multitude of projects on display. While the course focuses on the technical analysis of data and ensuing managerial implications, students’ projects are what make the poster session a success. Projects vary widely, reflecting students’ varying interests; still they all have a common theme – apply statistical tools to generate managerial insight. Some of the topics include: What determines a WashU student’s happiness; Factors that influence salaries in professional sports; Is food at WashU expensive? ; Determinants of literacy rates; and How to get more “likes” on Instagram. The hallway of Knight and Bauer Halls provides an opportunity for many and varied visitors to be involved with the students’ projects. Some come every semester to see what’s new in the course. These include academic advisors, Olin faculty, and deans. They question students on what they learned and how to apply their knowledge to real-world problems. Other visitors are fleeting. Among these are high-school students contemplating whether to choose WashU. Other visitors include classmates, students in other programs, and outside faculty. They all leave with a better understanding of how our students transform data into knowledge and a better appreciation of what goes on inside an Olin classroom. This book provides a sample of the projects presented in the poster session at the end of the Spring 2015 semester. Hopefully, it will encourage you to consider how to use data and statistics to answer questions that interest you. Eli Snir Washington University in St. Louis, Olin Business School • Campus Box 1156 One Brookings Drive, St. Louis, Missouri 63130-4899 Tel (314) 935-6090 • Fax: (314) 935-6359 • snir@wustl.edu
By: Paul Huang, Justin Morrell
Predicting NBA Win Shares
Background and Motivation The NBA is an incredibly competitive landscape both in terms of finance and basketball. We are interested in looking at the salary, minutes/game, years of experience, and draft pick of NBA players to see what effects the variables have on win shares. Win shares are the contribution a player has towards the success and profitability of an NBA team.
Key Results • For every 1 minute increase in minutes/game, holding all other variables constant, a player will contribute .097 more win shares • For every extra year of experience a player has, holding all other variables constant, he will contribute .161 more win shares
2014 WIN SHARES
2014 Minutes/Game vs. Win Shares 20
y = 0.1086x + 5.4549 R² = 0.0562
15 10
Model & Analysis
5
0 10
15
20 25 30 2014 MINUTES/GAME
35
40
Using a random sample of 100 NBA players from the 2014 regular season and methods of backwards elimination, we were able to create a multiple linear regression model that includes the variables minutes/game and experience.
2014 WIN SHARES
Years Experience vs. 2014 Win Shares 20
y = 0.1834x + 6.9052 R² = 0.0519
15
Final Model: đ?‘žđ?’Šđ?’? đ?’”đ?’‰đ?’‚đ?’“đ?’†đ?’” = đ?&#x;’. đ?&#x;•đ?&#x;’đ?&#x;?+. đ?&#x;Žđ?&#x;—đ?&#x;•
10 5 0 0
5
10 YEARS OF EXPERIENCE
Observations: • Low � 2 • Both variables significant at � = .05 • No outliers or influence points
15
20
đ?’Žđ?’Šđ?’?đ?’–đ?’•đ?’†đ?’” đ?’ˆđ?’‚đ?’Žđ?’†
+. đ?&#x;?đ?&#x;”đ?&#x;? đ?’šđ?’†đ?’‚đ?’“đ?’” đ?’?đ?’‡ đ?’†đ?’™đ?’‘đ?’†đ?’“đ?’Šđ?’†đ?’?đ?’„đ?’†
Descriptive Statistics: đ?‘…2 = 0.095737 đ?‘†đ?œ€ = 0.4212 đ?‘Ś đ??š − đ?‘ đ?‘–đ?‘”đ?‘›đ?‘–đ?‘“đ?‘–đ?‘”đ?‘Žđ?‘›đ?‘?đ?‘’ = 0.00759 • F-significance is below đ?›ź = .05. • Low đ?‘…2 value of 0.095737. đ?‘†đ?œ€ • is higher than benchmark of 0.2. đ?‘Ś
Conclusions • On average, high-usage players are better than lowusage players • On average, experienced players are better than nonexperienced players Why is this important? In the NBA, winning is the most important thing for a team. Winning increases revenue from all sources. Oftentimes, an NBA GM only has a couple of years at most to turn a franchise around before being on the hot seat. Making the right trades and free-agent acquisitions are key to managing a successful NBA franchise.
Data Sources: • Basketball Reference • ESPN
What determines a Country’s Health Expenditure? By Scarlett Ho & Zirui Su
Takeaways
Intro & Motivation • Health Expenditure has long been an important factor to people’s life, especially when the worldwide living standard is rising • People in different countries differ in the amount they spend on their health
Unemployment Rate, Education Rate, Life Expectancy and Death Rate are determinants of a country’s Health Expenditure A country’s expenditure on health has little to do with GDP
• To find the relationship between Health Expenditure and its deciding factors
Model
Key Data Source • • •
Central Intelligence Agency (CIA) Credible and reliable Access to worldwide information
Acknowledgement •
We would like to thank Professor Snir for leading this research, as well as our teaching assistant Marc Breinstein for providing insights and advice
•
Dependent variable: Health Expenditure
•
Independent variables: GDP Per Capita, GDP Growth Rate, Unemployment Rate, Education Expenditure, Obesity Rate, Life Expectancy, Death Rate
•
We used Multiple Linear Regression with Backward Elimination to determine our model:
Health Expenditure = 0.027*Unemployment Rate + 0.56*Education Rate + 0.17*Life Expectancy + 0.33*Death Rate – 11.70
Key Results • • • •
Mean of Health Expenditure: 6.88% R Square = 0.54 Standard Error = 1.7 Positive Correlation between Independent and Dependent Variables • Slight impact on Health Expenditure from GDP Per Capita, GDP Growth Rate and Obesity Rate
happiness...
and the variables that predict it
key findings...
motivation...
Latrionna Moore, Lillie Ross, Lucas Rasmussen
Happiness is often a key motivator for people throughout their life. People tend to make decisions that will bring them the greatest sense of happiness. Although happiness is a subjective term, we were interested in the factors that either increase or decrease happiness for an individual with regards to business. Unhappy employees do not usually work as efficiently as satisfied employees, which reduces productivity. These employees may be more likely to leave the company also, which increases the job turnover rate. We hypothesize the factors that increase happiness
the most are excellent health and being married, while the factor that
data... Data pertaining to general happiness and welfare is quite abundant. However, due to the inherent nature that happiness is a subjective feeling, data tends to be varied and contradictory. Despite this, there are articles pertaining to employee happiness, such as Bloomberg Business Week’s report on employee happiness and productivity. Our research stems from a unified source: General Social Survey 2014- Norc.org.
model... Due to happiness being a binary variable, we ran a logistical regression of the data with 770 observations. The original model predicted a 90.3% likelihood of happiness. After running a backwards stepwise Wald method, the sixth model predicted a 90.5% likelihood of happiness with eight significant variables. The model also had an R2 of 0.168 and a
standard error of 0.122. Our ln(odds ratio) for the model is... ln(y) = 3.392 – 0.021x(Age) + 0.713x(HG) – 0.054x(MHD) – 1.489x(Widowed) – 1.336x(Separated) – 0.968x(NM) + 0.008x(Income/1000) – 1.025x(Divorced)
The independent variable that had the greatest positive influence in predicting the likelihood of being happy was income. Although the coefficient is quite small, this is due to dividing the incomes by 1000. It is noted that as income increases, the probability of happiness is higher, on average, than lower income ranges. The independent variable that had the greatest negative influence in predicting the likelihood of being happy was marital status. With coefficients ranging from -1.489 to -0.968 if someone is widowed, separated, not married, or divorced their predicted likelihood of being happy will decrease.
conclusion... After running the logistic regression, the statistically signifiact variables that influenced the likelihood of happiness are age, mental health days, income, health, and marital status.
*HG- Health Good, MHD- Mental Health Days, NM- Never Married
age
education
gender
health
income
relaxation
mental health days marital status
Group 32: Jack Goodman, Jack Hoots, & Vishruth Reddy
Motivation Every year, NBA teams make multi-million dollar decisions to sign certain players based on ultimately, how many points per game they can contribute to the team. But what factors have a positive or negative impact on NBA players’ points per game?
Data Set We took every season in which a player scored above 23 points per game, including the 2014-2015 year. Average PPG: 26.2 points per game Standard Deviation: 2.56 Observations: 352
Hypothesis We expected scoring to decrease as age increases. We also thought that shooting (field goal and free throw percentage) and shot attempts would have a positive correlation with scoring. We expected Shooting Guard to be the highest scoring position.
Model & Results Y = -16.97 – 0.09x + 30.18x2 + 0.91x3 + 7.41x4 + 0.19x5 – 0.34i1 -0.33i2 x: age (years) x2: field goal percentage (%) x3: field goals attempted x4: free throw percentage (%) x5: minutes played per game i1: player holds small forward position (insignificant) i2: player holds power forward position (insignificant) Base case: player holds point guard, shooting guard, or center position
Graphics
Statistical Notes We chose to use a multiple linear regression model for this study. Model has an R2 of 0.70. Model’s F-Value is 1.15e-86. Model’s SE/ȳ is 0.03. Model significance level is α = 0.05. We found only 5 suspect points, including ‘14 Kevin Durant. We kept these points in the model due to the impact of these players on the league. Removing them would not be representative of he sample. We originally started the model with 9 independent variables, and reduced it to 7 via backwards elimination. We removed the Shooting Guard and Center positions as variables, since they were the least significant. The Point Guard position was the base at this time.
Conclusion
While age and minutes played do affect scoring, key variables that have a far greater impact on how much a player scores are shooting (both field goals and freethrows) and shots attempted. Data Sources: Real GM Basketball Reference
SAVING THE CHILDREN:
Which factors contribute to youth mortality rates worldwide? Methodology
We are hoping to study factors that predict the under-5 mortality rate worldwide. Identifying risk factors highly associated with youth mortality will allow policy makers to enact appropriate measures to reduce the risk factors, and therefore save more children’s lives.
After our initial multiple linear regression, we noticed that youth mortality rate and adolescent fertility rate were heavily right-skewed. To correct this, we re-ran the model using the logarithmically transformed variables ln(mortality) and ln(adolescent fertility). We performed backwards regression until we got to our final model.
We believe that higher fertility and adolescent fertility will predict higher youth mortality rates and higher immunization rates, female labor force participation, urban population and health expenditure will predict lower youth mortality.
Data Sources All data came from 2012 World Bank dataset. Dependent variable: Mortality rate under age 5 for the 171 countries with highest GDP per capita. Independent variables: • Fertility rate • Adolescent fertility rate • Female labor force participation rate • DPT immunization rate* • Measles immunization rate • Health expenditure as a percentage of GDP • Urban population percentage *Variable not in final model Samantha McGanney & Katherine Plaster
Final Model y = 2.3408 + 0.3190x1- 0.0055x2 - 0.0063x3 - 0.0369x4 0.0106x5+ 0.4157x6 Where y: ln(youth mortality), x1: Fertility rate, x2: Female labor force participation, x3: Measles immunization, x4: Health expenditure, x5: urban population, x6: ln(adolescent fertility).
Strength of Fit Measure R2 F-test # Observations
Value 0.83968 0 171
Factors that increase youth mortality
Average of mortality rate = 36.16 deaths/1000 births Average of ln(mortality) = 3.00 deaths/1000 births
• Female labor force participation • Measles immunization (not significant) • Health expenditure • Urban population %
Urban+Popula4on+ y"="$0.0329x"+"4.8598" R²"="0.43769"
6" 5" 4" 3" 2" 1"
There is a negative correlation between ln(mortality) and urban population.
0"
Reliable? >0.5, yes <0.05, yes >100, yes
Descriptive Statistics
Immunization rates are not significant predictors of youth mortality. Youth mortality was predicted by factors related to wealth and development. Instead of focusing on quick fixes like immunization programs, policy makers should focus on reducing poverty.
• Fertility Rate • ln(Adolescent Fertility)
Factors that decrease youth mortality
0"
20"
40"
60"
80"
100"
120"
Urban+Populai4on+
Fer-lity)Rate) 200"
Mortality)Rate)
Hypothesis
Key Results
ln(Mortality+Rate)+
Motivation
There is a positive correlation between ln(mortality) and fertility rates.
y"="22.204x")"28.051" R²"="0.73608"
150" 100" 50" 0" 0"
1"
2"
3"
4" Fer-lity)Rate)
5"
6"
7"
8"
HOW TO PREDICT LIFE EXPECTANCY? Grace Bridwell, Amanda Zhaoyi Lin, Ashley Hanqiu Zhou
Motivation The mystery of longevity has been an unsolved puzzle for many years. Curious to investigate this issue from a rather innovative angle, our study aims to analyze how a country’s productivity (as measured by GDP) and resources invested in education will correlate with life expectancy.
s
Methodology
We used Backwards Elimination to develop a Multiple Linear Regression model and identified the model with the lowest standard error with the following variables: X1
Hypothesis We predict that a country’s productivity will have the highest impact on its life expectancy. However, other factors such as government expenditure on education and rate of out-of-school children will also act as contributors.
Government expenditure on education as % of GDP (%) X2
ln (GDP per capita (current US$))
X3
Rate of out-of-school children of primary school age (%)
X4
ln (GDP per capita (current US$))
LIFE EXPECTANCY
UNESCO World Bank Trading Economics CIA World Factbook NOW Grenada
90 80 70 60 50 40 30 20 10 0
ln(GDP per capita(current US$))
Variable Government expenditure on education as % of GDP (%) ln (GDP per capita (current US$)) Rate of out-of-school children of primary school age, both sexes (%) Pupil-teacher ratio in primary education
Impact 2.444981
44.88261 0.015015 0.015456
- Expenditure on education as % of total government expenditure (%) - Labor force participation rate, female (% of female population ages 15+) - Unemployment, total (% of total labor force) - Pupil-teacher ratio in secondary education
Conclusion
0
We collected the bulk of our data from UNESCO and World Bank, but because some of these sources were
missing small pieces of data, Trading Economics, CIA World Factbook, and NOW Grenada were used to supplement this data.
In our final model, the variable with the highest impact is:
Eliminated Variables
Pupil-teacher ratio in primary education
Ŷlife expectancy=76.938 -3.091x1 + 12.101x2- 0.231x3-0.322x4
Data Sources • • • • •
Key Results
R^2 = 0.591
20
40 60 80 100 LN (GDP PER CAPITA (CURRENT US$))
Se/y(bar) = 0.0848
Significance F = 3.068E19
N = 108
120
This analysis contributes which measures of education and productivity most significantly impact life expectancy. We come to the conclusion that improving GDP per capita will improve life expectancy. The above results can be used for public health organizations with a stake in improving overall well-being, education, productivity, and health of a country.
Mo#va#on:
Unlocking
As users of the app, we have all no#ced that certain photos end up with much higher numbers of likes than others do for no immediately discernible reason. Addi#onally, many companies use social media, including Instagram, for marke#ng purposes, and so knowing how to most effec#vely generate interest in a post is very helpful. We believe that the conclusions we draw from this study of Instagram will also be applicable to other forms of social media, which together cons#tute a large and growing por#on of the adver#sing market.
Hypothesis:
The Number of Followers will have the largest impact on the number of likes, while other variables will have smaller effects.
Data Sources:
Instagram accounts of college-‐aged individuals
Methodology:
We applied the Backwards Elimina#on Method to a Mul#ple Linear Regression and eliminated three of our original variables in order to increase the accuracy of the model.
Key Results:
The Number of Followers is the independent variable with the greatest effect on the number of likes. There is a strong posi#ve linear rela#onship, as hypothesized. We also found that the use of Emojis in cap#ons and the presence of People in pictures are sta#s#cally significant variables, and they increase the expected number of likes substan#ally.
Jackson Smith Ethan Rinchik Sam Shapiro-‐Kline
Instagram Descrip#ve Sta#s#cs: Variable Likes Followers Following Followers/Following Hashtag Emoji Animal Scenery People Joke Gender (1=F, 0=M) Food or Beverage
Mean 70.962 427.943 434.048 1.036 0.390 0.419 0.076 0.371 0.686 0.476 0.467 0.086
Regression Coefficients: Intercept Followers Following Followers/Following Hashtags Emojis Scenery People Animals
Coefficients 42.024 0.408 -0.172 -83.751 -5.732 12.183 8.602 15.747 -15.089
Not in the model
Key Metrics: Metric
Value
Reliable?
p-‐value of F-‐stat
1.974E-‐24
Yes, much less than 0.05
R2
0.734
Fairly good
Sε /ȳ
0.342
No, above 0.2
Managerial Sta.s.cs II
â&#x20AC;&#x153;BECAUSE ITâ&#x20AC;&#x2122;S THE CUP.â&#x201E;˘â&#x20AC;? Can the regular season predict Stanley Cup Playoff success? The National Hockey League is known for The scatterplots of the 3 significant variables, Goal Regression its parity across teams, especially among Differential, Penalty Kill, and PDO, are shown below. The highlighted variables are those that were not teams that advance to the Stanley Cup 20 Goal Differential eliminated from our Backwards Wald stepwise linear Playoffs. regression. Here is the resulting regression equation: 15 Because â&#x20AC;&#x153;anything is possibleâ&#x20AC;? in the playoffs, we đ?&#x2018;Ś = 132.543 + 0.120đ?&#x2018;Ľ + 0.265đ?&#x2018;Ľ đ??şđ??ˇ đ?&#x2018;&#x192;đ?&#x2018;&#x192; decided to see if the regular season can predict 10 postseason success, or if it truly is â&#x20AC;&#x153;a new season.â&#x20AC;? + 0.481đ?&#x2018;Ľđ?&#x2018;&#x192;đ??ž â&#x2C6;&#x2019; 0.245đ?&#x2018;Ľđ?&#x2018;&#x201A;đ?&#x2018;?đ?&#x2018;&#x2020; â&#x2C6;&#x2019; 1.604đ?&#x2018;Ľđ?&#x2018;&#x192;đ??ˇđ?&#x2018;&#x201A; With the rise of hockey analytics, or â&#x20AC;&#x153;fancy stats,â&#x20AC;? we have attempted to quantify the regular season performance of teams who made the playoffs dating back to the 2005-2006 season: â&#x20AC;˘ Point Percentage â&#x20AC;˘ Goal Differential â&#x20AC;˘ Shot Attempts â&#x20AC;˘ Face-offs â&#x20AC;˘ Power Play â&#x20AC;˘ Penalty Kill â&#x20AC;˘ Offensive Zone Starts â&#x20AC;˘ PDO
Definitions â&#x20AC;˘ â&#x20AC;˘ â&#x20AC;˘ â&#x20AC;˘ â&#x20AC;˘ â&#x20AC;˘ â&#x20AC;˘
This equation predicts the number of games a team is likely to win in the postseason based on how they performed during the regular season.
5 0 -40
-20
0
20
40
20
60
80
Penalty Kill
15 10
Results The Cox and Snell R2 is 0.147, meaning that our model had a relatively poor fit. This indicates that it is very difficult to predict playoff success based solely on quantitative measures of regular season performance.
Conclusion
5 0 76
78
80
82
84
86
88
20 15
Point Percentage: percentage of all possible points Goal Differential = Goals For â&#x20AC;&#x201C; Goals Allowed 10 Shot Attempts: Shots on Net, Missed Shots, Blocked Shots Face-offs: percentage of face-offs won 5 Power Play/Penalty Kill: efficiency Offensive Zone Starts: face-offs in the offensive zone 0 PDO = Shooting Percentage + On-Ice Save Percentage
90
While these results appear to support the idea that 92 â&#x20AC;&#x153;anything can happenâ&#x20AC;? in the playoffs, there are also intangibles, such as â&#x20AC;&#x153;grit,â&#x20AC;? â&#x20AC;&#x153;resilience,â&#x20AC;? and â&#x20AC;&#x153;veteran PDO leadership,â&#x20AC;? that contribute to playoff success but cannot be quantified. There may also be other measures of performance that are not relevant during the regular season, but are vital during the playoffs. _________________________________________________________ All data was collected from reputable hockey statistics and analytics websites, including the NHL, ESPN, WAR On Ice, and Hockey Reference.
96
98
100
102
104
106
Melissa Guo
MONEY PUCK JEREMY ABEND VARUN PATEL
GRAPH
MOTIVATION
METHODOLOGY
Currently in the NHL, there is a lot of salary dversity among players. Some players are making over $7 million a year while others make only $600,000 year. This study test aims to find out what hockey statistics are correlated to higher salaries. If a strong correlation is found between salary and some key statistics, NHL general managers may be able to acquire useful players that possess skills that do not yield a high salary— this would help their team and still leave more money for other roster players.
WHAT WE DID
HYPOTHESIS We posit that a young age and a high number of goals and assists will be the 3 factors that have the greatest impact on a player’s salary.
DATA SOURCES All of the players’ statistics were taken from Hockey-Reference (url: http://www.hockeyreference.com/leagues/NHL_2014_skaters.html)
We performed a linear regression and then subsequently used backwards elimination in order to find the best model. We found the folowing variables to be statistically significant: Age, Games Played, Goals, Assists, plus/minus, Penalties in Minutes, and Game Winning Goals. We first removed the Time on Ice variable, because we found that it was highly correlated with all the other variables (high VIF value). We then found the variable Shooting Percentage to be statistically insignificant. After we removed it from our model, the Standard Error decreased, and the R-Squared remained the same.
THE MODEL Y = -1769.6581 + 104.6614(B1) – 13.6199(B2) + 41.1538(B3) + 87.0686(B4) + 20.8651(B5) + 7.2256(B6) + 160.4074(B7) B1 = Age, B2 = Games Played, B3 = Goals, B4 = Assists, B5 = plus/minus, B6 = Penalties in Minutes, B7 = Game Winning Goals R2 = 0.58899 N = 198
This graph shows the relationship between a player’s goals scored in a season and his contract. We see a positive relationship between the two variables. Many times, players who score many goals are seen as the most exciting and most valuable players on a team. However, our analysis shows that other factors have greater impacts on a player’s salary.
KEY RESULTS CONCLUSION
Currently, NHL players are paid a higher salary based on their age, assists, and game winning goals. With this knowledge, an NHL manager could target players with other useful skills and traits that would command a lower salary. For example, a manager could try to sign a younger player with more goals, games played, and penalties in minutes for a lower salary than an older player who specializes in assists and crucial goals. On the other hand, a player attempting to earn a higher salary should try to record many assists and game winning goals.