Washington University in St. Louis OLIN BUSINESS SCHOOL Eli Snir, Ph. D. Senior Lecturer in Management June 2016 Managerial Statistics II Poster Session – tools for analyzing Big Data Statistics is becoming a competitive necessity for companies large and small. No longer the domain of a few back-office analysts, companies today require statistical knowledge as part of a well-rounded analytical background. That’s where QBA121 – Managerial Statistics II comes in. In the course we develop the foundational skills for statistical analysis, primarily regression modeling. In a world surrounded by Big Data opportunities, students need to be comfortable with data and have the tools to analyze data to develop managerial recommendations. The QBA121 project does just that. It allows students to explore and analyze data that interests them. The poster session gives them an opportunity to display their insights to a broad audience. Their projects reflect both their interests and the competencies they mastered during the course. Most of the students applied regression analysis to gain insight on a topic they were knowledgeable about. Statistics allowed them to expand their knowledge while grounding insights in data, rather than conjecture. In many projects initial hypotheses were only partially supported, and the data determined that multiple factors impact the dependent variable. During the course we learned why statistics is a foundation in so many disciplines in academia. It can be applied to diverse topics that students will explore in their academic journey. Similarly, the topics chosen for the course project reflect the diversity of students’ interests. Movies were of interest this time around, with students identifying which movie awards are best predictors of Academy Award nominations; how a book’s genre and the author’s productivity impact a movie’s success; and which factors are effective predictors of Best-Actor Oscar Nominations. Crime statistics make for effective, if not particularly uplifting, project. Here we learned how education, poverty, population density, and law enforcement impact local crime. We also found out that the key factors determining a city’s crime index are assaults, rapes, and larceny. Finally, sports frequently generate a lot of interest in the class. Every Monday-morning coach wants to know why his team isn’t living up to their expectations. For those that want an in-depth look at college basketball, we have some advice on which teams will succeed in “March Madness”. The posters in this book represent a sample of the projects that were undertaken in the Spring 2016 semester in the course. Please read through them to appreciate students’ accomplishments, to learn about the diverse applications of statistics, and to learn about a few topics. Eli Snir Washington University in St. Louis, Olin Business School • Campus Box 1156 One Brookings Drive, St. Louis, Missouri 63130-4899 Tel (314) 935-6090 • Fax: (314) 935-6359 • snir@wustl.edu
A Method to the Madness:
Simona Brooks Clayton Keating Travis Parr
Predicting the NCAA March Madness Tournament Motivation
Model and Methodology
Every spring, millions of fans compete to create the perfect bracket for the NCAA March Madness tournament– never before has this feat been accomplished. Having been avid fans of college basketball for many years, we wanted to create the best bracket possible for the 2016 tournament. By performing statistical analysis of a team’s performance in the regular season, we believed that we would be better poised to do what has never been done before: predict the unpredictable.
We collected data on 20 independent variables from the 192 teams* that have competed in the March Madness tournament over the past three years. *A school’s team was treated independently from one season to the next (e.g. Duke 2013 and Duke 2014 were evaluated as separate teams)
# of wins
= -2.811+ 0.057x1+ 2.065x2 + 0.056x3 - 16.403x4 + 11.479x5 + 8.381x6 +3.595x7 -0.298x8 -0.235x9
To create the bracket, we compared the “success scores” (or predicted number of wins) of the teams in each matchup; the team with the higher score moved on to the next round.
x1: Seed x2: Winning percentage x3: Average scoring margin x4: Pythagorean rating x5: (Pythagorean rating)2 x6: Strength of schedule x7: Momentum x8: Big East x9: Big 12 -
Variables -
ȳ 0.984
(Seed)2 Avg. opponent’s PPG Defensive efficiency Offensive efficiency Team effective possession ratio (TEPR)2 ACC Atlantic 10 Big 10 SEC PAC 12
- because of the nature of single elimination tournaments, ȳ will always remain constant
Key Results
Hypothesis
n
192 R2
0.722
We expect that momentum, strength of schedule, and seed will be the most important factors in predicting a team’s success in the tournament.
F-sig.
3.08E-7 Std Error
0.726
Data Sources Ken Pomeroy
ESPN
Team Rankings
49/63 games predicted correctly
Momentum and seed had far less impact than we expected, whereas Ken Pomeroy’s pythagorean rating proved to be a influential variable.
SE/ȳ
0.738
What Crimes Contribute Most to the Crime Index of U.S. Cities? Shaista Dhanesar, Malvika Ragavendran, and Lizzie Wellington
Project Motivation
Key Results
With the presidential election upon us, the question of gun violence and reducing crime rate has been a hot topic of discussion. Thus, we thought it would be important to analyze which crimes have the biggest impact on the crime index in US cities in order to determine if these presidential candidates are focusing on the right issues and to determine where more resources should be allocated.
We used a first order modeling technique for our data and noticed that the data was skewed; thus, we took the natural log of the data. After taking the natural log of the data we ran a linear regression and obtained the following results:
Data Sources -FBI’s Uniform Crime Reports: “Offenses Known to Law Enforcement” -Uniform Crime Reporting Program
Descriptive Statistics
The coefficients highlighted in red indicate the statistically significant coefficients. The coefficient highlighted in green indicates that the Ln(murder) coefficient has the highest p-value. Thus, in order to improve the accuracy of the model we used backwards elimination and removed Ln(murder) since it was the variable with the largest p-value. We obtained the following model:
Analysis Forcible rape, assault, and larceny-theft are statistically significant coefficients. The F-test of 2.800E-51 indicates our model is statistically significant since it is less than 0.05. Our R-square value of .9373 is high and our SƐ/y-bar is 0.0233 which is only slightly above the .02 benchmark. After we performed backwards elimination the standard error of the second model decreased from .23644 to .23554. Now, the coefficients LN(rape), LN(assault), and LN (larceny) are the statistically significant coefficients for this model. The F-test value of 2.231E-52 indicates that the model is statistically significant. Again, our Rsquare value of .937 is quite high and our SƐ/ybar is .02317 is only slightly above the .02 benchmark.
Conclusions We hypothesized that the murder rate would be the most significant variable in a state’s crime index number, however once we ran the regression we found that it was not a significant variable. Analyses on types of crimes across cities benefits law enforcement organizations because it provides statistically significant information on where to allocate resources. The results are also useful to advocacy organizations that raise awareness about certain crimes.
And the Academy Award goes to...
Sam Hahn Alexander Rothbard Kevin Schneider
An Analysis of the Factors that Determine Oscar Nominations Introduction
Significance of Model
Conclusion and Key Findings
The 2016 Oscars nomination committee stirred great controversy when they failed to nominate a single actor/actress of color for any of the major Oscar awards categories. The lack of diversity in the selection pool has led to widespread outrage and a celebrity-led boycott of the event. This controversy has sent shockwaves throughout the nation as we continue to struggle with questions regarding racial equality. While the nominations committee will surely continue to reevaluate its selection process, the timeliness of this debate begs an interesting question: what factors affect the probability of an actor or actress being nominated in the category of best actor/actress in a leading role? Our research intended to create a prediction model that determines the likelihood of nomination itself.
The Cox & Snell R squared value is .597. The standard error for the model is less than
Our full 16 variable-model was able to correctly predict the outcome of an actor
.000: meaning the model remains significant. Moreover, the below graphs illustrate how
receiving a nomination ~51% of the time. After running backward elimination
well our model fits the data. We used our model to predict the probability that the actors
techniques using SPSS software, the model accurately predicts whether a candi-
whose date we collected to build our model as a way to check how well our model fits the
date is nominated or not ~87% of the time; a very strong result. The four signifi-
data. As can be observed, our model accurately gives higher percentages for actors who
cant variables in our investigation were Black (Race-subset), # of Actor Nomina-
were nominated, and lower for those that were not.
tions at the Time of the Movie Release, Total # of Oscar Nominations for the Movie, and Rotten Tomato Critic Rating. At the onset of our investigation, we set out to determine whether there was an legitimacy to the claim that the Oscars are biased against racial minorities - we have found compelling evidence confirming these claims. The coefficient of our Black variable is negative [-1.491] and significant, indicating that if an actor/actress is black, their likelihood of nomination is
Variables Analyzed:
Summary Statistics:
diminished. When we isolated our model to include only the variables Black (or
Dependent Variable:
Sample average Whether the actor was nominated or
Rating of 95%, that the probability of a black actor/actress being nominated was
not (y): 0.5 Black: 0.0798 Number of Previous Oscar Nominations: 1.9874 Total Number of Oscar Nominations for the Movie: 2.0966 Rotten Tomatoes Critic Rating: 0.7166
model are robust; for example, if a talent acquisition agency was looking to fill a
Whether or not an actor receives a nomination (coded as 1 = nominated 0 = not nominated for data collection, a probability for model) Actor-Specific Independent Variables: •Age •Height •Race/Ethnicity* •Number of Previous Oscar Nominations: Movie-specific Independent Variables: •Season of release •General budget value (USD) •Opening weekend sales (USD) •Domestic gross movie profit (USD) •Total number of Oscar Nominations for the movie •Movie rating •Rotten Tomatoes critic rating •Rotten Tomatoes audience rating
not Black) and Rotten Tomato Critic Rating, it was demonstrated that with a Critic 0.297, the probability for a non-black actor was 0.652. The applications of this role for an upcoming film, they could use the four metrics [or a projection of such metrics] to forecast the probability that a nomination would occur.
Data Sources: IMDB Official Academy Awards Database Rotten Tomatoes
Predicting Oscar Nominations One of the applications of our model is in predicting future Oscar nominations. To demonstrate this we calculated the likelihood of five actors from popular movies in 2015. By “predicting” in 2015 we could use the 2015 nominations to test the accuracy of our model. As can be seen, our model was accurate given that Leonardo DiCaprio, who was nom-
Yellow variables were statistically significant following backwards elimination. *For Race/Ethnicity only the coded Black variable was significant so for the final model the race/ethnicity variable was coded as 1= Black and 0 = Not Black (White, Latino, Indian or Asian)
inated in 2015, has the highest probability of being nominated. It is important to note that these predictions are limited given that they were calculated without information for how many awards the actors movie would be nominated for as the probabilities would be predicted prior to the announcement of nominations.
Analysis After manually collecting data for our respective variables, we ran a regression analysis in SPSS to find a logistic model that best fit our data. Using backwards elimination we created a more significant model that predicted accurate 86.9% of the time using the following variables:
Using an odds ration, we then translated these results into the logistic model:
ln(y)= -6.733-1.491(XBlack)+0.7(XActor)+1.445(XMovie)+6.23(XTomato Critic) Data can be entered for each of the variables to predict the likelihood that an actor would be nominated for an Oscar
Bruce Willis Precious Cargo
Eugene Cernan The Last Man on the Moon
Géza Röhrig Son of Saul
Leonardo DiCaprio The Reveant
Laurence Fishburne Standoff
Black: 0 Number of Oscars Nominations at the Time of the Movie Release: 0 Total Number of Oscar Nominations for Movie: 0 Rotten Tomato Critic Score: 0.25
Black: 0 Number of Oscars Nominations at the Time of the Movie Release: 0 Total Number of Oscar Nominations for Movie: 0 Rotten Tomato Critic Score: 0.93
Black: 0 Number of Oscars Nominations at the Time of the Movie Release: 0 Total Number of Oscar Nominations for Precious Cargo: 0 Rotten Tomato Critic Score: 0.96
Black: 0 Number of Oscars Nominations at the Time of the Movie Release: 5 Total Number of Oscar Nominations for Precious Cargo: 0 Rotten Tomato Critic Score: 0.82
Black: 1 Number of Oscars Nominations at the Time of the Movie Release: 1 Total Number of Oscar Nominations for Precious Cargo: 0 Rotten Tomato Critic Score: 0.5
0.56%
28.11%
32.03%
86.71%
1.20%
This research aims to identify what factors make a book-to-movie adaptation successful. Many recent blockbusters, as well as some indie films, have been based on popular novels, and identifying unique factors of the books as well as their film adaptations to predict the lifetime gross of movies could lead to valuable insight for film companies.
Our data included 104 randomly selected popular and independent films from the years 2013 to 2016, with information compiled from Rotten Tomatoes, a movie rating site, Goodreads, a book database, and Box Office Mojo, a film database.
Variables Not Included - Total books sold Insufficient - Weeks on NYT Bestseller List data for - Film production budget inclusion - Awards for film/book Removed - Pages in book through - Goodreads rating backwards - Series status elimination - Genre
Key Results Variable
Mean
Median
Standard Error
Standard Deviation
Lifetime Gross
75.331
35.322
10.306
105.101
Number of Theaters Lifetime
1896.346
2311
134.192
1368.499
Critic Rating
51.144
47.5
2.751
28.055
Audience Rating
58.798
56.5
1.647
16.795
LN Number of Books Written
2.205
2.197
.107
1.096
LN Year Published
7.593
7.603
.003
.035
Variable
Coefficients
P-value
Intercept
1862.511
0.235
Number of Theaters Lifetime
0.047
0.000
Critic Rating
0.628
0.108
Audience Rating
.988
0.135
LN Number of Books Written
-14.527
0.035
LN Year Published
-255.447
0.215
Action
77.027
0.009
The number of theatres that the film is shown in during its lifetime, whether or not a film/book is action genre, and the number of books the author has written are statistically significant variables. Rotten Tomatoes critic rating, Rotten Tomatoes audience rating, and year published are not statistically significant (p-value > 0.1). However, higher critic ratings and older, more established novels are linked to higher lifetime gross.
Impact on Lifetime Gross Revenue
Lifetime Gross ($M) = 1862.51 + 0.047002(Number of Theaters Lifetime) + 0.62839(Critic Ratings) + 0.98815(Audience Rating) — 14.5267[ln(# of books by author)] — 225.44701[ln(year published)] + 77.0269(Action)
R Square
.6
F-Statistic
2.32E-17
Standard Error
68.499
SE/Y Average
.909
Observations
104
,
,,
,
,
, , ,
Motivation
Methodology
Low fertility rate has recently caught a global attention, posing an ironic issue that goes against the appraisal of today’s progress. Despite of respective national effort, constant decrease in fertility rate bespeaks that causes are intertwined with deeper roots of social issues and values. In this study, we are hoping to identify those causes of decreasing fertility rate and come up with possible solutions for countries in such concern.!.
We developed regression model by using Backwards Elimination Method. The dependent variable is fertility rate and the independent variables are following. They are all significant.
Urban population (% of total)
Data Source Our data source is The World Bank Group, World Development Indicators. We pick 168 samples from different countries.
Standard Deviation Count Fertility Rate
1.38
Urban population (%) 22.82
Min
168
2.80
7.6
1.6
168
56.72
100
9
168
GDP per capita ($K)
20.67
168
14.35
Health expenditure (%) 2.59
168
6.75
51.11
0
0.031 0.0478
The Model
Mean Max
Female labor force participation (%) 15.72
0.00004
GDP per capita ($K)
88
15
116.6 0.26 17.1
1.5
+ Labor force participation rate, female (%) * 0.0125 -
GDP per capita ($K) * 0.01221
-
Health expenditure (%) * 0.07261
0.345
Se/ y 0.406
F-test
3.2569 E -14
4.5
9
13.5
18
Health expenditure (%) vs Fertility rate
8
Y = 1.6729 - Urban population (% of total) * 0.0211
6
R^2 = 0.0572
4
0
0
25
50
75
100
Urban Population (%) vs Fertility rate
n
168
This model is statistically significant with F-significance below alpha, 0.05. However R-squared is low and se/yaverage is higher than our threshold, 0.2. Also, it's most likely that there is non-linear relationship between GDP and fertility rate We found 5 observations with studentized residuals larger than |2|. However we decide to keep these observations since they are not influence points, with cook’s D lower than 0.5.
QBA 121 Term Paper| Professor Snir | Created by Xue Zhang and Silk Kim
0
2
Y = 0.4029 - Urban population (% of total) * 0.0217
R2
R^2 = 0.082
4 2
0.037
,
Y = 3.8398 - Health expenditure (%) * 0.1536
6
P-value
Labor force participation rate, female (%)
Health expenditure (%)
Describe Statistic
8
,
Key results We found that our multiple-linear regression model and all the independent variables are statistically significant. However, the R-squared value was low, indicating that there are other variables that account for fertility rate. Moreover, standard error was too high to guarantee accurate predictions of fertility rate for other countries or for the future estimates. We believe that this was unavoidable since countries have economic and social values that can be significantly different from each other so it is hard to establish a trend.
What’s been causing an all-time low in U.S. smoking rates? Motivation for Analysis As teenagers who have always been heavily discouraged to smoke by schooling and the media, we were interested to hear that smoking rates in the U.S. are currently at an all-time low. This prompted us to investigate what factors influence the smoking rate and to develop a model that would accurately predict said rates. Hypothesis We propose that an increase in GDP per capita, an increase in health care expenditure, an increase in people attaining higher levels of education, and a decrease in the birth rate and the poverty rate may help to explain the decreasing smoking trends across the United States. Model Summary R2
0.739970355
Sε
2.490358289
Forecast Accuracy
0.123463098
Fstatistic
1.39489E-26
Statistical Tables Variable
Coefficients (Standard error)
Intercept
27.3684754 (2.491313422)
Ln GDP in Current Dollars (%)
-0.458662668 (0.251560301)
Healthcare Expenditure per Capita % Bachelor Degree or Higher (%)
3.173475614 (0.968825215) -0.482235074 (0.062119515)
Poverty (%)
-0.595297274 (0.075755431)
Birth Rates for Teens aged 15–19 (%)
0.17017464 (0.018799545)
Data Sources We tracked smoking rates in all 50 states over the course of two years, 1995-1996 and 20102011. Healthdata.gov, Bea.gov, Kff.org, Census.gov, Cdc.gov
Carly Abramowitz, Harrison Silverstein & Zachary Stein
Descriptive Statistics Mean Current Smokers Ln GDP Health Care Bachelor Degree Poverty Birth Rates
Standard Deviation
Minimum
Maximum
Count
Sample Variance
20.17087157
4.761297652
10.3
29.6854
102
22.66995533
0.663096242
0.140941112
0.337817403
0.337817403
102
0.019864397
1.960784311
0.296951481
1.4018845
3.4221907
102
0.088180182
26.56372549
5.372791542
15.3
50.1
102
28.86688895
13.33823529
0.347168831
5.3
22.4
102
12.2936721
43.50980392
15.50982767
15.7
85.2
102
240.5547544
Methodology After transforming GDP (with COV over 1) into a natural logarithm, we performed a residual analysis with the following results: • Leverage: all leverage values were below the critical leverage value (.176) with the exception of two observations • Standardized Residuals: five observations > ± 2 • Studentized Residuals: five observations > ± 2 • Cook’s D: Cook’s D values for each of the suspect points found from Studentized was far less than 0.5, indicating that none of the suspect points pose legitimate concern • Conclusion: Eliminating suspect points did not improve the model significantly, so the points were not removed Backwards Elimination After residual analysis, we performed a backwards elimination to isolate variables that had the potential to negatively impact the model. Our results from SPSS came back with no suggested variables to remove, leaving the original model intact. Key Graphs
Key Results The most statistically significant model, denoted by the lowest standard error, is as follows: Current Smokers = 27.368 - 0.459(GDP) + 3.173 (Health Care) 0.482 (Education) - 0.595(Poverty) + 0.170 (Birth Rates) • Health Care, Education, Poverty, and Birth Rates are all statistically significant as each respective variable has a pvalue less than our alpha, 0.05. • The p-value for the natural log of GDP is 0.071, which is above the .05 threshold. However, a backwards elimination removed none of the variables, indicating that the current model has the lowest standard error and is therefore the most accurate. • When analyzing the residual analysis, certain observations were troublesome, however, we did not remove any of the outliers because their Cook’s D did not pose concern. Variables Not in the Model No variables were removed from the model Managerial Conclusions 1. States should focus on policies restricting health care expenditure. A decrease in health care expenditure per state decreases the cost of personal health care plans. This decrease in cost will lead to an increase in the number of people purchasing health care plans. An increase in enrollment numbers will cause more people to be conscientious of their health in general, thereby reducing the overall smoking rate. 2. States should budget more funds towards promoting the attainment of higher levels of educational because an increase in the percentage of people achieving a bachelor’s degree or higher lowers the smoking rate. 3. States should increase funding towards sex educational programs because the birth rate for teenagers aged 15-19 has a positive correlation with smoking rates. Less teenagers giving birth at age 15-19 results in a decreased smoking rate in their state.
Cracking
Daily Fantasy Sports
Quincy Acy Steven Adams Arron Afflalo Cole Aldrich Lamarcus Aldridge Lavoy Allen Al-Farouq Aminu Giannis Antetokounmpo Trevor Ariza Leandro Barbosa J.J. Barea Harrison Barnes Matt Barnes Will Barton Nicolas Batum Kent Bazemore Patrick Beverly Devin Booker Avery Bradley Cory Brewer Kobe Bryant Jimmy Butler Kentavious Caldwell-Pope DeMArre Carroll Jae Crowder Stephen Curry Anthony Davis Matthew Dellavedova Luol Deng DeMar DeRozan Boris Diaw Gorgui Dieng Goran Dragic Andre Drummond Tim Duncan Kevin Durant Monte Ellis Festus Ezeli Kenneth Faried Derrick Favors Evan Fournier Channing Frye Rudy Gay Paul George Rudy Gobert Aaron Gordon Marcin Gortat Danny Green Draymond Green Blake Griffin James Harden Maruice Harkless Devin Harris Garry Harris Tobia Harris Spencer Hawes Gordon Hayward Roy Hibbert Nene Hillario George Hill Jordan Hill Rodney Hood Al Horford Dwight Howard Serge Ibaka Andre Iguodala Ersan Ilyasova Kyrie Irving Lebron James Al Jefferson Amir Johnson Stanley Johnson Wesley Johnson
-4.26 -21.56 3.60 -0.02 17.61 1.98 6.91
Underestimate Underestimate Overestimate Underestimate Overestimate Overestimate Overestimate
3.89 -5.76 -4.84 -1.58 -1.16 16.85 12.91 14.06 -16.88 -7.41 -10.67 5.15 0.90 15.22 -5.73
Overestimate Underestimate Underestimate Underestimate Underestimate Overestimate Overestimate Overestimate Underestimate Underestimate Underestimate Overestimate Overestimate Overestimate Underestimate
1.27 2.98 21.27 -21.70 -2.78
Overestimate Overestimate Overestimate Underestimate Underestimate
1.85 -8.17 12.46 0.01 3.70 0.62 4.57 -4.91 -4.06 12.21 5.56 0.69 15.73 -0.38 3.99 -0.22 5.24 17.15 -13.29 9.97 0.99 1.19 10.00 5.41 1.61 10.41 1.32 -1.03 -1.86 -4.14 7.99 -24.55 -6.04 -1.59 8.96 -11.53 -0.79 11.73 -0.46 -18.76 -16.59 -8.86 -13.44 -0.96 4.03 0.95
Overestimate Underestimate Overestimate Overestimate Overestimate Overestimate Overestimate Underestimate Underestimate Overestimate Overestimate Overestimate Overestimate Underestimate Overestimate Underestimate Overestimate Overestimate Underestimate Overestimate Overestimate Overestimate Overestimate Overestimate Overestimate Overestimate Overestimate Underestimate Underestimate Underestimate Overestimate Underestimate Underestimate Underestimate Overestimate Underestimate Underestimate Overestimate Underestimate Underestimate Underestimate Underestimate Underestimate Underestimate Overestimate Overestimate
Daily Fantasy Sports is rising in popularity among sports fans due to its fast paced environment and immediate reward system. Unlike traditional fantasy sports, where competition lasts for a season, in DFS, it’s important for users to predict a player’s performance on a specific day. Through the paper, I want to examine relationships between player’s ability variables and a player’s fantasy points performance.
2016 Season per-Game data of players were taken from basketball-reference.com/. I’ve purposefully left last 3 days of seasons to test the prediction accuracy of the model. In total, I collected per-game data from 170 players, total of 10,853 observations.
9
đ??šđ?‘Žđ?‘›đ?‘Ąđ?‘Žđ?‘ đ?‘Ś đ?‘ƒđ?‘œđ?‘–đ?‘›đ?‘Ąđ?‘ = −12.282 + Intercept
-12.282 (0.222)
Minutes
0.351 (0.008)
FGA
1.284 (0.017)
FTA
0.901 (0.015)
eFG%
10.584 (0.149)
TRB%
0.482 (0.005)
AST%
0.217 (0.003)
STL%
0.805 (0.018)
BLK%
0.552 (0.012)
Usage Rate
-0.147 (0.009)
�� ��
đ?‘–=0 80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
Prediction
Actual Fantasy Points
-
After dropping four variables, final OLS model has 9 significant variables. Significance F is 0.000, well below ι=0.05, the OLS model is significant. The regression had shockingly high R-square of 0.9242, meaning the independent variables explain variation within the dependent variable very well. - Coefficient of variation is 0.1625, below the benchmark of 0.20. IMPLICATION: It’s possible to accurately predict NBA player’s performance, given the information about the player’s ability.
-
I’ve purposefully left last few days of season to perform model testing. - Used mean of independent variables per each player to predict performance. - The mean of residuals was -2.0 and standard deviation was 11.20.
-
-
The model testing suggests, player’s ability to play varies between games. - The standard deviation of model testing residuals is very high. - While the model can accurately predict collective player’s performances, predicting an individual performance is not reliable.
After dropping four variables, final OLS model has 9 significant variables. Significance F is 0.000, well below Îą=0.05, the OLS model is significant. The regression had shockingly high R-square of 0.9242, meaning the independent variables explain variation within the dependent variable very well. - Coefficient of variation is 0.1625, below the benchmark of 0.20.
AND THE OSCAR GOES TO… A study of those variables that are the strongest predictors of a films Best Picture success Motivation The Oscars is one of the most anticipated and viewed awards ceremony on the planet. With a viewership of 34.4 million, it is no wonder that speculation and model creation about who will win the Oscars floods the internet every awards season. These models look at a wide variety of variables ranging from ticket sales to the number of awards a movie is nominated for, and most of them are rather accurate. However, in my analysis, I decided to take a mixture of those variables that have been shown to impact a movie’s chances at winning Best picture and add in three variables (Number of Nominations, Month of Release, and Run – Time) all of which I expected to influence a movies chances at Oscar gold. I would hypothesize that those movies that win any of the awards offered by film guilds will have a disproportionally higher change of winning Best Picture. Below are the 14 variables I tracked over the past 15 years:
• • • •
Number of Nominations Month of Release Run – Time Best Picture from… • Directors Guild of America (DGA) • British Academy of Film and Television Arts (BAFTA) • Golden Globe (Drama) • Golden Globe (Musical or Comedy) • American Cinema Editors (ACE, Comedy or Musical) • American Cinema Editors (ACE, Dramatic) • Writer’s Guild of America (WGA, Adapted Screenplay) • Writer’s Guild of America (WGA, Original Screenplay) • Producers Guild of America (PGA) • Critic’s Choice • Screen Actors Guild (SAG)
Data Sources Internet Movie Data Base (IMDB) Number of Nominations Month of Release Run – Time Directors Guild of America (DGA) American Cinema Editors (ACE, Comedy or Musical) American Cinema Editors (ACE, Dramatic) Writer’s Guild of America (WGA, Adapted Screenplay) Writer’s Guild of America (WGA, Original Screenplay) Producers Guild of America (PGA) Critic’s Choice Screen Actors Guild (SAG)
International Film Awards Database British Academy of Film and Television Arts (BAFTA) Golden Globe (Drama) Golden Globe (Musical or Comedy)
Methodology I used Backward Elimination to run a Multiple Linear Regression so as to find what variables were most correlated to winning Best Picture. Those variables that were not eliminated from the Regression were…
Directors Guild of America (DGA) American Cinema Editors (ACE, Dramatic) Screen Actors Guild (SAG) All other variables (Number of Nominations, Month of Release, Run – Time, British Academy of Film and Television Arts (BAFTA), etc.) were found to not have a statistically significant correlation to winning the Best Picture award, which is why it was removed from the regression. This was determined by analyzing the standard error of each independent variable and removing those that are above .3 or until the regressions R – Squared value fails to decrease upon the removal of a variable.
Key Results I found that winning the best picture award from either the DGA, the ACE, or the SAG will increase your movies chances of winning the Oscar for Best Picture if it is nominated. This directly supports my hypothesis. In fact, the most influential guild which holds the most seats in the academy, the Directors Guild, seems to have the most influence with a coefficient of .516, thus further supporting my hypothesis.
Goodness of Fit Given that this model is a relatively simple Backward Elimination linear regression, it would seem that my model had a respectable “goodness of fit”. With an R2 value of .635, an F – Sig of 5.98E-18, and a F – Test that rejects the null hypothesis that there is no linear relationship between variables, this seems to be a good model. However, there are still issues such as terrible accuracy, represented by the Se/y value of 1.211
The Model
Conclusion
Overall, this model shows that predicting Best Picture success is a ln(y) = -0.00651 + .516x1 + .313x2 + .236x3 complex problem. While this purely qualitative approach leaves us with a model that is statistically significant and has a linear relationship to to the dependent variable, it also shows that such a model is highly 2 = .635 inaccurate. Through this model, 11 of the 14 starting variables were R DGA – x1 eliminated and for those that remained, the linear relationship for the Se/y = 1.211 ACE – x2 SAG and ACE were dubious at best. However, across time, this model F – Sig = 5.98E-18 SAG – x3 does prove to be a relatively accurate predictor of Oscar success. Thus, n = 86 industries that might benefit from this could be movie critics or possibly even movie studies trying to tailor make a movie that will win an Oscar, Both of these industries could greatly benefit from this model, given Given the above information, we have an equation that gives us the that accuracy of Oscar prediction can be a lucrative venture. relative likelihood of a nominated film winning the best picture award based on the other major film industry awards they have won. From this, we can set up perditions about who should have won Best picture in the past based on this model. Below are the results from 2010 and 2011 where the gold bar denotes the award winner of that year. Sources "Just How Predictable Are the Oscars?" ResearchGate. N.p., n.d. Web. 04 May 2016. Pardoe, Iain. "Just How Predictable Are the Oscars?" Chance 18.4 (2005): 32-39. Web. Person, and Walt Hickey. "FiveThirtyEight's Guide To Predicting The Oscars." FiveThirtyEight. N.p., 22 Jan. 2016. Web. 04 May 2016.
Data Sources http://www.imdb.com/search/title?count=100&groups=oscar_best_picture_winners&sort=year,de sc&ref_=nv_ch_osc_2 https://www.netogram.com/filmawards.htm
2010
2011 Ross Meaden
____________________________________________________________________________
________________________________________________________________
We used the Backwards Elimination Method to run a Multiple Linear Regression and found the following variables to be insignificant: average annual low temperature, average annual temperature (general), average annual precipitation, percentage of population foreign born, and median age were not found to be significant and were removed from the final model. All of these variables, except average annual low temperature, were removed to find the final model with the lowest standard error.
Sales tax, law enforcement, and income inequality are correlated with a higher violent crime rate. Average annual high temperature, percentage of population with a college degree, and population density are correlated with a lower violent crime rate. Some of our most interesting results are shown below: Violent Crime Rate (per 100,000)
y = 506.21 – 20.52x1 + 10.64x2 + 36.33x3 + 19.54x4 + 14.61x5 – 10.58x6 – 0.02x7
IV 4: Law Enforcement 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0
________________________________________________________________
đ?‘Ś = 697.135 (Average violent crime rate)
Despite the efforts of policymakers and law enforcement, violent crime is still a prevalent issue in cities across the United States. The emotional and physical toll of violent crime decreases the well-being of our society, and often prevents the victims of such crime from returning to normal lives. Thus, both the economic and social costs of violent crime are high. We aimed to identify the underlying factors that moderate the prevalence of violent crime in America’s cities in order to inform and foster more appropriate strategies to ameliorate the problem.
β1 = Average annual high temperature, p = 0.0026 β2 = Average annual low temperature, p = 0.1185 β3 = Total sales tax, p = 0.0299 β4 = Law enforcement (per 10,000 residents), p = 1.34E-10 β5 = Income inequality (Gini coefficient), p = 0.0430 β6 = Education (% of population with college degree), p = 4.89E-06 β7 = Population density (people/square mile), p = 0.0127
100
200
300
400
500
600
700
800
Law Enforcement (per 100,000)
There is a strong positive relationship between law enforcement and violent crime rate. IV 1: Average Annual High Temperature Violent Crime Rate (per 100,000)
2000 1800 1600 1400 1200 1000 800 600 400 200 0 0
10
20
30
40
50
60
70
80
90
100
Temperature (Fahrenheit)
________________________________________________________________
• • • • • • • •
U.S. Department of Justice and FBI U.S. Climate Data The Tax Foundation U.S. Department of Labor Bureau of Labor Statistics Governing Media Civic Dashboards by OpenGov United States Census Bureau City-Data.com
R square Standard Error F-statistic P-value
0.689 214.573 27.265 2.39E-19
One of our most unexpected findings was the strong negative relationship between average annual high temperature and violent crime rate.
A recent study suggested that Chicago could save $59 billion per year with a 25% decrease in violent crime (Shapiro & Hasset, 2012). This research aims to inform government officials and policymakers on the factors influencing a city’s violent crime rate. Shapiro, R. J., & Hassett, K. A. (2012). The economic benefits of reducing violent crime: A case study of 8 American cities.