Washington University in St. Louis OLIN BUSINESS SCHOOL Eli Snir, Ph. D. Senior Lecturer in Management December 2017 Informing Decisions at the QBA 121 Poster Session – Fall 2017 Informing decisions. One of the key skills attained in a business school education. The ability to ask a question, collect relevant data, distill data into manageable form, undertake an analysis, and present results both to technical and broad audiences, is essential to the success of Olin business school students. These are exactly the attributes that make a successful term paper in QBA 121, Managerial Statistics II. The term paper project requires groups of students to develop hypotheses on a topic of their choice, and then collect and analyze data, to answer their hypotheses. Projects are presented on posters to a broad audience of academics, judges, and fellow students. The dual modes of the assignment, both a technical report and a poster, epitomize the skills of Olin graduates, informing managerial decisions based on rigorous analysis. This poster book is a collection of some of the projects students investigated in the course. Projects in the course encompass all aspects of business, as well as topics of broader political or social interest. One important project this semester evaluates factors that drive college retention rates. On average in the US, only 82% of freshman continue college after their first year, but there is substantial variation across universities. Some universities have only about half the students returning for their sophomore year, while other achieve close to 100% retention rate. Understanding factors that differentiate top performers is important for both administrators and policy makers. Of the statistically significant factors, a few stand out. Lower student-faculty ratio increases retention. There are clear benefits to having more faculty teaching students. In addition, tuition improves retention. The additional services available at institutions with larger budgets have an immediate effect on retention rates. This should inform the public discussion on budgets and resources for higher education. A similar project, this time looking at starting salaries for college graduates, finds that several factors are important. One is that average academic spending is positively correlated with graduates’ salaries, again emphasizing the need to focus resources on students. Surprisingly, in this project, student-faculty ratio is not a significant predictor of salary. Sports projects are quite popular in the course. With the rise of sabermetrics in baseball, and its expansion into other sports, fans have a desire to understand what drives success and now have data to identify attributes of top performers. Some questions evaluated this time are the importance of both offence and defense in explaining wins of MLB teams, and factors that determine player PER in basketball. Looking through the posters collected in this book, you’ll learn about these projects are others. Hopefully, these will pique your interest to apply statistical methods to questions that interest you.
Eli Snir Washington University in St. Louis, Olin Business School • Campus Box 1156 One Brookings Drive, St. Louis, Missouri 63130-4899 Tel (314) 935-6090 • Fax: (314) 935-6359 • snir@wustl.edu
Save the Date:
When Will You Get Married? RSVP: Tiffany Chiang, Amrata Mehta, Maddy To
Motivation Marriage is an important factor of our personal lives that many people consider when planning their professional life, especially in terms of a worklife balance. Current research suggests that the mean age at the time of first marriage has been increasing through the past decades. We aim to explore the factors that have the strongest effect on what a person’s age will likely be at the time of their first-marriage, using the following factors: Sex (SEX) Ethnicity (CODE ETH) Religion (CODE RELIG16) Political Views (POLVIEWS)
Education Level (DEGREE) Residence during Adolescence (RES16) Region during Adolescence (REG16) Location of Parent’s Birth (PARBORN)
Data Source We collected data from the 1994 General Social Survey. Our final model consists of 1021 data points, of which the average age at first marriage was 22 years old.
The Method We used backwards elimination to run a Multiple Linear Regression on the data by turning ethnicity and religion into categorical variables. We then eliminated the categorical independent variables of Code ETH Asia, Code RELIG16 Catholic, and Code ETH Other as they increased our model’s standard error. This left us with eight of the eleven initial independent variables. We removed data points based on outliers with significant influence or leverage and reran the model.
The Model Age at first marriage is predicted by sex, European and African ethnicity compared to North American ethnicity, highest degree obtained, hometown size and location at age 16, location of parents’ birth, and political views.
Ŷ = 22.9359 - 1.9577xSEX + 0.6413IEUR + 0.5962IAFR+ 0.9317DEGREE + 0.1648RES16 - 0.1655REG16 + 0.2552PARBORN + 0.0867POLVIEWS
Quality of the Model R2
0.22975
Fairly Low
Standard Error / Y Average
0.15254
< 0.2
F-Statistic
1.17024E-52
< α = 0.05
Key Results Through our MLR model analysis, the strongest factors that influence age at first marriage are sex and highest degree obtained. Based on our results, being a woman results in marriage about 2 years younger than a man, on average, holding constant all other variables. Meanwhile, pursuing higher education will increase the age at which you first get married by 1 year per degree.
College Retention Rates Shivani Jindal – Taylor Chen – Katie Pearson
Motivation
Methodology
Key Results
Retention rate is the percentage of a school’s firsttime, first-year undergraduate students who continue at that school the next year. Many students drop out or switch universities after their first year. Low retention rates can hurt a university’s ranking. Higher education also plays a key role in social mobility and economic growth. We are interested in what institutional factors have an impact on 4-year undergraduate retention rates.
We developed a multiple linear regression model by using backwards elimination. Our dependent variable was FullTime College Undergraduate Retention Rates and our independent variables were University Statistics: Region (South, West, Northeast, Midwest), Public or Private, Number of Students Admitted, Students Enrolled Full-Time, Student-Faculty Ratio, and In-State Tuition.
Economic Significance
Hypothesis We expect to find that student-faculty ratio and tuition are the best predictors of retention rate, and are curious if the other factors listed impact retention as well.
Data Source
Model
Our data was from the Integrated Postsecondary Education Data System (IPEDS) through the National Center for Education Statistics (NCES). We took a random sample of 226 four-year undergraduate universities.
Descriptive Statistics Variable
Variables Intercept Midwest South West Public Religiously Affiliated Admissions Total Enrolled Full Time Total Student-Faculty Ratio In-State Tuition
Mean
Std Dev
Min
Max
82.058
10.115
50
98
South
0.307
0.462
0
1
West
0.098
0.298
0
1
Removed Variables
Public
0.084
0.279
0
1
Religiously Affiliated
1749.369 1044.253
17
5135
Full-Time Retention Rate
Admissions Total Enrolled Full Time Total Student-Faculty Ratio In-State Tuition
402.422
189.933
11
979
11.227
2.433
5
21
34372.231 11548.550
6009
50780
Midwest NCAA Affiliated Historically Black University Financial Aid
Coefficients 69.686 0.330 -2.361 5.549 6.672 0.392 -0.002 0.025 -0.836 0.00045
P-Value 0.000 0.772 0.025 0.000 0.002 0.710 0.000 0.000 0.001 0.000
100
Public 10%
80 60
StudentFaculty Ratio 11%
Enrolled Full Time Total 25%
Admissions Total 11%
40
ŷ = 0.0006x + 62.003 R² = 0.44379
20 0 0
0.646
SE / Ȳ
0.075 225 0.000
20000
40000
60000
This model shows that a university’s retention rate can be predicted based on location, size, exclusivity, and cost. Surprisingly, Financial Aid has no significant impact on retention rates. In-State Tuition has the lowest p-value and is the most economically significant variable.
These results can be incredibly useful to 4-year undergraduate universities across America who are looking to increase their retention rates, which can often directly impact a college’s national ranking and appeal.
R-Squared
F-Test
Published In-State Tuition & Fees
120
Conclusion
Model Metrics
# of Observations
In-State Tuition 28%
South 6% West 9%
ŶRetention Rate = 69.686 – 2.361xSouth + 5.549xWest– 0.002xAdmissions Total + 0.025xEnrolled Full Time +6.672xPublic -0.836xStudentFacultyRatio+0.004xTuition
what drives happiness? a managerial statistics II project motivation Every year, the World Happiness Report ranks 155 countries by their happiness levels to provide an eye-opening survey of the state of global happiness. These results led us to wonderâ&#x20AC;Ś What factors account for the large discrepancies in happiness between all of the countries? Since happiness is considered to be the proper measure of social progress as well as the goal of public policy, we felt that it was important to understand what truly accounts for the better well-being of a country and thus its people.
independent variables INCLUDED ln(GDP Per Capita) Government Effectiveness Unemployment Rate
NOT INCLUDED ln(Gross National Income) Adult Literacy Rate Life Expectancy
model y = 1.114049 + 0.491135 ln(x1) + 0.006269 x2 â&#x20AC;&#x201C; 0.021192 x3 x1: GDP Per Capita x2: Adult Literacy Rate x3: Life Expectancy
model fit R2: 0.841746 0.10151
High Below a benchmark of 0.20
F-Stat: 1.3289 x 10-25 n: 100
Well below Îą = 0.05 Greater than 50% countries
đ?&#x2018;şđ?&#x153;ş : $ đ?&#x2019;&#x161;
scatter plots
saki inaba elijah lacey amanda law
world bank group world happiness report
key results REGRESSION COEFFICIENTS Using the backwards elimination method, which entails running the regression with all predictors in the model, removing the least significant variable (the one with the largest p-value), and repeating the process until the standard error stops decreasing, we found our model to be: y = 1.114049 + 0.491135x1 + 0.006269x2 â&#x20AC;&#x201C; 0.021192x3. This makes sense because it is presumable that: an increase in a countryâ&#x20AC;&#x2122;s GDP per capita as well as its governmentâ&#x20AC;&#x2122;s effectiveness would lead to happier people, and an increase in a countryâ&#x20AC;&#x2122;s unemployment rate would lead to less happier people (although this variable is not statistically significant). x1: GDP Per Capita For every $1 increase in a countryâ&#x20AC;&#x2122;s ln(GDP per capita), there is a 0.491135 increase in a countryâ&#x20AC;&#x2122;s happiness rating, holding government effectiveness and unemployment rate constant.
$ 9.111887 đ?&#x2019;&#x2122; đ?&#x2019;&#x17D;đ?&#x2019;&#x2020;đ?&#x2019;&#x2026; 9.157819 đ?&#x2018;şđ?&#x153;ş 0.143001
x2: Government Effectiveness For every 1 point increase in a countryâ&#x20AC;&#x2122;s Worldwide Governance score, there is a 0.006269 increase in a countryâ&#x20AC;&#x2122;s happiness rating, holding GDP per capita and unemployment rate constant.
$ 9.111887 đ?&#x2019;&#x2122; đ?&#x2019;&#x17D;đ?&#x2019;&#x2020;đ?&#x2019;&#x2026; 9.157819 đ?&#x2018;şđ?&#x153;ş 0.143001 đ?&#x2019;&#x201D;đ?&#x2019;&#x2122; 1.430074
x3: Unemployment Rate For every 1% increase in a countryâ&#x20AC;&#x2122;s unemployment rate, there is a 0.021192 decrease in a countryâ&#x20AC;&#x2122;s happiness rating, holding GDP per capita and government effectiveness constant.
đ?&#x2019;&#x201D;đ?&#x2019;&#x2122; 1.430074
$ 9.111887 đ?&#x2019;&#x2122; đ?&#x2019;&#x17D;đ?&#x2019;&#x2020;đ?&#x2019;&#x2026; 9.157819 đ?&#x2018;şđ?&#x153;ş 0.143001 đ?&#x2019;&#x201D;đ?&#x2019;&#x2122; 1.430074
Group 21: Camille Benson and Madison Siguenza
What variables affect the number of weeks a song is on the Billboard Hot 100? Descriptive Statistics Motivation Methodology With the switch from physical distribution to online streaming services, it has become difficult to predict how popular songs actually are as there are now multiple mediums to access new music. Since Billboard calculates its ranking by radio airplay, sales data, and streaming activity, we want to know what other variables may also have an effect even if they aren't calculated in the formula.
We used backward elimination once and found the model with the lowest standard error and highest R^2. The independent variable is number of weeks on the Billboard Hot 100. The dependent variables used in the model can be found in the table below.
The Model y=7.44-0.11(x1)+4.56E-8(x2)+7.74(x3)+3.25(x4)-2.07E-9(x5)
Key Results As you can see in the graph below, there is a correlation between the number of weeks a song will be on the charts and the top ranking it will reach.
Data Sources Indicator variables for artist gender (using male as a base category) were removed from the model since they were discovered to not have a significant relationship with number of weeks on the Hot 100. R^2=Â 0.667555693 Se/Čł=0.487257463 F-stat=Â 7.91189E-21 n= 99 (removed 1 observation)
Key Metrics The model is statistically significant since it has a high R^2 and a F-stat much smaller than .05. However, the standard error over the average y is greater than the benchmark of .2 making the data not always accurate.
Outliers We removed one observation from the data set. Ed Sheeran's song Shape of You was a strong influencer.
Conclusion Overall, our research proves that there are other variables that have an effect on how long a song will stay on the charts. The genre of the song can make it stay longer on the charts, and the closer the song is to 1 on the peak position will also have an effect.
Stayin’ Alive
Using Health Conditions to Predict Death Rates Model & Analysis
Death Rate (deaths/1000 people)
Healthcare inequality is one of the most prevalent issues facing our modern day societies. In order to begin to address this daunting situation, we decided to look at how data pertaining to certain health conditions can predict death rates around the world. Knowing which factors can increase mortality rates the most will help countries determine which areas of their healthcare need to be improved. While previous analyses have been done to determine the leading cause of death across countries, most have not compared both individual heath and hospital accommodations and facilities to determine which factor has the greatest impact on mortality rates. We hypothesized that Prevalence of Raised Blood Pressure would have the most drastic impact on death rates, given it’s positive correlation with death rates. Death Rate vs. Prevalence of Raised Blood Pressure 20 15 10 5 0 0
10 20 30 40 Prevalence (%) of Raised Blood Pressure
We used Multiple Linear Regression with Backward Elimination to analyze our model. y = 4.656 + 0.278x₁ - 0.362x₂ + 0.027x₃ + 0.549x₄ - 0.033x₅
Variable
P - Value
Intercept
0.00015908
Prevalence (%) of Blood Pressure
2.64883E-11
Prevalence (%) of Blood Glucose
5.42899E-9
Density of Hospital Beds
0.008020319
Density of Physicians
0.000782371
Proportion of Population using Improved Sanitation
0.00017245
50
CIA Government Library Michelle Eisenberg, Abby Forsythe, Haley Myers
Prevalence (%) of Raised Blood Pressure Prevalence (%) of raised blood glucose Density of Hospital Beds (per 10,000 population) Density of Physicians (total number per 1000 population) Proportion (%) of population using improved sanitation facilities
With Outliers
Without Outliers
0.6773
0.701536
0.2339
0.1986
1.26952E-29
7.31426E-37
Our standard error to average of the dependent variable ratio is slightly greater than 0.2 at 0.2339; however if we were to re-run the Multiple Linear Regression model without the outliers we found, this value drops to a statistically significant value of 0.1986. The R² value (which indicates that there are a variety of factors that affect death rates, only some of which can be controlled) is 0.607729 is moderate in strength. However, this increases significantly if we ignore the seven outliers we found in our data. Further, our F-statistic is 1.2695E-29, and all the p-values indicate a significant model.
Our hypothesis was proven wrong. It turns out that Density of Physicians has a more drastic impact on Death Rates than Prevalence of Raised Blood Pressure does. This means that countries with a lower number of physicians per 1000 people will have a relatively higher rate of mortality.
Data Sources World Health Organization
R² Sɛ/y avg. F - Stat
Key Results
Impact of Significant Variables
y = 0.3715x - 1.8317 R² = 0.30567
We have 7 Outliers out of 160 Data Points
Death Rate vs. Density of Physicians Death Rate: deaths per thousand people)
Motivation
20 15 10 5
y = 0.3789x + 7.3735 R² = 0.03813
0 0
2 4 6 Density of Physicians (total number per 1000 population)
In the future, it may be beneficial to further test the effects of illnesses in comparison to quality of hospital accommodations and facilities when looking at the death rates of different countries, to determine which combinations of illnesses and hospital accommodations and facilities would create the lowest death rate. Additionally, because the question on every politician’s mind is how to solve or improve healthcare inequality, it would be helpful to test a factor stating accessibility to medical care, as this is a key component in death rates that varies greatly across different countries.
8
FIFA Soccer Rankings: What Factors Make a Team Good?
Key Scatterplots Included in the Model
Motivation The Fédération Internationale de Football Association ranks international soccer teams through a basic model: FIFA Ranking Formula: “Points = M (points for match results)* I (importance of match) * T (strength of opponent) * C (strength of confederation)”
Model We used the Backwards Elimination method to develop a Multi Linear Regression model. After running our model three times, we found the model with the lowest standard error. We excluded all independent variables except for: Log(Population), Number of Players in the Top Five Leagues, the South America Indicator Variable, Previous Points From the Past Four Years, and Previous Points Squared.
On December 1st, 2017, The Daily Post released an article discussing how there is alleged corruption in FIFA’s ranking. This suggests that their basic model that they claimed to have used are not the only factors in determining ranking positioning. Our project explores new potential variables that affect a country’s rank.
Not Included in the Model
Data Sources Key Results
To gather data sources, we utilized numerous databases. These data sources include: the official FIFA rankings, the United Nations Population Division population estimates, Economist Intelligence Unit's Democracy Index, The Olympic Games database, StatisticsTimes GDP database, and the World Cup Archives.
Variables We Examined:
Methodology
In our final model: 1. Standard error = 2.13 2. F-significance of 2.9E-113 below our alpha of 0.05 3. Three Significant Variables a. Previous Points Scored, Previous Points Scored Squared, and South America 4. South America had a p-value of 0.052, but we concluded that this was close enough to our threshold of 0.05 to have a significant impact on our model.
Applications We are very pleased with our final model. In addition to being significant, it has a high R squared and a low standard error. Furthermore, our model is accurate in forecasting predictions as the accuracy of our model is .0394, which is far below our benchmark of 0.2 Highlighted independent variables are used in final model. Non-highlighted ones were excluded.
Garrett Cohen, Neal Mintz, Albert Wu
FIFA is the organization that benefits from our model the most through using statistical proof in determining its ranking instead of alleged bribery and illegal non-statistical forms of ranking teams. Our model could be useful for individuals who want to bet on the final FIFA rankings. ● Previous points is measured over the last four years and the model uses historical data to forecast future rankings ● Due to the high accuracy that we found of previous points scored in relation to FIFA ranking, this portion of our model is extremely valuable to individuals looking to make these bets
Predicting NCAA Division I Menâ&#x20AC;&#x2122;s Basketball Coachâ&#x20AC;&#x2122;s Salary By Kristina Schmelter and Rachael Sondag Quality of Regression
Motivation
In May of 2017, Washington University in St. Louis hired a new womenâ&#x20AC;&#x2122;s basketball coach, Randi Henderson, to lead the program. As members of the varsity womenâ&#x20AC;&#x2122;s basketball team, we were curious to see what factors played a part in determining the salary of a collegiate basketball coach.
Scatter Plots
â&#x20AC;˘ R2= .0.8256 â&#x20AC;˘ F â&#x20AC;&#x201C; Sig = 2.8070E-33 â&#x20AC;˘ 3 of our coefficients have p values < .05 â&#x20AC;˘ Standard Error for the model is 0.4462 â&#x20AC;˘ Standard Error/average of DV = 0.0326 Ideally we would like our R2 to be higher, but the very low F-sig value proves our model is significant. The standard error/average of DV is well below the benchmark of 0.2, which means it is an accurate prediction of a coachâ&#x20AC;&#x2122;s salary.
Coefficients
Key Data Sources
Key Results
https://www.wrn.com/2016/03/ncaa-division-iii-tournament-tips-off-tonight/
The results in these scatter plots exemplify a positive linear relationship that emerged in the final model between these independent variables and the natural log of Coaching Salary.
Descriptive Statistics
â&#x20AC;˘ Coaching Salary Mean: 1480552.41 The average NCAA Division I Menâ&#x20AC;&#x2122;s basketball coach has a salary of $1,480,552.41 â&#x20AC;˘ Coaching Salary Standard Deviation: 1346304.6
Our model suggests that there are factors that influence a NCAA Division I menâ&#x20AC;&#x2122;s coachâ&#x20AC;&#x2122;s salary. The three variables that were statistically significant were tournament appearances, attendance of games, and university enrollment. The more successful coaches that have more tournament appearances tend to have a higher salary. Coaches at bigger universities also tend to have higher salaries. Finally, coaches that have more people attending the games have higher salaries.
ln đ??śđ?&#x2018;&#x153;đ?&#x2018;&#x17D;đ?&#x2018;?â&#x201E;&#x17D;đ?&#x2018;&#x2013;đ?&#x2018;&#x203A;đ?&#x2018;&#x201D; đ?&#x2018;&#x2020;đ?&#x2018;&#x17D;đ?&#x2018;&#x2122;đ?&#x2018;&#x17D;đ?&#x2018;&#x;đ?&#x2018;Ś = đ?&#x2018;Ś = 6.1084 + â&#x2C6;&#x2019;1.1564(đ?&#x2018;&#x2039;<=>% ) + 0.0612(đ?&#x2018;&#x2039;BCDE>FGH>I JKKHFEF>LHM ) + 0.8970 ln đ?&#x2018;&#x2039;JIIH>PF>LH + 9.6468 â&#x2C6;&#x2014; 10RS (đ?&#x2018;&#x2039;T>ECUUGH>I )
Starting Salaries After Graduation: What Actually Matters? Motivation We focused our project on the national dialogue surrounding the viability of the university system and the cost-benefit analysis of a college degree. Our findings have the potential to help students and their families make educated decisions about whether or not to attend a college or university. We expected to find that socioeconomic status has a high impact on the starting salaries of graduates.
Margot Dupuis Brittany Hendrix John Yucesoy
The Process Our MLR:
Coefficients
Y=22.538-.051X1-.416X2+.567X3+.148X4+.289X5+.425X6 X1 – Median Household Income ($K) * X2 – Students From Bottom 20% (%)* X3 – Chance Of Poor Student Becoming a Rich Adult (%)* X4 – Average Net Price of School ($K) X5 – Student to Faculty Ratio (x:1) X6 – Average Academic Spending per Student ($K)* R2=.798115 Standard Error=5.1576
*These are the only variables that are statistically significant, however the model is best with all included variables
Intercept
22.5380413
Median Household Income Students From Bottom 20%
-.05084591 -.41587085
Chance of Poor Student Becoming .566754264 a Rich Adult Average Net Price of School
.148264233
Student to Faculty Ratio
.289046704
Average Academic Spending per Student
.425144561
Results Outliers Harvard University – due to the importance of name branding and alumni network University of the Sciences – due to the specialized nature of the education
Data Sources The two outliers on this graph are Harvard University and University of the Sciences. It should be noted that this variable has high multicollinearity.
Motivation Major League Baseball teams are spending hundreds of millions of dollars on bolstering their batting lineup or pitching staff. Which players are worth more money? Should ownership spend money on hitting or pitching? We can explore this by looking at what statistics lead to more wins.
RBI
ERA
Wins Home Runs Average: 81 Average: 168 Std Dev: 10.5Std Dev: 42.2 Min: 59 Min: 61 Max: 104 Max: 253
Multicollinearity Hits/RBIs Hits/OPS Home Runs/OPS RBIs/OPS BAA/WHIP
Key Results R Squared = 0.8464 Standard Error = 4.236 SE/(Average y) = 0.0523 Significance F = 8.267E-62 RBI P-Value = .0007 ERA P-Value = 1.886E-17
1980 2014 Descriptive Statistics
Hits Average: 1429 Std Dev: 82.7 Min: 1199 Max: 1664
Strong Fairly Low Below .2 Below .05 Below .05 Below .05
Regression Coefficients
BAA Average: 0.258 Std Dev: 0.014 Min: 0.212 Max: 0.294
OPS Average: 0.735 Std Dev: 0.041 Min: 0.634 Max: 0.837
RBI ERA Average: 690 Average: 4.15 Std Dev: 84.3 Std Dev: 0.565 Min: 500 Min: 2.94 Max: 926 Max: 5.52
Strikeouts Average: 1168 Std Dev: 214.8 Min: 575 Max: 1614 WHIP Average: 1.34 Std Dev: 0.10 Min: 1.11 Max: 1.64
Variables Not in Model Home Runs Batting Average Against
XC: Place in the Race
How Outside Factors Affect A Runner’s Race Placement
Team 6: Yaseen Ali, Claire Chen, Marco Quaroni Equation
y# = -9922.428+1.307xtime + 3.781xyear + 0.001xelevation -34.4553ILaverneGibson- 5.229ISR - 4.712IJR - 2.990ISO - 4.584IWest - 11.329IMountain - 4.462IMidwest - 3.016IGreat Lakes - 6.636ISoutheast Motivation
One interesting thing about cross country races is that the fastest runner, based on personal records or past performance, might not always place highest during the race. This could be the result of a ‘bad day’, but we wanted to delve deeper into what other factors could affect how well runners placed in a competitive race, specifically at a collegiate level. To explore this, we chose to examine what factors led to the best placing in the Division 1 Men’s National Cross Country Championship Race between the years of 2012-2017.
Variables
For this model, we had a total of 13 independent variables: 3 numerical variables and 10 indicator variables in 4 categories. The dependent variable was place. Numerical: Qualification Team Qualification Individual Qualification
Course
E.P. “Tom” Sawyer Park Lavern Gibson Park
Time (s), Year (YYYY), Elevation (ft.)
Year in School FR: Freshman SO: Sophomore
Region West Mountain Midwest
JR: Junior
Great Lakes
SR: Senior
Northeast
*Base cases are highlighted in yellow. **Removed variables are highlighted in pink.
Data Source
Southeast South
Our main data source for the project was The Track and Field Results Reporting System (TFRRS), a site that collects and stores information on runners from all NCAA Division I, Division II, and Division III races. We took a sample of 1497 times from competitors in the 10km Division 1 National Championship Race between 2012-2017.
Key Results
These results showcase several interesting features. First, we can directly tell how many people are finishing per second in a race and estimate overall place. For every second, approximately 1.3 runners finish. Next, as the years go on, the races get more competitive, demonstrated by the coefficient of the year (3.781). Additionally, the course on which the race is run impacts performance drastically. Runners at the Laverne Gibson course place 34 places better, on average, compared to E.P. “Tom” Sawyer Park’s Course. Furthermore, seniors and juniors place, on average, 5 spots better than freshmen runners. This could be a result of older runners having more experience or physical maturity. Lastly, the mountain region has athletes that place an average pf 11 places better than teams from the South. This could be a result of the higher average elevation of the Mountain region—runners who trained at higher elevations are acclimated to having less oxygen and therefore might be more adept at running long distances.
Model
We started with 1497 variables. After running a residual analysis, we removed 22 outliers. Using backwards elimination, we determined that School Endowment, Northeast Region, Mid-Atlantic Region, and Team Qualification were insignificant variables in the model and removing them bettered our results.
Variable Intercept Time (in Seconds) Year (YYYY) Elevation (ft.) Laverne Gibson SR JR SO West Mountain Midwest Great Lakes Southeast
Coefficients -9922.428087 1.307450984 3.780677625 0.001037074 -34.4553276 -5.229255596 -4.711633561 -2.990094709 -4.584058287 -11.32936515 -4.46244128 -3.016364818 -6.635796162
Regression Statistics
P-Value 5.50362E-37 0 4.51875E-23 0.154565599 3.6386E-123 0.010184888 0.022229152 0.159768556 0.018908299 0.005119452 0.049360934 0.180011456 0.001389268
R-Squared Standard Error Observations SE/y-bar F-Significance Y
0.884967837 24.2339576 1477 0.196032397 0 124
*Non-significant variables are highlighted in red. ** F-Significance rounded to 0 through Excel
Conclusions
In conclusion, we can accurately predict with 95% certainty given an athlete’s time, year raced, course type, year in school, and athletic region, how they will place at the Nationals race. This impacts team championships, athletic programs, and even the athletic apparel industry for marketing and development of professional teams.
Why aren’t there more Women in the Workforce? Motivation
Model and analysis
The percentage of women in the overall workforce differs sharply from the percentage of men in the workforce, with the percentages fluctuating greatly by country. Figure 1 shows the percentage of women in the workforce by country. While there are aspects of the female identity that affect the presence of women in the workforce across the globe, specific characteristics of each country’s social, cultural, and legal structure contribute to the disparities by country. Analyzing the broader factors affecting women in the Figure 1: Percentage of Women in the Workforce by Country workforce pushes the controversy of this issue past the discussion on gender discrimination and into the realm of historical and cultural influences by country. We wanted to examine economic, maternal, and also educational economic, maternal, and educational challenges that women face across the globe to understand their participation in the workforce. Thus, we developed a hypothesis:
Our final model (below) does not include birth rates as an independent variable but correlates PPP, literacy rates, and net migration rates^2 to percentage of women in the adult labor force. Figure 2 shows scatter plots of each of our vaiables isolated. We conducted an F-test, and found that our model is statistically significant at 95% confidence. It also has a suitable forecasting error of .055, well below the accepted benchmark of .2, and all coefficients are sensible and statistically significant at 95% confidence. There are some issues with the model: it explains a very low proportion of variance in women’s workforce participation, meaning there are many other factors affecting it in addition to the variables we used. We incorrectly predicted the sign of the relationship between women’s laborforce participation and GDP purchasing power parity, and we predicted a relationship with birth rates that does not exist. However, we correctly predicted in our hypothesis that women’s labor force participation has a positive relationship between literacy rates and a negative relationship with net migration rates.
There is a positive linear relationship between the literacy rates, GDP purchasing power parity, and the percentage of women in the workforce per country while there is an inverse linear relationship between fertility rates, net-migration rates, and the percentage of women in the workforce per country.
Figure 2: Isolation of Significant Variables Scatterplots
y = -0.000115(PPP) + 0.137(Literacy Rate) - 0.00751(Net Migration Rate²) Descriptive Statistics: Percentage of Women in the Labor Force Mean Standard Error
PPP
Literacy Rate
Net Migration Rate
40.62423851
19905.6962
85.1208861
-0.4379747
0.733749203
1669.16036
1.44082253
0.54609327
Methodology We collected data from the United Nations Statistics Division from 2006 on economic activity by country, including data included percentage of women in the workforce. We extrapolated data on PPP, net migration rate, birth rate, and literacy rate by country from the Central Intelligence Agency Fact Book from 2014. Because our model incorporated data sets that were taken from two different organizations, we had to delete several countries that are recognized by the United Nations but not the CIA data. Therefore, our model is not representative of every country. Our analysis differs from the PWC and Bullough findings in two major ways: variability in observations and targeted use of independent variables to test the significance of their impact. Our data measuring the percentage of women participating in the labor force looks at every country, whereas PWC’s data only looks at developed countries. In developing our model, we examined each attempted model’s satistical significance, proportion of variance explained by the model (R²), forecasting error (Se/ȳ), stastical significane and senibility of each coefficient, suspect points and homoscedacity. We used backwards eliminated and high-level analysis to develop the best model.We chose a quadratic modeling technique for our regression because we noticed a parabolic error distribution in the residual plot of net migration rates. This parabolic error distribution is likely due to the fact that many of the net migration rate data points were negative. Net migration rate was the only independent variable in our model with negative date points; therefore, we measured our other independent variables at their face value in our regression.
Managerial Statistics II TEAM 68: Lexi Jackson, Tiffany Powell, and Jacob Halladay-Glynn
Key Results Our findings conclude that a nation’s PPP and literacy rates are significantly related with the nation’s female labor force participation rate, and net migration rates is significantly related with female workforce participation when the variable is presented in a quadratic model. In addition, our model’s low r-squared indicates that there are many other factors that affect women’s participation in the workforce in a given country, such as the nation’s predominant religion. This high level of unexplained variation in the data caused some numerical outliers in our model, especially among countries with extremely low participation rates due to religious or political law. However, because these outliers are not results of error, removing them from our model harmed the model’s significance. Therefore, our model cannot explain female labor force participation rates completely or for all nations, but does imply that PPP and literacy rates relate to female participation while birth rates do not and net migration rates relate with a quadratic model. Overall, female workforce participation appears to be more closely related to socioeconomic variables, and, perhaps, political and religious factors that were unexplored by our model.
Life Expectancy in Texas Counties
Mo=va=on and Data
Everyone lives, and everyone is in control of some of the factors that influence how long they will live. While previous studies of life expectancy have focused on either very narrow regions such as small neighborhoods of large ci=es or very large regions such as states or even countries, our study draws a middle ground by focusing on coun=es in Texas. Overall, our model is sta=s=cally significant. Addi=onally, each of our variables are also sta=s=cally 83 significant by a wide margin. 82 81 2 The R indicates the por=on of 80 varia=on explained by the 79 model. 78 R2: 0.388109 77 F-‐Significance: 1.35 x 10-‐10 76 Number of observa=ons: 110
Coefficients
Life Expectancy
By analyzing coun=es, we can draw more direct conclusions about factors that vary widely in these communi=es, such as racial makeup, income, and popula=on density. Our data is highly credible since all of our independent variables are drawn from the U.S. Census. Our dependent variable, average life expectancy per county, is drawn from World Life Expectancy, a source with over a million users including government and educa=onal ins=tu=ons.
81.80
82.00 80.00 78.00
77.05
76.65
76.00 73.95
74.00 72.00 70.00 Mean
Median
Minimum
Maximum
Economic Significance
Life Expectancy Vs. Per Capita Income ($K)
0.900 0.686
0.700 0.600 0.500
0.804
0.800
y = 0.0933x + 74.798 R² = 0.10081
0.596 0.494
0.400 0.300 0.200
75
0.100
74
0.000
73 0
10
20
30
40
50
Although these coefficients may The largest coefficient, the natural seem small, life expectancy is log of the popula=on per square measured in years. mile, correlates with residents living over 107 days longer on As the smallest coefficient, a 1% average, holding constant the higher Hispanic popula=on rate s=ll correlates with residents living other variables. That’s more than 3.5 months! over ten days longer on average, holding constant the other Variables Removed: variables. • Percent White Coefficients P-‐Values • High School % Black -‐3 -‐0.07426 2.37 x 10 Gradua=on Rate • Percent Without 0.02758 7.78 x 10-‐5 % Hispanic Health Insurance Ln(Popula=on) 0.29335 1.27 x 10-‐4 • Employment Rate Income Per Capita 0.13855 2.17 x 10-‐6
84.00
Quality Metrics
: 0.0176
% Black
% Hispanic
LN(Popula=on)
Per Capita Income ($K)
Key Takeaways
Actuaries, such as financial planners and insurance agents, can help their clients and their businesses by using not only demographic and racial data but also popula=on density in their analyses. Individuals concerned with longevity should work to increase their income, likely through educa=on, and move to more urbanized areas.
Re=rement homes should move into the coun=es with the highest life life expectancies, such as Collin, Hidalgo, and Williamson. Of course, correla=on does not imply causa=on, so moving to a county with higher life expectancy in no way guarantees that an individual will live longer. However, members of that county do indeed live longer on average.
Group 81: Caroline Stocking and Ruth Kingsbury
What Makes an NBA Player Efficient: Variable Effect on PER
By: Harkirat Anand, Nate Engel, Cameron White
Motivation
Methodology
Key Results
In our society people love to rank performers. This is especially true in sports. Many individuals make a living comparing professional athletes to one another. Most ranking systems rely mostly on subjective options of these “experts,” but oftentimes have little to no hard evidence to justify the rankings. This is especially true in the NBA, where nearly every fan has their own two cents on the “true” rankings of the players in the league. However, especially with the current myriad of data available on player performance, such subjective rankings are becoming less and less reliable. Also, when focusing on using statistics for NBA rankings, certain metrics such as points, rebounds, and assists are often weighted too heavily and therefore do not provide a complete overview of the true success of a player. To combat this, in 1973 the NBA created a comprehensive statistical parameter, the player efficiency rating (PER), in which offensive and defensive statistics are aggregated to generate one specific number. This ultimately gauges the effectiveness of a player based on their playing time each season. We decided to analyze the average PER of active American NBA players over the course of the 2016-17 season.
We used a Backwards Elimination Method and found that the model with the lowest Standard Error included all of our pre-determined variables.
We found that increasing the following variables have a positive effect on a player’s PER: currently being in the peak of their career (5-9 years of experience), experiencing a significant injury (over one month out), and number of games played. On the contrary, our findings show that increasing the following variables have a negative effect on a player’s PER: number of all-stars on your team, joining a new team prior to the season, age started in the NBA, and draft position. Below is a scatterplot showing the variable with the highest significance (other than binary variables): Number of Games Played.
ŶPER = 17.249 – 0.337x1 – 1.678x2 + 1.648x3 – 0.342x4 + 1.68x5 + 0.066x6 – 0.0213x7 X1 = Number of All Stars X2 = New team vs Return (Binary: New Team = 1) X3 = Suffered a major injury (Binary: Yes = 1) X4 = Age started in NBA X5 = If the player is in the peak of his career (Binary: Yes = 1)
X7 = Draft Position
Despite the fact that the R2 is very small, the F stat and individual pvalues are very significant.
R2 = 0.193 F Stat = 1.86963E-11 Standard Error: 3.975 n=306
Key Data Sources ESPN Leader in sports reporting in the United States • Player Efficiency Rating (Hollinger Report) • Number of all-stars on team (excluding themselves) Official NBA statistics website • Joined a new team prior to the season (binary) • Suffered major injury keeping out for over a month (binary) • Number of games played throughout season
BasketballReference.com Website Foremost independent basketball database • Age started in the NBA • Currently in peak of career (5-9 years of experience) • Draft Position
There were no players that had significant leverage on the model.
Conclusion Coaches and Management can use these results to identify the variable with the greatest effect on a player’s efficiency rating, outside of purely in-game statistics. They can use these predictions to determine how a player is performing relative to how well they should be performing, which will give a more clear indication as to whether a player is over or under performing.
Variable Statistics P-Value PER # of all stars on team
Mean 14.1548
0.199145
0.74836
Median
Standard Deviation Minimum
13.405 1
4.37416 0.87520
Maximum
7.18 0
90
30.7 4
Returner/New team (New = 1)
0.000429
0.44117
0
0.49734
0
1
Major Injury (Yes = 1)
0.039450
0.18627
0
0.38996
0
1
Age Started
0.038663
20.3398
20
1.76361
17
28
Peak (Yes = 1)
0.000420
0.39542
0
0.48974
0
1
# of games played
0.004108
66.4313
70
13.5908
18
82
Draft Position
0.121302
25.5359
19
21.2118
1
70
80 70
# of Games Played
NBA.com
We found 15 players, all with the highest PERs, to have large studentized residuals (greater that ±2). This means that they are considered outliers. We did not exclude any of these points because we expected to have 5% of the data be outliers (which in this case is 15.3). Our findings indicate that 41 players had a Cook’s D value over 1, making them influence points. After ensuring we properly input the data, we determined that nothing in particular stood out showing why these points were influential; therefore, we believe they were too important to remove from the model.
X6 = Number of games played
Hypothesis We predict that number of games played in the season will have the strongest effect on Player Efficiency. On the other hand, we believe that age started would have the least effect. Overall, we believe that all the variables tested will be significant.
Outliers
60 50
y = 0.412x + 60.6 R² = 0.01758
40 30 20 10 0
0
5
10
15
20
PER
25
30
35
Motivation
What Makes Childbirth so Expensive? Results
H eal thcare costs have been
Mary-Brent Brown
steadi l y ri si ng i n recent years and show no si gns of
Data Source
stoppi ng w i thout some form
My data set i s a representati ve sampl e of
of governance. C hi l dbirth,
100 observati ons from the 2011
one of the most common
H eal thcare C ost and U ti l i zati on Proj ect
procedures i n the heal thcare
(H C U P ). H C U P i s run by the Agency for
fi el d and comparabl e across
Heal thcare Research and Qual i ty, a
heal thcare provi ders, i s a
federal -state partnershi p that al l ow s
rel i abl e procedure to assess
publ i c access to non -i denti fyi ng medi cal
heal thcare costs. Bi rths are
i nformati on.
the most common reason for
Vaginal Birth 72%
$2500$3499 23%
$1500$2499 43%
Methodology
Independent Variable I started w i th the mul ti pl e l i near regressi on as my model , but onl y one out of fi ve
the regressi on i n SPSS. The fi nal model had the l ow est standard error and the hi ghest
R 2,
but sti l l
four i nsi gnifi cant vari abl es and
removed race, i ncome, and payer
more than tw o stati sti cal l y si gni fi cant vari abl es. The model s, because of thei r l ow qual i ty, coul d not answ er my
hypotheses. Thi s i mpl i es that more data i s needed to form a concl usi ve model about bi rth costs. For now w e must use the average y val ue of cost,$ 2772.84 . Income Quartile
and i ncl uded age and csecti on w i th a yhat=1212.46+34*age+1834.32 *
di vi ded by average y w as 0.347 w i th an R 2 of a measl y 0.438. I
model . I took the l og of cost onl y. Thi s model w as si gni fi cant and
accordi ng to the parti al F tests. Thi s w as my
yi el ded a l ow standard error
mai n chal l enge w hen devel opi ng my model ,
di vi ded by average y, but the R 2
and proved to be an i ssue throughout every
w as l ow at 0.429 and onl y csecti on
model I used. Al though the model i tsel f w as
w as si gni fi cant.
because the standard error divi ded by
$63,000 or more 21%
$1 $38,999 30%
fi nal l y tested the C obb -D ouglas
i ndependent vari abl es was si gni fi cant
stati sti cal l y si gni fi cant, i t w as a poor model Dependent
egregi ous outl i er vari abl es and ran
yi el d a si gni fi cant model w i th
csecti on . The standard error
2008.
$3500$4499 15%
i s l ow
backw ards el i mi nati on. The model C-section 28%
bi l l ion i n hospi tal costs i n
Birth Cost
average y i s 0.35 and the
w i th the l ow est standard error
i npati ent stays and 18.9
$4500$5499 6%
R2
at 0.45. I removed the most
Birth Type
States wi th 4.2 mi l l i on
Less than $1500 9%
because each model fai l ed to
seven outl i ers. I moved to
hospi tal i zati ons i n the U ni ted
$5500- $6500 $6499 and 2% above 2%
My fi ndi ngs are i nconcl usive
$48,000 62,999 24%
Independent Variable
$39,000 $47,999 25%
Group 21: Camille Benson and Madison Siguenza
What variables affect the number of weeks a song is on the Billboard Hot 100? Descriptive Statistics Motivation Methodology With the switch from physical distribution to online streaming services, it has become difficult to predict how popular songs actually are as there are now multiple mediums to access new music. Since Billboard calculates its ranking by radio airplay, sales data, and streaming activity, we want to know what other variables may also have an effect even if they aren't calculated in the formula.
We used backward elimination once and found the model with the lowest standard error and highest R^2. The independent variable is number of weeks on the Billboard Hot 100. The dependent variables used in the model can be found in the table below.
The Model y=7.44-0.11(x1)+4.56E-8(x2)+7.74(x3)+3.25(x4)-2.07E-9(x5)
Key Results As you can see in the graph below, there is a correlation between the number of weeks a song will be on the charts and the top ranking it will reach.
Data Sources Indicator variables for artist gender (using male as a base category) were removed from the model since they were discovered to not have a significant relationship with number of weeks on the Hot 100. R^2=Â 0.667555693 Se/Čł=0.487257463 F-stat=Â 7.91189E-21 n= 99 (removed 1 observation)
Key Metrics The model is statistically significant since it has a high R^2 and a F-stat much smaller than .05. However, the standard error over the average y is greater than the benchmark of .2 making the data not always accurate.
Outliers We removed one observation from the data set. Ed Sheeran's song Shape of You was a strong influencer.
Conclusion Overall, our research proves that there are other variables that have an effect on how long a song will stay on the charts. The genre of the song can make it stay longer on the charts, and the closer the song is to 1 on the peak position will also have an effect.
PREDICTING
FREE THROW PERCENTAGE
BY JASON BLANKFEIN, MATT DOUGHERTY, AND ADAM KAUFMAN
Motivation ”What factors influence an NBA player’s free throw
Regression:
FT% = 71.37 + .22(B1) + .47(B2) - .29(B3) + 8.79(B4) + 6.69(B5)
Key Results
shooting percentage?” is a question that, to the average NBA fan, is not obvious. Other NBA statistics, such as rebounds and shooting percentage, have significant impacts on the observations fans make about the player’s performance (position, age, or if they are a “good shooter” or not). By recognizing the subconscious
Model & Analysis B1 - Age
B2 – Min./Game
B4 – PG
B3 – eFG%
B5 – SG
biases that fans make about whether or not a player is going to make a free throw, we became motivated to discover what player characteristics make a good free throw shooter. The practical uses of this study could be but are not limited to: • NBA front office personnel for trading and free agent signing purposes • Fantasy basketball team managers for drafting purposes • The average NBA fan
Data Sources
• MPG, eFG%, PG, and SG are all statistically significant with p-values < 0.05 • Even though the “Age” variable has a p-value of 0.24, the backwards elimination methodology showed that its removal would raise the standard error of the model • Several outliers were removed from the data set due to high studentized residual or leverage values
(Categorical Variables) Multiple linear regression with backwards elimination yielded this 5-variable model. The following variables were excluded from the final model: Free Throw Attempts/Game • Small Forward • Power Forward •
Descriptive Statistics
Table detailing descriptive statistics for the model
Table detailing regression success metrics
Conclusion We found that the best free throw shooters are older (relative to the rest of the NBA) guards that play lots of minutes. As for the managerial implications of the model, we can make a recommendation to NBA GMs to acquire players that fit this description to increase the team’s FT%. Additionally, fantasy basketball managers can use this info to draft players that are efficient at the free throw line.
What variables are the strongest predictors of whether a country’s athletes will earn medals at the Olympic games? Opening Ceremony
Key Results
Once every two years, the majority of the world’s countries gather to compete in the Olympics. In our data analysis, we aim to understand the factors that contribute to a country’s athletic prowess. We used data from all countries that were part of the National Olympic Committee, which was 206 countries in total in 2016. We decided to interpret medal-winning countries over the following 4 variables: ● GDP (billions) ● Literacy rate ● Population ● Size of Olympic team Hypothesis: Countries with higher populations, higher GDP (in billions), and larger Olympic teams will produce superior results in the Olympic games on average, in a model that also controls for literacy rate by country.
Model Quality
A key outcome of our model was the significant relationship between number of athletes and the total medal count. Our findings suggest that at a significance level of .05, there is significant statistical evidence that there exists a strong positive linear relationship between the number of athletes on an Olympic team and the number of medals won by a country in the 2016 Rio Olympics, with a p-value of 3.0955E-34 holding constant for the other independent variables in the model which include GDP and population.
We used a Backwards Elimination Method to perform a multiple linear regression on the data that we collected to find the model with the lowest standard error. We eliminated Literacy rate as a variable because it was shown to be statistically insignificant, thus it was removed from the model. There were two key outliers, USA and China, that had significant influence and leverage on the regression model but we decided to include these countries because they are too important to remove. R2=.874 F-Stat= 1.269E-90 Se/Ȳ=.990 N=206 The model is statistically significant with a high R^2 value and a F-stat lower than ɑ =.05. While our high standard error is a concern and greater than the threshold of .20, it is most likely related to outliers such as a few countries winning the majority of all Olympic medals.
Descriptive Statistics Variable GDP (billions)
Mean
Median
637.63
63.878
Population 3566936 0.524
7081497
Total Athlete #
54.737
12
Standard Dev. 2837.992 134922064.184 95.324
Closing Ceremony On average, countries with a high GDP (billions), a large population, and large Olympic teams are more likely to receive medals at the Olympic than smaller and less affluent countries.
Standard Error 197.732
Logically, this make sense because a country with a higher GDP has more money to invest in their athletics and a large population yields a larger talent pool. We found that literacy rate is statistically insignificant which suggests a country whose primary goal is succeeding in the Olympics should invest more in athletics than academics.
9400465.647 6.641
Data Sources The key sources we used were the CIA Factbook and the Olympics official statistics page. Thus, our data was collected from multiple credible sources.
Model ŷmedals= - 0.6403 + 0.0036xGDP + 0.0782xathlete count -1.715*10-8xpopulation
Paul Lachman, Caroline Meyer, and Noah Truwit
TELEVISION)CONSUMPTION
What)variables) are)the)strongest)predictors)of)TV)watching? Motivation
Conclusion The)implications) from)this)model)can)be) helpful) for)entertainment)businesses,) specifically)TV)production) companies. Our) data)indicates)that)those)with) higher)TV) watching)trends)tend)to)be)older,)with)less) education)and)a)lower)income.)These) companies)can)conclude)that)they)may)be) more)financially)successful)producing)shows) that)are)more)simplistic,)possibly)with) lowbrow)humor,)as)opposed)to)TV)shows)that) are)more)intellectual)(i.e.)historical,) academic.).
With) the) rise)of) platforms) such) as)Netflix) and) Hulu,) TV)watching) has) never) been) more)popular.) As)college) students) who) are)also) heavy) consumers) of)television,) we)wanted) to)determine) how) other) variables) affect) how) many) hours) of)TV)someone) watches) per) day.) In) our) analysis,) we) examined) how) different) independent) variables) affect) overall) television) consumption.
Key-Results- &-Methodology After)removing) outliers) and using) backwards) elimination,) we) removed) the) indicator independent) variable) “Divorced”) and) developed) a)Multiple) Linear) Regression) Model:
x1
Age
x2
y)=)4.190)+)0.0223x 1 – 0.1280x 2 – 0.262x 3 – 0.080x 4 +)0.494I 5 +)0.734I 6 +)0.652I 7
x3
Sex*
x4
Income
x5
Widowed*
x6 Prediction:) The) number)of)hours)someone) watches)TV) per)day)can)be)predicted) as)a)function) of)their)age,) education) level,)sex,)income,)and) marital)status.
x7
Education
Mean(y): 2.336 Median(y): 2
FRStat= 1.04ER9
N= 289
KEY)DATA)SOURCE
Separated* Never- Married
*indicates-not-statistically-significant-in-final-model
Data)from) 2014 Aneesha Bandarpalle,-Isha Khanna,-Tyler-James-
A quiet life in a small town next to the Pacific Ocean in Oregon, or a high-paced modern city life in the Big đ&#x;?&#x17D;? Whatever house you choose, the house price of that area is an important factor to consider. Through our model, we intend to provide insightful information to people on what to consider when deciding where to buy a house and live a life.
Key Metrics
Descriptive Statistics
MLR Model P-value Initial Intercept Violent Crime Rate Population Annual Income Poverty Rate Midwest South West Inventory Measure Median Rental Price Median Price Cut Bachelor Degree Rate
Coefficients FINAL
0.001025558 0.045633778 3.5212E-05 0.551807448 0.049005209 0.003148293 0.000519678 1.01267E-07 0.562861427 1.91496E-23 2.11926E-06 0.692974232
Mean
P-value FINAL
-161.5707967 -0.027906207 -3.46725E-05 0.001874797 247.2407915 48.77295391 54.21137718 87.33500185 REMOVED 149.0828286 0.003862759 REMOVED
8.01357E-05 0.026516332 1.9214E-11 0.044533818 0.020205989 0.002295322 0.000427302 9.25996E-08 REMOVED 2.4806E-24 1.18539E-06 REMOVED
Median Listing Price (y) Violent Crime Rate AnnualIncome Population PovertyRate Midwest South West Median Rental Price Median Price Cut Bachelor Degree Rate Inventory Measure
36.97964847 49.39 14759 145674 0.063 0 0 0 0.46441132 2500 0.117 108
794.5336203 1988.63 49986 8550405 0.398 1 1 1 4.185733513 70000 0.656 16882
F-Test P-Value
đ?&#x2018;şđ?&#x153;ş & đ?&#x2019;&#x161;
Standard Error
# of Observations
Variables Removed
0.921837652 Extremely close to 1
5.09741E-48 Extremely mall
0.216215796 Slightly above 0.2
34.07 Quite small
104 Larger than 100
Bachelor Degree Rate, Annual Income
By Augus Gu Cheng Luo
Uniform Crime Reporting of FBI www.ucr.fbi.gov
We select the top 104 most populated cities in the United States and collect data for independent variables in our model.
Maximum
đ?&#x2018;šđ?&#x;?
Census Bureau www.census.gov
Real Estate Database Company http://www.zillow.com
157.5838754 685.0792308 26925.88462 597429.7404 0.201548077 0.173076923 0.365384615 0.384615385 1.103068016 8384.461538 0.311894231 1782.625
Minimum
Key Scatterplots Income level is the main driver of a cityâ&#x20AC;&#x2122;s house price level. Cities with higher annual income per capita have higher house price. Poverty rate affects a cityâ&#x20AC;&#x2122;s house price positively. Violent Crimes lowers a cityâ&#x20AC;&#x2122;s house price. Big cities donâ&#x20AC;&#x2122;t necessarily mean a outrageously high house price. Houses in the northeast region are relatively cheaper than other areas as of Dec, 2014.
What factors affect NBA players’ points per game? MOTIVATION
RESULTS
NBA players are employed based on their ability to do one thing – score points. This may seem like a straightforward tasks, however, it framed our curiosity to determine which factors influence a player’s points per game, either positively or negatively. The factors we tested were field goals attempted, free throws attempted, minutes played, total rebounds turnovers, and age.
HYPOTHESIS:
THE MODEL: Y = -2.4902 + 0.9863(B1) + 0.7197(B2) + 0.00079(B3) 0.14863(B4) + 0.0653(B5)
REGRESSON COEFFICIENTS DESCRIPTIVE STATISTICS Variable
Field Goals Attempted Free Throws Attempted
Mean
11.694
3.113
Median
11.5
2.8
Standard Error 0.138
.0728
Standard Deviation 3.425
1.723
Minutes Played
1135.346
1016
35.425
838.307
Age
26.888
26
0.183
3.42
Turnovers
1.963
1.9
0.036
0.844
Coefficients
P-value
Intercept
-2.49
1.32E-6
Field Goals Attempted
0.986
4.8E-176
Free Throws Attempted
0.719
Minutes Played
0.00079
4.36E-20
Age
0.0653
2.25E-05
Turnovers
-0.149
0.0725
We found that an NBA player’s points per game is most influenced by field goals and free throws attempted, as shown in the scatterplots below.
The number of points per game is predicted by Field Goals Attempted (B1), Free Throws Attempted (B2), Minutes Played (B3), Turnovers (B4), and Age (B5). The variable that dominated the model was Field Goals Attempted, which showed to be highly positively correlated with points per game, at 0.818.
All factors tested will be positively correlated to points per game, with field goals attempted being the most positively correlated.
Variable
TAKEAWAY
1.28E-53
DATA SOURCE: All player statistics were taken from http://www.basketball-reference.com/
OUTLIERS:
METHOD: We used backwards elimination to generate the multiple linear regression model with the lowest standard error. Total rebounds was found to be statistically insignificant, and was eliminated through backwards elimination. R2
The of our model is quite high, at 0.8757. Additionally, standard error is comfortably low, at 1.5443. Standard error divided by average y is below the accepted threshold of 0.2, and F-stat shows that the model is statistically significant, as it is much below the threshold of alpha = 0.05.
R Square
Standard Error
0.8757 1.5443
Se/y Average
0.11312
F Statistic
4.337 x 10-248
Observations
560
After analyzing our original 572 data points, we eliminated data points that were impossible or highly improbable, as well as those influence points pinpointed through analysis, to arrive at 560 points.
CONCLUSION: While there are several factors affecting an NBA player’s points per game, the variable with the most statistical significance supporting its effect on points per game is field goals attempted. Our initial model was not perfect, as it included two statistically insignificant variables, but through careful analyzation and backwards regression, we were able to come up wit the optimal model predicting an NBA player’s points per game.
TEAM 30: Lolly Buenaventura, Charles Coccia, & Isaiah Elder
Which Design Factors Influence Pricing of Women’s Career Heels? Introduction and Motivation
Key Findings
Graph
ℇ
: 0.6022
Model and Methodology Hypothesis α
Conclusion α
Data Sources Team 36 Vitoria Gaboardi, Jamie Shen, Jessica Qian
What’s in a Crime?
A Golden (State) Regression for California’s Theft Rate
Key Results
Motivations
Methodology
Theft is a timeless crime. At every point throughout history, and in any country across the globe, theft has been a problem. While there have already been numerous published works on predicting the leading causes of crime in a region, our team wanted to focus solely on theft. Our motivation is to find the independent variables that have the most statistically significant correlation with the dependent variable, theft rate. By alerting law enforcement officials of the related independent variables, our model allows them to most effectively cut theft rates in these cities. In addition, home and car security companies could benefit from our data by targeting customers in cities that are likely to have high theft rates, and insurance companies can more accurately model risk.
Our team developed a first order model which used the Backwards Elimination Method in SPSS to run our Multiple Linear Regression. Our dependent variable was theft rate with the 4 dependent variables being:
Poverty Rate Aggravated Assault Rate Percent with high school education Murder and non-negligent manslaughter rate
Data Sources
Conclusion
Model Y= -726.62 + 5462.81x1 + 2.289x2 + 2605.95x3 + 47.32x4 Regression Output:
We used a combination of the US Census Bureau and the FBI’s Uniform Crime Reporting Database to gather all of the data on the top 100 most populous cities (as of 2010) in the state of California.
We found that poverty rate, aggravated assault rate, and murder/non-negligent manslaughter rate all had a positive correlation with total theft rate, as was predicted. While we predicted that there would be a negative correlation between high school education rate and total theft rate, our model proved otherwise. We suspect this is due to collinearity that high school education rate has with poverty rate (R=-0.657).
Observations
99
Significance F
5.45862E-17
R2
0.58014213
Standard Error/ Y-average : 0.206161601
This model is statistically significant with an F-stat of 5.45861953723768E-17 which is far below our alpha of 0.05. It also has a standard error of 0.206 and an R2 of 0.580 which is not ideal but still good.
Sabrina Alexandre
Our data suggests that lawmakers should focus their efforts on: • Implementing programs to lower violent crime: murder and assault are positively correlated with theft, and occur less, reducing the occurrence of these violent crimes may be a more effective strategy for Californian police departments than focusing on theft itself • Work within the cities they serve to try and tackle poverty seeing as it is one of the key factors that lead to a higher theft rate. Our data suggests that home security and auto insurance companies should focus their efforts on selling to cities with higher poverty rates, aggravated assault rates and murder and non-negligent manslaughter rates in order to target a higher customer pool, and insurance companies should raise rates for these cities.
Jeffrey Bail
Geoffrey Mendoza