Managerial Stats II Poster Session

Page 1

Washington University in St. Louis OLIN BUSINESS SCHOOL Eli Snir Lecturer in Management December 2013

Business at the intersection of art and science is epitomized in the Managerial Statistics II course. As we progress through the course, we realize that statistics is not only a set of mathematical concepts; it is also prone to questions of judgment and interpretation that are commonplace in the social sciences. No wonder that this is a ubiquitous tool among university faculty. The capstone of the Managerial Statistics II course is developing a term paper. Students choose a topic to study and demonstrate how regression analysis applies. The choice of topic is unconstrained. One of the goals of the course is to demonstrate the broad applicability of statistics, and specifically of regression analysis, to any area of interest. And the term papers reflect that. They are as broad and as encompassing as the university, drawing on nearly every discipline taught here. As the course is at the intersection of art and science, so are the term papers. Students have two deliverables in the project. One is a rigorous analysis of the topic, proving their knowledge of the tools studied in the course. The second is a poster and presentation of their analysis to a broad audience. This book is a product of the latter. It is a collection of posters from the poster session in the Fall 2013 QBA121 course. The audience for the poster session is diverse, including undergraduate students, MBA students, advisors, faculty, and deans. Students are expected to explain their analysis to various constituents, some desire a qualitatively understanding while other challenge students on statistical methodology. Invariably, students in the course address all questions comprehensively and confidently. On the following pages of this book we collected a sample of the posters from the course. Hopefully, they convey the breadth of student interest and the depth of student learning. As you will see, statistics does apply to every discipline and every one of us.

Thanks for your interest,

Eli Snir Washington University in St. Louis, Olin Business School • Campus Box 1133 One Brookings Drive, St. Louis, Missouri 63130-4899 Tel (314) 935-6090 • Fax: (314) 935-6359 • snir@wustl.edu


Managerial Sta,s,cs II

What Factors Influence Runs Scored in MLB?

Note: Do not copy these examples. See

Background

Carrie Ross, Mason Meiners, & Kathy Chang

Major League Baseball (MLB) has a significant influence on North American culture. Thirty teams across the country are separated into two leagues with three divisions per league, and each team plays 162 regular-­‐season games from April to October in hopes of advancing to the World Series.

DescripPve StaPsPcs α=0.05

Methodology Using a random sample of 20 games per team from the 2012 regular season for six randomly selected teams, one from each league and each division, this analysis will determine the significance of the impact of fan aVendance, home field advantage, temperature, elevaPon and salary differences on number of runs.

Predicted average number of runs per game: Number of runs = -­‐0.21643 + 0.06331*Temperature + 0.00053*ElevaPon

y = 4.975 Sxy = 3.451

Regression Statistics R R Square Adjusted R Square S Total number of observations

MoPvaPon

Sε = 3.31

R 2 = 0.09527 Sε = 0.665 y

0.30866 0.09527 0.0798 3.31009 120

Eleva6on (feet) vs. Number of runs 16 14

y = 0.0005x + 4.569 R² = 0.04503

12 10 8 6 4 2 0 0

1000

2000

3000

4000

5000

F-­‐significance=0.00286 P-­‐value (temperature) = 0.0121 P-­‐value (elevaPon) = 0.01169

The final model consists of only two staPsPcally significant variables—temperature and elevaPon—thus excluding the other three independent variables. Although the regression has an F-­‐significance that is below α=0.05, this model is not completely trustworthy because the Standard Error divided by the average of y is 0.665, greater than the benchmark of 0.2. AddiPonally, there is a very low R-­‐squared of 0.095.

Regression staPsPcs for the final model.

Number of Runs

The number of runs scored tends to increase fan excitement about their team and the game in general. The goal of this analysis is to invesPgate a few internal and external factors that influence the average number of runs per game.

Key Results

Regression Analysis

6000

Eleva6on

The original regression results show elevaPon to be the only significant independent variable with a p-­‐value of .027 and an R2 of .112.

AddiPonal InformaPon We can further expand this research by including addiPonal independent variables, such as total foul area, distance to ouSield walls, day versus night games, and impact of precipitaPon.

ImplicaPons Using this model, it is difficult to predict the number of runs per game in MLB using the factors previously menPoned. More informaPon is needed to improve the accuracy of the model.

Key Data Sources

Sta6s6cs: espn.go.com/mlb/staPsPcs; mlb.mlb.com/ stats/; cbssports.com/mlb Related Ar6cles: www.iwu.edu/economics/PPE13/

houser.pdf; web.bus.ucf.edu/faculty/rhofler/


William Pudvah Evan Reuben Allison Rosengard

What’s driving car prices? Mo7va7on

Methodology

Results

Over the past 8 years, the average MSRP (Manufacturer’s Suggested Retail Price) for a new car has increased drama7cally, from just above $26,000 to $31,252. A $5000 increase isn’t one that can be explained away by infla7on. So, why are car prices rising so fast? What factors are affec5ng MSRP? Knowing the answer to this ques7on can be crucial for car manufacturers who want to sell their products at higher prices in an efficient fashion, by increasing the variables that maTer and ignoring the ones that don’t.

We ran a Mul7ple Linear Regression looking at the rela7onship of seven different factors on MSRP, of which the six below were significant (p<.05). MPG (hwy) was non-­‐significant.

Cargo room 1-­‐year insurance 3-­‐year service cost Horsepower

MSRP

MPG (city) Resale value

We created indicator variables for each make and used Kia as the base brand, because they have the lowest MSRP of all makes. A^er excluding incomplete entries, there were 775 observa7ons total.

We found that at least the six significant factors in our model correlate independently with MSRP. The histogram below shows that our standardized residuals follow a normal distribu7on (frequency over regression standard residual). Given these results and the aforemen7oned significance of the model, we can be comfortable using it to predict future values.

Regression equa5on

Data Kiplinger 2013 New Car Rankings Provides a wealth of data for every make and model of car on the market, including MSRP and the variables used in our model.

MSRP = -­‐37600.3503 + 1.5406 * Service + 30.2895 * Insurance -­‐ 222.1683 * Resale + 77.9865 * HP + 427.8388 * Cargo Room + 298.0094 * MPG (City) + 78.4195 * MPG (Hwy) + 1557.4110

This model is sta7s7cally significant with an R2 value of 0.926, SE/y of 0.18 (below our threshold of 0.2), and 6 of 7 main variables significant at a 95% confidence level.

Conclusion We discovered six variables that have an effect on MSRP. Car manufacturers can indeed benefit from this model. Because they can reliably predict MSRP, they can efficiently decide how to manipulate variables like cargo room and horsepower to maximize profits.


Analyzing the Relationship Between MLB Salaries and Performance Josh Cogan, Spencer Neal, Ian Kelso

Our Final Model:

Motivation:

Ĺś= đ??ľđ?‘œ +

Player salaries in Major League Baseball are, on average, the highest in professional sports. We were curious to see whether the players being paid astronomical salaries are actually earning them with their performance, or if there were other, off the field, factors that determine their pay-grade.

Data: For our data, we randomly selected 175 players who played in the MLB in the year 2013. We did not include Pitchers; and selected players had to have appeared in at least 80 games. Our data sources were baseballreference.com and MLB.com. Both are widely regarded as valid sources and are used frequently. We selected 8 variables, 7 relating to performance, and 1 categorical variable for our regression model. The variables are listed in the descriptive statistics below.

Coefficient Intercept AVG Years in MLB Fielding Pct Post-Arbitrage RBI

5 đ?‘–=1 đ??ľđ?‘–đ?‘‹đ?‘–

Type n/a number number number categorical number

Value P-Value 11.14293 3.01E-30 3.993243 0.033551 0.066013 0.003768 0.74976 0.301535 1.600388 1.78E-17 0.009914 0.000471

Variables removed: Range factor, HR, OPS Variables considered but not added: position, team, slugging

F-Test: 1.07E-42 Note: Ŝ is the natural log of the player’s estimated

Descriptive Statistics ln(salary) Mean 14.64436 Standard Error 0.097047 Standard Deviation 1.283815 Sample Variance 1.648181 Kurtosis -1.46561 Minimum 13.08154 Maximum 16.951

Years in PostAVG MLB Arbitrage 0.257 4.433066 0.5828571 0.002 0.280079 0.0373808 0.033 3.705093 0.4945018 0.001 13.72772 0.244532 -0.13 0.249425 -1.907015 0.179 0 0 0.348 15.059 1

Fielding Range RBI Pct Factor 51.81 0.97935 3.7766 1.672 0.0057 0.171 22.12 0.07543 2.2627 489.5 0.00569 5.1198 0.358 166.014 -0.399 11 0 0 137 1 9.31

salary

HR 12.8 0.62 8.21 67.4 0.48 0 44

OPS 0.728 0.007 0.098 0.01 0.833 0.479 1.078

Key Observations

Analysis and Key Results: Through backwards elimination, we were able to achieve a final model with five independent variables with relatively small p-values. While our model was statistically significant, it was not as accurate as we had hoped. The standard error of the model, adjusted for the transformation of the dependent variable, remained well over 3 million($) with an R Squared of only around .50. The data, according to the ln(Salary) Histogram was skewed to the right and the Studentized Residuals Histogram shows a normalized distribution.

Conclusion: Our results showed that the overall regression was statistically significant; however we believe that this does not correlate strongly between individual salary and on-field performance. This implies that on-field performance is not directly correlated with a player’s salary. Because of this, MLB teams may be paying players based upon factors such as popularity and merchandise promotion. Although an individual’s performance isn’t solely based upon factors such as these, we must take into consideration past year’s performances. MLB managers and owners could benefit from this analysis, reassessing the importance of statistical on-field performances. This data also suggests that lower salary teams are not put at an automatic disadvantage, as seen by past teams such as the 2002 Oakland Athletics, whose collective payroll was one of the lowest in the league.


3LAU’s Post Impressions

3LAU Musician/Band www.3LAU.com Analysts: Madison Blau, Jonathan Cohen & Paulina Gordon Managerial Statistics II December 4, 2013 About

Photos

Professor Snir

Free 3LAU Music

Tour & Tickets

Highlights

Write something...

Motivation

Data Sources

“Facebook Page Insights provide measurements on your page's performance. Find anonymized demographic data about your audience, and see how people are discovering and responding to your posts.” All data was exported from 3LAU’s Facebook Page Insights Admin Panel, the most accurate and reliable source there is for evaluating 3LAU’s Facebook Page Performance. Like

Comment

Nowadays, social media marketing can be critical to a business’s success. Facebook documents every post and every click, valuable data that provides businesses with information about their consumers, and how well the business reaches and interacts with their consumers. More specifically, post impressions are a key way to evaluate not only how many people a business reaches, but also how frequently users see the posts. Certain variables might contribute to determining impressions. If such variables can be identified and significant for 3LAU, an up and coming music producer, he can effectively increase the number of impressions his posts recieve. Like

Comment

Share

Share Excluded Variables

Key Results

The vast variability in the number of impressions is attributed to 3LAU’s nonlinear growth rate over time. Thus, taking the natural log of a post’s impressions was needed to create a more accurate model. Shares are significantly more influential in increasing the number of impressions a post receives than likes and comments.

The time of day a post is published, which could potentially affect the number of impressions a post receives, was not included in the model. The actual content of the posts, other than the type of post, would also probably impact the number of impressions the posts receive. This was too difficult to measure to include in the model. Like

Comment

Share

Descriptive Statistics

Dependent: Average Post Impression= 80,642 Average Post Ln(Impressions)= 10.65

Like

Comment

Share

Quality

R^2 = 0.66 Standard Error = 0.92 Standard Error/Y-bar = 0.087 F- significance = 7.65E-97 Comment P-value = 1.27E-05 Like P-value = 2.62E-48 Share P-value = 0.004 Video P-value = 0.00025 Photo P-value = 0.0023 Link/Music P-value = 2.15E-05 Like

Comment

Independent: Video Posts: 18% Photo Posts: 34% Links/Music Posts:14.5% Status Updates Posts: 33.5% Average Post Comments: 48 Average Post Likes: 350 Average Post Shares: 24 These statistics are representative of the beginning of 3LAU’s career through October 2013. As 3LAU grows over time, the engagement with his posts increase accordingly. Thus, in recent times, the average number of comments, likes, and shares are probably much greater than the averages accounted from the beginning of his career to recent times. As a result, impressions would also be much greater in recent times. Like

Comment

Share

Regression Coefficients

Share

Ln(Impressions) = 10.102 + 0.00106*Comments + 0.00107*Likes + 0.00158*Shares + 0.357*Video + 0.253*Photo - 0.439*Link/Music Video posts and shares generate the most impressions! Like

Comment

Share


Cardinal’s Attendance What variables are the strongest predictors of home game attendance? Project Motivation

By Chandler Weir and Ellen Kaushansky

We want to find out what factors influence attendance for Cardinal’s home games. Specifically are the factors: o Performance based o Non-performance based (out of control factors such as weather) o Non-performance based (controlled factors such as game-day promotions)

Key Data Sources Key Results Descriptive Statistics o Sample average of attendance: ÓŻ= 39,235 o Division Rivalry mean= .4074 o This means 41% of home games are against division opponents o Item Giveaway mean= .37 o This means 37% of home games have item giveaways o Thursday mean= .12 o This means 12% of games were played on Thursday o Item Giveaway Thursday shows that around 1% of all games played are on Thursday where an item is given away

We thought that factors related to the Cardinals’ performance would be the most influential to game attendance, but the opposite turned out to be true. Factors like Day of the Week, Item Giveaway, and Rain have much more of an impact on game attendance for the Cardinals. Many of the influential factors, such as weather, are out of the Cardinals control but the most important factor in the Cardinals control is when they do item giveaways.

Variables Indicator: Categorical cont.: Time of Game* Starting Pitcher: Division Rivalry= x1 Lohse- Base Item Giveaway= x2 Carpenter* Rain= x3 Garcia* Other Sports Events* McClellan= x10 Continuous Variables Lynn= x11 Temperature= x4 Jackson* Opponent’s Record= x5 Wainwright= x12 Cardinals’ Record* Kelly= x13 GB Division Leader= x6 Westbrook* Streak* Other = x14 Categorical: Item- “Day of Week�: Day of the Week: Item Sunday- Base Sunday- Base Item Monday* Monday* Item Tuesday* Tuesday= x7 Item Wednesday* Wednesday= x8 Item Thursday= x15 Thursday= x9 Item Friday* Friday* Item Saturday= x16 * Indicates variables removed Saturday* from the final model

Equation: Recommendations The net effect of the variables Thursday, ItemThurs, and Item Giveaway is positive so it is possible to overcome the negative effect of having a game on Thursday by using Item Giveaways. The Cardinals should advertise on their website to highlight Thursday Giveaways as an incentive for fans to come to the game. For games that occur Tuesday to Thursday, the Cardinals could have special discounts on food or apparel within the stadium in order to incentivize fans to come to these weekday games.

�=38633.436+1037.839(x1)+1529.657(x2)+1029.184(x3)+ + 32.626(x4) 2603.494 (x5) - 164.024(x6)- 2100.575 (x7) -3180.92(x8) -2318.284 (x9) – 1368.806(x10)+1617.163(x11)+939.340(x12) + 1028.098(x13) + 4323.809(x14) + 4289.073(x15)+1128.467 (x16)

Fit of Model o P-value of 0 < Alpha of .05 o R2=0.445 o Sđ?œ€/đ?‘Ś=0.07 < cutoff value of .2 With these metrics satisfied we are comfortable using this model.


Whatv ar i abl esi mpac tav er ageSATs c or espers t at e?

J ul i aBr os s eau,Samant haBl um,RyanMel t zer Wear enott hef i r s tgr oupt ot ack l et hes ubj ec tof av er ageSATs c or es .Ones t udydoneatUni v er s i t yof Penns yl v ani ac al l ed“ AnAnal ys i sofFac t or sI nf l uenc i ng SATSc or es �wr i t t enbyAl okShet handSamuelL ehrf ound v er yl ow c or r el at i onsbet weens t udent sGP Ai nhi gh s chool s ,c l ubi nv ol v ement ,s por t s ,et c .ont hei rSATs c or es . Ev ent ut or sf oundnoc or r el at i onbet weenbet t ers c or esand t ut or i ng.Howev er ,t heywi s hedt heyhaddonemor e r es ear chont heenvi r onment speopl el i v edi n.Ourgr oup want edt ot es tenvi r onment alf ac t or st os eei ft her ear eany c or r el at i onswi t hSATs c or es . Wedec i dedt ot r ackt heav er ageSATs c or espers t at ef r om t wodi f f er entyear s ,2008and2012( s i nc et heyar e el ec t i onyear s ) ,ov ert hef ol l owi ngs ev env ar i abl es :

Per c entT ak i ngSAT Rat i oofSt udentt oT eacher Di v or c eRat e Gr os sDomes t i cPr oduc t Obes i t yRat es Per c entVot edf orObama Av er ageT eacherSal ar y T hev ar i abl eshi ghl i ght edi ngr eenwer es t at i s t i c al l y s i gni f i c antandt hos ehi ghl i ght edi nbl uewer enot s i gni f i c ant ,butl ef ti nt hemodel .

Si nc eourdat aappear edt obel i near ,weus edal i near r egr es s i onmodelt opor t r ayt hedat a.Byt heus eof back war dsel i mi nat i on,weel i mi nat ed2ofour7v ar i abl es : per c entv ot edf orObamaandav er aget eachers al ar y ( hi ghl i ght edi ngr ay) .T her es ul t i ngr egr es s i onequat i oni s asf ol l ows :

X1 =Gr os sDomes t i cPr oduc t X2 =Obes i t yRat e X3 =Di v or c eRat e X4 =Rat i oofSt udentt oT eacher X5 =Per c entofSt udent sT ak i ngt heSAT .

B1 =. 001 B2 =3. 07 B3 =19. 55 B4 =6. 89 B5 =3. 47

Ours i gni f i c anc eFi s1. 48x1026,whi c hi swel lbel ow al phaof0. 05,mak i ngt hemodels t at i s t i c al l ys i gni f i c ant . OurR2 i sal s or el at i v el yhi ghat0. 7 47 ,butt her ear eot her f ac t or st hataf f ec tt heav er ageSATs c or ef oras t at e. F i nal l y ,ourSE/yi s0. 039,wel lbel ow ourt hr es hol dof 0. 2.

T ak eaways -I nor dert oboos tav er ageSATs c or es ,f oc usonr educ i ng Di v or c eRat eandRat i oofSt udent st oT eacher -Al t houghGDPandObes i t yRat ear enots t at i s t i c al l y s i gni f i c ant ,t hec oef f i c i ent sar el ogi c al -Asmor epeopl ei nt hes t at et ak et heSAT ,i ti sexpec t ed t hatt heav er agewi l lf al ls i nc et her ei smor ev ar i at i on -T her ec oul dbeot herv ar i abl est hataf f ec tSATs c or est hat

-Wehador i gi nal l ynotc ons i der edt heper c entofs t udent st ak i ng SAT ,whi chc ompl et el yal t er edt her es ul t soft her egr es s i on model

Whi l ewehadt ous emanydi f f er ents our cest of i ndourdat a,we us edt hemos tt r us t wor t hys i t eswecoul df i nd.Thes econs i s t ed mos t l yofgover nments i t esandot herr el i abl es i t ess uchast heUS Cens usorCol l egeboar df orSATi nf or mat i on.


Managerial Statistics II

Quality of Regression Analysis

Motivation for Analysis

Our goal was to use readily available statistics to predict the quality and success of an NFL quarterback. Many sources were focused on intangible or immeasurable qualities, such as ability to make “wow� throws. Through the use of quantitative variables, like career completion percentage, and indicator variables, such as college conference, we built a model that used career winning percentage as the dependent variable. Our final model used data from 96 NFL quarterbacks, each having started at least 16 games, and included seven independent variables. Winning % as function of Average Touchdown Passes per Game

Average Touchdown Passes per Game

1.000 0.800 0.600 0.400

y = 0.1844x + 0.2991 R² = 0.3618

0.200 0.000 0.00

0.50

1.00

1.50

2.00

2.50

Winning % as function of Average Interceptions per Game 1.000 0.800 0.600 0.400 0.200

0.000 0.00

0.50

1.00

1.50

Average Interceptions per Game

y = -0.0881x + 0.5749 R² = 0.0237

2.00

Winning Percentage Mean

.5000

Standard Deviation

.1098

Minimum

.1904 (Ryan Leaf)

Maximum

.7771 (Tom Brady)

Model and Analysis

After removing outliers and completing backwards elimination, we had a model with three significant and three insignificant variables. Coefficients Intercept

.17360029

Games Started

.00044208

Average Touchdown Passes per Game

.188442314

Average Interceptions per Game

-.14488813

Career Completion Percentage

.51383649

Average Passing Yards per Game

-.00052512

Big East

-.05233098

Eliminated Variables

Through backwards elimination, we removed in order: ACC, SEC, LN(Draft Pick), Height, Big 12, OTHER (Conferences).

The R2 is 0.54, which is acceptable. The model is statistically significant (Significance F=1.9x10-12). Standard Error divided by ��is 0.15, less than 0.2. Our model can be used to predict winning percentage of an NFL quarterback.

Key Results

We found that average touchdown passes per game, average interceptions per game, and games started are the primary determinants of winning percentage. Our greatest dissatisfaction was in our model’s inability to sufficiently account for the quality of the entire team, which works together to win. However, many different (and intangible) factors contribute to the success of a quarterback as measured by winning percentage. These results can be used by general managers and coaches to: • Scout players for the NFL Draft • Target free agents • Conduct contract negotiations • Make in-game, situational decisions

Key Data Source

www.pro-football-reference.com

Alex Goldberg & Hannah Towle


What causes higher homicide rate in a nation? by Seung Han Bae / Bomye Weon

• Data Sources - UNODC - IMF Data and Statistics - CIA The World Factbook

• Variables - đ?‘Ś – Homicide Rate - đ?‘Ľđ??ş – GDP per Capita - đ?‘Ľđ?‘ƒ – Population Density - đ?‘Ľđ?‘ˆ – Urban Population Ratio - đ?‘Ľđ??ľ – Birth Rate

đ?‘Ľđ??ż – Literacy Rate đ??źđ??´ – Arid Climate đ??źđ?‘€ – Moderate Climate đ??źđ??ś – Continental Climate

• Final Regression Model

• Descriptive Statistics - 1st: � = 10.526 - Final: � = 6.974

• 1st Regression Model - đ?‘Ś = −8.156 − 7.68 Ă— 10−5 đ?‘Ľđ??ş − 0.002đ?‘Ľđ?‘ƒ + 0.068đ?‘Ľđ?‘ˆ + 0.508đ?‘Ľđ??ľ + 0.133đ?‘Ľđ??ż − 10.337đ??źđ??´ − 9.957đ??źđ?‘€ − 9.437đ??źđ??ś - F-test: 2.974 Ă— 10−10 2 - đ?‘… = 0.327 - đ?‘†đ?œ€ đ?‘Ś = 1.039 - Some insignificant coefficients → Outlier identification & backward elimination

- Climate: Arid and moderate climates have lower homicide rate

- đ?‘Ś = −16.155 + 0.661đ?‘Ľđ??ľ + 0.135đ?‘Ľđ??ż − 5.602đ??źđ??´ − 2.112đ??źđ?‘€ −24 - F-test: 1.046 Ă— 10 - đ?‘… 2 = 0.582 - đ?‘†đ?œ€ đ?‘Ś = 0.683 - All coefficients are significant

• Possible Improvements - Use of second order models, interaction models, and logarithmic transformations - Other possible variables: Unemployment rate, Political situation, Continent, Religion, Criminal law (capital punishment), etc. - Further research of previous analyses

• Key Results - Birth Rate

40 35

Homicide Rate (per 100,000 Population)

- “State IQ estimates to be significantly and negatively associated with violent crime� - “Deleterious labor market conditions within a locale are closely associated with its high crime rates�

-

30 25 20 15

10 5 0

0

5

10

15

20

25

30

35

40

45

50

Birth Rate (per 1,000 Population)

- Literacy Rate 40 35

Homicide Rate (per 100,000 Population)

• Motivation

30

25 20 15 10 5 0 0

20

40

60

Literacy Rate (% of Population)

80

100

120


Population Growth Across Countries By: Helen Head, Leah Kraft, and Alaina Rolfes

Managerial Statistics II

Problem In many countries around the world, rapid population growth is a serious concern. Population growth can lead to problems such as food shortages, lack of adequate housing, poverty and environmental destruction. In order to implement the best solution to stabilize growth rates, one must understand the fundamental causes of the problem. We collected data from 100 countries in 2010 on the following independent variables: -

Health Expenditure Foreign Direct Investments GDP per Capita Net National Income Female Employment Rates Urban Population Life Expectancy at Birth Fertility Rate Death Rate Tertiary School Enrollment

Variables in green text are those found in the final regression model.

Data Source

We used a linear model and implemented backwards elimination to determine our final model:

Outliers Oman and Lithuania both had abnormally high and low growth rates, respectively.

X1= GDP per Capita X2= Life Expectancy at Birth X3= Fertility Rate X4= Death Rate

Key Measurements R2: 0.8871 SE/ӯ: 0.3011 F-Stat: 3.705 E-43 The high R squared and low F-stat demonstrate the strength of our model. While SE/ӯ is slightly above the threshold of .2, the model still remains strong. In addition to these measurements, all the variables in the final regression model are statistically significant.

5 4 3 2 1 0 -1

Conclusion Our model indicates that the main drivers of population growth are GDP per capita, life expectancy at birth, fertility rate and death rate. While GDP per Capita and Fertility Rate increase population growth, Death Rate and Life Expectancy at Birth decrease it. Therefore, a country’s aiming to curb population growth should focus on these four factors to help them smoothly achieve their goals. Image source: www.nationsonline.org

Angola Albania Argentina Armenia Australia Austria Azerbaijan Burundi Belgium Burkina Faso Bulgaria Belarus Belize Bhutan Central African… Switzerland Chile China Cameroon Colombia Comoros Cape Verde Cyprus Czech Republic Denmark Algeria Egypt, Arab Rep. Eritrea Spain Estonia Ethiopia Finland France United Kingdom Georgia Guinea Guyana Honduras Croatia Hungary Indonesia India Ireland Iceland Italy Jamaica Jordan Japan Kazakhstan Kyrgyz Republic Cambodia Korea, Rep. Lao PDR Lebanon Sri Lanka Luxembourg Latvia Morocco Moldova Madagascar Mexico Macedonia, FYR Mali Mongolia Mauritania Mauritius Malawi Malaysia Niger Netherlands Norway New Zealand Panama Peru Poland Portugal Paraguay Romania Rwanda South Asia Saudi Arabia Senegal El Salvador Sao Tome and… Slovak Republic Slovenia Sweden Chad Togo Thailand Tajikistan Tunisia Turkey Tanzania Ukraine Uruguay United States Vietnam

Population Growth

Country Population Growth Rate

These outliers had high Cook’s D values and caused the model to have large residuals and a high standard error. Removing these countries from the model lead to a smaller standard error, and thus a better model


Watch Instantly: The Rise of Online Streaming TV

Y = .24 -­‐ .016(B1) + .023(B2) + .020(B3)+ .004(B4) + . 018(B5) + .016(B6) The percentage of TV watched online is predicted by Image Quality (B1), Cost (B2), Online Accessibility (B3), Offline Accessibility (B4), Commercial Presence (B5), Time Convenience (B6). No one variable dominates the model. Each extra ranking point that the survey puts into each category alters the TV consump;on rate. An interes;ng interac;on to note is that there is a nega;ve rela;onship between Image Quality and the dependent variable. This implies that the Image Quality of TV online is of a lower quality.

Hypothesis:

The model fits very well the data. The R2 of the model is .72. Furthermore, the Significance F is 3.24E-­‐24 which is significantly lower than alpha = .05 Percentage of TV Watched Online

Television has become an integral part of our daily lives. We are now able to get news and entertainment from one source. But as ;me evolves people are shi<ing away from broadcast television to on demand viewing. For this reason we conducted a survey on the following factors, to see what is most desired in television watching:

By Anna Eisenberg & Aaron Pang

Cost Score 120% 100% 80% 60% 40% 20% 0% 0

2

4

6

8

10

12

Cost Score

We hypothesize that the ability to skip commercials is what drives the shi< over to online streaming

We conducted a 15 ques;on survey of the Wash U student body. We collected 100 results. 52% male and 48% female and the average age was 20 years old.

19 – 21 years old

Female 48%

Male 52%

We used Backwards Elimina;on to run a Mul;ple Linear Regression on the survey data that we collected. We eliminated the Gender variable because it was shown to be insignificant. Of the 7 variables we introduced in the survey, we kept 6 of them. There were no outliers that had significant influence or leverage on the regression model. Average Ra;ng of Importance of Factors 10 8.21 8 6.65 6 4 2 0

5.61

5.59

Image Quality Score

Cost Score

Online Accessibility Score

Cable Accessibility Score

6.55

Commercial Score

5.98

Time Convenience Score

Through our MLR model analysis, the strongest indicators of a shi< towards online TV consump;on are costs and accessibility. These results contradict our hypothesis but s;ll reveal important insights about the trend. This model emphasizes the price sensi;vity of young college students. Due to the complicated and expensive nature of buying a cable plan, college students default to consuming their TV online. Furthermore, this model shows that there is a wide variety of factors that contribute to increased online consump;on, and that no one feature of online consump;on drives the shi< towards online TV.


HOW DOES MY COLLEGE PAY?

——Research on relationships between multiple factors of U.S. four-year colleges and salaries of college graduates MOTIVATION Every year as the eagerly anticipated U.S rankings come out, prospective students and their parents are desperate to find out the ideal college that can bring its graduates to socially and economically well states, for which decent salaries is an important criteria.

DATA SOURCE Dependent variable: Payscale.com Independent variable: National Center for Education, college official websites, US. News Ranking website and Forbes College List website.

MODELING

We used multiple linear regression and backward elimination. Starting with fourteen independent variables. Intercept US.News Index Intl student (percent%) Acceptance Rate Average freshman retention rate 6-year graduation rate midwest South Atlantic Student/faculty ratio private SAT average

Coefficients P-value

-­‐26669.8 215.7342 174.649 11946.64

0.202928 0.027835 0.103186 0.015512

Observation Removed -Outliers and Influence University of Chicago MIT CalTech Colorado School of Mines Stevens Institute of Technology Variables Removed Economic Diversity Tuition and Fees Percentage Receiving Financial Aid West

KEY RESULT There are three independent variables that are noticeable: SAT average and U.S. index have positive relationship with salaries. Acceptance rate has negative relationship. 65000

Salary - SAT Average

60000

44237.13 -­‐23482.5 -­‐2156.1 -­‐1515.01 277.2803 2056.919 23.60307

0.136368 0.12099 0.054749 0.095729 0.087621 0.146787 0.01586

Key Metric R Square p-value of F-Test Standard Error

Value 0.517 6.63*10-10 3565.574

Average of y Standard Error/ Average of y Number of Observation

50483 0.070629 95

Criterion Not high Less than α = 0.05 Lowest in backward elimination process Lower than 0.20 100 before outliers removed

CONCLUSION Some of the coefficients for independent variables are counter-intuitive, but they are not statistically significant. Our R square is not quite high, which indicates that it is hard to predict starting salaries based merely on the characteristics of schools since the salaries vary on individual basis in real life. However, our analysis provides insights of factors that impact starting salary for students, parents as well as higher education administrators.

55000 50000 45000

y = 29.562x + 11624 R² = 0.4136

40000 35000 30000

1000

1100

1200

1300

1400

1500

1600

Kuan Chen. Xiaoxue Zhao. Yi Sun


Analysis Motivation:

As fans of the great sport of baseball living in the city of St. Louis, we are concerned with the methodology of the BBWAA voters in their ballots for the Cy Young Award. As statistics students, we want to investigate what drives voters toward a specific MLB pitcher.

The Best Pitchers

1. Randy Johnson: 2002 NL Winner Predicted Vote Share: 125.47% In one of the greatest seasons in recent history, Johnson posted an astronomical 10.91 WAR with 24 wins and 334 Ks 2. Justin Verlander: 2011 AL Winner & MVP Predicted Vote Share: 111.66% In a deserved MVP season, Verlander posted an 8.44 WAR with 24 wins. 3. Clayton Kershaw: 2011 NL Winner Predicted Vote Share: 92.12% Powered his way to the Cy Young with 248 Ks and only 59 ERs 4. Curt Schilling: 2002 NL Runner Up Predicted Vote Share: 90.16% Lost to teammate Randy Johnson while still posting a WAR of 8.73 with 23 wins and 316 K

Analyzing the Impact of Pitching Statistics on the Cy Young Award

ROBBED OF A CY YOUNG

Final OLS Model Formula:

Predicted Vote Share % = -130.52%+4.67%(W)-2.45%(L)+1.4%(CG)+ -2.69%(lnSHO)+1.48%(IP)-1.01%(ER)+0.12%(K)-0.2%(BF)

KEY RESULTS:

W

• Our OLS model was superior to our ln model • Wins is very influential, demonstrated by the coefficient and the p-value. • ln of SHO and CG, while in the model, are statisically insignificant

120% 100% 80% W

60%

Linear (W)

40% 20% 0%

0

5

10

15

20

25

VARIABLES NOT IN THE MODEL: WAR, G, H, BB, WHIP, FIP, SIERA, Playoffs

DESCRIPTIVE STATISTICS: ȳ = 37.82% Std. Dev. of y = 26.1% Mean of Wins = 17.83 Std. Dev of Wins = 2.85 Mean of ER = 70.95 Std. Dev of ER = 12.46

Sε=0.1889 R2=0.6803 Significance-F=5.74x10-20 n=103

16

14

The Worst Pitchers

Brandon Webb: 2006 NL Winner In a year full of ineptitude, Brandon Webb managed to win the Cy Young despite a predicted vote share of 24.6% as no other NL pitcher broke even 20%.

8

6

4

2

0

-­‐20%

-­‐10%

0%

10%

20%

30%

40%

50%

Key Data Sources:

60%

70%

80%

90%

100%

110%

120%

Carlos Zambrano: 2007 NL Candidate Somehow managed to get 2% of the vote with 13 losses, 95 ER, 101 BB, and an ERA of 3.95. We feel our model estimates his season properly.....at -14.91%. Of course, there is always the Chicago Cubs factor.

2009 Adam Wainwright Predicted Vote Share: 50.77% (lost to 45.96% and 45.03%) Tim Lincecum was the beneficiary of a vote the split between the strong performances of Cardinals pitchers Wainwright and Carpenter

SIGNS OF IMPROVEMENT?

Predicted Vote Share

10

2005 Johan Santana Predicted Vote Share: 42.31% (lost to 33.12%) Lost to Bartolo Colon who broke the 20 win benchmark with a skyhigh ERA of 3.48. Santana “only” had 16 wins, 12 fewer ER, and 81 more Ks.

30

REGRESSION STATISTICS:

12

Joe Warren Hamilton Cook Patrick Hart

Histogram for Regression Standardized Residual 16 14 12 10 8 6 4 2 0

-­‐2 -­‐1.75 -­‐1.5 -­‐1.25 -­‐1 -­‐0.75 -­‐0.5 -­‐0.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

2009 Zach Greinke AL Predicted Vote Share: 58.95% def. 65.24% Greinke’s dominating year was not overlooked. In a year where he posted a WAR of 10.37, but only 16 wins, the BBWAA voted him in over a traditionally better candidate Felix Hernandez 2010 Felix Hernandez AL Predicted Vote Share: 45.07% def. 52.61% Hernandez had an incredible year by most metrics except the one that required any run support from an anemic Mariner’s offense. Picturre Ciations: Adam Wainwright: http://mlb.mlb.com/news/article.jsp?ymd=20130225&content_ id=41970096&vkey=news_stl&c_id=stl Randy Johnson: http://sportsthenandnow.blogspot.com/2009/05/randy-johnson-last-of-his-kind.html Carlos Zambrano: http://studiousmetsimus.blogspot.com/2012/01/miami-unsound-machine-carlos-zambrano.html Clayton Kershaw: http://www.fancloud.com/articles/clayton-kershaw-and-his-nl-leading-era-will-takethe Cy Young Award: http://mlbfrance.com/?p=1521 Busch Stadium: http://en.wikipedia.org/wiki/St._Louis_Cardinals


League Regressions

Alexander Jia Kyle Kong

of

League of Legends is the most popular online game in the world. In the game, players control characters called champions, choosing from a pool of 114, each with their own unique traits. If champion preference criteria are identified, Riot can increase champion popularity, thereby encouraging more in-game purchases and addressing managerial goals.

Hypothesis The most influential factor in a player’s champion selection will be the champion’s win rate, but the age (time since release), ban frequency (above 10%), KDA ratio (kills and assists per death), number of skins (purchasable alternative in-game appearances), cost (in IP), and damage dealt will also contribute.

Data Sources Lolking.net October 2013: Champion Win Rates, Ban Frequency

Elophant.com October 2013: Champion KDA Ratios, Damage Dealt Both sources are accurate and reliable, and all other data was collected through in-game information.

Methodology We used the Backwards Elimination Method to run a Multiple Linear Regression and found the following variables to be most influential in determining champion popularity:

Age

Ban Rate

Skins

Cost (IP)

Damage

The variables of win rate and KDA ratio were statistically insignificant, and their removal from our model drastically improved a number of metrics for model quality, such as R Square and standard error. *Images found on LeagueofLegends.com and Clipart

The Model Y = 13118.907 – 339.304 (B1) + 6221.830 (B2) + 2957.981 (B3) – 1.489 (B4) + 0.041 (B5) Our final model is statistically significant with a low F-stat, and though the R2 and SE are not ideal, they are the best possible outcomes after backwards elimination.

Matches Played

Motivation

The Graph 450000 400000 350000 300000 250000 200000 150000 100000 50000 0 1

2

3

4

5

6

This shows the positive relationship between # of skins and matches played, with the decreasing upper ranges being a result of fewer champs with above 4 skins.

# of

Key Results We found that champion popularity is influenced more by accessibility and aesthetics, not in-game performance factors as we predicted in our hypothesis.

B1 B4

B2

B3

B5

R2= 0.358 Se/Ῡ = 0.456 F-stat = 6.48E-10 N = 110

Outliers We removed four outliers with studentized residuals > 3. We believe that they were outliers because of their popularity in professional play at the time. However, none of our points had high Cook’s distance or leverage values.

Conclusion Unfortunately, the obvious solution of changing pricing and skins could impact scarcity. Therefore, Riot’s current model (gradual release and rework) is effective in maximizing revenue and improving player experience.


Do game performance and league experience determine salary? MODEL & ANALYSIS

MOTIVATION We were interested in the relationship between an NBA player’s salary and their performance that season. We looked at a random sample of 100 NBA players in the 2012-2013 season using the following variables:

We used the Backwards Elimination method to develop a Multiple Linear Regression model, and after running the model four times, found that the third model had the lowest Se and that Points per game Years in the league Are the variables with the highest correlation with salary, and removed the variables Field Goal Percentage and Minutes Played

Regression Equation:

Salary = 518306.3782 – 20980.8238(x1) 2722933.2421(x2)+ 673993.9930(x3) + 387366.4944(x4)

points per game average number of minutes played per game

x1! Free Throw % ! x2! Games Played !

number of games played free throw percentage

Se =2,927,398 y = 4,708,276 Se/ y = 0.62176 n= 100 R2 = 0.69027

field goal percentage

!

number of years in the league

x3! Points per Game ! x4! Years in the League !

The model is statistically significant with a pvalue well below 0.05 and a relatively high R2 value. However, the Standard error/average y value is much higher than the benchmark of 0.2, likely due to outliers.

CORRELATION: Salary vs. Points per Game

DATA SOURCES

$25,000,000

y = 688250x - 522425 R² = 0.58499

$20,000,000 $15,000,000 $10,000,000

ESPN Official Website 2012-2013 Season Salary Data

NBA Official Website 2012-2013 Season Statistics and Player List

$5,000,000

!

$0

0

5

10

15

20

25

30

35

This scatterplot shows a relatively strong correlation between Salary and Points per game. The correlation may have been weakened by the significant outliers.

KEY RESULTS Using our final regression model, we found that there is a negative correlation between Free Throw Percentage and Games Played and Salary. As Free Throw Percentage and number of Games Played increases, Salary decreases. This was surprising to us, since we expected a positive relationship between the variables. However, it does not matter, since these variables are NOT statistically significant at an alpha of 0.05. Points per Game and Years in the League have a positive correlation with Salary. As Points per Game and Years in the League increase, so does Salary. These variables are statistically significant at an alpha value of 0.05.

CONCLUSION The high Standard Error and relatively high R2 value indicate that Salary cannot be easily predicted simply by looking at a small number of variables. Field Goal Percentage and Minutes Played per game were the variables removed from the model, and Free Throw Percentage and Games Played were not statistically significant in our final model. This leaves the variables Points per Game and Years in the League as the only two independent variables that determine Salary. We would recommend to players or managers to increase Points per Game and Years in the League in order to increase Salary.

JULIA BURNS and IAN RUDOLPH Group 64


Going for Gold: The Sta2s2cs of Olympic Medal Wins

By: Ross Hochwert & Alex Ranney

Mo2va2on

Key Findings We believe that Paralympic Total Summer Medals is a variable that represents strong implica2ons from our research. Paralympic total summer medals had a very low significance value and high impact on total medal count based on its coefficient and mean. This highly relevant variable implies that countries that perform well in the Paralympic games do equally well in the Olympic games showing that experience in other interna2onal compe22on helps win in the Olympics. Addi2onally, we concluded that certain countries value interna2onal compe22on as a whole across all compe22ons, which leads them to invest more 2me and resources into their athletes to win.

Every two years, the world comes together and puts aside its differences to compete in the Olympics. Athletes are piLed against one another in the fiercest compe22on in their individual sport. However, it seems that a handful of countries tend to dominate the Olympic scene year aNer year. We would like to examine what factors a country can and cannot control that impact the total amount of medals won by a country in the Summer Olympics. The problem is quite interes2ng because the more factors that a country cannot control, such as popula2on, that also strongly impacts a countries medal count, the less equal and more par2al the Olympic games appear.

Paraolympic Total Summer Medals Line Fit Plot

Mean Median Standard Error Standard Devia2on Mean Median Standard Error Standard Devia2on Mean Median Standard Error Standard Devia2on

Total Medals 6.70 4.00 0.75 8.65 Births/Woman 2.07 1.86 0.08 0.91 Death Rate (per 1000) 8.44 2.86 0.25 2.86

GDP Per Capita (in thousands) 18.62 10.25 1.88 21.60 Net Migra2on Rate (per 1000) 0.38 0.00 0.45 5.21 Urban Percentage 63.97 20.17 1.76 20.19

Data Sources

Total Medals

Descrip2ve Sta2s2cs Infant Mortality Rate 17.22 11.99 1.42 16.33 Land Area (in 1000 Sq KM) 449.15 147.66 57.50 660.65 Paralympic Total Summer Medals 134.50 21.50 22.91 263.22

50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0

Quality of Model We used a Backwards Elimina2on Method to perform a mul2ple linear regression and find the model with the lowest standard error. R2 = .795 The model is sta2s2cally significant F-­‐stat = 1.04E-­‐38 with a high R2 value and a F-­‐stat Se/y = .6 lower than α = .05. Although our N=132 standard error over the mean of Medal Wins is higher than the threshold of .2, we believe his is due to the inherent nature of the Olympics in that the distribu2on of medal wins is heavily skewed.

Summary

y = 0.0281x + 2.9572 R² = 0.73391 Total Medals Linear (Total Medals)

0

500 1000 1500 2000 Paraolympic Total Summer Medals

The Model Number of Medals Won = 5.595 – 0.064*GDP Per Capita + 0.041 * Infant Mortality Rate + 0.361 * Death Rate – 0.048 * Urban Percentage – 1.503 * Births/woman + 0.168 * Net immigraFon rate + 0.001 * Land Area in 1000 square kilometers + 0.031* Paralympic Total Summer Medals

Excluded Countries:

Excluded Variables: •  •  •  •  •  •  •

Countries that receive medals in previous Olympic games will tend to do so in future Olympic games. Yet, when looking at the intrinsic nature of global compe22on, countries that have ci2zens with favorable views of that country (as measured in net migra2on rate) and compe22ve spirit (as measured in Paralympic Medal Wins) tend to do beLer than compe2tors, holding all other factors constant. Governments should look to improve their net migra2on rate and success in global compe22ons (like the Paralympic games) in order to rise above compe2tors. Given that other variables are harder to control, and may nega2vely impact medal wins, we recommend Governments not focus on those variables.

Gini Coefficient Popula2on (in Millions) Previous medals (in 2006 Summer Olympics) Energy consump2on (kg oil equivalent per capita) Renewable water sources Literacy rate Educa2on index


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.