Solution Manual For Modern Business Analytics, 1st Edition Matt Taddy and Leslie Hendrix and Matthew by StudyGuide

Solution Manual For Modern Business Analytics, 1st Edition Matt Taddy and Leslie Hendrix and Matthew Harding Chapter 1-9

Chapter 1 Regression Problem 1.1 For this problem set, we will use 13,103 observations of hourly counts from 2011 to 2012 for bike rides (rentals) from the Capital Bikeshare system in Washington DC. The data are recorded for hours after 6am every day. (We omit earlier hours for convenience since they often include zero ride counts.) This dataset is adapted from data originally compiled by Fanaee and Gama in ‗Event labeling combining ensemble detectors and background knowledge‘ (2013). This data can be used for modeling system usage (ride counts). Such usage modeling is a key input for operational planning. bikeshare.csv contains:

         

dteday: date mnth: month (1 to 12) holiday: whether day is holiday or not weekday: day of the week, counting from 0:sunday. workingday: if day is either weekend or holiday is 0, otherwise is 1. weathersit: broad overall weather summary (clear, cloudy, wet) temp: Temperature, measured in Celsius hum: Humidity % windspeed: Wind speed, measured in km per hour cnt: count of total bike rentals that day

<bikeshare.csv> <bikeshareReadme.txt> Read the bikeshare.csv data into R. Plot the marginal distribution for the count of bike rentals and the conditional count distribution given the broad weather situation (weathersit).

a-1. Use a histogram to plot the marginal distribution for the count of bike rentals. What is the shape of the distribution? a. skewed left b. fairly symmetric

c. skewed right

Explanation/Solution The following code draws a histogram plotting the marginal distribution for the count of bike rentals: biketab <- read.csv("bikeshare.csv", strings=T) hist(biketab$cnt, xlab="daily ride count", freq=FALSE, main="")

a-2. If you haven‘t already, read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function Create side-by-side boxplots to show the conditional count distribution, given the broad weather situation (weathersit). What does the side-by-side boxplot look like? (The top image is correct.)

c. Explanation/Solution The following code draws side-by-side boxplots showing the conditional count distribution, given the broad weather situation: biketab <- read.csv("bikeshare.csv", strings=T) boxplot(cnt ~ weathersit, data = biketab, xlab="weather situation", ylab="daily ride count")

Problem 1.2 Read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function. Fit a regression for ride count as a function of the weather situation variable to answer the following questions. a. On wet days, is the expected ride count higher or lower compared to clear days?

On wet days, the expected ride count is lower by 3,073.5 +/-0.1

b. What is the SSE? 2467890819 (+/-1)

c. What is the R2? R2=0.1 (+/-0.01)

d. What is the estimate of the standard deviation of the residual errors? 1841 (+/-1) Explanation/Solution biketab <- read.csv("bikeshare.csv", strings=T) #read data in and make categorical variables factors using strings=T argument class(biketab$weathersit) #check to be sure weathersit is a factor [1] "factor"

If the result of the above says "character" instead of "factor" then use the code below to make weathersit a factor: biketab$weathersit <- factor(biketab$weathersit) levels(biketab$weathersit) [1] "clear" "cloudy" "wet"

Next, use GLM to regress the counts onto the weather situation variable. wsfit <- glm(cnt ~ weathersit, data=biketab) summary(wsfit) Call: glm(formula = cnt ~ weathersit, data = biketab) Deviance Residuals: Min 1Q −4445.8 −1254.8

Median −14.8

3Q 1400.7

Max 4326.1

Coefficients: Estimate Standard Error t value Pr(>|t|) (Intercept) 4876.79 85.57 56.994 < 2e-16 *** weathersitcloudy -840.92 145.07 -5.797 1.01e-08 ***

weathersitwet

Estimate Standard Error t value Pr(>|t|) -3073.50 410.79 -7.482 2.12e-13 ***

Significance codes:

0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

(Dispersion parameter for gaussian family taken to be 3389960) Null deviance: 2739535392 on 730 degrees of freedom Residual deviance: 2467890819 on 728 degrees of freedom AIC: 13071 Number of Fisher Scoring iterations: 2

a. We see that `weather situation ―clear‖ is the reference level, with an expected ride count of 4877 rides per hour. Wet days have an expected 3073.5 fewer rides per hour, so the expected ride count is lower. b. From the summary.glm output, the SSE is 2467890819. c. 1 - wsfit$deviance/wsfit$null.deviance d. The residual error variance is 3389960 (the ―dispersion parameter‖) so the standard error is around 1841. sqrt(wsfit$deviance/wsfit$df.residual)

Problem 1.3 If you haven‘t already, read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function. Run a linear regression using ride counts as the response and model the response from the weather variables weathersit, temp, hum, and windspeed.

For each 10-degree increase in the temperature, we expect ride count to increase/decrease by about 1560 (+/-1) Explanation/Solution biketab <- read.csv("bikeshare.csv", strings=T) #read data in and make categorical variables factors using strings=T argument

ridefit <- glm(cnt ~ weathersit + temp + hum + windspeed, data=biketab) coef(ridefit) (Intercept) weathersitcloudy

weathersitwet

3434.91953

-287.24322

-1824.47164

temp

hum

windspeed

155.97949

-19.05618

-58.97232

The coefficient on temp is 155.97949, and so a 10-degree increase in temperature corresponds to around 1560 expected extra rides per day.

Problem 1-4. If you haven‘t already, read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function. Run 2 separate regressions, one linear using ride counts as the response and another using log ride counts as the response. For both regressions, model the response from the weather variables weathersit, temp, hum, and windspeed, but add interactions between the continuous weather variables and the weathersit factor. a. Using the results from your linear regression, for each weather situation, determine the change in expected ride count per 10-degree increase in temperature. Use the following table to record your answers. Note: Round your answers to 3 decimal places. Weather Situation

Clear Cloudy Wet

Expected Change in Ride Count per 10-Degree Increase in Temperature 1481.851 (+/-0.1) 1780.467 (+/-0.1) 991.2182 (+/-0.1)

b. Now use the results of your log-linear regression to find the change in expected log ride count per 10-degree increase in temperature. Use the following table to record your answers. Note: Round your answers to 3 decimal places. Weather Situation Expected Change in Log Ride Count per 10-Degree Increase in Temperature Clear 1.502818 (+/-0.01) Cloudy 1.715866 (+/-0.01) Wet 1.424282 (+/-0.01) c. What is an advantage of the log model over the linear model? a. The log model will never predict a value lower than 0. b. Log models always make more accurate predictions. c. The linear model‘s predictions are interpreted as multiplicative effects.

d. The estimates from a log model are not biased. d. Plot the residuals from the linear and log models. Note that in the fitted object, the residuals are access as residuals and the fitted values as fitted.values. In this way, the fitted values from rideshare2 are ridefit2$fitted.values. Which of the following shows the plot from the log model? (The top image is correct.)

e. Using the residual plots for the linear and log ride count models, which model is overestimating ride counts on days when ride count is high? a. Both b. Neither c. Log model d. Linear model f. Find the predicted ride count for a clear, 25-degree day with 50% humidity and 5kmh winds using the linear and log-linear models. Use the following table to record your answers for each model. Note: Round your answers to 3 decimal places. Model

Expected Log Ride Count per 10Degree Increase in Temperature 5624.027 (+/-1) 5839.927 (+/-1)

Log Linear Explanation/Solution

biketab <- read.csv("bikeshare.csv", strings=T) #read data in and make categorical variables factors using strings=T argument

a. ridefit2 <- glm(cnt ~ weathersit*(temp + hum + windspeed), data=biketab) coef(ridefit2) (Intercept)

weathersitcloudy

2800.249154

1626.013771

weathersitwet

temp

-1931.128533

148.185109

hum

windspeed

-9.361672

-39.373256

weathersitcloudy:temp

weathersitwet:temp

29.861582

-49.063285

weathersitcloudy:hum

weathersitwet:hum

-28.504855

13.886794

weathersitcloudy:windspeed

weathersitwet:windspeed

-46.717599

-34.933447

The impact of a 10 degree increase for each weather situation is: Clear ## clear 10*coef(ridefit2)["temp"] temp 1481.851

Cloudy ## cloudy 10*(coef(ridefit2)["temp"] + coef(ridefit2)["weathersitcloudy:temp"]) temp 1780.467

Wet ## wet 10*(coef(ridefit2)["temp"] + coef(ridefit2)["weathersitwet:temp"]) temp 991.2182

b. lridefit <- glm(log(cnt) ~ weathersit*(temp + hum + windspeed), data=biketab) coef(lridefit) (Intercept)

weathersitcloudy

7.732306025

0.479423898

weathersitwet

temp

0.098311462

0.040734184

hum

windspeed

-0.001428943

-0.008882052

weathersitcloudy:temp

weathersitwet:temp

0.013257601

-0.005367406

weathersitcloudy:hum

weathersitwet:hum

-0.009508570

0.004139707

weathersitcloudy:windspeed

weathersitwet:windspeed

-0.013712213

-0.086879680

Comparing the models, we can look at the new implied impacts for a 10 degree increase in temperature. These are the multiplicative effects on expected ride count per 10 degree increase. Clear ## clear exp(10*coef(lridefit)["temp"]) temp 1.502818

Cloudy ## cloudy exp(10*(coef(lridefit)["temp"] + coef(lridefit)["weathersitcloudy:temp"])) temp 1.715866

Wet ## wet exp(10*(coef(lridefit)["temp"] + coef(lridefit)["weathersitwet:temp"])) temp 1.424282

The riderships now increase by 50% when clear, 72% when cloudy, and 42% when wet for a 10 degree temperature increase. This compares to linear ridership increases of 1480 when clear, 1780 when cloudy, and 990 when wet under our regression for raw counts. c. The log model has the advantage that it will never predict ride counts less than zero (which can happen in the linear model). d. The following code can be used to plot residuals from each model: par(mfrow=c(1,2)) plot(ridefit2$residuals ~ ridefit2$fitted.values, xlab="fitted", ylab="residu al", main="linear regression") plot(lridefit$residuals ~ lridefit$fitted.values, xlab="fitted", ylab="residu al", main="log linear regression")

e. On high ride count days, the residuals are negative for both models, so both models overestimate ride counts on days when ride count is high. f.

Linear model newdata <- data.frame(weathersit="clear", temp=25, hum=50, windspeed=5) predict(ridefit2, newdata) 1 5839.927

Log-linear model exp(predict(lridefit, newdata)) 1 5624.027

The linear model prediction is 5839 rides, the log-linear model prediction is 5624 rides. Note that the log-linear prediction, obtained by exponentiating the predicted log ride count, is a biased estimate of the expected ride count. Problem 1-5 If you haven‘t already, read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function. Suppose that you know the bike-share system starts to strain and require extra help (e.g., extra re-distribution of bikes throughout the day, helpers at the docking stations) when it gets really busy. From experience you know that this happens when you have greater than 7000 rides per day. Build a logistic regression model for the probability of having greater than 7000 rides per day using the weather variables weathersit, temp, hum, and windspeed, with interactions between the continuous weather variables and the weathersit factor. Hint: Use the logical expression cnt > 7000 as the response in the glm statement. What is the multiplicative effects on the odds of passing this threshold when the temperature is raised by 10 degrees? Note: Round your answers to 2 decimal places. Weather Situation

Clear Cloudy Wet

Odds of Exceeding 7000 Riders per Day if Temperature Increases 10 Degrees 2.67 (+/-1) 6.60 (+/-1) 0 (+/-1)

Explanation/Solution biketab <- read.csv("bikeshare.csv", strings=T) #read data in and make categorical variables factors using strings=T argument

You can run logistic regression on the logical statement cnt>7000. ride7k <- glm(cnt>7000 ~ weathersit*(temp + hum + windspeed), data=biketab, family="binomial") coef(ride7k) (Intercept)

weathersitcloudy

-1.83930235

0.80575825

weathersitwet

temp

-14.72676607

0.09837767

hum

windspeed

-0.02066004

-0.07114327

weathersitcloudy:temp

weathersitwet:temp

0.09034426

-0.09837767

weathersitcloudy:hum

weathersitwet:hum

-0.03156450

0.02066004

weathersitcloudy:windspeed

weathersitwet:windspeed

-0.11771010

0.07114327

Clear ## clear exp(10*coef(ride7k)["temp"]) temp 2.674538

Cloudy ## cloudy exp(10*(coef(ride7k)["temp"] + coef(ride7k)["weathersitcloudy:temp"])) temp 6.600988

Wet ## wet exp(10*(coef(ride7k)["temp"] + coef(ride7k)["weathersitwet:temp"])) temp 1

A 10-degree temperature increase raises the odds of >7k rides by 2.7 times (an increase of 170%) on a clear day, by 6.6 times (and increase of 560%) on a cloudy day, and they do not change for wet days.

Problem 1.6 If you haven‘t already, read the bikeshare.csv data into R and make sure to use the strings=T argument in the read.csv function. Fit the model for regressing bike counts onto the weather variables (weathersit, temp, hum, windspeed) where weathersit interacts with the continuous weather variables and then also include date, month, and weekday variables where these time and date variables do not interact with any other variables. Hint: You will first need to change the dteday variable into a date in R and make sure that weekday and month are factors. a. The plot below depicts the times series for bike counts from the model described above in gray with the predicted counts using this model in red.

Based on this plot, does there appear to be autocorrelation? Yes/No b. Plot the ACF function. What is the lag-1 correlation? a. Between 0 and 0.2 b. Between 0.2 and 0.4 c. Between 0.4 and 0.6 d. Between 0.6 and 0.8 c. Add an AR(1) term to the model by creating a lagged cnt variable and adding it to the regression. What is the resulting lag coefficient? Note: Round your answer to 2 decimal places.

0.35 (+/- 0.01) d. Plot the ACF function for the model with the lagged term. What is the lag-1 correlation? a. Between 0 and 0.2 b. Between 0.2 and 0.4 c. Between 0.4 and 0.6 d. Between 0.6 and 0.8

e. Does adding the lagged term seem to account for the autocorrelation? a. Yes b. No

Explanation/Solution biketab <- read.csv("bikeshare.csv", strings=T) #read data in and make categorical variables factors using strings=T argument

a. The complete code to build this model is as follows. First, we need to set the month and day indicators as factor variables, and create a time trend term. biketab$date <- as.Date(biketab$dteday, format="%m/%d/%Y") biketab$mnth <- factor(biketab$mnth) biketab$weekday <- factor(biketab$weekday) timefit <- glm(cnt ~ weathersit*(temp + hum + windspeed) + date + mnth + weekday, data=biketab)

And plot the results plot(cnt ~ date, type="l", col=8, data=biketab) lines(timefit$fitted ~ biketab$date, col="red")

We could also look at the residuals plot to see the autocorrelation. plot(timefit$residuals ~ biketab$date, type="l")

b. Plotting the series or the residuals from our fitted model shows potential correlation. We can use the ACF to make it more precise. acf(timefit$residuals)

The lag-1 correlation is around 0.5, so we have significant dependence in the residuals. c. Add the lagged cnt variable to the regression using the following code. biketab$lag <- c(NA, head(biketab$cnt,-1)) #create lagged variable biketab[1:5,c("cnt","lag")] # confirm it worked as planned arfit <- glm(cnt ~ lag + weathersit*(temp + hum + windspeed) + date + mnth + weekday, data=biketab)

coef(arfit)["lag"] lag 0.350217

The resulting lag coefficient is 0.350217. The coefficient is between zero and one, so this is a stationary mean-reverting process of residual errors. d. acf(arfit$residuals)

e. Plotting the ACF for this new model confirms that this AR(1) term accounts for most of the autocorrelation from our earlier fit.

Chapter 2 Uncertainty Quantification The data in ames2009.csv consist of information that the local government in Ames, Iowa, uses to assess home values. These data were compiled from 2006 to 2010 by De Cock (2011) and contain 2930 observations on 79 variables describing properties in Ames and their observed sale price. For this problem, we will use a subset of the Ames data contained in ames2009.csv. <Ames2009.csv> Problem 2.1 a. What is the average sales price? 178368 (+/-0.1)

b. What is the standard error of the mean? (Note: If you use intermediate calculations, keep at least two decimal places for each and report your answer to three decimal places.) 2506.944 (+/-0.2) c. Use the mean and standard error to calculate a 95% confidence interval for the mean unconditional sales price. Use 1.96 for the critical value. What is the lower bound and upper bound? (Note: If you use intermediate calculations, keep at least two decimal places for each and report your answer to one decimal place.) Lower bound: 173454.4 (+/-0.1) Upper bound: 183281.6 (+/-0.1) Explanation/Solution The following code will call in the data if it is in your working directory. Note the strings=T argument to treat character data as a factor in R. ames <- read.csv("Ames2009.csv", strings=T) #call in data

a. The following code can be used to calculate the average sales price: (xbar <- mean(ames$SalePrice)) #mean

b. The following code can be used to calculate the standard error of the mean: (muSE <- sd(ames$SalePrice)/sqrt(nrow(ames))) #SE of the mean

c. The following code can be used to calculate the 95% confidence interval for the mean unconditional sales price: xbar + c(-1,1)*1.96*muSE #95% CI

Problem 2.2 Regress the log(SalePrice) onto all variables except for Neighborhood. The following code will run the regression, assuming you called the data “ames”. amesFit <- glm(log(SalePrice) ~ .-Neighborhood, data=ames)

Which of the following regression coefficients are significant when you control for a 5% false discovery rate? Choose all that apply. (Note: You can use the code below from the text to find the cutoff. You may select more than one answer.) pvals <- summary(amesFit)$coef[-1,"Pr(>|t|)"] fdr_cut <- function(pvals, q){ pvals <- pvals[!is.na(pvals)] N <- length(pvals) k <- rank(pvals, ties.method="min") max(pvals[ pvals<= (q*k/N) ]) } cutoff5 <- fdr_cut(pvals,q=.05)

a. log.Lot.Area b. Lot.Config c. Bldg.Type d. Overall.Qual e. Overall.Cond f. Year.Built g. Central.Air h. Electrical i. Gr.Liv.Area j. Full.Bath k. Half.Bath l. Bedroom.AbvGr m. Kitchen.AbvGr n. TotRms.AbvGrd Explanation/Solution The following code prints which regression coefficients are significant when you control for a 5% false discovery rate: print(cutoff5) #FDR cut-off which(pvals<=cutoff5) #find predictors with p-values below the cutoff

Problem 2.3 Regress the log(SalePrice) onto all variables except for Neighborhood. a. What is the lower and upper bound of the 95% confidence interval for the effect of having central air on the expected log sale price? Note: Use the values output from glm and 1.96 for the critical value. Report your answer to four decimal places and carry as many decimal places as possible in intermediate calculations.

Lower bound: 0.0847 (+/-0.001) Upper bound: 0.1795 (+/-0.001) b. What is the lower and upper bound of the 95% confidence interval for the effect of having central air on the expected log sale price using a bootstrap? Note that the coefficient for the effect of having center air is Central.AirY. Note: Use the boot function from the boot library with 2,000 bootstrap samples. Before running the bootstrap, set the seed to 1. Report your answer to four decimal places and carry as many decimal places as possible in intermediate calculations. The following function will extract the coefficients to feed to the boot function. getBeta <- function(data, obs, var){ fit <- glm(log(SalePrice) ~ .-Neighborhood, data=data[obs,]) return(fit$coef[var]) }

Lower bound: 0.0585 (+/-0.001) Upper bound: 0.2162 (+/-0.001)

c. Which confidence interval is wider? a. glm b. bootstrap Explanation/Solution The following code will run the regression, assuming you called the data ―ames‖. amesFit <- glm(log(SalePrice) ~ .-Neighborhood, data=ames)

a. The following code can be used to calculate the 95% confidence interval for the effect of having central air on the expected log sale price: ( bstats <- summary(amesFit)$coef["Central.AirY",] ) #coefficient for Central Air bstats["Estimate"] + c(-1,1)*1.96*bstats["Std. Error"] #95% CI

b. The following code can be used to use a bootstrap to calculate the 95% confidence interval: library(parallel) library(boot) set.seed(1) ( betaBoot <- boot(ames, getBeta, 2000, var="Central.AirY", parallel="snow", ncpus=detectCores()) ) quantile(betaBoot$t, c(.025, .975))

c. The bootstrap interval is wider (the SE is 0.02 for the standard method and 0.04 for the bootstrap).

Problem 2.4. Regress the log(SalePrice)onto all variables except for Neighborhood. a. Run the code below and feed these objects to the boot function. When using a block bootstrap, what is the lower and upper bound of the 95% confidence interval for the coefficient on central air while allowing for dependence in sales prices within neighborhoods? Note: Use 2,000 bootstrap samples. Before running the bootstrap, set the seed to 1. Report your answer to four decimal places. byNBHD <- split(ames, ames$Neighborhood) getBetaBlock <- function(data, ids, var){ data <- do.call("rbind",data[ids]) fit <- glm(log(SalePrice) ~ .-Neighborhood, data=data) return(fit$coef[var]) }

Lower bound: 0.0220 (+/-0.01) Upper bound: 0.2031 (+/-0.01) b. Use the sandwich package to obtain a 95% confidence interval for the coefficient on central air while allowing for dependence in sales prices within neighborhoods. Use 1.96 for the critical value. What is the lower and upper bound of this CI? Report your answer to four decimal places. Upper bound: 0.2135 (+/-0.01) Lower bound: 0.0507 (+/-0.01)

c. Use the results from the block bootstrap to calculate a bias-corrected 95% confidence interval for the multiplicative effect of central heating on the expected sale price. What is the lower and upper bound of this CI? Report your answer to four decimal places. Upper bound: 1.2603 (+/-0.01) Lower bound: 1.0574 (+/-0.01)

Explanation/Solution The following code will run the regression, assuming you called the data ―ames‖. amesFit <- glm(log(SalePrice) ~ .-Neighborhood, data=ames)

a. The following code can be used to determine the confidence interval using a block bootstrap: library(parallel) library(boot)

set.seed(1) ( betaBootB <- boot(byNBHD, getBetaBlock, 2000, var="Central.AirY", parallel="snow", ncpus=detectCores()) ) quantile(betaBootB$t, c(.025, .975))

Note that, even with the same seed, you may get different results for different operating systems and versions of R. You should at least be able to replicated your own results by running the same code twice on your own machine. b. The following code uses the sandwich package to obtain a 95% confidence interval for the coefficient on central air while allowing for dependence in sales prices within neighborhoods: library(sandwich) library(lmtest) Vblock <- vcovCL(amesFit, cluster=ames$Neighborhood) clstats <- coeftest(amesFit, vcov = Vblock)["Central.AirY",] round(clstats, 5) clstats["Estimate"] + c(-1,1)*1.96*clstats["Std. Error"] c. This multiplicative effect is the exponentiated coefficient. The exponentiation is a

nonlinear transformation, and the distribution of this transformation will not be equal to the transformation of the raw coefficient distribution (i.e., exp(beta.hat) is a biased estimate for true exp(beta)). So, we should use the bias corrected bootstrap to obtain the 95% CI: quantile(2*exp(betaBootB$t0) - exp(betaBootB$t), c(.025, .975))

If you got different results for the bookstrap in part a, your answer might be slightly different than the answer key shows.

Chapter 3 Regularization and Selection For this problem set, we will look at the web browser history for 10,000 users for 1000 heavily trafficked websites. The data was obtained in the early 2000s. Each browser in the sample spent at least $1 online in the same year. This is a simple version of the data that are used to build targeted advertising products that predict which customers are most likely to make purchases. The data are stored in three files. 1. `browser-domains.csv` contains the counts for visits from each user ID to each website ID. 2. `browser-sites.txt` contains the full names of the domains for each website ID. 3. `browser-totalspend.csv` contains the total amount spent online that year for each user ID.

Using the code below, you can read the data and convert it into a simple triplet matrix that contains a column for every website, a row for every user, and entries that are 1 if the user visited that website and 0 otherwise: library(Matrix) web <- read.csv("browser-domains.csv") # Browsing History sitenames <- scan("browser-sites.txt", what="character") web$site <- factor(web$site, levels=1:length(sitenames), labels=sitenames) # relabel site factor web$id <- factor(web$id, levels=1:length(unique(web$id))) # also factor machine id # use this info in a sparse matrix xweb <- sparseMatrix( i=as.numeric(web$id), j=as.numeric(web$site), x=rep(1,nrow(web)), # use x=web$visits for a matrix of counts instead of binary 0/1 dims=c(nlevels(web$id),nlevels(web$site)), dimnames=list(id=levels(web$id), site=levels(web$site))) yspend <- read.csv("browser-totalspend.csv", row.names=1) # use 1st column as row names yspend <- as.matrix(yspend) ## good practice to move from dataframe to matrix

We now have `yspend` as the user's total spending and `xweb` as their browser history. <browser-domains.csv> <browser-sites.txt> <browser-totalspend.csv> Problem 3.1 How many sites did household 1 visit? (Household 1 is the first row of xweb.) 302

Explanation/Solution The following code can be used to determine the sites that Household 1 visited: head(xweb[1, xweb[1,]!=0])

Problem 3.2 Fit a log-linear regression model for `yspend` (regress log(yspend) onto xweb) using a sequence of lasso penalties and produce a path plot. When answering the following questions, use the default AICc selection unless specified otherwise. a. What is the first website to enter the model with a nonzero coefficient? a. amazon.com b. bizrate.com c. delta.com d. weatherbug.com

b. Use the AIC to select the optimal lasso penalty. What is the optimal lambda value? Round your answers to four decimal places. 0.0198 (+/-0.001) c. How many regression coefficients have nonzero estimates at this value? 229 d. Which 5 websites have the largest negative effects on expected spending? Select all that apply. a. 88.80.5.21 b. 032439.com c. advertising.com d. alltel.net e. aol.com f. facebook.com g. netoffers.net h. srch-results.com i. pointroll.com j. premiumproductsonline.com e. Which 5 websites have the largest positive effects on expected spending? Select all that apply. a. 180solutions.com b. advertising.com c. bizrate.com-o01 d. commerceonlinebanking.com e. ebay.com f. fedex.com g. overture.com h. travelhook.net i. ups.com j. whenu.com f. What is the predicted annual spending for someone whose browser history includes only google.com and fedex.com? 190.866 (+/-0.01) g. How many regression coefficients have nonzero estimates for the AIC chosen model? 229 h. How many regression coefficients have nonzero estimates for the BIC chosen model? 53 Explanation/Solution library(gamlr) spender <- gamlr(xweb, log(yspend))

plot(spender) ## path plot

a. Bizrate was an online shopping deal comparison website popular in the early 2000s. The code to reveal this is as follows: which(spender$beta[,2]!=0)

b. The code to determine the optimal lambda value is as follows: spender$lambda[which.min(AICc(spender))]

c. Since AICc selection is the default, we can just call coef to get the coefficients. We do that here and remove the intercept beta <- coef(spender)[-1,] #get rid of intercept sum(beta!=0) # number nonzero

d. The following code sorts the websites by the negative effects on expected spending: sort(beta)[1:5]

e. The following code sorts the websites by the positive effects on expected spending: sort(beta, decreasing=TRUE)[1:5]

f. You could use the predict function or pull out the coefficients and calculate that way: beta[c("fedex.com","google.com")] fedex.com google.com

0.1702358 0.0000000 ( logspend <- coef(spender)[1] + sum(beta[c("fedex.com","google.com")])) exp(logspend) g. The code to determine the number of regression coefficients with nonzero estimates for the AIC chosen model is as follows: AIC <- coef(spender, select=which.min(AIC(spender)))[-1,] ## and AIC instead sum(AIC!=0)

h. The code to determine the number of regression coefficients with nonzero estimates for the BIC chosen model is as follows: bBIC <- coef(spender, select=which.min(BIC(spender)))[-1,] ## and BIC instead sum(bBIC!=0)

Problem 3.3 Run a CV experiment for the log-linear regression model for `yspend`. Set the seed at 10. a. Using the CV-min rule, what is the optimal lambda? Round your answer to four decimal places. 0.0217 (+/-0.0001) b. What is the out-of-sample R2 use the cv min to select the model? Round your answer to four decimal places. 0.2344 (+/-0.0001) c. What is the out-of-sample R2 for the AICc selected model? Round your answer to four decimal places. 0.2342 (+/-10.0001)

Explanation/Solution set.seed(10) cv.spender <- cv.gamlr(xweb, log(yspend), verb=TRUE) #cv lasso plot(cv.spender) a. cv.spender$lambda.min b. 1 - cv.spender$cvm[cv.spender$seg.min]/cv.spender$cvm[1] c. 1 - cv.spender$cvm[which.min(AICc(spender))]/cv.spender$cvm[1]

Problem 3-4 Use a logistic regression Lasso to build a model for the probability that a user spends more than $1000 in a single year. Use the AICc selection. a. Which site has the largest coefficient? a. amazon.com b. bizrate.com c. fedex.com d. weather.com b. What is the increase on the odds of spending more than $1000 due to a user having fedex.com in their browser history? Report as a percent. 28% (+/-1) Explanation: bigspend <- gamlr(xweb, yspend>1000, famil="binomial") plot(bigspend) a. blogit <- coef(bigspend)[-1,] #get coefficients minus intercept which.max(blogit) #get maximum b. exp(blogit["fedex.com"]) #exponentiate coefficient for fedex.com

1.276706 means there is roughly a 28% increase in the odds of spending $1,000 or more if fedex.com is in the browsing history. Chapter 4 Classification This problem set uses the data from the spam prediction example from Chapter 1, where the content of emails is used to predict the probability that a message is spam. Using the code below, first break the data into a training set of 4000 emails and a test set of 601 emails. spammy<- read.csv("Spam.csv") set.seed(1)

testsamp <- sample.int(nrow(spammy), 601) xtrain <- spammy[-testsamp,-ncol(spammy)] xtest <- spammy[testsamp,-ncol(spammy)] ytrain <- spammy[-testsamp,"spam"] ytest <- spammy[testsamp,"spam"]

Problem 4-1 Explore your newly created objects. a. Find the dimensions for the xtrain object. Rows: 4000 Columns: 57 b. Find the dimensions for the xtest object. Rows: 601 Columns: 57 c. Find the length of the ytrain object. Rows: 4000 d. Find the length of the ytest object. Rows: 601 Feedback/Solution dim(xtrain) dim(xtest) length(ytrain) length(ytest)

Problem 4-2 Use K-NN classify the test emails and produce the out-of-sample confusion matrices. a. Set the seed to 10, and then run K-NN with K=5 using scaled x matrices. Find the values for the confusion matrix.

Actual Class 0

Predicted Class 0

343

220

b. For K-NN with K = 5, calculate and report the precision to 4 decimal places. 0.9442 tolerance is +- 0.0001

c. For K-NN with K=5, calculate and report the recall to 4 decimal places. 0.8980 tolerance +- 0.0001

d. Set the seed to 11, and then run K-NN with K=20 using scaled x matrices. Find the values for the confusion matrix.

Actual Class

Predicted Class

334

215

e. For K-NN with K = 20, calculate and report the precision to 4 decimal places. 0.9471 tolerance is +- 0.0001

f. For K-NN with K=20, calculate and report the recall to 4 decimal places. 0.8776 tolerance +- 0.0001

Feedback/Solution library(class)

## k=20 set.seed(11) k20 <- knn(train=xtrain, test=xtest, cl=ytrain, k=20) (k20cm <- table(pred=k20, actual=ytest)) k20cm[2,2]/sum(k20cm[2,]) # precision k20cm[2,2]/sum(k20cm[,2])

# recall

Feedback/Solution library(class)

## k=5 set.seed(10) k5 <- knn(train=xtrain, test=xtest, cl=ytrain, k=5) (k5cm <- table(pred=k5, actual=ytest)) k5cm[2,2]/sum(k5cm[2,]) # precision k5cm[2,2]/sum(k5cm[,2])

# recall

Problem 4-3 Using the training data, fit a lasso logistic regression model to predict whether an email is spam. First, load the gamlr library and fit the lasso logistic regression with gamlr. Note, that you need to set lmr=1e-4 or smaller within the gamlr function. Using the AICc selected model, produce both the in-sample and out-of-sample confusion matrices for a classification cutoff of 1/2. a. Find the values for the in-sample confusion matrix.

Actual Class 0

Predicted Class 0

2331

137

101

1431

b. Calculate and report the in-sample precision. 0.9341 tolerance is +- 0.0001

c. Calculate and report the in-sample recall. 0.9126 tolerance +- 0.0001

d. Find the values for the out-of-sample confusion matrix. Actual Class 0

Predicted Class 0

339

215

e. Calculate and report the out-of-sample precision to 4 decimal places. 0.9267 tolerance +- 0.0001

f. Calculate and report the out-of-sample recall to 4 decimal places. 0.8776 tolerance +- 0.0001

Feedback/Solution library(gamlr) spamfit <- gamlr(xtrain, ytrain, family="binomial", lmr=1e-4)

ptrain <- drop( predict(spamfit, xtrain, type="response") ) #preds train ( ctrain <- table(pred=ptrain>0.5, ytrain) )

#confusion matrix

ctrain[2,2]/sum(ctrain[2,]) # precision ctrain[2,2]/sum(ctrain[,2])

# recall

ptest <- drop( predict(spamfit, xtest, type="response") ) #preds test ( ctest <- table(pred=ptest>0.5, ytest) ) ctest[2,2]/sum(ctest[2,])

# precision

ctest[2,2]/sum(ctest[,2])

# recall

#confusion matrix

4-4 Using the training data, fit a lasso logistic regression model to predict whether an email is spam. First, load the gamlr library and fit the lasso logistic regression with gamlr. Note, that you need to use a smaller-than-default lambda min ration, lmr=1e-4. Use the AICc selected model to make predictions. It is much worse to miss an important email than it is to get the occasional spam in your inbox. Say that you view the cost of missing an important email as 10 times the cost of seeing a spam message in your inbox, such that you want to classify an email as spam only if (1-p)x10 is less than p where p is the probability that it is spam. a. What should the classification rule be for your spam filter? p < 10/11 p > 10/11

p < 9/10 p > 9/10 b. Find the values for the out-of-sample confusion matrix.

Actual Class

Predicted Class 0 1

355

155

c. How many important emails would you miss out of the 601? 1

d. Calculate and report the in-sample precision to 4 decimal places. 0.9936 tolerance is +- 0.0001

e. Calculate and report the in-sample recall to 4 decimal places. 0.6327 tolerance +- 0.0001

Feedback/Solution

a. (1-p)*10 < p → 10 – 10p < p → 10 < 11p and then p > 10/11 b. Library(gamlr) Spamfit <- gamlr (xtrain, ytrain, family=”binomial”, lmr=1e-4) ptest <- drop( predict(spamfit, xtest, type="response") ) ( cfilter <- table(pred=ptest>10/11, ytest) ) # confusion matrix c. From the confusion matrix in b, there is 1 email that has actual class 0 (not spam) but is

predicted to be spam. d. cfilter[2,2]/sum(cfilter[2,]) # precision e. cfilter[2,2]/sum(cfilter[,2]) # recall

Chapter 5 Causal Inference with Experiments

Malaria Medication Adherence In malaria.csv we have the results from an experiment that provided text alerts to encourage adherence to antimalarial medication (taking the meds) in Ghana. The treated variable is whether or not they get the text alerts, the adhere variable is whether they (mostly) adhered to the medication protocol, and the age and male variables tell us the patient‘s age and whether they are male. The data are taken from the paper at https://doi.org/10.7910/DVN/M4LY6C (Raifman, Julia RG, et al. “The impact of text message reminders on adherence to antimalarial treatment in northern Ghana: a randomized trial.” PloS one 9.10 (2014)). Problem 5-1 Calculate a 95% interval for the average treatment effect of text reminders on adherence probability. a. Use glm to find the average treatment effect and standard error to 4 decimal places. Average Treatment Effect: 0.0782 tolerance +-0.0001 Standard Error: 0.0421 tolerance+- 0.0001 b. Compute the 95% interval for the average treatment effect of text reminders on adherence probability. Round to three decimal places. Lower Bound: -0.004 tolerance+- 0.001 Upper Bound: 0.161 tolerance+- 0.001 c. Is the average treatment effect significant? Yes No Feedback/Solution d <- read.csv("malaria.csv") a. fit1 <- glm( adhere ~ treated, data=d) ( stats1 <- summary(fit1)$coef["treated",] ) #ATE and SE b. stats1[1] + c(-1,1)*1.96*stats1[2] # confidence interval c. The interval contains 0, so there is not a significant effect.

Problem 5-2 Use boot to run a bootstrap of the average treatment effect with 1,000 bootstrap samples. Set your seed to be 1 first using set.seed(1). a. Using the percentiles, find the 95% interval and report to 3 decimal places. Lower bound: -0.005 tolerance +- 0.001 Upper Bound: 0.161 tolerance +- 0.001 b. Using the percentiles, find the bias adjusted 95% interval and report to 3 decimal places. Lower bound: -0.005 tolerance +- 0.001 Upper Bound: 0.161 tolerance +- 0.001 © McGraw Hill LLC. All rights reserved. No reproduction or distribution without the prior written consent of McGraw Hill LLC.

Feedback/Solution library(parallel) library(boot) getATE <- function(data, ind){ coef(glm(adhere ~ treated, data=data[ind,]))["treated"] } set.seed(1) Note, this code assumes the malaria dataset object is called “d”. (boot1 <- boot(d, getATE, 1000, parallel="snow", ncpus=detectCores()))

a. quantile(boot1$t, c(.025,.975)) ## percentile bootstrap b. quantile(2*boot1$t0 -boot1$t, c(0.025,.975))

## bias adjusted

Problem 5-3 Do you have any evidence that the treatment selection is driven by the patient’s age or sex? Run a regression to find if the age and male covariates are predictive of treatment status. Make sure to model the interaction in your model. Yes No

Feedback/Solution summary(glm(treated ~ age*male, data=d))

From the output, we see the p-values are large.

Problem 5-4 Break the age into ten year chunks (1-10, 11-20, etc. and estimate a heterogeneous treatment effects model where the treatment effect depends upon age category and sex. First, use cut to break the age into chunks. max(d$age) # look at max to find out how many intervals you need d$agegrp <- cut(d$age,10*(0:9)) levels(d$agegrp)

a. In the next question, you will run a bootstrap for the 95% confidence interval for the CATE for a 25 year old female. Fill in the blanks for the correct getCATE function to feed into your bootstrap. getCATE <- function(data, ind){ fit <- glm(adhere ~ treated*agegrp*male, data=data[ind,]) predict(fit, data.frame(treated=1, male=0, agegrp="(20,30]")) predict(fit, data.frame(treated=0, male=0, agegrp="(20,30]")) }

b. Use the getCATE function from (a) in boot to run a bootstrap with 1,000 bootstrap samples to estimate the 95% CI for the CATE for a 25 year old female. Set the seed using set.seed(1) and report your answer to 3 decimal places. Note: R will return a warning because we include factor levels that are not identified. You can ignore it. Lower Bound: -0.104 tolerance +- 0.001 Upper Bound: 0.245 tolerance +- 0.001 c. Use the same output to estimate the bias adjusted 95% CI for the CATE for a 25 year old female. Report your answer to 3 decimal places. Lower Bound: -0.119 tolerance +- 0.001 Upper Bound: 0.229 tolerance +- 0.001 Feedback/Solution fit2 <- glm(adhere ~ treated*agegrp*male, data=d)

a. getCATE <- function(data, ind){ fit <- glm(adhere ~ treated*agegrp*male, data=data[ind,]) predict(fit, data.frame(treated=1, male=0, agegrp="(20,30]")) predict(fit, data.frame(treated=0, male=0, agegrp="(20,30]")) } b. set.seed(1) (boot2 <- boot(d, getCATE, 1000, parallel="snow", ncpus=detectCores())) quantile(boot2$t, c(.025,.975)) c. quantile(2*boot2$t0 -boot2$t, c(0.025,.975))

Problem 5-5 Use the code below to simulate data from a regression discontinuity design and recover an estimate of your treatment effect. Our simulator has two linear models. The treatment effect at r=0 is 3-1=2 (difference in intercepts). fcntrl <- function(r){ 1 +10*r + rnorm(length(r), sd=5) } #control ftreat <- function(r){ 3 +5*r + rnorm(length(r), sd=5)} #treatment Simulate 100 observations centered on r=0, and fit a linear model on either side.

set.seed(8) r <- runif(100, -3, 3) treat <- as.numeric(r>0) y <- fcntrl(r)*(1-treat) + ftreat(r)*treat plot(r, y, col=1+(r>0)) rdfit <- glm( y ~ r*treat) points(r, rdfit$fitted, pch=20, col=1+treat)

Find the estimated CATE at r=0. Round to 2 decimal places. 2.93 tolerance +- 0.01 Feedback/Solution summary(rdfit)

Chapter 6 Causal Inference with Controls Gender effect on wages In the hdm package there is the cps2012 data on wages. We will fit models here to understand the effect of gender on wage after controlling for the other census variables (which include educational status, geography, and transformations of years of potential experience). Use the following code to create your outcome variable (lnw, the log hourly wage), treatment (female), and matrix of control variables (xbig). library(hdm) data(cps2012)

library(gamlr)

xbig <- sparse.model.matrix(~(.-weight)^2-1, data=cps2012[,-(1:3)]) female <- cps2012$female lnw <- cps2012$lnw dim(xbig)

Problem 6-1 a. Use a regression to calculate the effect of female on lnw without conditioning on any controls. Report the estimate of the additive effect. Round your answer to 4 decimal places. -0.2609 +- 0.0001

b. Recall that lnw is log hourly wage. Exponentiate the coefficient for female to get an estimate of the multiplicative effect. Round your answer to 4 decimal places. 0.7704 +- 0.0001 c. Use the following code to run a bootstrap for the multiplicative effect of female. Notice that we exponentiate the coefficient since the regression response was log hourly wage and also notice we set the seed to 1 so that we all get the same answer. library(parallel) library(boot) getATE <- function(data, ind){ exp(coef(glm(lnw ~ female, data=data[ind,]))["female"]) } set.seed(1) (boot1 <- boot(data.frame(lnw=lnw, female=female), getATE, 1000, parallel="snow", ncpus=detectCores()))

Use the results to calculate a 95% interval for the multiplicative effect of female. Round your answer to 4 decimal places. Lower Bound 0.7595 +-0.0001 Upper Bound 0.7821 +-0.0001 d. Use the bootstrap results again, but this time to calculate the bias corrected 95% interval for the multiplicative effect of female. The bias corrected interval is better in this case since exp(beta.hat) is biased. Round your answer to 4 decimal places. Lower Bound 0.7586 +-0.0001 Upper Bound 0.7813 +-0.0001 Explanation/Solution a. The effect of female on lnw, without conditioning on any controls, is indicated by the coefficient of female in the output of the glm model, estimated by: summary(glm(lnw~female)) b. exp(-0.260863) c. quantile(boot1$t, c(.025,.975)) d. quantile(2*boot1$t0 -boot1$t, c(0.025,.975))

Problem 6-2 a. Use the doubleML function in the gamlr package for double ML to recover the additive effect of gender on log hourly wage after conditioning on the variables in xbig. Set the seed for 100 and use 5 folds. Report the estimate of the additive effect. Note: Round your answer to 4 decimal places. -0.2793 +- 0.0001 b. Now find a 95% CI for the multiplicative effect of gender on hourly wage. One easy way to find this interval is simulate from the sampling distribution implied by double ML. (You

could also run double ML inside a bootstrap loop which will take much much longer.) Run the code below to sample from the implied sampling distribution (and note the seed is set to 1 so we all get the same answer). stats <- summary(dml)$coef[1,] set.seed(1) tesamp <- exp(rnorm(1000, mean=stats[1], sd=stats[2])) Use this sample to find the 95% confidence interval for the multiplicative effect of gender on hourly wage. Note: Round your answer to 4 decimal places. Lower Bound

0.7452 +-0.0001

Upper Bound 0.7669 +-0.0001

Explanation/Solution a. The estimated additive event is indicated by the coefficient for d in the output. set.seed(100) dml <- doubleML(xbig, d=female, y=lnw, nfold=5) summary(dml) b. quantile(tesamp, c(0.025,.975))

Chapter 7 Trees and Forests Telemarketing We will consider the telemarketing data from chapter 3, where the goal is to predict the probability that a call results in a subscription (subscribe=1) for a term deposit product from the bank. <telemarketing.csv> <telemarketing_description.txt> Problem 7-1 Fit a CART tree to predict whether or not a call results in a subscription. Plot the resulting tree as a dendrogram. What does it look like is the most important variable? a. b. c. d.

contact:c durmin pweek poutcome

Explanation/Solution

tlmrk <- read.csv(“telemarketing.csv”, strings=T) library(tree) treeTD <- tree( subscribe==1 ~ ., data=tlmrk) plot(treeTD) text(treeTD)

The earliest, and most, splits are on durmin the length of the call.

Problem 7-2 a. Use the following code to split the data into a test and train set. Again, note that we are setting the seed to 1 so that we all get the same answer. If you have not done so yet, read the data in using: tlmrk <- read.csv("telemarketing.csv", strings=T)

set.seed(1) testi <- sample.int(nrow(tlmrk),1000) test <- tlmrk[testi,] train <- tlmrk[-testi,] Now, fit a random forest with 200 trees and importance = “impurity” to predict the subscription probability, using only the training sample. What is the most important variable? a. b. c. d.

age balance durmin pweek

e. poutcome

b. Now fit the lasso from Chapter 3 using the code below. library(gamlr) tlmrkX <- naref(tlmrk[,-15]) xTD <- sparse.model.matrix(~.^2 + I(durmin^2), data=tlmrkX) fitTD <- gamlr(xTD[-testi,], tlmrk$subscribe[-testi], family="binomial") plot(fitTD)

Then get the predicted values for the test set for the random forest and lasso using the code below. Note that this code assumes you called your random forest rfTD. plin <- predict(fitTD, xTD[testi,], type="response") # random forest predictions are the second column (it produces a matrix) prf <- predict(rfTD, test)$predictions[,2]

Now calculate misclassification rates for the lasso and random forest using 0.5 as a cutoff. Round your answers to 3 decimal places. Lasso 0.094 +- 0.001 Random Forest 0.115 +- 0.001

Explanation/Solution a. library(ranger) rfTD <- ranger(subscribe ~ ., data=train, prob=TRUE, num.tree=200, importance="impurity") sort(rfTD$variable.importance, decreasing=TRUE) durmin is the first in the sorted list (most to least important).

b. Create confusion matrix for lasso and random forest, then calculate misclassification rates. confusLasso <- table(as.numeric(plin>=0.5),test$subscribe) confusRF <- table(as.numeric(prf>=0.5),test$subscribe) sum(confusLasso[1,2] + confusLasso[2,1]) / sum(confusLasso) sum(confusRF[1,2] + confusRF[2,1]) / sum(confusRF)

--OR—compute directly: mean( mean(

(plin<=0.5)*test$subscribe + (plin>0.5)*(1-test$subscribe) ) (prf<=0.5)*test$subscribe + (prf>0.5)*(1-test$subscribe) )

You can also calculate the deviance and evaluate that way. -2*mean( log(plin)*test$subscribe + log(1-plin)*(1-test$subscribe) ) -2*mean( log(prf)*test$subscribe + log(1-prf)*(1-test$subscribe) )

Chapter 8 Factor Models Web Browsing Data For this problem set, we will (again) look at the web browser history for 10k users for 1000 heavily trafficked websites. The data was obtained in the early 2000s. Each browser in the sample spent at least $1 online in the same year. This is a simple version of the data that are used to build targeted advertising products that predict which customers are most likely to make purchases. The data are stored in three files. * browser-domains.csv contains the counts for visits from each user ID to each website ID. * browser-sites.txt contains the full names of the domains for each website ID. * browser-totalspend.csv contains the total amount spent online that year for each user ID. Using the code below, you can read the data and convert it into a simple triplet matrix that contains a column for every website, a row for every user, and entries that are 1 if the user visited that website and 0 otherwise. library(Matrix) web <- read.csv("browser-domains.csv") sitenames <- scan("browser-sites.txt", what="character") web$site <- factor(web$site, levels=1:length(sitenames), labels=sitenames) web$id <- factor(web$id, levels=1:length(unique(web$id))) xweb <- sparseMatrix( i=as.numeric(web$id), j=as.numeric(web$site), # replace this with x=web$visits to have a matrix of counts instead of binary 0/1 x=rep(1,nrow(web)), dims=c(nlevels(web$id),nlevels(web$site)), dimnames=list(id=levels(web$id), site=levels(web$site))) yspend <- read.csv("browser-totalspend.csv", row.names=1) # us 1st column as row names yspend <- as.matrix(yspend) ## good practice to move from dataframe to matrix

We now have `yspend` as the user's total spending and `xweb` as their browser history. Here, we will convert this into a dense matrix as required for prcomp. x <- as.matrix(xweb)

Problem 8-1 Choose all that apply. Which of the following are sites that household 1 visited? amazon.com atdmt.com ebay.com facebook.com google.com

weather.com weatherbug.com yahoo.com Explanation/Solution head(xweb[1, xweb[1,]!=0])

Problem 8-2 Fit PCA on xweb and plot the screeplot. a. What is the variance explained by the first principal component? 0 – 25 >25 – 50 >50 – 75 >75 b. What is the proportion of variance explained by the first principal component? Round your answer to 3 decimal places. 0.086 +-0.01

Explanation/Solution a. pca <- prcomp(x, scale=TRUE) plot(pca)

Look at the screeplot and see that the first bar is taller than 75.

b. summary(pca) # you may need to increase the max.print range to see proportion of variance data. Use the code: options(max.print=10000)

The ―Proportion of Variance‖ for PC1 is listed to be 0.08583. This is a very small amount of the total variance explained by PC1!

Problem 8-3 Use the code below to sample a random test sample of 1000 observations to use as test data (the remaining data is the training data). Notice the seed is set so that we all get the same “random” sample for our test data.

set.seed(1) testi <- sample.int(nrow(x),1000)

Compare out-of-sample (OOS) predicted log spending using PC regression for the first 10 components to that using a Lasso. a. Fit a principal components regression using the first 10 components to predict log spending using the training data. Calculate the OOS MSE and report your answer to 3 decimal places. 2.400 +-0.001 b. Run a Lasso regression to predict log spend from browsing history on the training data. Calculate OOS MSE and round your answer to 3 decimal places. 2.148 +-0.001

Explanation/Solution a. V <- as.data.frame(predict(pca)[,1:10]) fitPC <- glm(log(yspend)[-testi] ~ ., data= V[-testi,]) summary(fitPC) yhatPC <- predict(fitPC, V[testi,]) #predictions for test data mean( (log(yspend)[testi]-yhatPC)^2 ) #MSE PC regression

b. library(gamlr) spender <- gamlr(xweb[-testi,], log(yspend)[-testi]) #fit on train data yhatLasso <- predict(spender, xweb[testi,]) #predict on test data mean( (log(yspend)[testi]-yhatLasso)^2 ) #MSE lasso

Chapter 9 Text as Data

The data for this homework first appear in Gentzkow and Shapiro (2010) and summarize the first year of the 109th Congressional Record (2005) containing all speech in that year for members of the US house and senate. It is stored in the textir package in R in two separate datasets. Use the following 2 lines of code to load the data. library(textir) data(congress109)

The text is already tokenized into bigrams (two-word phrases) after stop-word removal and stemming (using a Porter stemmer). The matrix congress109Counts contains the number of times each phrase in a list of 1000 common bigram and trigrams was used in the 109th Congress by each of the 529 members of the House and Senate. > congress109Counts[c("Barack Obama", "John Boehner"), 995:998] 2 x 4 sparse Matrix of class "dgCMatrix" Barack Obama John Boehner

stem.cel natural.ga hurricane.katrina trade.agreement . 1 20 7 . . 14 .

The data also contain information about each member of Congress. > congress109Ideology[1:4,] name party state chamber repshare cs1 cs2 Chris Cannon Chris Cannon R UT H 0.7900621 0.534 -0.124 Michael Conaway Michael Conaway R TX H 0.7836028 0.484 0.051 Spencer Bachus Spencer Bachus R AL H 0.7812933 0.369 -0.013 Mac Thornberry Mac Thornberry R TX H 0.7776520 0.493 0.002

We will be considering the variable repshare which is the proportion of the two party vote by the members‘ constituents (districts for members of the house, state for senators) that was captured by the republican candidate (GW Bush) in the 2004 presidential election.

Problem 9-1 Fit a lasso regression that predicts repshare from whether or not each member used each text token. Hint: Use the following line of code to create a matrix with an indicator for whether each bigram was used or not. x <- congress109Counts>0

a. Find the top 3 democrat tokens using the AICc coefficients. Hint: These should be negative coefficients. action.lawsuit congressional.black.caucu family.value

human.embryo illegal.immigration issue.facing.american look.forward million.illegal.alien minority.owned.business voter.registration war.terror

1-1 b. Find the top 3 republican tokens using the AICc coefficients. Hint: These should be positive coefficients. action.lawsuit congressional.black.caucu family.value human.embryo illegal.immigration issue.facing.american look.forward million.illegal.alien minority.owned.business voter.registration war.terror Explanation/Solution First fit the lasso. x <- congress109Counts>0 y <- congress109Ideology$repshare fitlin <- gamlr(x, y) plot(fitlin)

Then, extract the AICc coefficients. b <- coef(fitlin)[-1,] a. sort(b)[1:3] #most negative coefficients b. sort(b, decreasing=TRUE)[1:3] #most positive coefficients

Problem 9-2

Use the following line of code to create a matrix with an indicator for whether each bigram was used or not. x <- congress109Counts>0

Then, set the seed to 20, load the text2vec library, and fit a 10 topic LDA model.

a. Find the top 15 tokens in each topic according to topic token probabilities. Which of the following is a plausible name for the first topic? a. education b. energy c. spending d. war b. Find the top token in each topic according to topic token lift. What is the top issue for the first topic? a. fund.tax.cut b. human.right.body c. safe.drinking.water d. social.security.crisi c. Regress repshare onto the 10 topic LDA. Looking at the output from the summary function, what does this tell you about the partisanship of these topics? a. they are largely democrat b. they are largely republican c. they are split roughly equally between democrat and republican Explanation/Solution library(text2vec) set.seed(20) tpc <- LDA$new(n_topics=10) W <- tpc$fit_transform(x) a. tpc$get_top_words(n=5, lambda=1) b. tpc$get_top_words(n=5, lambda=0) c. fitLDA <- glm(y ~ W) summary(fitLDA)

The coefficients are largely significantly positive.