Solutions Manual for Foundations of Statistics for Data Scientists With R and Python, 1st Edition by by ACADEMIAMILL

Solutions Manual for Foundations of Statistics for Data Scientists With R and Python, 1e by Alan Agresti, Maria Kater (All Chapters)

Chapter 1 1.1 (a) (i) an individual voter, (ii) the 1882 voters in the exit poll, (iii) the 11.1 million people who voted (b) Statistic: Sample percentage of 52.5% who voted for Feinstein Parameter: Population percentage of 54.2% who voted for Feinstein 1.2 (a) Use a command such as in R, > Students <- read.table(" +

header=TRUE)

(b) (i) What proportion of the students in this sample responded yes for whether abortion should be legal in the first three months; (ii) Same question but for some population, such as all social science graduate students at the University of Florida 1.3 (a) Quantitative; (b) categorical; (c) categorical; (d) quantitative 1.4 (a) Religious aﬀiﬀiliation (possible categories Christianity, Islam, Jewish, Hinduism, Buddhism, other, none) (b) Body/mass index (BMI = (weight in kg)/(height in meters)2 (c) Number of children in family (d) Height of a person 1.5 Ordinal, because categories have natural ordering 1.6 (a) College board score (e.g., SAT between 200 and 800) (b) Time spent in college (measure by integer number of years) 1.7 In R, for students numbered 00001 to 52000, > sample(1:52000, 10) [1] 1687 18236 26783 35366 14244 11429 20973 31436 48476

1.8 (a) observational, (b) experiment (c) observational, (d) experiment 1.9 Median = 4, mode = 2, expect mean larger than median because distribution is skewed right 1.10 (a) 3925

Solutions Manual: Foundations of Statistical Science for Data Scientists > Carbon <- read.table("http://stat4ds.rwth-aachen.de/data/Carbon_West.dat", + header=TRUE) > breaks <- seq(2.0, 18.0, by=2.0) > freq <- table(cut(Carbon$CO2, breaks, right=FALSE)) > cbind(freq, freq/nrow(Carbon)) freq [2,4) 4 0.11428571 [4,6) 15 0.42857143 [6,8) 7 0.20000000 [8,10) 6 0.17142857 [10,12) 0 0.00000000 [12,14) 0 0.00000000 [14,16) 2 0.05714286 [16,18) 1 0.02857143 > hist(Carbon$CO2)

(b) Mean = 6.72, median = 5.90, standard deviation = 3.36 mean(Carbon$CO2); median(Carbon$CO2); sd(Carbon$CO2)

1.11 Skewed to the right, because the mean is much larger than the median. 1.12 Number of times you went to a gym in the last week; median = 0 if more than half of persons in the sample never went. 1.13 (a) 63,000 to 75,000; (b) 57,000 to 81,000; (c) 51,000 to 87,000. 100,000 would be unusual because it is more than 5 standard deviations above the mean. 1.14 A quarter of the states had less that 6% without insurance, and a quarter had more than 9.5% without insurance. Half the states had between 6% and 9.5% without insurance, encompassing an interquartile range of 3.5%. 1.15 Skewed to the right, because distances of median from LQ and minimum are less than from UQ and maximum. 1.16 (a) The percentages in 2018 (with the default composite weight) for (0, 1, 2, 3, 4, 5, 6, ≥ 7) are (9.4, 24.8, 24.9, 14.8, 10.7, 5.3, 3.5, 6.7), somewhat skewed to the right. (b) Mode = 2, median = 2 (c) Mean = 2.8, standard deviation = 2.6. The lowest possible observation is only slightly more than a standard deviation below the mean, whereas in bell-shaped distributions, observations can occur two or three standard deviations from the mean in each direction. 1.17

> Murder <- read.table("http://stat4ds.rwth-aachen.de/data/Murder.dat", header=TRUE) > Murder1 <- Murder[Murder$state!="DC",] # data frame without D.C.

(a) Mean = 4.87, standard deviation = 2.59 > mean(Murder1$murder); sd(Murder1$murder)

(b) Minimum = 1.0, LQ = 2.6, median = 4.85, UQ = 6.2, maximum = 12.4, somewhat skewed right > summary(Murder1$murder); boxplot(Murder1$murder)

(c) Repeat the analysis above for Murder1$murder. The DC is a large outlier, causing the mean to increase (from 4.87 to 5.25) and the range to increase dramatically (from 11.4 to 23.2). 1.18 (a) Histogram is skewed right.

Solutions Manual: Foundations of Statistical Science for Data Scientists

> Income <- read.table("http://stat4ds.rwth-aachen.de/data/Income.dat", + header=TRUE); attach(Income) > hist(income)

(b) Five number summary is min. = 16, lower quartile = 22, median = 30, upper quartile = 465, max. = 120; also mean = 37.52 and standard deviation = 20.67. > summary(income); sd(income)

(c) Density approximation with default bandwidth = 6.85 is skewed right. Increasing the bandwidth (such as to 12) makes the curve smoother and bell-shaped, but still skewed. Decreasing it (such as to 3) makes it much bumpier and probably a poorer portrayal of a corresponding population distribution. > plot(density(income)) # default bandwidth = 6.85 > plot(density(income, bw=12))

(d)

> boxplot(income ~ race, xlab="Income", horizontal=TRUE) > tapply(income, race, summary) $B Min. 1st Qu. Median Mean 3rd Qu. Max. 16.00 19.50 24.00 27.75 31.00 66.00 $H Min. 1st Qu. Median Mean 3rd Qu. Max. 16.0 20.5 30.0 31.0 32.0 58.0 $W Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 24.00 37.00 42.48 50.00 120.00 > install.packages("tidyverse") > library(tidyverse) > Income %>% group_by(race) %>% summarize(n=n(),mean=mean(income),sd=sd(income)) race n mean sd 1 B 16 27.8 13.3 2 H 14 31 12.8 3 W 50 42.5 22.9

1.19 (a) Highly skewed right > Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat", + header=TRUE); attach(Houses) > PriceH <- hist(price); hist(price) # save histogram to use its breaks > breaks <- PriceH$breaks # breaks used in histogram > freq <- table(cut(Houses$price,breaks, right=FALSE)) > cbind(freq,freq/nrow(Houses)) # frequency table (not shown)

(b) y = 233.0, s = 151.9; 85%, not close to 68% because not bell-shaped but highly skewed > length(case[mean(price)-sd(price)<price & price<mean(price+sd(price)]) / + nrow(Houses)

(d)

> tapply(Houses$price, Houses$new, summary) $`0` Min. 1st Qu. Median Mean 3rd Qu. Max. 31.5 135.0 190.8 207.9 240.0 880.5 $`1` Min. 1st Qu. Median Mean 3rd Qu. Max. 158.8 256.9 427.5 436.4 519.7 866.2

New homes tend to have higher selling prices. 1.20 (a) Clear trend that price tends to increase as size increases.

Solutions Manual: Foundations of Statistical Science for Data Scientists > plot(size, price)

(b) 0.834, strong positive association > cor(size, price)

(c) Predicted price = −76.39 + 0.19(size), which is 113.5 thousand dollars at 1000 square feet and 683.2 thousand dollars at 4000 square feet. > summary(lm(price ~ size)) # linear model: read the coefficients estimates > pred <- function(x){-76.3894+0.1899*x}; pred(1000); pred(4000)

1.21 Correlation = 0.278 (positive but weak), predicted college GPA is 2.75 + 0.22(high school GPA), which is 3.6 for high school GPA of 4.0. 1.22

> Happy <- read.table("http://stat4ds.rwth-aachen.de/data/Happy.dat", header=TRUE) > Happiness <- factor(Happy$happiness); Marital <- factor(Happy$marital) > levels(Happiness) <- c("Very happy", "Pretty happy", "Not too happy") > levels(Marital) <- c("Married", "Divorced/Separated", "Never married") > table(Marital, Happiness) # forms contingency table Happiness Marital Very happy Pretty happy Not too happy Married 432 504 61 Divorced/Separated 92 282 103 Never married 124 409 135 > prop.table(table(Marital,Happiness), 1) Happiness Marital Very happy Pretty happy Not too happy Married 0.43329990 0.50551655 0.06118355 Divorced/Separated 0.19287212 0.59119497 0.21593291 Never married 0.18562874 0.61227545 0.20209581

Married subjects are more likely to be very happy and less likely to be not too happy than the other subjects. 1.23

> attach(Students) > table(relig, abor) abor relig 0 1 0 1 14 1 4 25 2 1 6 3 7 2

The very religious (attending every week) are less likely to support legal abortion (only 2 of the 9 observations in support). 1.24 (a) Values are skewed right, with mean 153.9 and median 119.8 and a very high outlier of 716 for the U.S. (b) 0.90 between GDP and HDI. (c) correlation = 0.674, predicted CO2 = 1.926 + 0.178(GDP), which increases dramatically between 2.71 at the minimum GDP = 4.4 and 13.11 at the maximum.GDP = 62.9. 1.25

> Races <- read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=TRUE) > attach(Races) > par(mfrow=c(2,2)) # a matrix of 2x2 plots in one graph > boxplot(timeM); boxplot(timeW) > hist(timeM); hist(timeW) > summary(timeM) Min. 1st Qu. Median Mean 3rd Qu. Max. 15.10 47.63 67.17 84.88 113.91 439.15

Solutions Manual: Foundations of Statistical Science for Data Scientists

> summary(timeW) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.75 55.82 79.72 100.72 140.69 490.05 > dev.off() # reset the graphical parameter mfrow > plot(timeM, timeW) > cor(timeM, timeW) [1] 0.9958732 > summary(lm(timeW ~ timeM)) Estimate (Intercept) 4.05731 timeM 1.13879

Both distributions are skewed to the right with an extreme outlier for a race with a record time much higher than the others. The men’s and women’s record times are very strongly correlated, so one could predict one of the them well based on the other, such as with the linear equation 4.057 + 1.139(timeM) for predicting the women’s record time from the men’s. 1.26

> cor(timeM, distance); cor(timeM, climb) [1] 0.9629676 [1] 0.672009 > plot(distance, timeM) > summary(lm(timeM ~ distance)) Estimate (Intercept) -1.1430 distance 5.1718 --> plot(climb, timeM) > summary(lm(timeM ~ climb)) Estimate (Intercept) 14.49 climb 79.34

See also analyses shown in Section 6.1.4. 1.27

> Sheep <- read.table("http://stat4ds.rwth-aachen.de/data/Sheep.dat",header=TRUE) > attach(Sheep) > tapply(weight, survival, summary) $`0` Min. 1st Qu. Median Mean 3rd Qu. Max. 3.1 12.0 14.8 16.0 20.0 32.8 $`1` Min. 1st Qu. Median Mean 3rd Qu. Max. 6.10 17.00 21.60 20.65 24.20 34.20 > tapply(weight, survival, sd) 0 1 5.326672 4.899645 > boxplot(weight ~ survival, xlab="weight", horizontal=TRUE)

1.28 Could treat opinion about legalization of same-sex marriage as response variable, with other variables listed as explanatory variables. Could treat political party aﬀiliaton as response variable, with other variables listed after it as explanatory variables. Some variables that are naturally fixed for each subject, such as race and gender, would only be explanatory variables. Opinion and aﬀiliation could plausibly be causally dependent on the explanatory variables such as education and income, so are natural response variables. 1.29 Every possible sample is not equally likely; e.g. two people listed next to each other on the list cannot both be in sample. 1.30 (b)

Solutions Manual: Foundations of Statistical Science for Data Scientists

1.31 (b) 1.32 1.33 Observational. No, correlation does not imply causation. 1.34 Family income may be a lurking variable. Families with higher income may be more likely to buy bottled water and may tend to have fewer dental problems because of having suﬀicient money to visit a dentist regularly. 1.35 When n is only 20, the shape is highly irregular, varying a lot from sample to sample. So, with a small sample, we should be cautious in concluding that the population distribution looks like the sample data distribution for the sample we are analyzing. 1.36 Any symmetric data set, such as {0, 1, 1, 2, 2, 2, 3, 3, 4} 1.37 Estimate (ii) because those having more friends are more likely to be sampled. 1.38 (a) For highly skewed distributions, mean can be quite different from a typical observation, as in the Leonardo’s Pizza Joint example in Section 1.4.3. (b) For highly discrete distributions, many percentiles are exactly the same. If the values (0, 1, 2, 3, 4, 5, 6, 7) for number of times playing a sport in the past week have percentages (60, 20, 10, 5, 3, 2), the median of 0 is not very informative. 1.39 (a) Range summarizes only two most extreme observations whereas s uses all the data. (b) IQR is not affected by outliers, whereas the range is greatly affected by a single outlier. 1.40 The median and the IQR are not affected by the magnitude of the largest observation. The mean increases somewhat, and the range may increase a lot. 1.41 The standard deviation of the y values generated should be about 2.9 when n = 30 for each sample and 0.5 when n = 1000 for each sample. For small samples, different studies can get quite different results, even if both use randomization to obtain the data. 1.42 (a) Highly skewed to right with a large percentage of observations at 0. (b) Plausible to have strong positive correlation because a person’s exercise one week tends to be similar to that person’s exercise another week. 1.43 [∑i (yi + c)]/n = (∑i yi )/n + nc/n = y + c. ∑i [(yi + c) − (y + c)]2 = ∑i (yi − y)2 , so s2 and s do not change. [∑i (cyi )]/n = c(∑i yi )/n = cy. ∑i (cyi − cy)2 /(n − 1) = ∑i [c(yi − y)]2 /(n − 1) = c2 ∑i (yi − y)2 /(n − 1), so s2 multiplies by c2 and thus s multiplies by ∣c∣. 1.44 (a) No, mean of logs is not log of mean. (b) Yes, the log transform preserves the order. (c) exp(x) = exp[(∑i log(yi ))/n] = exp[(1/n) log(∏ yi )] = exp[log(∏ yi )1/n ] = (∏i yi )1/n . 1.45 (a) 1, (b) 1/4, (c) 1/9, much larger than the values 32%, 5%, close to 0% for a bell-shape; Chebyshev applies to any distribution, not only bell-shaped ones. 1.46

Solutions Manual: Foundations of Statistical Science for Data Scientists

1.47 Taking the derivative of the function f with respect to c and setting it equal to zero gives −2[∑i (yi − c)] = 0 = −2[∑i yi − nc], from which c = (∑i yi )/n = y (critical point). Since f is a convex function of c, the critical point y corresponds to a global minimum. 1.48 Notice that the median is not unique unless n is an odd number. Imagine you are at the median, say for n odd, so it is unique, with at most (n − 1)/2 observations below and at most (n − 1)/2 above. Then when you move in a particular direction by any fixed amount, say d, you are moving closer to at most (n − 1)/2 points (decreasing the total sum of absolute distances by at most d(n−1)/2) but moving farther away from at least (n − 1)/2 + 1 = (n + 1)/2 points (thus increasing the total distance by at least d(n + 1)/2). You are increasing more than decreasing, so you were at the minimum sum of absolute distances when at the median. If n is even, start anywhere between the two points and as you move the total distance does not change until you pass one of those points, and then the same argument works with at most n/2 − 1 in the direction of the move and at least n/2 + 1 in the other direction. 1.49 (a)

Chapter 2 2.1

> sample(0:9, 1) [1] 6 # 0, 1, 2 are rain, so no rain on next day > sample(0:9, 10, replace=TRUE) [1] 0 3 5 6 2 2 4 6 9 0 # rain on days 1, 5, 6, 10 > rbinom(1, 100, 0.30) [1] 34 # proportion: 34/100 = 0.34 > rbinom(1, 10000, 0.30) [1] 2931 # proportion: 2931/10000 = 0.2931 > rbinom(1, 1000000, 0.30) [1] 299657 # proportion: 0.299657 (close to 0.300000)

2.2 (a) (i) (0.95)2 = 0.9025. (ii) (0.05)2 = 0.0025 y P (y) 0 0.0025 (b) 1 0.0950 2 0.9025 √ (c) 0.95 = 0.97468 2.3 (a) X given Y (b) Need also P (Y = w), where w denotes the event ”white”. Using 0.60 for it and b for ”black”, P (Y = w ∣ X = w) =

P (X = w ∣ Y = w)P (Y = w) P (X = w ∣ Y = w)P (Y = w) + P (X = w ∣ Y = b)P (Y = b)

(0.93)(0.60) = 0.903. (0.93)(0.60) + (0.15)(0.40)

2.4 (a) Denote the five wines by 1, 2, 3, 4, 5. Let the sample point (W1 , W2 , W3 , W4 , W5 ) denote guessing W1 for wine 1, . . ., W5 for wine 5, such as (1, 3, 2, 5, 4) to mean

Solutions Manual: Foundations of Statistical Science for Data Scientists guessing wine 1 is wine 1, guessing wine 2 is wine 3, ... , guessing wine 5 is wine 4 (and hence getting only 1 correct). There are 5! sample points, for the possible permutations of the first five integers. (b) Of the 5! = 120 permuations, only one has all five wines correct, so the probability is 1/120.

2.5 (a) Using numbered days, there are (365)3 possibilities (sample points) for the three birthdays. Of those, (365)(364)(363) are all different. (b) Probability that no two of 23 people have the same birthday is (365 × 364 × 363 × . . . 343)/(365)23 = 0.493, so the probability that at least two have the same birthday is 1 − 0.493 = 0.507. (c) > ind <- 0; iter <- 100000 > for(i in 1:iter){ + s <- sample(1:365,50,replace=TRUE) + d <- unique(s) + ind <- ind + as.numeric(length(d)!=length(s)) } > ind/iter [1] 0.97043

2.6 E(Y ) = ∑y yf (y) = 0(0.48)+1(0.24)+2(0.15)+3(0.07)+4(0.03)+5(0.02)+6(0.01) = 1.03. E(Y 2 ) = ∑y y 2 f (y) = 02 (0.48) + 12 (0.24) + 22 (0.15) + 32 (0.07) + 42 (0.03) + 52 (0.02) + 62 (0.01) = 2.81, so variance = 2.81 − (1.03)2 = 1.7491 and the standard deviation is √ 1.7491 = 1.32. 2.7

> rbinom(1, 30000000, 0.50)/10000000 [1] 1.499795 # theoretical expected value = 1.5000

2.8 (a) Let Ai be the event of catching the virus on trip i, i = 1, . . . , 100. The 100 events are not disjoint, so the probability of the union of the 100 events is not the sum of the probabilities. (b) 1 − P (not catching a virus) = 1 − (0.99)100 = 0.634 2.9 (a) Y is Binomial distrbuted with n = 10 and π = 0.20. (i) P (Y = 10) = 1.024e − 07 = 0.000, (ii) P (Y = 0) = 0.107. (e.g., these are found in R with dbinom(10, 10, 0.20) and dbinom(0, 10, 0.20)). √ √ (b) µ = nπ = 10(0.20) = 2 and σ = nπ(1 − π) = 10(0.20)(0.80) = 1.265. 2.10 In n inferences, P (all correct) = (0.95)n . Setting (0.95)n < 0.50 yields n log(0.95) < log(0.50), or n > log(0.50)/ log(0.95) = 13.5, so 14 or more inferences. (The inequality reverses direction here because of dividing by log(0.95), which is negative.) √ 2.11 Solving for n, having ∣1 − 2π∣/ nπ(1 − π) ≤ c is equivalent to n ≥ (1 − 2π)2 /c2 π(1 − π). For π = 0.20 and c = 0.3, this is 25. 2.12 (a) Assuming independence of the 200 free throws, each with probability of success of 0.80, the probability √ is 0.95 of making within two standard deviations of the mean, 200(0.80) ± 2 200(0.80)(0.20), which is (149, 171). Dividing by 200, the corresponding proportions are (0.74, 0.86). (b) (i) increase, because the standard deviation for the proportion gets smaller as n increases; (ii) decrease, because the standard deviation for the number of successes gets larger as n increases. 2.13 (a) (5/6)6 = 0.335

Solutions Manual: Foundations of Statistical Science for Data Scientists

(b) Geometric distribution f (y) = (5/6)y−1 (1/6), y = 1, 2, 3, . . . 2.14 (a) (39, 999, 000/40, 000, 000)10 = 0.99975 (b) For S surveys, solve (39, 999, 000/40, 000, 000)S = 0.50, so S = [log(0.50)]/[log(39, 999, 000/40, 000, 000)] = 27, 726. (c) Geometric distribution f (y) = (39, 999, 000/40, 000, 000)y−1 (1000/40, 000, 000) 2.15 If the company insures a very large number of homes and the event that a home burns down on any particular week is independent from home to home and has the same tiny probability for all insured homes. Under these conditions, Y has a binomial distribution with n = number of homes and π = probability of burning down in a week, and with very large n and very small π, the binomial distribution is approximated by the Poisson with µ = nπ. 2.16 (a) Data have mean 23.4 and variance 424.0. This suggests that the true variance is much larger than the mean (overdispersion), whereas they are the same for the Poisson distribution. (b) It would probably be better for the number of weekly admissions for a rare disease. If we consider the Poisson as an approximation to the binomial, in the first case the probability of admission π probably varies by day of the week (being higher on Friday and Saturday nights) in conflict with the constant value needed for the Poisson approximation. In the second case π might be approximately constant. 2.17 The standard normal probability between z = (900 − 830)/50 = 1.40 and z = (800 − 830)/50 = −0.60 is 0.645. 2.18 (a) (i) 2.3263; (ii) 1.9600; (iii) 1.6449 (b) (i) 1.6449; (ii) 1.9600 (iii) 2.5758 (c) (i) 0.6745; (ii) 1.6449; (iii) 2.3263 (d) qnorm(0.75) = 0.6745 and (µ + 0.6745σ) − (µ − 0.6745σ) = 1.349σ. 2.19 (a) A: 0.159, B: 0.309

–

(b) 0.660

–

2.20 Perhaps the gamma distribution, because the income values are essentially continuous but are skewed to the right. > Income <- read.table("http://stat4ds.rwth-aachen.de/data/Income.dat",header=T) > income <- Income$income > m <- mean(income); s <- sd(income) # mean and standard deviation of income values > k <- (m/s)^2 # from expression for gamma mean and standard deviation, > lambda <- k/m # shape para. = (mean/standard dev.)^2, rate para. = shape/mean > y <- seq(0, 1000, 0.001) > plot(density(income)) # smooth curve approximation for histogram of income values > lines(y, dgamma(y, shape=k, rate=lambda)) # gamma pdf with same mean, standard dev.

The approximation seems good, although some income values seem further out in right tail than gamma distribution allows. 2.21 The mean and standard deviation both increase, as their values for the gamma distribution are directly proportional to the scale parameter 1/λ. 2.22 (a) P (D and +) = P (+ ∣ D)P (D) = 0.86(0.01) = 0.0086 P (D and −) = P (− ∣ D)P (D) = 0.14(0.01) = 0.0014 P (Dc and +) = P (+ ∣ Dc )P (Dc ) = 0.12(0.99) = 0.1188

Solutions Manual: Foundations of Statistical Science for Data Scientists P (Dc and −) = P (− ∣ Dc )P (Dc ) = 0.88(0.99) = 0.8712 0.0086/(0.1188 + 0.0086) = 0.068 is proportion of positive diagnoses that result from actual disease. (b) If a person is HIV-positive, the probability the diagnostic test detects the positive status is 0.999. If a person is not HIV-positive, the probability the diagnostic test is negative is 0.9999. The positive predictive value is the probability that someone who tests positive truly is positive, which is 0.999(0.0001)/[0.999(0.0001) + 0.0001(0.9999)] = 0.4998.

2.23

> Afterlife <- read.table("http://stat4ds.rwth-aachen.de/data/Afterlife.dat", + header=TRUE); attach(Afterlife) > proportions(table(religion, postlife)) 1 2 # sample joint distribution 1 0.61558274 0.07340631 2 0.24726336 0.04185448 3 0.01094656 0.01094656 > proportions(table(postlife)); proportions(table(religion)) 1 2 # sample marginal distribution of postlife 0.8737927 0.1262073 1 2 3 # sample marginal distribution of religion 0.68898905 0.28911784 0.02189311 > proportions(table(religion, postlife), 1) 1 2 # sample conditional distributions of postlife, given religion 1 0.8934579 0.1065421 2 0.8552339 0.1447661 3 0.5000000 0.5000000

12! ) (0.25)3 (0.25)3 (0.50)6 = 0.07. 2.24 (a) ( 3!3!6!

> dmultinom(c(6,3,3), size = 12, prob=c(0.5,0.25,0.25)) [1] 0.07049561

(b) The possible samples for (y1 , y2 , y3 ) were (0, 0, 12), (0, 1, 11), (1, 0, 11), (0, 2, 10), (2, 0, 10), (1, 1, 10). The total probability is 12! 12! ) (0.25)0 (0.25)0 (0.50)12 + ( ) (0.25)0 (0.25)1 (0.50)11 0!0!12! 0!1!11! 12! 12! +( ) (0.25)1 (0.25)0 (0.50)11 + ( ) (0.25)0 (0.25)2 (0.50)10 1!0!11! 0!2!10! 12! 12! +( ) (0.25)2 (0.25)0 (0.50)10 + ( ) (0.25)1 (0.25)1 (0.50)10 = 0.01929. 2!0!10! 1!1!10! Or use the fact that the marginal distribution of Y = Y3 is the binomial distribution with n = 12 and π = π3 = 0.50, and find P (Y ≥ 10) = 1 − P (Y ≤ 9). In R, this is 1pbinom(9, 12, 0.5). The outcome would be very unlikely. (

2.25 E(Y ) = E[E(Y ∣ X)] = E[70 + 0.60(X − 70)] = 70 + 0.60[E(X) − 70] = 70 + 0.60[70 − 70] = 70. 2.26 (a) (i) 0.191, a weak tendency for happiness to increase as family income increases; (ii) 0.190 (you may use weightedCorr of the wCorr library). Relative Happiness Family Income Not too happy Pretty happy Very happy Total Below average 0.04998 0.198849 0.108171 0.357 (b) Average 0.0616 0.24508 0.13332 0.440 Above average 0.02842 0.113071 0.061509 0.203 Total 0.140 0.557 0.303 1.000 For instance, the first cell probability is (0.357)(0.140) = 0.04998.

Solutions Manual: Foundations of Statistical Science for Data Scientists 2.27

> x <- rnorm(1000, 162, 7); > plot(x, y) > cor(x, y) [1] 0.3645389 > mean(x); sd(x) [1] 161.9824 [1] 7.071358 > mean(y); sd(y) [1] 67.65524 [1] 8.460963

y <- rnorm(1000, 3.0 + 0.40*x, 8)

The correlation indicates a weak positive association, whereby taller women tend to weigh more. 2.28 The event A partitions into disjoint events AB1 , AB2 , . . . , ABc , so P (A) = P (AB1 ) + ⋯ + P (ABc ) = P (A ∣ B1 )P (B1 ) + ⋯ + P (A ∣ Bc )P (Bc ). 2.29 De Morgan’s law extends to (A1 ∪ A2 ∪ ⋯ ∪ Ap )c = Ac1 Ac2 ⋯Acp . 2.30 P (X = x ∣ Y = y) =

P (X = x and Y = y) P (Y = y ∣ X = x)P (X = x) = P (Y = y) ∑a P (Y = y and X = a)

P (Y = y ∣ X = x)P (X = x) . ∑a P (Y = y ∣ X = a)P (X = a)

P (A∣Bj )P (Bj )

Likewise, P (Bj ∣ A) = ∑c

i=1 P (A∣Bi )P (Bi )

2.31 f (x ∣ y) =

f (y∣x)f1 (x) ∫a f (y∣a)f1 (a)da

2.32 Using Bayes’ Theorem, π/[π + (1/k)(1 − π)] = kπ/[1 + (k − 1)π]; with k = 5 this is (i) 0.978 when π = 0.90 and (ii) 0.357 when π = 0.10. and Y =y) = P (Y =y∣X=x)P (X=x) = P (Y =y)P (X=x) = P (X = x) 2.33 P (X = x ∣ Y = y) = P (X=x P (Y =y) P (Y =y) P (Y =y) 2.34 P (A) = P (AB) + P (AB c ) = P (A)P (B) + P (B c ∣ A)P (A), so 1 = P (B) + P (B c ∣ A), so P (B c ∣ A) = P (B c ) and A and B c are independent. 2.35 The probability of an event for a continuous random variable is the area under the pdf for that event. For the event consisting of a single point, this area is 0. For any particular value, such as 29.058392764..., the proportion of times it occurs in the long run is 0. In practice, values are rounded, such as to 29.0, and the proportion might be slightly greater than 0. 2.36 (a) f (x) = 1/(U − L) for L ≤ x ≤ U and 0 elsewhere. E(X) = [

U 1 1 U 2 L2 1 (U + L)(U − L) (U + L) ] ∫ xdx = [ ]( − ) = [ ][ ]= . U −L L U −L 2 2 U −L 2 2

(b) Y = (X − L)/(U − L) linearly transforms X = L to Y = 0 and X = U to Y = 1. Then, X = (U − L)Y + L and µ = E(X) = (U − L)E(Y ) + L = (U − L)/2 √ + L = (U + L)/2. Also, var(X) = (U − L)2 var(Y ) = (U − L)2 /12 so σ = (U − L)/ 12. √ (c) For L = 200 and U = 800, µ = 500 and σ = 600/ 12 = 173.2.

Solutions Manual: Foundations of Statistical Science for Data Scientists > y <- runif(1000000, 200, 800); mean(y); sd(y) [1] 499.8442 [1] 173.1643

2.37 f (y) = (1 − π)y π,

y = 0, 1, 2, . . . .

2.38 (a) F (y) = P (Y ≤ y) = π + (1 − π)π + (1 − π)2 π + ⋯ + (1 − π)y−1 π = π[(1 − π)0 + (1 − π) + y (1 − π)2 + ⋯ + (1 − π)y−1 ] = π 1−(1−π) = 1 − (1 − π)y . 1−(1−π) d d y y−1 y (b) E(Y ) = ∑∞ = −π dp [∑∞ π = −π ∑∞ y=1 y(1 − π) y=1 (1 − π) ] = y=1 dp (1 − π) d 1 d y 2 [∑∞ −π dp y=1 (1 − π) ] = −π dp [ 1−(1−π) − 1] = −π(−1/π ) = 1/π.

2.39

y(y − 1)e−µ µy y! y=0

∞

E[Y (Y − 1)] = ∑ y(y − 1)f (y; µ) = ∑ y=0

∞ (y − 1)e−µ µy−1 ye−µ µy =µ∑ = µ2 . (y − 1)! y! y=1 y=0 ∞

=µ∑

Thus, E(Y 2 ) − E(Y ) = µ2 and E(Y 2 ) = E(Y ) + µ2 = µ + µ2 , so that var(Y ) = E(Y 2 ) − [E(Y )]2 = µ. 2.40 (a) The event N > n is equivalent to your waiting time being larger than those of n people. Out of n + 1 people (including you), under the condition of independent, identically distributed waiting times, the probability is 1/(n + 1) that your waiting time is longest, that is, Y > max(Y1 , . . . , Yn ). So, P (N > n) = P (Y > max(Y1 , . . . , Yn ) = 1/(n + 1). (b) With P (N > n) = 1/(n + 1), P (N = n) = P (N > n − 1) − P (N > n) = 1/n − 1/(n + 1) = ∞ ∞ 1/n(n + 1). Then, E(N ) = ∑∞ n=1 n[P (N = n)] = ∑n=1 n[1/n(n + 1)] = ∑n=1 [1/(n + 1)] = ∞. 2.41 For any real number x, the pdf at value y = µ − x is the same as at value y = µ + x, 1 )e−x /2σ . namely ( √2πσ 2

2.42

−y

n−y

n! n! µ µ π y (1 − π)n−y = ( ) (1 − ) y!(n − y)! y!(n − y)! n n

f (y; n, π) = n! µ (1 − ) (n − y)!ny n

−y

µy µ n(n − 1)⋯(n − y + 1) µ (1 − ) = (1 − ) y! n ny n

µy µ (1 − ) . y! n

As n → ∞, this converges toward 1 × 1 × µy! e−µ , which is the Poisson pmf. 2.43 (a) The median is the value of y that satisfies F (y; λ) = 1 − e−λy = 0.50. Solving for y yields median = − log(0.5)/λ = log(2)/λ = 0.693/λ. (b) From Section 2.5.6, the pth quantile is − log(1 − p)/λ. For p = 0.25 and p = 0.75, these are 0.288/λ and 1.386/λ. (c) E(Y ) = ∫

∞ 0

yλe−λy dy =

∞ λ2 1 1 e−λy y 2−1 dy = ∫ λ 0 Γ(2) λ

because the argument of the integral is a gamma pdf with parameters λ and k = 2, which integrates to 1. This is greater than the median because the distribution is skewed right.

Solutions Manual: Foundations of Statistical Science for Data Scientists

(d) E(Y 2 ) = ∫

∞ 0

y 2 λe−λy dy =

Γ(3) ∞ λ3 −λy 3−1 2 e y dy = 2 , ∫ 2 λ Γ(3) λ 0

so σ = E(Y ) − µ = λ22 − λ12 = λ12 , and σ = λ1 . 2

2.44 E(Y 2 ) = ∫

∞ 0

λk 2 −λy k−1 Γ(k + 2) ∞ λk+2 −λy (k+2)−1 = k(k + 1)/λ2 y e y dy == 2 e y ∫ Γ(k) λ Γ(k) 0 Γ(k + 2)

so var(Y ) = k(k + 1)/λ2 − (k/λ)2 = k/λ2 . 2.45 Substituting λ = k/µ √ the alternative pdf formula. Then σ = √ in (2.10) yields (2.11) becomes σ = k/(k/µ) = µ/ k.

√

k/λ in

2.46 The cdf is F (t) = 1 − e−λt ; so that P (T > u + t ∣ T > u) = P (T > u + t)/P (T > u) = e−λ(t+u) /e−λu = e−λt = P (T > t). 2.47 Every y value with (y − µ) = c and thus (y − µ)3 = c3 has a corresponding y value with (y − µ) = −c and thus (y − µ)3 = −c3 having the same value of f (y), so integrating or summing over all y values gives E(Y − µ)3 = 0. 2.48 (a) From (2.11), E(Y /θ) = E(Y )/[(1/λ)] = (k/λ)/[(1/λ)] = k and var(Y /θ) = (1/θ)2 var(Y ) = λ2 var(Y ) = λ2 (k/λ2 ) = k, neither dependent on λ and hence θ. (b) If Y ∼ N (µ, σ 2 ), then (Y − µ) ∼ N (0, σ 2 ) not dependent on µ. √ 2.49 The standard deviation π(1 − π)/n converges toward 0 as n → ∞. Also, the binomial disribution is increasingly bell-shaped around its mean nπ as n increases. Therefore, as n grows, the distribution of π̂ is bell-shaped around π with standard deviation converging toward 0. Since the probability within 3 standard deviations of the mean is close to 1 for a bell-shaped distribution, π̂ tends to be closer to π as n increases. 2.50 (a) E(Y ) = ∑y yf (y) ≥ ∑y≥t yf (y) since the other terms in the sum for the mean are nonnegative. Since y takes value t and above in the second sum, ∑y≥t yf (y) ≥ ∑y≥t tf (y) = t ∑y≥t f (y) = tP (Y ≥ t), so E(Y ) ≥ tP (Y ≥ t) and P (Y ≥ t) ≤ E(Y )/t. (b) P (∣X − µ∣ ≥ kσ) = P [(X − µ)2 ≥ k 2 σ 2 ], and by the Markov inequality with Y = (X − µ)2 and t = k 2 σ 2 , this is ≤ E[(X − µ)2 ]/k 2 σ 2 = 1/k 2 . (c) Let P (X = 0) = P (X = 1) = 1/2. Then µ = 1/2 and σ = 1/2 and P (∣X − µ∣ = σ) = 1. 2.51 If g is concave, then −g is convex, so E[−g(Y )] ≥ −g[E(Y )], so E[g(Y )] ≤ g[E(Y )]. The function g(y) = log(y) defined for y > 0 is concave, because g ′′ (y) = −1/y 2 < 0. So, for a positive-valued random variable, E[log(Y )] ≤ log[E(Y )]. By contrast, g(y) = 1/y is convex for y > 0, because g ′′ (y) = 2/y 3 > 0, so E(1/Y ) ≥ 1/E(Y ). ]. Since the standardization 2.52 (a) If Y ∼ N (µ, σ 2 ), then F (y) = P (Y ≤ y) = P [ Y σ−µ ≤ y−µ σ (Y − µ)/σ ∼ N (0, 1), F (y) = Φ[(y − µ)/σ]. (b) Taking the derivative and using the standard normal pdf ϕ(y) =√Φ′ (y), we have f (y) = F ′ (y) = (1/σ)ϕ[(y − µ)/σ]. Then substituting in ϕ(z) = (1/ 2π exp[−z 2 /2], we have equation (2.8) for f (y). 2.53 For 0 ≤ x ≤ 1, F (x) = P (X ≤ x) = P (Φ(Y ) ≤ x) = P (Y ≤ Φ−1 (x)] = Φ[Φ−1 (x)] = x, so X has a uniform distribution. See Section 2.5.7 for this result (the probability integral transformation) in a more general context.

Solutions Manual: Foundations of Statistical Science for Data Scientists > y <- rnorm(1000000); hist(y) # > x <- pnorm(y); hist(x) #

bell-shaped around 0 uniform between 0 and 1

2.54 For 0 ≤ x ≤ 1, the cdf of X is G(x) = P (X ≤ x) = P [1 − F (Y ) ≤ x] = P [F (Y ) ≥ 1 − x] = 1 − P [F (Y ) ≤ 1 − x] = 1 − P [Y ≤ F −1 (1 − x)] = 1 − F [F −1 (1 − x) = 1 − (1 − x) = x, so it has the uniform distribution over [0, 1]. 2.55 E(Y ) = E[E(Y ∣ λ)] = E(λ) = µ. 2.56 (a) Y1 is the number of successes in n independent trials, where “success” is outcome in category 1 and “failure” is outcome in category 2 or 3. Since the marginal distribution is binomial, using the expressions for the mean and variance of a binomial random variable, E(Y1 ) = nπ1 and var(Y1 ) = nπ1 (1 − π1 ). (b) If we know y1 , then y2 can take values only between 0 and n − y1 . Since its range depends on y1 , they are not independent. In the special case c = 2 of the binomial, y1 is the number of successes, which determines n − y1 , the number of failures, and one random variable is completely dependent on the other. 2.57 (a) This is the product of the separate probability mass functions, by independence. (b) They are not independent. For instance, if we know Y1 = c, then we know Y2 cannot be larger than n − c. It cannot have a Poisson distribution, because the Poisson has positive probabilities for every nonnegative integer value. (c) The conditional probability is P [(Y1 = y1 , Y2 = y2 , . . . , Yc = yc ) ∣ ∑ Yj = n] j

= =

P (Y1 = y1 , Y2 = y2 , . . . , Yc = yc ) P ( ∑j Yj = n) Πi [exp(−µi )µyi i /yi !] n

exp ( − ∑j µj )( ∑j µj ) /n!

n! y ∏ π i, Πi y i ! i i

where {πi = µi /( ∑j µj )}. This is the multinomial (n, {πi }) distribution, characterized by the sample size n and the probabilities {πi }. 2.58 (a) 1 1 = P (Z = 1), P (Z = 0 ∣ X = 1) = = P (Z = 0), 2 2 1 1 P (Z = 1 ∣ X = 2) = = P (Z = 1), P (Z = 0 ∣ X = 2) = = P (Z = 0). 2 2 The conditional distribution of Z given X is the same as the marginal distribution of Z. So, X and Z are independent. Equivalently, P (Z = 1 ∣ X = 1) =

1 = P (X = 1)P (Z = 1), 4 1 P (X = 1, Z = 0) = P (head, tail) = = P (X = 1)P (Z = 0), 4 1 P (X = 2, Z = 1) = P (tail, tail) = = P (X = 2)P (Z = 1), 4 1 P (X = 2, Z = 0) = P (tail, head) = = P (X = 2)P (Z = 0), 4 so the joint pmf factors as the product of the two marginal pmf’s. Likewise, Y and Z are independent. P (X = 1, Z = 1) = P (head, head) =

Solutions Manual: Foundations of Statistical Science for Data Scientists

(b) We see that P (X = 1, Y = 1, Z = 1) = P (head, head) =

1 1 ≠ P (X = 1)P (Y = 1)P (Z = 1) = . 4 8

2.59 Where you move to on the next move depends only on where you are now, not where you were in the past. 2.60 (a)

n <- 100 y <- NULL; p <- NULL # creates empty vectors to add new components y[1] <- 2*rbinom(1,1,0.5)-1 # y=1 for binom=1 and y=-1 for binom=0 p[1] <- y[1]>0 # equals 1 if y>0 and 0 else for(t in 2:n){ y[t] <- y[t-1] + 2*rbinom(1,1,0.5)-1 p[t] <- y[t]>0 } prop=mean(p); prop # proportion of times for which Y_t>0 t <- rep(1:n) plot(t, y, pch=20, cex=1.2, col="dodgerblue4") abline(h=0, col="red4", lty=2)

(b) Construct a function (winn) doing the simulation in (a) and returning just the proportion pn : winn <- function(n){ y <- matrix(0,n,1) # creates a column vector of 0's of length n p <- matrix(0,n,1) y[1] <- 2*rbinom(1,1,0.5)-1 # y=1 for binom=1 and y=-1 for binom=0 p[1] <- y[1]>0 # equals 1 if y>0 and 0 else for(t in 2:n){ y[t] <- y[t-1] + 2*rbinom(1,1,0.5)-1 p[t] <- y[t]>0 } return(mean(p))}

Run the simulation using the function winn: iter=100000 n <- 100 p_n <- NULL for(i in 1:iter){ p_n[i]=winn(n) }

# number of iterations # sample size for each iteration # performs the simulation in function winn() # iter (=100000) times

hist(p_n, main=" ", xlab=expression(p[n]), ylab=expression(Number~of~~Y[t] > 0), col = "lightsteelblue", border="dodgerblue4")

2.61 See previous exercise. 2.62 The joint probability distribution is V U −1 0 1 Total 0 0.00 0.25 0.00 0.25 1 0.25 0.00 0.25 0.50 2 0.00 0.25 0.00 0.25 Total 0.25 0.50 0.25 1.0 e.g., P (X = 1, Y = 1) = 0.25 = P (U = 2, V = 0). For these probabilities, E(U ) = 1, E(V ) = 0, E(U V ) = 0, so cov(U, V ) = corr(U, V ) = 0. The joint probabilities are not the product of the marginal probabilities, so U and V are not independent.

Solutions Manual: Foundations of Statistical Science for Data Scientists

2.63 (a) E(X + Y ) = E(X) + E(Y ) = µx + µy , so var(X + Y ) = E[(X + Y ) − (µx + µy )]2 = E[(X − µx ) + (Y − µy )]2 = E(X − µx )2 + E(Y − µy )2 + 2E[(X − µx )(Y − µy )] = var(X) + var(Y ) + 2cov(X, Y ). (b) var(X − Y ) = E[(X − Y ) − (µx − µy )]2 = E[(X − µx ) − (Y − µy )]2 = E(X − µx )2 + E(Y − µy )2 − 2E[(X − µx )(Y − µy )] = var(X) + var(Y ) − 2cov(X, Y ). 2.64 (a) var(Zx + Zy ) = var(Zx ) + var(Zy ) + 2cov(Zx , Zy ) = 1 + 1 + 2cov(Z √ x , Zy ) ≥ 0 implies that cov(Zx , Zy ) ≥ −1 and thus corr(Zx , Zy ) = cov(Zx , Zy )/ var(Zx )var(Zy ) = cov(Zx , Zy ) ≥ −1. The correlation is not changed by linear transformations, so corr(X, Y ) ≥ −1. Likewise, var(Zx − Zy ) = var(Zx ) + var(Zy ) − 2cov(Zx , Zy ) = 1 + 1 − 2cov(Z √x , Zy ) ≥ 0 implies that cov(Zx , Zy ) ≤ 1 and thus corr(Zx , Zy ) = cov(Zx , Zy )/ var(Zx )var(Zy ) = cov(Zx , Zy ) ≤ 1. The correlation is not changed by linear transformations, so corr(X, Y ) ≤ 1. (b) µx = µy , σx = σy , so cov(X, Y ) = E[(X − µx )(Y − µy )] = E[(X − µx )(X − µx )] = σx2 = σy2 = σx σy , so corr(X, Y ) = 1. 2.65 (a) For simplicity of notation, we take all means = 0, since they don’t affect covariation. Then cov(X, Y ) = cov(U + V, U + W ) = E[(U + V )(U + W )] = E(U 2 ) + E(U W ) + E(V U ) + E(V W ) = E(U 2 ) + E(U )E(W ) + E(V )E(U ) + E(V )E(W ) = E(U 2 ) = var(U ) (since U , V , W are uncorrelated). Thus, var(U ) cov(X,Y ) =√ . corr(X, Y ) = √ [var(U +V )][var(U +W )] [var(U )+var(V )][var(U )+var(W )] (b) As var(U ) increases, for fixed var(V ) and var(W ), corr(X, Y ) increases. With large variability in intelligence, we expect a strong positive correlation between the math and verbal test scores. d tY e ) = E(Y ety ) and hence m′ (0) = E(Y ). Likewise, 2.66 (a) m′ (t) = (d/dt)E(etY ) = E( dt m′′ (t) = E(Y 2 ety ) and m′′ (0) = E(Y 2 ). More generally, m(k) (t) = E(Y k etY ). 2

(b) m(t) = E(etY ) = E[1 + tY + t 2!Y + t 3!Y + ⋯ = 1 + tE(Y ) + t2! E(Y 2 ) + t3! E(Y 3 ) + ⋯. ty −µ y −µ ∞ (c) m(t) = E(etY ) = ∑∞ ∑y=0 (µet )y /y! = e−µ eµe = eµ(e −1) . So, y=0 e [e µ /y!] = e t t t m′ (t) = µet eµ(e −1) and m′ (0) = µ. Also, m′′ (t) = (µet )2 eµ(e −1) + eµ(e −1) µet and m′′ (0) = µ2 +µ, so var(Y ) = E(Y 2 )−[E(Y )]2 = m′′ (0)−[m′ (0)]2 = (µ2 +µ)−µ2 = µ. t

(d) m′ (t) = eµt+σ t /2 (µ + σ 2 t), so m′ (0) = µ. m′′ (t) = eµt+σ t /2 (µ + σ 2 t)2 + eµt+σ t /2 σ 2 , so m′′ (0) = µ2 + σ 2 and var(Y ) = E(Y 2 ) − [E(Y )]2 = m′′ (0) − [m′ (0)]2 = (µ2 + σ 2 ) − µ2 = σ 2 . 2 2

2.67 (a)

2 2

> Y <- rnorm(10) # similarly for n=100 and n=1000 > qqnorm(Y, col='blue', main='Y ~ N(0,1)'); abline(0,1)

(b)

> Y <- rnorm(1000, 0, 16) > qqnorm(Y) # slope of points is about 16

(c)

> Y1 <- rexp(1000); Y2 <- runif(1000) > qqnorm(Y1, col='blue', main='Y1 ~ exp(1)') > qqnorm(Y2, col='blue', main='Y2 ~ uniform(0,1)')

For interpretation, see the discussion in the R appendix. (d) For the uniform(0, 1), qi = i/(n + 1), since the proportion of the distribution below that point is i/(n + 1). ). The number of distinct 2.68 The number of distinct possible samples of size n is (F +M n F M possible samples with y females and n − Y males is ( y )(n−y). The smallest possible y is

Solutions Manual: Foundations of Statistical Science for Data Scientists

n − M , which could happen if every male available is sampled, and 0 if M > n in which case every person sampled could be male. The largest possible y is F , the total number of females available to be sampled, and n if F > n in which case every person sampled could be female. 2.69 Y = y if y successes occur in the first y +k −1 trials and then a failure occurs on trial y +k. ) is the number of ways y successes can occur in y + k − 1 The binomial coeﬀicient (y+k−1 y trials. The probability of each sequence with k failures and y successes is π y (1 − π)k , so )π y (1 − π)k . the total P (Y = y) = (y+k−1 y Γ(2) 2.70 (a) For 0 ≤ y ≤ 1, f (y; 1, 1) = Γ(1)Γ(1) y 1−1 (1 − y)1−1 = 1 since Γ(2) = 1! = 1, Γ(1) = 0! = 1.

(b) E(Y ) = ∫

1 0

yf (y; α, β) =

1 Γ(α + β) α−1 β−1 ∫ y ⋅ y (1 − y) . Γ(α)Γ(β) 0

The integral in this term is 1

∫

y (α+1)−1 (1 − y)β−1 = Γ(α + 1)Γ(β)/Γ[(α + 1) + β)],

from which we have E(Y ) = (c) E(Y 2 ) =

α Γ(α + β) Γ[(α + 1) + β] = . Γ(α)Γ(β) Γ(α + 1)Γ(β) α + β

Γ(α + β) Γ[(α + 2) + β] (α + 1)α = . Γ(α)Γ(β) Γ(α + 2)Γ(β) (α + β + 1)(α + β)

Subtracting µ2 with µ = α/(α+β) from E(Y 2 ) and combining terms yields var(Y ) = µ(1 − µ)/(α + β + 1), which, for fixed α + β, decreases toward 0 as µ approaches 0 or 1. (d) (i) symmetric with U-shape for α = β = 0.5, uniform for α = β = 1.0, bell shape for α = β > 1.0 with variability decreasing for larger values. (ii) Skewed left when α > β (mean > 1/2), skewed right when α < β (mean < 1/2), with spread decreasing as (α + β) increases for fixed µ = α/(α + β). Can use code such as: > y = seq(0, 1, length=100) > plot(y, dbeta(y, 10, 10), ylab="pdf", type ="l", col="blue") > lines(y, dbeta(y, 0.5, 0.5), col="red")

# l = line

2.71 (a) G(y) = P (Y ≤ y) = P√ [log(Y ) ≤ log(y)] = P [X ≤ log(y)] = F [log(y)], so g(y) = (1/y)f [log(y)] = [1/y( 2πσ)] exp{−[log(y) − µ]2 /2σ 2 }. (b) Since X = log(Y ), Y = eX , and E(Y ) = E(eX ). The mgf of X when X ∼ N (µ, σ 2 ) 2 2 2 is m(t) = E(etX ) = eµt+σ t /2 , so letting t = 1, E(Y ) = E(eX ) = m(1) = eµ+σ /2 . 2

E(Y 2 ) = E[(eX )2 ] = E(e2X ) = m(2) = e2µ+2σ , and var(Y ) = E(Y 2 ) − [E(Y )]2 = 2 2 2 2 2 2 e2µ+2σ − (eµ+σ /2 )2 = e2µ+2σ − e2µ+σ = eσ [E(Y )]2 − [E(Y )]2 = [E(Y )]2 (eσ − 1) (c) P (Y ≤ eµ ) = P [log(Y ) ≤ µ) = 0.50, since µ is the mean of the normal distribution for log(Y ). (d) exp(x) = exp[(∑i log(yi ))/n] = exp[(1/n) log(∏ yi )] = exp[log(∏ yi )1/n ] = (∏i yi )1/n . 2.72 (a) f (y; λ, k) = F ′ (y; λ, k) = λk ( λy )

k−1 −(y/λ)k

for y > 0.

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) The median is y such that F (y; λ, k) = 1 − e−(y/λ) = 1/2; taking logs of both sides and solving for y yields median = λ[log(2)]1/k . (c) > y <- seq(0, 5, 0.001) k

> plot(y, dweibull(y, shape=4, scale=1), type="l",col="red") > lines(y, dweibull(y, shape=2, scale=1), col="blue") > lines(y, dweibull(y, shape=1, scale=1))

The distribution is less skewed and more bell-shaped as k increases. If k ≤ 1, the pdf is monotone decreasing in y from y = 0. (d) The exponential (but we parameterized that distribution with λ replaced by 1/λ). y=∞

∞

2.73 (a) The pdf is nonnegative with ∫ f (y)dy = ∫1 α/y α+1 dy = −(1/y)α ∣

= 1.

y=1 ∞

∞

y=∞

(b) E(Y ) = ∫1 yα/y α+1 dy = ∫1 α/y α dy = −[α/(α − 1)]y −α+1 ∣

= α/(α − 1).

y=1

(c) With α = 0, ∑∞ y=1 (c/y) = ∞ for all c. 2.74 The plot is derived analogous to Exercise 2.72(c). θ1 is a location parameter because when you increase θ1 by a constant c, all values in the distribution increase by c (without a change in dispersion), and θ2 is a scale parameter because when you multiply θ2 by a constant c, all values in the distribution multiply by c. 3.1

> results <- rbinom(1000000, 2123, 0.50)/2123 > summary(results); hist(results) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.4451 0.4927 0.4998 0.5000 0.5073 0.5535

If π = 0.50, we expect sample proportions between 0.44 and 0.56; 0.61 is unusually large if actual π = 0.50, so we can predict that Klobuchar won. (She actually got 60.3% of all 2.6 million votes.) √ 3.2 (a) Standard error = (0.50 ⋅ 0.50)/1648 = 0.0123 (b) (i) > results <- rbinom(1000000, 1648, 0.50)/1648; hist(results) > summary(results) Min. 1st Qu. Median Mean 3rd Qu. Max. # 0.515 is not unusual 0.4405 0.4915 0.5000 0.5000 0.5085 0.5613 # cannot predict winner

(ii) The sampling distribution of the sample proportion would almost all be within 3 standard errors, that is, within about 0.04 from 0.50, which encompasses the sample result of π̂ = 0.515. 3.3

> results <- rbinom(1000000, 49, 0.50)/49; hist(results) > summary(results); 29/49 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1837 0.4490 0.5102 0.5000 0.5510 0.8367 [1] 0.5918367 # 29/49 = 0.59 not highly unusual if 50% of population prefer Coke

Notice that the observed π̂ = 0.59 is within 3 standard errors from 0.5: > 0.5 + c(-1,1)*3*sqrt(0.5*0.5/49) [1] 0.2857143 0.7142857

Solutions Manual: Foundations of Statistical Science for Data Scientists 3.4

> summary(rbinom(1000000, 262, 0.422)/262); 207/262 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2824 0.4008 0.4237 0.4220 0.4427 0.5725 [1] 0.7900763 # 79% stopped would be extremely unusual, under random variation

√ 3.5 (a) With n = 25, the theoretical standard errors of σ/ n are 1.00 when σ = 5 and 1.60 when σ = 8. The simulations had standard deviations of the 100,000 sample means of 1.004 and 1.603, very close to the theoretical values. (b) For the gamma distribution with (µ, σ) = (20, 5), (shape, scale) = (16, 1.25). We use the cdf to find the probability between 15 and 25: > pgamma(25, shape=16, scale=1.25) - pgamma(15, ,shape=16, scale=1.25) [1] 0.6879025

This is close to 2/3 and similar to the value for normal distributions. 3.6 0.36(909) = 327 voted for Trump. If the population proportion of Trump √ voters responding yes is close to 0.28, the standard error when n = 327 is close to 0.28(0.72)/327 = 0.0248. The sample proportion is probably no further than 3(0.0248) = 0.07 from the population proportion. Most likely between 21% and 35% of all Trump voters would have responded yes. 3.7 √ (i) because as n increases, the standard deviation of the number of heads, n(0.50)(0.50), increases, so the probability within 10 of the expected value of n/2 de√ creases. By contrast, the proportion has standard error (0.50)(0.50)/n that decreases as n increases, so the sample proportion of heads gets closer to 1/2. 3.8 These are the binomial distributions for n = 1, 2, 3, 4, with possible values divided by n. For n = 4 it is > y <- 0:4 > cbind(y/4, dbinom(y, 4, 0.50)) [,1] [,2] [1,] 0.00 0.0625 [2,] 0.25 0.2500 [3,] 0.50 0.3750 [4,] 0.75 0.2500 [5,] 1.00 0.0625

It gets more bell-shaped (and narrower) as n increases, by the CLT. 3.9 (a) (1,1), (1,2), …, (1,6) (2,1), (2,2), …, (2,6) … (6,1), (6,2), …, (6,6) y 1 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

Probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) More sample points have mean near the middle than at 1 and 6, which each have only one sample point. With additional rolls, by the Central Limit Theorem, it converges to a normal distribution. (c) > y <- NULL > for(i in 1:10000){y[i,] <- mean(sample(1:6, 10, replace=TRUE))} > hist(y) # the empirical sampling distribution is bell-shaped

3.10 (a) Using the 36 possible results of two rolls, y Probability 1 1/36 2 3/36 3 5/36 4 7/36 5 9/36 6 11/36 (b) Probability at 6 approaches 1 and other probabilities decrease toward 0. 3.11

> mean(rpois(10,5));mean(rpois(1000,5));mean(rpois(100000,5));mean(rpois(10000000,5)) [1] 5.2 #n=10 [1] 4.933 #n=1000 [1] 5.00817 #n=100000 [1] 4.999095 #n=10000000

3.12 Random sampling from a standard normal with n = 10, 1000, 100000, 10000000: > mean(rnorm(10));mean(rnorm(1000));mean(rnorm(100000));mean(rnorm(10000000)) [1] 0.1471995 #n=10 [1] -0.02465573 #n=1000 [1] 0.006233995 #n=100000 [1] -0.0002316873 #n=10000000

The sample mean converges to 0.000 as n increases. 3.13 The following code simulates the sampling distribution of y for n = 2 in random sampling from a uniform distribution over [0, 1]. This sampling distribution has a triangular shape. Run this with successively larger values of n to see the CLT convergence to a normal distribution. > n <- 2; y <- NULL > for(i in 1:100000){y[i] <- mean(runif(n, 0, 1))} # sample mean for case i > hist(y) # empirical sampling distribution of sample mean for samples of size n

3.14 Each observation has expected value 0 and standard deviation 1.0. The total after 100 bets has expected value 0, variance 100(1) and standard deviation 10. Since the sample mean winnings has approximately a normal distribution (by the CLT), so does the sum of the winnings. 3.15 (a) No, probably skewed right because lowest possible value is a bit more than one standard deviation below the mean. √ (b) Normal by CLT, mean = 5.5, standard error = 3.9/ 1000 = 0.123. (c) Almost certainly the population mean is within 3 standard errors of the sample mean, or between about 5.1 and 5.9. 3.16 (a) Mean = 2.6, standard deviation = 1.5 (b) Mean = 2.4, standard deviation = 1.4

Solutions Manual: Foundations of Statistical Science for Data Scientists √ (c) Mean = 2.6, standard error = 1.5/ 225 = 0.10.

3.17 (a) For y = 1 for females and y = 0 for males, P (1) = 0.60 and P (0) = 0.40. (b) y = 1 has proportion 18/50 = 0.36 and y = 0 has proportion 0.64. √ (c) Bell-shaped (by the CLT) with mean 0.60 and standard error 0.60(0.40)/50 = 0.069. A value of 0.36 is more than three standard errors from the expected value of 0.60, which would be very unusual for a simple random sample. 3.18 (a) (i) mean 72, standard deviation 12; (ii) mean 70, standard deviation 11, probably skewed left, resembling population distribution. √ (b) Bell-shaped (by CLT) with mean 72, standard error 12/ 100 = 1.2, describes probabilities for where Y is likely to fall. (c) A value of 60 is only a standard deviation from the expected value of 72, not a surprising value, but y = 60 is many standard errors below the expected value, a highly unusual result. (d) (i) Same as population distribution, with mean 72, standard deviation 12; (ii) if sample everyone, Y = µ = 72 with probability 1. 3.19

> y <- rpois(1000000, 9); mean(sqrt(y)); var(sqrt(y)) [1] 2.956059 [1] 0.262729 > y <- rpois(1000000, 100); mean(sqrt(y)); var(sqrt(y)) [1] 9.986786 [1] 0.2508395 > y <- rpois(1000000, 100000); mean(sqrt(y)); var(sqrt(y)) [1] 316.2278 [1] 0.2498776 # by delta method, variance close to 1/4 when Poisson mean is large

3.20 Since y almost certainly √ falls within three standard errors of µ, you can approximate √ the standard error by s/ n and then predict that µ takes value no more than 3s/ n from y. 3.21

> y <- rnorm(25, 3.0, 0.4) # random sample of n=25 from population > summary(y); hist(y) # describes sample data distribution Min. 1st Qu. Median Mean 3rd Qu. Max. 2.513 2.777 2.948 3.037 3.291 3.750 > ybar <- rnorm(100000, 3.0, 0.4/sqrt(25)); hist(ybar) # 100000 sample means, each having normal dist. with mean 3.0, std. error 0.4/sqrt(25)

3.22 (a) The sample proportions should vary around 0.50, few if any below 0.40 or above 0.60. (b) The plot√should resemble a normal distribution with mean 0.50 and standard deviation 0.50(0.50)/100 = 0.05. 3.23 (a) Unlike the population distribution, the empirical sampling distribution is bellshaped, by the Central Limit Theorem. (b) The Central Limit Theorem holds only when n is large, not n = 2. 3.24 (a) The sample data distribution should resemble the population distribution, having similar mean and standard deviation. (b) The theoretical values are the √ population mean and the population value of the standard deviation divided by 1000. This distribution describes the probability distribution of where y is likely to fall in a sample survey of this size that takes a simple random sample.

Solutions Manual: Foundations of Statistical Science for Data Scientists √ 3.25 The theoretical values are 0.50 for the mean and 0.50(0.50)/100 = 0.05 for the standard deviation. > y <-NULL > for(i in 1:10000){y[i] <- mean(rbinom(100, 1, 0.5)) } > mean(y); sd(y) [1] 0.499954 [1] 0.05036796

3.26 The smaller n is, the more the sample proportion or sample mean can vary from sample to sample. Even if every state has the same parameter value, the states with smaller sample sizes will tend to have sample statistic value farther from the expected value. √ 3.27 (a) Standard error = π(1 − π)/n = 0. If π = 1, necessarily every trial is a success and π̂ = 1.0. There cannot be any variability. (b) Let g(π) = π(1 − π). Setting g ′ (π) = 1 − 2π = 0 and solving yields π = 0.50 to maximize it (since g ′′ (π) = −2 < 0) and the corresponding standard error formula. It is easier to make a precise inference about π when it is near 0 or 1 than when it is near 0.50. √ 3.28 Solving 0.50(0.50)/n = 0.04 for n yields n = (0.50/0.04)2 = 156.25, i.e. n = 157. 3.29

> expo_median <- function(n, lambda, B=10000){ + par(mfrow=c(1,2), pin=c(2.2, 2.2)) + for(i in 1:2){ + Y <- numeric(length=n[i]*B) + Y <- matrix(rexp(n[i]*B, lambda), ncol=n[i]) + Ymed <- apply(Y, 1, median) + hist(Ymed) + }} > n <- c(10, 100) > expo_median(n, 1, 100000) # call the function for n given above and lambda=1

The sampling distribution of the sample median is skewed right for n = 10 but is approximately normal for n = 100. 3.30 From equation (2.11), the gamma distribution has shape parameter k = (µ/σ)2 and scale parameter 1/λ = µ/k = σ 2 /µ, which are here k = 1.665 and 1/λ = 6.006. > z <- 0:50 > plot(z, dgamma(z, shape=1.665, scale=6.006), type="l") # population distribution > y <- NULL > for(i in 1:10000){y[i] <- mean(rgamma(200, shape=1.665, scale=6.006))} > hist(y) # empirical sampling distribution

The sampling distribution is approximately normal, by the CLT, even though the population distribution is skewed right. 3.31 The population distribution. An example is the exit poll example in the chapter, where the sample data distribution and population distribution place all their probability at 0 and 1 but the sampling distribution describes sample proportion values that are close to the population proportion. 3.32 (a), (c), (d) 3.33 (c), because the distribution of the number of heads is approximately normal but with a √ standard deviation of 1000000(0.5)(0.5) = 500, so the probability at any single value is very close to 0. Using the command dbinom(500000, 1000000, 0.50) in R, we find that this probability equals 0.0008.

Solutions Manual: Foundations of Statistical Science for Data Scientists

3.34 (c) 3.35 (a) is incorrect because the sample data tends to look like the population distribution, which need not be bell-shaped. (b) is incorrect because population distributions can have any shape regardless of how large the population is. (d) is incorrect because the sampling distribution gets narrower and bell-shaped as n increases and is very different from the population distribution, which can have any shape and is much more spread out. 3.36 In adding many observations, the deviations from the mean also add. Even though each deviation may be small and they average to a value close to 0, the sum of a large number of deviations is potentially quite large. 3.37 The finite population correction equals: (a) σY = 0.995, so the standard error is only slightly less than the ordinary formula provides. (b) σY = 0 when N = n, so the standard error is 0. (c) σY = 1 when n = 1, and the standard error simplifies to σ. 3.38 (a) E(Y − µ)3 = (1/n3 )E[∑i (Yi − µ)]3 = (1/n3 ) ∑i E(Yi − µ)3 because all other terms in the expansion of the cube contain a product of E(Yi − µ) which is 0. Therefore, √ E(Y − µ)3 = (1/n2 )E(Y√− µ)3 . Therefore, skewness(Y )√= E(Y − µ)3 /[σ/ n]3 √ = (1/n2 )E(Y − µ)3 /[σ/ n]3 = E(Y − µ)3 /[ nσ 3 ] = S/ n. The skewness decreases as n increases, because the CLT tells us the sampling distribution converges to normal as n increases. √ (b) S/ n ≤ M implies n ≥ (S/M )2 . (c) n ≥ (S/M )2 = (S/0.32)2 = 10S 2 . For the exponential distribution, S = 2, so we need n ≥ 40 for this highly skewed distribution. 3.39 Convergence in probability says that the sequence of random variables converges to a single random number, i.e., to a random variable. Convergence in distribution says that the sequence of random variables converges to a random variable having a certain probability distribution. Convergence in probability is stronger and implies convergence in distribution. The opposite is true only if that distribution has all its probability at a single point. 3.40 For g(π̂) = log[π̂/(1 − π̂)], evaluated at π, its derivative equals 1/π(1 − π). By the delta method, the asymptotic variance of the sample logit is π(1 − π)/n (which is the variance of π̂) multiplied by the square of [1/π(1 − π)]. That is, √

n[ log (

π 1 π̂ d ) − log ( )] → N (0, ). 1 − π̂ 1−π π(1 − π)

√ √ √ 3.41 For g(y) = arcsin(y), g ′ (y) = 1/ 1 − y 2 , so for g(π) = arcsin( π), g ′ (π) = 1/2 π(1 − π). Multiplying the square of this by the variance of π̂, which is π(1 − π)/n, yields 1/4n for the approximate variance. 3.42 g(y) = log(y) has g ′ (y) = 1/y, so if Y has standard deviation = cµ and thus variance = (cµ)2 , log(Y ) has approximate variance (1/µ)2 (cµ)2 = c2 and standard deviation approximately equal to c, constant instead of increasing with the mean. The log-normal and gamma are examples of distributions for positively-valued Y with standard deviation proportional to the mean.

Solutions Manual: Foundations of Statistical Science for Data Scientists

3.43 For g(t) = t2 , g ′ (t) = 2t and g ′′ (t) = 2. At µ = 0, g(t) = 0. So, [g(T ) − g(0)] ≈ (T − 0)g ′ (0) = 0. Using also the second term in the expansion, [g(T ) − g(0)] ≈ [(T − 0)g ′ (0) + 1 (T − 0)2 g ′′ (0)] = T 2 , being exact since g is quadratic. Section 4.4.5 defines a chi2 squared distribution with 1 degree of freedom as the square of a standard normal random variable. 3.44 (a)

> prop1 <- rbinom(1000000, 50, 0.20)/50 > prop2 <- rbinom(1000000, 50, 0.10)/50 > hist(prop1/prop2) > hist(log(prop1/prop2)) # taking log yields DIFFERENCE of logged proportions, # closer to normality than RATIO of proportions

(b) For very large ni with √ very small πi , the by the Poisson √ binomial is approximated √ with µi = ni πi . Hence

1−π1 2 + n1−π = n1 π 1 2 π2

1−π1 2 + 1−π ≈ µ1 µ2

1 + µ12 . µ1

3.45 (a) E(etT ) = E(∏i etYi ) = ∏i E(etYi ) = ∏i mi (t) by the independence of the random variables. (b) ∏i mi (t) = ∏ni=1 [eµt+σ t /2 ] = enµt+nσ t /2 , the mgf of a N (nµ, nσ 2 ) distribution. Since Y = T /n, Y has a N (µ, σ 2 /n) distribution. 2 2

2 2

(c) Poisson with mean ∑i µi . (d) E(etY ) = π(et ) + (1 − π)(e0 ). Taking the product of n of these gives the mgf for the binom(n, π) distribution. (e) If π1 = ⋯ = πn , then S has mgf m(t) = [(1 − π) + πet ]∑i ni , and so S ∼ binom(∑i ni , π). (f) The exponential distribution is the gamma with k = 1. Taking the product of n exponential mgf ’s gives a gamma mgf with shape parameter k = n and rate parameter λ. As n increases, this is approximately normal, by the CLT. 3.46 From Exercise 3.45, the mgf of ∑i Zi is [m(t)]n . √ √ (a) n(Y − µ)/σ = (∑i Zi )/ n. If m(t) is the mgf of cZ is √ of Z, then the mgf √ E(et(cZ) ) = E(e(tc)Z ) = m(tc), so the mgf of n(Y − µ)/σ is [m(t/ n)]n . 2

(b) The mgf of a random variable Z is√m(t) = 1 + tE(Z) + t2! E(Z 2 ) + t3! E(Z 3 ) + ⋯. Apply this here by replacing t by t/ n and raising the result to the power n. E(Z ) √ (c) As n → ∞, the limit of an = ( t2 + t 3! + ⋯) is t2 /2, so apply result with this an . n 2

Chapter 4 4.1

> y <- NULL > for(i in 1:100000){ + x <- rnorm(100, 0, 1) + y[i] <- (quantile(x, 0.75) + quantile(x, 0.25))/2 } > sd(y) [1] 0.1107919

√ √ The standard error of the sample mean is σ/ n = 1/ 100 = 0.10, so the sample mean seems to be a slightly better estimator. 4.2 ℓ(π) = (1 − π)2 π, for 0 ≤ π ≤ 1.

Solutions Manual: Foundations of Statistical Science for Data Scientists > pi <- seq(0, 1, length=100);

4.3

plot(pi, ((1-pi)^2)*pi, type="l")

> pi <- seq(0, 1, length=100) > plot(pi, log(dbinom(6, 10, pi)), type="l")

π̂ = 0.60. Taking the log function, which is monotone increasing, does not affect the value at which the function is maximized. 4.4

> Students <- read.table("http://stat4ds.rwth-aachen.de/data/Students.dat", + header=TRUE) > sum(Students$life==1); length(Students$life) # life=1 is belief in life after death [1] 31 [1] 60 > library(proportion) > ciAllx(31, 60, 0.05) method x LowerLimit UpperLimit 1 Wald 31 0.3902218 0.6431115 4 Score 31 0.3930781 0.6382495

√ With π̂ = 31/60 = 0.517, Wald CI is 0.517 ± 1.96 (0.517)(0.483)/60, which is (0.390, 0.643). If this were a random sample of social science graduate students, we could can be 95% confident that the population proportion of the corresponding population that believes in life after death is between 0.390 and 0.643. 4.5 (a) The sample percentage favoring legalization increased from 19.6% in 1973 to 66.4% in 2018. (b) Using the GSS data with “no weights,” the counts are 938 in favor and 509 opposed. > library(proportion) > ciAllx(938, 1447, 0.05) method x LowerLimit UpperLimit 1 Wald 938 0.6236337 0.6728417 4 Score 938 0.6232707 0.6724198

In 2018, we can infer that between 62% and 67% (i.e., a majority) favored legalization. 4.6

> library(proportion) > ciAllx(0, 25, 0.01) method x LowerLimit UpperLimit 1 Wald 0 0.000000e+00 0.00000000 4 Score 0 0.000000e+00 0.20973347

Wald CI is (0, 0) because estimated standard error is 0.0 when π̂ = 0.0. The score CI is more reliable, suggesting that π falls between 0 and 0.21, not between 0 and 0. √ 4.7 Setting 0.05 = 1.96 0.5(0.5)/n and solving for n yields n = (1.96)2 (0.5)(0.5)/(0.05)2 = 384. √ 4.8 Setting 0.04 = 1.645 0.3(0.7)/n and solving for n yields n = (1.645)2 (0.3)(0.7)/(0.04)2 = 355. 4.9 (a) Histogram shows skew to right. Sample point estimates are y = 20.33, s = 3.68. √ (b) (i) 20.33 ± 2.045(3.68)/ 30, which is (18.96, 21.71). We can be 95% confident that the population mean annual income for heads of households in public housing in Chicago is between 18.96 and 21.71 thousand dollars. (ii)

Solutions Manual: Foundations of Statistical Science for Data Scientists > Chicago <- read.table("http://stat4ds.rwth-aachen.de/data/Chicago.dat",header=TRUE) > hist(Chicago$income) > mean(Chicago$income); sd(Chicago$income) [1] 20.33333 [1] 3.681111 > qt(0.025, 29) [1] -2.04523 > t.test(Chicago$income)$conf.int [1] 18.95878 21.70788

4.10 For the population of men, we can be 95% confident that the population mean number of female sex partners is between 8.0 and 13.1. The standard deviation is much larger than the mean, and that together with the smaller mode and median suggest that the sample data distribution is very highly skewed to the right. The median is probably a more relevant measure than the mean for summarizing the sample data distribution or the population distribution. √ √ 4.11 (a) The CI y ±tα/2,n−1 (s/ n) is 1.70±2.262(1.3375)/ 10, which is (0.74, 2.66). We can be 95% confident that the population mean number of hours of daily TV watching is between 0.74 and 2.66. (b) Now y = 3.7, s = 7.2 and the CI is (−1.46, 8.86), dramatically affected by one outlying observation. Just as outliers can affect the mean (especially with small n), the same is true of CIs. 4.12 (a) Side-by-side box plots show incomes tend to be higher for whites, but are quite skewed to the right (b) y and s are 27.75 and 13.28 for blacks, and 42.48 and 22.87 for whites, with n = 16 and √ 50 (so df = 64, with t quantile 1.669 for a 90% CI). The pooled estimate of σ is [15(13.28)2 + 49(22.87)2 ]/(16 + 50 − 2) = 21.02. The 90% confidence interval for √ µW − µB is (42.48 − 27.75) ± 1.669(21.02( (1/16) + (1/50), which is (4.65, 24.81). We can be 90% confident that the population mean income is between 4.65 and 24.81 thousand dollars higher for whites than blacks. > Income <- read.table("http://stat4ds.rwth-aachen.de/data/Income.dat", + header=TRUE) > boxplot(income ~ race, horizontal=TRUE, data=Income) > tapply(Income$income, Income$race, summary) $B # results for $H not shown Min. 1st Qu. Median Mean 3rd Qu. Max. 16.00 19.50 24.00 27.75 31.00 66.00 $W Min. 1st Qu. Median Mean 3rd Qu. Max. 18.00 24.00 37.00 42.48 50.00 120.00 > tapply(Income$income, Income$race, sd) B H W 13.28408 12.81225 22.86985 > incW <- Income$income[Income$race=="W"] > incB <- Income$income[Income$race=="B"] > t.test(incW, incB, conf.level=0.90, var.equal=TRUE) 90 percent confidence interval: 4.653686 24.806314

(c)

> t.test(incW, incB, conf.level=0.90) 90 percent confidence interval: # differ somewhat, because sample stand. dev. 6.943389 22.516611 # s1=12.3 and s2=22.9 are quite different

4.13 (a) y = 7.29, s = 7.18 (b) se = 1.74, df = 16, t quantile score of 2.120, CI is (3.6, 11.0).

Solutions Manual: Foundations of Statistical Science for Data Scientists

(c) (2.88, 12.55), so we can be 95% confident that the population mean weight change for the family therapy is between 2.88 and 12.55 pounds higher than for control. > family <- Anor$after[Anor$therapy=="f"] - Anor$before[Anor$therapy=="f"] > t.test(family, control, var.equal=TRUE, conf.level=0.95) 95 percent confidence interval: 2.880164 12.549248

4.14 (a) (5.53, 9.00), so we are 95% confident that the population mean number of weekly hours of TV watching for the population of social science graduate students is between 5.53 and 9.00. The CI assumes a random sample from a normal population distribution. The normal assumption is not very important, because of the robustness of the method. The sampling assumption is important. This interval is only useful to the extent that this sample is like a random sample from a population of social science graduate students. (b) The confidence interval for the difference between the population means for females and for males is (−2.00, 4.97). Since the CI includes 0, we cannot conclude whether females or males watch more TV. > Students <- read.table("http://stat4ds.rwth-aachen.de/data/Students.dat", + header=TRUE) > t.test(Students$tv) 95 percent confidence interval: 5.531395 9.001938 > t.test(Students$tv[Students$gender==1], Students$tv[Students$gender==0], + var.equal=TRUE) 95 percent confidence interval: -1.997897 4.965639 mean of x mean of y 7.983871 6.500000

4.15 The sample proportions are 0.863 for females and 0.744 for males. The 95% Wald confidence intervals are (0.844, 0.883) for females, (0.716, 0.772) for males, and (0.085, 0.153) for the difference between the population proportions for females and males. In the U.S., we can conclude that more females than males believe in life after death. > library(proportion) > ciAllx(1017, 1178, 0.05) method x LowerLimit UpperLimit 1 Wald 1017 0.8437120 0.8829434 4 Score 1017 0.8425274 0.8817661 > ciAllx(703, 945, 0.05) method x LowerLimit UpperLimit 1 Wald 703 0.7160871 0.7717436 4 Score 703 0.7151384 0.7707172 > prop.test(c(1017, 703), c(1178, 945), conf.level=0.95, correct=FALSE) 95 percent confidence interval: 0.08536551 0.15345915 sample estimates: prop 1 prop 2 0.8633277 0.7439153

4.16 (a) Of those who used alcohol, 955 used marijuana and 994 did not. Of those who did not use alcohol, 5 used marijuana and 322 did not. Thus, π̂1 = 955/(955 + 994)√= 0.490, π̂2 = 5/(5 + 322) = 0.015. The Wald 95% CI is (0.490 − 0.015) ± 1.96 [(0.490)(0.510)/1949] + [(0.015)(0.985)/327], which is (0.449, 0.501). If we can treat this as a random sample from a population of high school students of interest, we can be 95% confident that the population proportion using marijuana is between 0.45 and 0.50 higher for those who used alcohol that for those who did not.

FIGURE C.1: Bootstrap distribution of correlation between GDP and CO2 for UN data. (b) 28

4.17

> prop.test(c(955,5), c(955+994, 5+322), conf.level=0.95, correct=FALSE) 95 percent confidence interval: Solutions Manual: Foundations of Statistical Science for Data Scientists 0.4488310 0.5005777 # 95% Wald CI for difference sample estimates: prop 1 prop 2 0.48999487 0.01529052

> y <- rt(10000, 3) > qqnorm(y, col='blue', main='y ~ t(3)'); abline(0,1)

The plot shows some very large positive and large negative values, suggesting thicker or longer tails than the standard normal has. 4.18

> 1 - pnorm(3) [1] 0.001349898 # P(Y > 3) for standard normal > 1 - pt(3, 1) [1] 0.1024164 # P(Y > 3) for Cauchy, which is t with df=1

The Cauchy distribution permits more extreme values. 4.19 (a)

> Books <- read.table("http://stat4ds.rwth-aachen.de/data/Library.dat", + header=TRUE) > boxplot(Books$C, xlab="Years since checked out", horizontal=TRUE) > library(boot) > boot.results <- boot(Books$C, function(x,i){median(x[i])}, 10000) > boot.results original bias std. error t1* 4 0.4465 0.8890539 > boot.ci(boot.results) Level Percentile BCa 95% ( 2, 7 ) ( 2, 4 )

We can be 95% confident that the population median falls between 2 and 7 years. (b)

> boot.results2 <- boot(Books$C, function(x,i){sd(x[i])}, 100000) > boot.results2 original bias std. error t1* 15.36272 -0.4745266 3.197195 > boot.ci(boot.results2) Level Percentile BCa 95% ( 8.03, 20.69 ) ( 9.92, 22.36 )

The bias-corrected CI says that we can be 95% confident that the population standard deviation falls between 9.92 and 22.36 years. (The sample standard deviation is 15.36.) 4.20

> UN <- read.table("http://stat4ds.rwth-aachen.de/data/UN.dat", header=TRUE) > library(boot) > b_corr = boot(cbind(UN[,2],UN[,6]), function(x,i){cor(x[i,1],x[i,2])}, 100000) > b_corr ORDINARY NONPARAMETRIC BOOTSTRAP Bootstrap Statistics : original std. error t1* 0.67447 0.08368 > boot.ci(b_corr, conf=0.90) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Level Percentile BCa 90% ( 0.5257, 0.7988 ) ( 0.4948, 0.7833 ) > hist(b_corr$t, xlab="Correlation", breaks="Scott")

Figure C.1 shows the bootstrap distribution of the correlation. It is skewed left rather than symmetric, resulting in the lower endpoint of the confidence interval being farther than the upper endpoint from the sample value of 0.67447. 4.21 (a)

> y <- rt(1000, 1) # 1000 observations from Cauchy (t distribution with df=1) > boot.results <- boot(y, function(x,i){mean(x[i], 0.05)}, 10000) > boot.ci(boot.results)

Solutions Manual: Foundations of Statistical Science for Data Scientists

Level Percentile BCa 95% (-0.2607, 0.0952 ) (-0.2661, 0.0921 ) > t.test(y) # ordinary t CI assuming normal population distribution 95 percent confidence interval: -1.3870866 0.5482248 mean of x -0.4194309

A bootstrap CI for the trimmed mean is much narrower than the ordinary t CI for the mean, more precisely estimating the true center of 0. (b)

> boot.results2 <- boot(y, function(x,i){median(x[i])}, 10000) > boot.ci(boot.results2) Level Percentile BCa 95% (-0.1264, 0.0756 ) (-0.1214, 0.0772 )

The sample median estimator seems more precise than the sample mean or trimmed mean in sampling from such a heavy-tailed distribution. 4.22 (a)

(b)

> qbeta(c(0.025, 0.975), 0.84*300 + 1, 0.16*300 + 1) [1] 0.7941880 0.8770743 # posterior interval for Democrats > library(proportion) > ciBAx(0.84*300, 300, 0.05, 1.0, 1.0) x LBAQx UBAQx LBAHx UBAHx 1 252 0.794188 0.8770743 0.7957793 0.8784585 # another way to get interval > qbeta(c(0.025, 0.975), 0.24*300 + 1, 0.76*300 + 1) [1] 0.1951899 0.2914933 # posterior interval for Republicans > ciBAx(0.24*300, 300, 0.05, 1.0, 1.0) x LBAQx UBAQx LBAHx UBAHx y1 72 0.1951899 0.2914933 0.1941018 0.290301 > ciAllx(0.84*300, 300, 0.05) # classical CI's method x LowerLimit UpperLimit 1 Wald 252 0.7985154 0.8814846 4 Score 252 0.7942563 0.8771465

The Bayesian posterior interval and classical score CI are essentially the same. From the Bayesian interval, the probability is 0.95 that the population proportion of liberal Democrats who favor allowing gays and lesbians to legally marry is between 0.794 and 0.877. From the classical (score CI) approach, we are 95% confident that that population proportion is between 0.794 and 0.877. In repeated sampling, in the long run 95% of such CI’s would contain the actual population proportion. Before the samples are obtained, the random interval has probability 0.95 of containing the parameter. (c)

> library(PropCIs) > diffci.bayes(0.84*300, 300, 0.24*300, 300, 1.0, 1.0, 1.0, 1.0, 0.95, + nsim = 1000000) [1] 0.5306256 0.6577760

4.23 (a) ML estimate is sample proportion y/n = 10/10 = 1.0 (b) Bayesian estimate is posterior mean (y + 1)/(n + 2) = 11/12 = 0.917. 4.24 (a)

> qbeta(c(0.025, 0.975), 0 + 0.5, 25 + 0.5) [1] 1.944577e-05 9.468276e-02 > qbeta(c(0.025, 0.975), 0 + 1, 25 + 1) [1] 0.0009732879 0.1322746045 > qbeta(c(0.025, 0.975), 0 + 10, 25 + 10) [1] 0.1147335 0.3530451

# (i) # (ii) # (iii)

As α = β goes from 0.5 to 10, the influence of the prior increases and the posterior interval goes from (0.000, 0.095) to (0.115, 0.353), so the endpoints shrink toward 0.50, which is the π value that is compatible with the prior α = β.

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) With that uniform prior, the posterior beta distribution has parameters α = 1.0 and β = 26, which serves as the prior distribution for the new survey results.

4.25 (a)

> library(proportion) > ciBAx(0, 25, 0.05, 1.0, 1.0) x LBAQx UBAQx LBAHx UBAHx 1 0 0.0009732879 0.1322746 2.440083e-10 0.1088304

The lower bound is 0.00097 for the percentile interval and 0.000 for the HPD interval. The posterior distribution is proportional to (1−π)25 , monotone decreasing in π, so the HPD interval starts at 0. The equal-tail percentile interval starts at the 0.025 quantile of the beta(1, 26) distribution, which is 0.00097. (b) The equal-tail percentile interval cannot include exactly 0, because it starts at the 0.025 quantile of the beta posterior distribution, which is a continuous distribution between 0 and 1. Since the posterior density is monotone decreasing from π = 0, it is not sensible to exclude the values closest to 0 from a posterior interval. 4.26 With Jeffreys priors, the posterior distributions are beta(11.5, 0.5) for π1 and beta(0.5, 1.5) for π2 . > pi1 <- rbeta(10000000, 11.5, 0.5); pi2 <- rbeta(10000000, 0.5, 1.5) > hist(pi1 - pi2) > plot(density(pi1 - pi2)) > library(HDInterval) > hdi(pi1 - pi2, credMass=0.95) lower upper 0.182547 1.000000

The posterior density of π1 − π2 seems to be monotone increasing from −1.0 to 1.0, so the HPD interval has left-tail probability = 0.05 and right-tail probability = 0.00. 4.27 (a)

> change <- Anor$after - Anor$before > library(MCMCpack) > fit <- MCMCregress(change[Anor$therapy=="f"] ~ 1, mcmc=5000000, + b0=0, B0=0, c0=10^{-15}, d0=10^{-15}) > summary(fit) 1. Empirical mean and standard deviation for each variable Mean SD (Intercept) 7.266 1.856 2. Quantiles for each variable:

(Intercept)

2.5% 3.591

25% 6.067

50% 7.265

75% 8.464

97.5% 10.95

The posterior mean estimate is 7.27 and the 95% equal-tail percentile posterior interval is (3.59, 10.95). (b)

> t.test(change[Anor$therapy=="f"]) 95 percent confidence interval: 3.58470 10.94471

Similar results but different interpretations. With the Bayesian approach, the posterior probability is 0.95 that µ falls between 3.59 and 10.95. With the classical approach, we replace “posterior probability” by “confidence” with the understanding that in repeated sampling, in the long run 95% of such CIs would contain µ. 4.28 For the data file read in the previous exercise, we have: > y1 <- Anor$after[Anor$therapy=="f"] - Anor$before[Anor$therapy=="f"] > n1 = length(y1)

Solutions Manual: Foundations of Statistical Science for Data Scientists

> S1 = sum((y1 - mean(y1))^2) > rsigma1 <- S1/rchisq(1000000, n-1) > mu1 <- rnorm(1000000, mean=mean(y1), sd=sqrt(rsigma1)/sqrt(n1)) > cbind(n1, mean(mu1), sd(mu1)) n1 [1,] 17 7.261637 1.854139 > quantile(mu1, c(0.025, 0.975)) 2.5% 97.5% 3.583582 10.939045 > y2 <- Anor$after[Anor$therapy=="c"] - Anor$before[Anor$therapy=="c"] > n2 <- length(y2) > S = sum((y1 - mean(y1))^2) + sum((y2 - mean(y2))^2) > rsigma2 <- S/rchisq(1000000, n1 + n2 - 2) # under common sigma: > mu1 <- rnorm(1000000, mean=mean(y1), sd=sqrt(rsigma2)/sqrt(n1)) # posterior of mu1 > mu2 <- rnorm(1000000, mean=mean(y2), sd=sqrt(rsigma2)/sqrt(n2)) # posterior of mu2 > cbind(n1, n2, mean(mu1-mu2), sd(mu1-mu2)) n1 n2 [1,] 17 26 7.712349 2.454464 > quantile(mu1 - mu2, c(0.025, 0.975)) 2.5% 97.5% 2.86776 12.53840 # posterior interval > t.test(y1, y2, var.equal=TRUE) 95 percent confidence interval: 2.880164 12.549248 # classical CI

With the Bayesian approach, the posterior probability is 0.95 that µ1 − µ2 falls between 2.87 and 12.54. With the classical approach, we replace “posterior probability” by “confidence” for the CI (2.88, 12.55), with the understanding that in repeated sampling, in the long run 95% of such CIs would contain µ1 − µ2 . 4.29 Let π1 = P (HG = 1 ∣ N V = 1) and π2 = P (HG = 1 ∣ N V = 0). > Endo <- read.table("http://stat4ds.rwth-aachen.de/data/Endometrial.dat",header=TRUE) > addmargins(table(Endo$NV, Endo$HG), margin=2) # table with row margins 0 1 Sum 0 49 17 66 1 0 13 13 > prop.test(c(13, 17), c(13, 66), conf.level=0.95, correct=FALSE) 95 percent confidence interval: 0.6369237 0.8479248 # Wald CI for pi_1 - pi_2 sample estimates: prop 1 prop 2 1.0000000 0.2575758 > library(PropCIs) > diffscoreci(13, 13, 17, 66, conf.level=0.95) 95 percent confidence interval: # score CI 0.5001011 0.8329848 # score CI for pi_1 - pi_2 > diffci.bayes(13, 13, 17, 66, 1.0, 1.0, 1.0, 1.0, 0.95, nsim = 1000000) [1] 0.4785122 0.8008196 # posterior interval (equal tails)

The Wald CI is unreliable, because π̂1 = 13/13 = 1.000, so the second sample makes no contribution to the standard error. The score CI of (0.500, 0.833) says we can be 95% confident that the probability of HG being high is between 0.500 and 0.833 higher when NV is present than when it is absent. The Bayesian posterior interval of (0.479, 0.801) states that the posterior probability is 95% that π1 − π2 falls between 0.479 and 0.801. 4.30 95% Wald confidence interval for difference of population proportions believing there is solid evidence of global warming is between 0.463 and 0.617 higher for liberal Democrats than for conservative Republicans. The method assumes independent random samples and leads to the conclusion that many more liberal Democrats than conservative Republicans believe there is solid evidence of global warming.

Solutions Manual: Foundations of Statistical Science for Data Scientists > prop.test(c(0.92*200, 0.38*200), c(200, 200), conf.level=0.95, correct=FALSE) 95 percent confidence interval: 0.4629358 0.6170642

4.31

> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat", header=TRUE) > t.test(Houses$price[Houses$new==1], Houses$price[Houses$new==0], var.equal=TRUE) 95 percent confidence interval: # assumes equal population variances 143.2974 313.8913 sample estimates: mean of x mean of y 436.4455 207.8511 > t.test(Houses$price[Houses$new==1], Houses$price[Houses$new==0]) 95 percent confidence interval: # not assuming equal population variances 79.59807 377.59059 # quite different results > sd(Houses$price[Houses$new=="1"]) [1] 219.8328 > sd(Houses$price[Houses$new=="0"]) [1] 121.0391

The sample standard deviations are quite different, so it is safer to use the confidence interval that does not assume σ1 = σ2 . We can be 95% confident that the corresponding population mean selling price is between 79.6 and 377.6 thousand dollars higher for new homes than for older homes. 4.32

> Afterlife <- read.table("http://stat4ds.rwth-aachen.de/data/Afterlife.dat", + header=TRUE) > table(Afterlife$religion, Afterlife$postlife) 1 2 1 956 114 2 384 65 3 17 17 > prop.table(table(Afterlife$religion, Afterlife$postlife), 1) 1 2 1 0.8934579 0.1065421 2 0.8552339 0.1447661 3 0.5000000 0.5000000 > prop.test(c(956,384), c(956+114, 384+65), conf.level=0.95, correct=FALSE) 95 percent confidence interval: # Wald CI for difference between Protestant and 0.0007940597 0.0756541221 # Catholic in proportion believing in afterlife > prop.test(c(956,17), c(956+114, 17+17), conf.level=0.95, correct=FALSE) 95 percent confidence interval: 0.2243788 0.5625371 # difference between Protestant and Jewish > prop.test(c(384,17), c(384+65, 17+17), conf.level=0.95, correct=FALSE) 95 percent confidence interval: 0.1840460 0.5264217 # difference between Catholic and Jewish

The results can be interpreted analogously to Exercise 4.29. However, the comparisons to Jewish should be expressed with reserve since the sample sizes are highly unbalanced. 4.33

> n <- 100 > y <- z <- NULL > for(i in 1:100000){ + x <- rnorm(n, 0, 1) + y[i] <- mean(x) + z[i] <- median(x) + } par(mfrow=c(1,2), pin=c(2.3, 2.5)) # control the plots layout (two in a row) and size plot(density(y)); plot(density(z)) > sum(y^2)/100000 # MSE for sample mean [1] 0.01000186 > sum(z^2)/100000 # MSE for sample median [1] 0.01552494

Solutions Manual: Foundations of Statistical Science for Data Scientists

The mean has quite a bit smaller MSE and seems to be a better estimator (i.e., tends to be closer to true value of 0). 4.34

> n <- 100 > y <- z <- NULL > for(i in 1:100000){ + x <- runif(n, 0, 1) + y[i] <- mean(x) + z[i] <- median(x) + } > sum((y - 0.5)^2)/100000; sum((z - 0.5)^2)/100000 [1] 0.0008323104 # MSE of sample mean around true uniform mean 0f 0.5 [1] 0.00241696 # MSE of sample median around true uniform median of 0.5

The mean has quite a bit smaller MSE and seems to be a better estimator. (From Exercise 4.37, it is.) 4.35 False for nonlinear functions. For instance, when σ > 0, even though E(Y ) = µ, with the function g(y) = y 2 , E(Y 2 ) > [E(Y )]2 = (µ)2 . 4.36 (a) E(σ̃ 2 ) = E{[(n − 1)/(n + 1)]S 2 } = [(n − 1)/(n + 1)]E(S 2 ) = [(n − 1)/(n + 1)]σ 2 . Since [(n − 1)/(n + 1)] → 1 as n → ∞, E(σ̃ 2 ) → σ 2 and it is asymptotically unbiased. (b) Using the decomposition ∑i (Yi − µ)2 = ∑i (Yi − Y )2 ] + n(Y − µ)2 in Section 4.4.6, σ̂ 2 = [∑i (Yi − Y )2 ]/n = [∑i (Yi − µ)2 ]/n − (Y − µ)2 ]. The first sum is a sample mean of squared deviations about µ, which by the law of large numbers converges in probability to its expectation, which is σ 2 . The second p term converges in probability to 0, because by the law of large numbers Y → µ, and p so the squared difference between them converges in probability to 0 and σ̂ 2 → σ 2 . √ 4.37 (a) For mean and median M = µ of a normal distribution f (µ) = 1/ 2πσ so the variance is approximately 1/4[f (M )]2 n = πσ 2 /2n and the standard error is √ √ π/2(σ/ n). ̂) = (σ 2 /n)/[πσ 2 /2n] = 2/π = 0.637. (b) var(Y )/var(M (c) From Section 2.3.3, the variance of the uniform over [0, θ] is θ2 /12. The uniform ̂ has approximate variance θ2 /4n, so the sample mean has a has f (M ) = 1/θ, so M variance only a third that of the sample median and is a much better estimator. 4.38 (a) ℓ(π) = ∏i [(1 − π)yi −1 π] = [π/(1 − π)]n (1 − π)∑i yi , so a suﬀicient statistic is ∑i yi . (b) Taking L(π) = log[ℓ(π)], differentiating with respect to π and equating to 0 yields π̂ = n/(∑i yi ) = 1/y, provided that π ∈ (0, 1), which holds. Verify that this is a maximum. Thus, for a random sample Y1 , . . . , Yn , the ML estimator is π̂ = 1/Y . 4.39 (a) The likelihood function is ℓ(λ) = ∏i [λe−λyi ] = λn exp(−λ ∑i yi ) for λ > 0. The log-likelihood function is L(λ) = n log(λ) − λ ∑i yi for λ > 0. (b) Setting (∂/∂λ)L(λ) = n/λ − ∑i yi = 0 yields λ̂ = n/ ∑i yi = 1/y. Verify that this is a maximum. Thus, this is the ML estimate based on these observations, while for a random sample Y1 , . . . , Yn , the ML estimator is λ̂ = 1/Y . By the invariance property of ML estimators, the ML estimator of E(Y ) is Y . (c) As n increases, L(λ) becomes narrower and more parabolic in shape, and a narrower range of λ values are plausible: L <- function(lambda,n,ymean){n*log(lambda)-lambda*n*ymean # loglikelihood hatlambda <- 1/10 # ymean=10 lambda <- seq(0.01, 4, 0.01)

Solutions Manual: Foundations of Statistical Science for Data Scientists plot(lambda, L(lambda,1,10)-L(hatlambda,1,10), type="l", col="blue") lines(lambda, L(lambda,5,10)-L(hatlambda,5,10), col="red") lines(lambda, L(lambda,10,10)-L(hatlambda,10,10), col="darkgreen")

(d) For y ≥ 0, log f (y; λ) = log(λ) − λy, so ∂ log f (y; λ)/∂λ = 1/λ − y. Then E(1/λ − Y )2 = var(Y ) = 1/λ2 , so using equation (4.3) the information is I(λ) = nE[∂ log f (Y ; λ)/∂λ]2 = nE(1/λ − Y )2 = n/λ2 . Using equation (4.4), since ∂ 2 log f (y; λ)/∂λ2 = −1/λ2 , we have I(λ) = −n(−1/λ2 ) = n/λ2 . The asymptotic distribution of λ̂ is normal around mean λ with variance 1/I(λ) = λ2 /n. 4.40 ℓ(α) = αn /( ∏i yi )

α+1

. Setting ∂[log(ℓ(α))]/∂α = n/α − log ( ∏i yi ) = 0 yields α̂ = 1/n

n/ log ( ∏i yi ), which is 1/ log [( ∏i yi ) ], the reciprocal of the log of the geometric mean. Since ∂ 2 [log(ℓ(α))]/∂α2 = −n/α2 , by equation (4.4),√ I(α) = n/α2 and the large2 sample variance is α /n and estimated standard error is α̂/ n. k

λ ) e−λ ∑i yi ∏i (yik−1 ) 4.41 (a) ℓ(λ) = f (y1 ; k, λ)⋯f (yn ; k, λ) = ( Γ(k) L(λ) = kn log λ − λ ∑i yi + c, where the term c is constant in terms of λ. So ∂L(λ)/∂λ = kn/λ − ∑i yi = 0 yields λ̂ = k/y (verify that this is a maximum) and thus the ML estimator is λ̂ = k/Y .

(b) By the invariance property of ML estimators, its ML estimator is k/λ̂ = Y . (c) Using (2.11) and (4.3), I(λ) = nE(k/λ−Y )2 = nvar(Y ) = nk/λ2 , so the large sample variance is 1/I(λ) = λ2 /kn. 4.42 (a) Using the normal pdf (2.8), ℓ(µ) = (1/2πσ 2 )n/2 exp[− ∑i (yi − µ)2 /2σ 2 ], so as a function of µ, for a constant c, L(µ) = c − ∑i (yi − µ)2 /2σ 2 = c − [∑i (yi − y)2 + n(y − µ)2 ]/2σ 2 which is concave and parabolic in µ. Its maximum c − ∑i (yi − y)2 is achieved at µ = y and thus the ML estimator is µ̂ = Y . f (Y ;µ) ) , where for the normal pdf, ∂ log f (Y ; µ)/∂µ = (Y −µ)/σ 2 , (b) (i) I(µ) = nE( ∂ log ∂µ 2

;µ) )= so its expected square is 1/σ 2 and I(µ) = n/σ 2 . (ii) I(µ) = −nE( ∂ log∂µf (Y 2 2 2 n/σ . Using either, the large-sample variance of µ̂ is 1/I(µ), which is σ /n. 2

(c) µ̂ = y is the minimum variance unbiased estimator because it has variance σ 2 /n that is the minimum possible for an unbiased estimator, by the footnote about the Cramér–Rao lower bound in the paragraph that contains formula (4.3). √ (d) L(σ) = −n log( 2πσ) − ∑i (yi − µ)2 /2σ 2 ], so ∂L(σ)/∂σ = −n/σ − ∑i (yi − µ)2 /σ 3 . Equating to 0 and evaluating at √ µ̂ = y and solving, yields for a random sample Y1 , . . . , Yn , the ML estimator σ̂ = [∑i (Yi − Y )2 ]/n. Then, ∂ 2 L(σ)/∂σ 2 = n/σ 2 − 3[∑i (yi − µ)2 ]/σ 4 , and by equation (4.4), I(σ) = −n/σ 2 + 3[nσ 2 ]/σ 4 = 2n/σ 2 and the large-sample variance σ̂ is 1/I(σ) = σ 2 /2n.

4.43 (a) ∂ log[L(µ, σ)/∂µ = ∑i [log(yi )−µ]/σ 2 = 0 yields µ̂ = ∑i [log(yi )]/n, which is also the 1/n

log of the geometric mean [ ∏i yi ] . Also, ∂ log[L(µ, σ)]/∂σ = −n/σ − ∑i [log(yi ) − µ]2 /σ 3 = 0 yields σ̂ 2 = (1/n) ∑i [log(yi ) − µ̂]2 . (b) ∂ 2 log[f (y; µ, σ)]/∂µ2 = −1/σ 2 , so from (4.4) the large-sample √ variance of µ̂ = ∑i [log(Yi )]/n is σ 2 /n and the estimated standard error is σ̂/ n. (c) Substituting the values from (a), µ̂y = eµ̂+σ̂ /2 and σ̂y2 = [eσ̂ − 1][µ̂y ]2 2

Solutions Manual: Foundations of Statistical Science for Data Scientists 4.44 (a)

> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat",header=TRUE) > NormMuHat <- mean(Houses$price) > NormSigHat <- sd(Houses$price)*sqrt((n-1)/n) > NormMuHat; NormSigHat [1] 232.9965 # ML estimate of normal mu parameter [1] 151.1319 # ML estimate of normal sigma parameter

(b)

> LogNormMuHat <- mean(log(Houses$price)) > LogNormSigHat <- sd(log(Houses$price))*sqrt((n-1)/n) > LogNormMuHat; LogNormSigHat [1] 5.291323 # ML estimate of log-normal mu parameter [1] 0.5591038 # ML estimate of log-normal sigma parameter

(c)

> EYhat <- exp(LogNormMuHat + (LogNormSigHat^{2})/2) > SigYhat <- sqrt(((EYhat)^2)*(exp(LogNormSigHat^2) - 1)) > EYhat; SigYhat [1] 232.2052 # ML estimate of mean selling price for log-normal distribution [1] 140.6655 # ML estimate of standard deviation of price for log-normal dist.

For the log-normal distribution, the estimates of the µ and σ parameters are 5.29 and 0.56, and the corresponding estimates of the mean and standard deviation of selling price are 232.2 and 140.7 thousand dollars. These are similar to but somewhat different from the ML estimates found in part (a) of the mean and standard deviation of selling price of 233.0 and 151.1 thousand dollars for the normal distribution. (d)

> hist(Houses$price, prob=TRUE) > curve(dnorm(x, 232.9965, 151.1319), add=TRUE, yaxt="n") > curve(dlnorm(x, 5.2913, 0.5591), add=TRUE, yaxt="n")

The histogram is skewed right, and the log-normal distribution also has that shape. The normal distribution is symmetric and has part of the distribution over negative values, so does not fit well. 4.45 The log-likelihood function is L(π) = y1 log(π 2 ) + y2 log[2π(1 − π)] + y3 log[(1 − π)2 ] + c = 2y1 log(π) + y2 log(2) + y2 log(π) + y2 log(1 − π) + 2y3 log(1 − π) + c, where c is a constant in terms of π. (a) ∂L(π)/∂π = (2y1 + y2 )/π − (y2 + 2y3 )/(1 − π), and setting ∂L(π)/∂π = 0 and solving for π yields π̂ = (2y1 + y2 )/2n, since y1 + y2 + y3 = n. (b) ∂ 2 L(π)/∂π 2 = −(2y1 + y2 )/π 2 − (y2 + 2y3 )/(1 − π)2 , and substituting E(Y1 ) = nπ, 2 E(Y2 ) = 2nπ(1 − π), and E(Y3 ) = n(1 − π)√ yields I(π) = −E[∂ 2 L(π)/∂π 2 ] = 2n/π(1 − π) and large-sample standard error π(1 − π)/2n. √ (c) We estimate the standard error and use π̂ ± 1.96 π̂(1 − π̂)/2n. 4.46 A pivotal quantity is a function of the data and the parameter that has distribution not depending on the parameter. One can bound the pivotal quantity between quantiles of that distribution not depending on the parameter value, and they can be inverted to get bounds for the parameter itself. 4.47 (a) The normal or t quantile used in the margin of error decreases as the confidence level decreases. √ (b) Standard errors have n in the denominator, so diminish (as do margins of errors) as n increases. √ 4.48 The standard error is greatest at√π = 0.50, of error √ where it equals √ 0.50/ n. The margin in a 95% CI is then 1.96(0.50)/ n ≈ 1/ n. Setting 1/ n = M yields n = 1/M 2 .

Solutions Manual: Foundations of Statistical Science for Data Scientists √ 4.49 The standard error for estimating a mean is σ/ n, so to obtain a certain margin of error M , you need a larger n when σ is larger. In the U.S., medical doctors would have much greater variability in their incomes than entry-level McDonald’s employees. 4.50 (a) You would expect about 95%, which is 19 of 20, but usually fewer will do so because the actual coverage probability of the Wald method is less than the nominal value when n is small and π is near 0 or 1. Of 1000 intervals, probably about 70%, rather than 95%, contain π = 0.06. As shown in the R Appendix section A.4.1, one can find the actual coverage probability for the Wald (asymptotic) method: > library(binom) > binom.coverage(0.06, 20, conf.level=0.95, "asymptotic") method p n coverage 1 asymptotic 0.06 20 0.7042596

(b) Sampling distribution is highly skewed to the right. Also, Wald method must replace π by its estimate π̂ to get the margin of error, which is not done with the score CI. √ (c) The standard error estimate is π̂(1 − π̂)/n which is 0 when π̂ = 0, giving (0, 0) for the Wald CI for π. Using the binomial distribution, P (π̂ = 0) = (1 − π)n which approaches 1 as π approaches 0. So, when π is tiny and approaching 0, the probability approaches 1 that the CI is (0, 0) and fails to contain the actual value of π. (d) The score method does well, containing π about 97% of the time even though n is not large. A confidence interval with coverage higher than the confidence level is conservative. The coverage should be as close as possible to the confidence level. >

binom.coverage(0.06, 20, conf.level=0.95, "wilson") # wilson is score method method p n coverage 1 wilson 0.06 20 0.9710343

4.51 (a) 4.52 (b) 4.53 (a) is incorrect because we are making an inference about µ, not y, which we know. (b) is correct. (c) is incorrect because the CI is not summarizing the sample data distribution. (d) is correct. (e) is incorrect because the percentage of times y values would fall between 6.8 and 8.0 could be anything, depending on where the y in the original sample falls relative to µ. 4.54 (a) ℓ(λ) = λn exp[−λ ∑i yi ] for λ > 0, a function of the data and the parameter that depends on the data only through ∑i yi . Thus the suﬀicient statistic is ∑i Yi . (b) L(λ) = n log(λ) − λ ∑i yi , so ∂L(λ)/∂λ = n/λ − ∑i yi = 0 yields λ̂ = 1/y and the ML estimator is λ̂ = 1/Y . (c) It is a pivotal quantity because its distribution does not depend on λ. Let χ2q,d denote quantile q of a chi-squared distribution with d degrees of freedom. Since P [χ20.025,2n < 2λ(∑i Yi ) < χ20.975,2n ] = 0.95, the CI for λ is [χ20.025,2n /(2 ∑i yi ), χ20.975,2n /(2 ∑i yi )]. 4.55 This follows because we can express X12 + X22 as a sum of d1 + d2 squared independent standard normal random variables.

Solutions Manual: Foundations of Statistical Science for Data Scientists

4.56 Since S 2 = [(n1 − 1)S12 + (n2 − 1)S22 ]/(n1 + n2 − 2), X 2 = (n1 + n2 − 2)S 2 /σ 2 = [(n1 − 1)S12 + (n2 − 1)S22 ]/σ 2 . From the single-sample results, (n1 − 1)S12 has a chi-squared distribution with df = n1 − 1 and (n2 − 1)S22 has a chi-squared distribution with df = n2 − 1. The samples are independent, so these chi-squared random variables are independent, and their sum is also chi-squared with df values summing to give n1 + n2 − 2. 4.57 (a) The Cauchy is the t distribution with df = 1, so from the definition of a t random variable, Y = √ Z2 where X 2 is a squared standard normal. So, Y is a ratio of X /1

standard normal random variables. (b)

> y <- rcauchy(1); mean(y) [1] 3.533669 > y <- rcauchy(100); c(mean(y), median(y)) [1] 1.52809 0.1332651 > y <- rcauchy(10000); c(mean(y), median(y)) [1] -0.3069901 -0.01267257 > y <- rcauchy(100000); c(mean(y), median(y)) [1] 0.01355825 -0.003072703 > y <- rcauchy(1000000); c(mean(y), median(y)) [1] 9.347567 0.0007656779 > y <- rcauchy(10000000); mean(y); median(y); quantile(y,0.25) [1] 5.087522 # sample mean of ten million Cauchy random variables [1] -0.0001024686 # sample median very close to Cauchy median = 0.0 25% -0.9989757 # sample lower quartile very close to Cauchy LQ = -1.0 > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. # extreme outliers -2159764 -1 0 5 1 33344648 # affect mean > boxplot(y)

Sample mean is not converging, because the mean does not exist for the Cauchy distribution. Quantiles such as the median and quartiles are converging to the true values for the distribution. 4.58 (a) No, the inference refers to where µ falls, not y, and the CI’s would change from sample to sample. (b) No, the CI does not summarize the sample data distribution. (c) No, the inference refers to where µ falls, not y. We know exactly what y equals. (d) No, 95% of the CI’s would contain µ, but that CI would not be (23.5, 25.0) for every sample. 4.59 From the Bayesian interval, the probability is 0.95 that µ is between 23.5 and 25.0. From the classical interval, we are 95% sure that µ lies in the CI (which differs from sample to sample) and with repeated samples, in the long run the CI would contain µ 95% of the time. α+β n α ) ny + ( n+α+β ) α+β 4.60 (a) The posterior distribution is beta with mean E(π ∣ y) = ( n+α+β . As n grows, n/(n + α + β) → 1 and (α + β)/(n + α + β) → 0. Thus the posterior mean is approximately y/n = π̂. The posterior distribution looks much like the likelihood function and has estimate of π close to the sample proportion.

(b) We get the ML estimate y/n in the limit as α = β converges down toward 0, not a legitimate value for the hyperparameters. 4.61 The posterior distribution is gamma with shape parameter k ∗ = k + ∑i yi and rate parameter λ∗ = λ + n. The mean is k ∗ /λ∗ which is approximately y for large n.

Solutions Manual: Foundations of Statistical Science for Data Scientists

4.62 The empirically generated sampling distribution typically has similar variability as the true sampling distribution but tends to be centered around θ̂ rather than around θ. 4.63 (a) From Section 4.2.5, ∂[log f (yi ; µ)] ∂L(µ) ∑ni=1 yi − nµ n(y − µ) = = = , ∂µ ∂µ µ µ i=1 n

U (y; µ) = ∑

√ (b) U√(y,µ) = [n(y−µ)/µ] = I(µ)

n/µ

√

I(µ) =

n . µ

n(y−µ) √ . The score CI is the endpoints of the solution of the µ

quadratic equation resulting from setting

√

√ n(y − µ)/ µ = ±zα/2 .

4.64 (a) The ML estimator π̂ is unbiased with variance π(1 − π)/n. The Bayesian estimator is the weighted average [n/(n + 2)]π̂ + [2/(n + 2)](1/2), which has expected value [n/(n+2)]π+[2/(n+2)](1/2) = (nπ+1)/(n+2). The variance is [n/(n+2)]2 var(π̂) = nπ(1 − π)/(n + 2)2 . For large n, these are similar. (b) MSE = variance + (bias)2 . The bias is (nπ +1)/(n+2)−π = (1−2π)/(n+2), so MSE = [nπ(1 − π) + (1 − 2π)2 ]/(n + 2)2 . The MSE for the Bayes estimator is smaller for a region around π = 0.50, with the region getting narrower as n increases. Observe the difference between the two plots for n = 10 and n = 1000: > mseB <- function(n,p){(n*p*(1-p)+(1-2*p)^2)/(n+2)^2} > mse <- function(n,p){p*(1-p)/n} > p <- seq(0,1,0.005) > plot(p, mse(10,p), type="l", col="blue") > lines(p, mseB(10,p), col="red") > plot(p, mse(1000,p), type="l", col="blue") > lines(p, mseB(1000,p), col="red")

We expect it to be better near 0.50 because it shrinks the sample proportion toward 0.50. MSE is smaller for a biased estimator than an unbiased estimator if its squared bias is less than its decreased variance compared to the unbiased estimator. n 2c ) Yn + ( n+2c ) 12 gives weight n/(n + 2c) to π̂ that decreases 4.65 (a) π̃ = (Y + c)/(n + 2c) = ( n+2c as c increases and weight 2c/(n + 2c) to (1/2) that increases as c increases.

(b) E(π̃) = [E(Y ) + c]/(n + 2c) = (nπ + c)/(n + 2c), so |bias| = ∣E(π̃) − π∣ = c∣1 − 2π∣/(n + 2c) = ∣1 − 2π∣/(2 + n/c) increases as c increases. var(π̃) = [1/(n + 2c)2 ]var(Y ) = nπ(1 − π)/(n + 2c)2 , which is decreasing as c increases. (c) From Section 4.7.2, the estimate π̃ = (y + c)/(n + 2c) results when we use a beta prior distribution for π with α = β = c, such as the uniform prior distribution which has c = 1. As n increases, the weight given (1/2) decreases and π̃ shrinks less and is more similar to π̂. 4.66 (a) Bias = E(cY −µ) = µ(c−1) has absolute value increasing from 0 to ∣µ∣ as c decreases from 1 to 0. var(µ̃) = var(cY ) = c2 σ 2 /n decreases from σ 2 /n to 0 as c decreases from 1 to 0. (b) MSE = var(µ̃) + (bias)2 = c2 σ 2 /n + [µ(c − 1)]2 . With µ = σ = n = 1, squared bias = (c − 1)2 , var(µ̃) = c2 , and MSE = c2 + (c − 1)2 which has minimum at c = 0.50. For arbitrary n with µ = σ = 1, MSE = c2 /n + (c − 1)2 is minimized at c = n/(n + 1), which increases toward 1 and µ̃ is more similar to y as n increases. For the plot, adjust the code of Exercise 4.64. (c) From Section 4.8, assuming independent observations from a normal distribution and using a N (ν, τ 2 ) prior distribution for µ, with the prior mean ν = 0, the posterior mean estimator of µ is µ̃ = cy with c = τ 2 /[τ 2 + (σ 2 /n)]. As n increases,

Solutions Manual: Foundations of Statistical Science for Data Scientists

c increases toward 1 and µ̃ shrinks less and is more similar to y. (In practice, the Bayesian approach requires a prior distribution also for σ; with a very disperse or improper prior, c ≈ τ 2 /[τ 2 + (s2 /n)].) 4.67 (a) With y successes in n binary trials, π̂ = y/n and the midpoint of the 95% CI is [π̂ + (1.96)2 /2n]/[1 + (1.96)2 /n]. Replacing 1.96 by 2.0, this is approximately [y/n + 2/n]/(1 + 4/n) = (y + 2)/(n + 4), which is the sample proportion after adding two successess and two failures to the data set. (b) Again, this follows when we replace 1.96 by 2. 4.68 (a) By Bayes’ Theorem, (D) (+∣D)P (D) π1 ρ P (D ∣ +) = P (+∣D)P = [P (+∣D)PP(D)+P = [π1 ρ+π P (+) (+∣D c )P (D c )] 2 (1−ρ)] (b) With uniform priors, the posterior distribution is beta(96, 6) for π1 and beta(6, 96) for π2 . > pi1 <- rbeta(10000, 96, 6); pi2 <- rbeta(10000, 6, 96) > PPV <- 0.005*pi1/(0.005*pi1 + 0.995*pi2) > quantile(PPV, c(0.025, 0.975)) 2.5% 97.5% 0.04052689 0.17471935 # Posterior probability = 0.95 that PPV between 0.04 and 0.17

(c)

> PPV05 <- 0.05*pi1/(0.05*pi1 + 0.95*pi2) > quantile(PPV05, c(0.025, 0.975)) 2.5% 97.5% 0.3067085 0.6891875 > PPV50 <- 0.5*pi1/(0.5*pi1 + 0.5*pi2) > quantile(PPV50, c(0.025, 0.975)) 2.5% 97.5% 0.8936792 0.9768143

The PPV increases considerably as ρ increases. 4.69 (a) Probability of P (H, H) = P (H)P (H) = (0.5)(0.5) = 0.25 and P (H, T ) = P (H)P (T ) = 0.25 for two independent flips of a balanced coin. P (T, H) = P (T )π = π/2 and P (T, T ) = P (T )(1 − π) = (1 − π)/2. (b) Probability of head on second flip is 0.25 + π/2. Equating this to π̃ yields π̂ = 2(π̃ − 0.25) for the estimate of π. (c) (i) 0, (ii) 0.50 4.70 (a) Here, E(Y ) = θ/2, so setting y = θ/2 yields the estimate θ̃ = 2y. (b) The joint pdf is (1/θ)n for all yi between 0 and θ. After observing the data, we know θ must be larger than all observations, so ℓ(θ) = 1/θn for θ ≥ y(n) . This is monotone decreasing in θ, so it is maximized at θ̂ = y(n) . (c) We have θ̃ = 2y = 8, θ̂ = 9. From the data, we know θ must be at least 9, so method of moments estimate is not sensible. (d) For 0 < y < θ, cdf is F (y) = P (Y(n) ≤ y) = P (Y1 ≤ y, . . . , Yn ≤ y) = P (Y1 ≤ y)⋯P (Yn ≤ y) = (y/θ)n and its pdf is f (y) = F ′ (y) = (ny n−1 )/θn . (e) Method of moments estimator has E(θ̃) = E(2Y ) = 2E(Y ) = 2(θ/2) = θ, so unbiased. ML estimator cannot be greater than θ, so has expected value less than θ θ θ and is biased. Using its pdf from (d), E(θ̂) = ∫0 yf (y)dy = (n/θn ) ∫0 y n dy = n n+1 [n/(n + 1)θ ]θ = [n/(n + 1)]θ.

Solutions Manual: Foundations of Statistical Science for Data Scientists (f) The bootstrap distribution would be highly discrete, with a high probability exactly at y(n) . That probability is equal to 1−P (all n generated observations are not equal to y(n) ), which is 1 − [1 − (1/n)]n . This is approximately 1 − e−1 = 0.632 for large n. All the remaining probability is on values below y(n) , and since P (Y(n) < θ) = 1, the probability that a percentile-based confidence interval or any confidence interval of the form (Y(a) , Y(b) ) with 1 ≤ a < b ≤ n will contain θ is 0.

4.71 (a) By Markov inequality when P (Y ≥ 0) = 1, then P (Y ≥ t) ≤ E(Y )/t. So, P (∣θ̂ − θ∣ ≥ ϵ) = P [(θ̂ − θ)2 ≥ ϵ2 ) ≤ E[(θ̂ − θ)]2 /ϵ2 . Since E(θ̂ − θ)2 → 0, the right-hand side goes p to 0 as n → ∞, so θ̂ → θ. (b) Since MSE(θ̂) = var(θ̂) + (bias)2 , if E(θ̂) → θ, then bias → 0 and together with p var(θ̂) → 0, we have that MSE(θ̂) → 0 and then part (a) implies that θ̂ → θ. √ √ p p 4.72 (a) By the law of large numbers, π̂ →π, so π̂(1 − π̂) → π(1 − π) and √ p π(1 − π)/π̂(1 − π̂) → 1. p

(b) Here, Xn → 1 and Zn → N(0, 1) by the Central Limit Theorem, so that ¿ Á π(1 − π) π̂ − π π̂ − π d À =Á = Xn Zn → N(0, 1). √ √ π̂(1−π̂) π(1−π) π̂(1 − π̂) n

4.73 (a) E(Zi2 ) = var(Zi ) = 1, so E(χ2d ) = E(Z12 + ⋯ + Zd2 ) = d. (b) X 2 /d = (Z12 +⋯+Zd2 )/d, which by the law of large numbers converges to E(Z 2 ) = 1, √ √ p so X 2 /d → 1 Then, T = Z/ X 2 /d converges in distribution to that of Z, which is standard normal. √ √ √ √ 4.74 (a) F (y) = P (Y ≤ y) = P (Z 2 ≤ y) = P (− y ≤ Z ≤ y) = Φ( y) − Φ(− y). √ √ (b) Since ϕ is symmetric around 0, ϕ( y) = ϕ(− y)) and f (y) = F ′ (y) = √ √ √ (1/ y)ϕ( y) = (1/ 2πy)e−y/2 . 4.75 (a) From equation (2.6), n n f (y; π, n) = ( )π y (1 − π)n−y = (1 − π)n ( )[π/(1 − π)]y y y π n n )] = (1 − π) ( ) exp [y (log 1−π y (b) ℓ(θ) = ∏i [B(θ)h(yi ) exp(θyi )] = [B(θ)]n [∏i h(yi )] exp(θ ∑i yi ). The term involving the data and the parameter uses the data only through ∑i yi . Thus the suﬀicient statistic is ∑i yi . 4.76 For t < 1/2, mU (t) = (1 − 2t)−d1 /2 and mU +V (t) = (1 − 2t)−(d1 +d2 )/2 . Since they are independent, mU +V (t) = mU (t)mV (t) and so mV (t) = mU +V (t)/mU (t) = (1 − 2t)−d2 /2 , so that V ∼ χ2d2 . 4.77 Each observation has probability 0.50 of falling below the median, so the number below the median has the binom(n, 0.50) distribution, which has mean n/2 and standard devi√ ation n(0.5)(0.5). P (Y(a) < M < Y(b) ) is the probability the binomial random variable takes value between a and b. For √ this to be 0.95, using√the normal approximation for the binomial, take a ≈ n/2 − 1.96 n/4 and b ≈ n/2 + 1.96 n/4. For the library example, n = 54 so take a = 20 and b = 34, giving values 11 and 19.

Solutions Manual: Foundations of Statistical Science for Data Scientists

4.78 (a) Setting {αj = 1} yields p(π1 , . . . , πc ; α1 , . . . , αc ) ∝ 1 over the simplex region. (b) The posterior pdf g(π1 , . . . , πc ∣ y1 , . . . , yc ; α1 , . . . , αc ) is proportional to p(π1 , . . . , πc ; α1 , . . . , αc )f (y1 , y2 , . . . , yc ; n, π1 , π2 , . . . , πc ) ∝

π1α1 −1 π2α2 −1 ⋯πcαc −1 π1y1 π2y2 ⋯πcyc = π1y1 +α1 −1 π2y2 +α2 −1 ⋯πcyc +αc −1

which is the kernel of the Dirichlet distribution with hyperparameters {αj∗ = yj + αj }. (c) The posterior mean of πj is αj∗ / ∑k αk∗ = (yj + αj )/[∑k (yk + αk ) = (yj + αj )/(n + ∑k αk ). With the uniform prior over the probability simplex, this is (yj +1)/(n+c), the sample proportion after adding one observation to each category. ≤ 1.645 is equivalent to θ ≥ T (Y) − 1.645σT . The inequality satisfies 4.79 Note that T (Y)−θ σT the definition of a confidence interval in Section 4.3 with TL (Y) = T (Y) − 1.645σT and TU (Y) equal to the upper limit of the possible parameter √ values. For a proportion, using the estimated standard error, the lower bound is π̂ −1.645 π̂(1 − π̂)/n = 0.628, so we are 95% confident that the population proportion in favor of legalization is at least 0.628. 4.80 (a) The sample mean changes to ∑i (yi + c)/n = y + c. The sample variance changes to [∑i (cyi − cy)2 ]/(n − 1) = c2 s2 , so the standard deviation changes to cs. (b) No. The posterior mean is a weighted average of the sample mean and the prior mean, and if a constant is added to each observation, the sample mean changes by that constant but the posterior mean does not, so the estimator changes by less than that constant. 4.81 The goal is to find the statistic that minimizes the posterior expected loss, which in this case is rp∗ (θ̂ ∣ y) = E[(θ − θ̂)2 ∣ Y = y]. Now, E[(θ − θ̂)2 ∣ Y = y] = E[(θ − E(θ ∣ y) + E(θ ∣ y) − θ̂)2 ∣ Y = y] = E[(θ − E(θ ∣ y)2 ∣ Y = y] + [E(θ ∣ y) − θ̂]2 , which is minimized when [E(θ ∣ y) − θ̂]2 = 0, that is, when θ̂ = E(θ ∣ y). 4.82 We need to minimize the following posterior expected loss:

rp∗ (θ̂ ∣ y)

= E[∣θ − θ̂∣ ∣ Y = y] = ∫ ∣θ − θ̂∣g(θ ∣ y)dθ = −∫

θ̂ −∞

(θ − θ̂)g(θ ∣ y)dθ + ∫

∞ θ̂

(θ − θ̂)g(θ ∣ y)dθ.

Taking the derivative with respect to θ̂ and equating it to 0, we get θ̂

∞

θ̂

∞

∫−∞ g(θ ∣ y)dθ = ∫θ̂ g(θ ∣ y)dθ, which implies 2 ∫−∞ g(θ ∣ y)dθ = ∫−∞ g(θ ∣ y)dθ = 1, and θ̂ finally ∫−∞ g(θ ∣ y)dθ = 1/2. Thus, θ̂ is the median of the posterior distribution g(θ ∣ y), since the second derivative satisfies

∂ 2 rp∗ (θ̂∣y) ∂ θ̂ 2

= 2g(θ ∣ y) > 0.

Solutions Manual: Foundations of Statistical Science for Data Scientists

Chapter 5 5.1 (a) H0 : π = 0.50; (b) Ha : Population correlation > 0; (c) H0 : µ1 = µ2 √ 5.2 z = (0.52 − 0.50)/ (0.50)(0.50)/1200 = 1.386, P -value = 0.166. The evidence against H0 is not strong, but we cannot “accept H0 ” because there are many plausible values for π other than 0.50, as highlighted by a confidence interval. √ 5.3 H0 : π = 0.50, Ha : π ≠ 0.50, test statistic z = (0.73 − 0.50)/ (0.50)(0.50)/3402 = 26.8, P -value = 0.0000 to many decimal places (which one can report as P < 0.0001) gives extremely strong evidence against H0 . We conclude that legalization is supported by a majority of Canadians (since π̂ = 0.73 > 0.50). 5.4 For π = population proportion of those leaning Republican who believe that climate change is a major threat, H0 : π = 0.50, Ha : π ≠ 0.50. The P -value gives extremely strong evidence against H0 and more specifically that π < 0.50, since π̂ = 0.31 < 0.50). That is, a minority believe this. From the confidence interval, we learn not only that π < 0.50 but that it likely falls between 0.268 and 0.355, quite far below 0.50. √ 5.5 z = [(40/116) − (1/3)]/ [(1/3)(2/3)]/116 = −0.26 has P -value = 0.60. We cannot reject H0 . It is plausible that π = 1/3 and the astrologers have no predictive power. 5.6 (a) P -values for one-sided tests with Ha : π > 0.50 are 0.036 in state A and 0.057 in state B. State A has greater evidence that π > 0.50 and of a Republican victory (but it does not mean that π for state A is greater than that of B). (b) For state A, using the posterior beta distribution with hyperparameters 59 + 50 = 109 and 41 + 50 = 91, the posterior P (π < 0.50) = 0.101 (e.g., found in R using pbeta(0.50, 109, 91)). For state B, using the posterior beta with hyperparameters 525 + 50 = 575 and 475 + 50 = 525 gives posterior P (π < 0.50) = 0.066. So, state B has greater evidence of a Republican victory. The prior belief has as much weight as the data for state A but relatively little weight for state B. 5.7 (a) H0 : µ = 2.0, Ha : µ ≠ 2.0 (b) t = (2.49 − 2.00)/0.0236 = 20.8. The P -value gives very strong evidence against H0 , and we can conclude that µ > 2.0. 5.8 (a)

> Students<-read.table("http://stat4ds.rwth-aachen.de/data/Students.dat",header=T) > t.test(Students$ideol, mu=4.0, alt="two.sided") t = -4.5766, df = 59, p-value = 2.484e-05 95 percent confidence interval: 2.610683 3.455984

P -value = 0.00002 gives strong evidence that µ < 4.0 for the corresponding population (i.e., mean in the liberal direction from the moderate value of 4.0). Reject H0 using α = 0.05. (b) The confidence interval (2.61, 3.46) contains only values below 4.0. 5.9 The P -value reported is only approximate, because its calculation assumes a normal population distribution to get the t sampling distribution, and in practice the population distribution is not exactly normal. 5.10 See also results for Exercise 1.27.

Solutions Manual: Foundations of Statistical Science for Data Scientists

> Sheep <- read.table("http://stat4ds.rwth-aachen.de/data/Sheep.dat", header=TRUE) > attach(Sheep) > t.test(weight[survival == 1], weight[survival == 0], var.equal=TRUE) t = 14.5, df = 1357, p-value < 2.2e-16 95 percent confidence interval: 4.019011 5.276597 sample estimates: mean of x mean of y 20.64592 15.99811

The test gives extremely strong evidence that the population means differ, the confidence interval showing a higher mean for those who survived. If Y1 is the weight for the survived sheep and Y2 for those not survived, then we assume that Y1 and Y2 are independent with Y1 ∼ N (µ1 , σ 2 ) and Y2 ∼ N (µ2 , σ 2 ), i.e. have equal variances. 5.11

> Polid <- read.table("http://stat4ds.rwth-aachen.de/data/Polid.dat", header=TRUE) > attach(Polid); tapply(ideology, race, mean) black hispanic white 3.816712 4.089431 4.170572 > tapply(ideology, race, sd) black hispanic white 1.298165 1.339258 1.458553 > t.test(ideology[Polid$race=="black"], ideology[Polid$race=="white"], var.equal=TRUE) t = -4.3383, df = 2204, p-value = 1.5e-05 > t.test(ideology[Polid$race=="hispanic"], ideology[Polid$race=="white"], var.equal=T) t = -0.98813, df = 2202, p-value = 0.3232 > t.test(ideology[Polid$race=="hispanic"], ideology[Polid$race=="black"], var.equal=T) t = 2.8127, df = 738, p-value = 0.005044

There is insuﬀicient evidence to conclude that the population mean ideology differs for Hispanics and whites. There is strong evidence that the population mean ideology is lower (i.e., more liberal) for blacks than for whites and for Hispanics. 5.12 Assuming that the variances are equal, the t test has test statistic 2.31 and two-sided P -value 0.023. The 95% confidence interval for µB − µA is (0.51, 6.71). The significance test tells us there is strong evidence that menu B has higher sales, but the confidence interval is more informative, showing that the difference in means could be very small or quite large. Such analyses can be done using the following R function as shown below: > t.test.2sample <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=TRUE, alpha=0.05) {if( equal.variance==TRUE ) { # pooled sd: se <- sqrt((1/n1+1/n2)*((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2)) df <- n1+n2-2 } else { se <- sqrt((s1^2/n1)+(s2^2/n2)) # Welch df (unequal variances): df <- ((s1^2/n1+s2^2/n2)^2)/((s1^2/n1)^2/(n1-1)+(s2^2/n2)^2/(n2-1)) } t <- (m2-m1-m0)/se CI <- (m2-m1) + c(-1,1)*qt(alpha/2,df,lower.tail = FALSE)*se output <- c(m2-m1, se, t, 2*pt(abs(t),df,lower.tail = FALSE), CI) names(output) <-c("Difference of means","Std Error","t","p-value","CI.L","CI.U") return(output) } > t.test.2sample(22.3,25.91,6.88,8.01,43,50) Difference of means Std Error t p-value 3.61000000 1.56185318 2.31135683 0.02307139 CI.L CI.U 0.50757053 6.71242947 > t.test.2sample(22.3,25.91,6.88,8.01,43,50,equal.variance=FALSE) Difference of means Std Error t p-value 3.61000000 1.54402137 2.33805054 0.02157623

Solutions Manual: Foundations of Statistical Science for Data Scientists CI.L 0.54299121

CI.U 6.67700879

(The analyses with summary statistics can easily be also done with the Compare Two Means app at artofstat.com.) 5.13 The significance tests have P -values essentially 0 each year, giving strong evidence of a difference but not telling us how much of a difference. The 95% confidence intervals comparing strong Democrats to strong Republicans are (0.71, 1.33) for 1974 and (2.67, 3.07) for 2018. These can be easily found with the R function provided in the solution of Exercise 5.12 or the applet at artofstat.com. On the average, the Democrats moved a category in the liberal direction while the Republicans moved a category in the conservative direction, so the difference between the two means increased by about two categories. 5.14 (a) Assume the data are random samples from the corresponding population distributions, which are normal with equal variances. The random sample situation is most important, equal variance less important unless sample variances are quite different, normality not so important because of robustness for that assumption. > Income <- read.table("http://stat4ds.rwth-aachen.de/data/Income.dat",header=TRUE) > incB <- Income$income[Income$race=="B"] > incH <- Income$income[Income$race=="H"] > t.test(incH, incB, var.equal=TRUE) t = 0.67962, df = 28, p-value = 0.5023 alternative hypothesis: true difference in means is not equal to 0 mean of x mean of y 31.00 27.75 > sd(incH); sd(incB) # similar values, so can use this t test [1] 12.81225 [1] 13.28408

Since the P -value of 0.50 is not small, it is plausible that the population mean incomes are identical. (b)

> library(MCMCpack) > Income$black <- Income$race=="B" > fit.bayes <- MCMCregress(income ~ black, mcmc=10000000, + b0=0, B0=10^{-15}, c0=10^{-15}, d0=10^{-15}, data=Income[-(31:80),]) > summary(fit.bayes) 1. Empirical mean and standard deviation for each variable, Mean SD (Intercept) 31.00 3.623 blackTRUE -3.25 4.961 # difference in sample means is -3.25 > mean(fit.bayes[,2] > 0) # posterior probability the 2nd parameter > 0 [1] 0.251036

The posterior probability of 0.251 of a higher population mean for blacks is similar for the one-sided P -value of 0.5023/2 = 0.251 for testing H0 : µB = µH (implicitly, H0 : µB ≥ µH ) against Ha : µB < µH . 5.15 Effect size for height is (175.4 − 161.7)/7 = 1.96 and effect size for marathon time is (248 − 221)/40 = 0.675. Effect is greater for height. 5.16 (a) The same subjects are in each sample. (b) µd is the population mean of the difference scores. We test H0 : µd = 0. Since d = (∑i di )/n = [∑i (yi1 − yi2 )]/n = (∑i yi1 )/n − (∑i yi2 )/n = y 1 − y 2 , so likewise is µd = µ1 − µ2 .

Solutions Manual: Foundations of Statistical Science for Data Scientists

(c) Assume students in experiment are random sample of population of interest, and the difference scores √ have an approximate normal distribution. The test statistic is t = (d − 0)/(sd / n), where sd is the sample standard deviation of the difference scores. The number {di } is n, so the df = n − 1. > CP <- read.table("http://stat4ds.rwth-aachen.de/data/CellPhone.dat", header=T) > d = CP$phone - CP$control > mean(d); sd(d) [1] 79.375 [1] 45.40591 > t.test(d, mu=0, alt="two.sided") t = 4.9444, df = 7, p-value = 0.001667 95 percent confidence interval: 41.41471 117.33529

The P -value of 0.002 gives strong evidence that the population mean is higher in the cell phone condition than in the control condition. Alternatively, the paired sample option of the t.test() function can be used: t.test(CP$phone, CP$control, paired = TRUE, alternative = "two.sided")

5.17

> Anor <- read.table("http://stat4ds.rwth-aachen.de/data/Anorexia.dat", header=TRUE) > cogbehav <- Anor$after[Anor$therapy=="cb"] - Anor$before[Anor$therapy=="cb"] > control <- Anor$after[Anor$therapy=="c"] - Anor$before[Anor$therapy=="c"] > cogbehav[15] <- 2.9 > mean(control); mean(cogbehav) [1] -0.45 [1] 2.386207 > sd(control); sd(cogbehav) [1] 7.988705 [1] 6.448351 > t.test(cogbehav, control, var.equal=TRUE) t = 1.4553, df = 53, p-value = 0.1515 95 percent confidence interval: -1.072826 6.745239

The mean for the cognitive behavioral group changes from 3.01 to 2.39, the test statistic changes from 1.68 to 1.46, and the P -value changes from 0.10 to 0.15 (compare to results in Section 5.3.2). A single outlier can have a substantial effect when the sample size is not very large. 5.18 (a)

> family <- Anor$after[Anor$therapy=="f"] - Anor$before[Anor$therapy=="f"] > t.test(cogbehav, family, var.equal=TRUE) t = -1.9216, df = 44, p-value = 0.06115 95 percent confidence interval: -8.7234423 0.2078237 mean of x mean of y 3.006897 7.264706

The P -value of 0.06 gives some, but not strong, evidence that the population mean weight change is higher for family therapy than for cognitive behavioral therapy. Since P -value = 0.06 > 0.05, we cannot reject H0 . An error would be a Type II error. (b) The 95% confidence interval of (−8.7, 0.2) contains 0.0, indicating that it is plausible that the difference in population mean weight changes is 0, but it also indicates that the mean weight change could be much greater for the family therapy group. (c) (i) Deciding the family therapy has a greater population mean weight loss, when actually its population mean is the same as for the cognitive behavioral therapy. (ii) Not rejecting H0 of equal population means even though the population mean weight change is greater for the family therapy than the cognitive behavioral therapy.

Solutions Manual: Foundations of Statistical Science for Data Scientists

5.19 Estimates π̂1 = 0.863, π̂2 = 0.744 give pooled √ estimate π̂ = (1017 + 703)/(1178 + 945) = 0.810. Test statistic z = (0.863 − 0.744)/ (0.810)[(1/1178) + (1/945)] = 3.04. For Ha : π1 ≠ π2 , P -value =0.002 gives strong evidence that the population proportion believing in life after death is higher for women. 5.20 The posterior beta distributions are beta(1018, 162) for females and beta(704, 243) for males. > pi1 = rbeta(10000000, 1018, 162); pi2 = rbeta(10000000, 704, 243) > quantile(pi1 - pi2, c(0.025, 0.975)) 2.5% 97.5% 0.08544253 0.15346138 > mean(pi1 > pi2) [1] 1

The posterior probability is 1.0. We can conclude that females are more likely than males to believe in life after death. 5.21

> prop.test(c(133, 429), c(429, 487), correct=FALSE) X-squared = 313.5, df = 1, p-value < 2.2e-16

There is extremely strong evidence of an association between political party and opinion about climate change. 5.22 The two-way table cross-classifying marijuana and alcohol can be derived as follows: > Substance <-read.table("http://stat4ds.rwth-aachen.de/data/Substance.dat",header=T) > MA.table <- apply(array(Substance$count, c(2,2,2)),c(1,3),sum) > dimnames(MA.table) <-list(marijuana=c("Yes","No"),alcohol=c("Yes","No")) > addmargins(MA.table) # contingency table with marginal sums alcohol marijuana Yes No Sum Yes 955 5 960 No 994 322 1316 Sum 1949 327 2276 > addmargins(prop.table(MA.table,2)) # cond. probabilities within col. alcohol marijuana Yes No Sum Yes 0.4899949 0.01529052 0.5052854 No 0.5100051 0.98470948 1.4947146 Sum 1.0000000 1.00000000 2.0000000

(a) Based on the table above, we get π̂1 = 955/(955 + 994) = 0.490, π̂2 = 5/(5 + 322) = √ 0.015, pooled estimate π̂ = (955 + 5)/(1949 + 327) = 0.422; z = (0.490 − 0.015)/ (0.422)[(1/1949) + (1/327)] = 16.1. P -value < 0.0001 gives strong evidence that population proportion using marijuana is higher for those who used alcohol than for those who did not use alcohol. We assume that we can treat this as a random sample from a conceptual population of all high school students of interest. (b) > prop.test(c(955,5), c(1949,327), correct=FALSE) X-squared = 258.73, df = 1, p-value < 2.2e-16

The z test statistic of 16.1 is 5.23

√

258.73.

> Happy <- read.table("http://stat4ds.rwth-aachen.de/data/Happy.dat", header=TRUE) > Gender <- Happy$gender; Happiness <- factor(Happy$happiness) > levels(Happiness) <- c("Very", "Pretty", "Not too") > GH.tab <- table(Gender, Happiness) > chisq.test(GH.tab) # (a) X-squared = 0.91653, df = 2, p-value = 0.6324

Solutions Manual: Foundations of Statistical Science for Data Scientists

> chisq.test(GH.tab)$expected # (b) Happiness Gender Very Pretty Not too female 347.2941 640.4575 160.2484 male 300.7059 554.5425 138.7516 > stdres <- chisq.test(GH.tab)$stdres; stdres Happiness Gender Very Pretty Not too female 0.5381769 0.1345633 -0.9061642 male -0.5381769 -0.1345633 0.9061642 > library(vcd) > mosaic(GH.tab, gp=shading_Friendly, residuals=stdres)

The chi-squared test has test statistic X 2 = 0.92 and a P -value of 0.63, so it is plausible that happiness and gender are independent. This is in agreement with expected frequencies that are close to the observed counts and standardized residuals that are small. 5.24 The number who report being not too happy is 9.8 standard errors below the number expected for the married subjects, 5.5 standard errors above the number expected for the divorced/separated subjects, and 5.6 standard errors above the number expected for the never married subjects. 5.25 Continuing the code of Exercise 5.23, we have: > Marital <- factor(Happy$marital); levels(Marital) <-c("Married","Div/Sep","Never") > MH.tab <- table(Marital, Happiness); MH.tab[2:3,] Happiness Marital Very Pretty Not too Div/Sep 92 282 103 Never 124 409 135 > chisq.test(MH.tab[2:3,]) X-squared = 0.53865, df = 2, p-value = 0.7639 > rbind(MH.tab[1,], MH.tab[2,]+MH.tab[3,]) Very Pretty Not too [1,] 432 504 61 [2,] 216 691 238 > chisq.test(rbind(MH.tab[1,], MH.tab[2,]+MH.tab[3,])) X-squared = 196.76, df = 2, p-value < 2.2e-16

It is plausible that the divorced/separated and never married subjects have the same conditional distribution on happiness, but those two combined differ from the married subjects in their conditional distribution. 5.26

> GSS <- read.table("http://stat4ds.rwth-aachen.de/data/GSS2018.dat", header=TRUE) > gender <- factor(GSS$SEX, levels=c(1,2), labels=c("Male","Female")) > vote <- factor(GSS$PRES16,levels=1:4,labels=c("Clinton","Trump","Other","Never")) > table(gender, vote) vote gender Clinton Trump Other Never Male 306 309 49 11 Female 458 268 38 9 > prop.table(table(gender, vote), 1) vote gender Clinton Trump Other Never Male 0.45333333 0.45777778 0.07259259 0.01629630 Female 0.59249677 0.34670116 0.04915912 0.01164295 > chisq.test(gender, vote) X-squared = 28.242, df = 3, p-value = 3.231e-06 > chisq.test(gender, vote)$expected vote gender Clinton Trump Other Never

Solutions Manual: Foundations of Statistical Science for Data Scientists Male 356.1464 268.9744 40.55594 9.323204 Female 407.8536 308.0256 46.44406 10.676796 > stdres <- chisq.test(gender, vote)$stdres; stdres vote gender Clinton Trump Other Never Male -5.2914697 4.3067688 1.8718622 0.7568542 Female 5.2914697 -4.3067688 -1.8718622 -0.7568542 > library(vcd) > mosaic(table(gender, vote), gp=shading_Friendly, residuals=stdres, + residuals_type="Standardized\nresiduals", labeling=labeling_residuals)

Females had relatively more Clinton votes and males had relatively more Trump votes. 5.27

> PID <- read.table("http://stat4ds.rwth-aachen.de/data/PartyID.dat", header=TRUE) > PID.tab <- table(PID$race, PID$id); PID.tab Democrat Independent Republican black 281 65 30 other 124 77 52 white 633 272 704 > chisq.test(PID.tab) X-squared = 233.04, df = 4, p-value < 2.2e-16 > chisq.test(PID.tab)$stdres PID$id PID$race Democrat Independent Republican black 12.0867410 -0.6632516 -12.0876031 other 0.8911036 5.1918344 -5.1541153 white -10.6805791 -3.1056344 13.6842848 > prop.test(c(PID.tab[1,1], PID.tab[3,1]), c(sum(PID.tab[1,]), sum(PID.tab[3,])), + correct=F) X-squared = 153.67, df = 1, p-value < 2.2e-16 95 percent confidence interval: 0.3039396 0.4039172 sample estimates: prop 1 prop 2 0.7473404 0.3934121

Relative to what’s expected under independence of party ID and race, there are more black Democrats and white Republicans and other-category Independents. One can also do pairwise comparisons of groups for a particular response category, such as shown for comparing the proportions of blacks and whites who identify as Democrats. 5.28

Afterlife <- read.table("http://stat4ds.rwth-aachen.de/data/Afterlife.dat", header=TRUE) > table(Afterlife$gender, Afterlife$postlife) 1 2 1 549 110 2 808 86 > prop.table(table(Afterlife$gender, Afterlife$postlife), 1) 1 2 1 0.83308042 0.16691958 2 0.90380313 0.09619687 > chisq.test(table(Afterlife$gender, Afterlife$postlife), correct=FALSE) X-squared = 17.206, df = 1, p-value = 3.354e-05 > chisq.test(table(Afterlife$gender, Afterlife$postlife))$stdres 1 2 1 -4.147994 4.147994 2 4.147994 -4.147994 > table(Afterlife$religion, Afterlife$postlife) 1 2 1 956 114 2 384 65 3 17 17 > prop.table(table(Afterlife$religion, Afterlife$postlife), 1) 1 2

Solutions Manual: Foundations of Statistical Science for Data Scientists

1 0.8934579 0.1065421 2 0.8552339 0.1447661 3 0.5000000 0.5000000 > chisq.test(table(Afterlife$religion, Afterlife$postlife)) X-squared = 48.232, df = 2, p-value = 3.362e-11 > chisq.test(table(Afterlife$religion, Afterlife$postlife))$stdres 1 2 1 3.473424 -3.473424 2 -1.404520 1.404520 3 -6.636370 6.636370

Relatively more women than men believe in the afterlife, and relatively more Protestants and fewer Jewish subjects believe in it compared with Catholics. Mosaic plots can also be included and commented in the report. 5.29 For π = probability of greater √ relief with new analgesic than standard, testing H0 : π = 0.50 has z = (0.60 − 0.50)/ (0.50)(0.50)/100 = 2.0 and P -value = 0046, giving considerable but not overwhelming evidence in favor of the new analgesic. Hell Heaven Yes No 5.30 (a) Yes 804 113 No 10 209 π̂1 = (804 + 113)/1136 = 0.807 and π̂2 = (804 + 10)/1136 = 0.717. Dependent samples because same 1136 subjects used for each estimate. (b) π1 = π11 + π12 , and π2 = π11 + π21 so π1 = π2 is equivalent to π12 = π21 . Using the counts 113 and 10, z = 9.3 and P -value < 0.0001, indicating very strong evidence of a greater population proportion believing in heaven. 5.31 The populations of interest are the threatened species Sooty Falcon on Fahal Island (population 1) and the Daymaniyat islands in the Sea of Oman (population 2), which are assumed to be independent. We can compare the means of their clutch sizes by testing H0 ∶ µ1 = µ2 versus Ha ∶ µ1 ≠ µ2 . Using the R function provided in the solution of Exercise 5.12 we get: > t.test.2sample(2.66,2.92,0.618,0.787,100,53) Difference of means Std Error t 0.26000000 0.11569727 2.24724401 CI.L CI.U 0.03140546 0.48859454 > t.test.2sample(2.66,2.92,0.618,0.787,100,53,equal.variance=FALSE) Difference of means Std Error t 0.26000000 0.12452087 2.08800337 CI.L CI.U 0.01248840 0.50751160

p-value 0.02607381

p-value 0.03973195

H0 is rejected at significance level α = 5% and thus we conclude that the mean clutch sizes are slightly larger for the second population. The 95% CI for µ2 − µ1 (considering unequal variances) is (0.012, 0.501). For comparing the population proportions of failed nests in the two locations, the hypothesis testing problem is H0 ∶ π1 = π2 versus Ha ∶ π1 ≠ π2 . The sample proportions are π̂1 = 1/108 = 0.009 and π̂2 = 29/118 = 0.246. The Z-test is not suitable, since y1 = 1 is very low for asymptotic inference. Using uniform prior distributions, we find that the 95% equal tailed posterior interval for π2 − π1 is (0.154 0.316) while the posterior probability that the population proportion of failed nests is higher at the Daymaniyat islands than at Fahal Island to be equal to 1:

Solutions Manual: Foundations of Statistical Science for Data Scientists > pi1 = rbeta(10000000, 2, 108); pi2 = rbeta(10000000, 30, 90) > quantile(pi2 - pi1, c(0.025, 0.975)) 2.5% 97.5% 0.1536552 0.3157329 > mean(pi2 > pi1) [1] 1

5.32 (a) (i) Deciding astrologers can predict better than random guessing when they cannot, (ii) Deciding it is plausible that astrologer’s predictions are like random guessing when actually they can predict better than that. (b) The probability of Type II error at value π = π1 (under Ha ) when P (Type I error) = α can be found as follows (illustration for π0 = 1/3, π1 = 0.5 and α = 0.01): > p0=1/3; p1=0.5; n=116; alpha=0.01 > se <- function(p,n){sqrt(p*(1-p)/n)} > se0 <- se(p0,n); se1 <- se(p1,n); > pnorm((p0+qnorm(alpha,lower.tail=F)*se0 - p1)/se1) [1] 0.08123565

√ 5.33 (a) Jones has z = (0.55−0.50)/ (0.50)(0.50)/400 = 2.0, which has two-sided P -value = √ 0.046. Smith has z = (0.5475−0.50)/ (0.50)(0.50)/400 = 1.90, which has two-sided P -value = 0.057. The result is statistically significant only for Jones, even though in practical terms the results are very similar. (b) The 95% confidence interval for π is (0.501, 0.599) for Jones and (0.499, 0.596) for Smith, almost identical, but the one for Jones does not contain 0.50, in agreement with its statistical significance. 5.34 (a) Jones gets t = (519.5 − 500)/10.0 = 1.95 and P -value = 0.051; Smith gets t = (519.7 − 500)/10.0 = 1.97 and P -value = 0.049. For α = 0.05, Jones does not reject H0 but Smith does. Only Smith’s study is significant at the 0.05 level. (b) These two studies give such similar results that they should not yield different conclusions. Reporting the actual P -value shows that each study has moderate evidence against H0 . The 95% confidence intervals for µ are (499.9, 539.1) for Jones and (500.1, 539.3) for Smith, nearly identical, but the one for Smith does not contain 500, in agreement with its statistical significance. 5.35 Since P (fail to reject H0 ∣ effect) = 0.50, P(effect) = 0.10, P (rej. H0 ∣ No effect)P (No effect) P (rej. H0 ∣ No effect)P (No effect) + P (rej. H0 ∣ effect)P (effect) (0.05)(0.90) = = 0.47 (0.05)(0.90) + (0.50)(0.10)

P (No effect ∣ rej. H0 ) =

5.36 The 95% confidence interval for θ is the set of θ0 for which 2[L(θ̂) − L(θ0 )] ≤ 3.84, or log[ℓ(θ̂)] − log[ℓ(θ0 )] = log[ℓ(θ̂)/ℓ(θ0 )] ≤ 3.84/2. Exponentiating both sides, ℓ(θ̂)/ℓ(θ0 ) ≤ exp(3.84/2) = 6.8. The calculation of the 95% LR-based CI for the proportion in the example of Section 5.7.2 can be found by trial and error or by using code such as follows: > con <- function(a=0.05){exp(qchisq(a, df=1, lower.tail = F)/2)}; con(a=0.05) [1] 6.825936 > y <- 524; n<- 1008; pihat <- y/n > loglik <- function(pi,n,y){pi^y*(1-pi)^(n-y)} > logLR <- function(pi0,pi,n,y,a=0.05){log(loglik(pi,n,y)/loglik(pi0,n,y))} > pi0 <- seq(0.45,0.6,0.001) > plot(pi0,logLR(pi0,pihat,n,y,a=0.05),type="l")

Solutions Manual: Foundations of Statistical Science for Data Scientists

> abline(h=con(a=0.05), col="blue") > library(rootSolve) > LR.CI <- function(pi0){logLR(pi0,pihat,n,y,a=0.05) - con(a=0.05)} > L <- uniroot(LR.CI, c(0.45, 0.5))$root; U <- uniroot(LR.CI, c(0.5, 0.6))$root > c(L,U) [1] 0.4617125 0.5775959 > abline(v=L, col="red", lty=3); abline(v=U, col="red", lty=3)

5.37

> Anor <- read.table("http://stat4ds.rwth-aachen.de/data/Anorexia.dat", header=TRUE) > cogbehav <- Anor$after[Anor$therapy=="cb"] - Anor$before[Anor$therapy=="cb"] > control <- Anor$after[Anor$therapy=="c"] - Anor$before[Anor$therapy=="c"] > t.test(cogbehav, control, var.equal=TRUE) t = 1.676, df = 53, p-value = 0.09963 > library(EnvStats) > test <- twoSamplePermutationTestLocation(cogbehav, control, fcn="mean", + alternative="two.sided", exact=FALSE, n.permutations=1000000) > test$p.value [1] 0.09996 > wilcox.test(cogbehav, control, alternative ="two.sided") W = 472, p-value = 0.1111

The P -values are about 0.10 for all three methods. 5.38 (a) The P -value is smallest when the observations in one group are completely separated from those in the other group. The number of possible partitions is (14 ) = 3432, and the one-sided P -value is then 1/3432 = 0.00029. 7 > petting <- c(203, 217, 254, 256, 284, 294, 296) > praise <- c(4, 7, 24, 25, 48, 71, 114) > library(EnvStats) > test <- twoSamplePermutationTestLocation(petting, praise, fcn="mean", + alternative="greater", exact=TRUE) > test$p.value [1] 0.0002913753 > choose(14,7); 1/choose(14,7) [1] 3432 [1] 0.0002913753

(b) The sample medians are 256 and 25 > library(simpleboot) > b <- two.boot(petting, praise, median, R=100000) > library(boot) > boot.ci(b) Level Percentile BCa 95% (169, 277 ) (146, 271 )

From the BCa interval, we can be 95% confident that the conceptual population has median time for petting between 146 seconds and 271 seconds higher than for praise. This tells us not just that an effect exists but estimates its size. 5.39 (a) The standard deviation of the selling prices is 219.8 thousand dollars for new homes and 121.0 thousand dollars for older homes, quite different, so we use the method that does not assume equal population variances. > Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat",header=T) > PriceNew <- Houses$price[Houses$new=="1"] > PriceOld <- Houses$price[Houses$new=="0"] > sd(PriceNew); sd(PriceOld) [1] 219.8328 [1] 121.0391 > t.test(PriceNew, PriceOld) Welch Two Sample t-test t = 3.386, df = 10.762, p-value = 0.006263

Solutions Manual: Foundations of Statistical Science for Data Scientists alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 79.59807 377.59059 mean of x mean of y 436.4455 207.8511

The P -value of 0.006 gives strong evidence that the corresponding population has a higher mean selling price for new homes than for older homes. The 95% confidence interval contains only values above 0 for µ1 − µ2 , in agreement with rejecting H0 : µ1 = µ2 in favor of Ha : µ1 ≠ µ2 at the α = 0.05 level. (b)

5.40 (a)

> median(PriceNew); median(PriceOld) [1] 427.5 [1] 190.8 > library(EnvStats) > test <- twoSamplePermutationTestLocation(PriceNew, PriceOld, fcn="median", + alternative="two.sided", exact=FALSE, n.permutations=100000) > test$p.value [1] 1e-05 # very strong evidence of a higher median selling price for new homes > Survival<-read.table("http://stat4ds.rwth-aachen.de/data/Survival.dat",header=T) > Survival$time[6] <- 11; Survival$status[6] <- 1 > library(survival) > survdiff(Surv(time, status) ~ group, data=Survival) N Observed Expected (O-E)^2/E (O-E)^2/V group=0 20 16 10.1 3.47 5.91 group=1 20 14 19.9 1.76 5.91 Chisq= 5.9 on 1 degrees of freedom, p= 0.02

The test gives strong evidence that the survival distributions differ between the drug and control groups. (b) Observing further that subject 20 in the drug group, who was censored at 36 months, was still alive after 40 months. > Survival$time[20] <- 40 > survdiff(Surv(time, status) ~ group, data=Survival) N Observed Expected (O-E)^2/E (O-E)^2/V group=0 20 16 10.1 3.47 5.91 group=1 20 14 19.9 1.76 5.91 Chisq= 5.9 on 1 degrees of freedom, p= 0.02

5.41

> Survival2 <- read.table("http://stat4ds.rwth-aachen.de/data/Survival_Cox_Oakes.dat", + header=TRUE) > survdiff(Surv(time, status) ~ group, data=Survival2) N Observed Expected (O-E)^2/E (O-E)^2/V group=0 21 21 10.7 9.77 16.8 group=1 21 9 19.3 5.46 16.8 Chisq= 16.8 on 1 degrees of freedom, p= 4e-05 > fit <- survfit(Surv(time, status) ~ group, data=Survival2) > fit n events median 0.95LCL 0.95UCL group=0 21 21 8 4 12 group=1 21 9 23 16 NA > plot(fit, xlab="Time", ylab="Estimated P(survival)")

The P -value of 0.00004 gives extremely strong evidence of a difference between the survival distributions. The estimated medians are 23 for the drug group and 8 for the control group. 5.42 Regardless of whether H0 is true and regardless of the value of the test statistic, the probability at a single value is 0 if the test statistic has a continuous distribution (such as the normal or t). It is also very close to 0 if the test statistic has a discrete distribution

Solutions Manual: Foundations of Statistical Science for Data Scientists

(such as the binomial) but n is large. In practice, we often use a continuous distribution to approximate a sampling distribution that is discrete (such as in using the normal distribution for a test statistic with the sample proportion), but the probability of any possible value is small. 5.43 There are many plausible values for the parameter, not just the H0 value, as a confidence interval for the parameter shows. On the other hand, when the P -value is very small, all the plausible values are in the range of values contained in Ha , so we can accept it. 5.44 (a) Observations (4, 5, 5, 6) for men and (9, 10, 10, 11) for women have t = −8.66 and P -value < 0.0001. (b) Observations (0, 0, 10, 10) for men and (5, 5, 15, 15) for women have t = −1.225 and P -value = 0.27. For fixed means, the smaller the within-group variability (as in case a), the more significant the results. 5.45 The joint prior density and joint posterior density of (µ1 , µ2 ) are continuous. The probability of a region is the volume over that region under the pdf. The volume over the diagonal line µ1 = µ2 is 0. 5.46 Under H0 there are the same (c − 1) parameters for each distribution (one fewer than c since they sum to 1). Under Ha each of the r rows has its own probabilities, so there are r(c−1) parameters. Therefore, for the chi-squared test, df = r(c−1)−(c−1) = (r−1)(c−1). 5.47 Reject H0 at α = 0.01 level is equivalent to the 99% confidence interval not containing the H0 value of the parameter. 5.48 (a) 5.49 (b), (d) 5.50 Correct interpretation: If H0 : µ = 100 was true, the probability would be 0.057 of getting y ≥ 120 or y ≤ 80, so the data do not strongly contradict H0 . (a) In the classical approach, the parameters (and thus the hypotheses on them) are unknown but fixed. Thus, we do not assign probabilities to them. (b) The P -value refers to tail regions of values at least as contradictory to H0 as the observed value (in the direction of Ha ), not the probability of just the observed value. (c) The probability of Type I error is the α-level for the test, not the P -value. (d) We never accept H0 because its value is only one of many plausible values, as a confidence interval illustrates. 5.51 (a) If α is any value below 0.057, such as α = 0.05, we do not reject H0 . If α is any value ≥ 0.057, such as α = 0.10, we reject H0 . Of the set of values ≥ 0.057, 0.057 is the minimum and thus the smallest α-level at which H0 can be rejected. (b) For any α ≤ 0.057, we would not reject H0 and 100 would fall in the corresponding confidence interval with confidence coeﬀicient (1 − α), and these range between 94.3% and 99.9999...%. The narrowest of all these intervals is the one with smallest confidence coeﬀicient, which is the 94.3% interval. 5.52 It holds: var(Y 1 − Y 2 ) = var(Y 1 ) + var(Y 2 ) − 2cov(Y 1 , Y 2 ). Now var(Y i ) = σ 2 /n, i = 1, 2, and cov(Y 1 , Y 2 ) = (1/n2 )E{[∑i (Yi1 − µ1 )][∑i (Yi2 − µ1 )] = (1/n2 )nE[(Yi1 − µ1 )(Yi2 − µ1 )] = (1/n)cov(Yi1 , Yi2 )] = (1/n)ρσ 2 . Therefore var(d) = 2σ 2 /n − (2/n)ρσ 2 = 2σ 2 (1 − ρ)/n. As ρ increases from 0, var(d) decreases from the independent-samples value for var(Y 1 − Y 2 ) of 2σ 2 /n and d is a more precise estimate of µ1 − µ2 .

Solutions Manual: Foundations of Statistical Science for Data Scientists

5.53 It is so diﬀicult to convict anyone that many guilty people are found to be not guilty, thus making many Type II errors at the cost of a tiny chance of Type I error. 5.54 Type I error is having a positive diagnostic test when the disease is absent (false positive). Type II error is having a negative test when the disease is present (false negative). Having stricter standards to declare a test to be positive means that more cases will occur in which a person has the disease but the test does not predict it. 5.55 (a) P (Type II error) is (i) 0.74, (ii) 0.36 , (iii) 0.08, (iv) 0.01; P (Type II error) decreases as H0 is more badly false. (b) (i) 0.85, (ii) 0.89, (iii) 0.93; as π gets closer to the H0 value of 0.50, P (Type II error) converges toward 1 − α = 0.95, because at π = 0.50, the probability of rejecting H0 is 0.05 (i.e., P (Type I error)), so the probability of not rejecting H0 is 0.95. (c) (i) 0.59, (ii) 0.36, (iii) 0.11; as n increases, P (Type II error) decreases. 5.56 As µ decreases toward the H0 value of 0, P (Type II error) converges toward 1−α = 0.95, because at µ = 0, the probability of rejecting H0 is 0.05 (i.e., P (Type I error)), so the probability of not rejecting H0 is 0.95. 5.57 With Ha : µ1 > µ2 , the critical value for the rejection region will be smaller than with Ha : µ1 ≠ µ2 (e.g., having 0.05 in the right tail of the t distribution instead of 0.025). So, it is easier to reject H0 and we are less likely to make a Type II error. Since power = 1 − P (Type II error), if truly µ1 > µ2 , the power is greater than if we use Ha : µ1 ≠ µ2 . 5.58 Binomial with n = 100 and π = 0.05, which has expected value 100(0.05) = 5. 5.59 Even if H0 is true in all 40 cases, we expect to reject H0 about 40(0.05) = 2 times. Those two results could well be Type I errors. 5.60 If all the schools had the same distribution for the scores of their students, and if results were independent from year to year, then the probability that a particular school performed above the median in all five years is (0.50)5 = 1/32. Just by chance we’d expect one of the 32 schools to perform above the median each year. 5.61 See Section 5.6.4. 5.62 See the fourth bulleted comment in Section 5.6.4 5.63 (a) From Section 4.2.5 with µ̂ = y, 2[L(µ̂) − L(µ0 )] = 2n{y[log(y) − log(µ0 )] − (y − µ0 )}. (b) > LRT <- function(n, mu0, mu.hat){ # Poisson LR test statistic + 2*n*((mu0 - mu.hat) - mu.hat*log(mu0/mu.hat))} > # Function returning vector of B values of LR test statistic > # for the B simulated Poisson(mu0) samples of size n: > simstat <- function(B, n, mu0){ y <- rep(-1,B) # simulating Poisson + for (i in 1:B){x <- rpois(n, mu0) # samples and applying + ML <- mean(x) # LRT function to each + y[i] <- LRT(n, mu0, ML)} + return(y) } > n <- 25; mu0 <- 3; B <- 100000 # B = number of Monte Carlo samples > stat <- simstat(B, n, mu0) > hist(stat, prob=TRUE, border="blue", breaks="Scott") > fchi2 <- function(x) {dchisq(x, 1)} > curve(fchi2, from=0, to=max(stat), add=TRUE)

Verify the good approximation by the limiting chi-squared distribution. (c) Adjusting the code of Exercise 5.36, we get√the 95% LR based CI (3.27, 4.84) as follows. (The 95% Wald interval is y ± 1.96 4/25, which is (3.22, 4.78).)

Solutions Manual: Foundations of Statistical Science for Data Scientists

mu.hat=4.0; n=25; a=0.05 > logLR <- function(n, mu0, mu.hat){LRT(n, mu0, mu.hat)} > mu0 <- seq(3,5,0.01) > plot(mu0,logLR(n, mu0, mu.hat),type="l") > abline(h=qchisq(a, df=1, lower.tail = F), col="blue") > library(rootSolve) > LR.CI <- function(mu0){logLR(n, mu0, mu.hat) - qchisq(a, df=1, lower.tail = F)} > L <- uniroot(LR.CI, c(3.0, 3.5))$root; U <- uniroot(LR.CI, c(4.5, 5.0))$root > c(L,U) # 95% LR based CI [1] 3.266376 4.836026 > abline(v=L, col="red", lty=3); abline(v=U, col="red", lty=3)

(d) Under H0 , the ML estimate of µ = µ1 = µ2 is the pooled estimate µ̂ = y = (n1 y 1 + n2 y 2 )/(n1 + n2 ) and the maximized log-likelihood is L(µ̂) = (n1 + n2 ){y[log(y)] − y}. Under Ha , the maximized log-likelihood is L(µ̂1 , µ̂2 ) = n1 [y 1 log(y 1 ) − y 1 ] + n2 [y 2 log(y 2 ) − y 2 ] and the likelihood-ratio test statistic is 2[L(µ̂1 , µ̂2 ) − L(µ̂)]. 5.64 For a binomial sample, the log-likelihood is y log(π̂)+(n−y) log(1−π̂). Under H0 : π1 = π2 the ML estimate of the common value is the pooled estimate π̂ = (y2 + y2 )/(n1 + n2 ). The maximized log-likelihood functions are L(π̂) = (y1 + y2 ) log(π̂) + [(n1 + n2 ) − (y1 + y2 )] log(1 − π̂) under H0 and L(π̂1 , π̂2 ) = [y1 log(π̂1 ) + (n1 − y1 ) log(1 − π̂1 )] + [y2 log(π̂2 ) + (n1 − y2 ) log(1 − π̂2 )] under Ha . The likelihood-ratio test statistic is then 2[L(π̂1 , π̂2 ) − L(π̂)]. 5.65 (a) The multinomial log-likelihood is a function of the parameters through ∑j yj log(πj ). Using π̂j = yj /n, 2[L(π̂1 , ⋯, π̂c )−L(π10 , ⋯, πc0 )] = 2[∑j yj log(yj /n)− ∑j yj log(πj0 )] = 2 ∑j yj log(yj /nπj0 ). Under Ha there are c − 1 parameters (since they sum to 1, c − 1 of them determine the other). Under H0 there are no unknown parameters. Thus, df = c − 1. (b) Test statistic for H0 : π1 = ⋯ = π6 = 1/6 is 2 ∑j yj log(6yj /100), which equals 3.17. From the χ25 distribution, the P -value is 0.67. H0 is plausible. (c) > y <- rmultinom(1000000, 100, prob=rep(1/6,6)) > T = colSums(2*(y*log(6*y/100))) > 1 - ecdf(T)(3.17) # finds empirical cdf at observed T of 3.17 [1] 0.683381

5.66 The multinomial log-likelihood is a function of the parameters through ∑i ∑j yij log(πij ). Thus, 2[L(π̂11 , ⋯, π̂rc )−L(π11,0 , ⋯, πrc,0 )] = 2 ∑i ∑j yij log(yij /n)−2 ∑i ∑j yij log(π̂ij,0 ) = 2 ∑i ∑j yij log(yij /nπ̂ij,0 ). It has a large-sample chi-squared distribution. The number of unknown parameters is (r − 1) + (c − 1) under H0 , since each marginal distribution has one few parameters than categories. Under Ha , there are rc − 1 parameters, since the probabilities sum to 1 over the rc cells. Therefore, the chi-squared test has df = (rc − 1) − [(r − 1) + (c − 1)] = (r − 1)(c − 1). 5.67 (a) U (y; µ) = ∑ni=1 ∂[log f (yi ; µ)]/∂µ = (1/µ) ∑i yi − n. From Section 4.2.5, I(µ) = n/µ. The value of the score test statistic at y is √ √ √ z = U (y, µ0 )/ I(µ0 ) = [(1/µ0 )(∑ yi ) − n]/ n/µ0 = (y − µ0 )/ µ0 /n. i

√ (b) Since I(µ) = n/µ, the value of the Wald statistic at y is = (y − µ0 )/ y/n, where √ y/n is the estimated standard error. The score statistic does not require estimating the null standard error. (c) From Section 4.2.5 with µ̂ = y, 2[L(µ̂) − L(µ0 )] = 2n{y[log(y) − log(µ0 )] − (y − µ0 )}.

Solutions Manual: Foundations of Statistical Science for Data Scientists (d) For a particular test, the 100(1 − α)% confidence interval is the set √of µ0 values for which the P -value ≥ α. The Wald confidence interval is y ± zα/2 y/n. The score √ confidence interval is the set of µ0 for which ∣z∣ = ∣y − µ0 ∣/ µ0 /n ≤ zα/2 , which is obtained by solving a quadratic equation. The score method does not require estimating the standard error in the z pivotal quantity.

5.68 When the Wald test statistic (θ̂ − θ0 )/se has an approximate standard normal distribution, the corresponding CI is θ̂ ± zα/2 se, which has θ̂ at its center and so is symmetric. Such symmetry is not ideal when the parameter is near a boundary. For instance, when a binomial probability π is near 0, unless n is large one could get π̂ = 0.0 and a Wald confidence interval of (0, 0), and even if π̂ > 0, requiring π̂ to be the midpoint may result in too narrow a CI or a CI (L, U ) with L < 0. 5.69 Under H0 : identical population distributions, for given observations, every possible partitioning of them to n1 in the first sample and n2 in the second sample is equally likely. The P -value is the proportion of the permutations that are at least as extreme as observed, such as having difference between the sample medians at least as large as observed. The t test assumes that the identical population distributions in H0 are normal distributions. 5.70 (a) When the two samples are completely separated, as in the first example, the observed sample is the one of the (84) = 70 partitionings that is the most extreme, and the two-sided P -value is 2/70 = 0.0286. > library(EnvStats) > y1 <- c(4, 5, 5, 6); y2 <- c(9, 10, 10, 11) > test <- twoSamplePermutationTestLocation(y1, y2, fcn="mean","two.sided",exact=T) > test$p.value [1] 0.02857143 > y1 <- c(0, 0, 10, 10); y2 <- c(5, 5, 15, 15) > test <- twoSamplePermutationTestLocation(y1, y2, fcn="mean","two.sided",exact=T) > test$p.value [1] 0.4

(b) Have very litte variability within each sample, such as: > y1 <- c(4.9, 5, 5, 5.1); y2 <- c(9.9, 10, 10, 10.1) > test <- twoSamplePermutationTestLocation(y1, y2, fcn="mean","two.sided",exact=T) > test$p.value [1] 0.02857143 > t.test(y1, y2, var.equal=TRUE) t = -86.603, df = 6, p-value = 1.597e-10

5.71 An equal tailed 95% CI for µ1 −µ2 based on the inversion of a permutation test is derived by allocating a lower and an upper value for µ1 − µ2 at which the test has P -value=0.05. Trying various values for µ1 − µ2 (see below), we get the CI (67, 248). > test <- twoSamplePermutationTestLocation(petting, praise, fcn="mean", + alternative="two.sided", mu1.minus.mu2=66, n.permutations=100000) > test$p.value [1] 0.04837 > test <- twoSamplePermutationTestLocation(petting, praise, fcn="mean", + alternative="two.sided", mu1.minus.mu2=67, n.permutations=100000) > test$p.value [1] 0.05114 > test <- twoSamplePermutationTestLocation(petting, praise, fcn="mean", + alternative="two.sided", mu1.minus.mu2=248, n.permutations=100000) > test$p.value [1] 0.05087 > test <- twoSamplePermutationTestLocation(petting, praise, fcn="mean",

Solutions Manual: Foundations of Statistical Science for Data Scientists

+ alternative="two.sided", mu1.minus.mu2=249, n.permutations=100000) > test$p.value # The 95\% confidence interval for is (67, 248) [1] 0.04778 > library(simpleboot) > b <- two.boot(petting, praise, mean, R=10000) > library(boot) > boot.ci(b) Level Percentile BCa 95% ( 71.9, 234.6 ) ( 47.7, 224.9 ) > sd(petting); sd(praise) [1] 61.7117 [1] 102.5392 > t.test(petting, praise) Welch Two Sample t-test t = 3.6351, df = 9.8424, p-value = 0.004694 95 percent confidence interval: 63.42237 265.43477 mean of x mean of y 232.00000 67.57143

5.72 Data are censored when we don’t know the actual value but only an interval in which they fall, such as all possible values above the observed one for right-censoring with survival data. Another example is comparing two brands of a product in terms of how long they work properly without needing and adjustment or repair. Left-censoring: We observe for school children their age when they learn how to tell time, but some already knew how when they started school, so we know only an upper bound for the actual value of those students. 5.73 (a) F (T ) and 1 − F (T ) are left-tail and right-tail probabilities, which are one-sided P values. Since F (T ) has a uniform distribution, for 0 ≤ u ≤ 1, the cdf of U = 1−F (T ) is G(u) = P (U ≤ u) = P [1 − F (T ) ≤ u] = P [F (T ) ≥ 1 − u] = 1 − P [F (T ) ≤ 1 − u] = 1 − (1 − u) = u, also uniform. (b) > y <- rbinom(1000000, 1500, 0.50) # 1000000 random binomials with n=1500 > z <- (y/1500 - 0.50)/sqrt(0.5*0.5/1500) # test statistic for each binomial sample > p <- 1 - pnorm(z) # right-tail P-value for each binomial > hist(p) # approximately uniform

5.74 (a) E(mid P -value) = ∑j πj [(πj /2) + πj+1 + ⋯ + πc ]. In this sum, each πj appears as 2

πj2 /2 and as πj πk for each k ≠ j, so it equals ( ∑j πj ) /2, which is 1/2. (b)

> 0.5*dbinom(9,10,0.4)+dbinom(10,10,0.4) [1] 0.0008912896 > z <- (9/10 - 0.40)/sqrt(0.4*0.6/10); p <- 1 - pnorm(z); p [1] 0.0006244155

5.75 (a) ℓ(θ) = [B(θ)]n [∏i h(yi )] exp[Q(θ) ∑i R(yi )], so ′ ′ ′ ℓ(θ)/ℓ(θ ) = [B(θ)/B(θ )]n exp{[Q(θ) − Q(θ )] ∑i R(yi )}. Since Q is monotone increasing, this is an increasing function of ∑i R(yi ). (b) n n f (y; π, n) = ( )π y (1 − π)n−y = (1 − π)n ( )[π/(1 − π)]y y y π n n = (1 − π) ( ) exp [y (log )] y 1−π so the binomial is in the exponential distribution with Q(π) = log[π/(1−π)], which is monotone increasing in π, and R(y) = y, so T = ∑i R(yi ) = ∑i yi .

Solutions Manual: Foundations of Statistical Science for Data Scientists

Chapter 6 6.1 (a)

(b)

> Races <-read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat",header=T) > plot(timeM ~ timeW, data=Races) > fit <- lm(timeM ~ timeW, data=Races) > summary(fit) Coefficients: Estimate Std. Error (Intercept) -2.834324 1.209815 timeW 0.870889 0.009769 ---> fitted(fit)[which(Races$timeW == 490.05)] # -2.834324+0.870889(490.05)=423.9 41 423.9448 > cor(Races$timeM, Races$timeW) [1] 0.9958732

Very strong positive correlation, suggesting one can predict men’s record time very well using the linear equation with women’s record time as the explanatory variable. > summary(lm(timeM ~ -1 + timeW, data=Races)) Coefficients: Estimate Std. Error timeW 0.852274 0.005871

# model without intercept

The fit is quite similar but goes through the origin. For an increase in women’s record time of t, men’s record time is predicted to increase by 0.85t. 6.2 (a)

> Guns <-read.table("http://stat4ds.rwth-aachen.de/data/Firearms.dat",header=T) > plcolor <- (Guns$deaths >= 10) + 1 > plot(Guns$firearms, Guns$deaths, xlim=c(0,95), pch=16, col= plcolor) > text(Guns$firearms, Guns$deaths, Guns$Nation, cex=0.7, pos=4)

The U.S. observation is well removed from the others, being much higher on both variables. (b)

> cor(Guns$firearms, Guns$deaths) [1] 0.6343518 > x <- which(Guns$Nation == "US"); cor(Guns$firearms[-x], Guns$deaths[-x]) [1] -0.1750878

Without the U.S., the correlation is very weak and even has a different sign. (c)

> fit <- lm(deaths ~ firearms, data=Guns) > summary(fit) Estimate Std. Error (Intercept) -0.27336 1.60474 firearms 0.19176 0.05225 ---> fit2 <- lm(deaths[-22] ~ firearms[-22], data=Guns) > summary(fit2) Estimate Std. Error (Intercept) 3.91015 1.03741 firearms[-22] -0.03254 0.04197

The slope changes direction, because the correlation does also. 6.3

> Guns <- read.table("http://stat4ds.rwth-aachen.de/data/Firearms2.dat", header=TRUE) > plot(Guns$Ownership, Guns$Rate) > fit <- lm(Guns$Rate ~ Guns$Ownership, data=Guns) > tail(sort(cooks.distance(fit)), 3) 21 8 11 0.07323167 0.16697293 0.28479001

Solutions Manual: Foundations of Statistical Science for Data Scientists

> Guns$State[c(8, 11)] [1] "Delaware" "Hawaii" > cor(Guns$Rate, Guns$Ownership) [1] 0.6976103 > cor(Guns$Rate[-11], Guns$Ownership[-11]) [1] 0.7822902

Hawaii (observation 11) is an outlier, Delaware (observation 8) less so; their Cook’s distance values of 0.28 and 0.17 are much larger than all the others. Removing Hawaii, the correlation changes from 0.698 to 0.782. 6.4 (a)

> Election <- read.table("http://stat4ds.rwth-aachen.de/data/BushGore.dat", + header=TRUE) > plot(Election$Perot,Election$Buchanan) > text(Election$Perot, Election$Buchanan, as.numeric(rownames(Election)), + cex=0.8, pos=1) > Election$County[c(50, 52, 6)] [1] "PalmBeach" "Pinellas" "Broward"

The Buchanan vote for Palm Beach county is a clear outlier, much higher than one would expect. (b)

> fit <- lm(Buchanan ~ Perot, data=Election) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.081433 49.899462 0.022 0.983 Perot 0.035735 0.004352 8.211 1.23e-11 --> fit2 <- lm(Buchanan ~ Perot, data=Election[-50,]) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) 45.689948 13.981299 3.268 0.00174 Perot 0.024143 0.001281 18.847 < 2e-16 > pred.PB <- predict.lm(fit2,Election[50,]); pred.PB 50 787.8075 # or equivalenty: sum(fit2$coefficients*c(1,Election$Perot[50])) > Election$Buchanan[50] - pred.PB 50 2619.193

The prediction equation fitted to all but the observation for Palm Beach county is ŷ = 45.69 + 0.0241x, very different from the equation 1.08 + 0.0357x using all the data. Based on fit (ii), we predict 788 votes for Buchanan. Since he received 3407 votes, the residual is 2619. (c)

> Cd <- cooks.distance(fit); tail(sort(Cd), 3) # 3 highest Cook's distances 52 6 50 0.1319675 0.5866819 4.0555563 > library(MASS); stdres(fit)[c(50, 6, 52)] # stand. residuals for obs. 50,6,52 50 6 52 7.748652 -2.136112 -1.090258 > library(olsrr); ols_leverage(fit)[c(50, 6, 52)] # leverages for obs. 50,6,52 [1] 0.1190138 0.2045493 0.1816987

Cook’s distance is 4.06 for Palm Beach county but well below 1 for all other observations. The rightmost points (observations 52 and 6) have large leverage but not large residuals and therefore do not have great influence. 6.5 (a)

(b)

> Covid <-read.table("http://stat4ds.rwth-aachen.de/data/Covid19.dat",header=T) > plot(Covid$day, Covid$cases) > plot(Covid$day, log(Covid$cases)) > cor(Covid$cases, Covid$day) [1] 0.7937368

Solutions Manual: Foundations of Statistical Science for Data Scientists > cor(log(Covid$cases), Covid$day) [1] 0.9968212

There is a very strong linear relation when the response variable is the log of the number of cases. (c)

> fit <- lm(log(cases) ~ day, data=Covid) > summary(fit) Estimate Std. Error (Intercept) 2.843852 0.084012 day 0.308807 0.004583

The prediction for log(Y ) is 2.844 + 0.309x for x = day, so the prediction for Y is exp(2.844 + 0.309x) = 17.18(1.362)x . 6.6 (a)

> Mental <- read.table("http://stat4ds.rwth-aachen.de/data/Mental.dat", header=T) > summary(lm(impair ~ life, data=Mental)) Estimate Std. Error t value Pr(>|t|) (Intercept) 23.30949 1.80675 12.901 1.85e-15 life 0.08983 0.03633 2.472 0.018 --> summary(lm(impair ~ life + ses, data=Mental)) Estimate Std. Error t value Pr(>|t|) (Intercept) 28.22981 2.17422 12.984 2.38e-15 life 0.10326 0.03250 3.177 0.00300 ses -0.09748 0.02908 -3.351 0.00186 --> cor(Mental$life, Mental$impair) [1] 0.3722206

The life events effect does not change much after adjusting for SES, because SES and life events are not strongly correlated. Perhaps life events and SES have somewhat separate effects on mental impairment. (b) For testing H0 : β2 = 0 of no SES effect, the test statistic is t = −0.097/0.029 = −3.35 and the P -value is 0.002, giving strong evidence of a negative effect of SES on mental impairment, adjusting for life events. The 95% confidence interval for β2 is −0.097 ± 2.026(0.029), which is (−0.156, −0.039). Adjusting for life events, we are 95% confident that the change in mean mental impairment per one-unit increase in SES falls between −0.156 and −0.039, or equivalently, between −15.6 and −3.9 over the range of SES values from 0 to 100. 6.7 Observation 41 is highly inflential, with a Cook’s distance of 7.88, and no other observation has a Cook’s distance above 1 (apply R code as in Exercise 6.3). When we delete observation 41, the estimate of the interaction parameter changes from the substantively important effect of 0.658 (P -value=0.100) to the unimportant effect of −0.033 (highly non-significant with P -value=0.933). 6.8 (a) The estimated education effect is (i) 1.486 marginally, (ii) −0.583 after adjusting for urbanization. The effect of education on crime rate is positive overall, but negative when we adjust for urbanization. Simpson’s paradox holds, because the association changes direction. Perhaps more urbanized counties tend to have both higher education levels and higher crime rates. (b) At the individual level, there could be a lot more variability in income at each level of education than in the county summaries. (c) There is no way to know if the linear trend extends to other values above or below the observed ones. The prediction could even be absurd. For instance, the least squares line for predicting a county’s median income from education has an intercept of −4.6, predicting a negative value.

Solutions Manual: Foundations of Statistical Science for Data Scientists

6.9 We obtain a 72.4% reduction in error in predicting y using x compared √ to predicting it using y. A scatterplot shows a positive trend, so the correlation is r = + r2 = 0.85. 6.10 (a)

> fit <- lm(cogpa ~ hsgpa + sport + tv, data=Students); summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 2.815427 0.367788 7.655 2.86e-10 hsgpa 0.208804 0.101290 2.061 0.0439 sport -0.014066 0.011599 -1.213 0.2303 tv 0.003336 0.006868 0.486 0.6291 --Residual standard error: 0.3414 on 56 degrees of freedom Multiple R-squared: 0.1045, Adjusted R-squared: 0.05655 F-statistic: 2.179 on 3 and 56 DF, p-value: 0.1007

Since sport and tv are statistically non-significant, we should propose a simpler model (s. next exercise). (b) For an increase of 1 in high school GPA, the mean college GPA is estimated to increase by 0.21, adjusting for hours watching TV and participating in sports. Notice however that we proceed to interpretations for the final selected model, which is not the case for this one. (c) R2 = 0.1045, adjusted R2 = 0.0565, so the predictive power is weak. There is only about a 5.7% reduction in sum of squared prediction errors from using hsgpa, sport, and √ tv instead of the mean cogpa to predict cogpa. The multiple correlation of R = 0.1045 = 0.323 is the correlation between the observed cogpa values and the values predicted by the fit of the model. 6.11 (a) Test statistic F = 2.18 has P -value = 0.10, only weak evidence that at least one of the explanatory variables has a true effect in the corresponding population. (b) Test statistic t = 0.209/0.101 = 2.06, df = 56, has a P -value = 0.044 for H0 : β1 = 0. Reject H0 and conclude hsgpa has a positive effect on cogpa, adjusting for tv and sport. (c) We need P -value ≤ 0.05/3 = 0.0167 to reject H0 when we use the Bonferroni approach, so the hsgpa effect is not significant with this correction. (d) Testing the significance of tv, adjusting for hsgpa and sport, and then the significance of sport, adjusting for hsgpa, we conclude that, among these variables, only hsgpa has a significant effect on the mean college GPA: > fit1 <- lm(cogpa ~ hsgpa, data=Students); summary(fit1) Estimate Std. Error t value Pr(>|t|) (Intercept) 2.74916 0.32207 8.536 7.79e-12 hsgpa 0.21285 0.09644 2.207 0.0313 --Residual standard error: 0.3405 on 58 degrees of freedom Multiple R-squared: 0.07748, Adjusted R-squared: 0.06157 F-statistic: 4.871 on 1 and 58 DF, p-value: 0.03128

6.12

> UN <- read.table("http://stat4ds.rwth-aachen.de/data/UN.dat", header=TRUE) > attach(UN) > cor(cbind(GDP, HDI, GII, Fertility, CO2, Homicide, Prison, Internet)) GDP HDI GII Fertility CO2 Homicide GDP 1.000000000 0.90297348 -0.8506693 -0.48615886 0.6744699 -0.4073651 HDI 0.902973484 1.00000000 -0.8822150 -0.66930355 0.6809100 -0.4281391 GII -0.850669343 -0.88221498 1.0000000 0.59787355 -0.5545239 0.5114319 Fertility -0.486158857 -0.66930355 0.5978735 1.00000000 -0.4477451 0.3056060 CO2 0.674469917 0.68090996 -0.5545239 -0.44774507 1.0000000 -0.1651759 Homicide -0.407365069 -0.42813912 0.5114319 0.30560597 -0.1651759 1.0000000 Prison -0.002952338 0.03746578 0.2159562 -0.08558244 0.3685526 0.3306924 Internet 0.877198714 0.86808986 -0.8657863 -0.47968591 0.6409425 -0.3374744

Solutions Manual: Foundations of Statistical Science for Data Scientists Prison Internet GDP -0.002952338 0.87719871 HDI 0.037465785 0.86808986 GII 0.215956167 -0.86578625 Fertility -0.085582437 -0.47968591 CO2 0.368552643 0.64094250 Homicide 0.330692423 -0.33747439 Prison 1.000000000 -0.01425214 Internet -0.014252137 1.00000000 > fit <- lm(Internet ~ GDP + HDI + GII + Fertility + CO2 + Homicide + Prison) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 11.158310 38.773097 0.288 0.77526 GDP 0.440903 0.290680 1.517 0.13856 HDI 55.851013 46.652218 1.197 0.23952 GII -72.428931 25.323061 -2.860 0.00719 Fertility 4.092148 3.065379 1.335 0.19076 CO2 0.310113 0.654899 0.474 0.63886 Homicide 0.377324 0.299751 1.259 0.21668 Prison 0.009091 0.018347 0.495 0.62344 --Multiple R-squared: 0.8477, Adjusted R-squared: 0.8164 > fit2 <- lm(Internet ~ GDP) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) 26.1341 3.7490 6.971 2.06e-08 *** GDP 1.4060 0.1217 11.555 2.55e-14 *** --Multiple R-squared: 0.7695, Adjusted R-squared: 0.7637

Adjusted R2 is 0.816 with all the predictors and 0.764 using only GDP. The other variables that are highly correlated with Internet (HDI, GDI, and to some extend C02) are also highly correlated among themselves. The data exhibit multicollinearity (strong correlations among the explanatory variables), so we do nearly as well with one predictor as using all seven predictors. The estimated GDP effect is 1.41 in the bivariate model and 0.44 in the multiple regression model. In the bivariate model, as GDP increases by 1, the variables that are highly correlated with GDP also tend to increase, boosting the estimated effect of GDP compared to its effect in the multiple regression model in which we adjust for those other variables. 6.13

> Polid <- read.table("http://stat4ds.rwth-aachen.de/data/Polid.dat", header=TRUE) > fit <- lm(ideology ~ race, data=Polid) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.81671 0.07372 51.776 < 2e-16 racehispanic 0.27272 0.10439 2.612 0.00904 racewhite 0.35386 0.08082 4.378 1.24e-05 --Residual standard error: 1.42 on 2572 degrees of freedom Multiple R-squared: 0.007426, Adjusted R-squared: 0.006654 F-statistic: 9.621 on 2 and 2572 DF, p-value: 6.876e-05 > fit <- aov(ideology ~ race, data=Polid) > TukeyHSD(fit, conf.level=0.95) Tukey multiple comparisons of means $race diff lwr upr p adj hispanic-black 0.27271930 0.02791618 0.5175224 0.0245244 white-black 0.35386062 0.16432144 0.5433998 0.0000370 white-hispanic 0.08114131 -0.10882466 0.2711073 0.5758640

The test of equal population means has test statistic F = 9.62 and a P -value < 0.0001,

Solutions Manual: Foundations of Statistical Science for Data Scientists

giving strong evidence of at least one difference. The Tukey multiple comparison method indicates that hispanics and whites tend to be more conservative than blacks, on the average, but there is no significant difference between whites and hispanics. Its 0.95 confidence level applies to the family of three confidence intervals for the pairwise differences in the population means of political ideology among the race groups. 6.14 (a) µ̂ = 11.366 + 0.391(CW ) + 0.809c2 + 1.149c3 , where c2 and c3 are indicator variables for colors 2 and 3. The estimate 0.391 is the estimated effect on the mean of the log ejaculate of a 1 cm increase in CW, adjusting for color. The estimate 0.809 is the estimated difference in the mean of the log ejaculate between color 2 and color 1 and 1.149 is the estimated difference in the mean of the log ejaculate between color 3 and color 1, adjusting for CW. (b) (i) Test H0 : β1 = β2 = β3 = 0 in main effects model, and P -value < 0.0001 gives strong evidence that at least one variable has an effect. (ii) Test H0 : β2 = β3 = 0 in main effects model, and P -value < 0.0001 gives strong evidence of a color effect, adjusting for CW. (iii) Test H0 : β4 = β5 = 0 in model that adds two interaction terms, and P -value = 0.22 indicates the interaction terms do not yield a significantly better fit. 6.15 (a)

(b)

> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat",header=T) > attach(Houses) > pch.list <- NULL; col.list <- NULL > pch.list[new=="0"] <- 20; pch.list[new=="1"] <- 4 # pick symbols > col.list[new=="0"] <- "blue"; col.list[new=="1"] <- "red" # pick colors > plot(size, price, pch=pch.list, col=col.list) # control symbols/colors > legend("topleft", title="age of house", inset=0.02, legend= c("old","new"), pch=c(20,4), col=c("dodgerblue4","red4"), cex=0.9,box.lty=0) > fit <- lm(price ~ size + new, data=Houses) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) -60.34630 22.04421 -2.738 0.00737 size 0.17420 0.01319 13.204 < 2e-16 new 86.60442 27.97956 3.095 0.00257 --Residual standard error: 80.82 on 97 degrees of freedom Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169 F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16

We estimate that the mean price increases by $17,420 for a 100 square foot increase in size and is $86,604 higher for new homes than for older homes, adjusting for the other explanatory variable. (c) R2 = 0.72. There is a 72% reduction in sum of squared errors in predicting selling price using size and whether new, compared to using the sample mean selling price as the predictor. (d) Test statistic t = 3.095 (df = 97) is highly significant (P = 0.003), indicating that the population mean selling price is higher for new homes. (e) > tapply(size, new, mean) 0 1 1539.618 2354.727 > predict(fit,data.frame(size=2354.73, new=1), interval="confidence") fit lwr upr 1 290.964 258.7207 323.2072 > predict(fit,data.frame(size=2354.73, new=1), interval="prediction") fit lwr upr 1 290.964 179.2701 402.6579

Solutions Manual: Foundations of Statistical Science for Data Scientists If the model truly holds, we have 95% confidence that the conceptual population mean selling price falls between $258,721 and $323,207, and we predict that a selling price for another new house of that size will fall between $179,270 and $402,658. (f)

> plot(cooks.distance(fit)) > Cd <- cooks.distance(fit); tail(sort(Cd),5) 76 6 22 9 64 0.1492467 0.2102763 0.2644963 0.7685104 1.2841872 > fit2 <- lm(price~size+new, data=Houses[-64,]) # or, use > summary(fit2) # update(fit, subset=-64) Estimate Std. Error t value Pr(>|t|) (Intercept) -94.73173 21.37790 -4.431 0.0000249 size 0.19927 0.01316 15.138 < 2e-16 new 61.95927 25.99036 2.384 0.0191 --Residual standard error: 73.48 on 96 degrees of freedom Multiple R-squared: 0.772, Adjusted R-squared: 0.7672 F-statistic: 162.5 on 2 and 96 DF, p-value: < 2.2e-16

Observation 64 has the maximum Cook’s distance of 1.28. Without this observation, R2 increases from 0.723 to 0.772. The effect of a home being new changes dramatically, decreasing from 86.60 to 61.96. 6.16

> summary(lm(price ~ taxes, data=Houses)) Estimate Std. Error t value Pr(>|t|) (Intercept) 35.504052 15.206753 2.335 0.0216 taxes 0.103486 0.006698 15.450 <2e-16 > summary(lm(price ~ taxes + size, data=Houses)) Estimate Std. Error t value Pr(>|t|) (Intercept) -42.91312 20.27864 -2.116 0.0369 taxes 0.05940 0.01038 5.725 1.16e-07 size 0.09977 0.01923 5.189 1.16e-06

After adjusting for size, the effect of taxes is only a bit more than half as large. So, size has some influence on the effect of taxes on selling price, but is not completely responsible for the association. 6.17

> Anor <- read.table("http://stat4ds.rwth-aachen.de/data/Anorexia.dat",header=TRUE) > y <- Anor$after - Anor$before > fit <- aov(y ~ therapy, data=Anor) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) therapy 2 615 307.32 5.422 0.0065 Residuals 69 3911 56.68 --> TukeyHSD(fit, conf.level=0.95) Tukey multiple comparisons of means 95% family-wise confidence level $therapy diff lwr upr p adj cb-c 3.456897 -1.413483 8.327276 0.2124428 f-c 7.714706 2.090124 13.339288 0.0045127 f-cb 4.257809 -1.250554 9.766173 0.1607461

(a) The test of H0 : µ1 = µ2 = µ3 for the three conceptual population means has test statistic F = 5.42 and P -value = 0.0065, giving strong evidence that at least two of the population means differ. (b) The Tukey multiple comparisons suggest that the family therapy has greater population mean weight change than the control group, but it is plausible that the population means are the same for the other two pairs because the confidence intervals contain 0.

Solutions Manual: Foundations of Statistical Science for Data Scientists (c)

> fit <- lm(after ~ therapy + before, data=Anor) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 45.6740 13.2167 3.456 0.000950 therapycb 4.0971 1.8935 2.164 0.033999 therapyf 8.6601 2.1931 3.949 0.000189 before 0.4345 0.1612 2.695 0.008850 --> Anova(fit) Response: after Sum Sq Df F value Pr(>F) therapy 766.3 2 7.8681 0.0008438 before 353.8 1 7.2655 0.0088500 Residuals 3311.3 68

The mean weight after was estimated to be 4.1 pounds higher for the cb group than c and 8.66 pounds higher for the f group than c, adjusting for the weight before. The test of H0 : β1 = β2 = 0 for the therapy effect has test statistic F = 7.27 and P -value = 0. 009 that gives strong evidence that at least two of the therapy groups have different population means, adjusting for the weight before. 6.18

> FEV <- read.table("http://stat4ds.rwth-aachen.de/data/FEV.dat", header=TRUE) > fit <- lm(fev ~ base + drug, data=FEV) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.71365 0.12150 5.874 0.00000000723 base 0.90285 0.04307 20.961 < 2e-16 drugC 0.22589 0.05572 4.054 0.00005734009 drugP -0.28149 0.05573 -5.051 0.00000059192 --> anova(lm(fev ~ base + drug, data=FEV)) Analysis of Variance Table Response: fev Df Sum Sq Mean Sq F value Pr(>F) base 1 131.887 131.887 442.623 < 2.2e-16 drug 2 24.812 12.406 41.635 < 2.2e-16 Residuals 572 170.437 0.298

The fitted equations are 0.71 + 0.90(base) for drug A, (0.71 + 0.23) + 0.90(base) = 0.94 + 0.90(base) for drug C, and (0.71 − 0.28) + 0.90(base) = 0.43 + 0.90(base) for placebo. The test of the hypothesis of equal population means for the three drugs, adjusting for the baseline measurement, has test statistic F = 24.8 and P -value < 0.0001. The output shows that drug C is significantly better than drug A (which does not have its own indicator), and drug A is significantly better than placebo, adjusting for the baseline measurement. To compare drug C with placebo, we fit the model again with placebo as the reference category: > drug2 <- factor(drug, levels=c("p", "a", "c")) > fit2 <- lm(fev ~ base + drug2, data=FEV) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.43216 0.12017 3.596 0.000351 base 0.90285 0.04307 20.961 < 2e-16 drug2A 0.28149 0.05573 5.051 0.000000592 drug2C 0.50738 0.05571 9.107 < 2e-16

We see that drug C is also significantly better than placebo. 6.19 For very flat prior distributions, we have similar results as least squares: > fit.bayes <- MCMCregress(timeW ~ distance + climb, mcmc=5000000, b0=0, B0=10^{-10},

Solutions Manual: Foundations of Statistical Science for Data Scientists + c0=10^(-10), d0=10^(-10), data = Races[-41,]) > summary(fit.bayes) 1. Empirical mean and standard deviation for each variable Mean SD Naive SE Time-series SE (Intercept) -8.934 3.3328 0.0014905 0.0014905 distance 4.172 0.2439 0.0001091 0.0001091 climb 43.851 3.7744 0.0016880 0.0016865 > fit <- lm(timeW ~ distance + climb, , data = Races[-41,]) > summary(fit) # classical least squares approach Estimate Std. Error t value Pr(>|t|) (Intercept) -8.931 3.281 -2.723 0.00834 distance 4.172 0.240 17.383 < 2e-16 climb 43.852 3.715 11.806 < 2e-16

6.20 (a) For the main effects model, the outlying observation 41 is highly influential, having Cook’s distance = 16.9. Without that observation, the maximum Cook’s distance equals 0.44. Here are the fits with and without it: > fit <- lm(timeM ~ distance + climb, data=Races) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) -15.2608 2.7853 -5.479 7.44e-07 distance 4.5082 0.1352 33.347 < 2e-16 climb 28.3525 2.9718 9.541 5.59e-14 --Multiple R-squared: 0.9697, Adjusted R-squared: 0.9688 > fit2 <- lm(timeM ~ distance + climb, data = Races[-41, ]) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) -9.0429 2.2335 -4.049 0.000141 distance 3.5603 0.1634 21.789 < 2e-16 climb 37.4477 2.5289 14.808 < 2e-16 -Multiple R-squared: 0.9689, Adjusted R-squared: 0.9679

The effect estimates change substantially. For the model allowing interaction, observation 41 is also highly influential, with Cook’s distance = 14.04. Without that observation, the maximum Cook’s distance equals 0.54. Here are the fits: > fit3 <- lm(timeM ~ distance + climb + distance:climb, data=Races) > summary(fit3) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.0620 5.0788 -0.209 0.83503 distance 3.5186 0.3292 10.687 7.2e-16 climb 11.1418 5.9716 1.866 0.06666 distance:climb 0.9752 0.2996 3.255 0.00182 --Multiple R-squared: 0.974, Adjusted R-squared: 0.9728 > fit4 <- lm(timeM ~ distance + climb + distance:climb, data = Races[-41,]) > summary(fit4) Estimate Std. Error t value Pr(>|t|) (Intercept) -5.4698 4.0973 -1.335 0.187 distance 3.3459 0.2631 12.719 < 2e-16 climb 31.9988 5.8173 5.501 0.000000735 distance:climb 0.2741 0.2636 1.040 0.302 --Multiple R-squared: 0.9694, Adjusted R-squared: 0.968

The effect estimates change substantially, and the interaction effect is much weaker. It seems adequate to use the main effects model, deleting observation 41, for which R2 = 0.97 indicates excellent predictive power. (b) With highly disperse priors in a Bayesian approach, we get similar results as with least squares:

Solutions Manual: Foundations of Statistical Science for Data Scientists

> fit.bayes <- MCMCregress(timeM ~ distance + climb, mcmc=5000000, b0=0, + B0=10^{-10}, c0=10^(-10), d0=10^(-10), data = Races[-41,]) > summary(fit.bayes) 1. Empirical mean and standard deviation for each variable Mean SD Naive SE Time-series SE (Intercept) -9.045 2.269 0.00101473 0.00101473 distance 3.560 0.166 0.00007426 0.00007426 climb 37.447 2.570 0.00114918 0.00114822 sigma2 71.507 13.058 0.00583980 0.00610863 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) -13.504 -10.56 -9.045 -7.529 -4.583 distance 3.234 3.45 3.560 3.671 3.887 climb 32.390 35.73 37.447 39.162 42.499 sigma2 50.381 62.24 70.001 79.101 101.273

6.21 We create a new data file by listing the women’s times as observations following the men’s times, adding an indicator variable for gender (1 = men, 0 = women). (This is the ScotsRacesMW at the book’s website.) For the main effects model, the Highland Fling observation for women has Cook’s distance = 2.78. The next largest is 0.32, for the Highland Fling observation for men. The fit of the main effects model without that race is: > RacesMW <- read.table("http://stat4ds.rwth-aachen.de/data/ScotsRacesMW.dat",header=T) > head(RacesMW,2); tail(RacesMW, 2) race distance climb time gender 1 AnTeallach 10.6 1.062 74.68 1 2 ArrocharAlps 25.0 2.400 187.32 1 race distance climb time gender 135 TwoInns 24.0 1.77 169.47 0 136 Yetholm 12.8 0.76 71.55 0 > fit <- lm(time ~ distance + climb + gender, data=RacesMW) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) -7.0109 2.6071 -2.689 0.00809 distance 4.7722 0.1137 41.968 < 2e-16 climb 31.9567 2.4996 12.785 < 2e-16 gender -15.8387 2.2879 -6.923 1.74e-10 > Cd <- cooks.distance(fit); tail(sort(Cd), 3) 70 41 109 0.1435206 0.3204229 2.7814732 > RacesMW[c(41,109),] race distance climb time gender 41 HighlandFling 85 1.2 439.15 1 109 HighlandFling 85 1.2 490.05 0 # without Highland Fling: > fit2 <- lm(time ~ distance + climb + gender, data=RacesMW[-c(41,109),]) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.3295 2.3152 -0.574 0.567 distance 3.8662 0.1542 25.080 < 2e-16 climb 40.6499 2.3858 17.038 < 2e-16 gender -15.3154 1.9186 -7.983 6.59e-13 --Multiple R-squared: 0.9541, Adjusted R-squared: 0.9531

For a particular race, we predict that the record time is 15.3 minutes lower for men than women. We get a slight improvement in fit in permitting interactions between pairs of variables, for instance allowing the gender effect to depend on the distance. 6.22 (a) With τ = 10 the priors for β1 and β2 (specified by B0 = 1/(10)2 = 0.01) are highly disperse and we have results similar to least squares:

Solutions Manual: Foundations of Statistical Science for Data Scientists > library(MCMCpack) > A = diag(c(10^{-10}, 0.01, 0.01)) > fit.bayes <- MCMCregress(impair ~ life + ses, mcmc=1000000, + b0=0, B0=A, c0=10^(-10), d0=10^(-10), data=Mental) > summary(fit.bayes) 1. Empirical mean and standard deviation for each variable Mean SD Naive SE Time-series SE (Intercept) 28.22826 2.23463 2.235e-03 2.239e-03 life 0.10328 0.03344 3.344e-05 3.344e-05 ses -0.09745 0.02990 2.990e-05 2.990e-05 sigma2 5.841e-03 21.95130 5.40361 5.404e-03 --> mean(fit.bayes[,2] <= 0) [1] 0.001531

With τ = 0.1 the priors for β1 and β2 are narrow and the Bayes estimates shrink the least squares estimates somewhat toward 0: > A = diag(c(10^{-10}, 100, 100)) > fit.bayes2 <- MCMCregress(impair ~ life + ses, mcmc=1000000, + b0=0, B0=A, c0=10^(-10), d0=10^(-10), data=Mental) > summary(fit.bayes2) 1. Empirical mean and standard deviation for each variable Mean SD Naive SE Time-series SE (Intercept) 28.22061 2.14976 2.150e-03 2.154e-03 life 0.09199 0.03171 3.171e-05 3.192e-05 ses -0.08846 0.02862 2.862e-05 2.876e-05 sigma2 21.95097 5.40192 5.402e-03 5.834e-03

(b) Using τ = 0.1 for the intercept forces the Bayes estimate of β0 to be near 0, even though the least squares estimate is 28.2, and then also greatly affects the other estimates. The estimated SES effect even changes sign. > fit.bayes3 <- MCMCregress(impair ~ life + ses, mcmc=1000000, + b0=0, B0=100, c0=10^(-10), d0=10^(-10), data=Mental) > summary(fit.bayes3) 1. Empirical mean and standard deviation for each variable Mean SD Naive SE Time-series SE (Intercept) 0.01479 0.09993 9.993e-05 1.002e-04 life 0.28054 0.05185 5.185e-05 5.391e-05 ses 0.18546 0.04178 4.178e-05 4.198e-05 sigma2 121.28716 29.95581 2.996e-02 3.258e-02

(c) The classical one-sided P -value of 0.0015 says that if truly β1 = 0, the probability would be 0.0015 of getting a t statistic of 3.177 (the observed value) or larger, which makes it seem implausible that H0 : β1 = 0 is true or that β1 < 0. We do not make a probability statement about β1 itself. When τ = 10, the Bayesian posterior P (β1 ≤ 0) = 0.0015 is the probability, once we’ve observed the data, that the life events effect is truly nonpositive. 6.23

> Hares <- read.table("http://stat4ds.rwth-aachen.de/data/Hares.dat", header=TRUE) > plot(Hares$weight ~ Hares$foot, pch=as.character(Hares$sex), cex=0.50) > fit <- lm(weight ~ foot + sex, data=Hare) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) -1643.83 122.04 -13.470 < 2e-16 foot 22.25 0.90 24.719 < 2e-16 sexm -49.78 12.76 -3.902 0.000107 --Residual standard error: 148.8 on 547 degrees of freedom Multiple R-squared: 0.5441, Adjusted R-squared: 0.5424 F-statistic: 326.4 on 2 and 547 DF, p-value: < 2.2e-16 > fit2 <- lm(weight ~ foot + sex + foot:sex, data=Hare)

Solutions Manual: Foundations of Statistical Science for Data Scientists

> summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) -1955.267 173.194 -11.289 <2e-16 foot 24.550 1.279 19.194 <2e-16 sexm 558.052 241.337 2.312 0.0211 foot:sexm -4.518 1.792 -2.522 0.0119 --Residual standard error: 148.1 on 546 degrees of freedom Multiple R-squared: 0.5493, Adjusted R-squared: 0.5469 F-statistic: 221.8 on 3 and 546 DF, p-value: < 2.2e-16 > abline(lm(Hares$weight ~ Hares$foot, subset=Hares$sex=="f"), lty=4, col="red") > abline(lm(Hares$weight ~ Hares$foot, subset=Hares$sex=="m"), lty=5, col="blue")

For the model permitting interaction, the prediction equation between y = weight and x = foot length is µ̂ = −1955.27+24.55x for female hares and µ̂ = −1397.21+20.03x for male hares. The lines cross at x = 123.5, and female hares have a higher predicted weight for foot length above that value. The model has R2 = 0.549, and the simpler model without interaction has R2 = 0.544, so the fit is as good for most practical purposes even though the test of H0 : no interaction has a P -value of 0.012. No values of Cook’s distance are exceptionally large for either model. 6.24 For example, use points close to (x, y) = (0, 30), (1, 31), (2, 32), (3, 33), (4, 34) which have correlation near 1.0 and then add point at (10, 0) which gives an overall strong negative correlation. That point is inflential because it is far from the mean of x (i.e., large leverage) and has a large residual. 6.25 6.26 A model is merely a simple approximation for the true relationship among the variables in the population. It is wrong because, for instance, E(Y ) is never exactly linearly related to x, with exactly a normal distribution for Y at each value of x, with exactly the same variance σ 2 for the conditional distribution of Y at each value of x. A model is not useful if it badly describes reality, such as assuming a linear effect for x on E(Y ) when actually it is parabolic. 6.27 To minimize f (β0 ) = ∑i (yi − β0 )2 , we take the derivative of f (β0 ) with respect to β0 and set it equal to zero at β̂0 , giving −2[∑i (yi − β̂0 )] = 0 = −2[∑i yi − nβ̂0 ], from which β̂0 = (∑i yi )/n. The second derivative at β0 is positive, so we have found the minimum. (y −y)(y −y)

i i i=1 6.28 Using the definition (6.2) of the sample correlation, r = √[ n ∑(y = 1. −y)2 ][ n (y −y)2 ]

∑i=1

6.29 (a) For the centered model, β0 is E(Y ) when x = µx , so it is much more relevant that the ordinary intercept when x does not take values near 0. (b) Then β0 is E(Y ) when x = min({xi }). 6.30 For z√= cy, √ √ sz = [∑i (zi − z)2 ]/(n − 1) = [∑i (cyi − cy)2 ]/(n − 1) = c [∑i (yi − y)2 ]/(n − 1) = csy . n

(x −x)(cy −cy)

(x −x)(y −y)

i i β̂1 = ∑i=1∑n i (xi −x)i2 = c ∑i=1 is also multiplied by c, so r = β̂1 (sx /sy ) is 2 ∑n i=1 i=1 (xi −x) multiplied by c in the numerator and denominator and does not change.

6.31 When x and y are standardized, x = y = 0 so β̂0 = y − β̂1 x = 0. Also, sx = sy so that r = β̂1 (sx /sy ) = β̂1 and the prediction equation is µ̂i = rxi . 6.32 (a) (µ̂i − y) = (β̂0 + β̂1 xi ) − y = (y − β̂1 x + β̂1 xi ) − y = β̂1 (xi − x).

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) Since r = β̂1 (sx /sy ), in this case β̂1 = r = 0.70 and (µ̂i −y) = β̂1 (xi −x) = 0.70(xi −x), so the distance from the mean on the final exam is predicted to be only 70% of what it was for the midterm exam.

6.33 Suppose that the systolic blood pressure readings vary around a mean of 120 and the correlation between readings for a person at two times separated by a month is 0.67. Then, for people with systolic blood pressure 150 originally, on the average we would expect the reading a month later to be 2/3 as far from 120, with a mean of 140. 6.34

E(Y ∣X=x)−µY σY

µ +ρ(σ /σ )(x−µ )−µ

x−µ

Y X X Y = Y = ρ( σ X ). So, if (x − µX )/σX = z, then σY X E[(Y ∣ X = x) − µY ]/σY = ρz, only a fraction of the relative amount.

6.35 (a) E(X ∣ Y = y) = µX + ρ(σX /σY )(y − µY ) (b) β1 /(1/β1∗ ) = β1 β1∗ = [ρ(σY /σX )][ρ(σX /σY )] = ρ2 . 6.36 From equation 6.1, β̂1 = sxy /s2x and similarly β̂1∗ = sxy /s2y , so 6.37

√ β̂1 β̂1∗ = sxy /[sx sy ] = r.

> x <- sample(10:17, 100, replace=T) > y <- rnorm(100, -25.0 + 6.0*x, 15) > plot(x,y); abline(lm(y ~ x)) > summary(lm(y ~ x)) Estimate Std. Error (Intercept) -28.3430 7.4568 x 6.2102 0.5516

The prediction equation is µ̂ = −28.34 + 6.21x. As n grows, it would get closer to −25.0 + 6.0x. 6.38

> L <- 1000000; x <- rnorm(L, 162, 7) > y <- rnorm(L, 3.0 + 0.40*x, 8) > cor(x,y) [1] 0.3302167 > summary(lm(x ~ y)) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.435e+02 5.324e-02 2695.6 <2e-16 y 2.726e-01 7.793e-04 349.8 <2e-16

The simulation approximates the regression equation for predicting height using weight by µ̂x = 143.5 + 0.273y. 6.39

> Races <- read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", header=T) > cor(Races$timeW, Races$climb) [1] 0.6852924 > cor(Races[-41,]$timeW, Races[-41,]$climb) [1] 0.851599 > cor(rank(Races$timeW), rank(Races$climb)) [1] 0.8501804 > cor(rank(Races[-41,]$timeW), rank(Races[-41,]$climb)) [1] 0.851184

The Spearman correlation changes only from 0.850 to 0.851 when we remove the outlying observation. It is more robust than the ordinary correlation to outliers. 6.40 (a) To minimize ∑i (yi − βxi )2 , taking the derivative with respect to β yields ∑i yi = β ∑i xi and thus β̂ = (∑i yi )/(∑i xi ).

Solutions Manual: Foundations of Statistical Science for Data Scientists (b)

> fit.d0 <- lm(timeW ~ -1 + distance, data=Races) > summary(fit.d0) Estimate Std. Error t value Pr(>|t|) distance 5.993 0.128 46.84 <2e-16 *** -Residual standard error: 21.48 on 67 degrees of freedom Multiple R-squared: 0.9704, Adjusted R-squared: 0.9699

R-squared is larger because it now compares SSE to TSS measured as variability around 0. 6.41 (a) Since the estimator β̂1 is a linear function of {Yi } and {Yi } are assumed to be independent normal random variables, β̂1 has a normal sampling distribution. Since E(Y ) = ∑i E(Yi )/n = ∑i (β0 + β1 xi )/n = β0 + β1 x, we have E(Yi − Y ) = β1 (xi − x), and then E(β̂1 )

n n ∑i=1 [(xi − x)E(Yi − Y )] ∑i=1 (xi − x)(Yi − Y ) ] = n n ∑i=1 (xi − x)2 ∑i=1 (xi − x)2 n ∑ (xi − x)2 = β1 i=1 = β1 n ∑i=1 (xi − x)2

= E[

For the calculation of var(β̂1 ), consider that β̂1 equals =

β̂1

n n n 1 ∑i=1 (xi − x)(Yi − Y ) x)Y ] − Y (xi − x) ] = [ [(x − ∑ ∑ i i n n ∑i=1 (xi − x)2 ∑i=1 (xi − x)2 i=1 i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ =0

n 1 [(xi − x)Yi ]. ∑ n ∑i=1 (xi − x)2 i=1

Thus, var(β̂1 ) = var[

n n σ2 ∑i=1 (xi − x)Yi ∑i=1 [(xi − x)2 var(Yi )] ] = = n n 2 ∑i=1 (xi − x)2 ∑i=1 (xi − x)2 [ ∑ni=1 (xi − x)2 ]

(b) (i, iii) With more variation in x or with more observations, [∑i (xi − x)2 ] increases and var(β̂1 ) decreases. (ii) With less variation in y given x, σ 2 decreases and var(β̂1 ) decreases. 6.42 (a)

> x <- runif(100, 0, 10); x2 <- x^2 > y <- rnorm(100, 40 - 5*x + 0.50*x2, 1) > plot(x, y) > fit <- lm(y ~ x + x2) > summary(fit) Estimate Std. Error (Intercept) 39.95192 0.25831 x -5.03556 0.12027 x2 0.50684 0.01158 --Multiple R-squared: 0.9519, Adjusted R-squared: > cor(x, y) [1] 0.0300587

0.9509

The fitted equation is µ̂ = 39.95 − 5.04x + 0.51x2 , not far from the true regression equation of E(Y ) = 40.0 − 5.0x + 0.5x2 .

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) The correlation is close to 0 because the relationship is highly nonlinear and poorly described by a straight line. The correlation is a measure of the strength of linear association.

6.43 Healthier people are more likely to play at least a round of golf a week and more likely to be alive a decade later. 6.44 (a) Age is a confounding variable if it is associated both with exercising regularly (perhaps older people being less likely to do so) and with the number of annual number of serious illnesses (perhaps older people tending to have a greater number of serious illnesses). If as age increases, the frequency of exercising decreases and the annual number of serious illnesses increases, then the impact is for less exercising to be assocated with a greater number of serious illnesses. (b) For each subject, the potential outcomes are their numbers of serious illnesses per year (i) if they exercised regularly, (ii) if they did not exercise regularly. The counterfactual outcomes are the ones we did not observe, that is, the number of serious illnesses per year if those who exercised regularly instead did not do so and if those who did not exercise regularly instead did so. 6.45 A possible lurking variable is family’s income. If families with greater income tend to have children more likely to eat breakfast and more likely to do well on school exams, then X and Y could have a positive correlation because of family income being a partial cause. You could check for this by observing the association separately for different groups of people stratified on family income. 6.46 Plot statewide values of x = median income against y = percent voting Republican that show a negative trend of points. For two states, one at a low value for x and one at a high value for x, plot several points centered around the state’s point but with a positive trend. 6.47 (a) When we condition on U = u, corr(X, Y ) = corr(u + V, u + W ) = corr(V, W ) = 0. (b) If U is a common partial cause of X and Y , then as U increases, both X and Y tend to increase, so we’ll observe a positive correlation between X and Y even though they are uncorrelated when we keep U fixed. 6.48 Show points varying around a horizontal line with small values of both x1 and y (for the first level of x2 ), other points varying around a horizontal line with medium values of both x1 and y (for the second level of x2 ), and other points varying around a horizontal line with large values of both x1 and y (for the third level of x2 ). 6.49 r2 = (T SS −SSE)/T SS is estimating the population ratio [var(Y ) - var(Y ∣ X)]/var(Y ). When the range of x-values sampled is larger, we expect var(Y ) to be larger, whereas the standard model treats var(Y ∣ X) as being the same no matter which x values are sampled, so r2 will tend to be larger. 6.50 (a) It represents the change in E(Y ), in y standard deviations, for an increase of sxj in xj , adjusting for the other explanatory variables. (b) The ordinary effects relate to a 1-unit change, which is very different in substantive terms for a variable with a relatively large standard deviation, such as weight, and a variable with a relatively small standard deviation, such as number of years of education. Standardized coeﬀicients are more comparable, because they relate to the same change in the explanatory variable as measured by a standard deviation.

Solutions Manual: Foundations of Statistical Science for Data Scientists

6.51 ∑i (yi − y)2 = ∑i [yi − µ̂i ) + (µ̂i − y)]2 = ∑i (yi − µ̂i )2 + ∑i (µ̂i − y)]2 + 2 ∑i (yi − µ̂i )(µ̂i − y). Now ∑i (yi − µ̂i )y = y ∑i ei = 0 by the first likelihood equation in Section 6.2.6. Also, ∑i (yi − µ̂i )µ̂i = ∑i (yi − µ̂i )(∑j βj xij ) and ∑i (yi − µ̂i )xij = 0 by the second likelihood equation. So, ∑i (yi − y)2 simplifies to ∑i (µ̂i − y)2 + ∑i (yi − µ̂i )2 . 6.52 (a) Since R2 = (T SS − SSE)/T SS and 1 − R2 = SSE/T SS, F=

(T SS − SSE)/p (T SS − SSE)/[(T SS)p] R2 /p = = . SSE/[n − (p + 1)] [SSE/T SS][n − (p + 1))] (1 − R2 )/[n − (p + 1)]

Larger values of R2 yield larger values of R2 /(1 − R2 ) and hence larger values of the F test statistic. (R12 −R02 )/(p1 −p0 ) 0 −SSE1 )/(p1 −p0 )] 1 )−(T SS−SSE0 )]/[T SS(p1 −p0 )] (b) F = [(SSE = [(T SS−SSESSE = (1−R 2 )/[n−(p +1)] . SSE1 /[n−(p1 +1)] 1 /T SS[n−(p1 +1)] 1 1

6.53 From Section 4.4.5, a t random variable has form T =

√Z X 2 /d

for a standard normal

random variable Z and an independent chi-squared random variable X 2 with d degrees Z 2 /1 of freedom. Therefore, T 2 = X 2 /d , and since the square of a standard normal random variable has a chi-squared distribution with 1 degree of freedom, from the definition of an F random variable in Section 6.4.1, this has the form of an F random variable with df 1 = 1 and df 2 equal the degrees of freedom for the t distribution. 6.54 From Section 4.4.6, for n independent observations from a N (µ, σ 2 ) distribution, (n − 1)S 2 /σ 2 has a χ2n−1 distribution. For two independent samples from normal distribution each having variance σ 2 , (n1 − 1)S12 /σ 2 and (n2 − 1)S22 /σ 2 are independent chi-squared random variables with df1 = n1 − 1 and df2 = n2 − 1. From the definition in Section 6.4.1 of a F random variable, the ratio F=

(n1 − 1)S12 /σ 2 (n1 − 1) S12 = (n2 − 1)S22 /σ 2 (n2 − 1) S22

has the F distribution with df1 = n1 − 1 and df2 = n2 − 1. 6.55 (a) At x0 = x, β̂0 + β̂1 x0 = β̂0 +√β̂1 x = y by the least squares formula for β̂0 , so the interval is y ± tα/2,n−(p+1) (s/ n), the same except for using tα/2,n−(p+1) instead of tα/2,n−1 which is used for the marginal distribution for which p = 0. (b) When {xi } are more spread out, ∑ni=1 (xi − x)2 increases and the standard error used in the margin of error decreases. 6.56 (a) It is easier to make a precide inference about a mean than it is about a particular future observation. The standard error for a sample mean or a regression prediction decreases toward 0 as n increases, whereas the standard error for a future Y value has σ as a lower bound even as n increases indefinitely. (b) (i) The width of the prediction interval is the same at x0 = x − c as at x + c and does not reflect that y tends to be more variable at higher values of x. (ii) The conditional distribution of Y is far from normal, and it may be impossible to get a total probability close to 0.95 within any particular interval. 6.57 (a) Highly skewed to the left, since r cannot take a value much larger than ρ. √ (b) A 100(1 − α)% confidence interval for T (ρ) is T (r) ± zα/2 1/(n − 3). Once we get the endpoints of the interval for T (ρ), we substitute each endpoint for T in the inverse transformation ρ = (e2T −1)/(e2T +1) to get the endpoints of the confidence interval for ρ. (Unless r = 0, the confidence interval for ρ is not symmetric about r, because of the nonsymmetry of the sampling distribution of r.)

Solutions Manual: Foundations of Statistical Science for Data Scientists

6.58 Scenario 1: Samples (9, 10, 10, 11), (11, 12, 12, 13), (13, 14, 14, 15) yield F = 24 and P -value = 0.0002. Scenario 2: Samples (0, 10, 10, 20), (2, 12, 12, 22), (4, 14, 14, 24) yield F = 0.24 and P value = 0.79. Smaller within-groups variability, relative to a particular between-groups variability, leads to larger F test statistics and smaller P -values. 6.59 Note that ∪j Bj = ∪j Ej and Bj ⊆ Ej , but the {Bj } are disjoint, and so P (∪j Ej ) = P (∪j Bj ) = ∑j P (Bj ) ≤ ∑tj=1 P (Ej ), since P (Bj ) ≤ P (Ej ) for all j. 6.60 (a) The most significant test compares P(1) to α/t, which is what each P -value is compared to with the Bonferroni method, but then the other tests with the FDR method have less conservative requirements. (b) With the FDR method the ordered P -values are compared with jα/t = j(0.05)/15 = (0.0033)j, starting with j = 15. The maximum j for which P(j) ≤ (0.0033)j is j = 4, for which P(4) = 0.0095 < (0.0033)4. So the hypotheses with the four smallest P values are rejected. By contrast, the Bonferroni approach with family-wise error rate 0.05 compares each P -value to 0.05/15 = 0.0033 and rejects only three of these hypotheses. 6.61 For the normal linear model, Section 6.4.6 showed that the maximized likelihood function simplifies to n 1 ℓ(β̂0 , β̂1 , . . . , β̂p , σ̂) = ( √ ) e−n/2 . σ̂ 2π Denote the ML variance estimates by σ̂02 = SSE0 /n for the simpler model, and by σ̂12 = SSE1 /n for the more complex model. The likelihood ratio is n

[(

n/2

1 σ̂ 2 1 √ ) e−n/2 ]/[( √ ) e−n/2 ] = ( 02 ) σ̂1 σ̂1 2π σ̂0 2π

SSE0 ) SSE1

= (1 +

SSE0 − SSE1 ) SSE1

n/2

= (1 +

p1 − p0 F) n − (p1 + 1)

for the F test statistic (6.9). A large value of the likelihood-ratio, and thus strong evidence against H0 , corresponds to a large value of the F statistic. The likelihood-ratio test leads to the same analysis as the F test. 6.62 (a) Letting xij be an indicator variable for subject i for category j of A and zij be an indicator variable for subject i for category j of B, E(Yij ) = β0 + β1 xi1 + ⋯ + βr−1 xi,r−1 + γ1 zi1 + ⋯ + γc−1 zi,c−1 . Test H0 : β1 = β2 = ⋯ = βr−1 = 0 by using the F test comparing nested models, where the simpler model does not have the indicator variables for A. (b) E(Yij ) = β0 + β1 xi1 + γ1 zi1 + γ2 zi2 + δ1 xi1 zi1 + δ2 xi1 zi2 . H0 : No interaction is H0 : δ1 = δ2 = 0, which is tested using the F test comparing nested models, where the simpler model does not have the interaction cross-products of indicator variable terms. 6.63 The effect β2 is the difference between E(Y ) for the group with z = 1 and the group with z = 0 at x = µx . Without centering, β2 is the difference between E(Y ) for the group with z = 1 and the group with z = 0 at x = 0, which may not be a relevant place to make a comparison.

Solutions Manual: Foundations of Statistical Science for Data Scientists

6.64 The model matrix is 1 1 0⎞ ⎛1 x11 1 1 0⎟ ⎜1 x21 ⎜ ⎟ 1 0 1⎟ ⎜1 x31 ⎜ ⎟ ⎜1 x41 1 0 1⎟ ⎜ ⎟ ⎜1 x51 1 0 0⎟ ⎜ ⎟ ⎜1 x 1 0 0⎟ ⎜ ⎟ 61 ⎜ ⎟ ⎜1 x71 ⎟ 0 1 0 ⎟ ⎜ ⎜1 x81 0 1 0⎟ ⎜ ⎟ ⎜1 x 0 0 1⎟ 91 ⎜ ⎟ ⎜ ⎟ ⎜1 x10,1 0 0 1⎟ ⎜ ⎟ ⎜1 x11,1 0 0 0⎟ ⎝1 x12,1 0 0 0⎠ 6.65 As n increases, the model matrix X has more rows, each term on the main diagonal of X T X has a greater number of terms in its sum of squares and is therefore larger, and the main diagonal elements of (X T X)−1 tend to be smaller. 6.66

Chapter 7 7.1 (a)

> Houses <- read.table("http://stat4ds.rwth-aachen.de/data/Houses.dat", header=T) > plot(Houses$taxes, Houses$price)

It appears that prices may vary more at higher values for taxes, suggesting that a GLM assuming a gamma distribution may be more appropriate. (b)

> fit <- glm(price ~ taxes + new, family=gaussian(link=identity), data=Houses) > summary(fit) # (i) Estimate Std. Error t value Pr(>|t|) (Intercept) 41.96514 14.69505 2.86 0.0053 taxes 0.09513 0.00693 13.72 <2e-16 new 86.20008 27.24481 3.16 0.0021 --Null deviance: 2284086 on 99 degrees of freedom Residual deviance: 602636 on 97 degrees of freedom AIC: 1162 > fit2 <- glm(price ~ taxes + new, family=Gamma(link=identity), data=Houses) > summary(fit2) # (ii) Estimate Std. Error t value Pr(>|t|) (Intercept) 63.22196 7.83642 8.07 1.9e-12 taxes 0.08359 0.00621 13.46 < 2e-16 new 80.54539 35.97100 2.24 0.027 --(Dispersion parameter for Gamma family taken to be 0.087319) AIC: 1107

Adjusting for taxes, the estimated mean selling price for new homes is 86.2 thousand dollars higher than for older homes with the normal model. It is 80.5 thousand dollars higher with the gamma model. √ (c) For the normal model, the estimated variability is σ̂ = 602636/97 = 78.82 at each value of x1 and√x2 and thus at each value for the estimated mean. For the gamma √ model, σ̂ = µ̂/ k̂ = 0.0873µ̂ = 0.2955µ̂, which varies between 29.55 thousand

Solutions Manual: Foundations of Statistical Science for Data Scientists dollars when µ̂ = 100 thousand dollars to 147.75 thousand dollars when µ̂ = 500 thousand dollars. This is probably more realistic. (d) The gamma model has considerably smaller AIC and is preferred to the normal model by that criterion.

7.2 (a)

> Races <- read.table("http://stat4ds.rwth-aachen.de/data/ScotsRaces.dat", + header=T) > fit.dc2 <- glm(timeW ~ distance + climb, family=Gamma(link=identity), + data=Races) > summary(fit.dc2) Estimate Std. Error t value Pr(>|t|) (Intercept) -4.6081 1.4551 -3.167 0.00235 distance 4.1485 0.2418 17.160 < 2e-16 climb 38.7379 4.0058 9.670 3.32e-14 (Dispersion parameter for Gamma family taken to be 0.01532269) AIC: 512.41

The estimated standard deviation √ σ̂ of√the conditional distribution of Y relates to the estimated mean µ̂ by σ̂ = µ̂/ k̂ = 0.01532µ̂ = 0.1238µ̂. > c(min(fitted(fit.dc2)), max(fitted(fit.dc2))) [1] 16.41471 394.50054 > c(sqrt(0.01532)*min(fitted(fit.dc2)), sqrt(0.01532)*max(fitted(fit.dc2))) [1] 2.031715 48.828905

The estimated standard deviation increases from 2.0 to 48.8, and is far from constant as normal GLMs assume. (b)

> fit.dc <- glm(timeW ~ distance + climb, family=gaussian(link=identity), + data=Races) > summary(fit.dc) Estimate Std. Error t value Pr(>|t|) (Intercept) -14.5997 3.4680 -4.21 8.02e-05 distance 5.0362 0.1683 29.92 < 2e-16 climb 35.5610 3.7002 9.61 4.22e-14 --(Dispersion parameter for gaussian family taken to be 195.0032) Null deviance: 353013 on 67 degrees of freedom Residual deviance: 12675 on 65 degrees of freedom AIC: 556.47

Since AIC is smaller for the gamma model (512.4 versus 556.5), AIC suggests that the gamma model gives a better fit. (c)

> fit.dc3 <- glm(timeW ~ distance + climb + distance:climb, + family=Gamma(link=identity),data=Races) > summary(fit.dc3) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3273 2.4401 0.134 0.8937 distance 3.6530 0.3026 12.071 < 2e-16 climb 26.5266 6.0753 4.366 4.71e-05 distance:climb 0.8702 0.3387 2.569 0.0125 (Dispersion parameter for Gamma family taken to be 0.01403262) AIC: 507.43

The effect on record time of a 1 km increase in distance changes from 3.653 + 0.870(0.185) = 3.81 minutes (for climb value 0.185 km) to 3.653 + 0.870(2.40) = 5.74 minutes (for climb value 2.40 km). 7.3

> x <- runif(6, 0, 10); mu <- 40 + 4*x - 0.1*x^2; y <- rnorm(6, mu, 5) > Data <- data.frame(cbind(x,y)); Data x y 1 4.792859 61.17038 2 6.087078 63.70319 3 3.449201 53.73074

Solutions Manual: Foundations of Statistical Science for Data Scientists

4 6.782841 62.78965 5 1.101420 45.55565 6 9.285150 79.52515 > fit0 <- lm(y ~ 1, data=Data); fit1 <- lm(y ~ x, data=Data) > fit5 <- lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5), data=Data) > plot(y ~ x, Data, pch = 16, col="dodgerblue4") > curve(40 + 4*x - 0.1*x^2, 0, 10, col="dodgerblue4", lwd=3, add=T) > curve(predict(fit5,newdata=data.frame(x)), add=T, lwd=2, col="olivedrab4") > curve(predict(fit0,newdata=data.frame(x)),add=T, lwd=2, col="darkorange1") > curve(predict(fit1,newdata=data.frame(x)), col="red4", lwd=2, add=T) > colors <- c("dodgerblue4", "darkorange1","red4", "olivedrab4") > labels <- c("true relationship", "null model" ,"straight-line model", + "fifth-degree polynomial") > legend("bottomright", labels, lwd=2, col=colors, cex=0.9,box.lty=0) > anova(fit0) # (i) Df Sum Sq Mean Sq F value Pr(>F) Residuals 5 645.05 129.01 > anova(fit1) # (ii) Df Sum Sq Mean Sq F value Pr(>F) x 1 615.06 615.06 82.034 0.0008235 Residuals 4 29.99 7.50 > anova(fit5) # (iii) Df Sum Sq Mean Sq F value Pr(>F) x 1 615.06 615.06 I(x^2) 1 6.36 6.36 I(x^3) 1 10.27 10.27 I(x^4) 1 13.19 13.19 I(x^5) 1 0.17 0.17 Residuals 0 0.00 > SS <- function(fit,mu){sum((fitted(fit) - mu)^2)} > c(SS(fit0,mu), SS(fit1,mu), SS(fit5,mu)) [1] 429.1716 113.0308 151.8976

The more complex model gives a perfect fit, having as many parameters as observations. The SSE values for the specific data set are 645.05, 29.99 and 0 for models (i), (ii) and (iii) while the corresponding actual sum of squares around the true mean values are 429.17, 113.03 and 151.90, showing that the sum of squares of the model fitted values around the means from the true relationship is lower for the simple straight-line model. Particular values in such simulations will vary because of the randomness in constructing the sample. 7.4 (a)

> x <- sort(runif(30, -4, 4)); mu <- 3.5*x^3-20*x^2+0.5*x+20 > y <- rnorm(length(x),mu,30) > x [1] -3.8272055 -3.5776861 -3.4631754 -2.8894969 -2.5612320 -2.5305821 [7] -2.2543454 -2.1223775 -1.6739663 -1.3422954 -0.8463387 -0.6233095 [13] -0.3867731 0.1269114 0.6174547 1.1180703 1.1531301 1.2079706 [19] 1.9163838 2.2633300 2.4247252 2.4395922 2.4437635 2.7548597 [25] 2.7746287 2.8231808 3.3551416 3.3799582 3.4474097 3.6819943 > y [1] -488.1556479 -374.7428640 -343.9618261 -224.5604698 -136.8970853 [6] -173.9774963 -152.0411357 -79.1332840 -73.7314454 7.2620082 [11] 77.5107067 49.6370131 20.0326630 0.2549266 4.6642432 [16] 0.2074055 -1.6558158 3.5770667 -7.9335280 -65.5657816 [21] -41.7614756 -44.2999342 -0.6930596 -37.2092112 -45.5186380 [26] -54.8708126 -26.2475665 -50.6547644 -69.9836776 -79.1121202 > fit1 <- lm(y ~ x), fit2 <- lm(y ~ x+I(x^2)); fit3 <- lm(y ~ x+I(x^2)+I(x^3)) > fit5 <- lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) > fit7 <- lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)) > plot(x,y); abline(lm(y ~ x), lwd=2, col="dodgerblue4") > lines(x,mu,lwd=2) # the true relationship (in black) > lines(x, fitted.values(fit2), col="red4",lwd=2)

Solutions Manual: Foundations of Statistical Science for Data Scientists > lines(x, fitted.values(fit3), col="olivedrab4",lwd=2) > lines(x, fitted.values(fit7), col="darkorange1",lwd=2) > SS <- function(fit,mu){sum((fitted(fit) - mu)^2)} > c(SS(fit1,mu),SS(fit2,mu),SS(fit3,mu),SS(fit5,mu),SS(fit7,mu)) [1] 276747.636 33054.508 3708.529 4621.374 7041.126 > anova(fit1); anova(fit2); anova(fit3); anova(fit5); anova(fit7)

(b)

> x <- sort(runif(30, -4, 4)); mu <- 0.5*x^3-20*x^2+0.5*x+20 > y <- rnorm(length(x),mu,30) > x [1] -3.63230165 -3.59945067 -3.56849796 -3.51074935 -3.46025713 -2.26578297 [7] -1.84165229 -1.62468778 -1.56411481 -1.47087958 -1.37341308 -1.19423067 [13] -0.94942150 -0.60947502 -0.55319765 -0.34691411 -0.12838775 0.05019634 [19] 0.11928448 0.12450132 0.58993398 0.68122118 1.49969209 1.77812556 [25] 2.28015643 2.31777778 2.77283500 2.96543995 3.26932004 3.47691067 > y [1] -255.747991 -240.911992 -259.277097 -208.432139 -261.842604 -104.936907 [7] -107.041414 -24.706167 -73.014158 -43.133820 36.407084 12.710253 [13] 1.492540 3.944474 -31.243192 33.717999 -0.213923 17.264402 [19] 6.311622 24.826813 -35.788593 16.468968 17.610538 -39.492714 [25] -5.006374 -109.696817 -168.122795 -166.003156 -205.206344 -218.547232 > fit1 <- lm(y ~ x), fit2 <- lm(y ~ x+I(x^2)); fit3 <- lm(y ~ x+I(x^2)+I(x^3)) > fit5 <- lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)) > fit7 <- lm(y ~ x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)) > c(SS(fit1,mu),SS(fit2,mu),SS(fit3,mu),SS(fit5,mu),SS(fit7,mu)) [1] 273917.643 1992.414 3874.320 4074.662 8206.447 > anova(fit1); anova(fit2); anova(fit3); anova(fit5); anova(fit7)

The SSE (Residuals Sum Sq in the anova output above) for the models fit1 to fit7 are 296554, 52682, 15734, 14821 and 12401 for case (a) and 292947, 25936, 25195, 24994 and 20862, for case (b) above. They are always decreasing in the degree of the polynomial model fitted, i.e. decreasing in the number of parameters in the model. On the other hand, the sum of squares of the model fitted values around the means from the true relationship (see SS in the output) is smaller for simpler models. In case (a) the smallest value is achieved for the third-degree polynomial model (true relationship) while in case (b), where the coeﬀicient of x3 is small, it can be that the simpler quadratic model provides a better fit to specific data, as for the data above (verify it repeating several times the simulation of data under both setups). 7.5 (a) Since survival is binary, we fit a logistic regression model. > Sheep <- read.table("http://stat4ds.rwth-aachen.de/data/Sheep.dat",header=T) > mean(Sheep$survival) [1] 0.7660044 # 76.6% of the 1359 sheep survived > summary(Sheep$weight) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.10 15.00 20.50 19.56 23.70 34.20 > fit <- glm(survival ~ weight, family=binomial, data=Sheep) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.02983 0.25051 -8.103 5.37e-16 weight 0.17545 0.01394 12.589 < 2e-16 Null deviance: 1478.8 on 1358 degrees of freedom Residual deviance: 1290.4 on 1357 degrees of freedom

Since β̂1 > 0 and the P -value < 0.0001 for testing H0 : β1 = 0, the model fit suggests that the heavier sheep were more likely to survive. (b) Since π̂ = 0.50 at x = −β̂0 /β̂1 = 2.029/0.175 = 11.57, π̂ > 0.50 when weight x > 11.57 kg. 7.6

> x <- 0:2; x1 <- rep(x, each=4)

# (i) ungrouped data: x1, y1

Solutions Manual: Foundations of Statistical Science for Data Scientists

> y1 <- c(1,0,0,0,1,1,0,0,1,1,1,1) > fit1 <- glm(y1 ~ x1, family=binomial(link=logit)) > summary(fit1) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.503 1.181 -1.272 0.2033 x1 2.060 1.130 1.823 0.0682 . --Null deviance: 16.301 on 11 degrees of freedom Residual deviance: 11.028 on 10 degrees of freedom > n <- rep(4,3); y <- c(1,2,4) # (ii) grouped data: x, y, n > fit <- glm(y/n ~ x, weights=n, family=binomial(link=logit)) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.503 1.181 -1.272 0.2034 x 2.060 1.130 1.823 0.0683 . --Null deviance: 6.2568 on 2 degrees of freedom Residual deviance: 0.9844 on 1 degrees of freedom

(a) Estimates and se values are the same. (b) For ungrouped data, deviances = 16.30 and 11.03. For grouped data, deviances = 6.26 and 0.98. The log-likelihood value for the saturated model depends on the form of data entry, because the number of parameters differs in the two cases. (c) The difference between the deviances, which is the likelihood-ratio test statistic for testing H0 : β1 = 0 for the effect of x1 , is 5.3 in each case. This difference does not depend on the log-likelihood value for the saturated model, because it cancels out in the difference of deviances. 7.7

> Afterlife <-read.table("http://stat4ds.rwth-aachen.de/data/Afterlife.dat",header=T) > y <- ifelse(Afterlife$postlife == 1, 1, 0) > fit <- glm(y ~ factor(religion)+factor(gender), family=binomial, data=Afterlife) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8013 0.1228 14.671 < 2e-16 factor(religion)2 -0.3476 0.1677 -2.073 0.0382 factor(religion)3 -2.1142 0.3613 -5.852 4.84e-09 factor(gender)2 0.6259 0.1566 3.996 6.44e-05 ---Null deviance: 1177.5 on 1552 degrees of freedom Residual deviance: 1128.3 on 1549 degrees of freedom

The probability of believing in the afterlife seems to be somewhat greater for females than males, adjusting for religion, and quite a bit lower for the Jewish religious category than the Protestant and Catholic categories, adjusting for gender. 7.8 (a) We assume independent observations, both within and between replicates (which may be unrealistic, but more complex modeling that accounts for possible correlation is beyond the scope of this book) and a binomial distribution for the number of successes out of each set of n independent, identical trials. > Soybeans <- read.table("http://stat4ds.rwth-aachen.de/data/Soybeans.dat", + header=TRUE) > fit <- glm(y/n ~ treatment + lab, family=quasi(link=identity, + variance="mu(1-mu)"), weights=n, data=Soybeans) > summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.86492 0.01232 70.228 <2e-16 treatmentB -0.26627 0.01642 -16.212 <2e-16 labprivate 0.02333 0.01523 1.532 0.127

Solutions Manual: Foundations of Statistical Science for Data Scientists The estimated probability of success is 0.266 higher for treatment A than for treatment B. (b)

> fit2 <- glm(y/n ~ treatment * lab, family=quasi(link=identity, + variance="mu(1-mu)"), weights=n, data=Soybeans) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.86042 0.01371 62.74 <2e-16 treatmentB -0.25250 0.02369 -10.66 <2e-16 labprivate 0.03167 0.01841 1.72 0.087 treatmentB:labprivate -0.02667 0.03292 -0.81 0.419 --> fit3 <- glm(y/n ~ treatment * lab, family=binomial, weights=n, data=Soybeans) > summary(fit3) Estimate Std. Error z value Pr(>|z|) (Intercept) 1.81875 0.05890 30.878 < 2e-16 treatmentB -1.38019 0.07223 -19.108 < 2e-16 labprivate 0.29345 0.08830 3.323 0.00089 treatmentB:labprivate -0.27242 0.10631 -2.563 0.01039

The model with logit link is also not as simple to interpret as the model with identity link, since the effects refer to odds ratios rather than to differences of probabilities. 7.9 (a)

> SoreThroat <- read.table("http://stat4ds.rwth-aachen.de/data/SoreThroat.dat", + header=TRUE) > fit <- glm(Y ~ D + T, family=binomial(link=logit), data = SoreThroat) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.41734 1.09457 -1.295 0.19536 D 0.06868 0.02641 2.600 0.00931 ** T -1.65895 0.92285 -1.798 0.07224 .

The model fit is logit(π̂) = −1.417 + 0.069D − 1.659T . The estimated odds of a sore throat for those using a tracheal tube are e−1.659 = 0.19 times the estimated odds for those using a laryngeal mask airway, for a given duration of surgery. The estimated odds of a sore throat multiply by e0.069 = 1.07 for each additional minute in duration of the surgery (i.e., the estimated odds increase by 7%), for a given type of device. (b)

> fit1 <- glm(Y ~ D * T, family=binomial(link=logit), data = SoreThroat) > summary(fit1) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.04979 1.46940 0.034 0.9730 D 0.02848 0.03429 0.831 0.4062 T -4.47224 2.46707 -1.813 0.0699 . D:T 0.07460 0.05777 1.291 0.1966

The model fit is logit(π̂) = 0.0498 + 0.0285D − 4.4722T + 0.0746(D × T ). This gives prediction equation logit(π̂) = −4.42 + 0.103D when T = 1 and logit(π̂) = 0.0498 + 0.0285D when T = 0. Duration has much more of a sample effect when the tracheal tube is used (T = 1) than when a laryngeal mask airway is used (T = 0), but it is not statistically significant. 7.10 (a)

> ctab <- array(c(53,414,0,16,11,37,4,139), dim=c(2,2,2)) > dimnames(ctab) <- list(DeathP=c("yes","no"), victim_race=c("white","black"), + defend_race=c("white","black")); ctab , , defend_race = white victim_race DeathP white black yes 53 0 no 414 16 , , defend_race = black

Solutions Manual: Foundations of Statistical Science for Data Scientists

victim_race DeathP white black yes 11 4 no 37 139 > DPdr_v.w <- ctab[,1,]; DPdr_v.w # conditional table: victim=white defend_race DeathP white black yes 53 11 no 414 37 > DPdr_v.b <- ctab[,2,]; DPdr_v.b # conditional table: victim=black defend_race DeathP white black yes 0 4 no 16 139 > DPdr_v.marg <- marginSums(ctab, c(1,3)); DPdr_v.marg # marginal table defend_race DeathP white black yes 53 15 no 430 176 > DPdr_v.w[1,]/colSums(DPdr_v.w) # death penalty, condit. victim=white white black 0.1134904 0.2291667 > DPdr_v.b[1,]/colSums(DPdr_v.b) # death penalty, condit. victim=black white black 0.00000000 0.02797203 > DPdr_v.marg[1,]/colSums(DPdr_v.marg) # death penalty, indep. of victim's race white black 0.10973085 0.07853403

Marginally, the percentages getting the death penalty are 7.9% for black defendants and 11% for white defendants. With white victims, the percentages were 22.9 and 11.3, and with black victims, the percentages were 2.8 and 0.0%. In each case, black defendants were more likely to receive the death penalty. This reflects the strong association between defendant’s race and victim’s race (odds ratio = 87). Blacks are more likely to kill blacks, whites are more likely to kill whites, and killing a white person is more likely to lead to the death penalty. (b)

> y <- c(3,0,9,11); n <- c(25,3,30,132) > d <- c(0,0,1,1); v <- c(0,1,0,1) > fit <- glm(y/n ~ d + v, family=binomial, weights=n) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0232 0.6137 -3.297 0.000978 *** d 1.1886 0.7236 1.643 0.100461 v -1.5713 0.5028 -3.125 0.001778 ** Null deviance: 9.50367 on 3 degrees of freedom Residual deviance: 0.16676 on 1 degrees of freedom

For black defendants, the estimated odds of receiving the death penalty were e1.1886 = 3.28 times the estimated odds for white defendants, adjusting for victim’s race. 7.11

y <- c(3,0,9,11); n <- c(25,3,30,132) d <- c(0,0,1,1); v <- c(0,1,0,1) fit <- glm(y/n ~ d + v, family=binomial, weights=n) summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0232 0.6137 -3.297 0.000978 d 1.1886 0.7236 1.643 0.100461 v -1.5713 0.5028 -3.125 0.001778 Null deviance: 9.50367 on 3 degrees of freedom Residual deviance: 0.16676 on 1 degrees of freedom

Solutions Manual: Foundations of Statistical Science for Data Scientists Adjusting for defendant’s race, the death penalty is more likely when the victim is white. Adjusting for victim’s race, there is some indication that the death penalty is more likely for black defendants, but the P -value for testing this effect is just 0.10 (assuming these cases can be treated as a random sample from some conceptual population). The residual deviance is small, so the model seems to fit well.

7.12

> Employ <- read.table("http://stat4ds.rwth-aachen.de/data/Employment.dat",header=T) > fit2 <- glm(employed ~ female + italian + pension + italian:pension, + family=quasi(link=identity, variance="mu(1-mu)"), data=Employ) > summary(fit2) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.573069 0.005387 106.382 < 2e-16 female -0.139706 0.003450 -40.495 < 2e-16 italian 0.168581 0.005432 31.036 < 2e-16 pension -0.221692 0.031448 -7.050 1.81e-12 italian:pension -0.203007 0.031896 -6.365 1.97e-10 --(Dispersion parameter for quasi family taken to be 1.000073) Null deviance: 96912 on 72199 degrees of freedom Residual deviance: 89896 on 72195 degrees of freedom > fit3 <- glm(employed ~ female + italian + pension + italian:pension, + family=binomial(link=logit), data=Employ) > summary(fit3) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.33899 0.02246 15.096 < 2e-16 female -0.64471 0.01614 -39.953 < 2e-16 italian 0.71552 0.02262 31.637 < 2e-16 pension -0.98404 0.16046 -6.132 8.65e-10 italian:pension -0.91266 0.16307 -5.597 2.19e-08 --(Dispersion parameter for binomial family taken to be 1) Null deviance: 96912 on 72199 degrees of freedom Residual deviance: 89881 on 72195 degrees of freedom

Gender is a main effect and not in an interaction term, so simple to interpret. From the linear probability model (identity link function), the estimated population proportion employed is 0.14 lower for females than males, adjusting for the pension and Italian citizenship variables. From the logistic model (logit link function), the estimated odds of a female being employed is e−0.645 = 0.52 times the estimated odds of a male being employed, adjusting for the pension and Italian citizenship variables. The estimates are nearly identical to those in the model without the interaction between pension and Italian citizenship. 7.13

> Employ <- read.table("http://stat4ds.rwth-aachen.de/data/Employment2.dat",header=T) > fit <- glm(empl ~ female + italian + pension + female:italian, family=binomial, + data=Employ) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.30076 0.22497 -10.227 < 2e-16 female -0.27256 0.30592 -0.891 0.372961 italian 0.01163 0.22604 0.051 0.958964 pension 0.21656 0.09497 2.280 0.022583 female:italian -1.08296 0.31075 -3.485 0.000492

The gender effect is −0.273 for non-citizens and −0.273 − 1.083 = −1.356 for citizens. The estimated odds of employment for women are exp(−0.263) = 0.76 times those for men for non-citizens and exp(−1.356) = 0.26 for citizens. 7.14 For defendant’s race d (1 = white, 0 = nonwhite), victim’s race v (1 = white, 0 = nonwhite), gender g (1 = male, 0 = female), and response y (1 = conviction, 0 = no

Solutions Manual: Foundations of Statistical Science for Data Scientists

conviction), the logistic model logit[P (Y = 1)] = β0 + β1 d + β2 v + β3 g has β̂1 = −1.9, β̂2 = 1.8, β̂3 = −2.7. For example, the estimated odds of conviction for a white defendant were e−1.9 = 0.15 times the estimated odds of conviction for a black defendant, adjusting for victim’s race and gender, which is a very strong effect. 7.15 (a) The odds ratio was (0.00006291/(1−0.00006291))/(0.000000168/(1−0.000000168)) = 374.49; the odds of death due to suicide by firearm was 374.49 times higher in the U.S. than in the UK. The risk ratio was 0.00006291/0.000000168 = 374.46. When the proportions are close to 0, the complement probabilities used in the denominator for each odds are close to 1 so do not affect the ratio much. (b) Risk ratio = 0.78/0.21 = 3.71, odds ratio = (0.78/0.22)/(0.21/0.79) = 13.34 much larger and more similar to square of risk ratio, which is 13.80. (c) logit[P (Y = 1)] = β0 + β1 x, where y = whether climate change should be a top priority (1 = yes, 0 = no) and x = political party (1 = Democrat, 0 = Republican), for which β̂1 = log(13.34) = 2.59. 7.16

> x <- c(1, 0); y <- c(10, 5); n <- c(10, 10) > fit <- glm(y/n ~ x, weights=n, family=binomial(link=logit)) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) 6.895e-17 6.325e-01 0 1 x 2.512e+01 5.461e+04 0 1 ---Null deviance: 8.6305e+00 on 1 degrees of freedom Residual deviance: 2.4675e-10 on 0 degrees of freedom

(a) Here, R reports a log odds ratio of β̂1 = 25.12 (which corresponds to an estimated odds ratio of more than 81 billion!). The true β̂1 = ∞. (b) Using the results reported by R, with se = 54610 the Wald statistic is z = 25.12/54610 = 0.0, with P -value = 1.0. The Wald test is not appropriate when the actual ML estimate is infinite and software does not provide reliable results. The likelihood-ratio statistic is the null deviance of 8.63, which has chi-squared P -value (with df = 1) of 0.003 and is more appropriate than the Wald test. (c) Following the analysis of the example in Section 7.3.2 and assuming a N (0, σ 2 ) prior distribution with σ 2 = 100 for both, β0 and β1 , we get: > x1 <- c(rep(1,10), rep(0,10)) > y1 <- c(rep(1,10), rep(1,5), rep(0,5)) > library(MCMCpack) > fitBayes <- MCMClogit(y1 ~ x1, mcmc=1000000, b0=0, B0=0.01) # prior mean=b0, > summary(fitBayes) # prior var.=1/B0 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) 0.04044 0.6608 0.0006608 0.002503 x1 9.74143 5.6514 0.0056514 0.015559 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) -1.257 -0.3964 0.03768 0.4746 1.349 x1 2.073 5.4090 8.59879 12.9985 23.304 > mean(fitBayes[,2] <= 0) [1] 0.000443

The posterior probability that β1 ≤ 0 is 0.0004. For prior variance σ 2 = 5, the posterior mean for β1 is close to the estimate derived by Firth penalized likelihood method (see next exercise for the code).

Solutions Manual: Foundations of Statistical Science for Data Scientists

7.17 (a)

> x <- c(10, 20, 30, 40, 60, 70, 80, 90) > y <- c(0, 0, 0, 0, 1, 1, 1, 1) > fit <- glm(y ~ x, family=binomial) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -118.158 296046.187 0 1 x 2.363 5805.939 0 1 --Null deviance: 1.1090e+01 on 7 degrees of freedom Residual deviance: 2.1827e-10 on 6 degrees of freedom > 1 - pchisq(11.09, 1) [1] 0.0008679448

True ML estimate β̂1 = ∞, reported β̂1 = 2.363 with se = 5805.9 very large because log likelihood is very flat at value β̂1 = 2.363 where convegence was adequate for the iterative fitting routine, and the se is greater as the log-likelihood function has less curvature (and lower information). (b)

> library(logistf) > fit.pen <- logistf(y ~ x, family=binomial) > summary(fit.pen) coef se(coef) lower 0.95 upper 0.95 Chisq p (Intercept) -4.44817185 2.94129528 -17.14643889 -0.4340569 5.072331 0.02431068 x 0.08896344 0.05460256 0.01455854 0.3238650 6.338126 0.01181696

The estimate of β1 is 0.089 instead of ∞, and the 95% CI is (0.015, 0.324) instead of over the entire real line given by the ordinary Wald CI. 7.18 (a) Classical: If truly β1 = 0, the probability would be only 0.0011 of getting a T test statistic farther out in the right-tail of the t distribution. This makes us very skeptical that β1 = 0, whereas it is plausible that β1 > 0 (i.e., Ha is true). Bayesian: Based on our prior beliefs and the data, we conclude that the probability is only 0.0002 that β1 ≤ 0. Since the posterior P (β1 > 0) = 0.9998, we would conclude that β1 > 0. (b) Continuing with the data read and standardized in Section 7.3.2, we have: > fitBayes <- MCMClogit(HG ~ NV2 + PI2 + EH2, mcmc=10000000, b0=0, B0=0.0001, data=Endo) > summary(fitBayes) Mean SD (Intercept) 7.8137 4.5504 NV2 18.3292 9.0766 PI2 -0.4919 0.4565 EH2 -2.1193 0.5928 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) 0.2447 3.929 7.7658 11.691 15.4887 NV2 3.2181 10.573 18.2560 26.093 33.5195 PI2 -1.4401 -0.787 -0.4727 -0.177 0.3469 EH2 -3.3802 -2.495 -2.0831 -1.704 -1.0675 > mean(fitBayes[,2] <= 0) [1] 6.36e-05

With the informative prior reflecting the subjective belief that effects are all weak, we conclude that β1 is positive but small, with 95% posterior interval (0.3, 3.0). With the uninformative prior we conclude that the effect is very strong, with 95% posterior interval (3.2, 33.5), not overlapping at all with the interval based on an informative prior. 7.19

> Tennis <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Tennis.dat", header=T) > Tennis Djokovic Federer Murray Nadal Wawrinka won lost

Solutions Manual: Foundations of Statistical Science for Data Scientists

1 1 -1 0 0 0 9 6 # Djokovic won 9, lost 6 vs Federer 2 1 0 -1 0 0 14 3 3 1 0 0 -1 0 9 2 4 1 0 0 0 -1 4 3 5 0 1 -1 0 0 5 0 # Feder. beats Mur. in all 5 matches 6 0 1 0 -1 0 5 1 7 0 1 0 0 -1 7 2 8 0 0 1 -1 0 2 4 9 0 0 1 0 -1 2 2 10 0 0 0 1 -1 4 3 > fit <- glm(won/(won+lost) ~ -1 + Djokovic + Federer + Murray + Nadal + + Wawrinka, family=binomial, weights=won+lost, data=Tennis) > summary(fit) Estimate Std. Error z value Pr(>|z|) Djokovic 1.1761 0.4995 2.354 0.0185 Federer 1.1358 0.5109 2.223 0.0262 Murray -0.5685 0.5683 -1.000 0.3172 Nadal -0.0618 0.5149 -0.120 0.9044 Wawrinka NA NA NA NA --Null deviance: 25.8960 on 10 degrees of freedom Residual deviance: 4.3958 on 6 degrees of freedom

Federer beats Murray all 5 times that they met, for a sample proportion of 1.00. Fitting the model provides smoothing of the sample proportions and gives estimated odds exp(1.1358 − (−0.5685)) = exp(1.704) = 5.50 and estimated probability of a Federer win of exp(1.704)/[1 + exp(1.704)] = 0.85. 7.20 (a)

> Crabs <- read.table("http://stat4ds.rwth-aachen.de/data/Crabs.dat", header=T) > fit <- glm(y ~ weight + factor(color), family=binomial, data=Crabs) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -3.2572 1.1985 -2.718 0.00657 weight 1.6928 0.3888 4.354 1.34e-05 factor(color)2 0.1448 0.7365 0.197 0.84410 factor(color)3 -0.1861 0.7750 -0.240 0.81019 factor(color)4 -1.2694 0.8488 -1.495 0.13479 --Null deviance: 225.76 on 172 degrees of freedom Residual deviance: 188.54 on 168 degrees of freedom AIC: 198.54 > library(car) > Anova(fit) Analysis of Deviance Table (Type II tests) LR Chisq Df Pr(>Chisq) weight 23.5186 1 1.237e-06 factor(color) 7.1949 3 0.06594 . > confint(fit) 2.5 % 97.5 % weight 0.9668016 2.4985572

Adjusting for weight, the evidence of a color effect is weak, with P -value of 0.066. The 95% profile likelihood CI for the weight effect indicated that adjusting for color, for a 1 kg increase in weight, the odds of at least one satellite are estimated to multiply by between e0.9668 = 2.63 and e2.4986 = 12.16. (b)

> fit2 <- glm(y ~ weight * factor(color), family=binomial, data=Crabs) > summary(fit2) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6203 4.8909 -0.331 0.740 weight 1.0483 1.8929 0.554 0.580 factor(color)2 -0.8320 5.0311 -0.165 0.869

Solutions Manual: Foundations of Statistical Science for Data Scientists factor(color)3 -6.2964 5.5165 -1.141 0.254 factor(color)4 0.4335 5.4046 0.080 0.936 weight:factor(color)2 0.3613 1.9559 0.185 0.853 weight:factor(color)3 2.7065 2.2284 1.215 0.225 weight:factor(color)4 -0.8536 2.1551 -0.396 0.692 Null deviance: 225.76 on 172 degrees of freedom Residual deviance: 181.66 on 165 degrees of freedom AIC: 197.66 > Anova(fit2) LR Chisq Df Pr(>Chisq) weight 23.5186 1 1.237e-06 factor(color) 7.1949 3 0.06594 . weight:factor(color) 6.8860 3 0.07562 .

The estimated effect of weight on logit[P (Y = 1)] is 1.048 for color 1, 1.048 + 0.361 = 1.410 for color 2, 1.048 + 2.706 = 3.755 for color 3, and 1.048 − 0.854 = 0.1947 for color 4. The effects are all positive. The evidence is weak (P -value = 0.076) that the interaction model gives a better fit. (c) AIC is 197.7 for the interaction model and 198.5 for the main effects model. > AIC(glm(y ~ factor(color), family=binomial, data=Crabs)) [1] 220.0608 > AIC(glm(y ~ weight, family=binomial, data=Crabs)) [1] 199.7371 > AIC(glm(y ~ 1, family=binomial, data=Crabs)) [1] 227.7585

AIC is similar (within a value of 2) for the ordinary main effects model with color as a factor, the model permitting interaction, and the model having only weight as an explanatory variable. They all seem to be much better than the model using only color or the null model. 7.21 (a) After constructing the scatterplot, we can add the case numbers as point labels to identify the unusual observations: > Crabs <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", header=T) > plot(Crabs$weight,Crabs$width) > text(Crabs$weight, Crabs$width, as.numeric(rownames(Crabs)), cex=0.70, + pos=1,col="red") > Crabs[141,] crab sat y weight width color spine 141 141 7 1 5.2 33.5 2 1 > fit1 <- glm(sat ~ weight + factor(color), family=poisson(link=log), + data=Crabs[-141,]) > summary(fit1) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.36635 0.27389 -1.338 0.1810 weight 0.66286 0.08587 7.719 1.17e-14 *** factor(color)2 -0.18138 0.15388 -1.179 0.2385 factor(color)3 -0.42275 0.17600 -2.402 0.0163 * factor(color)4 -0.40364 0.20946 -1.927 0.0540 .

An exceptionally heavy crab (observation 141 in the data file) weighing 5.2 kg that had 7 satellites is an outlying observation. Without it, the weight effect changes from 0.546 to 0.663. Although the weight is very high, the number of satellites is not exceptionally high, so it has some influence but not a huge amount. (e.g., for the fit of an ordinary linear model to the satellite counts as a function of weight and color factor, the Cook’s distance value for that observation is 0.007.) (b)

> fit2 <- glm(sat ~ weight + color, family=poisson(link=log), data=Crabs) > summary(fit2) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.08855 0.25443 0.348 0.72783

Solutions Manual: Foundations of Statistical Science for Data Scientists

weight 0.54588 0.06749 8.088 6.05e-16 color -0.17282 0.06155 -2.808 0.00499 --Null deviance: 632.79 on 172 degrees of freedom Residual deviance: 552.79 on 170 degrees of freedom AIC: 914.09

With color treated as a factor, the category estimates of (0, −0.205, −0.450, −0.452) suggest that the expected number of satellites decreases as the color gets darker, adjusting for weight. With color treated in a quantitative manner, we estimate that the expected number of satellites multiplies by e−0.173 = 0.84 for each increase of a category in color darkness. AIC is less for this model (914.1 compared with 917.1), suggesting that this model may better reflect the true relationship. (c)

> fit3 <- glm(sat ~ weight * color, family=poisson(link=log), data=Crabs) > summary(fit3) Estimate Std. Error z value Pr(>|z|) (Intercept) 1.9978 0.7404 2.698 0.00697 weight -0.2082 0.2814 -0.740 0.45941 color -1.0244 0.3176 -3.226 0.00126 weight:color 0.3408 0.1231 2.769 0.00563 --Null deviance: 632.79 on 172 degrees of freedom Residual deviance: 544.98 on 169 degrees of freedom AIC: 908.28

This model is better according to both statistical significance (P -value = 0.006 for the test of no interaction) and AIC (lower value). (d)

> library(MASS) > fit2a <- glm.nb(sat ~ weight + color, data=Crabs) > summary(fit2a) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.3220 0.5540 -0.581 0.561 weight 0.7072 0.1612 4.387 1.15e-05 color -0.1734 0.1199 -1.445 0.148 --(Dispersion parameter for Negative Binomial(0.9555) family taken to be 1) Null deviance: 219.50 on 172 degrees of freedom Residual deviance: 196.64 on 170 degrees of freedom AIC: 754.45 > fit3a <- glm.nb(sat ~ weight + color + weight:color, data=Crabs) > summary(fit3a) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.9047 1.4706 0.615 0.538 weight 0.2008 0.5822 0.345 0.730 color -0.6751 0.5889 -1.146 0.252 weight:color 0.2103 0.2404 0.875 0.382 (Dispersion parameter for Negative Binomial(0.968) family taken to be 1) Null deviance: 221.04 on 172 degrees of freedom Residual deviance: 197.17 on 169 degrees of freedom AIC: 755.66

With the negative binomial assumption, the interaction is not significant. The simple main effects model is adequate and is much better according to AIC than the Poisson models. 7.22 Starting with a model having all four explanatory variables we can proceed to stepwise model selection, based on AIC, as shown next for one of these models. > fit4 <- glm(sat ~ weight + width + factor(color) + factor(spine), + family=poisson(link=log), data=Crabs) > step(fit4, direction = "backward")

Solutions Manual: Foundations of Statistical Science for Data Scientists The negative binomial model with weight and color seems to be good, having AIC = 754.45, while also the negative binomial model with only weight as single explanatory variable could be considered, since it has AIC = 754.6. The last model is selected when starting the stepwise procedure with the model including factor color.

7.23 No, because the sample mean and variance should not be dramatically different. (They are equal for the Poisson distribution.) Also for the Poisson, the mode is the integer part of the mean, so with a mean of 5.9 for males, the modal response would be 5, not 0. 7.24 (a)

> Cancer <-read.table("http://stat4ds.rwth-aachen.de/data/Cancer2.dat",header=T) > Cancer time histology stage count risktime 1 1 1 1 9 157 # 63 contingency table cells ... 63 7 3 3 3 10 > logrisktime = log(Cancer$risktime) > fit <- glm(count ~ factor(histology) + factor(stage) + factor(time), + family=poisson, offset=logrisktime, data=Cancer) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -3.0093 0.1665 -18.073 <2e-16 factor(histology)2 0.1624 0.1219 1.332 0.1828 factor(histology)3 0.1075 0.1474 0.729 0.4658 factor(stage)2 0.4700 0.1744 2.694 0.0070 factor(stage)3 1.3243 0.1520 8.709 <2e-16 factor(time)2 -0.1274 0.1491 -0.855 0.3926 # showing 2 of 6 ... # time effects factor(time)7 -0.1752 0.2498 -0.701 0.4832 --Null deviance: 175.718 on 62 degrees of freedom Residual deviance: 43.923 on 52 degrees of freedom > library(car) > Anova(fit) # likelihood-ratio tests of effects, adjusting for the others LR Chisq Df Pr(>Chisq) factor(histology) 1.876 2 0.39132 factor(stage) 99.155 2 < 2e-16 factor(time) 11.383 6 0.07724

The estimates β1S = 0, β2S = 0.470, and β3S = 1.324 show the progressively worsening death rate as the stage of disease advances. The estimated death rate at stage 3 is exp(1.324) = 3.76 times that at stage 1, adjusting for follow-up time and histology. The likelihood-ratio test of H0 : no stage of disease effect, adjusting for histology and time, has test statistic 99.15 (df = 2) and P -value < 0.0001. (b) Only through how they contribute to the time at risk, on which the death rate is based, not through how many subjects the study actually has. 7.25 Factors that cause heterogeneity, such as changing weather from day to day, whether the day is on a holiday or weekend, time of year. 7.26

> y <- c(33, 29, 29, 12, 17, 21, 31, 28, 19, 14, 11, 26, 23) > c(mean(y), var(y)) [1] 22.53846 55.76923 > fit <- glm(y ~ 1, family=poisson) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) 3.11522 0.05842 53.32 <2e-16 --Null deviance: 31.392 on 12 degrees of freedom Residual deviance: 31.392 on 12 degrees of freedom

Solutions Manual: Foundations of Statistical Science for Data Scientists

AIC: 97.129 > fit.nb <- glm.nb(y ~ 1, link=log) # requires library(MASS) > summary(fit.nb) Estimate Std. Error z value Pr(>|z|) (Intercept) 3.11522 0.09153 34.03 <2e-16 --(Dispersion parameter for Negative Binomial(15.4944) family taken to be 1) Null deviance: 13.363 on 12 degrees of freedom Residual deviance: 13.363 on 12 degrees of freedom AIC: 92.608

The sample variance being much larger than the mean suggests the Poisson model may be inappropriate. The estimate 3.115 of the log of the mean is the log of the sample mean, and has se = 0.058 for the Poisson model and se = 0.092 for the negative binomial model, which is larger but probably more reliable because of the overdispersion. AIC also selects the negative binomial models as better reflecting reality. 7.27

> Endo <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Endometrial.dat", header=T) > fit <- glm(HG ~ NV + PI + EH, family=binomial, data=Endo) > summary(fit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.30452 1.63730 2.629 0.008563 NV 18.18556 1715.75089 0.011 0.991543 PI -0.04218 0.04433 -0.952 0.341333 EH -2.90261 0.84555 -3.433 0.000597 --> attach(Endo) > x <- cbind(NV, PI, EH) > library(glmnet) > fit.lasso <- glmnet(x, HG, alpha=1, family="binomial") > plot(fit.lasso, "lambda") > set.seed(1) > cv <- cv.glmnet(x, HG, alpha=1, family="binomial", type.measure="class") cv$lambda.min [1] 0.02381343 cv$lambda.1se [1] 0.06626227 > coef(glmnet(x,HG,alpha=1,family="binomial",lambda=0.06626227)) s0 (Intercept) 1.757685 NV 1.567396 PI . EH -1.585875 > coef(glmnet(x,HG,alpha=1,family="binomial",lambda=0.02381343)) s0 (Intercept) 2.724064453 NV 2.514027238 PI -0.003434003 EH -2.240841166

The NV effect is estimated to be ∞ with the classical approach, 18.33 with the uninformative Bayesian approach using prior standard deviations σ = 100, 2.93 with the penalized likelihood approach, 2.51 with the lasso using lowest sample mean prediction error, and 1.57 using the one-standard-error rule. 7.28 (a)

> Students <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Students.dat", + header=TRUE) > fit <- glm(veg ~ gender + age + hsgpa + cogpa + dhome + dres + tv + sport + + news + aids + ideol + relig + affirm + abor, + family=binomial, data=Students) > summary(fit)

Solutions Manual: Foundations of Statistical Science for Data Scientists Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.541e+00 2.902e+03 0.002 0.9985 gender -1.844e+00 1.641e+00 -1.124 0.2612 age 4.772e-02 7.677e-02 0.622 0.5342 hsgpa -3.839e+00 2.885e+00 -1.331 0.1833 cogpa -2.937e+00 2.269e+00 -1.294 0.1956 dhome -1.123e-03 7.130e-04 -1.575 0.1153 dres 4.552e-01 2.562e-01 1.776 0.0757 . tv -2.897e-02 1.289e-01 -0.225 0.8221 sport -5.885e-01 3.945e-01 -1.492 0.1358 news 1.344e-01 2.928e-01 0.459 0.6462 aids 2.069e-01 2.541e-01 0.814 0.4155 ideol -2.458e+00 1.469e+00 -1.673 0.0942 . relig 1.636e+00 1.071e+00 1.528 0.1266 affirm 2.355e+01 2.902e+03 0.008 0.9935 abor -3.573e+00 3.248e+00 -1.100 0.2713 --Null deviance: 50.725 on 59 degrees of freedom Residual deviance: 26.645 on 45 degrees of freedom > 1 - pchisq(fit$null.deviance-fit$deviance, fit$df.null-fit$df.residual) [1] 0.04481806

(b)

> attach(Students) > x <- cbind(gender, age, hsgpa, cogpa, dhome, dres, tv, sport, news, aids, + abor, ideol, relig, affirm) > library(glmnet) > fit.lasso <- glmnet(x, veg, alpha=1, family="binomial") > plot(fit.lasso, "lambda") > set.seed(1) > cv.glmnet(x, veg, alpha=1, family="binomial", type.measure="class") Lambda Measure SE Nonzero min 0.09432 0.15 0.0631 0 1se 0.09432 0.15 0.0631 0 > coef(glmnet(x, veg, alpha=1, family="binomial", lambda=0.09432)) s0 (Intercept) -1.734601 # lasso fit has only an intercept (the null model)! gender 0.000000 age . hsgpa . cogpa . dhome . dres . tv . sport . news . aids . abor . ideol . relig . affirm .

> Happy <- read.table("http://stat4ds.rwth-aachen.de/data/Happy.dat", header=TRUE) > y <- I(Happy$happiness == 1) > fit <- glm(y~factor(marital)+factor(gender), family=binomial(link=logit), data=Happy) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.23770 0.07866 -3.022 0.00251 factor(marital)2 -1.16768 0.13270 -8.800 < 2e-16 factor(marital)3 -1.20853 0.11830 -10.216 < 2e-16 factor(gender)male -0.06555 0.09812 -0.668 0.50410

Solutions Manual: Foundations of Statistical Science for Data Scientists

--Null deviance: 2626.0 Residual deviance: 2472.7

on 2141 on 2138

degrees of freedom degrees of freedom

In summary, no significant difference between females and males, but each tend to be happier when married than in the other two categories. 7.30

> race <- rep(c("white","black"), each=2); azt <- rep(c("yes","no"), 2) > yes <- c(14, 32, 11, 12); no <- c(93, 81, 52, 43) > AIDS <- data.frame(race, azt, yes, no); AIDS race azt yes no # yes and no are categories of y = AIDS symptoms 1 white yes 14 93 2 white no 32 81 3 black yes 11 52 4 black no 12 43 > fit <- glm(yes/(yes+no) ~ azt + race, weights=yes+no, family=binomial, data=AIDS) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.07357 0.26294 -4.083 4.45e-05 aztyes -0.71946 0.27898 -2.579 0.00991 racewhite 0.05548 0.28861 0.192 0.84755 --Null deviance: 8.3499 on 3 degrees of freedom Residual deviance: 1.3835 on 1 degrees of freedom

The estimated odds of developing symptoms for those receiving AZT were e−0.71946 = 0.49 times the odds for those not receiving AZT. The Wald test of H0 : no effect of AZT has a test statistic of −2.58 and P -value of 0.0099 for the two-sided alternative. AZT seems to have a true negative effect on developing AIDS symptoms. 7.31 E[g(Y )] ≠ g[E(Y )] when g is nonlinear; for instance, when g is concave, such as with g(y) = log(y), E[g(Y )] ≤ g[E(Y )] by Jensen’s inequality. With the GLM, one can recover information about E(Y ) by applying g −1 to the linear predictor, such as in exponentiating β̂j to get a multiplicative effect on E(Y ) in loglinear models. 7.32 The parameter space for M0 is contained in that for M1 , in the sense that some parameters in M1 are forced to take value 0 in M0 . Maximizing over a larger parameter space can yield a larger maximum value. At the worst, over the larger space, the maximum would be the same as over the smaller space. Thus, the log-likelihood functions satisfy L(µ̂0 ; y) ≤ L(µ̂1 ; y). 7.33 A difference of probabilities equal to d between the treatments for each lab does not imply that the difference of logits between the treatments is the same for each lab. For instance, consider the probabilities of success 0.9 for treatment A and 0.8 for treatment B in one lab and probabilities 0.6 and 0.5 in the other lab. The difference between the probabilities is d = 0.1 in each lab. The difference between the logits is logit(0.9) − logit(0.8) = 0.81, and the difference logit(0.6) − logit(0.5) = 0.41. Likewise if the logit main effects model holds, the identity-link main effects model does not. 7.34 (a) Let ρ = P(Y=1). By Bayes Theorem, P (Y = 1 ∣ x)

= = =

ρ exp[−(x − µ1 )2 /2σ 2 ] ρ exp[−(x − µ1 )2 /2σ 2 ] + (1 − ρ) exp[−(x − µ0 )2 /2σ 2 ] 1 2 1 + [(1 − ρ)/ρ] exp{−[µ0 − µ21 + 2x(µ1 − µ0 )]/2σ 2 1 exp(β0 + β1 x) = , 1 + exp[−(β0 + β1 x)] 1 + exp(β0 + β1 x)

Solutions Manual: Foundations of Statistical Science for Data Scientists where β1 = (µ1 − µ0 )/σ 2 and β0 = − log[(1 − ρ)/ρ] + [µ20 − µ21 ]/2σ 2 . (b)

> x <- c(rnorm(100000, 161.7, 7), rnorm(100000, 175.4, 7)) > y <- c(rep(0, 100000), rep(1, 100000)) > fit <- glm(y ~ x, family=binomial) > summary(fit) Estimate Std. Error (Intercept) -47.501495 0.206217 x 0.281824 0.001223 --> beta1 <- (175.4 - 161.7)/7^2; beta1 [1] 0.2795918

The theoretical value is 0.2796, and the simulation gives the very similar value of 0.2818. 7.35 ∂ π̂i /∂xij

β̂j exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )[1 + exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )] [1 + exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )]2

−

β̂j [exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )]2 [1 + exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )]2 = so that

β̂j exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip ) [1 + exp(β̂0 + β̂1 xi1 + ⋯ + β̂p xip )]2

= β̂j [π̂i (1 − π̂i )],

n 1 1 n ∑(∂ π̂i /∂xij ) = ( )β̂j ∑[π̂i (1 − π̂i )]. n i=1 n i=1

Since π̂i (1 − π̂i ) ≤ 0.25 for every i, β̂j /4 is an upper bound for the size of the effect. 7.36 The likelihood function is ℓ = ∏ki=1 (nyii )πiyi (1 − π)ni −yi , with πi in equation (7.3). The saturated model has fitted probability π̃i = yi /ni (the sample proportion). Comparing to the fitted probabilities {π̂i } for the chosen model, the deviance D(y; π̂) is k

i=1

2[L(π̃; y) − L(π̂; y)] = 2{ log [ ∏ π̃iyi (1 − π̃i )ni −yi ] − log [ ∏ π̂iyi (1 − π̂i )ni −yi ]} k

= 2 ∑ yi log i=1

k yi ni − yi + 2 ∑(ni − yi ) log . ni π̂i n i − ni π̂i i=1

7.37 (a) When n is extremely large, an effect can easily be statistically significant without being practically significant (Section 5.6.2). It is helpful to summarize the closeness of a model fit to the sample data in a way that, unlike a test statistic or a measure like the deviance, is not affected by n. The value of D helps indicate whether the lack of fit is important in a practical sense. A very small D value suggests that the sample data follow the model pattern closely, even though the model is not perfect. If D for a complex model is only slightly larger than for a simpler model, the simpler model may be adequate for describing most or all of the data. (b) > Employ <- read.table("http://stat4ds.rwth-aachen.de/data/Employment.dat", + header=TRUE) # 3way partial table of observed frequencies for Employ=1: > obs1 <- xtabs(employed ~ female + italian + pension, data=Employ) # 3way partial table of observed frequencies for Employ=0: > obs2 <- xtabs( ~ female + italian + pension, data=Employ) - obs1

Solutions Manual: Foundations of Statistical Science for Data Scientists

> obs1+obs2 # obs. freq. table merged over "employed" (left part of Table 7.1) > fit <- glm(employed ~ female + italian + pension, family = binomial, + data = Employ) > fitted1 <- xtabs(fit$fitted.values ~ female + italian + pension, data=Employ) # 3way table of fitted frequencies for Employ=1 > fitted2 <- xtabs( ~ female + italian + pension, data=Employ) - fitted1 # 3way table of fitted frequencies for Employ=0 > D <- (sum(abs(obs1-fitted1))+sum(abs(obs2-fitted2))) / + (2*(sum(obs1)+sum(obs2))); D [1] 0.00598033 > fit.int <- glm(employed ~ female + italian + pension + italian:pension, + family = binomial, data = Employ) > fitted.int1 <- xtabs(fit.int$fitted.values ~ female + italian + pension, + data=Employ) > fitted.int2 <- xtabs( ~ female+italian+pension, data=Employ) - fitted.int1 > D.int <- (sum(abs(obs1-fitted.int1))+sum(abs(obs2-fitted.int2))) / + (2*(sum(obs1)+sum(obs2))); D.int [1] 0.00558761 > library(car) > Anova(fit.int) LR Chisq Df Pr(>Chisq) female 1625.5 1 < 2.2e-16 italian 966.4 1 < 2.2e-16 pension 5010.0 1 < 2.2e-16 italian:pension 27.4 1 1.649e-07 > summary(fit.int) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.33899 0.02246 15.096 < 2e-16 female -0.64471 0.01614 -39.953 < 2e-16 italian 0.71552 0.02262 31.637 < 2e-16 pension -0.98404 0.16046 -6.132 8.65e-10 italian:pension -0.91266 0.16307 -5.597 2.19e-08

The interaction term is highly statistically significant (the difference of deviances between the main effects model and interaction model is 27.4 with df = 1), but n is huge for these data (n = 72, 200). The dissimilarity values of 0.0060 and 0.0056 show that for either model less than 1% of the data would need to be moved to achieve a perfect fit. This suggests that both models fit most of the data very well (the exception being the poor fit of the main effects model for the non-Italians with a pension), and the more complex interaction model does not fit much better in a practical sense. √ 7.38 var(Yi ) = E(Yi2 ) − [E(Yi )]2 = π − π 2 = π(1 − π), and cov(Yi , Yj ) = ρ var(Yi )var(Yj ) = ρπ(1 − π), and n

i=1

var( ∑ Yi ) = [ ∑ var(Yi ) + 2 ∑ ∑ cov(Yi , Yj )] i<j

= [nπ(1 − π) + n(n − 1)ρπ(1 − π)] = n[1 + ρ(n − 1)]π(1 − π). Overdispersion occurs when ρ > 0. With n = 1, var( ∑ni=1 Yi ) = nπ(1 − π), the ordinary binomial variance. 7.39 Consider sampling married couples and observing Y = number with a college education, with possible values 0, 1, and 2. Perhaps π follows a beta distribution with α < 1 and β < 1, which has modes at 0 and 1, and then we observe mainly Y = 0 or Y = 2, since there are relatively fewer cases for which just one of the partners has college education. This is not possible with an ordinary binomial distribution, which is unimodal. 7.40 E(Y ) = ∫ ∫ yf (x, y)dxdy = ∫ [∫ yf (y ∣ x)dy]f1 (x)dx = ∫ [E(Y ∣ X = x)]f1 (x)dx = E[E(Y ∣ X)].

Solutions Manual: Foundations of Statistical Science for Data Scientists

7.41 When λ has the gamma distribution (7.6), E(λ) = µ and var(λ) = µ2 /k. Therefore, E(Y ) = E[E(Y ∣ λ)] = E(λ) = µ and var(Y ) = E[var(Y ∣ λ)] + var[E(Y ∣ λ)] = E(λ) + var(λ) = µ + µk /k. These are the expressions in Section 7.5.2 for the mean and variance of a negative binomial distribution. The variance exceeds the mean, unlike the Poisson distribution for which they are identical. 7.42 No, the Poisson distribution does not allow for the overdispersion that exists, so the se values are badly biased in being too low. The negative binomial model is more realistic, permitting overdispersion, so we give more credence to se values for that model. 7.43 The model has a parameter π attached to the distribution with P (Y = 0) = 1 and parameter (1 − π) attached to a discrete distribution on the nonnegative integers such as the Poisson or negative binomial. Examples: The number of times that you donated money to a religious institution in the past year; the number of times that you ate out at a restaurant in the past week; the number of times you went to a theatre to see a movie in the past year; the number of times you had sex in the past month. 7.44 The probability distribution of Y = the number of successes before failure number k is y+k−1 y f (y; k, π) = ( )π (1 − π)k , y

y = 0, 1, 2, . . . .

The negative binomial distribution (7.7) is p(y; µ, k) =

Γ(y + k) µ y k k ( ) ( ) , Γ(k)Γ(y + 1) µ + k µ+k

y = 0, 1, 2, . . . .

) = Γ(y + k)/[Γ(y + 1)Γ(k)], substituting π = µ/(µ + k) and (1 − π) = Now, since (y+k−1 y k/(µ + k) in the negative binomial formula, we see they are the same. 7.45 Because the variability is not constant, residuals would not tend to have similar magnitudes over all values of explanatory variables. We see this with the Houses data modeled in Section 7.1.3, where an observation that is an outlier for an ordinary linear model and has a large Cook’s distance is not unusual or influential for a gamma model. 7.46 (a) Multiplying the Poisson likelihood function (4.5), with the Poisson mean parameter µ replaced by λ, by the gamma pdf for the prior distribution (2.10), with y replaced by λ (i.e., f (λ; k, µ) = (k/µ) e−kλ/µ λk−1 ), we get Γ(k) k

g(λ ∣ y) ∝ [e−nλ λ∑i=1 yi / ∏ yi !] × [(k/µ)k /Γ(k)]e−kλ/µ λk−1 , n

i=1

As a function of λ, after observing the data, this is proportional to λ∑i=1 yi +k−1 e−(n+k/µ)λ . n

′

This has the gamma form (2.10) when we take the shape parameter to be k = ′ n ∑i=1 yi + k and the rate parameter (i.e., reciprocal of scale parameter) to be λ = n + k/µ. (b) From equation (2.11), the posterior gamma distribution has mean the ratio of the scale and rate parameters, ′

′

k /λ = (∑ yi + k)/(n + k/µ) = [n/(n + k/µ)]y + [(k/µ)/(n + k/µ)]µ. i=1

Solutions Manual: Foundations of Statistical Science for Data Scientists

This is a weighted average of the sample mean y and the prior mean µ. As n increases, the weight given to the sample mean converges up toward 1. 7.47 Since the ML estimate β̂ satisfies L′ (β̂) = 0, dropping higher-order terms in the expansion yields 0 ≈ L′ (β (0) ) + (β̂ − β (0) )L′′ (β (0) ). Solving for β̂ yields the approximation β̂ (1) = β (0) − L′ (β (0) )/L′′ (β (0) ), which we use as β (1) , the next approximation for β̂. Using the same argument, replacing β (0) by β (t) , we get the general expression. 7.48 Here is a simple program to do 25 iterations, starting with initial estimates of 0 for the elements of β, without including any convergence criteria: > Beetles <- read.table("http://stat4ds.rwth-aachen.de/data/Beetles_ungrouped.dat", + header=TRUE) > y <- Beetles[,2] > X <- cbind(rep(1,481), Beetles[,1]) # construction of model matrix > beta <- matrix(0, 2, 1) > for(k in 1:25){ > pi <- exp(X%*%beta)/(1 + exp(X%*%beta)) > D <- diag(pi[1:481]*(1 - pi[1:481])) > beta <- beta + (solve(t(X)%*%D%*%X))%*%t(X)%*%(y - pi) > print(beta) > } [,1] # iteration 1 [1,] -37.85529 [2,] 21.33745 [,1] [1,] -53.85409 # iteration 2 [2,] 30.38601 [,1] [1,] -59.98190 # iteration 3 [2,] 33.85631 [,1] [1,] -60.73029 # iteration 4 [2,] 34.28035 [,1] [1,] -60.74013 # iteration 5 (convergence to 5 decimal places) [2,] 34.28593 ... [,1] [1,] -60.74013 # iteration 25 (same to 5 decimal places as every [2,] 34.28593 # iteration beginning with iteration 5)

7.49 (a) The log-likelihood function involves the data together with the parameters only through these terms. (b) Section 7.6.2 showed that n exp (∑pk=0 βk xik ) ∂L(β) n . = ∑ yi xij − ∑ xij ∂βj 1 + exp (∑pk=0 βk xik ) i=1 i=1

The likelihood equations result from setting ∂L(β)/∂βj = 0 for j = 0, 1, . . . , p, so based on the expression (7.3) for πi , the likelihood equations have the form n

i=1

∑ yi xij − ∑ πi xij = 0. j = 0, 1, . . . , p.

Solutions Manual: Foundations of Statistical Science for Data Scientists For each j, E(∑ni=1 Yi xij ) = ∑ni=1 πi xij , so the equations equate the suﬀicient statistics to their expected values. (c) The likelihood equation with xi0 = 1 for all i simplifies to ∑ni=1 yi = nπ, so π̂ = ȳ, the overall sample proportion.

7.50 When the explanatory variable takes only two values (x = 0 or 1), the model for P (Y = 1) simplifies to exp(β0 + β1 )/[1 + exp(β0 + β1 )] at x = 1 and to exp(β0 )/[1 + exp(β0 )] at x = 0. Note that the odds ratio is P (Y = 1 ∣ X = 1)/P (Y = 0 ∣ X = 1) = exp(β1 ), P (Y = 1 ∣ X = 0)/P (Y = 0 ∣ X = 0) so β1 represents the log odds ratio. For y1 successes in n1 trials when x = 1 and y2 successes in n2 trials when x = 0, by directly taking the log of the likelihood (see L(β) in Section 7.6.2) we get L = (y1 + y2 )β0 + (y1 )β1 − {n1 log[1 + exp(β0 + β1 )] + n2 log[1 + exp(β0 )]}. The likelihood equations are ∂L/∂β1 = y1 − n1 exp(β0 + β1 )/[1 + exp(β0 + β1 )] = 0 ∂L/∂β0 = (y1 + y2 ) − n1 exp(β0 + β1 )/[1 + exp(β0 + β1 )] − n2 exp(β0 )/[1 + exp(β0 )] = 0. Subtracting gives y2 = n2 exp(β0 )/[1 + exp(β0 )], or β̂0 = log[y2 /(n2 − y2 )]. The first equation gives β̂0 + β̂1 = log[y1 /(n1 − y1 )], so that β̂1 = log{[y1 /(n1 − y1 )]/[y2 /(n2 − y2 )]}, which is the sample log odds ratio. Or, the likelihood equations are y2 = n2 eβ0 /(1 + eβ0 ),

y1 = n1 eβ0 +β1 /(1 + eβ0 +β1 ),

equating suﬀicient statistics to their expected values. Solving the first gives β̂0 = logit(y2 /n2 ). Solving the second gives β̂0 + β̂1 = logit(y1 /n1 ), from which β̂1 is the log odds ratio. 7.51 (a)

n e−µ̂i (µ̂i )yi ] = ∑[yi log(µ̂i ) − µ̂i − log(yi !)]. yi ! i=1 i=1 n

L(µ̂; y) = log [ ∏

(b) Since L(y; y) substitutes yi for µ̂i , the deviance equals n

D(y; µ̂) = 2[L(y; y) − L(µ̂; y)] = 2 ∑[yi log(yi /µ̂i ) − yi + µ̂i ]. i=1

(c) When a model with log link contains an intercept term, the likelihood equation implied by that parameter is ∑i yi = ∑i µ̂i and the deviance simplifies to n

D(y; µ̂) = 2 ∑ yi log(yi /µ̂i ). i=1

7.52 (a) The log-likelihood involves the data together with the model parameters only through the suﬀicient statistics ∑ni=1 yi xij , j = 0, 1, . . . , p.

Solutions Manual: Foundations of Statistical Science for Data Scientists

(b) Setting equation (7.13) equal to 0, the likelihood equations are n

i=1

k=0

∑ yi xij = ∑ xij exp ( ∑ βj xik ). Since µi = exp(∑pk=0 βj xik ), these equations thus have the form n

i=1

∑ yi xij = ∑ µi xij ,

j = 0, 1, . . . , p.

(c) For the null model, the likelihood equation simplifies to ∑ni=1 yi = nµ, so µ̂ = ȳ and β̂0 = log(ȳ). 7.53 ∂ µ̂i /∂xij = β̂j exp(∑k β̂k xik ) = β̂j µ̂i which sum over i to β̂j (∑i µ̂i ). The likelihood equation for j = 0 in terms of the ML estimates is ∑i yi = ∑i µ̂i , so n1 ∑i (∂ µ̂i /∂xij ) = 1 β̂ ( y ) = β̂j y. n j ∑i i −1

̂ β̂) = (X T Diag[µ̂]X) , the 7.54 For the expression for a Poisson loglinear model that var( −1

matrix X is a column vector with n elements of 1, so (X T Diag[µ̂]X) √ 1/nµ̂ and since µ̂ = y, the standard error is 1/ny.

= ( ∑ni=1 µ̂)−1 =

7.55 (a) As n increases, X has more rows and you sum more positive terms to get the ̂ diagonal elements of X T DX, so the values tend to be smaller in the inverse matrix that gives the variances (and through their square roots, the standard errors) of the estimates. ̂ (b) As {π̂i } tend to fall close to 0 or close to 1, the main diagonal elements of X T DX tend to be smaller, and thus tend to be larger in the inverse matrix that gives the variances of the estimates. 7.56 (a) log [

P (Yi = a) P (Yi = a) P (Yi = b) ] = log [ ] − log [ ] P (Yi = b) P (Yi = c) P (Yi = b) = (βa0 + βa1 xi1 + ⋯ + βap xip ) − (βb0 + βb1 xi1 + ⋯ + βbp xip ) = (βa0 − βb0 ) + (βa1 − βb1 )xi1 + ⋯ + (βap − βbp )xip .

(b) (i) Model primary mode of transportation to work (automobile, bicycle, bus, subway, walk) using explanatory variables age, annual income, gender, race, attained education, whether one lives in city; (ii) Model favorite type of music (classical, country, folk, jazz, pop, rap/hip-hop, rock) using explanatory variables age, gender, annual income, attained education, race. 7.57 (a)

> x <- seq(-10,5,0.1); prob <- function(c){exp(c+x)/(1+exp(c+x))} > plot(x,prob(4), type="l", lwd=2, col="red", ylab="probability") > lines(x,prob(3), lwd=2, col="blue"); lines(x, lwd=2, prob(1)) > abline(v = -3, lty=2)

The three S-shaped curves have the same shape and do not cross anywhere. The curve for P (Yi ≤ 1) is below the curve for P (Yi ≤ 2), which is below the curve for P (Yi ≤ 3), reflecting that necessarily 0 ≤ P (Yi ≤ 1) ≤ P (Yi ≤ 2) ≤ P (Yi ≤ 3) ≤ 1. With different effects, the curves would cross and cumulative probabilities would be out of their proper order for some values of the explanatory variables.

Solutions Manual: Foundations of Statistical Science for Data Scientists (b) (i) Model happiness (not too happy, pretty happy, very happy) using explanatory variables age, annual income, gender, race, attained education, whether married, whether attend religious services regularly; (ii) Model quality of life (poor, fair, good, excellent) using explanatory variables age, gender, annual income, attained education, race.

7.58 (a) Under independence, µij = nP (X = i)P (Y = j) for all i and j, so it has the form log(µij ) = β0 + βiX + βjY with X and Y having effects for levels of each factor. (b)

> counts <- c(432, 504, 61, 92, 282, 103, 124, 409, 135) > marital <- c(1, 1, 1, 2, 2, 2, 3, 3, 3) > happiness <- c(1, 2, 3, 1, 2, 3, 1, 2, 3) > fit <- glm(counts ~ factor(marital) + factor(happiness), family=poisson) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) 5.70915 0.04560 125.201 < 2e-16 factor(marital)2 -0.73723 0.05567 -13.242 < 2e-16 factor(marital)3 -0.40046 0.05000 -8.009 1.15e-15 factor(happiness)2 0.61201 0.04879 12.545 < 2e-16 factor(happiness)3 -0.77345 0.06991 -11.063 < 2e-16 --Null deviance: 981.67 on 8 degrees of freedom Residual deviance: 205.05 on 4 degrees of freedom

The residual deviance of 205.05 with df = 4 is the likelihood-ratio statistic for testing the model, and hence testing H0 : independence of happiness and marital status. There is strong evidence of an association between happiness and marital status. 7.59 (a) Let A, C, and M be indicator variables taking values 1 = yes, 0 = no. For i, j, k = 0 and 1, let µijk = nP (A = i, C = j, M = k) be the expected frequencies for the cells. The model is µijk = β0 + β1 A + β2 C + β3 M + β4 AC + β5 AM + β6 CM. (b) The AC conditional log odds ratio at level k of M is log[(µ11k µ22k )(µ12k µ21k )] = log(µ11k ) + log(µ22k ) − log(µ12k ) − log(µ21k ), and substituting the loglinear model formula, this simplifies to β4 , the coeﬀicient of the AC cross-product of indicator variables for A and C. (c) > Myes <- Drugs[Drugs$marijuana=="yes",]$count > Mno <- Drugs[Drugs$marijuana=="no",]$count > n <- Myes + Mno; A <- c(1, 1, 0, 0); C <- c(1, 0, 1, 0) > fit <- glm(Myes/n ~ A + C, family=binomial, weights=n) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -5.3090 0.4752 -11.172 < 2e-16 A 2.9860 0.4647 6.426 1.31e-10 C 2.8479 0.1638 17.382 < 2e-16 --Null deviance: 843.82664 on 3 degrees of freedom Residual deviance: 0.37399 on 1 degrees of freedom

The estimated conditional log odds ratios are 2.986 between M and A and 2.848 between M and C, identical to the estimates for the coeﬀicient of the AM and CM terms in the loglinear model. 7.60 A generalized additive model (GAM) replaces the linear predictor in a generalized linear model (GLM) by additive unspecified smooth functions. Its basic version has the form link(µi ) = β0 + s1 (xi1 ) + s2 (xi2 ) + ⋯ + sp (xip ),

Solutions Manual: Foundations of Statistical Science for Data Scientists

where each smooth function sj is typically based on cubic splines. The name additive derives from the additive structure of the predictor. GAMs have the advantage over GLMs of greater flexibility. The GLM is a special case, with sk (xk ) replaced by βk xk . In practice, it is often helpful to use both smooth and linear terms in a model. Using a graphical portrayal of a GAM fit, we may discover patterns that we would miss with ordinary GLMs, and we obtain potentially better estimates of mean responses. A disadvantage of GAMs and other smoothing methods, compared with GLMs, is that interpretability is more diﬀicult. It can be more diﬀicult to summarize an effect and judge when it has substantive importance.

Chapter 8 8.1 (a)

> Crabs <- read.table("http://stat4ds.rwth-aachen.de/data/Crabs.dat", header=T) > library(MASS) > fit.lda <- lda(y ~ weight + color, data=Crabs) > fit.lda Prior probabilities of groups: 0 1 0.3583815 0.6416185 Coefficients of linear discriminants: LD1 weight 1.5327300 color -0.5881721

The linear discriminant function is 1.533(weight) - 0.588(color). (b)

> fit.lda <- lda(y ~ weight + color, prior=c(0.5, 0.5), data=Crabs) > lda.predict <- predict(fit.lda)$posterior > head(cbind(lda.predict, Crabs$weight, Crabs$color, Crabs$y), 1) 0 1 1 0.2138457 0.7861543 3.05 2 1

A crab at those weight and color values has posterior probability 0.786 of having at least one satellite, and would be predicted to have satellites. (c)

> fit.lda <- lda(y ~ weight + color, > xtabs(~Crabs$y + fit.lda$class) fit.lda$class Crabs$y 0 1 0 44 18 1 36 75

prior=c(0.5, 0.5), CV=TRUE, data=Crabs)

Sensitivity P̂ (ŷ = 1 ∣ y = 1) = 75/(36 + 75) = 0.676, specificity P̂ (ŷ = 0 ∣ y = 0) = 44/(44 + 18) = 0.710. The overall proportion of correct predictions is (44 + 75)/173 = 0.688, as good as obtained in Section 8.1.3 using all four explanatory variables. (d)

> fit.lda <- lda(y ~ weight + color, data=Crabs) > lda.predict <- predict(fit.lda)$posterior > library(pROC) > rocplot <- roc(y ~ lda.predict[,2], data=Crabs) > plot.roc(rocplot, legacy.axes=TRUE) > auc(rocplot) Area under the curve: 0.761 > fit.logistic <- glm(y ~ weight + color, family=binomial, data=Crabs) > rocplot2 <- roc(y ~ fitted(fit.logistic), data=Crabs) > plot.roc(rocplot2, legacy.axes=TRUE) > auc(rocplot2) Area under the curve: 0.7605

100

Solutions Manual: Foundations of Statistical Science for Data Scientists The areas under the curve of 0.761 for linear discriminant analysis and 0.7605 for logistic regression are essentially as good as the value 0.770 obtained for linear discriminant analysis using all four explanatory variables. (e)

> fit <- lm(y ~ weight + color, data=Crabs); summary(fit) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.21434 0.20244 1.059 0.291 weight 0.28464 0.05976 4.763 4.06e-06 color -0.10923 0.04300 -2.540 0.012 --> sum(I(fitted(fit) > 1)) [1] 6

The estimated probability of a satellite increases by 0.285 for a 1 kg increase in weight, adjusting for color, and decreases by 0.109 by an increase of 1 in color darkness, adjusting for weight. A disadvantage is that for suﬀiciently small values of color and large values of weight, P̂ (Y = 1) > 1. This happens for 6 crabs in this sample. 8.2 (a)

> x <- c(rnorm(100000, 161.7, 7), rnorm(100000, 175.4, 7)) > y <- c(rep(0, 100000), rep(1, 100000)) > library(MASS) > fit.lda <- lda(y ~ x) > fit.lda Prior probabilities of groups: 0 1 0.5 0.5 Group means: x 0 161.7027 1 175.3809 Coefficients of linear discriminants: LD1 x 0.1429862 > lda.predict <- predict(fit.lda)$posterior

Predict y = 0 (female) when posterior probability of that outcome exceeds 0.50, which occurs if x < (161.70 + 175.38)/2 = 168.54 cm, the mean height in the sample. (b)

> fit.lda <- lda(y ~ x, prior=c(0.5, 0.5), CV=TRUE) > xtabs(~y + fit.lda$class) fit.lda$class y 0 1 0 83597 16403 1 16359 83641

Proportion correctly classified is (83597+83641)/200000 = 0.836. (c)

8.3 (a)

> library(pROC) > rocplot <- roc(y ~ lda.predict[,2]) > plot.roc(rocplot, legacy.axes=TRUE) > auc(rocplot) Area under the curve: 0.9166 > Iris <- read.table("http://stat4ds.rwth-aachen.de/data/Iris.dat", header=TRUE) > cor(Iris$s_length, Iris$p_length) [1] 0.8284787 > y <- I(Iris$species =="I.versicolor") > fit.lda <- lda(y ~ s_length + p_length, data=Iris) > fit.lda Group means: s_length p_length FALSE 6.588 5.552 TRUE 5.936 4.260

Solutions Manual: Foundations of Statistical Science for Data Scientists

101

Coefficients of linear discriminants: LD1 s_length 1.637937 p_length -3.152368

The linear discriminant function 1.638(sepal length) − 3.152(petal length) suggests that the posterior P (Y = 1) increases as sepal length increases and decreases as petal length increases. (b)

> fit.lda <- lda(y ~ s_length + p_length, > xtabs(~y + fit.lda$class) fit.lda$class y FALSE TRUE FALSE 47 3 TRUE 3 47

prior=c(0.5, 0.5), CV=TRUE, data=Iris)

Sensitivity P̂ (ŷ = 1 ∣ y = 1) = 47/50 = 0.94, specificity P̂ (ŷ = 0 ∣ y = 0) = 47/50 = 0.94, overall proportion of correct predictions is 94/100 = 0.94. (c)

> fit.lda <- lda(y ~ s_length + p_length, data=Iris) > lda.predict <- predict(fit.lda)$posterior > library(pROC) > rocplot <- roc(y ~ lda.predict[,2], data=Iris) > plot.roc(rocplot, legacy.axes=TRUE) > auc(rocplot) Area under the curve: 0.985 > fit.logistic <- glm(y ~ p_length + s_length, family=binomial, data=Iris) > summary(fit.logistic) Estimate Std. Error z value Pr(>|z|) (Intercept) 39.839 13.089 3.044 0.002338 p_length -13.313 3.913 -3.402 0.000669 s_length 4.017 1.623 2.474 0.013348 --Null deviance: 138.63 on 99 degrees of freedom Residual deviance: 23.85 on 97 degrees of freedom AIC: 29.85 > rocplot2 <- roc(y ~ fitted(fit.logistic), data=Iris) > plot.roc(rocplot2, legacy.axes=TRUE) > auc(rocplot2) Area under the curve: 0.9914

The area under the ROC curve is close to 1 with both methods, indicating excellent performance. (d)

> Iris <- read.table("http://stat4ds.rwth-aachen.de/data/Iris.dat", header=TRUE) > y <- I(Iris$species =="I.versicolor") > library(rpart) > fit <- rpart(y ~ p_length + s_length, method="class", data=Iris) > plotcp(fit) > summary(fit) CP nsplit rel error xerror xstd 1 0.86 0 1.00 1.16 0.09871170 2 0.01 1 0.14 0.22 0.06257795 > p.fit <- prune(fit, cp=0.00) > library(rpart.plot) > rpart.plot(p.fit, extra=1, digits=4, box.palette=0)

No pruning is necessary, as using λ = 0 gives a simple tree that predicts y = 1 (species = versicolor) if petal length is at least 4.75 cm. This tree correctly predicts 49 of the 55 predicted to have y = 1 (i.e., sensitivity = 0.891) and 44 of the 45 predicted to have y = 0 (i.e., specificity = 0.978), an overall proportion correct of 93/100 = 0.93. 8.4

> library(rpart) > fit <- rpart(y ~ color + weight, method="class", data=Crabs)

102

Solutions Manual: Foundations of Statistical Science for Data Scientists > plotcp(fit) > p.fit <- prune(fit, cp=0.04) > library(rpart.plot) > rpart.plot(p.fit, extra=1, digits=4, box.palette=0)

With complexity parameter λ = 0.04, the horseshoe crabs predicted to have satellites were all crabs that have weight at least 1.925 kg. This classification tree correctly predicts (without cross-validation) the proportion (53 + 48)/111 = 0.910 of the crabs that actually had satellites and 22/62 = 0.355 of those that did not, an overall proportion correct of 123/173 = 0.711, nearly as good as the proportion 0.728 for the more complex tree. 8.5 (a) For the more highly-pruned tree, the only region predicted not to have satellites is the rectangular region with colors 3 and 4 and carapace width < 25.85 cm. (b) Linear discriminant analysis or logistic regression results in a single line passing through this space with ŷ = 1 on one side of the line and ŷ = 0 on the other. (c) Simple classification trees using solely weight or solely width or weight with color or width with color seem adequate. Results of such classification trees and of corresponding logistic regression model fits suggest that satellites are more likely for lighter-colored crabs with greater weight or greater carapace width. 8.6

> Kyphosis <- read.table("http://stat4ds.rwth-aachen.de/data/Kyphosis.dat",header=T) > library(rpart) > fit <- rpart(y ~ x, data=Kyphosis) > library(rpart.plot) > p.fit <- prune(fit, cp=0.00) > rpart.plot(p.fit, extra=1, digits=4, box.palette=0)

Kyphosis is predicted both for the very young (less than 40 months) and older children (at least 131 months), which are 20 of the 40 cases. This indicates that the relationship between x and E(Y ) may not be monotonic, which we might not realize using logistic regression modeling unless we added a quadratic term to the linear predictor. 8.7 (a) All males of age ≤ 9.5 years who had fewer then 3 siblings, and all females. (b) 0.73(36) + 0.89(2) + 0.98(2) + 0.83(61) = 80.65%, compared to 1 - [0.73(36) + 0.89(2)] = 71.94% correct if predicted that everyone died. 8.8 (a) Q1: Is the subject’s age > 70? (3157 yes, 1497 no) Q2: Is the subject’s age > 83? (931 yes, 2226 no) Q3: Does the subject have dementia? (65 yes, 2161 no) Q4: Does the subject have Parkinson’s disease? (37 yes, 2124 no) (b) For the node of 931 subjects of age > 83, of whom 112 disenrolled and 819 stayed, the misclassification cost is 819 if we predict that these 931 subjects disenroll and it is 13(112) = 1456 if we predict that these 931 subjects stay. The misclassification cost is lower if we predict that they all disenroll, so this is the prediction for this terminal node. 8.9

> Crabs <- read.table("http://stat4ds.rwth-aachen.de/data/Crabs.dat", header=TRUE) > Crabs.std <- scale(Crabs[, 4:7]) > y <- Crabs$y > Crabs2 <- data.frame(cbind(y, Crabs.std)) > train_index <- sample(nrow(Crabs), (3/4)*nrow(Crabs)) > Crabs_train <- Crabs2[train_index, ] > Crabs_test <- Crabs2[-train_index, ]

Solutions Manual: Foundations of Statistical Science for Data Scientists

103

> target_cat <- Crabs2[train_index, 1] > test_cat <- Crabs2[-train_index, 1] > library(class) > pr1 <- knn(Crabs_train, Crabs_test, cl=target_cat, k=1) > table(pr1, test_cat) test_cat pr1 0 1 0 18 0 1 0 26 > pr3 <- knn(Crabs_train, Crabs_test, cl=target_cat, k=3) > table(pr3, test_cat) test_cat pr3 0 1 0 17 1 1 1 25 > pr5 <- knn(Crabs_train, Crabs_test, cl=target_cat, k=5) > table(pr5, test_cat) test_cat pr5 0 1 0 16 1 1 2 25

We get perfect prediction using only a single neighbor! Other solutions may differ slightly, because of the random selection for the training and test samples. 8.10

> Crabs <- read.table("http://stat4ds.rwth-aachen.de/data/Crabs.dat", header=TRUE) > train_index <- sample(nrow(Crabs), (3/4)*nrow(Crabs)) > Crabs_train <- Crabs[train_index, ] > Crabs_test <- Crabs[-train_index, ] > library(neuralnet) > nn <- neuralnet(y=="1" ~ weight + width + color + spine, Crabs_train, linear.output=F) > pred.nn <- predict(nn, Crabs_test) > table(Crabs_test$y == "1", pred.nn[, 1] > 0.50) FALSE TRUE FALSE 10 16 TRUE 2 16 > plot(nn, information = F) > table(Crabs_test$y == "1", pred.nn[, 1] > 0.64) FALSE TRUE FALSE 13 13 TRUE 2 16

Other solutions may differ slightly, because of the random selection for the training and test samples. 8.11

> Elections <- read.table("http://stat4ds.rwth-aachen.de/data/Elections2.dat",header=T) > x1 <- Elections[Elections$state=="Illinois",c(3:13)] > x2 <- Elections[Elections$state=="Wyoming",c(3:13)] > NJ <- data.frame("number"=15, "state"="New Jersey", x1) > PA <- data.frame("number"=16, "state"="Pennsylvania", x1) > AL <- data.frame("number"=17, "state"="Alabama", x2) > Elect2 <- rbind(Elections, NJ, PA, AL); Elect2 number state e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 1 1 Arizona 0 0 0 0 1 0 0 0 0 0 1 2 2 California 0 0 0 1 1 1 1 1 1 1 1 3 3 Colorado 0 0 0 1 0 0 0 1 1 1 1 4 4 Florida 0 0 0 0 1 0 0 1 1 0 0 5 5 Illinois 0 0 0 1 1 1 1 1 1 1 1 6 6 Massachusetts 0 0 1 1 1 1 1 1 1 1 1 7 7 Minnesota 1 1 1 1 1 1 1 1 1 1 1 8 8 Missouro 0 0 0 1 1 0 0 0 0 0 0 9 9 NewMexico 0 0 0 1 1 1 0 1 1 1 1 10 10 NewYork 0 0 1 1 1 1 1 1 1 1 1

104

Solutions Manual: Foundations of Statistical Science for Data Scientists 11 11 Ohio 0 0 0 1 1 0 0 1 1 0 0 12 12 Texas 0 0 0 0 0 0 0 0 0 0 0 13 13 Virginia 0 0 0 0 0 0 0 1 1 1 1 14 14 Wyoming 0 0 0 0 0 0 0 0 0 0 0 15 15 New Jersey 0 0 0 1 1 1 1 1 1 1 1 16 16 Pennsylvania 0 0 0 1 1 1 1 1 1 1 1 17 17 Alabama 0 0 0 0 0 0 0 0 0 0 0 > distances <- dist(Elect2[, 3:13], method = "manhattan") > democlust <- hclust(distances, "average") > plot(democlust, labels=Elect2$state)

New Jersey and Pennsylvania join the initial cluster with California and Illinois, and Alabama joins the initial cluster with Texas and Wyoming. In the two-cluster summary, New Jersey and Pennsylvania are Democratic-leaning and Alabama is Republicanleaning. 8.12

> Elect <- read.table("http://stat4ds.rwth-aachen.de/data/Elections.dat",header=TRUE) head(Elect, 2) state e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11 1 Alab 0 0 0 0 0 0 0 0 0 0 0 2 Alas 0 0 0 0 0 0 0 0 0 0 0 > distances <- dist(Elect[, 4:12], method = "manhattan") > democlust <- hclust(distances, "average") > plot(democlust, labels=Elect$state) > library(gplots) > heatmap.2(as.matrix(Elect[, 4:12]), labRow = Elect$state, dendrogram="row", Colv=FALSE)

Repeat the analysis above using columns 2:12 of the data file and compare the clusters. 8.13

> Gators <- read.table("http://stat4ds.rwth-aachen.de/data/Gators.dat", header=TRUE) > summary(Gators$length) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.240 1.575 1.850 2.130 2.450 3.890 > fit <- glm(fish ~ length, family=binomial, data=Gators) > summary(fit) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0429 0.9199 -2.221 0.0264 length 1.0260 0.4317 2.377 0.0175 --> 2.0429/1.026 # P(Y=1) = 0.5 at x = -beta_0/beta_1 [1] 1.991131

By logistic regression, alligators of length greater than 1.99 meters have greater than a 50% chance of having fish as primary food choice. 8.14 8.15 8.16 The ROC curve is the straight line with slope 1 going from the point (0, 0) to the point (1, 1). The area under the curve is 0.50, the same as what is expected with random guessing of the response variable. For instance, for the null model fitted to the horseshoe crab data: > fit <- glm(y ~ 1, family=binomial, data=Crabs) > rocplot <- roc(y ~ fitted(fit), data=Crabs) > plot.roc(rocplot, legacy.axes=TRUE) > auc(rocplot) Area under the curve: 0.5

8.17 See Section 8.1.5.

Solutions Manual: Foundations of Statistical Science for Data Scientists

105

8.18 Like classification and regression trees, a disadvantage compared with logistic regression modeling is interpretability, because of not having model coeﬀicients for summarizing effects of explanatory variables, and to help determine which of them are important. A disadvantate of logistic regression is its restriction to categorical response variables and its inability to describe highly complex relationships with very irregular decision boundaries. An advantage of K-nearest neighbors is simplicity of explanation of how predictions are made. An advantage of neural networks is its applicability to a much greater variety of problems, including image classification and language translation and its potential power for excellent predictions with very large n regardless of the nature of the relationship between the explanatory variables and the response variable. 8.19 We can interpret D as the probability that two randomly selected subjects fall in the same class, so larger D reflects less diversity. 8.20 If you want a simple portrayal that groups olive oils by rectangular regions of the explanatory variables, then a classification tree may be adequate. If you want to analyze effects of explanatory variables and determine which are most important for predicting the origin, logistic regression is appropriate. In this case, if the chemical properties are quantitative and can be considered to have approximately normal distributions, one could instead use linear discriminant analysis. 8.21 The regression tree method partitions the space of explanatory variable values into distinct regions. Each observation in a particular region has predicted response value given by the sample mean for the training observations in that region. Compared with regression models, regression trees have the advantage of being simple to explain and to portray values that lead to various predicted values. It can portray a complex non-linear relationship that we might not discover using regression methods. Disadvantages include not providing a framework for inference, such as to determine whether an explanatory variable has an effect on a response variable when we adjust for other explanatory variables, and it does not provide simple numerical summaries of effects, such as regression coeﬀicients and correlations. 8.22 The squared Euclidean distance between two points (x1 , . . . , xn ) and (y1 , . . . yn ) is ∑i (xi − yi )2 . One could start with each observation as a cluster, and then join pairs together that have the minimum Euclidean distance. Continue, finding for each possible pair of clusters that could be merged their mean Euclidean distance between pairs of points, one from each cluster, and merging clusters with the minimum mean Euclidean distance. Without standardizing, changing units (e.g., from feet to meters for a distance variable can result in a different dendrogram. Without standardizing, a variable with a very large standard deviation can have more influence of the formation of clusters than one with a very small standard deviation. Unless one expects a particular number of clusters (e.g., two clusters for Democrat-leaning and Republican-leaning states), one can view the dendrogram as clusters merge until observing a natural choice of clusters that clearly distinguishes different groups of observations. 8.23 Interchange variables and cases in the data file used for the clustering, so each row is a different variable and each column is a different observation. 8.24 A statistical model for a set of variables is a simple approximation for the true and possibly complex relationship among those variables in the population, such as by giving a linear predictor formula for how the expected value of a response variable depends on values of a set of explanatory variables. A model uses a framework that incorporates assumptions about the random variability in those variables and imposes a structure

106

Solutions Manual: Foundations of Statistical Science for Data Scientists for describing and making inferences about relationships. Sampling distributions and statistical inferences are derived under the assumed model structure. By contrast, an algorithm does not incorporate a framework for random variability or result in a formula for describing how the response variable depends on values of explanatory variables, but merely provides a recipe for conducting an analysis, such as for providing a predicted value of a response variable based on values input for a set of explanatory variables. A statistical model is better than an algorithm for summarizing effects and for inferential statistics, but sometimes an algorithm can perform better for prediction.