SOLUTIONS MANUAL for Data Analytics: A Small Data Approach 1st Edition by Shuai Huang and Houtao De by StudyGuide

Q2: Follow up the data on Q1. Use the R pipeline to build the linear regression model. Compare the result from R and the result by your manual calculation. Solution: data = rbind(c(-0.15, -0.48, 0.46),c(-0.72, -0.54, -0.37),c(1.36, -0.91, -0.2 7),c(0.61, 1.59, 1.35),c(-1.11, 0.34, -0.11)) data = data.frame(data) names(data) ## [1] "X1" "X2" "X3" colnames(data) = c("X1","X2","Y") lm.YX <-lm(Y~X1+X2,data = data) summary(lm.YX) 2

## ## Call: ## lm(formula = Y ~ X1 + X2, data = data) ## ## Residuals: ## 1 2 3 4 5 ## 0.5663 -0.1014 -0.2435 0.0566 -0.2780 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.2124 0.2170 0.979 0.431 ## X1 0.2222 0.2430 0.914 0.457 ## X2 0.5946 0.2430 2.447 0.134 ## ## Residual standard error: 0.4852 on 2 degrees of freedom ## Multiple R-squared: 0.7682, Adjusted R-squared: 0.5365 ## F-statistic: 3.315 on 2 and 2 DF, p-value: 0.2318

Q3: Please read the following output in R. (1) Write up the fitted regression model. (2) Identify the significant variables. (3) What is the R-squared of this model? Does the model fit the data well? (4) What would you recommend as the next step in data analysis? ## ## Call: ## lm(formula = y ~ ., data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.239169 -0.065621 0.005689 0.064270 0.310456 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.009124 0.010473 0.871 0.386 ## x1 1.008084 0.008696 115.926 <2e-16 *** ## x2 0.494473 0.009130 54.159 <2e-16 *** ## x3 0.012988 0.010055 1.292 0.200 ## x4 -0.002329 0.009422 -0.247 0.805 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1011 on 95 degrees of freedom ## Multiple R-squared: 0.9942, Adjusted R-squared: 0.994 ## F-statistic: 4079 on 4 and 95 DF, p-value: < 2.2e-16

Solution: (1) The fitted model is 𝑦 = 0.009124 + 1.008084𝑥! + 0.494473𝑥" + 0.012988𝑥$ − 0.002329𝑥% . (2) It seems the variables 𝑥! and 𝑥" are significant, i.e., their p-values by the t-test is small than 0,05. (3) The R-squared is 0.9942. The model fits the data very well. (4) It seems that the intercept and the variables 𝑥$ and 𝑥% are not significant. The next step would be removal of these terms from the current model and refit the model.

Q4: Consider the following dataset ID 1 2 3 4 5 6 7 8

X1 0.22 0.58 0.57 0.41 0.6 0.12 0.25 0.32

X2 0.38 0.32 0.28 0.43 0.29 0.32 0.32 0.38

Y No Yes Yes Yes No Yes Yes No

Build a decision tree model by manual calculation. To simplify the process, let’s only try three alternatives for the splits: 𝑥! ≥ 0.59, 𝑥! ≥ 0.37, and 𝑥" ≥ 0.35.

Solution: To conduct the first split, we evaluate the IG for each of the three alternatives. For 𝑥! ≥ 0.59, it would split the data as shown in below

We can compute '

𝑒& = − ( log ( − ( log ( = 0.9544, 𝑒! = −1 log 1 − 0 log 0 = 0, 4

)

𝑒" = − log − log = 0.8631. The IG is !

)

𝐼𝐺 = 𝑒& − 𝑤! 𝑒! − 𝑤" 𝑒" = 0.9544 − ( × 0 − ( × 0.8631 = 0.1992.

For 𝑥! ≥ 0.37, it would split the data as shown in below

We can compute '

𝑒& = − ( log ( − ( log ( = 0.9544, 𝑒! = − log − log = 0.8113, 𝑒" = − log − log = 1. The IG is !

𝐼𝐺 = 𝑒& − 𝑤! 𝑒! − 𝑤" 𝑒" = 0.9544 − × 0.8113 − × 1 = 0.0488. For 𝑥" ≥ 0.35, it would split the data as shown in below

We can compute '

𝑒& = − ( log ( − ( log ( = 0.9544, 𝑒! = − $ log $ − $ log $ = 0.9183, 𝑒" = − ' log ' − ' log ' = 0.7219. The IG is 5

𝐼𝐺 = 𝑒& − 𝑤! 𝑒! − 𝑤" 𝑒" = 0.9544 − ( × 0.9183 − ( × 0.7219 = 0.1589.

We choose the one with the maximum IG, 𝑥! ≥ 0.59. The tree at this stage is

We continue to split the child node in the right side. There are two alternatives left. For 𝑥! ≥ 0.37, it would split the data as shown in below

We can compute '

𝑒& = − ) log ) − ) log ) = 0.8631, 𝑒! = −1 log 1 − 0 log 0 = 0, !

𝑒" = − log − log = 1. The IG is $

𝐼𝐺 = 𝑒& − 𝑤! 𝑒! − 𝑤" 𝑒" = 0.8631 − ) × 0 − ) × 1 = 0.2917.

For 𝑥" ≥ 0.35, it would split the data as shown in below

We can compute '

𝑒& = − ) log ) − ) log ) = 0.8631, 𝑒! = − $ log $ − $ log $ = 0.9183, 𝑒" = −1 log 1 − 0 log 0 = 0. The IG is $

)

𝐼𝐺 = 𝑒& − 𝑤! 𝑒! − 𝑤" 𝑒" = 0.8631 − × 0.9183 − × 0 = 0.4695.

We choose the one with the maximum IG, 𝑥" ≥ 0.35. Now, only the node with data points ID #1,4,8 are not homogeneous. We can further split this node. And there is only one alternative to choose from, so the final tree is shown in below

Q5: Follow up the dataset in Q4. Use the R pipeline for building a decision tree model. Compare the result from R and the result by your manual calculation. 7

Solution: library(rpart) library(rpart.plot) X<-rbind(c(0.22, 0.38), c(0.58, 0.32), c(0.57, 0.28), c(0.41, 0.43), c(0.6, 0.29), c(0.12, 0.32), c(0.25, 0.32), c(0.32, 0.38)) Z<-rbind(c("N"), c("Y"), c("Y"), c("Y"), c("N"), c("Y"), c("Y"), c("N")) Z<-as.factor(Z) data <- data.frame(X,Z) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] data.test <- data[-train.ix,] tree <- rpart( Z ~ ., data = data, minbucket = 1) prp(tree,nn.cex=1)

Q6: Use the mtcars dataset in R, select the variable mpg as the outcome variable and other variables as predictors, run the R pipeline for linear regression, and summarize your findings. Solution: # Step 1 -> Read data into R workstation data <- mtcars[,c("mpg","disp","hp","wt")] 8

# Step 2 -> Data preprocessing # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> Use lm() function to build a full model with all predictors lm.model <- lm(mpg ~ ., data = data.train) summary(lm.model) ## ## Call: ## lm(formula = mpg ~ ., data = data.train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5076 -2.0593 -0.5060 0.7955 5.5689 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.865895 4.115142 9.202 8.73e-07 *** ## disp 0.006152 0.021213 0.290 0.777 ## hp -0.029966 0.017667 -1.696 0.116 ## wt -4.336897 2.434883 -1.781 0.100 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.226 on 12 degrees of freedom ## Multiple R-squared: 0.8203, Adjusted R-squared: 0.7754 ## F-statistic: 18.26 on 3 and 12 DF, p-value: 9.082e-05 # Step 4 -> use step() to automatically delete all the insignificant variable s # Automatic model selection lm.reduced <- step(lm.model, direction="backward", test="F") ## Start: AIC=40.88 ## mpg ~ disp + hp + wt ## ## Df Sum of Sq RSS AIC F value Pr(>F) ## - disp 1 0.875 125.78 38.991 0.0841 0.7768 ## <none> 124.90 40.879 ## - hp 1 29.946 154.85 42.318 2.8770 0.1156 ## - wt 1 33.021 157.92 42.632 3.1725 0.1002 ## ## Step: AIC=38.99 ## mpg ~ hp + wt ## ## Df Sum of Sq RSS AIC F value Pr(>F) 9

## <none> 125.78 38.991 ## - hp 1 30.033 155.81 40.417 3.1041 0.101576 ## - wt 1 128.659 254.44 48.263 13.2978 0.002956 ** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 anova(lm.reduced,lm.model) ## Analysis of Variance Table ## ## Model 1: mpg ~ hp + wt ## Model 2: mpg ~ disp + hp + wt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 13 125.78 ## 2 12 124.90 1 0.87544 0.0841 0.7768 # Step 5 -> Predict using your linear regession model pred.lm <- predict(lm.reduced, data.test) cor(pred.lm, data.test$mpg) ## [1] 0.9620341 # Step 6 -> Conduct diagnostics of the model # install.packages("ggfortify") require("ggfortify") # ggfortify is the package to do model diagnosis ## Loading required package: ggfortify ## Loading required package: ggplot2 autoplot(lm.reduced, which = 1:6, ncol = 3, label.size = 3)

Q7: Use the mtcars dataset in R, select the variable mpg as the outcome variable and other variables as predictors, run the R pipeline for decision tree, and summarize your findings. Another dataset is to use the iris dataset, select the variable Species as the outcome variable (i.e., to build a classification tree). Solution: We can use the mtcars dataset and build a regression tree: library(rpart) library(rpart.plot) # Step 1: read data into R data <- mtcars[,c("mpg","disp","hp","wt")] # Step 2: data preprocessing X <- data[,2:4] Y <- data$mpg data <- data.frame(X,Y) names(data)[4] = c("mpg") # Step 2 -> Data preprocessing # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,]

# Step 3: build the tree tree_reg <- rpart(mpg ~ ., data.train, method="anova", minbucket = 3) # for r egression problems, use method="anova" # Step 4: draw the tree prp(tree_reg, nn.cex=1)

# Step 5 -> prune the tree tree_reg <- prune(tree_reg,cp=0.03) prp(tree_reg,nn.cex=1)

# Step 6 -> Predict using your tree model pred.tree <- predict(tree_reg, data.test) cor(pred.tree, data.test$mpg) #For regression model, you can use correlation to measure how close your predictions with the true outcome values of the dat a points ## [1] 0.7762185

Another example. We use the iris dataset to build a decision tree for classification: library(rpart) library(rpart.plot) # Step 1 -> Read data into R workstation data("iris") data <- iris # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,1:4] Y <- data[,5] # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[5] = c("Species") # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> use rpart to build the decision tree. tree <- rpart(Species ~ ., data = data.train) # Step 4 -> draw the tree prp(tree,nn.cex=1)

# Step 5 -> prune the tree tree <- prune(tree,cp=0.01) prp(tree,nn.cex=1)

# Step 6 -> Predict using your tree model pred.tree <- predict(tree, data.test, type="class") err.tree <- length(which(pred.tree != data.test$Species))/length(pred.tree) print(err.tree) ## [1] 0.05333333

Q8: Design a simulated experiment to evaluate the effectiveness of the lm() in R. For instance, you can simulate 100 samples from a linear regression model with 2 variables, 𝑦 = 𝑥! 𝛽! + 𝑥" 𝛽" + 𝜀, where 𝛽! = 1, 𝛽" = 1, and 𝜀~𝑁(0,1). You can simulate 𝑥! and 𝑥" using the standard normal distribution 𝑁(0,1). Run lm() on the simulated data, and see how close the fitted model is with the true model. Solution: x1 <- rnorm(mean = 0, sd = 1, n = 100) # simulate a predictor (x1) with 100 m easurements from a normal distribution, while mean = 0 and std = 1. rnorm() i s the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x2) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) lm.XY <- lm(y ~ ., data = data.frame(y,x1,x2)) # Now, let's fit the linear re gression model summary(lm.XY) ## ## Call: ## lm(formula = y ~ ., data = data.frame(y, x1, x2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.42166 -0.67625 0.06936 0.63142 3.05738 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.07480 0.09820 0.762 0.448 ## x1 0.88604 0.09976 8.882 3.47e-14 *** ## x2 1.07347 0.10510 10.213 < 2e-16 *** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.972 on 97 degrees of freedom ## Multiple R-squared: 0.6832, Adjusted R-squared: 0.6767 ## F-statistic: 104.6 on 2 and 97 DF, p-value: < 2.2e-16

From the result we can see that the fitted regression parameters are close to the true regression parameters. Also, the R-squared is 0.6832 , which implies that about 2/3 of the variance in the dataset is explained by the model. Recall that we used a standard normal distribution to simulate for 𝑥! , 𝑥" , and 𝜀 – this again makes sense, since the other 1/3 of the variance that could not be explained by the model comes from 𝜀, the noise term.

Q9: Follow up the experiment in Q8. Let’s add two more variables 𝑥$ and 𝑥% into the dataset but still generate 100 samples from a linear regression model from the same underlying model 𝑦 = 𝑥! 𝛽! + 𝑥" 𝛽" + 𝜀, where 𝛽! = 1, 𝛽" = 1, and 𝜀~𝑁(0,1). In other words, 𝑥$ and 𝑥% are insignificant variables. You can simulate 𝑥! to 𝑥% using the standard normal distribution 𝑁(0,1). Run lm() on the simulated data, and see how close the fitted model is with the true model. Solution: x1 <- rnorm(mean = 0, sd = 1, n = 100) # simulate a predictor (x1) with 100 m easurements from a normal distribution, while mean = 0 and std = 1. rnorm() i s the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x2) x3 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x3) x4 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x4) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) lm.XY <- lm(y ~ ., data = data.frame(y,x1,x2,x3,x4)) # Now, let's fit the lin ear regression model summary(lm.XY) ## ## Call: ## lm(formula = y ~ ., data = data.frame(y, x1, x2, x3, x4)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.83015 -0.58746 -0.03979 0.47486 1.92029 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.01949 0.09335 -0.209 0.835 16

## x1 0.98395 0.08772 11.217 < 2e-16 *** ## x2 1.01686 0.10823 9.395 3.24e-15 *** ## x3 0.07732 0.08923 0.866 0.388 ## x4 -0.09552 0.10782 -0.886 0.378 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9262 on 95 degrees of freedom ## Multiple R-squared: 0.717, Adjusted R-squared: 0.7051 ## F-statistic: 60.18 on 4 and 95 DF, p-value: < 2.2e-16

It seems that 𝑥$ and 𝑥% are insignificant, but their inclusion in the model increased the R-squared to be 0.717. Adding more those noise variables would further increase the R-squared.

Q10: Follow up the experiment in Q8. Run rpart() on the simulated data, and see how close the fitted model is with the true model. Solution: x1 <- rnorm(mean = 0, sd = 1, n = 100) # simulate a predictor (x1) with 100 m easurements from a normal distribution, while mean = 0 and std = 1. rnorm() i s the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x2) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) data <- data.frame(cbind(x1,x2,y)) library(rpart) library(rpart.plot) tree <- rpart(y ~ ., data = data) prp(tree,nn.cex=1)

pred.tree <- predict(tree, data) R_squared <- 1 - var(pred.tree-y)/var(y) R_squared ## [1] 0.6439515 It seems the decision tree could achieve a similar level of R-squared, but the tree structure is pretty complex. The complexity comes from the difficult of using a tree-based format to express the linearity. Another disadvantage of this complex tree structure is it may be data-dependent, i.e., in other words, in another simulation the data points will be different, so will be the tree structure.

Q11: Design a simulated experiment to evaluate the effectiveness of the rpart() in R package rpart. For instance, you can simulate 100 samples from a tree model as shown in the Figure below, run rpart() on the simulated data, and see how close the fitted model is with the true model.

Solution:

library(rpart) library(rpart.plot) x1 <- rnorm(100, 0, 1) # simulate a predictor (x1) with 100 measurements from a normal distribution, while mean = 0 and std = 1. rnorm() is the function t o simulate from normal distribution x2 <- rnorm(100, 0, 1) # simulate another predictor (x2) x3 <- rnorm(100, 0, 1) # simulate another predictor (x2) y <- rep(0,100) y[which(x1<1 & x2>=-0.5 & x3 >= 0.2)] = 1 y[which(x1<1 & x2< 0.5)] = 1 y <- paste0("c", y) y <- as.factor(y) data <- data.frame(cbind(x1,x2,x3,y)) tree <- rpart(y ~ ., data = data, method = "class") prp(tree,nn.cex=1)

We can observe that the rpart() correctly identified the tree model, despite the small difference on the cutoff values. But, both the true model and the fitted model fit the data with 100% accuracy, indicating that the true model, while defined specifically, could be represented with other models that could do an equally good job in terms of classification. In other words, the concept “true model” doesn’t imply that there is only one best model. There is inherent ambiguity in the concept “true model” in this case. Also, note that in this case, we didn’t add any noise in the variable “y”. A more realistic scenario should consider the case that some values of “y” should be interrupted by noise.

Chapter 3

Q1: Consider the case that, in building linear regression models, there is a concern that some data points may be more important (or more trustable). For these cases, it is not uncommon to assign a weight to each data point. Denote the weight for the ith data point as 𝑤* . An example is shown in the data table below, as the last column, e.g., 𝑤! = 1, 𝑤" = 2, 𝑤' = 3. 𝑋! 𝑋" 𝑌 𝑤 -0.15 -0.48 0.46 1 -0.72 -0.54 -0.37 2 1.36 -0.91 -0.27 2 0.61 1.59 1.35 1 -1.11 0.34 -0.11 3 We still want to estimate the regression parameters in the least-squares framework. Follow the process of the derivation of the least-squares estimator as shown in Chapter 2, and propose your new estimator of the regression parameters.

Solution: The objective function for weighted least-square loss is min(𝒚 − 𝐗𝜷), 𝐖(𝒚 − 𝐗𝜷). 𝜷

To solve this optimization problem, we use the first derivative test and compute the gradient of the objective function regarding 𝜷 and set it equal to zero: 𝒅(𝒚0𝐗𝜷)! 𝐖(𝒚0𝐗𝜷) 𝒅𝜷

= 0.

Then we have 𝐗 , (𝒚 − 𝐗𝜷) = 0. Solving this equation for 𝜷 and we have K = (𝐗 , 𝐖𝐗)0! 𝐗 , 𝐖𝐘. 𝜷

Q2: Follow up the weighted least squares estimator derived in Q1, please calculate the regression parameters (𝛽# , 𝛽! , and 𝛽" ) of the regression model (shown below) using the data shown in the Table. 𝑌 = 𝛽# + 𝛽! 𝑋! + 𝛽" 𝑋" + 𝜖.

Solution: X=

1 1 1 1 1

-0.15 -0.72 1.36 0.61 -1.11

-0.48 -0.54 -0.91 1.59 0.34

XT =

1 -0.15 -0.48

1 -0.72 -0.54

1 1.36 -0.91

1 0.61 1.59

1 -1.11 0.34

1 0 0 0 0

0 2 0 0 0

0 0 2 0 0

0 0 0 1 0

0 0 0 0 3

XT * W =

1 -0.15 -0.48

2 -1.44 -1.08

2 2.72 -1.82

1 0.61 1.59

3 -3.33 1.02

XT * W *X =

9 -1.59 -0.77

-1.59 8.8269 -1.7879

-0.77 -1.7879 5.3447

(XT * W *X)-1 =

0.118004697 0.026495069 0.025863781 0.026495069 0.127473004 0.046459144 0.025863781 0.046459144 0.206368817

XT * W *y =

0.2 0.9192 2.7045

(XT * W *X)-1 * XT * W *y =

0.117903802

0.46 -0.37 -0.27 1.35 -0.11

0.248120955 0.606002466 B_0 = B_1 = B_2 =

0.117903802 0.248120955 0.606002466 22

Q3: Follow up the dataset in Q1. Use the R pipeline for linear regression on this data (set up the weights in the lm() function). Compare the result from R and the result by your manual calculation. Solution: X1 <- c(-0.15, -0.72, 1.36, 0.61, -1.11) X2 <- c(-0.48, -0.54, -0.91, 1.59, 0.34) Y <- c(0.46, -0.37, -0.27, 1.35, -0.11) W <- c(1, 2, 2, 1, 3) lm.para <- lm(Y ~ X1 + X2, weights = W) summary(lm.para) ## ## Call: ## lm(formula = Y ~ X1 + X2, weights = W) ## ## Weighted Residuals: ## 1 2 3 4 5 ## 0.67020 0.02543 -0.24591 0.11720 -0.27458 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.1179 0.1881 0.627 0.595 ## X1 0.2481 0.1955 1.269 0.332 ## X2 0.6060 0.2487 2.437 0.135 ## ## Residual standard error: 0.5475 on 2 degrees of freedom ## Multiple R-squared: 0.7588, Adjusted R-squared: 0.5177 ## F-statistic: 3.147 on 2 and 2 DF, p-value: 0.2412

Q4: Consider the following dataset ID 1 2 3 4 5 6 7 8

X1 0.22 0.58 0.57 0.41 0.6 0.12 0.25 0.32

X2 0.38 0.32 0.28 0.43 0.29 0.32 0.32 0.38

Y No Yes Yes Yes No Yes Yes No

Use the R pipeline for building logistic regression model on this data.

Solution: # Step 1 -> Read data into R workstation X1 <- c(0.22,0.58,0.57,0.41,0.6,0.12,0.25,0.32) X2 <- c(0.38,0.32,0.28,0.43,0.29,0.32,0.32,0.38) Y <- c("No","Yes","Yes","Yes","No","Yes","Yes","No") # Step 2-> Data processing Y <- as.factor(Y) data <- data.frame(X1,X2,Y) # Step 3 -> Use lm() function to build a logistic regression model logit <- glm(Y ~ X1 + X2, data = data, family = "binomial") summary(logit) ## ## Call: ## glm(formula = Y ~ X1 + X2, family = "binomial", data = data) ## ## Deviance Residuals: ## 1 2 3 4 5 6 7 ## -1.3137 0.9504 0.8367 1.2523 -1.5198 0.8517 0.8789 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.3217 6.5097 0.510 0.610 ## X1 -0.5799 4.6846 -0.124 0.901 ## X2 -7.5773 16.3186 -0.464 0.642 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 10.585 on 7 degrees of freedom ## Residual deviance: 10.364 on 5 degrees of freedom ## AIC: 16.364 ## ## Number of Fisher Scoring iterations: 4

8 -1.2882

Q5: Consider the model fitted in Q4. Suppose that now there are two new data points as shown in the following table. Please use the fitted model to predict on these two data points and fill in the table. ID X1 X2 Y 9 0.25 0.18 10 0.08 1.12 Solution: For #9, we have 3.3217 -0.5799 * 0.25 – 7.5773 * 0.18 = 1.812811 which is larger than 0. So the prediction is Yes.

For #10, we have 3.3217 -0.5799 * 0.08 – 7.5773 * 1.12 = -5.211268 which is smaller than 0. So the prediction is No. ID X1 X2 Y 9 0.25 0.18 Yes 10 0.08 1.12 No

Q6: Use the dataset PimaIndiansDiabetes2 in the R package mlbench, run the R pipeline for logistic regression on it, and summarize your findings. Solution: # Step 1 -> Read data into R workstation library(mlbench) data("PimaIndiansDiabetes2") data <- PimaIndiansDiabetes2 data <- na.omit(data) # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,1:8] Y <- data[,9] # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[9] = c("diabetes") # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> Use glm() function to build a full model with all predictors logit.full <- glm(diabetes ~ ., data = data.train, family = "binomial") summary(logit.full) ## ## Call: ## glm(formula = diabetes ~ ., family = "binomial", data = data.train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9860 -0.6308 -0.3920 0.6552 2.4601 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -8.857017 1.695243 -5.225 1.75e-07 *** 25

## pregnant 0.015204 0.079677 0.191 0.8487 ## glucose 0.036814 0.008197 4.491 7.09e-06 *** ## pressure -0.004356 0.018892 -0.231 0.8176 ## triceps 0.026986 0.026537 1.017 0.3092 ## insulin -0.001764 0.001735 -1.017 0.3093 ## mass 0.030031 0.044088 0.681 0.4958 ## pedigree 0.066276 0.613316 0.108 0.9139 ## age 0.058679 0.027495 2.134 0.0328 * ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 232.66 on 195 degrees of freedom ## Residual deviance: 169.44 on 187 degrees of freedom ## AIC: 187.44 ## ## Number of Fisher Scoring iterations: 5 # Step 4 -> use step() to automatically delete all the insignificant variable s # Automatic model selection logit.reduced <- step(logit.full, direction="both", trace = 0) anova(logit.reduced,logit.full,test = "LRT") ## Analysis of Deviance Table ## ## Model 1: diabetes ~ glucose + triceps + age ## Model 2: diabetes ~ pregnant + glucose + pressure + triceps + insulin + ## mass + pedigree + age ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 192 170.79 ## 2 187 169.44 5 1.3502 0.9297 summary(logit.reduced) ## ## Call: ## glm(formula = diabetes ~ glucose + triceps + age, family = "binomial", ## data = data.train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9571 -0.6271 -0.3975 0.6316 2.4323 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -8.156010 1.195165 -6.824 8.84e-12 *** ## glucose 0.032509 0.006835 4.757 1.97e-06 *** ## triceps 0.034631 0.020243 1.711 0.08713 . ## age 0.061821 0.020029 3.087 0.00202 ** 26

## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 232.66 on 195 degrees of freedom ## Residual deviance: 170.79 on 192 degrees of freedom ## AIC: 178.79 ## ## Number of Fisher Scoring iterations: 5 # Step 5 -> test the significance of the logistic model dev.p.val <- 1 - pchisq(logit.reduced$deviance, logit.reduced$df.residual) dev.p.val ## [1] 0.8622615 # Step 6 -> Predict using your logistic regression model y_hat <- predict(logit.reduced, data.test) # Step 7 -> Evaluate the prediction performance of your logistic regression m odel y_hat2 <- y_hat y_hat2[which(y_hat > 0)] = "c1" y_hat2[which(y_hat < 0)] = "c0"

library(pROC) ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var plot(roc(data.test$diabetes, y_hat), col="green", main="ROC Curve") ## Setting levels: control = neg, case = pos ## Setting direction: controls < cases

Q7: Follow up on the simulation experiment in Q11 in Chapter 2. Apply glm() on the simulated data to build a logistic regression model, and comment on the result. Solution: x1 <- rnorm(100, 0, 1) # simulate a predictor (x1) with 100 measurements from a normal distribution, while mean = 0 and std = 1. rnorm() is the function t o simulate from normal distribution x2 <- rnorm(100, 0, 1) # simulate another predictor (x2) x3 <- rnorm(100, 0, 1) # simulate another predictor (x2) y <- rep(0,100) y[which(x1<1 & x2>=-0.5 & x1 >= 0.2)] = 1 y[which(x1<1 & x2< 0.5)] = 1

data <- data.frame(cbind(x1,x2,x3,y)) logit <- glm(y~., data = data, family = "binomial") summary(logit) ## ## Call: ## glm(formula = y ~ ., family = "binomial", data = data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.8645 -0.8655 0.3661 0.7313 2.2100 28

## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.9317 0.2841 3.280 0.001040 ** ## x1 -1.0433 0.2852 -3.658 0.000254 *** ## x2 -1.1881 0.3241 -3.666 0.000246 *** ## x3 -0.4085 0.2672 -1.529 0.126280 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 131.791 on 99 degrees of freedom ## Residual deviance: 97.411 on 96 degrees of freedom ## AIC: 105.41 ## ## Number of Fisher Scoring iterations: 5 y_hat <- predict(logit, data) # predict on the data points y_hat2 <- y_hat y_hat2[which(y_hat > 0)] = 1 y_hat2[which(y_hat < 0)] = 0 err_rate <- length(which(y_hat2 != y))/length(y) err_rate # this is the error rate ## [1] 0.26

It seems that while the decision tree model could achieve 0% error rate on this data (that is perfectly designed for tree model), logistic regression model has an error rate as 26%. It is generally difficult to use linear model to approximate the rectangular classification boundary.

Chapter 4 Q1: Continue the example in the 4-step R pipeline R lab in this chapter that estimated the mean and standard derivation of the variable HippoNV of the AD dataset. Use the same R pipeline to evaluate the uncertainty of the estimation of the standard derivation of the variable HippoNV of the AD dataset. Report its 95% CI. Solution: # Step 1 -> Read data into R workstation library(RCurl) AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_49 8/master/resource/data/AD.csv")) # Step 2 -> Decide on the statistical operation that you want to "Bootstrap" with require(MASS) ## Loading required package: MASS fit <- fitdistr(AD$HippoNV, densfun="normal") fit$estimate[2] # the estimated standard derivation ## sd ## 0.07645579 # Step 3 -> draw R bootstrap replicates to conduct the selected statistical o peration R <- 1000 # Initialize the vector to store the bootstrapped estimates bs_sd <- rep(NA, R) # draw R bootstrap resamples and obtain the estimates for (i in 1:R) { resam1 <- sample(AD$HippoNV, length(AD$HippoNV), replace = TRUE) # (1) AD $HippoNV is the sample we'd like to bootstrap; (2) length(AD$HippoNV) is tell R that we'd like to generate the bootstrapped samples with the sample size o f the original data; (3) replace = TRUE means that a data point could be repe atedly selected by our bootstrap procedure fit <- fitdistr(resam1 , densfun="normal") # resam1 is a bootstrapped data set. bs_sd[i] <- fit$estimate[2] # store the bootstrapped estimates of the mean } # Step 4 -> Summerarize the results and derive the bootstrap confidence inter val (CI) of the parameter bs_sd.sorted <- sort(bs_sd) # sort the mean estimates to obtain quantiles nee ded to construct the CIs # 0.025th and 0.975th quantile gives equal-tail bootstrap CI 30

CI.bs <- c(bs_sd.sorted[round(0.025*R)], bs_sd.sorted[round(0.975*R+1)]) CI.bs ## [1] 0.07219672 0.08081619

Q2: Use the R pipeline for Bootstrap to evaluate the uncertainty of the estimation of the coefficient of a logistic regression model. Report the 95% CI of the estimated coefficients. Solution: # Step 1 -> Read data into R workstation library(RCurl) AD <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_49 8/master/resource/data/AD.csv")) # Step 2 -> Apply the above pipeline to another statistical operation - estim ation of linear regression coefficients tempData <- data.frame(AD$DX_bl,AD$AGE, AD$PTGENDER, AD$PTEDUCAT) names(tempData) <- c("DX_bl","AGE","PTGENDER","PTEDUCAT") N <- dim(tempData)[1] # number of samples (sample size) P <- dim(tempData)[2] - 1 # number of predictors; the reason to minus 1 is be cause in tempData, there is an outcome variable "MMSCORE". # build a linear regression model with three predictors glm.AD <- glm(DX_bl ~ AGE + PTGENDER + PTEDUCAT, data = tempData, family = "b inomial") sum.glm.AD <- summary(glm.AD) std.glm <- sum.glm.AD$coefficients[ , 2] # Age is not significant according to the p-value glm.AD$coefficients[2] - 1.96 * std.glm[2] ## AGE ## 0.02516012 glm.AD$coefficients[2] + 1.96 * std.glm[2] ## AGE ## 0.07708342 # Step 3 -> draw R bootstrap replicates to conduct the selected statistical o peration # draw R bootstrap replicates R <- 1000 # Initialize the vector to store the bootstrapped estimates bs_glm <- matrix(NA, nrow = R, ncol = P+1) # There are P+1 regression coeffic 31

ients (counting the intercept here) # draw R bootstrap resamples and obtain the estimates for (i in 1:R) { resam_ID <- sample(c(1:N), N, replace = TRUE) resam_Data <- tempData[resam_ID,] # The above two lines generate a Bootstra pped dataset with the same sample size as the original dataset, with replacem ent of data points in resampling bs.glm <- glm(DX_bl ~ AGE + PTGENDER + PTEDUCAT, data = resam_Data, family = "binomial") bs_glm[i,] <- bs.glm$coefficients } # Step 4 -> Summerarize the results and derive the bootstrap confidence inter val (CI) of the parameter # Here, let's look at the linear regression coefficient of the variable Age f irst; for other variables, it is the same process bs.AGE <- bs_glm[,2] # sort the mean estimates of AGE to obtain bootstrap CI bs.AGE.sorted <- sort(bs.AGE) # 0.025th and 0.975th quantile gives equal-tail bootstrap CI CI.bs <- c(bs.AGE.sorted[round(0.025*R)], bs.AGE.sorted[round(0.975*R+1)]) CI.bs # One run of this code shows that CI.bs of the regression coefficient o f AGE is [0.02522836 0.008011093], which does not contains. Thus, AGE is sig nificant here. ## [1] 0.02522836 0.08011093

Q3: Consider the following data. Assume that two trees were built on it. Calculate the variable importance of each variable in RF. ID 1 2 3 4

𝑋! 1 0 1 0

𝑋" 0 0 1 1

𝑋$ 1 1 1 1

Class C0 C1 C1 C1

(1) Calculate the gini index of each node of both trees Solution: For tree 1 Node Gini index 1,2,3,4,4 0.2 × 0.8 + 0.8 × 0.2 = 0.32 2,4,4 0 1,3 0.5 × 0.5 + 0.5 × 0.5 = 0.5 1 0 3 0 For tree 2 Node Gini index 0.4 × 0.6 + 0.6 × 0.4 = 0.48 1,1,2,3,4 2,4 0 0.67 × 0.33 + 0.33 × 0.67 = 0.4444 1,1,3 1,1 0 3 0 (2) Estimate the importance scores of the three variables in these RF model Solution: We can calculate the gini gain scores in the splits Split Gini index 1 0.32 − 0.6 × 0 − 0.4 × 0.5 = 0.12 2 0.5 − 0.5 × 0 − 0.5 × 0 = 0.5 3 0.48 − 0.4 × 0 − 0.6 × 0.4444 = 0.2133 4 0.4444 − 0.5 × 0 − 0.5 × 0 = 0.4444 Thus, we can create a “credit table” for the variables: 33

variables Split 1 Split 2 Split 3 Split 4 Final score (Average over the number of trees) X1 0.12 0.2133 (0.12+0.2133)/2 = 0.1667 X2 0.5 0.4444 (0.5+0.4444)/2 = 0.4722 X3 0 Q4: Use the dataset PimaIndiansDiabetes2 in the R package mlbench, run the R pipeline for random forest on it, and summarize your findings. Solution: # Step 1 -> Read data into R workstation library(mlbench) data("PimaIndiansDiabetes2") data <- PimaIndiansDiabetes2 data <- na.omit(data) # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,1:8] Y <- data[,9] # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[9] = c("diabetes") # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> Use randomForest() function to build a RF model with all predicto rs library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. rf.diabetes <- randomForest( diabetes ~ ., data = data.train, ntree = 100, no desize = 20, mtry = 5) rf.diabetes ## ## Call: ## randomForest(formula = diabetes ~ ., data = data.train, ntree = 100, nodesize = 20, mtry = 5) ## Type of random forest: classification ## Number of trees: 100 34

## No. of variables tried at each split: 5 ## ## OOB estimate of error rate: 23.47% ## Confusion matrix: ## neg pos class.error ## neg 110 18 0.1406250 ## pos 28 40 0.4117647 # Step 4 -> Predict using your RF model y_hat <- predict(rf.diabetes, data.test,type="class") # Step 5 -> Evaluate the prediction performance of your RF model

# ROC curve is another commonly reported metric for classification models library(pROC) # pROC has the roc() function that is very useful here ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var y_hat <- predict(rf.diabetes, data.test,type="vote") plot(roc(data.test$diabetes, y_hat[,1]), col="green", main="ROC Curve") ## Setting levels: control = neg, case = pos ## Setting direction: controls > cases

Q5: Modify the R pipeline for Bootstrap and incorporate the rpart package to write your own version of Random Forest. Test it using the same data that has been used in the R lab for decision tree. Solution:

library(RCurl) library(rpart) library(rpart.plot) # Step 1 -> Read data into R workstation data <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_ 498/master/resource/data/AD.csv")) # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,2:16] Y <- data$DX_bl Y <- paste0("c", Y) Y <- as.factor(Y) # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[16] = c("DX_bl") # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] 36

# Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> draw R bootstrap replicates, and use rpart to build the decision tree. # draw R bootstrap replicates N <- dim(data.train)[1] P <- dim(data.train)[2]-1 #the response variable is not counted as a feature K <- round(P/2) R <- 50 # Initialize the vector to store the bootstrapped estimates bs_dt <- vector("list", R) # There are R models # draw R bootstrap resamples and obtain the estimates for (i in 1:R) { resam_ID <- sample(c(1:N), N, replace = TRUE) resam_var <- sample(c(1:P), K, replace = FALSE) # also randomly select K fe atures to split resam_Data <- data.train[resam_ID,c(resam_var,P+1)] # P+1 -> the response v ariable tree <- rpart( DX_bl ~ ., data = resam_Data) bs_dt[[i]] <- tree } # Step 4 -> draw trees prp(bs_dt[[1]],nn.cex=1) ## Warning: Cannot retrieve the data used to build the model (model.frame: ob ject 'rs11136000' not found). ## To silence this warning: ## Call prp with roundint=FALSE, ## or rebuild the rpart model with model=TRUE.

# Step 5 -> Predict using your RF model pred.tree <- matrix(nrow=dim(data.test)[1], ncol=R) for (i in 1:R){ pred.tree[,i] <- predict(bs_dt[[i]], data.test, type="class") } pred.majorityvote <- rep(NA,dim(data.test)[1]) for(i in 1:dim(data.test)[1]){ temp <- table(pred.tree[i,]) pred.majorityvote[i] <- names(temp[which.max(temp)]) # get the majority vot e result for sample i } pred.majorityvote[which(pred.majorityvote==1)]="c0" pred.majorityvote[which(pred.majorityvote==2)]="c1" err.rf <- length(which(pred.majorityvote != data.test$DX_bl))/length(pred.maj orityvote) print(mean(err.rf)) ## [1] 0.1544402

Q6: Suppose that a random forest model with 3 trees is built.

Please use this model to predict on the following data points ID 1 2 3

𝑋! 2 0.8 1.2

𝑋" -0.5 -1.1 -0.3

𝑋$ 0.5 0.1 0.9

Class

Solution: For data point #1, tree #1’s prediction is c0; tree #2’s prediction is c0; tree #3’s prediction is c0. According to the majority vote, the final prediction of data point #1 is c0. For data point #2, tree #1’s prediction is c0; tree #2’s prediction is c1; tree #3’s prediction is c1. According to the majority vote, the final prediction of data point #2 is c1. For data point #3, tree #1’s prediction is c1; tree #2’s prediction is c0; tree #3’s prediction is c0. According to the majority vote, the final prediction of data point #3 is c0.

ID 1 2 3

𝑋! 2 0.8 1.2

𝑋" -0.5 -1.1 -0.3

𝑋$ 0.5 0.1 0.9

Class c0 c1 c0

Q7: Follow up on the simulation experiment in Q9 in Chapter 2. Apply random forest with 100 trees (by setting ntree = 100) on the simulated data, and comment on the result. Solution: x1 <- rnorm(mean = 0, sd = 1, n = 100) # simulate a predictor (x1) with 100 m easurements from a normal distribution, while mean = 0 and std = 1. rnorm() i s the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x2) x3 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x3) 39

x4 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x4) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) data = data.frame(y,x1,x2,x3,x4) library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. rf <- randomForest( y ~ ., data = data, ntree = 100) pred.rf <- predict(rf, data) R_squared <- 1 - var(pred.rf-y)/var(y) R_squared ## [1] 0.8747632

It seems that R-squared by random forest is 0.87. We know the true model has 1/3 of the variability that is inherently noise, so 0.87 is an overestimate. This is called “overfitting” that will be discussed in details in Chapter 5 and other chapters.

Chapter 5

Q1: A random forest model is built on the training data with six data points. The details of the trees and their bootstrapped datasets are shown in the Table below:

Bootstrapped data Tree 1,3,4,4,5,6 1 2,2,4,4,4,5 2 1,2,2,5,6,6 3 3,3,3,4,5,6 4 To calculate the out-of-bag (OOB) errors, which data points are legitimate to be used for each tree? You can mark the elements in the following table (no need to make predictions without the random forest model, only to mark the elements where OOB errors could be collected). Tree Bootstrapped data 1 (C1) 2 (C2) 3 (C2) 4 (C1) 5 (C2) 6(C1) 1 1,3,4,4,5,6 2 2,2,4,4,4,5 3 1,2,2,5,6,6 4 3,3,3,4,5,6 Solution:

Tree Bootstrapped data 1 (C1) 2 (C2) 3 (C2) 4 (C1) 5 (C2) 6(C1) 1 1,3,4,4,5,6 X 2 2,2,4,4,4,5 X X X 3 1,2,2,5,6,6 X X 4 3,3,3,4,5,6 X X Q2: The figure below shows the ROC curves of two classification models. Which model is better?

Solution: A model is better if its ROC curve has larger area under the curve. In other words, its curve is closer to the upper left corner. Thus, Model 1 is better.

Q3: Follow up on the simulation experiment in Q9 in Chapter 2 and the random forest model in Q7 in Chapter 4. Split the data into a training set and a testing test, then use 10-fold cross validation to evaluate the performance of the random forest model with 100 trees. Solution: # Step 1 -> Read data into R workstation x1 <- rnorm(mean = 0, sd = 1, n = 100) # simulate a predictor (x1) with 100 m easurements from a normal distribution, while mean = 0 and std = 1. rnorm() i s the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x2) x3 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x3) x4 <- rnorm(mean = 0, sd = 1, n = 100) # simulate another predictor (x4) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) data = data.frame(y,x1,x2,x3,x4) # Step 2 -> Use 10-fold cross-validation to evaluate all the models library(randomForest) 42

## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. # First, let me use 10-fold cross-validation to evaluate the performance of m odel1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) R_squared <- NULL # cv_mse aims to make records of the prediction error for e ach fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data[-test_i, ] # Then, the remaining 9 folds' data form o ur training data data.test.cv <- data[test_i, ] # This is the testing data, from the ith f old rf <- randomForest(y ~ ., data = data.train.cv, ntree = 100) pred.rf <- predict(rf, data.test.cv[,2:5]) R_squared[k] <- 1 - var(pred.rf-data.test.cv$y)/var(data.test.cv$y) } mean(R_squared) ## [1] 0.4753708

It can be seen that the R-squared evaluated by the 10-fold cross validation is more conservative than the one in Q7 in Chapter 4 that was estimated using training data.

Q4: Follow up on Q3. Increase the sample size to be 1000, and comment on the result. Solution:

# Step 1 -> Read data into R workstation x1 <- rnorm(mean = 0, sd = 1, n = 1000) # simulate a predictor (x1) with 100 measurements from a normal distribution, while mean = 0 and std = 1. rnorm() is the function to simulate from normal distribution x2 <- rnorm(mean = 0, sd = 1, n = 1000) # simulate another predictor (x2) x3 <- rnorm(mean = 0, sd = 1, n = 1000) # simulate another predictor (x3) x4 <- rnorm(mean = 0, sd = 1, n = 1000) # simulate another predictor (x4) beta1 <- 1 # the regression coefficient of the first predictor = 1 43

beta2 <- 1 # the regression coefficient of the second predictor = 0.5 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(1000, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) data = data.frame(y,x1,x2,x3,x4) # Step 2 -> Use 10-fold cross-validation to evaluate all the models library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. # First, let me use 10-fold cross-validation to evaluate the performance of m odel1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) R_squared <- NULL # cv_mse aims to make records of the prediction error for e ach fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data[-test_i, ] # Then, the remaining 9 folds' data form o ur training data data.test.cv <- data[test_i, ] # This is the testing data, from the ith f old rf <- randomForest(y ~ ., data = data.train.cv, ntree = 100) pred.rf <- predict(rf, data.test.cv[,2:5]) R_squared[k] <- 1 - var(pred.rf-data.test.cv$y)/var(data.test.cv$y) } mean(R_squared) ## [1] 0.6098877

With sample size increased to be 1000, the R-squared evaluated by the 10-fold cross validation is closer to 2/3. We can see that the sample size influences the magnitude of the estimation of the error; while on the other hand, recall that the cross-validation ‘s another important value is to provide a fair comparison of models with different complexities. In other words, although the cross-validation may not provide an accurate estimate of an individual model, it is still a better method to compare multiple models than simply using error rate estimated by training data. 44

Chapter 6

Q1: In what follows is a summary of the clustering result on a dataset by using the R package mclust. (1) How many samples are in total in this dataset? How many variables? (2) How many clusters are found? What are the sizes of the clusters? (3) What is the fitted GMM model? Please write up its mathematical form. ## ---------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------## ## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model wit h 3 ## components: ## ## log-likelihood n df BIC ICL ## -2303.496 145 29 -4751.316 -4770.169 ## ## Clustering table: ## 1 2 3 ## 81 36 28 ## ## Mixing probabilities: ## 1 2 3 ## 0.5368974 0.2650129 0.1980897 ## ## Means: ## [,1] [,2] [,3] ## glucose 90.96239 104.5335 229.42136 ## insulin 357.79083 494.8259 1098.25990 ## sspg 163.74858 309.5583 81.60001 ## ## Variances: ## [,,1] ## glucose insulin sspg ## glucose 57.18044 75.83206 14.73199 ## insulin 75.83206 2101.76553 322.82294 ## sspg 14.73199 322.82294 2416.99074 ## [,,2] ## glucose insulin sspg ## glucose 185.0290 1282.340 -509.7313 ## insulin 1282.3398 14039.283 -2559.0251 ## sspg -509.7313 -2559.025 23835.7278 ## [,,3] ## glucose insulin sspg ## glucose 5529.250 20389.09 -2486.208 45

## insulin 20389.088 83132.48 -10393.004 ## sspg -2486.208 -10393.00 2217.533 Solution: (1) There are in total 81 + 36 + 28 = 145 samples. There are 3 variables to define the clusters. (2) Three clusters are found. Cluster 1 has 81 samples; Cluster 2 has 36 samples; Cluster 3 has 28 samples. (3) The mathematical form of the fitted model is 𝑥~𝜋! 𝑁(𝝁! , 𝚺! ) + 𝜋" 𝑁(𝝁" , 𝚺" ) + 𝜋$ 𝑁(𝝁$ , 𝚺$ ), where 𝜋! = 0.5368974, 𝜋" = 0.2650129, and 𝜋$ = 0.1980897. 90.96239 𝝁! = P357.79083Q, 163.74858

104.5335 𝝁" = P494.8259Q, 309.5583

229.42136 𝝁$ = P1098.25990Q. 81.60001

57.18044 75.83206 14.73199 185.0290 1282.3398 −509.7313 𝚺! = P75.83206 2101.76553 322.82294 Q, 𝚺" = P 1282.3398 14039.283 −2559.025 Q, 14.73199 322.82294 2416.99074 −509.7313 −2559.025 23835.7278 5529.250 20389.088 −2486.208 𝚺$ = P 20389.088 83132.48 −10393.00Q. −2486.208 −10393.00 2217.533

Q2: Consider the following dataset that has 9 data points. Let’s use it to estimate a GMM model with 3 clusters. The initial values are shown in the table below

ID Label

1.53 C1

0.57 C3

2.56 C1

1.22 C2

4.13 C2

6.03 C2

0.98 C1

5.21 C2

-0.37 C3

(1) Write up the gaussian mixture model (GMM) that you want to estimate Solution: The GMM model with three clusters consists of the following parameters: For the first cluster: 𝜋! (the prior probability that a data point belongs to the first cluster), 𝜇! and 𝜎!" (the mean and variance of the normal distribution of the first cluster) For the second cluster: 𝜋" (the prior probability that a data point belongs to the second cluster), 𝜇" and 𝜎"" (the mean and variance of the normal distribution of the second cluster) For the third cluster: 𝜋$ (the prior probability that a data point belongs to the third cluster), 𝜇$ and 𝜎$" (the mean and variance of the normal distribution of the third cluster) And by definition, 𝜋! + 𝜋" + 𝜋$ = 1. 46

(2) Estimate the parameters of your GMM model Solution: This consists of the same estimation procedure for three clusters: For the first cluster, by initial assignment, we have known that the data points: 1.53, 2.56, 0.98, can be used to estimate 𝜇! and 𝜎!" . Using the either manual calculation or R functions mean() and var(), we can $

get 𝜇! = 1.6900, 𝜎!" = 0.6433. And 𝜋! = 4 = $ = 0.3333. For the second cluster, by initial assignment, we have known that the data points: 1.22, 4.13,6.03,5.21, can be used to estimate 𝜇" and 𝜎"" . Using the either manual calculation or R functions mean() and var(), %

we can get 𝜇" = 4.1475, 𝜎"" = 4.4144. And 𝜋" = 4 = 0.4444. For the third cluster, by initial assignment, we have known that the data points: 0.57, -0.37, can be used to estimate 𝜇$ and 𝜎$" . Using the either manual calculation or R functions mean() and var(), we can get "

𝜇$ = 0.1000, 𝜎$" = 0.4418. And 𝜋$ = 4 = 0.2222.

(3) Update the labels with your estimated parameters ID Label

1.53 C1

0.57 C3

2.56 C1

1.22 C1

4.13 C2

6.03 C2

0.98 C1

5.21 C2

-0.37 C3

Thus, we can see that 𝑝U𝑧!! = 1Z𝐗, 𝚯(#) Y =

#.%()'×0.3333 #.%()'×0.3333+0.087×#.%%%%8#.#'4$×#.""""

= 0.7575 ;

0.087×#.%%%%

𝑝U𝑧!" = 1Z𝐗, 𝚯(#) Y = #.%()'×0.3333+0.087×#.%%%%8#.#'4$×#."""" = 0.1811 ; #.#'4$×#.""""

𝑝U𝑧!$ = 1Z𝐗, 𝚯(#) Y = #.%()'×0.3333+0.087×#.%%%%8#.#'4$×#."""" = 0.0614 . Thus, this data point is assigned to Cluster 1 (C1). Do the same for all the other 8 data points.

𝑝U𝑧"! = 1Z𝐗, 𝚯(#) Y = 0.3358 ; 𝑝U𝑧"" = 1Z𝐗, 𝚯(#) Y = 0.1063 ; 𝑝U𝑧"$ = 1Z𝐗, 𝚯(#) Y = 0.5578 . Thus, this data point is assigned to Cluster 3 (C3).

𝑝U𝑧$! = 1Z𝐗, 𝚯(#) Y = 0.5915 ; 𝑝U𝑧$" = 1Z𝐗, 𝚯(#) Y = 0.4076 ; 𝑝U𝑧$$ = 1Z𝐗, 𝚯(#) Y = 0.0009 . Thus, this data point is assigned to Cluster 1 (C1). 𝑝U𝑧%! = 1Z𝐗, 𝚯(#) Y = 0.6850 ; 𝑝U𝑧%" = 1Z𝐗, 𝚯(#) Y = 0.1568 ; 𝑝U𝑧%$ = 1Z𝐗, 𝚯(#) Y = 0.1582 . Thus, this data point is assigned to Cluster 1 (C1).

𝑝U𝑧'! = 1Z𝐗, 𝚯(#) Y = 0.0189 ; 𝑝U𝑧'" = 1Z𝐗, 𝚯(#) Y = 0.9811 ; 𝑝U𝑧'$ = 1Z𝐗, 𝚯(#) Y = 0.0000 . Thus, this data point is assigned to Cluster 2 (C2).

𝑝U𝑧5! = 1Z𝐗, 𝚯(#) Y = 0.0000 ; 48

𝑝U𝑧5" = 1Z𝐗, 𝚯(#) Y = 1 ; 𝑝U𝑧5$ = 1Z𝐗, 𝚯(#) Y = 0.0000 . Thus, this data point is assigned to Cluster 2 (C2).

𝑝U𝑧)! = 1Z𝐗, 𝚯(#) Y = 0.5756 ; 𝑝U𝑧)" = 1Z𝐗, 𝚯(#) Y = 0.1391 ; 𝑝U𝑧)$ = 1Z𝐗, 𝚯(#) Y = 0.2852 . Thus, this data point is assigned to Cluster 1 (C1).

𝑝U𝑧(! = 1Z𝐗, 𝚯(#) Y = 0.0001 ; 𝑝U𝑧(" = 1Z𝐗, 𝚯(#) Y = 0.9999 ; 𝑝U𝑧($ = 1Z𝐗, 𝚯(#) Y = 0.0000 . Thus, this data point is assigned to Cluster 2 (C2).

𝑝U𝑧4! = 1Z𝐗, 𝚯(#) Y = 0.0518 ; 𝑝U𝑧4" = 1Z𝐗, 𝚯(#) Y = 0.0707 ; 𝑝U𝑧4$ = 1Z𝐗, 𝚯(#) Y = 0.8775 . Thus, this data point is assigned to Cluster 3 (C3).

(4) Estimate the parameters again Solution: This consists of the same estimation procedure for three clusters: For the first cluster, by the updated assignment, we have known that the data points: 1.53, 2.56, 1.22, 0.98, can be used to estimate 𝜇! and 𝜎!" . Using the either manual calculation or R functions mean() and %

var(), we can get 𝜇! = 1.5725, 𝜎!" = 0.4841. And 𝜋! = 4 = 0.4444. For the second cluster, by the updated assignment, we have known that the data points: 4.13,6.03,5.21, can be used to estimate 𝜇" and 𝜎"" . Using the either manual calculation or R functions mean() and var(), $

we can get 𝜇" = 5.1233, 𝜎"" = 0.9081. And 𝜋" = 4 = 0.3.

For the third cluster, by the updated assignment (the same as the initial assignment), we have known that the data points: 0.57, -0.37, can be used to estimate 𝜇$ and 𝜎$" . Using the either manual calculation or R functions mean() and var(), we can get 𝜇$ = 0.1, 𝜎$" = 0.4418. And 𝜋$ = 2/9=0.2222.

Note: this is a basic idea used in many iterative clustering algorithms, which is also very similar as the EM-algorithm, but it is not exactly the same as the iterations in EM algorithm. In the EM algorithm, the data points won’t be exactly assigned to clusters using the hard thresholding as used here. Rather, the EM algorithm treats each data point with probabilities (thus, a soft thresholding). More details could be found in the textbook.

Q3: Follow up on the dataset in Q2. Use the R pipeline for clustering on this data. Compare the result from R and the result by your manual calculation. Solution: X <- rbind(1.53, 0.57, 2.56, 1.22, 4.13, 6.03, 0.98, 5.21, -0.37) require(mclust) ## Loading required package: mclust ## Package 'mclust' version 5.4.6 ## Type 'citation("mclust")' for citing this R package in publications. q4.Mclust <- Mclust(X, G=3) summary(q4.Mclust,parameters = TRUE) ## ---------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------## ## Mclust E (univariate, equal variance) model with 3 components: ## ## log-likelihood n df BIC ICL ## -17.03782 9 6 -47.25899 -51.76736 ## ## Clustering table: ## 1 2 3 ## 3 3 3 ## ## Mixing probabilities: ## 1 2 3 ## 0.3422112 0.3243824 0.3334065 ## ## Means: ## 1 2 3 ## 0.6462368 1.5456761 5.1179254 ## 50

## Variances: ## 1 2 3 ## 0.6121298 0.6121298 0.6121298 q4.Mclust$classification ## [1] 2 1 2 2 3 3 1 3 1

This means that mclust put the 9 data points into three clusters. Three clusters: {ID=1,3,4}, {ID=2,7,9}, {ID=5,6,8}

While our results are: Three clusters: {ID=1,3,4,7}, {ID=2, 9}, {ID=5,6,8} The results are very similar. Note that, our result (1) only has one iteration (thus it is not final, but pretty stably progress to final convergence); (2) starts with a different initial assignment of clusters from mclust (mclust uses random assignment); (3) and mclust uses the complete EM algorithm, while our manual calculation uses a slightly simplified EM iteration (we use hard thresholding to assign data points to clusters, while EM algorithm uses soft thresholding).

Q4: Consider the following dataset that has 10 data points. Let’s use it to estimate a GMM model with 3 clusters. The initial values are shown in the table below Data

2.22

6.33

3.15

-0.89

3.21

1.10

1.58

0.03

8.05

0.26

Label

(1) Write up the gaussian mixture model (GMM) that you want to estimate (2) Estimate the parameters of your GMM model. (3) Update the labels with your estimated parameters. You could use rnorm() in R to help calculating the likelihoods Data

2.22

6.33

3.15

-0.89

3.21

1.10

1.58

0.03

8.05

0.26

Label

(4) Estimate the parameters again. Solution:

(1) The GMM model is 𝑥~𝜋! 𝑁(𝜇! , 𝜎!" ) + 𝜋" 𝑁(𝜇" , 𝜎"" ) + 𝜋$ 𝑁(𝜇$ , 𝜎$" ). (2) # Estimate parameters c1 <- c(2.22, 3.15, 1.58) meanA <- mean(c1) varianceA <- var(c1) wa <- 3/10 c2 <- c(-0.89, 3.21, 1.10, 0.03, 0.26) meanB <- mean(c2) varianceB <- var(c2) wb <- 1/2 c3 <- c(6.33, 8.05) meanC <- mean(c3) varianceC <- var(c3) wc <- 2/10 param <- data.frame(mean = c(meanA, meanB, meanC), Var=c(varianceA, varianceB, varianceC), weight = c(wa, wb, wc)) rownames(param) <- c("Cluster 1", "Cluster 2", "Cluster 3") param ## mean Var weight ## Cluster 1 2.316667 0.6232333 0.3 ## Cluster 2 0.742000 2.4054700 0.5 ## Cluster 3 7.190000 1.4792000 0.2 (3) # Iterate clustering ID <- c(2.22,6.33,3.15,-0.89,3.21,1.10,1.58,0.03,8.05,0.26) results <- matrix(ncol=length(ID), nrow=1) probabilities <- matrix(data = NA, nrow = length(ID), ncol = 3) colnames(probabilities) <- c("c1", "c2", "c3") # Loop for estimating the probabilities for (i in 1:length(ID)) { x <- ID[i] # We manually calculate the probabilities Prob_X_A <- (1/sqrt(2*meanA*(varianceA^2)))*exp(-(((x-meanA)^2)/(2*(varianceA ^2)))) Prob_X_B <- (1/sqrt(2*meanB*(varianceB^2)))*exp(-(((x-meanB)^2)/(2*(varianceB ^2)))) Prob_X_C <- (1/sqrt(2*meanC*(varianceC^2)))*exp(-(((x-meanC)^2)/(2*(varianceC ^2)))) prob_A_X <- (Prob_X_A*wa)/((Prob_X_A*wa)+(Prob_X_B*wb)+(Prob_X_C*wc)) prob_B_X <- (Prob_X_B*wb)/((Prob_X_A*wa)+(Prob_X_B*wb)+(Prob_X_C*wc)) prob_C_X <- (Prob_X_C*wc)/((Prob_X_A*wa)+(Prob_X_B*wb)+(Prob_X_C*wc)) # save likelihoods probabilities[i,1] <- prob_A_X 52

probabilities[i,2] <- prob_B_X probabilities[i,3] <- prob_C_X # Choose dependent on max likelihood results[,i] <- which.max(probabilities[i,]) }# print results results ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 3 2 2 2 2 2 2 3 2 Data

2.22

6.33

3.15

-0.89

3.21

1.10

1.58

0.03

8.05

0.26

Label

(4) # Estimate parameters c1 <- c(2.22) meanA <- mean(c1) varianceA <- var(c1) wa <- 1/10 c2 <- c(3.15, -0.89, 3.21, 1.10, 1.58, 0.03, 0.26) meanB <- mean(c2) varianceB <- var(c2) wb <- 7/10 c3 <- c(6.33, 8.05) meanC <- mean(c3) varianceC <- var(c3) wc <- 2/10 param <- data.frame(mean = c(meanA, meanB, meanC), Var=c(varianceA, varianceB, varianceC), weight = c(wa, wb, wc)) rownames(param) <- c("Cluster 1", "Cluster 2", "Cluster 3") param ## mean Var weight ## Cluster 1 2.220000 NA 0.1 ## Cluster 2 1.205714 2.436229 0.7 ## Cluster 3 7.190000 1.479200 0.2

Q5: Design a simulation experiment to test the effectiveness of the mclust R package. For instance, simulate a three-cluster structure in your dataset by this GMM model: 𝑥~𝜋! 𝑁(𝝁! , 𝚺! ) + 𝜋" 𝑁(𝝁" , 𝚺" ) + 𝜋$ 𝑁(𝝁$ , 𝚺$ ), where 𝜋! = 0.5, 𝜋" = 0.25, and 𝜋$ = 0.25.

5 𝝁! = P3Q, 3

10 𝝁" = P 5 Q, 1

−5 𝝁$ = P 10 Q. −2

1 0 0 1 0 0 1 0 0 𝚺! = P0 1 0Q, 𝚺" = P0 1 0Q, 𝚺$ = P0 1 0Q. 0 0 1 0 0 1 0 0 1 Then, use the mclust package on this dataset and see if the true clustering structure could be recovered. Solution:

# simulate the data #cluster 1 x1.C1 <- rnorm(100,mean =5, sd = 1) x2.C1 <- rnorm(100,mean =3, sd = 1) x3.C1 <- rnorm(100,mean =3, sd = 1) #cluster 2 x1.C2 <- rnorm(50,mean =10, sd = 1) x2.C2 <- rnorm(50,mean =5, sd = 1) x3.C2 <- rnorm(50,mean =1, sd = 1) #cluster 3 x1.C3 <- rnorm(50,mean =-5, sd = 1) x2.C3 <- rnorm(50,mean =10, sd = 1) x3.C3 <- rnorm(50,mean =-2, sd = 1) #combine all the data in cluster 1 x1 = cbind(x1.C1,x2.C1,x3.C1) #combine all the data in cluster 2 x2 = cbind(x1.C2,x2.C2,x3.C2) #combine all the data in cluster 3 x3 = cbind(x1.C3,x2.C3,x3.C3) #combine all data into X X <- rbind(x1,x2,x3) # Use BIC to select the number of clusters require(mclust) ## Loading required package: mclust ## Package 'mclust' version 5.4.6 ## Type 'citation("mclust")' for citing this R package in publications. BIC <- mclustBIC(X) summary(BIC)

## Best BIC values: ## EII,3 VII,3 EEI,3 ## BIC -2155.645 -2165.5987 -2166.02566 ## BIC diff 0.000 -9.9534 -10.38041 # perform clustering on the final model md.final <- Mclust(X, x = BIC) summary(md.final) ## ---------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------## ## Mclust EII (spherical, equal volume) model with 3 components: ## ## log-likelihood n df BIC ICL ## -1046.033 200 12 -2155.645 -2155.677 ## ## Clustering table: ## 1 2 3 ## 100 50 50 plot(md.final, what = c("classification")) # It seems that the three cluster s are effectively recovered.

Q6: Follow up on the simulation experiment in Q5, by increasing the noise level of the GMM model: 55

𝑥~𝜋! 𝑁(𝝁! , 𝚺! ) + 𝜋" 𝑁(𝝁" , 𝚺" ) + 𝜋$ 𝑁(𝝁$ , 𝚺$ ), where 𝜋! = 0.5, 𝜋" = 0.25, and 𝜋$ = 0.25. 5 𝝁! = P3Q, 3

10 𝝁" = P 5 Q, 1

−5 𝝁$ = P 10 Q. −2

3 0 0 3 0 0 1 0 0 𝚺! = P0 3 0Q, 𝚺" = P0 3 0Q, 𝚺$ = P0 1 0Q. 0 0 3 0 0 3 0 0 1 Then, use the mclust package on this dataset and see if the true clustering structure could be recovered. Solution: # simulate the data #cluster 1 x1.C1 <- rnorm(100,mean =5, sd = 4) x2.C1 <- rnorm(100,mean =3, sd = 4) x3.C1 <- rnorm(100,mean =3, sd = 3) #cluster 2 x1.C2 <- rnorm(50,mean =10, sd = 4) x2.C2 <- rnorm(50,mean =5, sd = 4) x3.C2 <- rnorm(50,mean =1, sd = 4) #cluster 3 x1.C3 <- rnorm(50,mean =-5, sd = 1) x2.C3 <- rnorm(50,mean =10, sd = 1) x3.C3 <- rnorm(50,mean =-2, sd = 1) #combine all the data in cluster 1 x1 = cbind(x1.C1,x2.C1,x3.C1) #combine all the data in cluster 2 x2 = cbind(x1.C2,x2.C2,x3.C2) #combine all the data in cluster 3 x3 = cbind(x1.C3,x2.C3,x3.C3) #combine all data into X X <- rbind(x1,x2,x3) # Use BIC to select the number of clusters require(mclust) ## Loading required package: mclust ## Package 'mclust' version 5.4.6 ## Type 'citation("mclust")' for citing this R package in publications. BIC <- mclustBIC(X) summary(BIC) 56

## Best BIC values: ## VEI,2 VII,2 VVI,2 ## BIC -3272.655 -3274.563396 -3281.640237 ## BIC diff 0.000 -1.908408 -8.985249 # perform clustering on the final model md.final <- Mclust(X, x = BIC) summary(md.final) ## ---------------------------------------------------## Gaussian finite mixture model fitted by EM algorithm ## ---------------------------------------------------## ## Mclust VEI (diagonal, equal shape) model with 2 components: ## ## log-likelihood n df BIC ICL ## -1607.187 200 11 -3272.655 -3272.993 ## ## Clustering table: ## 1 2 ## 150 50 plot(md.final, what = c("classification"))# It seems that two clusters are m erged into one. Due to the high noise level, the boundary between the two mer ged clusters was not correctly recognized by the algorithm.

Q7: Design a simulation experiment to test the effectiveness of the diagnostic tools in the ggfortify R package. For instance, use the same simulation procedure that has been used in Q8 of Chapter 2 to design a linear regression model with two variables, simulate 100 samples from this model, fit the model, and draw the diagnostic figures. Solution: # Simulate a dataset by ourselves, to see what should be the residual plots l ook like when there is no violation of the basic assumptions such as linearit y, gausian error, independence of errors, etc. x1 <- rnorm(100, 0, 1) # simulate a predictor (x1) with 100 measurements from a normal distribution, while mean = 0 and std = 1. rnorm() is the function t o simulate from normal distribution x2 <- rnorm(100, 0, 1) # simulate another predictor (x2) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 1 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) lm.XY <- lm(y ~ ., data = data.frame(y,x1,x2)) # Now, let's fit the linear re gression model summary(lm.XY) ## ## Call: ## lm(formula = y ~ ., data = data.frame(y, x1, x2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.53421 -0.55067 0.05602 0.66604 2.52490 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.19036 0.10103 1.884 0.0625 . ## x1 1.04049 0.08709 11.947 <2e-16 *** ## x2 1.04129 0.09493 10.969 <2e-16 *** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9999 on 97 degrees of freedom ## Multiple R-squared: 0.7274, Adjusted R-squared: 0.7217 ## F-statistic: 129.4 on 2 and 97 DF, p-value: < 2.2e-16 58

# Conduct diagnostics of the model library("ggfortify") ## Loading required package: ggplot2 autoplot(lm.XY, which = 1:6, ncol = 3, label.size = 3) #

compare this with the results from

a real-world data analysis

It looks like there are some data points outstanding, but not all figures point out the same observations, nor the outstanding points extraordinary abnormal. This experiment helps us to establish an expectation of what we may see when applying these diagnostic tools in practice. In other words, it is a baseline that our practice could be compared with. Q8: Follow up on the simulation experiment in Q7. Add a few outliers into your dataset and see if the diagnostic tools in the ggfortify R package can detect them. Solution: # x1 <- rnorm(100, 0, 1) # simulate a predictor (x1) with 100 measurements from a normal distribution, while mean = 0 and std = 1. rnorm() is the function t o simulate from normal distribution x2 <- rnorm(100, 0, 1) # simulate another predictor (x2) 59

beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 1 mu <- beta1 * x1 + beta2 * x2 # with simulated values of x1 and x2, and the c oefficients, we can calculate the mean levels of the outcome variable y <- rnorm(100, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1) # create two outliers in the dataset y[1] <- 10*y[1] y[10] <- 0.1*y[10] lm.XY <- lm(y ~ ., data = data.frame(y,x1,x2)) # Now, let's fit the linear re gression model summary(lm.XY) ## ## Call: ## lm(formula = y ~ ., data = data.frame(y, x1, x2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.6578 -0.7336 -0.1007 0.5160 13.0745 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.004376 0.172791 -0.025 0.98 ## x1 1.063503 0.175404 6.063 2.56e-08 *** ## x2 1.290247 0.160670 8.030 2.31e-12 *** ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.64 on 97 degrees of freedom ## Multiple R-squared: 0.5011, Adjusted R-squared: 0.4908 ## F-statistic: 48.72 on 2 and 97 DF, p-value: 2.256e-15 # Conduct diagnostics of the model library("ggfortify") autoplot(lm.XY, which = 1:6, ncol = 3, label.size = 3) # compare this with th e results from a real-world data analysis

Different from the previous experiment, here, the data point #1 is found to be extraordinary by all figures. Also, note that another outlier, data point #10, is not detected. If we remove data point #1 and redo the analysis, we may detect #10 – don’t forget there is dependency between the results. The analysis process matters when sorting through our results.

Chapter 7 Q1: To build a linear SVM on the following data, how many support vectors are needed?

Solution: 3 support vectors are needed.

Q2: Here let’s consider the following dataset. ID 𝑋! 𝑋" 1 4 1 2 4 -1 3 8 2 4 -2.5 0 5 0 1 6 -0.3 -1 7 2.5 -1 8 -1 1

𝑋$ 1 0 1 0 1 0 1 0

𝑌 1 1 1 -1 -1 -1 -1 -1

(1) Identify the support vectors if you’d like to build a linear SVM classifier Solution: x = matrix(c(4,4,8,-2.5,0,-0.3,2.5,-1,1,-1,2,0,1,-1,-1,1,1,0,1,0,1,0,1,0), nr ow = 8, ncol = 3) y = c(1,1,1,-1,-1,-1,-1,-1) linear.train <- data.frame(x,y) require( 'kernlab' ) ## Loading required package: kernlab

linear.svm <- ksvm(y ~ ., data=linear.train, type='C-svc', kernel='vanilladot ', C = 10, scale = c(), scaled = FALSE) ##

Setting default kernel parameters

alphaindex(linear.svm) ## [[1]] ## [1] 1 2 7

Another approach to solve this problem is to visualize the data, i.e., we can draw three scatterplots: X1 versus X2, X1 versus X3, and X2 versus X3:

You can see that, Figure X1 versus X2 and the figure X2 versus X3 shows a set of three data points that could form the support vectors and separate the two classes with maximum margin. A further check of the coordinates of these three data points reveals that those are ID # 1,2,7. Here, Figure X2 versus X3 is misleading, as you can see that, only 6 dots are shown on the figure. This is because that, ID #1 and #5 overlap and are shown as one dot, and ID #2 and #6 overlap and are shown as one dot. Thus, from this Figure, we could see (with close reading) that the two classes could not be separated. This indicates that we could not read the locations of the support vectors from this Figure. 63

(2) Derive the alpha values (i.e., the 𝛼* ) for the support vectors and the offset parameter 𝑏 Solution: We have write up four equations

Plugging in the specific values, these four equations are:

To solve these equations: # Solve the equations in (2) A <- matrix(c(18, 15, -10, 1, 15, 17, -11, 1, 10, 11, -8.25, 1, 1, 1, -1, 0), ncol = 4, nrow = 4, byrow = TRUE) b <- c(1, 1, -1, 0) solve(A,b) Thus, we have

. (3) Derive the weight vector (i.e., the 𝒘 ` ) of the SVM model Solution:

. (4) Predict on the new dataset and fill in the table below on column of 𝑌 ID 𝑋! 𝑋" 𝑋$ 9 5.4 1.2 2 10 1.5 -2 3 11 -3.4 1 -2 12 -2.2 -1 -4

𝑌

Solution: The decision function is

. Our prediction function is

. For data point ID#9, we have 5#

a5!

−

5.4 !5$ b P 1.2Q − 5! = 1.906, thus, 𝑦4 = 1. 5! 2 $"

For data point ID#10, we have 5#

a5!

!5 5!

1.5 $" !5$ − 5!b P−2Q − = −3.295, thus, 𝑦!# = −1. 5! 3

For data point ID#11, we have 5#

a5!

!5 5!

−3.4 $" !5$ − 5!b P 1 Q − = −4.71, thus, 𝑦!! = −1. 5! −2

For data point ID#12, we have 5#

a5!

!5 5!

−2.2 $" !5$ − 5!b P −1 Q − = −3, thus, 𝑦!" = −1. 5! −4

The result is summarized in the following table 65

ID 𝑋! 𝑋" 𝑋$ 9 5.4 1.2 2 10 1.5 -2 3 11 -3.4 1 -2 12 -2.2 -1 -4

𝑌 1 -1 -1 -1

Q3: Follow up on the dataset used in Q2. Use the R pipeline for SVM on this data. Compare the alpha values, the offset parameter 𝑏, and the weight vector from R and the result by your manual calculation in Q2. Solution: data_test = x = matrix(c(5.4,1.5, -3.4,-2.2,1.2,-2,1,-1,2,3,-2,-4), nrow = 4, ncol = 3) y_hat <- predict(linear.svm, data_test) y_hat ## [1]

1 -1 -1

Q4: Modify the R pipeline for Bootstrap and incorporate the glm package to write your own version of ensemble learning that ensembles a set of logistic regression models. Test it using the same data that has been used in the R lab for logistic regression model. Solution: library(RCurl) # Step 1 -> Read data into R workstation data <- read.csv(text=getURL("https://raw.githubusercontent.com/shuailab/ind_ 498/master/resource/data/AD.csv")) # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,2:16] Y <- data$DX_bl Y <- paste0("c", Y) Y <- as.factor(Y) # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[16] = c("DX_bl") # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] 66

# Step 3 -> draw R bootstrap replicates, and use rpart to build the decision tree. # draw R bootstrap replicates N <- dim(data.train)[1] P <- dim(data.train)[2]-1 #the response variable is not counted as a feature K <- round(P/2) R <- 50 # Initialize the vector to store the bootstrapped estimates bs_logit <- vector("list", R) # There are R models # draw R bootstrap resamples and obtain the estimates for (i in 1:R) { resam_ID <- sample(c(1:N), N, replace = TRUE) resam_var <- sample(c(1:P), K, replace = FALSE) # also randomly select K fe atures to split resam_Data <- data.train[resam_ID,c(resam_var,P+1)] # P+1 -> the response v ariable logit.model <- glm(DX_bl ~ ., data = resam_Data, family = "binomial") bs_logit[[i]] <- logit.model } # Step 4 -> see one of the models summary(bs_logit[[1]] ) ## ## Call: ## glm(formula = DX_bl ~ ., family = "binomial", data = resam_Data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4268 -0.6031 -0.1734 0.5901 2.5014 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 13.77296 3.26057 4.224 2.40e-05 *** ## e2_1 -1.99701 0.98797 -2.021 0.04325 * ## AV45 1.76369 0.85974 2.051 0.04023 * ## AGE -0.03310 0.02665 -1.242 0.21423 ## rs3865444 -0.64393 0.36007 -1.788 0.07372 . ## rs3851179 0.18552 0.36857 0.503 0.61472 ## rs610932 -1.07704 0.39386 -2.735 0.00625 ** ## HippoNV -26.63848 3.74322 -7.116 1.11e-12 *** ## PTGENDER -0.06852 0.36294 -0.189 0.85026 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 355.79 on 257 degrees of freedom ## Residual deviance: 206.20 on 249 degrees of freedom ## AIC: 224.2 67

## ## Number of Fisher Scoring iterations: 5 # Step 5 -> Predict using your RF model y_hat <- matrix(nrow=dim(data.test)[1], ncol=R) for (i in 1:R){ y_hat_temp <- predict(bs_logit[[i]], data.test) y_hat_temp[which(y_hat_temp > 0)] = "c1" y_hat_temp[which(y_hat_temp < 0)] = "c0" y_hat[,i] <- y_hat_temp } pred.majorityvote <- rep(NA,dim(data.test)[1]) for(i in 1:dim(data.test)[1]){ temp <- table(y_hat[i,]) pred.majorityvote[i] <- names(temp[which.max(temp)]) # get the majority vot e result for sample i } err.logit <- length(which(pred.majorityvote != data.test$DX_bl))/length(pred. majorityvote) print(mean(err.logit)) ## [1] 0.1737452

Q5: Use the dataset PimaIndiansDiabetes2 in the R package mlbench, run the R SVM pipeline on it, and summarize your findings. Solution: We use a dataset from the mlbench R package and run SVM model.

# Step 1 -> Read data into R workstation library(mlbench) data("PimaIndiansDiabetes2") data <- PimaIndiansDiabetes2 data <- na.omit(data) # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,1:8] Y <- data[,9] # Then, we integrate everything into a data frame data <- data.frame(X,Y) names(data)[9] = c("diabetes") 68

# Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> gather a list of candidate models # SVM: often to compare models with different kernels, different values of C, different set of variables # Use different kernels # kernel='rbfdot': Radial Basis kernel "Gaussian" # kernel='polydot': Polynomial kernel # kernel='vanilladot': Linear kernel # kernel='tanhdot': Hyperbolic tangent kernel # kernel='laplacedot': Laplacian kernel # kernel='besseldot': Bessel kernel # kernel='anovadot': ANOVA RBF kernel # kernel='splinedot': Spline kernel # kernel='stringdot': String kernel # Step 4 -> Use 10-fold cross-validation to evaluate all the models # First, let me use 10-fold cross-validation to evaluate the performance of m odel1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data.train)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) linear.svm <- ksvm(diabetes~., data=data.train.cv, type='C-svc', kernel='va nilladot', C=10) # Fit the linear SVM model with the training data y_hat <- predict(linear.svm, data.test.cv) # Predict on the testing data u sing the trained model true_y <- data.test.cv$diabetes # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) }

## Loading required package: kernlab ## ## ## ## ## ## ## ## ## ##

Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters

mean(cv_err) ## [1] 0.255 cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) rbf.svm <- ksvm(diabetes~., data=data.train.cv, type='C-svc', kernel='rbfdo t', C=10) # Fit the SVM model with the gaussian kernel y_hat <- predict(rbf.svm, data.test.cv) # Predict on the testing data usin g the trained model true_y <- data.test.cv$diabetes # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } mean(cv_err) ## [1] 0.3002632 cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) 70

poly.svm <- ksvm(diabetes~., data=data.train.cv, type='C-svc', kernel='polyd ot', C=10) # Fit the SVM model with the polynomial kernel y_hat <- predict(poly.svm, data.test.cv) # Predict on the testing data usi ng the trained model true_y <- data.test.cv$diabetes # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } ## ## ## ## ## ## ## ## ## ##

## ## ##

Setting default kernel parameters Setting default kernel parameters Setting default kernel parameters

mean(cv_err) ## [1] 0.3371053 # Step 5 -> After model selection, use ksvm() function to build your final mo del rbf.svm <- ksvm(diabetes~., data=data.train, type='C-svc', kernel='rbfdot', C =10) # Step 6 -> Predict using your SVM model y_hat <- predict(rbf.svm, data.test, type = 'response') # Step 7 -> Evaluate the prediction performance of your SVM model # ROC curve is another commonly reported metric for classification models library(pROC) # pROC has the roc() function that is very useful here ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var y_hat <- predict(rbf.svm, data.test, type = 'decision') plot(roc(data.test$diabetes, y_hat), col="green", main="ROC Curve") ## Setting levels: control = neg, case = pos ## Warning in roc.default(data.test$diabetes, y_hat): Deprecated use a matrix as ## predictor. Unexpected results may be produced, please pass a numeric vecto r. ## Setting direction: controls < cases

Q6: Use R to generate a dataset with two classes as shown in Figure below

Then, run SVM model with a properly selected kernel function on this dataset. Solution: The following r code will generate a dataset as shown in the figure. 73

# Step 1 -> Generate a dataset with nonlinear boundary n = 100 p = 2 bottom.left <- matrix(rnorm( n*p, mean=0, sd=1 ),n, p) upper.right <- matrix(rnorm( n*p, mean=4, sd=1 ),n, p) tmp1 <- matrix(rnorm( n*p, mean=0, sd=1 ),n, p) tmp2 <- matrix(rnorm( n*p, mean=4, sd=1 ),n, p) upper.left <- cbind( tmp1[,1], tmp2[,2] ) bottom.right <- cbind( tmp2[,1], tmp1[,2] ) y <- c( rep( 1, 2 * n ), rep( -1, 2 * n ) ) data <- data.frame( x=rbind( bottom.left, upper.right, upper.left, bottom.rig ht ), y=y) # Visualize the distribution of data points of two classes require( 'ggplot2' ) ## Loading required package: ggplot2 p <- qplot( data=data, x.1, x.2, colour=factor(y) ) p <- p + labs(title = "Scatterplot of data points of two classes") print(p)

Then, run SVM model with a properly selected kernel function on this dataset.

# Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) X <- data[,1:2] Y <- data[,3] # Then, we integrate everything into a data frame data <- data.frame(X,Y) # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data)/2) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> gather a list of candidate models # SVM: often to compare models with different kernels, different values of C, different set of variables # Use different kernels # kernel='rbfdot': Radial Basis kernel "Gaussian" # kernel='polydot': Polynomial kernel # kernel='vanilladot': Linear kernel 74

# kernel='tanhdot': Hyperbolic tangent kernel # kernel='laplacedot': Laplacian kernel # kernel='besseldot': Bessel kernel # kernel='anovadot': ANOVA RBF kernel # kernel='splinedot': Spline kernel # kernel='stringdot': String kernel # Step 4 -> Use 10-fold cross-validation to evaluate all the models # First, let me use 10-fold cross-validation to evaluate the performance of m odel1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data.train)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) linear.svm <- ksvm(Y~., data=data.train.cv, type='C-svc', kernel='vanillado t', C=10) # Fit the linear SVM model with the training data y_hat <- predict(linear.svm, data.test.cv) # Predict on the testing data u sing the trained model true_y <- data.test.cv$Y # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } ## Loading required package: kernlab ## ## ## ## ## ## ## ## ## ##

mean(cv_err) ## [1] 0.44 75

cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) rbf.svm <- ksvm(Y~., data=data.train.cv, type='C-svc', kernel='rbfdot', C=1 0) # Fit the SVM model with the gaussian kernel y_hat <- predict(rbf.svm, data.test.cv) # Predict on the testing data usin g the trained model true_y <- data.test.cv$Y # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } mean(cv_err) ## [1] 0.08 cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) poly.svm <- ksvm(Y~., data=data.train.cv, type='C-svc', kernel='polydot', C= 10) # Fit the SVM model with the polynomial kernel y_hat <- predict(poly.svm, data.test.cv) # Predict on the testing data usi ng the trained model true_y <- data.test.cv$Y # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } mean(cv_err) ## [1] 0.44 cv_err <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the 10 iterations, rem ember, we use one fold of data as the te sting data 76

data.train.cv <- data.train[-test_i, ] # Then, the remaining 9 folds' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold require( 'kernlab' ) spline.svm <- ksvm(Y~., data=data.train.cv, type='C-svc', kernel='splinedot ', C=10) # Fit the SVM model with spline kernel y_hat <- predict(spline.svm, data.test.cv) # Predict on the testing data u sing the trained model true_y <- data.test.cv$Y # get the true y values for the testing data cv_err[k] <-length(which(y_hat != true_y))/length(y_hat) } mean(cv_err) ## [1] 0.21 # Step 5 -> After model selection, use ksvm() function to build your final mo del rbf.svm <- ksvm(Y~., data=data.train, type='C-svc', kernel='rbfdot', C=10) # (1) The argument, kernel='vanilladot', means that we are going to build a lin ear SVM model; (2) C=10 is the tolerance parameter (similarly as the penalty parameter in LASSO, C in SVM is to balance two objectives - one to maximize m argin, another one to reduce errors) # Step 6 -> Predict using your SVM model y_hat <- predict(rbf.svm, data.test, type = 'response') # Step 7 -> Evaluate the prediction performance of your SVM model # ROC curve is another commonly reported metric for classification models library(pROC) # pROC has the roc() function that is very useful here ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var y_hat <- predict(rbf.svm, data.test, type = 'decision') plot(roc(data.test$Y, y_hat), col="green", main="ROC Curve") ## Setting levels: control = -1, case = 1 ## Warning in roc.default(data.test$Y, y_hat): Deprecated use a matrix as ## predictor. Unexpected results may be produced, please pass a numeric vecto r. ## Setting direction: controls < cases 77

Q7: Follow up on the dataset generated in Q6. Try visualizing the decision boundaries by different kernel functions such as linear, laplace, gaussian, and polynomial kernel functions. Below is one example using gaussian kernel with its parameter sigma = 0.2. The blackened points are support vectors, and the contour reflects the characteristics of the decision boundary. require( 'kernlab' ) rbf.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='rbfdot', kpar=list(sigma=0.2), C=100, scale=c()) plot(rbf.svm, data=data)

Please follow this example and visualize linear, laplace, gaussian, and polynomial kernel functions with different parameter values. 78

Solution: Visualize gaussian kernel with its parameter sigma taking different values. require( 'kernlab' ) # Train a nonlinear SVM rbf.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='rbfdot', kpar=list(sigma=0.2), C=100, scale=c()) plot(rbf.svm, data=data)

rbf.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='rbfdot', kpar=list(sigma=1), C=100, scale=c()) plot(rbf.svm, data=data)

rbf.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='rbfdot', kpar=list(sigma=3), C=100, scale=c()) plot(rbf.svm, data=data)

Visualize linear kernel. 80

# linear linear.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='vanilladot', C=100, scale=c()) ##

Setting default kernel parameters

plot(linear.svm, data=data)

Visualize laplace kernel. # laplace laplace.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='laplacedot', C=100, scale=c()) plot(laplace.svm, data=data)

Visualize polynomial kernel with its parameter taking different values. # poly poly.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='poly', kpar=list(degree = 1), C=100, scale=c()) plot(poly.svm, data=data)

poly.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='poly', kpar=list(degree = 3), C=100, scale=c()) plot(poly.svm, data=data)

poly.svm <- ksvm(y ~ ., data=data, type='C-svc', kernel='poly', kpar=list(degree = 5), C=100, scale=c()) plot(poly.svm, data=data)

Chapter 8 Q1: In the following path solution trajectory figure generated by applying glmnet() on a dataset with ten predictors, which two variables are the top two significant variables (note the index of the variables are shown in the right end of the figure)?

Solution: Variables 1 and 2.

Q2: let’s consider the following dataset. 𝑋! 𝑋" 𝑌 -0.15 -0.48 0.46 -0.72 -0.54 -0.37 1.36 -0.91 -0.27 0.61 1.59 1.35 -1.11 0.34 -0.11 Set an initial value for lambda = 1, beta1 = 0, and beta2 = 1. Implement the Shooting algorithm by your manual operation. Get updated values of beta1 and beta2. Do one iteration.

(#) Solution: Suppose that we choose 𝜆 = 1. First, we initiate the regression parameters as 𝛽d! = 0 and (#) 𝛽d = 1. "

In the first iteration, we aim to update 𝛽d! . We can obtain that 0.46 −0.48 0.94 ⎡−0.37⎤ ⎡−0.54⎤ ⎡ 0.17 ⎤ (#) (#) 𝒚 − 𝐗 (:,") 𝛽d" = ⎢⎢−0.27⎥⎥ − ⎢⎢−0.91⎥⎥ 𝛽d" = ⎢⎢ 0.64 ⎥⎥. ⎢−0.24⎥ ⎢ 1.35 ⎥ ⎢ 1.59 ⎥ ⎣−0.45⎦ ⎣−0.11⎦ ⎣ 0.34 ⎦ Thus, (#) , 𝑞! = 𝐗 (:,!) a𝒚 − 𝐗 (:,") 𝛽d" b = 0.9601.

As 𝑞! − 𝜆⁄2 = 0.4601 > 0, we know that (!) 𝛽d! = 𝑞! − 𝜆⁄2 = 0.4601.

Similarly, we can update 𝛽d" . We can obtain that 0.46 −0.15 0.5290 ⎡−0.0387⎤ ⎡−0.37⎤ ⎡−0.72⎤ (!) (!) 𝒚 − 𝐗 (:,!) 𝛽d! = ⎢⎢−0.27⎥⎥ − ⎢⎢ 1.36 ⎥⎥ 𝛽d! = ⎢⎢−0.8957⎥⎥. ⎢ 1.35 ⎥ ⎢ 0.61 ⎥ ⎢ 1.0693 ⎥ ⎣−0.11⎦ ⎣−1.11⎦ ⎣ 0.4007 ⎦ Thus, (!) , 𝑞" = 𝐗(:,") a𝒚 − 𝐗 (:,!) 𝛽d! b = 2.418596.

As 𝑞" − 𝜆⁄2 = 1.918596 > 0, we know that (!) 𝛽d" = 𝑞" − 𝜆⁄2 = 1.918596.

Q3: Follow up on the dataset in Q2. Use the R pipeline for LASSO on this data. Compare the result from R and the result by your manual calculation. Solution: x1 <-c(-0.15,-0.72,1.36,0.61,-1.11) x2 <-c(-0.48,-0.54,-0.91,1.59,0.34) y <-c(0.46,-0.37,-0.27,1.35,-0.11) x <-cbind(x1, x2) require(glmnet) ## Loading required package: glmnet ## Loading required package: Matrix ## Loaded glmnet 4.0-2 fit <-glmnet(x,y) fit$beta plot(fit,label =TRUE)

Interpretation: It could be observed that, 𝛽d" is always larger than 𝛽d! . Also, 𝛽d! will become zero as long as 𝜆 is large enough, before 𝛽d" is penalized to become zero. This could be verified, using the same (!) (!) process as we have done in Q1, e.g., with 𝜆 = 2, 𝛽d will become zero but 𝛽d remains nonzero. !

Q4: Conduct a principal component analysis for the following dataset. Show details of the process. 𝑋! 1 2 1 2 1 2 1 2

𝑋" 𝑋$ 𝑋% 1.8 2.08 -0.28 3.6 -0.78 0.79 2.2 -0.08 -0.52 4.3 0.38 -0.47 2.1 0.71 1.03 3.6 1.29 0.67 2.2 0.57 0.15 4.0 1.12 1.18

(1) Standardize the dataset (i.e., by making the means of the variables to be zero, and the standard derivations of the variables to be 1). ;! ;

(2) Calculate the sample covariance matrix (i.e., S = <0!). (3) Conduct eigenvalue decomposition on the sample covariance matrix, obtain the four eigenvectors and their eigenvalues. (4) Report the percentage of variances that could be explained by the four PCs, respectively. Draw the screeplot. How many PCs are sufficient to represent the dataset? (in other words, which PCs are significant?) 86

(5) Interpret the PCs you have selected, i.e., which variables define which PCs? (6) Convert the original data (in the space spanned by the four X variables) into the space spanned by the four PCs, by filling in the following table 𝑃𝐶!

𝑃𝐶"

𝑃𝐶$

𝑃𝐶%

Solution: (1) # set data vectors x1 <- c(1, 2, 1, 2, 1, 2, 1, 2) x2 <- c(1.8, 3.6, 2.2, 4.3, 2.1, 3.6, 2.2, 4.0) x3 <- c(2.08, -0.78, -0.08, 0.38, 0.71, 1.29, 0.57, 1.12) x4 <- c(-0.28, 0.79, -0.52, -0.47, 1.03, 0.67, 0.15, 1.18) # original dataset orig_data <- cbind(x1, x2, x3, x4) # write a simple standardize function to standardize each value # in a given vector, v standardize <- function(v) { mean <- mean(v) std <- sd(v) v <- (v-mean)/std return(v) } # write a standardizied data matrix std_data <- cbind(standardize(x1), standardize(x2), standardize(x3), standard ize(x4)) colnames(std_data) <- c("x1", "x2", "x3", "x4") std_data ## x1 x2 x3 x4 ## [1,] -0.9354143 -1.1804936 1.62511508 -0.8712918 ## [2,] 0.9354143 0.6279222 -1.65088783 0.6857558 ## [3,] -0.9354143 -0.7786235 -0.84906894 -1.2205362 ## [4,] 0.9354143 1.3311950 -0.32215938 -1.1477769 87

## [5,] -0.9354143 -0.8790910 0.05584096 1.0350001 ## [6,] 0.9354143 0.6279222 0.72020519 0.5111336 ## [7,] -0.9354143 -0.7786235 -0.10452282 -0.2455624 ## [8,] 0.9354143 1.0297923 0.52547774 1.2532778 (2) # calculate sample covariance matrix sample_cov <- (t(std_data) %*% std_data)/(nrow(std_data)-1) sample_cov ## x1 x2 x3 x4 ## x1 1.0000000 0.9666389 -0.194396282 0.348078453 ## x2 0.9666389 1.0000000 -0.242451234 0.253498336 ## x3 -0.1943963 -0.2424512 1.000000000 -0.004545141 ## x4 0.3480785 0.2534983 -0.004545141 1.000000000 (3) # Calculate eigenvalue decomposition w <- eigen(sample_cov) w ## eigen() decomposition ## $values ## [1] 2.19822404 1.00029497 0.77413243 0.02734856 ## ## $vectors ## [,1] [,2] [,3] [,4] ## [1,] -0.6515814 0.04751028 -0.2556093 0.71263477 ## [2,] -0.6425935 -0.05011346 -0.3143543 -0.69695306 ## [3,] 0.2369715 0.80523560 -0.5426222 -0.03164318 ## [4,] -0.3261284 0.58892023 0.7358032 -0.07353116 (4)

Results are shown below. As seen in the screeplot, only PCs 1-3 are signifigant, given that PC4 only accounts for 0.01 % of the variability in the dataset. # eigen vectors pc1 <- w$vectors[,1] pc2 <- w$vectors[,2] pc3 <- w$vectors[,3] pc4 <- w$vectors[,4] # eigen values w1 <- w$values[1] w2 <- w$values[2] w3 <- w$values[3] w4 <- w$values[4] total <- sum(w1, w2, w3, w4) 88

per1 <- w1/total per2 <- w2/total per3 <- w3/total per4 <- w4/total contributions <- cbind(per1, per2, per3, per4) colnames(contributions) <- c("PC1", "PC2", "PC3", "PC4") # scree plot xx <- barplot(contributions, main="PCA Scree Plot", xlab="Principal Component s", ylab="Pecentage of Variability Explained", ylim = c(0,1)) text(x=xx, y=contributions, label = round(contributions, digits=2), pos = 3, cex=0.8, col="blue")

(5)

To answer this question, we look at the loadings of the significant PCs, which show us how much each variable contributes to that PC. For this we consider the magnitude of each individual contribution, ignoring sign. For PC1, we can see that X1 and X2 have significant contributions. pc1 ## [1] -0.6515814 -0.6425935

0.2369715 -0.3261284

PC2 is influenced by X3 and X4. pc2 ## [1]

0.04751028 -0.05011346

0.80523560

0.58892023

PC3 is influenced by X4, and X3. 89

pc3 ## [1] -0.2556093 -0.3143543 -0.5426222

0.7358032

(6) # data transformation # Transform the x values into pca values. X.pca <- data.frame(std_data %*% w$vectors) names(X.pca) <- c("PC1","PC2","PC3, PC4") X.pca ## PC1 PC2 PC3, PC4 NA ## 1 2.0373351 0.8101960 -0.91272900 0.168783154 ## 2 -1.6278551 -0.9125238 0.96389909 0.230791451 ## 3 1.3066834 -1.4079213 0.04651346 -0.007330093 ## 4 -1.1669358 -0.9576323 -1.32729464 -0.166580091 ## 5 0.8500866 0.6541101 1.24670254 -0.131795366 ## 6 -1.0090244 0.8939262 -0.45119619 0.168602689 ## 7 1.1651530 -0.2342045 0.35989503 -0.102580856 ## 8 -1.5554428 1.1540494 0.07420970 -0.159890889

Q5: Consider the dataset in Q2 from Chapter 7. (1) Conduct the PCA analysis on the three predictors to identify the three principal components and their contributions on explaining the variance in data. Solution: First, we standardize the data. To obtain the basic statistics of the variables: data <- as.data.frame(matrix(c(4,1,1,1,4,-1,0,1,8,2,1, 1,-2.5, 0,0,-1,0,1,1,-1, -0.3,-1,0,-1,2.5,-1,1,-1, -1,1,0,-1), nrow=8, ncol=4, byrow=TRUE)) colnames(data) <- c("x1", "x2", "x3", "y") summary(data[,1:3]) ## ## ## ## ## ## ##

x1 Min. :-2.500 1st Qu.:-0.475 Median : 1.250 Mean : 1.837 3rd Qu.: 4.000 Max. : 8.000

x2 Min. :-1.00 1st Qu.:-1.00 Median : 0.50 Mean : 0.25 3rd Qu.: 1.00 Max. : 2.00

x3 Min. :0.0 1st Qu.:0.0 Median :0.5 Mean :0.5 3rd Qu.:1.0 Max. :1.0

sd(data$x1) ## [1] 3.434671 90

sd(data$x2) ## [1] 1.164965 sd(data$x3) ## [1] 0.5345225 Then, we standardize the data: X <- data[,1:3] X$x1 <- (X$x1-1.837)/sd(X$x1) X$x2 <- (X$x2-0.25)/sd(X$x2) X$x3 <- (X$x3-0.5)/sd(X$x3) X ## x1 x2 x3 ## 1 0.6297547 0.6437963 0.9354143 ## 2 0.6297547 -1.0729938 -0.9354143 ## 3 1.7943495 1.5021914 0.9354143 ## 4 -1.2627119 -0.2145988 -0.9354143 ## 5 -0.5348402 0.6437963 0.9354143 ## 6 -0.6221848 -1.0729938 -0.9354143 ## 7 0.1930316 -1.0729938 0.9354143 ## 8 -0.8259889 0.6437963 -0.9354143 Then, calculate the covariance matrix S <- (t(as.matrix(X)) %*% as.matrix(X))/7 S <- cov(X) Eigen <- eigen(S) Eigen ## eigen() decomposition ## $values ## [1] 1.9005069 0.6839791 0.4155140 ## ## $vectors ## [,1] [,2] [,3] ## [1,] 0.5764906 0.5820827 -0.5734443 ## [2,] 0.5268149 -0.8012385 -0.2836950 ## [3,] 0.6245996 0.1385515 0.7685563 We can obtain the scores of the three PCs: PC <- X colnames(PC) <- c("PC1", "PC2", "PC3") X <- as.matrix(X) for (i in 1:8){ for (j in 1:3){ PC[i,j] <- X[i,] %*% Eigen$vectors[,j] 91

} } PC ## PC1 PC2 PC3 ## 1 1.2864686 -0.01966206 0.17514759 ## 2 -0.7864810 1.09669028 -0.77564478 ## 3 2.4100618 -0.02955074 -0.73620506 ## 4 -1.4252548 -0.69266105 0.06605694 ## 5 0.6150906 -0.69755259 0.84297782 ## 6 -1.5082123 0.36795795 -0.05772728 ## 7 0.1302712 1.10168741 0.91262875 ## 8 -0.7212728 -1.12623131 -0.42790181 To obtain their contribution in explaining the variance in data: exp_frame <- matrix(NA, nrow=3, ncol=1) for (i in 1:3){ exp_frame[i,] <- Eigen$values[i]/sum(Eigen$values) } exp_frame ## [,1] ## [1,] 0.6335023 ## [2,] 0.2279930 ## [3,] 0.1385047 (2) Use the R pipeline for PCA to do the PCA analysis and compare with your manual calculation. Solution: norm_X <- cbind(scale(data$x1), scale(data$x2), scale(data$x3)) model_pca <- eigen(cov(norm_X)) X_pc <- data.frame(norm_X %*% model_pca$vectors) colnames(X_pc) <- c("PC1", "PC2", "PC3") model_pca ## eigen() decomposition ## $values ## [1] 1.9005069 0.6839791 0.4155140 ## ## $vectors ## [,1] [,2] [,3] ## [1,] -0.5764906 0.5820827 0.5734443 ## [2,] -0.5268149 -0.8012385 0.2836950 ## [3,] -0.6245996 0.1385515 -0.7685563 X_pc ## PC1 PC2 PC3 ## 1 -1.2863846 -0.01974680 -0.17523106 ## 2 0.7865649 1.09660554 0.77556130 92

## 3 -2.4099779 -0.02963548 0.73612159 ## 4 1.4253387 -0.69274579 -0.06614042 ## 5 -0.6150067 -0.69763733 -0.84306130 ## 6 1.5082962 0.36787322 0.05764380 ## 7 -0.1301873 1.10160268 -0.91271223 ## 8 0.7213567 -1.12631605 0.42781833

Q6: Suppose that we have an outcome variable that could be augmented into the dataset in Q2, as shown in below. 𝑋! 1 2 1 2 1 2 1 2

𝑋" 𝑋$ 𝑋% 𝑌 1.8 2.08 -0.28 1.2 3.6 -0.78 0.79 2.1 2.2 -0.08 -0.52 0.8 4.3 0.38 -0.47 1.5 2.1 0.71 1.03 0.8 3.6 1.29 0.67 1.6 2.2 0.57 0.15 1.2 4.0 1.12 1.18 1.6

Please apply the shooting algorithm for LASSO on this dataset to identify important variables. Please use the following initial values for the parameters, 𝜆 = 1, 𝛽! = 0, 𝛽" = 1, 𝛽$ = 1, 𝛽% = 1. and just do one iteration of the shooting algorithm. Show details of this process by manual calculation. Solution: The first step is to compute 𝑞! (#)

(#)

, 𝑞! = 𝑋(,!) a𝑦 − 𝑋(,") 𝛽d" − 𝑋(,$) 𝛽d$ − 𝑋(,%) 𝛽d% b = −33.72.

Since =

𝑞! + " = −33.22 < 0, We know that (!)

𝛽!

= −33.22.

Then we compute 𝑞" (!)

(#)

, 𝑞" = 𝑋(,") a𝑦 − 𝑋(,!) 𝛽d! − 𝑋(,$) 𝛽d$ − 𝑋(,%) 𝛽d% b = 1316.89.

Since =

𝑞" −

= 1316.39 > 0,

So (!)

𝛽"

= 1316.39.

Then we compute 𝑞$ (!) (!) (#) , 𝑞$ = 𝑋(,$) a𝑦 − 𝑋(,!) 𝛽d! − 𝑋(,") 𝛽d" − 𝑋(,%) 𝛽d% b = −18528.40.

And (!)

𝛽$

= −18527.90.

Then we compute 𝑞% (!)

(!)

, 𝑞% = 𝑋(,%) a𝑦 − 𝑋(,!) 𝛽d! − 𝑋(,") 𝛽d" − 𝑋(,$) 𝛽d$ b = 19464.57.

And (!)

𝛽%

= 19464.07.

Q7: After extraction of the four PCs from Q3, use lm() in R to build a linear regression model with the outcome variable (as shown in the table in Q3) and the four PCs as the predictors. (1) Report the summary of your linear regression model with the four PCs (2) Which PCs significantly affect the outcome variable by looking at the p-values of the t-test? Solution: (1) # use PCs to build linear regression model dataset <- data.frame(cbind(X.pca,y)) fit.pc <- lm(y ~ ., data=dataset) # display summary of regression model summary(fit.pc) ## ## Call: ## lm(formula = y ~ ., data = dataset) ## ## Residuals: ## 1 2 3 4 5 6 ## 0.11428 0.13368 -0.22304 0.01082 -0.15150 -0.20197 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)

7 0.26026

8 0.05747

## (Intercept) 1.350000 0.094774 14.244 0.00075 *** ## PC1 -0.240691 0.068336 -3.522 0.03886 * ## PC2 0.001736 0.101303 0.017 0.98741 ## PC3..PC4 -0.036679 0.115153 -0.319 0.77098 ## NA. 1.132852 0.612657 1.849 0.16157 ## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2681 on 3 degrees of freedom ## Multiple R-squared: 0.8415, Adjusted R-squared: 0.6301 ## F-statistic: 3.982 on 4 and 3 DF, p-value: 0.1428

(2) The analysis shows that only the first PC is significant.

Q8: Revisit Q1 in Chapter 3. Derive the shooting algorithm for weighted least square regression with L1 norm penalty of its regression parameters. Solution: For the weighted least square regression model with L1 norm penalty, we could denote the objective function as 𝐿(𝜷) = (𝒚 − 𝑿𝜷), 𝑾(𝒚 − 𝑿𝜷) + 𝜆‖𝜷‖! . To derive the shooting algorithm, let’s first consider a simple case where there is only one predictor and no intercept. The objective function becomes 𝐿(𝛽) = (𝒚 − 𝒙𝛽), 𝑾(𝒚 − 𝒙𝛽) + 𝜆|𝛽|. To find the optimal solution, we can solve the equation as >?(@) >@

= 0.

The complication is the L1-norm term, |𝛽|, which has no gradient when 𝛽 = 0. Thus, we can discuss different scenarios and identify the solutions. •

If 𝛽 > 0, then A"𝒙! 𝑾𝒚0=D "

•

= 2𝛽 − 2𝒙, 𝑾𝒚 + 𝜆. Thus,

>?(@) >@

= 0 will lead to the solution that 𝛽 =

. But if 2𝒙, 𝑾𝒚 − 𝜆 < 0, this will result in a contradiction, and thereby, 𝛽 = 0.

If 𝛽 < 0 , then A"𝒙! 𝑾𝒚8=D

•

>?(@)

>?(@) >@

= 2𝛽 − 2𝒙, 𝑾𝒚 − 𝜆 . Similarly as above, we can conclude that 𝛽 =

. But if 2𝒙, 𝑾𝒚 + 𝜆 > 0, this will result in a contradiction, and thereby, 𝛽 = 0.

If 𝛽 = 0, then we have had the solution and no longer need the calculate the gradient.

In summary, we can derive the solution of 𝛽 as 95

A"𝐗 ! 𝒚0=D

⎧ ⎪

𝛽d =

⎨ ⎪ ⎩

" A"𝐗 ! 𝒚8=D "

, 𝑖𝑓2𝒙, 𝑾𝒚 − 𝜆 > 0 , 𝑖𝑓2𝒙, 𝑾𝒚 + 𝜆 < 0 . 𝑖𝑓 𝜆 ≥ |2𝒙, 𝑾𝒚|

Now we are ready to generalize this practice to general case with more predictors. Suppose that we are now at the 𝑡th iteration and we are trying to optimize for 𝛽E , we can rewrite the general optimization problem’s objective function as a function of 𝛽E (G0!)

𝐿U𝛽E Y = a𝒚 − ∑FHE 𝐗(:,F) 𝛽F

(G0!)

− 𝐗(:,E) 𝛽E b 𝑾 a𝒚 − ∑FHE 𝐗(:,F) 𝛽F

(G0!)

− 𝐗(:,E) 𝛽E b + 𝜆 ∑FHE •𝛽F

•+

𝜆Z𝛽E Z. (G)

Here, 𝛽F is the value of 𝛽F in the 𝑡th iteration. Thus, we can readily derive that 𝑞E − 𝜆⁄2 , 𝑖𝑓𝑞E − 𝜆⁄2 > 0 (G) 𝛽dE = € 𝑞E + 𝜆⁄2 , 𝑖𝑓𝑞E + 𝜆⁄2 < 0 , 0,

(G0!)

, where 𝑞E = 𝐗 (:,E) 𝑾 a𝒚 − ∑FHE 𝐗 (:,F) 𝛽F

𝑖𝑓 𝜆 ≥ Z2𝑞E Z

Q9: Design a simulated experiment to evaluate the effectiveness of the glmet() in the R package glmnet. (1) For instance, you can simulate 20 samples from a linear regression model with 10 variables, where only 2 out of the 10 variables are truly significant, e.g., the true model is 𝑦 = 𝑥! 𝛽! + 𝑥" 𝛽" + 𝜀, where 𝛽! = 1, 𝛽" = 1, and 𝜀~𝑁(0,1). But there are also 8 more other variables 𝑥$ to 𝑥!# where each is generated from 𝑁(0,1). In data analysis, we will use all the 10 variables as predictors, since we won’t know the true model. (2) Run lm() on the simulated data and comment on the results. (3) Run glmnet() on the simulated data, and check the path trajectory plot to see if the true significant variables could be detected. (4) Use the cross validation process integrated int the glmnet package to see if the true significant variables could be detected. (5) Use rpart() to build a decision tree and extract the variable importance score, to see if the true significant variables could be detected.

(6) Use randomforest() to build a random forest model and extract the variable importance score, to see if the true significant variables could be detected. Solution: (1) # step 1 -> simulate the data n=20 p <- 10 mu <- rep(0,p) library(clusterGeneration) ## Loading required package: MASS Sigma <- rcorrmatrix(p) library(MASS) X <- mvrnorm(n, mu, Sigma, tol = 1e-6, empirical = FALSE) beta1 <- 1 # the regression coefficient of the first predictor = 1 beta2 <- 1 # the regression coefficient of the second predictor = 1 mu <- beta1 * X[,1] + beta2 * X[,2] # with simulated values of x1 and x2, and the coefficients, we can calculate the mean levels of the outcome variable y <- rnorm(n, mu, 1) # further, simulate the outcome variable. remember, y = f(x) + error. Here, the error term is N(0,1)

(2) # Try lm model lm.XY <- lm(y ~ ., data = data.frame(y,X)) # Now, let's fit the linear regres sion model summary(lm.XY) ## ## Call: ## lm(formula = y ~ ., data = data.frame(y, X)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5357 -0.4015 0.1250 0.4286 1.1771 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.06576 0.29685 0.222 0.830 ## X1 0.82771 0.98264 0.842 0.421 ## X2 0.09119 0.62098 0.147 0.886 ## X3 0.22113 0.85071 0.260 0.801 97

## X4 0.40608 0.64304 0.632 0.543 ## X5 0.69563 1.00992 0.689 0.508 ## X6 0.46139 0.54578 0.845 0.420 ## X7 -0.12642 0.61247 -0.206 0.841 ## X8 -0.56563 0.49956 -1.132 0.287 ## X9 0.45069 0.51271 0.879 0.402 ## X10 -0.60617 0.43164 -1.404 0.194 ## ## Residual standard error: 1.047 on 9 degrees of freedom ## Multiple R-squared: 0.7797, Adjusted R-squared: 0.535 ## F-statistic: 3.186 on 10 and 9 DF, p-value: 0.04787 It seems that the lm() could not identify any significant variable.

# Try LASSO library(glmnet) ## Loading required package: Matrix ## Loaded glmnet 4.0-2 fit = glmnet(X,y, family=c("gaussian")) plot(fit,label = TRUE) It seems that LASSO could identify the true significant variables, x1 and x2.

# Use cross-validation (a procedure integrated into the glmnet package) to se lect the best set of variables cv.fit = cv.glmnet(X,y) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observation s per ## fold plot(cv.fit)

coef(cv.fit, s = "lambda.min") ## 11 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) -0.03359771 ## x1 0.90366848 ## x2 0.41699494 ## x3 . ## x4 0.14932342 ## x5 0.09375941 ## x6 . ## x7 . ## x8 . ## x9 . ## x10 -0.38930260 It seems that the cross-validation procedure integrated in the glmnet package not only identified the true significant variables, but also some insignificant variables such as x4, x5, x10.

# Use decision tree to evaluate the importance of the variables library(rpart) data.train <- data.frame(y,X) 100

tree <- rpart( y ~ ., data = data.train) tree$variable.importance ## X1 ## 15.337385

X10 6.134954

X5 6.134954

X8 6.134954

X2 4.601215

X9 4.601215

It seems that the decision tree identified x1, but not x2.

# Use Random forest to evaluate the importance of the variables library(randomForest) ## randomForest 4.6-14 ## Type rfNews() to see new features/changes/bug fixes. rf <- randomForest( y ~ ., data = data.train, ntree = 100, nodesize = 20, mtr y = 5) rf$importance ## IncNodePurity ## X1 6.4083490 ## X2 5.0438098 ## X3 0.7824175 ## X4 0.2131150 ## X5 6.7103528 ## X6 0.6079683 ## X7 0.7591356 ## X8 1.6966687 ## X9 0.6190051 ## X10 0.7638909 It seems that the random forest model identified x1 and x2, but also included x5 which is not a true significant variable.

101

Chapter 9 Q1: Build a kernel regression model with Gaussian kernel with bandwidth parameter 𝛾 = 1 using the dataset shown in below,

𝑋 𝑌 ID 1 -0.32 0.66 2 -0.1 0.82 3 0.74 -0.37 4 1.21 -0.8 5 0.44 0.52 6 -0.68 0.97 and predict on the following data points. Please use manual calculation. Testing data (3 data points) ID 𝑋 𝑌 7 -1 8 0 9 1 Solution: The Gaussian kernel with bandwidth parameter 𝛾 = 1 is: $

𝐾U𝑥* , 𝑥E Y = 𝑒 0IJ" 0J#I . The prediction function of the kernel regression model is: 𝑦∗ =

∗ ∑' %() M% N(J% ,J ) ∗ ∑' %() N(J% ,J )

For the data point ID#7, we can calculate

𝑋 𝑌 𝐾(𝑥O , 𝑥 ∗ ) 𝑦O 𝐾(𝑥O , 𝑥 ∗ ) ID 1 -0.32 0.66 0.62977 0.41565 2 -0.1 0.82 0.44485 0.36480 3 0.74 -0.37 0.04843 -0.01790 4 1.21 -0.8 0.00756 -0.00605 5 0.44 0.52 0.12573 0.06538 6 -0.68 0.97 0.90267 0.87560 Then we can compute that ∑5OP! 𝐾(𝑥O , 𝑥 ∗ ) = 2.15903, 102

∑5OP! 𝑦O 𝐾(𝑥O , 𝑥 ∗ ) = 1.697. Thus, !.54)

𝑦 ∗ = ".!'4#$ = 0.78600. Similarly, we can calculate the prediction for the data points #8 and #9. The result is summarized in the following table ID 𝑋 𝑌 7 -1 0.7860 8 0 0.4982 9 1 -0.0994 Q2: Follow up on the dataset in Q1. Build a KNN regression model with K = 2. Predict on the 3 data points in the testing data. Please use manual calculation. Solution:

Q3: Follow up on the dataset in Q1. Use the R pipeline for KNN regression on this data. Compare the result from R and the result by your manual calculation. Solution: train <- data.frame(x,y) library(FNN) KNN_m <- knn.reg(train = train$x, test = data.frame(test), y = train$y, k = 2) KNN_m ## Prediction: ## [1] 0.815 0.740 -0.585

Q4: Consider the following dataset. 103

Training data (6 data points) ID 𝑋 𝑌 1 -0.32 0.66 2 -0.1 0.82 3 0.74 -0.37 4 1.21 -0.8 5 0.44 0.52 6 -0.68 0.97 Use the gausskernel() function from the R package “KRLS” to calculate the similarity between the data points (including the 6 training data points and the 3 testing data points in the Table below) Solution: x <- c(-0.32, -0.1, 0.74, 1.21, 0.44, -0.68) y <- c(0.66, 0.82, -0.37, -0.8, 0.52, 0.97) test <- c(-1,0,1) library(KRLS) ## ## KRLS Package for Kernel-based Regularized Least Squares. ## ## See Hainmueller and Hazlett (2014) for details. Kernel_m <- gausskernel(X = c(x, test), sigma=1) Kernel_m[,7:9] ## 7 8 9 ## 1 0.62977038 0.9026684 0.17509966 ## 2 0.44485807 0.9900498 0.29819728 ## 3 0.04843173 0.5783362 0.93463425 ## 4 0.00756593 0.2312861 0.95685827

Q5: Use the BostonHousing dataset from the R package mlbench, select the variable medv as the outcome, and use other numeric variables as the predictors. Run the R pipeline for KNN regression on it. Use crossvalidation to select the best number of nearest neighbor, and summarize your findings. Solution: # Step 1 -> Read data into R workstation library(mlbench) data(BostonHousing) data <- BostonHousing # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) x <- data[,c(1:3,5:13)] y <- data[,14] 104

# Make sure the outcome variable is legitimate. If it is a continuous variabl e (regression problem), it should be defined as a "num" variable in R. If it is a binary or a more genernal categorical variable (classification problem), it should be defined as a "factor" variable in R. # Create a training data (half the original data size) train.ix <- sort(sample(nrow(data),floor( nrow(data) * 4/5) )) data.train.x <- x[train.ix,] data.train.y <- y[train.ix] # Create a testing data (half the original data size) data.test.x <- x[-train.ix,] data.test.y <- y[-train.ix] # Step 3 -> gather a list of candidate models # KNN regression: often to compare models with different number of nearest ne ighbors # Use different values of bandwidth # model1: knn.reg(train = x, y = y, k=2) # model2: knn.reg(train = x, y = y, k=5) # model3: knn.reg(train = x, y = y, k=10) # model4: knn.reg(train = x, y = y, k=20) # Step 4 -> Use 5-fold cross-validation to evaluate all the models # First, let me use 5-fold cross-validation to evaluate the performance of mo del1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data.train.x)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) # This randomly creates a l abeling vector (1 X N) for the N samples. For example, if N = 16, and I run t his function and it returns the value as 5 4 4 10 6 7 6 8 3 2 1 5 3 9 2 1. That means, the first sample is allocated to the 5th fold, the 2n d and 3rd samples are allocated to the 4th fold, etc. require(FNN) ## Loading required package: FNN cv_mse <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the n_folds iteration s, remember, we use one fold of data as the testing data data.train.x.cv <- data.train.x[-test_i, ] data.train.y.cv <- data.train.y[-test_i] # Then, the remaining n_folds-1 f olds' data form our training data 105

data.test.x.cv <- data.train.x[test_i, ] data.test.y.cv <- data.train.y[test_i]# This is the testing data, from the ith fold model1 <- knn.reg(train = data.train.x.cv, test = data.test.x.cv, y = data. train.y.cv, k = 2) # (1) Fit the kernel regression model with Gaussian kernel (argument: kernel = "normal") and bandwidth = 0.5; (2) Here, one unique thin g about ksmooth is, there is no predict() for it. Rather, it has the argument "x.points=data.test.cv" to specify where you want to predict on y_hat <- model1$pred # Predict on the testing data using the trained model true_y <- data.test.y.cv # get the true y values for the t esting data cv_mse[k] <- mean((true_y - y_hat)^2) } mean(cv_mse) ## [1] 48.58876 cv_mse <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the n_folds iteration s, remember, we use one fold of data as the testing data data.train.x.cv <- data.train.x[-test_i, ] data.train.y.cv <- data.train.y[-test_i] # Then, the remaining n_folds-1 f olds' data form our training data data.test.x.cv <- data.train.x[test_i, ] data.test.y.cv <- data.train.y[test_i]# This is the testing data, from the ith fold model2 <- knn.reg(train = data.train.x.cv, test = data.test.x.cv, y = data. train.y.cv, k = 5) # (1) Fit the kernel regression model with Gaussian kernel (argument: kernel = "normal") and bandwidth = 0.5; (2) Here, one unique thin g about ksmooth is, there is no predict() for it. Rather, it has the argument "x.points=data.test.cv" to specify where you want to predict on y_hat <- model2$pred # Predict on the testing data using the trained model true_y <- data.test.y.cv # get the true y values for the t esting data cv_mse[k] <- mean((true_y - y_hat)^2) } mean(cv_mse) ## [1] 47.10889 cv_mse <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the n_folds iteration s, remember, we use one fold of data as the testing data data.train.x.cv <- data.train.x[-test_i, ] data.train.y.cv <- data.train.y[-test_i] # Then, the remaining n_folds-1 f 106

olds' data form our training data data.test.x.cv <- data.train.x[test_i, ] data.test.y.cv <- data.train.y[test_i]# This is the testing data, from the ith fold model3 <- knn.reg(train = data.train.x.cv, test = data.test.x.cv, y = data. train.y.cv, k = 10) # (1) Fit the kernel regression model with Gaussian kerne l (argument: kernel = "normal") and bandwidth = 0.5; (2) Here, one unique thi ng about ksmooth is, there is no predict() for it. Rather, it has the argumen t "x.points=data.test.cv" to specify where you want to predict on y_hat <- model3$pred # Predict on the testing data using the trained model true_y <- data.test.y.cv # get the true y values for the t esting data cv_mse[k] <- mean((true_y - y_hat)^2) } mean(cv_mse) ## [1] 49.25045 cv_mse <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the n_folds iteration s, remember, we use one fold of data as the testing data data.train.x.cv <- data.train.x[-test_i, ] data.train.y.cv <- data.train.y[-test_i] # Then, the remaining n_folds-1 f olds' data form our training data data.test.x.cv <- data.train.x[test_i, ] data.test.y.cv <- data.train.y[test_i]# This is the testing data, from the ith fold model4 <- knn.reg(train = data.train.x.cv, test = data.test.x.cv, y = data. train.y.cv, k = 20) # (1) Fit the kernel regression model with Gaussian kerne l (argument: kernel = "normal") and bandwidth = 0.5; (2) Here, one unique thi ng about ksmooth is, there is no predict() for it. Rather, it has the argumen t "x.points=data.test.cv" to specify where you want to predict on y_hat <- model4$pred # Predict on the testing data using the trained model true_y <- data.test.y.cv # get the true y values for the t esting data cv_mse[k] <- mean((true_y - y_hat)^2) } mean(cv_mse) ## [1] 57.00013 # Step 5 -> After model selection, use ksmooth() function to build your final model knn.final <- knn.reg(train = data.train.x, test = data.test.x, y = data.trai n.y, k = 5) # # Step 6 -> Evaluate the prediction performance of your model y_hat <- knn.final$pred # Predict on the testing data using the trained mode 107

l true_y <- data.test.y g data mse <- mean((true_y - y_hat)^2) rror (MSE). The small r, the better your model is print(mse)

# get the true y values for the testin # mean((true_y - y_hat)^2): mean squared e this erro

## [1] 32.11422 Using cross-validation we can observe that the best number of nearest neighbor is 5. Q6: Use the BostonHousing dataset from the R package mlbench, select the variable lstat as the predictor and medv as the outcome, run the R pipeline for kernel regression on it. Try the gaussian kernel function with its bandwidth parameter taking values as 5, 10, 30, 100. Solution: # Step 1 -> Read data into R workstation library(mlbench) data(BostonHousing) data <- BostonHousing # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) x <- data[,13] y <- data[,14] data <- data.frame(x,y) # Make sure the outcome variable is legitimate. If it is a continuous variabl e (regression problem), it should be defined as a "num" variable in R. If it is a binary or a more genernal categorical variable (classification problem), it should be defined as a "factor" variable in R. # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data) * 4/5) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> use visual inspection to decide what is the best kernel and bandw idth plot(y ~ x, col = "gray", lwd = 2) lines(ksmooth(x,y, "normal", bandwidth=5),lwd = 3, col = "darkorange") lines(ksmooth(x,y, "normal", bandwidth=10),lwd = 3, col = "dodgerblue4") lines(ksmooth(x,y, "normal", bandwidth=30),lwd = 3, col = "forestgreen") lines(ksmooth(x,y, "normal", bandwidth=100),lwd = 3, col = "black") legend(x = "topright", legend = c("Kernel Reg (bw = 5)", "Kernel Reg (bw = 1 0)", "Kernel Reg (bw = 30)","Kernel Reg (bw = 100)"), lwd = rep(3, 4), col = 108

c("darkorange", "dodgerblue4", "forestgreen","black"), text.width = 32, cex = 0.85)

# Step 5 -> After model selection, use ksmooth() function to build your final model kr.final <- ksmooth(data.train$x, data.train$y, kernel = "normal", bandwidth = 30, x.points=data.test[,1]) # # Step 6 -> Evaluate the prediction performance of your model y_hat <- kr.final$y # Predict on the testing data using the trained model true_y <- data.test$y # get the true y values for the testin g data mse <- mean((true_y - y_hat)^2) # mean((true_y - y_hat)^2): mean squared e rror (MSE). The small this error, the better your model is print(mse) ## [1] 69.0908

Q7: Figure below shows a nonlinear model (i.e., the curve) and its sampled points. Suppose that the curve is unknown to us, and our task is to build a KNN regression model with K=2 based on the samples. Draw the fitted curve of this KNN regression model. 109

Solution:

Q8: Suppose that the underlying model is a linear model.

To use KNN model to approximate the underlying model, we need samples. Suppose that we could afford sampling 8 data points. Which locations you’d like to acquire samples in order to achieve best approximation of the underlying model using your later fitted KNN model? Solution: The sampling scheme is shown in below. We need to evenly sample the data points. 110

It seems that using KNN regression model for approximating linear model is not best.

111

Chapter 10 Q1: Please complete the convolution operation as shown in below

Solution:

112

Q2: Use the convolution() function in R package OpenImageR to run the data in Q1. Solution: require(OpenImageR) ## Loading required package: OpenImageR myMatrix <- matrix(c(1,0,1,0,1,1,1,1,0), nrow = 3, ncol = 3) kernel <- matrix(c(1,0,0,1), nrow = 2, ncol = 2) # Make convolution myOutput = convolution(myMatrix, kernel) myOutput ## [,1] [,2] [,3] ## [1,] 2 1 1 ## [2,] 1 1 1 ## [3,] 1 1 0 Note that the output from this function has a third row and third column. The reason for this extended result is illustrated in below. Other than that, the output from R is consistent with our manual calculation.

Q3: Let’s try applying the convolution operation on a real image. E.g., use readImage(system.file("images", "sample-color.png", package="EBImage")) to get the image below

113

Use the convolution() function in R package OpenImageR to filter this image. You can use the high-pass Laplacian filter, that would be defined in R as ## High-pass Laplacian filter kernel = matrix(1, nc=3, nr=3) kernel[2,2] = -8

Solution: require(EBImage) require(OpenImageR) myMatrix = readImage(system.file("images", "sample-color.png", package="EBIma ge")) display(myMatrix, title='Sample')

114

## High-pass Laplacian filter kernel = matrix(1, nc=3, nr=3) kernel[2,2] = -8 # Make convolution myOutput = convolution(myMatrix, kernel) display(myOutput, title='Filtered image') ## Only the first frame of the image stack is displayed. ## To display all frames use 'all = TRUE'.

115

We can see that the convolution operation captures the main geometric pattern of the parrots.

Q4: The figure below shows a NN model with its parameters.

Please use this NN model to predict on the following data points ID 𝑥! 1 0 2 -1 3 2

𝑥" 1 2 2

𝑦

116

Solution: For the data point #1, from the input layer to the first node in the hidden layer, we have 0 × 1 + 1 × −1 + 1 = 0. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(0) = max(0,0) = 0. From the input layer to the second node in the hidden layer, we have 0 × 0 + 1 × 1 + 1 = 2. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(2) = max(0,2) = 2. From the hidden layer to output 𝑦 = 0 × 1 + 2 × −1 = −2.

For the data point #2, from the input layer to the first node in the hidden layer, we have −1 × 1 + 2 × −1 + 1 = −2. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(−2) = max(0, −2) = 0. From the input layer to the second node in the hidden layer, we have −1 × 0 + 2 × 1 + 1 = 3. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(3) = max(0,3) = 3. From the hidden layer to output 𝑦 = 0 × 1 + 3 × −1 = −3.

For the data point #3, from the input layer to the first node in the hidden layer, we have 2 × 1 + 2 × −1 + 1 = 1. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(1) = max(0,1) = 1. From the input layer to the second node in the hidden layer, we have

117

2 × 0 + 2 × 1 + 1 = 3. The activation function as the first node in the hidden layer is 𝜙(𝑧) = max(0, 𝑧). Thus, 𝜙(3) = max(0,3) = 3. From the hidden layer to output 𝑦 = 1 × 1 + 3 × −1 = −2. In summary, the predictions are ID 𝑥! 1 0 2 -1 3 2

𝑥" 1 2 2

𝑦 -2 -3 -2

Q5: Use the BostonHousing dataset from the R package mlbench, select the variable medv as the outcome and all other numeric variables as predictors. Run the R pipeline for NN on it. Please use 10-fold cross validation to evaluate a NN model with 2 hidden layers, while each layer has a number of nodes at your choice. Comment on the result. Solution: Let’s try a NN model with 2 hidden layers, each layer with 3 nodes. # Step 1 -> Read data into R workstation library(mlbench) data(BostonHousing) data <- BostonHousing # Step 2 -> Data preprocessing # Create your X matrix (predictors) and Y vector (outcome variable) x <- data[,c(1:3,5:13)] y <- data[,14] data <- data.frame(x,y) # Make sure the outcome variable is legitimate. If it is a continuous variabl e (regression problem), it should be defined as a "num" variable in R. If it is a binary or a more genernal categorical variable (classification problem), it should be defined as a "factor" variable in R. # Create a training data (half the original data size) train.ix <- sample(nrow(data),floor( nrow(data) * 4/5) ) data.train <- data[train.ix,] # Create a testing data (half the original data size) data.test <- data[-train.ix,] # Step 3 -> Use 10-fold cross-validation to evaluate all the models # First, let me use 10-fold cross-validation to evaluate the performance of m 118

odel1 n_folds = 10 # number of fold (the parameter K as we say, K-fold cross valida tion) N <- dim(data.train)[1] # the sample size, N, of the dataset folds_i <- sample(rep(1:n_folds, length.out = N)) # This randomly creates a l abeling vector (1 X N) for the N samples. For example, if N = 16, and I run t his function and it returns the value as 5 4 4 10 6 7 6 8 3 2 1 5 3 9 2 1. That means, the first sample is allocated to the 5th fold, the 2n d and 3rd samples are allocated to the 4th fold, etc. library(neuralnet) cv_mse <- NULL # cv_mse aims to make records of the prediction error for each fold for (k in 1:n_folds) { test_i <- which(folds_i == k) # In each iteration of the n_folds iteration s, remember, we use one fold of data as the testing data data.train.cv <- data.train[-test_i, ] # Then, the remaining n_folds-1 fold s' data form our training data data.test.cv <- data.train[test_i, ] # This is the testing data, from the ith fold model1 <- neuralnet(y~., data=data, hidden=c(3,3)) # Fit the neural network model with 2 hidden layers pred <- compute (model1, data.test.cv) # Predict on the testing data using the trained model y_hat <- pred$net.result model1$y_hat <- y_hat true_y <- data.test.cv$y # get the true y values for the testing data cv_mse[k] <- mean((true_y - y_hat)^2) # mean((true_y - y_hat)^2): mean s quared error (MSE). The small this error, the better your model is } mean(cv_mse) ## [1] 83.76633

119