TEST BANK for Data Analysis and Statistics for Geography, Environmental Science, and Engineering 1st by StudyGuide

To make sure you understand the workspace. Save your workspace .Rdata file. Then close R and start a new R session, Load the workspace, make sure you have the objects created before. Solution: Just proceed as indicated. Exercise 1-5 Use the notepad or Vim to create a simple text file myfile.txt. Type 10 numbers in a row separated by a blank space, trying to type numbers around a value of 10. Save in folder lab1. Now read the file using scan, calculate sample mean, variance, and standard deviation, plot a stem-and-leaf diagram and a histogram and discuss. Solution: The solution details vary according to the numbers created by each student. Suppose one possible solution is

x <- scan("lab1/myfile.txt") length(x) stem(x) hist(x) round(mean(x),1) round(var(x),1) round(sd(x),1)

On the console obtain > x <- scan("lab1/myfile.txt") Read 10 items > length(x) [1] 10 > stem(x) The decimal point is at the | 9|2 9 | 5678

10 | 11233 > hist(x) > round(mean(x),1) [1] 9.9 > round(var(x),1) [1] 0.1 > round(sd(x),1) [1] 0.4 >

On the graphics window obtain

2.0 1.5 0.0

0.5

1.0

Frequency

2.5

3.0

Histogram of x

9.2

9.4

9.8

9.6

10.0

10.2

10.4

Exercise 1-6 Use file lab1\exercise.csv. Examine the file contents using the notepad or Vim. Read the file, list numbers on the R Console rounding to 2 decimals. Calculate sample mean, variance, and standard deviation, plot a stem-and-leaf diagram and a histogram and discuss. Solution:

Then > x.ex <- scan("lab1/exercise.csv",sep=",") Read 100 items > round(x.ex,2) [1] 0.58 2.30 1.44 0.35 0.04 1.58 0.18 1.08 1.38 2.33 1.41 0.35 0.30 0.11 2.84 [16] 2.79 0.94 0.63 0.21 0.31 2.11 1.25 0.56 0.56 1.41 2.66 0.96 2.70 0.29 0.42 [31] 1.26 0.11 0.45 0.80 0.42 1.04 0.39 0.08 2.52 0.74 2.75 1.45 0.52 1.11 0.51 [46] 0.31 2.96 0.08 0.24 1.01 0.87 3.62 0.03 0.67 1.65 1.12 0.72 0.33 0.38 0.25 [61] 1.30 0.51 0.78 2.56 0.02 0.31 0.14 0.15 0.35 1.72 0.67 0.19 0.84 1.33 3.28 [76] 0.24 0.64 0.52 3.86 0.38 0.35 2.39 2.15 1.25 0.05 1.17 2.10 0.12 1.14 0.31 [91] 1.30 1.31 0.14 0.35 0.56 0.47 0.26 0.88 0.13 1.10 >

Now > round(mean(x.ex),2) [1] 1 > round(var(x.ex),2) [1] 0.82 > round(sd(x.ex),2) [1] 0.91 > stem(x.ex) The decimal point is at the | 0 | 0001111111111222222333333333344444444 0 | 5555556666667777888999 1 | 000111112233333344444 1 | 667 2 | 111334

2 | 5677888 3 | 03 3 | 69 >

Finally hist(x.ex)

20 0

Frequency

Histogram of x.ex

x.ex

Exercise 1-7 Separate the first 20 and last 20 elements of salinity x array into two objects. Plot a stem-andleaf plot and a histogram for each. Solution: > x1<- x[1:20]

> x2<- x[21:40] > hist(x1) > stem(x1) The decimal point is at the | 24 | 26467 26 | 53 28 | 792334677 30 | 0789

4 0

Frequency

Histogram of x1

28 x1

> stem(x2) The decimal point is at the | 22 | 77777789 23 | 0001112234 23 | 57 > hist(x2)

4 3 2 1 0

Frequency

Histogram of x2

22.6

22.8

23.0

23.2 x2

23.4

23.6

23.8

Chapter 2 Probability Theory Exercise 2-1 Suppose we flip a fair coin to obtain heads or tails. Define the sample space and the possible outcomes. Define events and the probabilities of each. Solution: Sample space U={heads, tails}, Event A ={side facing up is heads}, then P[A] = ½, or 1 out of two outcomes. Event B ={side facing up is tails}, then P[B] = ½, or 1 out of two outcomes. Exercise 2-2 Define event A={rain today} with probability 0.2. Define the complement of event A. What is the probability of the complement? Solution: Complement is B={does not rain today}; P(B)=1-P(A)=1-0.2=0.8 Exercise 2-3 Define A={rains less than 1 inch } B={rains more than 0.5 inches}. What is the intersection event C? Solution: Event C={rains less than 1 inch and more than 0.5 inch} this is to say C={rain in between 0.5 and 1 inch}. Exercise 2-4 A pixel of a remote sensing image can be classified as grassland, forest or residential. Define A={land cover is grassland} B={land cover is forest}. What is the union event C? What is D= the complement of C? Solution: Event C={land cover is grass or forest}, Event D={land cover is residential} Exercise 2-5 Assume we flip a coin three times in sequence. The outcome of a toss is independent of the others. Calculate and enumerate the possible combinations and their probabilities. 11

Solution: Possible outcomes n=23=8, Sample space U={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} Each outcome is equally likely with probability 1/8, obtained by (½)3. Exercise 2-6 Assume we take water samples from water wells to determine if the well is contaminated. Assume we sample four wells and that they are independent. Calculate the number and enumerate the possible events of contamination results. Calculate the number and enumerate those that would have exactly two contaminated wells in the four trials. Solution: n=24=16, Sample space U={NNNN,CNNN,NCNN, etc} where C=contaminated, N= not  4 contaminated. Of these   = 6 include exactly two contaminated, these are  2 {CCNN,CNCN,CNNC,NCCN,NCNC,NNCC} Exercise 2-7 Using the tree of Figure 2-8 What is the total probability of the test is in error? Hint: BD or AC. What is the probability that the test is correct? Solution: P[BD]= 0.056, P[AC]= 0.006 Test is in error: P[BD]+P[AC]=0.056+0.006=0.062 Test is correct 1-(P[BD]+P[AC])=1-0.062=0.938 (could also sum P(AD)+P(BC)) Exercise 2-8 Using Figure 2-8 and Bayes’ theorem: what is the probability that the water is contaminated given a positive test result? Hint: calculate P[A|D]. Solution:

P[= A | D]

P[ AD] P[ D | A]P[ A] = P[ D] P[ D | A]P[ A] + P[ D | B]P[ B]

P(D) = 0.2(1-0.03) + 0.8(0.07) = 0.25 12

P(A|D) = 0.2(0.97)/(0.25) = 0.776 Exercise 2-9 Assume 20% of an area is grassland. We have a remote sensing image of the area. An image classification method yields correct grass class with probability=0.9 and correct non-grass class with probability=0.9. What is the probability that the true vegetation of a pixel classified as grass is grass? Repeat assuming that grasslands is 50% of the area? Which one is higher and why? Solution: Apply Bayes Theorem as above. Probability of grass P(G)=0.2 then probability of non grass P(NG)=0.8. Probability of grass class given that is grass is P(g|G)=0.9 so P(ng|G)=0.1. Probability of non grass class given it is non grass is P(ng|NG)=0.9 so P(g|NG)=0.1. We want P(G|g)

P[Gg ] P[ g | G ]P[G ] = = P[ g ] P[ g | G ]P[G ] + P[ g | NG ]P[ NG ] 0.9 × 0.2 0.18 0.18 = = = 0.69 0.9 × 0.2 + 0.1× 0.8 0.18 + 0.08 0.26

P[= G | g]

Now if P(G)=0.5

P[G= | g]

0.9 × 0.5 0.45 0.45 = = = 0.9 0.9 × 0.5 + 0.1× 0.5 0.45 + 0.05 0.5

The result is higher for P(G)=0.5. This makes sense because higher P(G) increases P(Gg). Exercise 2-10 Plot a histogram in probability density scale for DO variable of the x object from datasonde.csv. Save the graph as a jpeg file. Insert to an application. Solution: hist(DO,prob=T)

Exercise 2-11 Read file lab2/lake-lewisville.csv to a data frame. Use both Rcmdr and Rconsole. Solution: x <- read.table("lab2/lake-lewisville.csv",header=T,sep=",") >x Date Time Temp SpCond TDS Salinity DOsat DO Depth pH Turbid IBatt 1 1/1/2010 0:00:00 7.59 328.3 213.4 0.16 109.2 13.06 0.826 8.59 6.8 10.7 2 1/1/2010 0:30:00 7.59 328.3 213.4 0.16 109.7 13.12 0.829 8.59 6.5 10.7 3 1/1/2010 1:00:00 7.57 328.2 213.3 0.16 109.3 13.07 0.830 8.59 6.4 10.8 4 1/1/2010 1:30:00 7.55 328.2 213.3 0.16 109.3 13.07 0.831 8.59 6.5 10.8 5 1/1/2010 2:00:00 7.55 328.2 213.3 0.16 109.0 13.04 0.828 8.60 6.7 10.8 6 1/1/2010 2:30:00 7.51 328.3 213.4 0.16 109.0 13.05 0.829 8.59 6.7 10.7 7 1/1/2010 3:00:00 7.53 328.0 213.2 0.16 109.0 13.05 0.831 8.59 6.7 10.8 8 1/1/2010 3:30:00 7.50 328.2 213.3 0.16 108.9 13.04 0.831 8.59 6.8 10.8 9 1/1/2010 4:00:00 7.50 328.2 213.3 0.16 108.7 13.02 0.824 8.59 6.3 10.8 10 1/1/2010 4:30:00 7.50 328.3 213.4 0.16 108.6 13.00 0.827 8.59 6.2 10.7  etc

Exercise 2-12 14

Plot variables of data frame created in exercise 2-11. Solution: Time should be converted to a sequence of real numbers from hour 0 to hour 23.5. It is convenient to write a loop and plot each variable. attach(x) time <- seq(0,23.5,0.5) pdf(file="lab2/lakelewisville.pdf") for(i in 3:12) plot(time,x[,i], type="l", col=1,ylab=names(x)[i]) dev.off()

The PDF contains one page per variable. For example

Exercise 2-13

Generate a linear function = y ax + b . Using a=0.1, b=0.1. Plot y for values of x in 0 to 1. Limit y-axis to go from 0 to the maximum of y. Solution:

0.10 0.00

0.05

0.15

0.20

> a=0.1;b=0.1; x=seq(0,1,0.1) > y <- a*x+b; plot(x,y,type="l",ylim=c(0,max(y)))

0.0

0.2

0.4

0.6

0.8

1.0

Exercise 2-14 Generate a linear function = y ax + b Using b=0.1 and two values of a, a=0.1 and a=-0.1 Plot y for values of x in the interval [0,1]. Limit the y-axis to the interval [minimum of y, maximum of y]. Place a legend. Solution: 16

0.10

0.15

0.20

a=c(0.1,-0.1); b=0.1; x=seq(0,1,0.1) y <- matrix(nrow=length(x), ncol=length(a)) for(i in 1:2) y[,i] <- a[i]*x+b matplot(x,y,type="l",ylim=c(min(y),max(y)), col=1) legend(0.8,b,paste("a=",as.character(a)), lty=c(1:length(a)))

0.00

0.05

a= 0.1 a= -0.1

0.0

0.2

0.4

0.6

0.8

1.0

Exercise 2-15 This exercise refers to the Bayes’ rule script. Change probability of contamination P[A] to 0.3. Plot the probability of contamination given that a test is negative P[A|C] vs. false negative error with false positive error as a parameter. Hint: modify the script given above for Bayes’ rule to reverse the roles of Fneg and Fpos. Solution: # pA =contamination p[A]

# Fneg = false negative p[C|A] # Fpos = false positive p[D|B] # fix pA and explore changes of p[A|C] # as we vary Fpos and Fneg # fix pA pA=0.3 # sequence of values Fneg <- seq(0,1,0.05); Fpos <- seq(0,1,0.2) # array to store results Cont.neg <- matrix(nrow=length(Fneg),ncol=length(Fpos)) # Bayes theorem for(i in 1:length(Fpos)) Cont.neg[,i] <- Fneg*pA/(Fneg*pA + (1-Fpos[i])*(1-pA)) # plot matplot(Fneg,Cont.neg, type="l",lty=1:length(Fpos), col=1, xlab="False Negative Error", ylab="Prob(Contaminated | test negative)") legend(0,1, paste("Fpos=",as.character(Fpos)), lty=1:length(Fpos), col=1)

Exercise 2-16 On the decision making script. Change ΔI to 4 and plot again. Discuss the changes obtained for the values of p at which we would decide for alternative A1. Solution: # fix delta I dI <- 4 # sequences for delta M and p dM <- seq(0,10,2); nM <- length(dM) p <- seq(0,1,0.01); np <- length(p) # prepare a 2D array to store results C <- matrix(nrow=np, ncol=nM) # loop to calculate C for various dM for(i in 1:nM) C[,i] <- dI-dM[i]*p # plot the family of lines matplot(p,C,type=“l”,lty=1:nM,col=1,ylim=c(−dI,dI)) # draw horizontal line at 0 to visualize crossover abline(h=0) # legend to identify the lines, use a keyword to position it legend(“bottomleft”,leg=paste(“dM=”,dM),lty=1:nM,col=1)

4 2 0 -2

-4

dM= 0 dM= 2 dM= 4 dM= 6 dM= 8 dM= 10 0.0

0.2

0.4

0.6 p

The values of p have increased by a factor of 2.

0.8

1.0

Chapter 3 Random Variables, Distributions, Moments, and Statistics

Exercise 3-1 Define a RV based on outcomes of classification of a pixel of a remote sensing image as grassland (prob=0.2), forest (prob=0.4) or residential (prob=?). Is this RV discrete or continuous? Plot the distributions (density or mass) and cumulative. Calculate the mean and variance. Solution: Define grass=1, forest=2, residential=3. Values xi={1,2,3} with pi={0.2,0.4,0.4}. This is a discrete RV.

Exercise 3-2 Define a RV from the outcome of soil moisture measurements in the range of 20-40 % in volume. Give an example of an event. Is this RV discrete or continuous? Assuming that it can take values in [20,40] uniformly, plot appropriate distributions (density or mass) and cumulative. Calculate the mean and variance. Solution: Domain is [20,40] real number interval. This is a continuous RV. Example of an event= moisture in [34,35] interval.

Exercise 3-3

Consider a RV uniformly distributed between a=5 and b=10. Calculate the mean and variance using equations 3.23 and 3.24. Plot the pdf and cdf Solution:

Exercise 3-4 Plot the pmf and cmf of a binomial with n=3 trials for three values of p, 0.2, 0.5, 0.8. Discuss the differences. Solution: For p=0.2 3 p (0) =   × 0.20 × 0.83 = 1×1× 0.512 = 0.512  0  3 1 2 3 0.2 × 0.64 =0.384 p (1) =   × 0.2 × 0.8 =× 1  3 2 1 3 0.04 × 0.8 = 0.096 p (2) =   × 0.2 × 0.8 =× 2    3 p (3) =   × 0.23 × 0.80 = 1× 0.008 ×1 = 0.008  3

For p=0.5

3 p(0) =   × 0.50 × 0.53 = 1 × 1 × 0.125 = 0.125 0  3 p(1) =  × 0.51 × 0.52 =3 × 0.5 × 0.25 =0.375 1  3 2 1 p(2) = 3 0.25 × 0.5 = 0.375  2  × 0.5 × 0.5 =×    3 p(3) =   × 0.53 × 0.50 = 1 × 0.125 × 1 = 0.125  3

For p=0.8 3 p(0) =   × 0.80 × 0.23 = 1 × 1 × 0.008 = 0.008 0  3 p(1) =  × 0.81 × 0.22 =3 × 0.8 × 0.04 =0.096 1  3 2 1 p(2) = 3 0.64 × 0.2 = 0.384  2  × 0.8 × 0.2 =×    3 p(3) =   × 0.83 × 0.20 = 1 × 0.512 × 1 = 0.512  3

For p less than 0.5 the distribution is skewed to the left, for p=05 is symmetrical, for p> 0.5 is skewed to the right. Exercise 3-5 Calculate the sample mean, variance, and standard deviation of hypothetical data drawn from the RV of Exercise 3.1. The data obtained were 300 grass pixels, 500 forest and 200 residential out of 1000 pixels Solution: Assign 1 to grass pixel, 2 to a forest pixel and 3 to residential. The sample mean is

1 n 1 = Xi ( (1 + 1 + ... + 1)300 times + (2 + 2 + ... + 2)500 times + (3 + 3 + ... + 3)200 times ) ∑ 1000 n i=1 1 = (1× 300 + 2 × 500 + 3 × 200=) 1.9 1000

= X

note that each one of the values appears 300, 500 and 200 times and this equivalent to multiplication. The sample variance is  (1 − 1.9)2 + (1 − 1.9)2 + ... + (1 − 1.9)2 )300 times   1 1  2 2 2 2 ( X i − X )= s 2=  +(2 − 1.9) + (2 − 1.9) + ... + (2 − 1.9) )500 times  ∑ x 999  n − 1 i =1   +(3 − 1.9)2 + (3 − 1.9)2 + ... + (3 − 1.9)2 )200 times    1 = ((1 − 1.9)2 × 300 + (2 − 1.9)2 × 500 + (3 − 1.9)2 × 200)= 999 1 490 = (0.9)2 × 300 + (0.1)2 × 500 + (1.1)2 × 200 ) = = 0.49 ( 999 999 n

The sample standard deviation is = sx

= 0.49 0.7

Exercise 3-6 At a site, monthly air temperature is normally distributed. It averages to 20 °C with standard deviation 4 °C. What is the probability that a value of air temperature in a given month exceeds 24 °C? What is the probability that if is below 16 °C or above 24 °C? Solution 24

Note that 24 °C is the mean plus one sd, and 16 °C is mean minus one sd. Thus, values above 24 or below 16 °C have probability 1-0.68=0.32. Now, only values above have half of that 1 − 0.68 = 0.16 . probability, then the probability is 2 Exercise 3-7 Assume 60% of a landfill is contaminated. Suppose that we randomly take three soil samples to test for contamination. We define event C=soil sample contaminated. We define X to be a RV where x = number of contaminated soil samples. Determine all possible values of X. What distribution do we get for X? Calculate the values of pmf and cmf for all values of x. Graph the pmf and cmf. Calculate the mean and the variance Solution Possible values are 0,1,2,3. The distribution is binomial with p=0.6 and n=3. Values of pmf are 3 p(0) =   × 0.60 × 0.43 = 1 × 1 × 0.064 = 0.064 0  3 p(1) =  × 0.61 × 0.42 =3 × 0.6 × 0.16 =0.288 1  3 2 1 p(2) = 3 0.36 × 0.4 = 0.432  2  × 0.6 × 0.4 =×    3 p(3) =   × 0.63 × 0.40 = 1 × 0.216 × 1 = 0.216  3

Values of cmf are 0.064, then 0.064+0.288=0.352, then 0.352+0.432=0.784, then 0.784+0.216=1.000

Mean and variance are

µX = np =× 3 0.6 = 1.8 σ X2 = np(1 − p ) = 1.8 × 0.4 = 0.72 Exercise 3-8 Read data in file lab3/hintense.txt, which contains number of intense hurricanes by year in the period 1945-1999. Source: Atlantic basin tropical cyclone data, Dr. Christopher W. Landsea, "FAQ: Hurricanes, Typhoons, and Tropical Cyclones", Version 2.10, Part E: Tropical Cyclone Records, Available online at: http://www.aoml.noaa.gov/hrd/tcfaq/tcfaqE.html. Table the data, plot a pie chart and a barplot. What is the most common number of intense hurricanes per season in this time period? Determine the mean, standard deviation, and coefficient of variation. Solution: > xi <- read.table("lab3/hintense.txt",skip=10) > #xt <- read.table("lab3/hnumber.txt",skip=10) > table(xi[,2]) 0 1 2 3 4 5 6 7 5 14 17 11 2 4 2 1 > par(mfrow=c(1,2)) > pie(table(xi[,2]),col=gray(seq(0.4,1,length=6))) > barplot(table(xi[,2]))

> mean(xi[,2]) [1] 2.285714 > mean(xi[,2])/sd(xi[,2]) [1] 1.425393 >

1 0 7 6 5

The most common is 2 with 17. Exercise 3-9 Generate a sample of 100 random numbers from a N(0,1). Obtain an estimate of the density function. Calculate the interquartile distance (iqd). Smooth the density using iqd. Calculate sample mean, variance and standard deviation. Compare the sample statistics to the theoretical moments of N(0,1) and the density estimate to the theoretical density. Solution: > x <- rnorm(100,0,1) > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -1.97200 -0.64130 -0.06786 -0.01221 0.53150 2.23300

> iqd <- 0.5315-(-0.6413) > iqd [1] 1.1728 > > mean(x);var(x);sd(x) [1] -0.01221315 [1] 0.6778474 [1] 0.8233149 >

The sample mean is approximately equal to the theoretical 0, but slightly negative. The sample standard deviation 0.82 is not too close to the theoretical 1.

0.2 0.0

0.1

Density

0.3

0.4

density.default(x = x, width =

-3

-2

-1

N = 100 Bandwidth = 0.2932

The density is slightly asymmetrical with heavy tail to the right. Exercise 3-10 Suppose that on the average we get 1/3 intense hurricanes out of the total Atlantic hurricanes in a season. Suppose a binomial distribution for the number of intense hurricanes in a season. Calculate the probabilities of the possible values of the number of intense hurricanes in a season of 10 hurricanes. Calculate the standard deviation. 28

Solution: > n = 10 > p <- dbinom(x=seq(0,n),size=n, prob=1/3) > round(p,5) [1] 0.01734 0.08671 0.19509 0.26012 0.22761 0.13656 0.05690 0.01626 0.00305 [10] 0.00034 0.00002 > sd.x <- sqrt(n*(1/3)*(1-1/3)) > sd.x [1] 1.490712 >

Exercise 3-11 Calculate mean and variance of sample means of 100 samples of size 10 from a normal distribution. Calculate the standard error. Solution: > x <- matrix(ncol=100,nrow=10) > m.x <- array() > for(i in 1:100){ + x[,i] <- rnorm(10,0,1) + m.x[i] <- mean(x[,i]) +} > mean(m.x); var(m.x);sd(m.x) [1] 0.0277511 [1] 0.09509917 [1] 0.3083815 >

Note that standard error is close to the expected, i.e., square root of 1/10=0.31. Exercise 3-12 Write a script to calculate the species evenness of a discrete uniform distribution from 1 to n, for n=2, 3, …, 10. Here n is species richness. Plot evenness vs. richness and discuss what happens to evenness as you change species richness. Solution: n <- seq(2,10); E <- array() for(i in 1:length(n)){ p <- rep(1/n[i],n[i]) E[i] <- - sum(p*log(p)) } plot(n,E)

2.0 1.5 1.0

Evenness increases with richness. The increase is non-linear, the slope decreases with richness. Exercise 3-13 There are 130 residents in an area where you are researching their attitudes toward the environment and natural resources. Suppose you assign a code number in a sequence from 1 to 130. You want to conduct a survey that covers 10% of these residents. Select a random sample without replacement. List the code numbers of the residents to interview. Solution: Apply sample to code numbers from 1 to 130, select 10% that is 130/10=13 > sample(seq(1,130,1), 13, replace=F) [1] 107 10 112 95 52 14 63 61 75 113 94 37 20 >

Chapter 4 Exploratory analysis and introduction to inferential statistics Exercise 4-1 At a site, monthly air temperature is normally distributed. It averages to 20 °C with standard deviation 2 °C. a) What is the probability that a value of air temperature in a given month exceeds 24 °C? b) What is the probability that it is below 16 °C or above 24 °C? c) What is the probability that it is below 18 °C or above 26 °C? Solution: a) 24 C is two standard deviations from the sample mean, thus we can use the 95% value and count only the upper tail P(T ≥ 24) = (1 − 0.95) / 2 = 0.025 b) 16 is 2σ below the mean so we count both tails P (T ≤ 16 and T ≥ 24) = (1 − 0.95) = 0.05 c) 18 is 1σ below the mean and 26 is 3σ above so we have half a lower tail of 1-68% and half an upper tail of 1-99%, therefore P (T ≤ 18 and T ≥ 26) = (1 − 0.68) / 2 + (1 − 0.99) / 2 = 0.16 + 0.005 = 0.1605 Exercise 4-2 Is it true that covariance of X and Y is equal to variance of X when X=Y? Explain why. Demonstrate. Explain why the auto-covariance of X for any lag cannot exceed the value at lag 0. Solution: Yes, because cov( X , X ) = E[( X − µ X )( X − µ X )] = E[( X − µ X ) 2 ] = σ X2 At lag =0, covariance is the variance

cov(0) = E[( X − µ X )( X − µ X )] =

σ X2 E[( X − µ X ) 2 ] = At any lag the covariance is at most the variance. Thus, the variance at anty lag is less than the covariance at zero lag. 31

Exercise 4-3 Suppose we have collected 50 values for a sample of ozone and the average is 2.00, with standard deviation of 0.5. What would be the value of t when testing that the mean is equal to 2.5? Solution:

= t

X − µX X − µX 2 − 2.5 = = = −7.07 se s X / n 0.5 / 50 Exercise 4-4

Monthly rainfall at a site is classified in two groups: one group for El Niño months and the other for La Niña months (defined according to sea surface temperature in the Pacific Ocean). We have 100 months for each group. The variance of each group is the same. Is it true that rainfall during El Niño is different to that during La Niña? What type of test would you run? What is H0? Suppose you get a p-value=0.045. What is the conclusion of the study? Solution: Run t-test on two samples. H0 is that the two samples come from distributions with the same mean. With this p-value we reject the H0 at the 5% level and decide that monthly rainfall for El Niño months is different to that for La Niña months. Exercise 4-5 Listed below are grades on a paper for a class of 14 students. The list is in the same sequence in which they were graded. 90 80 100 90 77 96 88 83 81 93 97 95 72 85

Calculate descriptive statistics; check these grades for normality, outliers, and serial correlation. Apply functions eda6, ts.plot, and acf. Note that serial correlation could exist if the grader gets either more generous or tougher during the grading process. Solution If the above grades are stored in a file lab4/grades.txt > grades <- scan("lab4/grades.txt") > summary(grades) Min. 1st Qu. Median Mean 3rd Qu. Max. 72 81.5 89 87.64 94.5 100

> eda6(grades, "Class Grades")

100

95 90

14 0.04

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Index

Density approximation

0.00

0.01

0.02

0.03

Histogram

Density

100

Class Grades

100

110

Class Grades 1.0

100

ECDF vs Std. normal

0.0

0.2

Fn(x)

0.8

QQ plot

0.6

0.4

Frequency

85 75

Boxplot

Class Grades

Index plot

Class Grades

100

The summary( ) tells us that the mean is a high B. There are four grades above the third quartile. There is one grade at the maximum possible score of 100. We apply function eda6 and obtain the results in

-1

-2

norm quantiles

-1

Standardized Class Grades

Then apply eda.ts and obtain figure … > eda.ts(grades,ylabel="Grades")

100 95 90 85 75

Grades

-0.5

0.0

ACF

0.5

1.0

Index

Lag

Examine this figure for index, histogram, boxplot, and density graphs; also for empirical cdf q-q, time series, and autocorrelation graphs. The resulting graphs suggest that the data are nearly normal; density has a heavy tail and some positive skewness; the median is further to the right than it should be. These plots suggest that grading was not affected much through time. All spikes of lag 1 and longer are relatively small, except maybe for the one at lag=2, suggesting a minor relation every two papers graded. In conclusion the grading process does not seem to be affected by the sequence followed during grading. Exercise 4-6 Use the data in file lab4/example2.txt. Apply EDA methods to this dataset. Is this sample normally distributed? Are there outliers? Is there serial correlation? Use Z test to check that the population mean and standard deviation are 30 and 1. 34

Solution: x <- scan("lab4/example2.txt") eda6(x, "example2") eda.ts(x,ylabel="example2")

31 28

example2

31 30 28

example2

Boxplot

Index plot

100

Index Density approximation

0.4

0.3 0.0

0.1

0.2

Density

20 15 10

Frequency

Histogram

ECDF vs Std. normal

0.6 0.4 0.0

0.2

Fn(x)

0.8

QQ plot

example2 1.0

example2

-2

-1

-3

norm quantiles

-2

-1

Standardized example2

32 31 30 28

example2

100

0.2 0.4 0.6 0.8 1.0 -0.2

ACF

Index

Lag

This sample seems normally distributed. Only two outliers show in the boxplot. There is not much serial correlation. Using Z test to check that the population mean and standard deviation are 30 and 1

= Z

X − µX X − µX = σe σX / n Z <- (mean(x)-30)/(1/sqrt(length(x))) >Z [1] -0.386325 > pnorm(Z,0,1) [1] 0.349628

and p-value is 2×0.35=0.7 suggesting not to reject H0. 36

Exercise 4-7 a) Generate 100 random numbers from a normal distribution with µ=300, σ=30, use EDA and discuss whether the sample does indeed look normal. Calculate sample mean, variance, and standard deviation. x <- rnorm(100,300,30) eda6(x)

300 x 250 200

200

250

300

350

Boxplot

350

Index plot

100 0.015

Index

Density

20 15

Density approximation

0.000

Frequency

Histogram

0.010

0.005

200

250

300

200

350

250

300

350

x 1.0

ECDF vs Std. normal

0.6

Fn(x)

0.0

200

0.2

250

0.4

300

0.8

350

QQ plot

-2

-1

-4

norm quantiles

-3

-2

-1

Standardized x

Sample looks normal. > mean(x);var(x); sd(x) [1] 300.9658 [1] 780.9887 [1] 27.94618 >

b) Generate another sample of the same size (n=100) and standard deviation (σ=30) but with different mean µ=350. Repeat the EDA, calculations, and discussion. > y <- rnorm(100,350,30)

> eda6(y) > eda.ts(y) > mean(y);var(y); sd(y) [1] 353.4749 [1] 871.8063 [1] 29.52637

c) Select an appropriate test to see whether there is indeed a difference in the sample means of the two samples. Justify your selection. Run the test, report your results, and discuss. Calculate power and number of observations required to increase power to 0.9 if power is low. The two-sample t-test would be appropriate since we have equal variances. > t.test(x,y) Welch Two Sample t-test data: x and y t = -12.9159, df = 197.404, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -60.52646 -44.49185 sample estimates: mean of x mean of y 300.9658 353.4749 >

We can easily reject with this low p-value. Now calculate power > power.t.test(n=100, delta = 50, sd=30, type="two.sample", alternative = "one.sided", sig.level=0.01) Two-sample t test power calculation n = 100 delta = 50 sd = 30 sig.level = 0.01 power = 1 alternative = one.sided NOTE: n is number in *each* group

Power is high enough, no need to increase sample size. 38

Exercise 4-8 Use data on number of Atlantic Hurricanes per season. File lab4/hnumber.txt. Question: Is the mean number of hurricanes per season greater than 3? Apply most appropriate test (parametric or nonparametric) to determine this. Justify your reasoning and provide conclusive answer to the question. Calculate power and number of observations required to increase power to 0.9 if power is low. Hint: Import data from hnumber.txt into a data set. Use read.table from console or Rcmdr read as table from text file. Solution:

hn <- read.table("lab4/hnumber.txt", skip=11) shapiro.test(hn$V2) # okay normal > t.test(hn$V2, mu=3, alt="greater") One Sample t-test data: hn$V2 t = 9.0532, df = 54, p-value = 1.024e-12 alternative hypothesis: true mean is greater than 3 95 percent confidence interval: 5.312035 Inf sample estimates: mean of x 5.836364 >

The average number of hurricanes is not greater than 3. mean(hn$V2)# 5.83 sd(hn$V2)# 2.32 power.t.test(n=length(hn$V2), delta = 2, sd=2.3, type="one.sample", alt = "one.sided", sig.level=0.01)

Results > power.t.test(n=length(hn$V2), delta = 2, sd=2.3, type="one.sample", + alt = "one.sided", sig.level=0.01) One-sample t test power calculation n = 55 delta = 2

sd = 2.3 sig.level = 0.01 power = 0.9999619 alternative = one.sided

Power is high enough. Exercise 4-9 Study the following question: are there fewer Atlantic hurricanes during El Niño year as compared to La Niña? Use the file lab4/enso-yrs.txt provided to generate two sets of values: one for El Niño and another for La Niña. Determine the most appropriate test and provide answer to the question. Explain null and alternative hypotheses. Calculate power and number of observations required to increase power to 0.9 if power is low. Solution: There are two possible approaches one via datasets and the other via arrays. Here I will outline how to solve using arrays and the R console. First you have to pull data from two files enso-yrs.txt and hnumber.txt into data arrays. We already saw how to get data from hnumber.txt following the previous exercise. Now, we can get Niño and Niña numbers using loops. # import hurricane number hn.yr <- matrix(scan("lab3/hnumber.txt", skip=11),ncol=2,byrow=T) yr <- hn.yr[,1]; hn <- hn.yr[,2] # import nino and nina years nino.yr <- scan("lab3/enso-yrs.txt", skip=1,nline=1) nina.yr <- scan("lab3/enso-yrs.txt", skip=2,nline=1) # define arrays to store values and then loop to merge data nino.hn <- array();nina.hn <- array() # loop thru all years for(j in 1:length(yr)) { # check if el nino yr and store for(i in 1:length(nino.yr)) if(hn.yr[j,1]==nino.yr[i]) nino.hn[i] <- hn.yr[j,2] # now do la Nina for(i in 1:length(nina.yr)) if(hn.yr[j,1]==nina.yr[i]) nina.hn[i] <- hn.yr[j,2] }

the results are 40

> nino.hn; nina.hn [1] 6 7 6 7 12 4 5 3 3 5 4 11 10 [1] 11 8 4 4 6 4 6 7 9 >

Now we need to look at the data. Apply eda6( ). > eda6(nino.hn, "Nino") > eda6(nina.hn, "Nina") 12 10 8

Nino

4 4

0.15

Index

Density approximation

0.00

0.05

Density

Histogram

0.10

Frequency

Boxplot

8 4

Nino

Index plot

QQ plot

ECDF vs Std. normal

0.6

Fn(x)

8 x

0.0

0.2

0.4

0.8

Nino

1.0

Nino

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

-1

norm quantiles

1 Standardized Nino

Nina

10 11

Boxplot

6 5 4

Nina

10 11

Index plot

Density approximation

0.08 0.04

Density

0.12

Histogram

0.00

Frequency

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Index

QQ plot

ECDF vs Std. normal

0.6 0.4 0.0

0.2

Fn(x)

0.8

10 11

Nina

1.0

Nina

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

norm quantiles

-1

Standardized Nina

You conclude that a t-test is not appropriate because data are not normal. A non-parametric test is more appropriate. The null hypothesis is that the mean of nino.hn is the same as nina.hn, so that if we reject then we will conclude that there are fewer hurricanes during ENSO. Therefore use the alternative =“less” in the test. First note that > mean(nino.hn) [1] 6.384615 > mean(nina.hn) [1] 6.555556 boxplot(nino.hn,nina.hn)

12 10 8 6 4

The sample means are close and we suspect that we cannot detect the difference. # now perform wilcox test between nino.hn and nina.hn > wilcox.test(nino.yr,nina.yr, alternative="less") Wilcoxon rank sum test data: nino.yr and nina.yr W = 74, p-value = 0.854 alternative hypothesis: true location shift is less than 0

The p-value indicates that we cannot reject the null and therefore we have no evidence to reject the claim of equal means, and therefore we cannot conclude that there are fewer hurricanes during El Niño years compared to la Niña. Exercise 4-10 Ground level ozone is an important urban air pollution problem related to emissions and photochemistry. Use data set airquality of package datasets. Is ground-level ozone in this sample correlated to solar radiation? Determine appropriate tests to run. Provide an answer to the question. Solution: data(airquality) x <- airquality$Ozone y <- airquality$Solar.R cor.test(x,y) cor.test(x,y)

Pearson's product-moment correlation data: x and y t = 3.8798, df = 109, p-value = 0.0001793 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.173194 0.502132 sample estimates: cor 0.3483417

There is non-zero correlation between ozone and solar radiation, this correlation is not high (~0.35).

Chapter 5 More on inferential statistics: Goodness of Fit, contingency analysis, and analysis of variance Exercise 5-1 We want to check if 100 values come from a uniform distribution using a GOF test. We got the following counts in 5 categories (bins or intervals) 21,19,18,22,20. What test would you use? Calculate the statistic (should be 0.5). How many degrees of freedom do we have? The p-value for this value of the statistic is 0.973. What is your conclusion? Repeat for counts 30,25,20,15,10. The statistic should now be 12.5 and p-value=0.013. What is your conclusion now? Solution: We could use the chi-square test Using R you could do > x <- c(21,19,18,22,20) > X2 <- sum((x-20)^2)/20 > X2 [1] 0.5

We have df=5-1 =4. The p-value is > 1- pchisq(0.5,4) [1] 0.973501 >

Therefore we cannot reject. But when > x <- c(30,25,20,15,10) > X2 <- sum((x-20)^2)/20 > X2 [1] 12.5 > 1- pchisq(X2,length(x)-1) [1] 0.01399579

Now we can reject. Exercise 5-2 Three forest cover types (palms, pines and hardwoods) and three terrain types (valley, slope and ridge). We want to see if forest cover is associated with terrain? You classify 45

all available polygons and organize in a table. What test would you use? Suppose you run this test and get value for the statistic 10.64 and p-value=0.0309. What is H0? What is your conclusion? Solution: This is contingency analysis and we can apply chi-square test. Given the result we could reject to 5% level. H0 is that forest cover is not associated with terrain. We conclude it is. Exercise 5-3 Three zones (A, B and C) are distinguished in a lake. Ten values of nitrate concentration are taken at random in each zone. Is nitrate concentration different among zones? What analysis do you run to answer this question? Suppose you get statistic = 5.8, pvalue=0.007. What is the statistic? What is H0? What is your conclusion? Solution: We can use ANOVA one-way. The statistic is F. H0 is that there is no difference in the means among zones. We can reject it, and conclude that there is diference in nitrate concentration among the zones. Exercise 5-4 We want to check whether samples (of size n=100) come from a uniform distribution using a GOF test. We have three different variables. a) What statistic do you use? chisquare b) We use counts per quartile. How many counts do we expect in each quartile. How many degrees of freedom do we have? We expect 100/4=25 counts per quartile. We have df=4-1=3 c) First variable: We got the following counts in the four quartiles: 26, 24, 23, 27. Calculate the statistic. > x <- c(26, 24, 23, 27) > X2 <- sum((x-25)^2)/25 > X2 [1] 0.4

The p-value for this value of the statistic is 0.94. What is your conclusion? Cannot reject that is uniform 46

d) Second variable: we obtained counts 28,22,23,27 a. Calculate the statistic > x <- c(28,22,23,27) > X2 <- sum((x-25)^2)/25 > X2 [1] 1.04

b. The p-value=0.79. What is your conclusion now? Cannot reject that is uniform e) Third variable: counts 35,35,15,15 a. Calculate the statistic > x <- c(35,35,15,15) > X2 <- sum((x-25)^2)/25 > X2 [1] 16

b. Suppose the p-value=0.001. What is your conclusion now? We can reject that it is uniform. Exercise 5-5 Suppose we have two different community types: rural, and urban-industrial and three levels of health problems: low, medium, high. We want to see if community type is associated with health? a) Organize in a contingency table. Rural

Urban-industrial

Low Medium High

b) What test would you use? Then suppose you run the test and get a value for the statistic =10.64 and p-value=0.0309. Would sue ci-square. c) What is H0? 47

Health problems are not associated with community type. d) What is your conclusion? Reject H0 and conclude that health problemas are associated with community type.

Exercise 5-6 Four archeological sites (A, B, C and D) are being analyzed. There is an artifact type in common among all sites. Ten values of artifact length are taken at random in each site. Is artifact length different among zones? a) What analysis do you run to answer this question? ANOVA one way b) What graphs do you use to visually compare the magnitude of variability among sites to variances within each site? Illustrate with an example Boxplots c) What is the statistic? F d) What is H0? There is no difference in artifact length among sites e) Suppose you get statistic = 5.8, p-value=0.007. What is your conclusion? We can reject H0 and we conclude that there is difference in artifact length among sites. Exercise 5-7 Generate 20 random numbers from an exponential distribution with rate=1. Run function cdf.plot.gof with proper arguments to visualize the potential fit of this sample to an exponential with rate =1. Repeat the cdf.plot.gof with rate =2 for the same sample. Are there sufficient large differences in each case to suspect lack of fit? Use the K-S test to see if the sample fits an exponential with rate 1, then repeat for rate=2. Compare and discuss. 48

xe <- rexp(20,rate=1) cdf.plot.gof(xe,dist="exp") mtext(side=3,line=2,"Sample Exp, Hyp Exp",cex=0.7) cdf.plot.gof(xe,dist="exp", rate=2)

Exercise 5-8 Generate 100 random numbers from a normal distribution with µ=300, σ=30. Produce visual GOF exploratory graphs to see if this sample comes from a normal distribution. Apply the chi-square and the Shapiro-Wilks test. Discuss. Solution: x <- rnorm(100,300,30) cdf.plot.gof(x,dist="normal",mu=300,sd=30) mtext(side=3,line=2,"Sample Normal, Hyp Normal",cex=0.7)

1.0 0.8 0.6 0.2

0.4

F(x)

0.0

Data Hyp 250

300

350

400

Sample Normal, Hyp Normal

0.02 -0.02 -0.06

Diff Empir - Theor

250

300 x

> z <- (x-mean(x))/sd(x) > x2z <- chisq.gof.norm(z,10,0) > x2z $X2 [1] 5.2 $df [1] 9 $p.value [1] 0.8165368 $observed [1] 10 9 11 13 6 12 12 7 8 12 > shapiro.test(x)

350

400

Shapiro-Wilk normality test data: x W = 0.9949, p-value = 0.973

From the exploratory graph, this sample seems to come from a normal distribution. The tests do not provide evidence that it does not come from a normal distribution. Exercise 5-9 Use data set airquality. Perform a cross tabulation of ozone and wind. Perform a contingency analysis of ozone and wind. Use chi-square. Discuss whether there is indication of relation between ozone and wind. Perform a cross tabulation of ozone and solar radiation. Perform a contingency analysis of ozone and solar radiation. Use chisquare. Discuss whether there is indication of relation between ozone and solar radiation. Solution: First attach dataset and remind ourselves of variable names > attach(airquality) > names(airquality) [1] "Ozone" "Solar.R" "Wind"

"Temp"

"Month" "Day"

For wind and ozone > table(cut(Wind, quantile(Wind)),cut(Ozone, quantile(Ozone, na.rm=T))) (1,18] (18,31.5] (31.5,63.2] (63.2,168] (1.7,7.4] 3 3 10 21 (7.4,9.7] 6 7 5 6 (9.7,11.5] 9 9 7 1 (11.5,20.7] 13 7 7 1 > x <- table(cut(Wind, quantile(Wind)),cut(Ozone, quantile(Ozone, na.rm=T))) > chisq.test(x) Pearson's Chi-squared test data: x X-squared = 39.8094, df = 9, p-value = 8.227e-06

We conclude that there is association between ozone and wind, highest ozone values for lower wind speed, and lowest ozone values for higher wind speed. > x <- table(cut(Solar.R, quantile(Solar.R,na.rm=T)),cut(Ozone, quantile(Ozone,

na.rm=T))) >x (1,18] (18,31.5] (31.5,63.2] (63.2,168] (7,116] 14 7 6 0 (116,205] 6 5 5 10 (205,259] 3 7 8 11 (259,334] 6 6 9 6 > chisq.test(x) Pearson's Chi-squared test data: x X-squared = 21.9094, df = 9, p-value = 0.009171 >

We conclude that there is association between ozone and solar radiation, lowest ozone values for lower solar radiation, and higher ozone values for higher solar radiation. This association is not as clear as the one for wind. Exercise 5-10 Use dataset immer of package MASS. Produce boxplots, interaction plot, and run appropriate two-way ANOVA of yield Y1 as a function of factors Loc and Var. Discuss results. Are differences in Loc significant? Are differences in Var significant? Is there factor interaction? Solution: > aov(Y1~Loc+Var,data=immer) Call: aov(formula = Y1 ~ Loc + Var, data = immer) Terms:

Loc Var Residuals Sum of Squares 17829.847 2756.625 3257.743 Deg. of Freedom 5 4 20 Residual standard error: 12.76273 Estimated effects may be unbalanced > anova(aov(Y1~Loc+Var,data=immer)) Analysis of Variance Table Response: Y1 Df Sum Sq Mean Sq F value Pr(>F) Loc 5 17829.8 3566.0 21.8923 1.751e-07 *** Var 4 2756.6 689.2 4.2309 0.01214 * Residuals 20 3257.7 162.9

--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >

Plots

100

140

180

boxplot(Y1~Loc,data=immer) boxplot(Y1~Var,data=immer) interaction.plot(immer$Loc,immer$Var,immer$Y1)

100

140

180

180 140

160

T V M P S

100

120

mean of immer$Y1

immer$Var

immer$Loc

Differences in Var and Loc are significant; more so for Loc than for Var. There is interaction between Var and Loc. Exercise 5-11 Use function invent.mxn to generate a dataset of four groups, such that all group means are separated by 3 units and the standard deviation is 1 unit for all groups. Produce boxplots and one-way ANOVA table. Now, increase standard deviation to 3 units for all groups. Produce boxplots and one-way ANOVA table. Compare results and discuss. Solution: First, use sd=1 m=4;n=5 p <- matrix(c(30,1,33,1,36,1,40,1),byrow=T,ncol=2) Xr <- invent.mxn(m=4,n=5,d=1,p,f2="random") y <- c(Xr) f <- factor(rep(LETTERS[1:m], rep(n,m))) f.y <- data.frame(f, y)

34 28

> summary(aov(y~f, data=f.y)) Df Sum Sq Mean Sq F value Pr(>F) f 3 289.4 96.46 106.5 8.76e-11 *** Residuals 16 14.5 0.91 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >boxplot(y~f, data=f.y,ylab="y", xlab="f")

We see significant differences among groups well above variability within each group. Second, use sd=3 > p <- matrix(c(30,3,33,3,36,3,40,3),byrow=T,ncol=2) > Xr <- invent.mxn(m=4,n=5,d=1,p,f2="random") > y <- c(Xr) > > f <- factor(rep(LETTERS[1:m], rep(n,m))) > f.y <- data.frame(f, y) > summary(aov(y~f, data=f.y)) Df Sum Sq Mean Sq F value Pr(>F) f 3 215.1 71.7 10.24 0.000526 *** Residuals 16 112.0 7.0 ---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > boxplot(y~f, data=f.y,ylab="y", xlab="f") > >

Now while differences among groups are still significant, these differences are comparable to the within variation. Exercise 5-12 Use function invent.mxn with f2=”step” to generate a dataset of four groups, such that group pairs A-B, B-C, C-D are separated by 1 unit and that the range of all groups is 3 units. Produce boxplots, interaction plot, and a two-way ANOVA table. Now, decrease separation of pairs A-B, B-C, C-D such that there is overlap of 2 units between all three pairs. Produce boxplots and two-way ANOVA table. Compare results and discuss. Solution: First, no overlap and differences of one unit 57

> p <- matrix(c(30,33,34,37,38,41,42,45),byrow=T,ncol=2) > Xs <- invent.mxn(m=4,n=5,d=1,p,f2="step") > y <- c(Xs) > f1 <- factor(rep(LETTERS[1:m], rep(n,m))) > f2 <- factor(rep(paste("V", 1:n,sep=""),m)) > f.y <- data.frame(f1, f2, y) > summary(aov(y~f1+f2, data=f.y)) Df Sum Sq Mean Sq F value Pr(>F) f1 3 400.0 133.33 2.791e+31 <2e-16 *** f2 4 21.9 5.48 1.147e+30 <2e-16 *** Residuals 12 0.0 0.00 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > interaction.plot(f.y$f1, f.y$f2, f.y$y,xlab="f1",ylab="y", + type="b", pch=1:n, trace.label="f2") >panel2(size=5) > boxplot(y~f1, data=f.y,ylab="y", xlab="f1") > boxplot(y~f2, data=f.y,ylab="y", xlab="f2") >

V5 V4 V3 V2 V1

C f1

45 40 30

There is no interaction and significant differences for both factors. Now, include overlap > p <- matrix(c(30,33,31,34,32,35,33,36),byrow=T,ncol=2) > Xs <- invent.mxn(m=4,n=5,d=1,p,f2="step") > y <- c(Xs) > f1 <- factor(rep(LETTERS[1:m], rep(n,m))) > f2 <- factor(rep(paste("V", 1:n,sep=""),m)) > f.y <- data.frame(f1, f2, y) > summary(aov(y~f1+f2, data=f.y)) Df Sum Sq Mean Sq F value Pr(>F) f1 3 25.00 8.333 5.116e+30 <2e-16 *** f2 4 21.92 5.480 3.364e+30 <2e-16 *** Residuals 12 0.00 0.000 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > interaction.plot(f.y$f1, f.y$f2, f.y$y,xlab="f1",ylab="y", + type="b", pch=1:n, trace.label="f2") > panel2(size=5) > boxplot(y~f1, data=f.y,ylab="y", xlab="f1") > boxplot(y~f2, data=f.y,ylab="y", xlab="f2")

36 35

33 30

V5 V4 V3 V2 V1

33 30

There is no interaction and while differences are still significant, among differences are comparable to within differences.

Chapter 6 Regression

Exercise 6-1 We have 10 observations of two variables y and x x y 1.0 1.67 3.0 7.90 5.0 9.03 7.0 17.84 9.0 13.60

Using a calculator: determine sample means, sample variances, and sample standard deviations of X and of Y. Calculate sample covariance of X and Y. Then, using these results, calculate the regression coefficients. Write the equation for the linear predictor. Calculate the predicted  y for x4= 7. Calculate the residual error for x4= 7. Sketch the 4

scatter plot and predicted line. Solution:

X , σ X2 , σ X is 5, 10, 3.16 Y , σ Y2 , σ Y is 10, 37.29, 6.10 scov ( X , Y ) is16.9

Coefficients = b1

scov ( X , Y ) 16.9 = = 1.69 10 s2x

b0 = Y − b1 X = 10 − 1.69 × 5 = 1.55 Linear predictor

= Y 1.55 + 1.69 X Predicted for x4  y4= 1.55 + 1.69 x4= 1.55 + 1.69 × 7= 13.38

Residual error for x4 61

e4 = y4 −  y4 = 17.84 − 13.38 = 4.46

Sketch

10 5

Generated with R

6 x

Exercise 6-2 62

Using results of the previous exercise calculate the explained, residual, and total error for x4= 7. Illustrate on the scatter plot. Calculate the explained, residual, and total error for all values of x and the overall total. Calculate R2. Calculate the correlation coefficient r. How much variance does the regression model explains? Calculate the residual standard error. Solution:

Recall the sample mean of y is Y = 10 . The total error is SST = (1.67 − 14.69) 2 + (7.9 − 10) 2 + (9.03 − 10) 2 + (17.84 − 10) 2 + (13.6 − 10) 2 = 69.52 + 4.44 + 0.96 + 61.34 + 12.90 = 149.17

y= 1.55 + 1.69 ×= 1 3.24 1 y= 1.55 + 1.69 × 3= 6.62 2

Then calculate all the predicted values y= 1.55 + 1.69 × 5= 10.00 3 y4= 1.55 + 1.69 × 7= 13.38 y= 1.55 + 1.69 × 9= 16.76 5

The total residual error is SS E = (3.24 − 1.67) 2 + (6.62 − 7.9) 2 + (10 − 9.03) 2 + (13.38 − 17.84) 2 + (16.76 − 13.6) 2 = 2.46 + 1.64 + 0.94 + 19.89 + 9.98 = 34.92

The explained error is SS M = SST − SS E = 149.17 − 34.92 = 114.24 Now MS M 114.24 R2 = = = 0.77 which should be equivalent to MST 149.17 2

 scov ( X , Y )   16.9  R = =  =  0.77 × 3.16 6.10 s s   X Y   2

the correlation coefficient is r =

= 0.77 0.88 r=0.88

Exercise 6-3 Perform linear regression on 10 observations of two variables y and x x y [1,] 1 1.674 [2,] 2 8.997 [3,] 3 7.904 [4,] 4 9.877 [5,] 5 9.034 [6,] 6 17.037 [7,] 7 17.836 [8,] 8 12.462 [9,] 9 13.599 [10,] 10 25.067

Produce scatter plots, regression line, confidence interval lines, and diagnostic (e.g. residual error plots). Write the predictor. Discuss these results. Calculate the explained, residual and total error. How much variance is explained by the model? Solution: >x.y <- read.table("lab6/exercise6-3.txt",header=T) > lm(x.y$y ~x.y$x) Call: lm(formula = x.y$y ~ x.y$x) Coefficients: (Intercept) x.y$x 2.435 1.803 > plot(x.y$x,x.y$y) abline(lm(x.y$y ~x.y$x)) conf.int.lm(lm(x.y$y ~x.y$x), 0.05) par(mfrow=c(2,2));plot(lm(x.y$y ~x.y$x))

The predictor is= y 2.44 + 1.80 x 64

25 20 15 5

x.y$y

1.5 -0.5

-1.5

-4

-2

Residuals

Normal Q-Q

0.5

Residuals vs Fitted

Standardized residuals

x.y$x

-6

8 9

-1.5

-1.0

-0.5 0.0

1.5

1 10 0.5

-1

0.4

0.8

Residuals vs Leverage

1.2

Scale-Location

1 0.5

0.0

Standardized residuals

1.0

Theoretical Quantiles

Standardized residuals

Fitted values

0.5

Cook's distance

9 1

0.00

Fitted values

0.10

0.20 Leverage

0.30

Results indicate a good regression with normal residuals. # total > sst <- sum((x.y$y - mean(x.y$y))^2) > sst [1] 377.4569 > ye <- lm(x.y$y ~x.y$x)$coeff[1]+ lm(x.y$y ~x.y$x)$coeff[2]*x.y$x # residual > sse <- sum((ye-x.y$y)^2) > sse [1] 109.3985 # explained > ssm <- sst - sse > ssm [1] 268.0583 > R2 <- ssm/sst # variance explained > R2 [1] 0.7101695 >

The regression explains 71% of the variance of y. Exercise 6-4 Assume the following data for X (first column) and Y (second column). 0.97 0.25 0.07 2.60 0.77 0.28 1.96 1.32 1.78 0.06 0.81 1.87 0.31 2.85 1.01 1.39 2.41 1.73 2.83 0.43

3.31 1.15 1.21 5.48 0.87 0.91 4.47 4.13 5.01 1.14 2.57 6.43 2.37 6.88 4.08 3.60 5.73 4.07 7.90 1.67

Perform exploratory data analysis and descriptive statistics for each variable 66

a) Run a correlation test b) Build a linear predictor of Y from X using linear regression. c) Evaluate the regression. Discuss thoroughly. Solution: x.y <- read.table("lab6/exercise6-4.txt",header=T) eda6(x.y$x) eda6(x.y$y)

Boxplot

2.0 2.5 0.0

0.0

0.5 1.0

1.5

1.5 0.5 1.0

2.0 2.5

Index plot

Index Density approximation

0.30 0.10

0.20

Density

5 4 3 0

0.00

Frequency

Histogram

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-1

ECDF vs Std. normal

0.0

0.6 0.4 0.0

0.2

0.5 1.0

1.5

Fn(x)

0.8

2.0 2.5

QQ plot

2 x

1.0

-2

-1

norm quantiles

-1.5

-1.0

-0.5

0.0

0.5

1.0

Standardized x

1.5

2.0

8 7 6 5 x 4 3 2

20 0.15

Index

Density approximation

0.00

0.05

Density

Histogram

0.10

Frequency

Boxplot

Index plot

-2

x 1.0

QQ plot

ECDF vs Std. normal

0.6 0.4 0.0

0.2

Fn(x)

0.8

6 x

-2

-1

1 Standardized x

norm quantiles

> cor(x.y$x,x.y$y) [1] 0.938653 > lm(x.y$y ~x.y$x) plot(x.y$x,x.y$y) abline(lm(x.y$y ~x.y$x)) conf.int.lm(lm(x.y$y ~x.y$x), 0.05) par(mfrow=c(2,2));plot(lm(x.y$y ~x.y$x))

8 7 x.y$y

5 0.0

0.5

1.0

1.5

2.5

2.0

Normal Q-Q 12

-1

Residuals

Residuals vs Fitted

Standardized residuals

x.y$x

-2

-1

0.5 19

-1

0.5

Residuals vs Leverage

1.0

-2

4 0.5

0.0

Standardized residuals

1.5

Scale-Location 5

Theoretical Quantiles

Standardized residuals

Fitted values

Cook's5distance 1

0.00

0.05

0.10

0.15

0.20

Leverage

Fitted values

We see that x and y are not exactly normal but their distributions are too skewed. The correlation is high (~0.94). The regression is okay with nearly normal residuals; except with some outliers, particularly observations 12, 4, and 5. 69

Exercise 6-5 Use the upwelling part of light measurements in the lab6/light-depth.csv file. Assume that upwelling light extinction follows an exponential law same as down welling. Produce a scatter plot of upwell light versus depth. Demonstrate graphically that a linear regression would not yield a good model. Obtain estimates of k and y0 from a linear regression applied to log transformed data. Plot and compare. Use these estimates as initial guess for a nonlinear regression optimization applied to an exponential model. Graph the exponential curve and compare to the log-transformed. Diagnose the residual error. Use polynomial regression and compare to nonlinear regression. Draw predicted vs. observed plots for all four models. Solution: First, scatter plot of upwell light versus depth and illustrate graphically that a linear regression would not yield a good model. light.depth.all <- read.table(file="lab6/light-depth.csv",sep=",",header=T) light.depth <- light.depth.all[-24,] attach(light.depth) # linear regression Upwell.lm <- lm(Upwell ~ Depth) # regression object panel2(size=5) plot(Depth, Upwell) # get scatter plot abline(Upwell.lm$coef) # add regression line to scatter plot # identify(Depth, Upwell) # identify outliers conf.int.lm(Upwell.lm, 0.05)

4 3 2 0

Upwell

0.0

0.5

1.0

1.5

2.0

2.5

3.0

2.5

3.0

3 2

9 10

Upwell

Depth

0.0

0.5

1.0

1.5

2.0 Depth

Now, obtain estimates of k and y0 from a linear regression applied to log transformed data. Plot and compare. > par(mfrow=c(2,2)); plot(Upwell.lm) # diagnostic plots residual error > summary(Upwell.lm) Call: lm(formula = Upwell ~ Depth) Residuals: Min 1Q Median 3Q Max -0.4201 -0.3037 -0.1642 0.3061 0.6385 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.71045 0.15068 24.62 < 2e-16 *** Depth -1.27134 0.07697 -16.52 1.66e-13 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3734 on 21 degrees of freedom Multiple R-squared: 0.9285, Adjusted R-squared: 0.9251 F-statistic: 272.8 on 1 and 21 DF, p-value: 1.656e-13 > # log transform > lnUpwell.lm <- lm(log(Upwell/Upwell[1])~ 0 + Depth) > summary(lnUpwell.lm)

Call: lm(formula = log(Upwell/Upwell[1]) ~ 0 + Depth) Residuals: Min 1Q Median 3Q Max -0.64922 -0.02205 0.22340 0.30917 0.38764 Coefficients: Estimate Std. Error t value Pr(>|t|) Depth -0.95267 0.03184 -29.92 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.299 on 22 degrees of freedom Multiple R-squared: 0.976, Adjusted R-squared: 0.9749 F-statistic: 895 on 1 and 22 DF, p-value: < 2.2e-16 > > # fitted values by log transform > yt.est <- Upwell[1]*exp(lnUpwell.lm$fitted) > # fitted values of linear model > y.est <- Upwell.lm$fitted > > # sum of square residual errors > SSEl <- sum((Upwell - y.est)^2) > SSEt <- sum((Upwell - yt.est)^2) > SSEl; SSEt [1] 2.9272 [1] 3.5859 > > # Standard errors of residual > Se.l <- sqrt(SSEl/Upwell.lm$df) > Se.t <- sqrt(SSEt/lnUpwell.lm$df) > Se.l; Se.t [1] 0.3733504 [1] 0.403727 > > # correlation coefficients > r.l <- cor(Upwell, y.est) > r.t <- cor(Upwell, yt.est) > r.l ; r.t [1] 0.963599 [1] 0.9848242 > >

0.2

2.0 1.5 0.0 -0.5

-0.2

-1.0

-0.4

-2

-1

2.0

1.5

-1.5 -1.0 -0.5 0.0

0.4 0.2 0.0

1.0

0.5

1.2 0.6

0.8

1.0

3 2

Residuals vs Leverag Standardized residuals

1.4

Scale-Location 23

Theoretical Quantiles

Fitted values

Standardized residuals

1.0

Residuals

0.0

23 2

0.5

3 2

Normal Q-Q Standardized residuals

0.4

0.6

Residuals vs Fitted

Cook's distance 0.00

Fitted values

0.05

0.10

0.15

Leverage

Now, use these estimates as initial guess for a nonlinear regression optimization applied to an exponential model. Graph the exponential curve and compare to the logtransformed. Diagnose the residual error. > # nonlinear > Upwell.nls <- nls(Upwell ~ y0*exp(-k*Depth), start=list(k=1.1, y0=66)) > Upwell.nls Nonlinear regression model model: Upwell ~ y0 * exp(-k * Depth) data: parent.frame() k y0 0.8271 4.6645 residual sum-of-squares: 0.8351 Number of iterations to convergence: 7 Achieved convergence tolerance: 4.589e-06 > > yn.est <- fitted(Upwell.nls) > SSEn <- sum((Upwell - yn.est)^2) > Se.n <- sqrt(SSEn/(length(yn.est)-2)); Se.n [1] 0.1994176 > r.n <- cor(Upwell, yn.est); r.n [1] 0.9909133 > >

Now, use polynomial regression. 73

> Upwell.poly <- lm(Upwell ~ poly(Depth, 2), data=light.depth) > summary(Upwell.poly) Call: lm(formula = Upwell ~ poly(Depth, 2), data = light.depth) Residuals: Min 1Q Median 3Q Max -0.191908 -0.026804 0.001662 0.026972 0.242673 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.57957 0.02196 71.93 < 2e-16 *** poly(Depth, 2)1 -6.16651 0.10532 -58.55 < 2e-16 *** poly(Depth, 2)2 1.64480 0.10532 15.62 1.14e-12 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1053 on 20 degrees of freedom Multiple R-squared: 0.9946, Adjusted R-squared: 0.994 F-statistic: 1836 on 2 and 20 DF, p-value: < 2.2e-16 > yp.est2 <- Upwell.poly$fitted > Upwell.poly <- lm(Upwell ~ poly(Depth, 3), data=light.depth) > summary(Upwell.poly) Call: lm(formula = Upwell ~ poly(Depth, 3), data = light.depth) Residuals: Min 1Q Median 3Q Max -0.198356 -0.023028 0.001675 0.025177 0.245633 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.57957 0.02252 70.142 < 2e-16 *** poly(Depth, 3)1 -6.16651 0.10800 -57.097 < 2e-16 *** poly(Depth, 3)2 1.64480 0.10800 15.230 4.21e-12 *** poly(Depth, 3)3 -0.01525 0.10800 -0.141 0.889 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.108 on 19 degrees of freedom Multiple R-squared: 0.9946, Adjusted R-squared: 0.9937 F-statistic: 1164 on 3 and 19 DF, p-value: < 2.2e-16 > yp.est3 <- Upwell.poly$fitted > >

We are ready to compare to nonlinear regression and drawing predicted vs. observed plots for all four models.

Linear Log

Upwell

panel2(size=5) plot(Depth,Upwell) lines(Depth,y.est,lty=1) lines(Depth,yt.est,lty=2) legend("topright",lty=1:2,leg=c("Linear","Log")) plot(Depth,Upwell) lines(Depth, yn.est,lty=1) lines(Depth, yp.est2,lty=2) lines(Depth, yp.est3,lty=3) legend("topright",lty=1:3,leg=c("Opt","Poly 2","Poly 3"))

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Opt Poly 2 Poly 3

Upwell

Depth

0.0

0.5

1.0

1.5

2.0

2.5

Depth

Examine residuals panel2(size=5) plot(yn.est, residuals(Upwell.nls)) abline(h=0) qqnorm(residuals(Upwell.nls)); qqline(residuals(Upwell.nls)) cdf.plot.gof(residuals(Upwell.nls))

3.0

> shapiro.test(residuals(Upwell.nls)) Shapiro-Wilk normality test data: residuals(Upwell.nls) W = 0.957, p-value = 0.4054

0.6 0.2

0.4

F(x)

0.8

1.0

0.0

Data Hyp -0.4

-0.2

0.0

0.2

0.4

0.0

0.2

0.4

0.3 0.1 -0.3

-0.1

Diff Empir - Theor

-0.4

-0.2 x

Now compare all models by drawing predicted vs observed. panel4(size=7) xlabel = "Upwell Observed"; ylabel = "Upwell Predicted" plot(Upwell,y.est,ylim=c(0,4),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); mtext(side=3,line=-1,"Linear",cex=0.8) plot(Upwell,yt.est,ylim=c(0,4),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); mtext(side=3,line=-1,"Log transformed",cex=0.8) plot(Upwell,yn.est,ylim=c(0,4),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);mtext(side=3,line=-1,"Nonlinear Optimization",cex=0.8) plot(Upwell,yp.est3,ylim=c(0,4),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);mtext(side=3,line=-1,"Polynomial order 3",cex=0.8)

4 3

Upwell Predicted

Log transformed

4 3 2

Upwell Predicted

Linear

Upwell Observed

Polynomial order 3

Upwell Predicted

Nonlinear Optimization

Upwell Predicted

Upwell Observed

Exercise 6-6 Estuarine sediments are important receptors of water contaminants resulting from point sources and from freshwater runoff. Assume a sample of moisture content (g water/100 g dried solids) measured at different depths (m) in the sediments of an estuary (data from Davis, 2002, Chapter 4). The depth and moisture pairs are (0,124), (5,78), (10,54), (15,35), (20,30), (25,21), (30,22), (35,18). Demonstrate graphically that a linear regression would not yield a good model. Obtain estimates of k and y0 from a linear regression applied to log-transformed data. Plot and compare. Use these estimates as initial guess for a nonlinear regression optimization applied to an exponential model. Graph the exponential curve and compare to the log-transformed. Diagnose the residual error. Use polynomial regression and compare to nonlinear regression. Draw predicted vs. observed plots for all four models. Solution: Put data in file mois_dep.txt as follows. Here we have additional column for second data set in Davis. 77

0 5 10 15 20 25 30 35

124 78 54 35 30 21 22 18

137 84 50 32 28 24 23 20

Create a data frame and store object moist.depth > moist.depth <- read.table("lab6/mois_dep.txt") > names(moist.depth) <- c("Depth", "Moist1", "Moist2") > moist.depth Depth Moist1 Moist2 1 0 124 137 2 5 78 84 3 10 54 50 4 15 35 32 5 20 30 28 6 25 21 24 7 30 22 23 8 35 18 20 > attach(moist.depth) The following object(s) are masked from 'light.depth': Depth > Depth [1] 0 5 10 15 20 25 30 35 >

Now the following statements are performed for regression of Moist1 vs depth and plot scatter plot, identify outliers, add a regression line, and do confidence interval at alpha 0.05 >moist.lm <- lm(Moist1 ~ Depth) # regression object >par(mfrow=c(2,1)) >plot(Depth, Moist1) # get scatter plot >abline(moist.lm$coef) # add regression line to scatter plot >identify(Depth, Moist1) # identify outliers >conf.int.lm(moist.lm, 0.05)

We get

100 120 80 40

Moist1

80 40

Moist1

100 120

Depth

20 Depth

Note that observations numbered 1,4 and 8 would be away from the line and that observation 4 is outside the confidence interval. We can also do the residual diagnostic plots par(mfrow=c(2,2)) plot(moist.lm) # diagnostic plots residual error

2.5 2.0 1.5

1.0

0.5

Normal Q-Q

-20

-10

-1.0 -0.5 0.0

Residuals

Standardized residuals

Residuals vs Fitted

-1.5

-1.0

-0.5

Fitted values

1.5

1 8

0.0

-1

Standardized residuals

1.5 1.0 0.5

Standardized residuals

1.0

Residuals vs Leverag 1

0.5

Theoretical Quantiles

Scale-Location

0.0

Cook's distance 0

0.0

0.1

Fitted values

0.2

0.3

0.4

Leverage

We can see that the residuals have pattern (heteroscedastic), and that observations numbered 1,4 and 8 are outliers. The residual vs leverage plot indicates that observations 1 and 8 are very influential since they have high values of residual, Cook’s distance, and almost high leverage. High leverage in this case would occur for 2 x 2/8 = 0.5. Now we can gain more information by using summary( ) > summary(moist.lm)# get more info on regression object Call: lm(formula = Moist1 ~ Depth, data = moist.depth) Residuals: Min 1Q Median 3Q Max -19.452 -11.750 -4.952 10.113 29.333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 94.6667 11.6745 8.109 0.000189 *** Depth -2.6810 0.5581 -4.803 0.002991 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 18.09 on 6 degrees of freedom Multiple R-Squared: 0.7936, Adjusted R-squared: 0.7592 F-statistic: 23.07 on 1 and 6 DF, p-value: 0.002991

Examine the summary and plots and compare to analysis in Davis, 2002 (pp 191-200). R2 is ~ 0.79 and adjusted 0.76, therefore about ¾ of the variance is explained by regression. The large value of F = 23.07 or small p-value ~ 0.003 indicate that H0 can be rejected. Therefore there is a trend (slope different from zero). The equation is Moist1= 94.67 - 2.68×depth These coefficients are significantly different from zero (p-values very low in t tests). The Y vs X plot does not indicate a linear relation, but curvilinear. The residuals go down and then up with predicted mois1. They do not seem to be normally distributed (also indicated by qq plot). In conclusion, we need to try a non-linear regression. The message of this exercise is that of alert: your R2 and p-value looked fine, but the linear regression is a poor model for this data set. Try log transformed, non linear, and polynomial > # log transform > lnmoist.lm <- lm(log(Moist1/Moist1[1])~ Depth) > summary(lnmoist.lm) Call: lm(formula = log(Moist1/Moist1[1]) ~ Depth) Residuals: Min 1Q Median 3Q Max -0.22407 -0.12773 -0.01417 0.14457 0.22567 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.225673 0.120580 -1.872 0.11 Depth -0.054346 0.005765 -9.427 8.1e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1868 on 6 degrees of freedom Multiple R-squared: 0.9368, Adjusted R-squared: 0.9262 F-statistic: 88.87 on 1 and 6 DF, p-value: 8.1e-05 > > # fitted values by log transform > yt.est <- Moist1[1]*exp(lnmoist.lm$fitted) > # fitted values of linear model > y.est <- moist.lm$fitted >

> # sum of square residual errors > SSEl <- sum((Moist1 - y.est)^2) > SSEt <- sum((Moist1 - yt.est)^2) > SSEl; SSEt [1] 1962.619 [1] 771.8412 > > # degrees of freedom > n <- length(Moist1); df <- n-2 > > # Standard errors of residual > Se.l <- sqrt(SSEl/df) > Se.t <- sqrt(SSEt/df) > Se.l; Se.t [1] 18.08599 [1] 11.34197 > > # correlation coefficients > r.l <- cor(Moist1, y.est) > r.t <- cor(Moist1, yt.est) > r.l ; r.t [1] 0.8908507 [1] 0.9759719 > > # nonlinear > moist.nls <- nls(Moist1 ~ y0*exp(-k*Depth), start=list(k=-0.005, y0=124)) > moist.nls Nonlinear regression model model: Moist1 ~ y0 * exp(-k * Depth) data: parent.frame() k y0 0.07175 118.87476 residual sum-of-squares: 239.4 Number of iterations to convergence: 9 Achieved convergence tolerance: 2.775e-06 > yn.est <- fitted(moist.nls) > SSEn <- sum((Moist1 - yn.est)^2) > Se.n <- sqrt(SSEn/df); Se.n [1] 6.316179 > r.n <- cor(Moist1, yn.est); r.n [1] 0.9891273 > > moist.poly <- lm(Moist1 ~ poly(Depth, 3), data=moist.depth) > yp.est <- moist.poly$fitted

This last one is a cubic polynomial to approximate the non-linear functional relationship. Recall polynomial regression is useful when you do not know what model to apply. However, there may not be a physical meaning attached to the coefficients. 82

> moist.poly Call: lm(formula = Moist1 ~ poly(Depth, 3)) Coefficients: (Intercept) poly(Depth, 3)1 poly(Depth, 3)2 poly(Depth, 3)3 47.75 -86.87 41.97 -13.29 > summary(moist.poly) Call: lm(formula = Moist1 ~ poly(Depth, 3)) Residuals: 1 2 3 4 5 6 7 8 0.9394 -2.4091 1.5844 -0.8074 2.6883 -2.6558 0.4329 0.2273 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 47.7500 0.8726 54.719 6.68e-07 *** poly(Depth, 3)1 -86.8728 2.4682 -35.197 3.89e-06 *** poly(Depth, 3)2 41.9705 2.4682 17.005 7.01e-05 *** poly(Depth, 3)3 -13.2939 2.4682 -5.386 0.00575 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 2.468 on 4 degrees of freedom Multiple R-Squared: 0.9974, Adjusted R-squared: 0.9955 F-statistic: 519 on 3 and 4 DF, p-value: 1.230e-05 >

We can see that all coefficients are significant (low p-value of t test) as well as the F statistic. The R2 improved substantially. Now let us graph the scatter plot and superimpose the fitted line panel2(size=5) plot(Depth,Moist1,ylab="Moisture") lines(Depth,y.est,lty=1) lines(Depth,yt.est,lty=2) legend("topright",lty=1:2,leg=c("Linear","Log")) plot(Depth,Moist1,ylab="Moisture") lines(Depth, yp.est,lty=1) lines(Depth, yn.est,lty=2) legend("topright",lty=1:2,leg=c("Poly","Opt"))

100 120 80 20

Moisture

Linear Log

Poly Opt

Moisture

100 120

Depth

As we see we have achieved a better fit for this data set does not yet tell us that we have a mechanism to explain the change of moisture with depth. We may have done a better job chasing these specific data points with a curve, but we cannot claim that we have an understanding of a generic response of moisture to depth. We can also plot predicted vs observed for these models. par(mfrow=c(2,2)) xlabel = "Moist 1 Observed"; ylabel = "Moist 1 Predicted" plot(Moist1,y.est,ylim=c(20,120),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); title("Linear") plot(Moist1,yt.est,ylim=c(20,120),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); title("Log transformed") plot(Moist1,yn.est,ylim=c(20,120),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);title("Non Linear") plot(Moist1,yp.est,ylim=c(20,120),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);title("Polyomial order 3")

120 100 60 40 20

100

120

100

120

Moist 1 Observed

Non Linear

Polyomial order 3

60 40 20

100

Moist 1 Predicted

120

Moist 1 Observed

100

120

Moist 1 Predicted

Log transformed

Moist 1 Predicted

100 80 20

Moist 1 Predicted

120

Linear

100

120

Moist 1 Observed

100

Moist 1 Observed

Students may pursue this exercise using the Rcmdr and obtain similar results.

120

To perform polynomial regression in Rcmdr, go to Statistics | Fit Models | Linear Model then enter the poly expression in the function to obtain same results as with the console.

Exercise 6-7 Work the ozone vs. temp example from the airquality dataset using non-linear regression and polynomial regression. First, we need an equation that may describe the nonlinear nature of the data. We could start with an exponential

= y y0 exp ( k ( x − x0 ) )

(0.1)

Where y is ozone and x is temperature. Note that x0~50 from the scatter plot. Start the optimization algorithm from an initial guess for the coefficients from a log-transformed regression. Select the best order of the polynomial by trial and error. Diagnose the residual error. Use polynomial regression and compare to nonlinear regression. Draw predicted vs. observed plots for all four model Solution: Start with linear regression Temp50 <- Temp - 50 # change variable to temp above 50 # assume ozone = 1 at 50 deg ozone.lm <- lm(Ozone ~ 0+Temp50) # regression with 0 intercept

Now use log transform to obtain an initial guess, we take log of both sides

ln( y )= ln( y0 ) + k ( x − x0 ) or

ln( y ) = ln ( y / y= k ( x − x0 ) 0) ln( y0 )

So, perform a linear regression of ln(y/y0) ~ x-x0 to obtain the value of the coefficient k which we could use as an initial guess for the non-linear estimation of k > ozone.log.lm <- lm(log(Ozone) ~ 0+Temp50) # regression with 0 intercept > summary(ozone.log.lm)# get more info on regression object Call: lm(formula = log(Ozone) ~ 0 + Temp50) Residuals: Min 1Q Median 3Q Max -1.5634 -0.3382 0.0294 0.6073 2.1790 Coefficients: Estimate Std. Error t value Pr(>|t|) Temp50 0.116975 0.002412 48.49 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.7646 on 115 degrees of freedom (37 observations deleted due to missingness) Multiple R-squared: 0.9534, Adjusted R-squared: 0.953 F-statistic: 2351 on 1 and 115 DF, p-value: < 2.2e-16 > par(mfrow=c(2,2));plot(ozone.log.lm) # diagnostic plots residual error > plot(Temp50,log(Ozone), xlab="Temp above 50") # scatter plot > abline(a=0, b=ozone.log.lm$coef) # add regression line to scatter plot >

Normal Q-Q

24 8 15

-1

-2

-1

Standardized residuals

24 158

Residuals

Residuals vs Fitted

-2

-1

Fitted values

Residuals vs Leverage

117

-1

0.5

Standardized residuals

24 158

1.0

1.5

Theoretical Quantiles

Scale-Location

120

-2

0.0

Standardized residuals

76 Cook's distance

0.000

Fitted values

0.005

0.010

0.015

0.020

Leverage

Now, we use k = 0.12 and y0= 1 as an initial guess for a non-linear regression. The initial guess is declared in a start= list(…) > ozone.nls <- nls(Ozone ~ exp(k*Temp50), start=list(k=0.12)) > ozone.nls Nonlinear regression model model: Ozone ~ exp(k * Temp50) data: parent.frame() k 0.1078 residual sum-of-squares: 83416 Number of iterations to convergence: 4 Achieved convergence tolerance: 2.955e-06 >

Recall: a way of referring to contents of the regression object is to use coefficients(object), residuals(object) or fitted(object) > k <- coefficients(ozone.nls)[[1]]

With this information, we can do graphs.

100 0

Ozone

150

plot(Temp,Ozone) Temp1 <- seq(min(Temp), max(Temp), (max(Temp)-min(Temp))/100) lines(Temp1, exp(k*(Temp1-50))) plot(Temp,Ozone)

Temp

The graph shows a relatively good fit except for the outliers in the range of 80 to 90 degrees. It is a good idea to do a plot of residuals vs predicted > plot(fitted(ozone.nls), residuals(ozone.nls)) > abline(h=0)>

to obtain

100 50

residuals(ozone.nls)

0 -50 0

100

150

fitted(ozone.nls)

which reveals that we have a pattern of errors, suggesting non-normality of residuals. Now, try polynomial by trial and error. Diagnose the residual error. Try second order ozone.poly <- lm(Ozone ~ poly(Temp50, 2), data=airquality) eda6(ozone.poly$residuals)

50 x 0

100

Boxplot

100

Index plot

100

120

Index Density approximation

Density

0.010

40 30 0

0.000

Frequency

0.020

Histogram

100

-50

100

x 1.0

ECDF vs Std. normal

0.6 0.4 0.0

0.2

Fn(x)

0.8

100

QQ plot

-2

-1

-2

norm quantiles

Standardized x

Try third order ozone.poly <- lm(Ozone ~ poly(Temp50, 3), data=airquality)

eda6(ozone.poly$residuals)

50 x 0

100

Boxplot

100

Index plot

100

120

Index Histogram

Density

0.010

40 30 0

0.000

Frequency

0.020

Density approximation

100

-50

100

x 1.0

ECDF vs Std. normal

0.6 0.4 0.0

0.2

Fn(x)

0.8

100

QQ plot

-2

-1

-2

norm quantiles

Standardized x

Draw predicted vs. observed plots for all four models y.est <- ozone.lm$coef*(Temp50) yt.est <- exp(ozone.log.lm$coef*(Temp50)) yn.est <- exp(coefficients(ozone.nls)*Temp50) yp.est <- ozone.poly$fitted par(mfrow=c(2,2)) xlabel = "Ozone Observed"; ylabel = "Ozone Predicted" plot(Ozone,y.est,ylim=c(0,150),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); title("Linear") plot(Ozone,yt.est,ylim=c(0,150),xlab=xlabel,ylab=ylabel) abline(a=0,b=1); title("Log transformed") plot(Ozone,yn.est,ylim=c(0,150),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);title("Non Linear") plot(na.omit(Ozone),yp.est,ylim=c(0,150),xlab=xlabel,ylab=ylabel) abline(a=0,b=1);title("Polynomial order 3")

100

150

100

150

Ozone Observed

Non Linear

Polynomial order 3 Ozone Predicted

100 0

150

Ozone Observed

100

150

Ozone Predicted

100 0

100

Ozone Predicted

150

Log transformed

Ozone Predicted

Linear

100

150

Ozone Observed

100

150

Ozone Observed

Chapter 7 Stochastic or random processes and time series Exercise 7-1 Time series of tidal height is important to many environmental models. Tidal heights are referenced to a datum and are specified at equally spaced intervals, e.g., 0.5 hr. Data are obtained from tidal stage recorders or from the U.S. Coast and Geodetic Survey Tide Tables. Consider the following model of tidal height 3

x (t ) = A0 + ∑ Ak sin ( k 2π f t ) + ∑ Bk cos ( k 2π f t )

= k 1= k 1

where: x(t)=tidal elevation and Ai, Bi are coefficients, f is frequency hr-1 and t is time, hr. Assume all coefficients equal to 1. Sketch each one of the tidal time-series model components for one period. Then graphically add all the components. Sketch the periodogram for each component and for the total tidal elevation. Solution: Assume period equal one day or 24 hr. The basic frequency is 0.042 hr-1. Equivalently 1 ω = 2π × f = 2π × = 0.262 in radians hr-1 24 The students would sketch this by hand. Here is the solution using R so that it can be used as reference to evaluate the hand sketches. t <- seq(0,24,0.5) par(mfrow=c(2,2)) x0 <- 1 x1 <- cbind(sin((2*pi/24)*t),cos((2*pi/24)*t)) matplot(t,x1,type="l",col=1,ylab="x1") legend("bottomleft",col=1,leg=c("sin","cos"),lty=c(1,2)) x2 <- cbind(sin(2*(2*pi/24)*t),cos(2*(2*pi/24)*t)) matplot(t,x2,type="l",col=1,ylab="x2") legend("bottomleft",col=1,leg=c("sin","cos"),lty=c(1,2)) x3 <- cbind(sin(3*(2*pi/24)*t),cos(3*(2*pi/24)*t)) matplot(t,x3,type="l",col=1,ylab="x3") legend("bottomleft",col=1,leg=c("sin","cos"),lty=c(1,2)) x <- x0+x1[,1]+x1[,2]+x2[,1]+x2[,2]+x3[,1]+x3[,2] plot(t,x,type="l",ylab="x = sum ")

1.0 -0.5

0.0

0.5

1.0 0.5 0.0 -0.5

sin cos

-1.0

sin cos 10

5 4 3 -1

-0.5

x = sum

0.5 0.0

1.0

-2

-1.0

sin cos 10

10 t

Periodograms Frequency components are f1=0.042, f2=2×0.042=0.084, and f3=3×0.042=0.125 hr-1. Or in radians hr-1 0.262, 0.524, and 0.786. A sketch would be similar to the following

For reference these are the results using R 1e+02 1e-18

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

frequency bandwidth = 0.000259

Series: x3 Smoothed Periodog

Series: x Smoothed Periodog

1e-19

1e-08 1e-16

1e-09

spectrum

1e+00

1e+01

0.00

spectrum

Series: x2 Smoothed Periodog

1e-08

spectrum

1e-08 1e-18

spectrum

1e+02

Series: x1 Smoothed Periodog

0.00

0.05

0.10

0.15

0.20

0.00

frequency bandwidth = 0.000259

0.05

0.10

0.15

0.20

frequency bandwidth = 0.000259

Which is generated with t <- seq(0,24*100,0.5) par(mfrow=c(2,2)) x1 <- cbind(sin((2*pi/24)*t),cos((2*pi/24)*t)) x2 <- cbind(sin(2*(2*pi/24)*t),cos(2*(2*pi/24)*t)) x3 <- cbind(sin(3*(2*pi/24)*t),cos(3*(2*pi/24)*t)) x <- x0+x1[,1]+x1[,2]+x2[,1]+x2[,2]+x3[,1]+x3[,2] x1.spec <- spec.pgram(x1, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,0.2)) x2.spec <- spec.pgram(x2, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,0.2)) x3.spec <- spec.pgram(x3, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,0.2)) x.spec <- spec.pgram(x, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,0.2))

Exercise 7-2 Consider a longitudinal transect of 100m along a direction N-S, establish a starting point as distance d=0 m and that we record the distance from this datum to each one of the points where we detect the presence of a plant species. Assuming that detections are independent and that the expected rate is 0.1 plants per m. Use a Poisson pdf to model this spatial process; determine the probability of observing k=1,2,3,4 plants in the transect. Sketch the pdf of distances to the detection of the 5th plant. Solution: Poisson pdf such that rate for the entire length L is a =λ L =0.1×100 =10 P (k, L) =

(λ L) k exp(−λ L) k!

Then for k=1,2,3,4 (10) exp(−10) = 0.00045 1 2 (10) exp(−10) = P ( 2, L ) = 0.00225 2 3 (10) exp(−10) = P ( 3, L ) = 0.0075 6 4 (10) exp(−10) = P ( 4, L ) = 0.019 24 = P (1, L )

The inter-arrival pdf is Erlang where x is distance

= p(k , x)

λ k x k −1 exp(−λ x)

0.15 x 4 exp(−0.1x) = (k − 1)! 4! 98

0.015 0.010 0.000

0.005

Erlang p(x)

0.020

Use Figure 7-19, use curve for k=5 and scale 10 times along x axis

100

It could be generated with R x <- seq(0,100,0.01); nx <- length(x) y <- 0.1^5*x^4*exp(-0.1*x)/(2*3*4) plot(x,y, type="l",col=1,ylab="Erlang p(x)")

Exercise 7-3 Sea surface temperature of the Pacific Ocean in several regions are considered indicative of El Niño. File lab7/sstoi_pa.txt contains monthly data for years 1950-1997 and for regions Niño1+2, Niño3, Niño4, and Niño3.4 (Acevedo et al., 1999). Read the file, convert the columns to time series, plot and analyze the time series for regions 1+2 and 3.4 using autocorrelation and periodograms. Solution: Read the file as data frame, adjust to have complete years sstoi.pa <- read.table("lab7/sstoi_pa.txt", header=T) mo.sstoi <- length(sstoi.pa$YR) - 11 yr.sstoi <- mo.sstoi/12

Convert to time series for Anomalies of regions 12 and 34 and select period 1951-1996 sstoi.pa12 <- ts(sstoi.pa[,4], start = c(1950,1), end = c(1996,12), frequency = 12) sstoi.pa12 <- window(sstoi.pa12, start = c(1951,1), end = c(1996,12)) sstoi.pa34 <- ts(sstoi.pa[,10], start = c(1950,1), end = c(1996,12), frequency = 12) sstoi.pa34 <- window(sstoi.pa34, start = c(1951,1), end = c(1996,12))

Plot the series, the ACF and periodograms

4 2 -2

sstoi.pa12

par(mfrow=c(2,1)) ts.plot(sstoi.pa12) ts.plot(sstoi.pa34) acf(sstoi.pa12) acf(sstoi.pa34) spec.pgram(sstoi.pa12, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,2)) spec.pgram(sstoi.pa34, spans=c(3,3,3), demean=T, plot=T,xlim=c(0,2))

1950

1960

1970

1980

1990

1980

1990

2 1 -2 -1 0

sstoi.pa34

Time

1950

1960

1970 Time

We see repetitive peaks of positive anomalies for both regions. The period seems to be around 4-6 years for the larger spikes.

100

0.6 -0.2

0.2

ACF

1.0

Series sstoi.pa12

Lag

0.6 0.2 -0.2

ACF

1.0

Series sstoi.pa34

Lag

The ACF shows positive AC at about 4 years for region 1+2 and less significant and more spread out from about 4-5 ears for region 3+4.

101

5e-01 5e-04 1e-02

spectrum

Series: sstoi.pa12 Smoothed Periodogram

0.0

0.2

0.4

0.6

0.8

1.0

1e-02

5e-01

Series: sstoi.pa34 Smoothed Periodogram

5e-04

spectrum

frequency bandwidth = 0.0262

0.0

0.2

0.4

0.6

0.8

1.0

frequency bandwidth = 0.0262

The maximum values occur for a frequency of about 0.3 for region 1+2 and 0.22 for region 3+4. These frequencies correspond to periods of about 3.3 years and 4 years. Exercise 7-4 Streamflow is an important variable of watersheds measured a volume discharge per unit time. The US Geological Survey operates monitoring stations throughout the country. Consider for example flow of the Neches River in Texas at station USGS 08040600 near Town Bluff, TX. File lab7/TB-flow.csv contains daily flow data 1952-2010. Read the file, convert the flow to time series, plot and analyze the time series using autocorrelation and periodograms. Hint: when applying ts use freq=365, start=1952, end=2010. Solution: Script TB.df <- read.table("lab7/TB-Flow.csv",sep=",",header=T) TB.df <- TB.df[complete.cases(TB.df),] names(TB.df) <- c("TBday","TBQ") attach(TB.df)

102

40000 0

qtb

80000

qtb <- ts(TBQ, freq=365, start=1952, end=2010) ts.plot(qtb) acf(qtb,lag.max=365) spec.pgram(qtb, spans=c(3,3,3), demean=T, plot=T, xlim=c(0,2),ylim=c(10^6,10^8))

1950

1960

1970

1980

1990

2000

2010

Time

0.4 0.0

ACF

0.8

Series qtb

0.0

0.2

0.6

0.4

0.8

1.0

Lag

1e+07 1e+06

spectrum

1e+08

Series: qtb Smoothed Periodogram

0.0

0.5

1.0

1.5

2.0

frequency bandw idth = 0.0213

We see peaks in flow every year, maxima positive AC every year, and a spectrum with a peak at frequency of one year. Exercise 7-5 Simulate daily rainfall for the month modeled as a Poisson with rate λ=0.5 and Weibull distributed marks X with shape c=0.9 and scale b=0.5. 103

#marked poisson simulation ndays= 30rate=0.5;shape=0.9; scale=0.5 # define array zp <- array() nwet <- array() # loop realizations rainy <- poisson.rain(rate,ndays,shape,scale,plot.out=F) zp <- rainy$z[,2]; nwet <- length(rainy$y[,3])

2.0

par(mfrow=c(1,2)) plot(zp, type="s", ylab="x", xlab="Days") mtext(side=3,line=-1,paste("Days with events=",nwet[1]),cex=0.7) hist(zp,prob=T,main="",xlab="x") mtext(side=3,line=-1,paste("Only non-zero values"),cex=0.7)

Only non-zero values

0.5

1.0

Density

1.0

0.0

0.5 0.0

1.5

Days w ith events= 12

Days

0.0

0.5

1.0

1.5 x

104

2.0

2.5

Chapter 8 Spatial Point Patterns

Exercise 8-1 Use the spatial point pattern of Figure 9-24 and the chi square test to determine of the pattern is uniform (homogeneous). a) What is the number of cells, T, the total number of points, m, and the expected number of points for each cell if the distribution were uniform? b) What is the null hypothesis for a chi square test? How many degrees of freedom df? c) Calculate the chi square value. d) Use the 1- pchisq(x,df) expression in R to calculate the p-value. e) Can you reject the null hypothesis? What is your conclusion? Solution: a) By inspection of the figure T=6, m=8+6+7+6+7+6=40, e=m/T=40/6=6.66 b) H0 is that the number of points is the same in all cells. We have df=6-1=5

= c) χ 2

( oi − ei ) 0.5 = 2

∑ i =1

d) > 1-pchisq(0.5,5), and result is 0.99 e) Cannot reject the null, the pattern may be uniform Exercise 8-2 Uniform point patterns can be either regular or random, but the χ test does not specify which. 2

a) Does the point pattern of Exercise 8-1 appear more regular or random? b) What distribution and test could you use to determine the nature of a uniform pattern? Solution: a) Random b) Poisson, chi-square Exercise 8-3 Examine the spatial point pattern of Figure 8-25 for randomness. 105

a) What is the total area, A, the total number of points, m, and the density, λ, the total number of cells, T, the area for each cell, a, and the rate, λa = m/T? b) If the points are random they will follow the Poisson distribution given by equation 8-4. Using the expected number of points per cell as the rate yields the following probabilities that a cell will have r number of points: P(0) = 0.646, P(1) = 0.282, P(2) = 0.062, P(3) = 0.009 c) Calculate the expected number of cells that will have 0, 1, 2 or 3 points. How do these values compare to the point pattern? d) Determine the number of points ri in each cell i. Calculate the sample variance using equation 8.6. m e) Calculate the ratio of the mean over the sample variance, 2T , and compare the s value to 1. Is it greater than 1, equal to 1, or less than 1? What does this imply about the point pattern being regular, random, or clustered? What test could you use to determine if the point pattern is random? f) The t-value (from equation 8-7) is approximately 0.288, which has a corresponding p-value of 0.389. Can you reject the null hypothesis? What is your conclusion? Solution: a) A=2x2=4, m=7,= λ m= / A 7= / 4 1.75 , T=4x4=16, a=0.5x0.5=0.25, = λa m= / T 7= /16 0.4375 b) Done already c) Use er = TP (r ) , then 0.646 ×16=10.34, 0.282 ×16=4.51, 0.062 ×16=0.99, 0.009 ×16=0.14 these compare well to 10,5,1,0 which are counts by observations 2

m  ∑  ri −  10(0 − 0.438) 2 + 5(1 − 0.438) 2 + (2 − 0.438) 2 5.94 T d) = = = = 0.396 s 2 i =1  15 15 T −1 m T 0.438 e) = = 1.10 2 s 0.396 this ratio is larger than 1, therefore the pattern is uniform. Now apply t-test f) We cannot reject the null that the pattern is random. We conclude that the pattern may be random. T

Exercise 8-4 Consider the point pattern of Figure8-26 on the x-y plane, A = (0, 1), B = (3,3), C = (2,0), and D = (2,1). a) For each point identify its nearest neighbor and calculate that distance to complete 106

Table 8-2. b) If the point pattern were random, then the points would follow a Poisson distribution. Calculate what the mean nearest neighbor distance would be if the 1 point pattern were random, i.e. use density λ = m and µ d = . How does A 2 λ this compare with the average NND? NND c) The ratio is called the nearest neighbor statistic, or the standardized mean,

µd

and ranges from 0 (perfectly clustered, i.e. all points are in the same location) through 1 (random) to 2.15 (perfectly dispersed, i.e. the average nearest neighbor distance is maximized). Calculate this ratio and use it to judge if this point pattern more clustered, random, or regular. Solution:

a) Table8-2 NND table Point

Nearest Neighbor

Nearest Neighbor Distance (NND)

Average Nearest Neighbor Distance ( NND )

1.55

5 = 2.23

1 1 b) = λ m= 4= / 9 0.44 and= µd = = 0.75 This value is lower than the A 2 λ 1.32 average from table.

NND 1.55 = = 2.07 The value is between 1 and 2.15, thus the pattern is random. 0.75 µd Exercise 8-5

a) Do you need values at each point to calculate chi-square and Ripley’s K and L? Explain. b) Do you need values at each point to calculate the semivariogram? Explain Solution: 107

a) No, because we only use the location b) Yes, because we use the value in the calculation Exercise 8-6 Suppose a semivariogram has range of 4, sill of 20, and a nugget of 2. a) Write an expression for the variogram using the spherical model given by equation 8.44. Hint: substitute the values given into the equation so that it is only function of h. b) Sketch a graph of the semi-variogram c) Write an expression for the covariance as a function of h d) Draw a graph of the covariogram Solution: 0 when h = 0   3h h 3   a) γ= − 3  when 0 < h ≤ a ( h ) γ (0+ ) + [c(0) − γ (0+ )]   2a 2a   γ (0+ ) + [c(0) − γ= (0+ )] c(0) when h > a  0 when h = 0   3h h3   γ (h) = 2 + 18  −  when 0 < h ≤ 4 8 128    20 when h > 4 

15 10 0

Semivariance g(h)

Lag

c) Covariance is variance minus semivariance or c= (h) c(0) − γ (h)

108

c(0) when h = 0    3h h3    c(= h) ( c(0) − γ (0+ ) ) 1 +  − 3   when 0 < h ≤ a   2a 2a    0 when h > a  20 when h = 0    3h h3    = c(h) (18 ) 1 +  −   when 0 < h ≤ a   8 128    0 when h > a 

15 10 0

Covariance c(h)

Lag

Exercise 8 7 Generate a uniform pattern of 100 points in a [0,1] x [0,1] domain. Convert to data frame. Run grid (quadrat) and nearest neighbor analysis. Check whether the pattern is uniform. Confirm using the Monte Carlo method. Hint: use runifpoint, data.frame, quad.chisq.ppp, nnGK.ppp, and nnGKenv.ppp. Solution: Quadrat analysis X <- data.frame(runifpoint(100)) quad.chisq.ppp(X,5)

Note that the p-value is 0.28

109

1.0

Intensity

y 0.0

0.0

0.2

0.4

0.6

0.8

Point Pattern

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.8

1.0

x ECDF

0.6 0.4

F(Count)

0.20

0.8

Observed

0.0

0.00

0.2

0.10

Proportion

0.30

0.6

Count

With this p-value the pattern may be uniform. It is also confirmed visually. Now apply NN analysis nnGK.ppp(X) X.ppp <- ppp(X$x,X$y) G.u <- Gest(X.ppp,correction=c("none","km")) K.u <- Kest(X.ppp,correction=c("none","iso")) par(mfrow=c(2,1)) GKhat.env(n=100, s=20, G.u, stat="G", win=owin(c(0,1),c(0,1))) GKhat.env(n=100, s=20, K.u, stat="K", win=owin(c(0,1),c(0,1)))

110

0.8 0.4

Probability

0.0

Raw Uncorrected Theoretical Poisson K-M Corrected

0.00

0.05

0.10

0.20

Raw Uncorrected Theoretical Poisson Iso Corrected

0.00

0.10

L(d)

0.05

0.10

Raw Uncorrected Theoretical Poisson Iso Corrected

0.00

K(d)

0.15

Distance

0.00

0.10

0.20

0.00

Distance

0.10

0.20

Distance

0.4

0.8

K-M Mean Low&High

0.0

Ghat Empirical

0.15

0.0

0.2

0.6

0.4

0.8

1.0

0.20

0.25

0.20 0.10

Iso Mean Low&High

0.00

Khat Empirical

Mean Ghat

0.00

0.05

0.10

0.15 Mean Khat

All these results confirm that the pattern may be uniform, and also random, except some deviations in the Khat for higher values. Exercise 8 8 111

Use both grid (quadrat) and nearest neighbor analysis (plot G, K, and L and use envelopes from 20 simulation runs) to determine that indeed the xyz pattern in lab8/xyzgeoEAS.txt is uniform but not random. Hint: use quad.chisq.ppp, nnGK.ppp, and nnGKenv.ppp. Solution: xyz <- scan.geoeas.ppp("lab8/xyz-geoEAS.txt") quad.chisq.ppp(xyz,5)

Note that $p.value [1] 0.9998942

1.0

And then the pattern is likely uniform. 0.8

Intensity

0.6 y 0.4

0.4

0.0

0.2

0.6

0.8

Point Pattern

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.4

0.2

1.0

0.4

0.6

0.8

ECDF

0.0

0.2

0.1

0.2

F(Count)

0.3

Observed

Proportion

0.8

x 1.0

0.6

Count

5 Count

But check by NN analysis nnGK.ppp(xyz) nnGKenv.ppp (xyz,nsim=20)

112

0.8 0.4

Probability

0.0

Raw Uncorrected Theoretical Poisson K-M Corrected

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.20

Raw Uncorrected Theoretical Poisson Iso Corrected

0.00

0.10

L(d)

0.05

0.10

Raw Uncorrected Theoretical Poisson Iso Corrected

0.00

K(d)

0.15

Distance

0.00

0.10

0.20

0.00

0.20

Distance

0.4

0.8

K-M Mean Low&High

0.0

Ghat Empirical

Distance

0.10

0.2

0.6

0.4

0.8

1.0

0.20

0.25

0.20

Mean Ghat

0.10

Iso Mean Low&High

0.00

Khat Empirical

0.0

0.00

0.05

0.10

0.15 Mean Khat

We can see how it departs from the theoretical Poisson, therefore it is not random. Exercise 8 9

113

Use hick.spp created in the computer session. Apply the tools we learned for quadrat and nearest neighbor analysis. Hint: apply data frame, quad.chisq.ppp, nnGK.ppp, and nnGKenv.ppp. Demonstrate that pattern is clustered. Note: The iso correction estimate cannot be computed for 500 or more points. Subset a portion of the data frame, say 450 points, by using [1:450,]. Solution: To apply the tools we learned for quadrat and nearest neighbor analysis. First, convert the ppp object into a data frame. > hick.df <- data.frame(x=hick.spp$x,y= hick.spp$y)

apply the quadrat analysis function. > hick <- quad.chisq.ppp(hick.df,5) Note Hick $p.value [1] 0

We can clearly reject H0 and conclude that the pattern is clustered.

114

One limitation is that the iso estimate cannot be computed for 500 or more points. We subset a portion of the data frame, say 450 points. For example > nnGK.ppp(hick.df[1:450,])

Where we can observe

that the empirical function departs from the Poisson and we confirm that the pattern is clustered. Then by Monte Carlo >nnGKenv.ppp (hick.df[1:450,],nsim=100)

The results are

115

Again we confirm that the pattern is clustered. Exercise 8 10 Use marked pattern given in file lab8/unif100marked-geoEAS.txt. This pattern is uniform because the location data are the same as unif100 already analyzed. Run omnidirectional variogram analysis on the marks and determine the semivariance spherical model. Determine the covariance model and plot it. Solution: > xyv <- scan.geoeas.ppp("lab8/unif100marked-geoEAS.txt") Read 1 item Read 3 items > xyv.v <- vario(xyv,num.lags=10,type='isotropic', maxdist=0.45) ................................................................................................... > xyv.v lags bins classic robust med n 1 1 0.0225 0.1191077 0.1290627 0.1475013 36 2 2 0.0675 0.2070985 0.2595917 0.2992549 91 3 3 0.1125 0.1685219 0.1873228 0.2011548 130 4 4 0.1575 0.1685909 0.1718824 0.1892962 181 5 5 0.2025 0.1631993 0.1804397 0.1945689 199

116

0.12

Variogram estimator: da

0.2

0.04

0.4

0.6

0.8

Dataset

0.08

Classical semi-vari

1.0

6 6 0.2475 0.1648980 0.1716067 0.2148563 261 7 7 0.2925 0.1755638 0.1759069 0.1852450 296 8 8 0.3375 0.1840737 0.1918045 0.2296381 320 9 9 0.3825 0.1698943 0.1794637 0.1910472 307 10 10 0.4275 0.1832562 0.1824388 0.1815632 348 > > var(xyv$v) [1] 0.08805452 > model.semivar.cov(var=xyv.v, nlags=10, n0=0.04, c0=0.088, a=0.1)

0.0

0.2

0.4

0.6

0.8

0.00

0.0

[ 0.029 , 0.18175 ] ( 0.18175 , 0.4715 ] ( 0.4715 , 0.76125 ] ( 0.76125 , 0.986 ]

1.0

0.0

0.1

0.2

0.3

0.4

1.0

Lag

0.6

0.8

Boxplot by bin

0.0

0.2

0.4

square root differne

1.0 0.2

0.4

0.6

0.8

point.pair

0.0

square root differne

0.0

0.1

0.2

0.3

0.4

lag

7 lag

117

0.08 0.04 0.00

Semi-variance gamma

0.0

0.1

0.2

0.3

0.4

0.08 0.04 0.00

Covariance c(h)

Lag Distance h

0.0

0.1

0.2

0.3

0.4

Lag Distance h

Exercise 8 11 Plot model semivariance and covariance for the zinc variable in maas dataset using the directional NE variogram. Solution: library(sgeostat) data(maas) maas45.v <- vario(maas,num.lags=10,type='anisotropic', theta=45, dtheta=7.5, maxdist=2000)

118

179000

180000

181000

1400 100000 60000 0

[ 113 , 198 ] ( 198 , 326 ] ( 326 , 674.5 ] ( 674.5 , 1839 ]

178000

Variogram estimator: dataset.v

20000

Classical semi-variogram

330000

331000

332000

333000

Dataset

182000

500

1000

1500

Boxplot by bin

square root differnece

40 30 20 0

square root differnece

Lag point.pair

500

1000

1500

2000

lag

maas.vsph <- fit.variogram(model="spherical", maas45.v, nugget=45000, sill=135000, range=900) # does not converge , increent iterations maas.vsph <- fit.variogram(model="spherical", maas45.v, nugget=45000, sill=150000, range=900, iterations=40)

it converges …. Iteration: 31 Gradient vector: 9180.072 -86818.75 98.43144 New parameter estimates: 54180.07 63181.25 998.4314 rse.dif = 4.768372e-07 (rse = 3249993818 ) ; parm.dist = 87302.8 Convergence achieved by sums of squares. Final parameter estimates: 54180.07 63181.25 998.4314 >

model.semivar.cov(var= maas45.v, nlags=10, n0=54180, c0=63181, a=998)

119

120000 80000 0

40000

Semi-variance gam

500

1000

1500

60000 40000 0

20000

Covariance c(h)

Lag Distance h

500

1000

1500

Lag Distance h

Does not seem like a good fit. Let us try by visual estimation

120000 80000 0

40000

Semi-variance gamm

model.semivar.cov(var= maas45.v, nlags=10, n0=54180, c0=100000, a=1000)

500

1000

1500

8e+04 4e+04 0e+00

Covariance c(h)

Lag Distance h

500

1000 Lag Distance h

120

1500

121

Chapter 9 Matrices and linear algebra

Exercise 9-1 Identify the row and column number of the remaining elements in matrix A of equation 9.1 Solution: Element

Row

Column

Exercise 9-2 Write a 2×3 matrix as matrix B of equation 9.3 Solution:

 b11 b12 b b  21 22

b13  b23  Exercise 9-3

What would be the dimensions of a row vector with n entries? What would be the dimensions of a column vector with n entries? a) 1 × n

b) n ×1 Exercise 9-4

Write a 2×2 matrix. Determine the elements above the diagonal and below the diagonal.

122

Solution:

 a11 a12    a12 above diagonal, a21 below diagonal  a21 a22  Exercise 9-5 Suppose C of equation 9.7 is a covariance matrix. What are the various variances and covariances represented in the entries of matrix C? Solution:

 2 1 4 C = 1 2 3 4 3 2 Variance of all variables is =2. Covariance of variables 1 and 2 is =1, covariance of variables 1 and 3 is =4, covariance of variables 2 and 3 is =3. Exercise 9-6 What would the dimensions of the resulting matrix be if a 3×4 matrix is post-multiplied with a 4×5 matrix? Solution 3×5 Exercise 9-7 Multiply the following matrices

1 2  1 4 −7     2 −5 8  × 3 4    5 6   

Solution:

1 × 1 + 4 × 3 − 7 × 5 1 × 2 + 4 × 4 − 7 × 6   −22 −24   2 × 1 − 5 × 3 + 8 × 5 2 × 2 − 5 × 4 + 8 × 6  =  27 32      123

Exercise 9-8 Find the determinant for the 2×2 identity matrix. Based upon this calculation what do you think the determinant of the 5×5 identity matrix would be? Solution:

 1 0   =1 − 0 =1  0 1 For a 5 x 5 it would also be 1, because all diagonals except the main are zero Exercise 9-9 Find the transpose of square symmetric matrix C in equation 9.7. Do you think the transpose of a square symmetric matrix is always the same matrix? Solution: 2 1 4 CT = 1 2 3   4 3 2 

yes

Exercise 9-10 Multiply the matrix and its transpose given in equation 9.18. What is the dimension of the product? Is it symmetric? Solution: 6 3 6 5 4    77 32  , a 2x2 matrix, yes 3 2 1  × 5 2 = 32 14      4 1  

Exercise 9-11 Consider the following matrices: 4 1  5 0   A = B  3 0  =  0 5   2 −1 124

a) Can you add these? If yes, find A + B, if not explain why. b) Can you find AB? BA? If yes, complete the operation. If not, why not? c) What is the transpose of B? d) Find the det (BTB). Solution: a) No, matrices with different dimensions cannot be added. b) Dim (A) = 2x2

Dim (B) = 3x2

The inner dimensions of the matrices to be multiplied must match, therefore A and B can only be multiplied in one direction which is B times A. 4 1   20 5  5 0     = 15 0  BA =  3 0  ×   0 5  2 −1  10 −5

c) Transpose of B is

4 3 2  BT =   1 0 −1 d) det (BTB) is 4 1  4 3 2    29 2 2 1 0 −1 ×  3 0  = 2 2 = 29 × 2 − 2 = 54    2 −1  

Exercise 9-12 Suppose we have six values of X, xi=2,1,0,0,1,2. Calculate Sx=xTx. Solution:

125

1 2  1 1    6 2 + 1 + 0 + 0 + 1 + 2  6 6  1 1 1 1 1 1  1 0   T = = S x x= x   =        2 1 0 0 1 2  1 0   2 + 1 + 0 + 0 + 1 + 2 4 + 1 + 0 + 0 + 1 + 4  6 10  1 1    1 2  Exercise 9-13 Suppose we also have yi= 3,2,1,1,2,5. Determine matrix Sy=xTy Use this and Sx from previous exercise to write a matrix equation where vector b of regression coefficients is the unknown. Solve this matrix equation and compare results to previous exercise. Solution:

3 2   1   3 + 2 + 1 + 1 + 2 + 5  14  1 1 1 1 1 1   T = = S y x= y  =        2 1 0 0 1 2  1  6 + 2 + 0 + 0 + 2 + 5 15 2   5 xT y = xT xb = b (= xT x) −1 (xT y ) S x S y -1

−1

6 6  14   0.42 −0.25 14   2.08 = b S= x Sy 6 10  =    =       15  −0.25 0.25  15  0.25 -1

Exercise 9-14 Calculate the major product matrix xTx for 10 values x drawn from a standard normal RV. Hint use random number generation x <- rnorm(10,0,1). Solution: Solution:

126

> x <- rnorm(10,0,1)

Note that x is a row vector, and should be xT, also t(x) would be a column vector and should be x. Therefore, to obtain xTx we use x%*%t(x) > round(x%*%t(x),2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1.00 1.15 0.62 -2.43 -1.32 -0.36 -1.55 0.87 -0.51 0.99 [2,] 1.15 1.33 0.71 -2.81 -1.53 -0.41 -1.79 1.00 -0.58 1.14 [3,] 0.62 0.71 0.38 -1.51 -0.82 -0.22 -0.96 0.54 -0.31 0.61 [4,] -2.43 -2.81 -1.51 5.94 3.23 0.87 3.79 -2.11 1.24 -2.41 [5,] -1.32 -1.53 -0.82 3.23 1.76 0.47 2.06 -1.15 0.67 -1.31 [6,] -0.36 -0.41 -0.22 0.87 0.47 0.13 0.55 -0.31 0.18 -0.35 [7,] -1.55 -1.79 -0.96 3.79 2.06 0.55 2.42 -1.35 0.79 -1.54 [8,] 0.87 1.00 0.54 -2.11 -1.15 -0.31 -1.35 0.75 -0.44 0.86 [9,] -0.51 -0.58 -0.31 1.24 0.67 0.18 0.79 -0.44 0.26 -0.50 [10,] 0.99 1.14 0.61 -2.41 -1.31 -0.35 -1.54 0.86 -0.50 0.98 >

Exercise 9-15 Calculate the determinant of the major product matrix of the previous exercise. Calculate the inverse. Solution: Generate x by binding a repeat of 10 values of 1 and the rnorm values and then det operation. Two statements suffice x <- cbind(rep(10,1),rnorm(10,0,1)) det(x%*%t(x)) [1] 7.287954e-145

Very low value, no inverse. Exercise 9-16

1 3  5 0  Use the following matrices A =  , B=   and the following vectors 2 4 0 3 2 x  c =   x =  1  Calculate AI, AB, BA, Bc, Ic where I is the identity matrix. Write the 0  x2  equation Bx =c. Solve for x. 127

Solution: > A <- matrix(c(1,3,2,4),ncol=2,byrow=T) > B <- matrix(c(5,0,0,3),ncol=2,byrow=T) > I <- diag(2) >A [,1] [,2] [1,] 1 3 [2,] 2 4 >B [,1] [,2] [1,] 5 0 [2,] 0 3 >I [,1] [,2] [1,] 1 0 [2,] 0 1 > A%*%I [,1] [,2] [1,] 1 3 [2,] 2 4 > A%*%B [,1] [,2] [1,] 5 9 [2,] 10 12 > B%*%A [,1] [,2] [1,] 5 15 [2,] 6 12 > c <- c(2,0) > B%*%c [,1] [1,] 10 [2,] 0 > I%*%c [,1] [1,] 2 [2,] 0 > solve(B,c) [1] 0.4 0.0 >

For verification and further illustration, these can be done by hand.

128

AI = A 1 3 5 0   5 9  AB = =      2 4  0 3 10 12  5 15 BA =   6 12  5 0   2  10  Bc = =     0 3  0   0  5 0   x1   5 x1  Bx = =      0 3   x2   3 x2  1 0   2   2  Ic = =     0 1   0   0   5x  2  x   2 / 5 = solve  1  = yields  1      3 x2   0   x2   0 

129

Chapter 10

Multivariate models

Exercise 10-1 Assume two uncorrelated variables X1 and X2 with the sample means 1.5 and 2.3 respectively, and sample variances 0.2, 0.3 respectively. Assume a dependent variable Y with sample mean of 4.0 and that covariance of Y with X1 and X2 are 0.25 and 0.12 respectively. Calculate the coefficients of linear multiple regression. Solution:     Y − b1 X 1 − b2 X 2  b0    scov( X1 ,Y )  b  =  =  1   sX2 1 b2    scov( X 2 ,Y )   2   sX 2  

   4 − b11.5 − b2 2.3   0.25     0.2   0.12   0.3  

Then b0     b1  b2 

 4 − 0.125 ×1.5 − 0.4 × 2.3   0.125 =    0.4

 2.89    0.125  0.4  Exercise 10-2

Repeat the previous exercise but assume that X1 and X2 are correlated with covariance of 0.5. Discuss the differences in results with respect to the results of the previous exercise. Solution:

130

b0  b   1 b2 

    Y − b1 X 1 − b2 X 2     2 ( s X 2 ) scov( X1 ,Y ) − scov( X 2 ,Y ) scov( X1 , X 2 )  =   ( s X1 ) 2 ( s X 2 ) 2 − ( scov( X1 , X 2 ) ) 2    ( s X ) 2 scov( X ,Y ) − scov( X ,Y ) scov( X , X )  2 1 1 2  1  2 2 2 ( s X1 ) ( s X 2 ) − ( scov( X1 , X 2 ) )  

   4 − b11.5 − b2 2.3     0.3 × 0.25 − 0.12 × 0.5   0.2 × 0.3 − 0.52   0.2 × 0.12 − 0.25 × 0.5     0.3 × 0.2 − 0.52 

b0   4 + 0.079 × 1.5 − 0.532 × 2.3  2.89  b  =  =   −0.079  1    −0.079  b2     0.532  0.532 We see that b0 does not change, whereas b1 and b2 change in opposite directions. Coefficient b1 decreases and coefficient b2 increases. Exercise 10-3 Suppose that a multivariate regression model has 2 dependent variables Y1 and Y2 and four independent variables X1, X2, X3, X4. Determine the dimensions for matrices x, y, b, xTx, and xTy. Assume n=10 observations. Solution: Matrix x is 10×5, the y matrix is 10×2, and the predictor model will have a coefficient matrix b of dimension 5×2. Then xTx, is 5×10×10×5=5×5, and xTy is 5×10×10×2=5×2. Exercise 10-4

3 1 1 3   = 2 2  and X 2 Suppose X1 =    3 3  2 1 

5 4  6  4 5  4

5 5  5  . Determine, m, n1, n2. Show that the group 5 4  5

means of X1 and X2 are X1 = [ 2.2 2.0] and X 2 = [ 4.66 4.83] . Calculate the vector D of differences in group means. Assume that the covariance matrices of X1 and X2 are  0.7 −0.25  0.67 −0.07  S1 = and S2    1   −0.25  −0.07 0.17  131

Show that Sp and its inverse are

1.56 0.43  0.68 −0.15 Sp =  and S p −1 =    0.43 1.98   −0.15 0.54   -5.07  Show that the vector of weights is A =   , and that the centroids are  -6.67  Z 1 = 15.71 and Z 2 = −15.71 . Show that the Mahalanobis distance is D2=31.4. Show that F =38 and that the p-value=0.000021. Given that the scores are calculated to be  −18.51  −13.44     −23.58   Draw a sketch of the observations on the Z line. Plot 13.44 −    −11.83    −13.44  the centroids on the same line and write conclusions.

18.33  15.12    = Z1 = 16.72  and Z 2    4.98   23.40 

Solution: There are two groups, thus m=2, there are 5 observations in group 1, n1=5, and six in group 2, n2=6. The group means of X1 are calculated by arithmetic averages of each column and therefore X1 = [ 2.2 2.0] Likewise for X2, X 2 = [ 4.66 4.83] . Then the

difference D = X1 - X 2 = − [ 2.46 2.83] [ 2.2 2.0] − [ 4.66 4.83] = . Now Sp =

(n1 − 1)S1 + (n2 − 1)S 2 n1 + n2 − 2 (5 − 1)  0.7 −0.25 (6 − 1)  0.67 −0.07  +   1  5 + 6 − 2  −0.07 0.17  5 + 6 − 2  −0.25

4  0.7 −0.25 5  0.67 −0.07  + 1  9  −0.07 0.17  9  −0.25  0.68 −0.15 =   −0.15 0.54 

And then taking inverse 132

1.56 0.43 S p −1 =   0.43 1.98  1.56 0.43  −2.46   -5.07  = The vector of weights is A = S p -1 DT =     , and the centroids 0.43 1.98   −2.83  -6.67   -5.07  Z 1 X= A 2.2 2.0 = are= [ ] 1  -6.67  15.71 and    -5.07  Z 2 = X 2 A = [ 4.66 4.83]   = −15.71  -6.67   -5.07  31.4 DA = − [ 2.46 2.83]  The Mahalanobis distance is D 2 = =  -6.67  Calculate

( n1 + n2 − m − 1)( n1n2 ) D 2 F= = m(n1 +n2 − 2) ( n1 +n2 ) =

5 + 6 − 2 − 1)( 5 × 6 ) (= 31.4 2(5 + 6 − 2) ( 5 + 6 )

8 × 30 = 31.4 38 2 × 9 ×11

Degrees of freedom 2, 8 the p-value is F distribution with df 2,8, which 0.000021. A sketch of the observations and centroids on the Z line. Centroids are represented by large symbols.

Exercise 10-5 Suppose we have m=3 groups, n=10 observations for all groups, and k=2 variables, and want to do MANOVA. Calculate degrees of freedom for among and within: dfa and dfw. Suppose we have calculated the sample means to be 133

= X1

= 1.58] X 2 [ = 2.48 2.35 ] X 3 [3.42 3.47 ] [1.46

(0.2)

Assume we have calculated that the covariance matrix of these sample means is 0.962 0.928  S2 X =   Show that the among covariance matrix Sa is 0.928 0.910  19.235 18.561  Sa =   . Assume that the covariance matrices of each group are 18.560 18.200 

0.099 0.019  2  0.103 -0.031  2  0.079 -0.033 = S21 = = S3   S 2  -0.031  0.093 0.019 0.064    -0.033 0.093

(0.3)

 2.531 -0.411 Show that the within covariance matrix S w =    -0.411 2.245 0.407 0.0746  We calculate the inverse to be S w −1 =   0.0746 0.459  9.218 9.954  Show that the matrix SaS w −1 is SaS w −1 =   . Show that the values of the 8.917 9.739  Holling-Lawley, Pillai and Wilks statistics are 18.96, 1.00, and 0.048 respectively Solution: Total N =m×n = 30, degrees of freedom for among and within: dfa =m-1=3-1=2 and dfw.= N-m=30-3=27. The among covariance matrix is 0.962 0.928  19.235 18.561  S a =df a S 2 X =2 ×10 ×   =  0.928 0.910  18.560 18.200 

(

)

( )

The within covariance matrix is S w = df w S 2p . First calculate S 2p m

= S 2 p

∑S

 2.531 -0.411  0.094 −0.015  =   therefore S w =  -0.411 2.245 m    −0.015 0.083 

i =1

134

19.235 18.561  0.407 0.0746  9.218 9.954  = Matrix SaS w −1 is S a S w −1 =     . Show 18.560 18.200  0.0746 0.459  8.917 9.739  that the values of the Holling-Lawley, Pillai and Wilks statistics are 18.96, 1.00, and 0.048 respectively H = trace(S a S w −1 ) = 9.218 + 9739 = 18.96 P = trace (Sa (Sw + Sa )−1 ) could be calculated with R WAinv <- solve((Sa+Sw),diag(2)) PA <- Sa%*%WAinv P <- sum(diag(PA))

And we get P=1

= Λ

Sa 5.59 = = 0.048 S a + S w 115.62 Exercise 10-6

Use trees data frame in package datasets. Attach this dataset. You can get more information about it by using help(trees). Assume that Girth (diameter at breast height) and tree height will be explanatory variables and that Volume will be a dependent variable). Then investigate by linear multiple regression the relationships of Volume to Girth and Height. What is the effect of the correlation between explanatory variables (girth and height) on the results of the multiple regression? Solution: Available upon request. Exercise 10-7 Use Mandel dataframe of package car that has values for two explanatory variables x1,x2 and one dependent variable y. Note: first load package car and then the dataset Mandel. Build a predictor. Investigate by linear multiple regression. Are there any collinearity problems? Explain. Solution: Available upon request.

135

Exercise 10-8 Use data from Davis, 2002 Figure 6-2 page 472, and variables are median grain size and sorting coefficient of beach sand samples from Texas. For easy access, SEEG includes files sand1.txt and sand2.txt with these data. Perform linear discriminant analysis to find a discriminant function of sand1 data from sand2. Solution: Work with data files sand1.txt and sand2.txt which have m=2 variables, n1=34, n2=47 observations . First look at files sand1.txt and sand2.txt using notepad, the scan the files and convert to matrices X1 <- matrix(scan("lab10/sand1.txt"),ncol=2,byrow=T) X2 <- matrix(scan("lab10/sand2.txt"),ncol=2,byrow=T)

Use file lda2.R in the downloaded archive. This contains a function to perform linear discriminant analysis. Source it to R. Then run lda2 with X1 and X2 as arguments lda2(X1,X2)

The results obtained are > lda2(X1,X2) $m.n1.n2 [1] 2 34 47 $G1 [1] 0.3297059 1.1673529 $G2 [1] 0.3398511 1.2100000 $S1

[,1] [,2] [1,] 2.803209e-05 -0.0001480749 [2,] -1.480749e-04 0.0022927807 $S2

[,1] [,2] [1,] 3.008603e-05 -0.0001834783 [2,] -1.834783e-04 0.0023260870 $Sp

[,1] [,2] [1,] 2.922805e-05 -0.0001686895 [2,] -1.686895e-04 0.0023121742

136

[,1] [1,] -783.4418 [2,] -75.6022 $Z1c.Z2c [1] 5.586185 -5.586185 $D2.F.p.value [1] 11.17237 108.81145 0.00000 $Z1s [1] 9.60951074 4.12541812 4.93627970 8.09746668 13.66381862 11.28607344 [7] 8.96316781 7.39628420 4.23509722 5.07337857 -1.19415585 8.23456555 [13] 1.96703113 8.26198533 6.69510172 5.91165992 10.63973051 6.72252150 [19] 3.58875429 3.61617406 7.53338307 2.04929045 3.64359383 -1.05705698 [25] 5.23789721 6.83220059 0.56466617 7.64306217 2.15896955 3.78069271 [31] -0.08167676 3.86295203 10.77682938 5.15563789 $Z2s [1] 1.8847718 -3.5993208 -6.7330880 -9.0559937 -1.9501779 -6.6508287 [7] 0.4275673 -4.2730835 -2.6787801 -7.3794310 -9.7297564 -2.6513604 [13] -5.0016858 -8.1354530 -12.0526620 -4.1908242 -7.3245914 0.5372464 [19] -2.5965208 -4.1634044 -6.5137298 -4.9194264 -5.7028683 -6.4863101 [25] -2.5416813 -4.1085649 -6.4588903 -8.8092157 -1.7582395 -12.7264247 [31] -4.1085649 -8.7817959 -3.2702835 -7.9709343 -9.5378179 -1.6759801 [37] -5.5931892 -9.5103982 -4.7823276 -7.8886750 -9.4555586 -3.1606044 [43] -6.2943716 -5.4835101 -5.4286705 -8.5624377 -5.7028683

The centroid scores Z1c, Z2c and individual scores Z1s, Z2s have been centered on Z0, which is the midpoint or average of the centroids. From A, the weight or coefficient for the first variable is ten times the weight for the second variable. The F value is large enough that produces negligible p- value allowing us to reject the null indicating significant difference between the two groups. Using the individual scores and the centroid scores, the function also produces the plot

137

which displays the difference between the groups in two different manners. The one at the top panel uses markers and the two below use histograms. It indicates little overlap in observations along the score axis and substantial differences between the centroids. Exercise 10-9 Use the data from previous exercise. Perform MANOVA on sand1 and sand2. Solution: Use the base package function aov. There is a special summary function that applies to manova. Pillai is the default, but Wilks lambda can be requested as argument test=”Wilks”. First make the multivariate set, then the factor and then a data frame >X <- rbind(X1,X2) >grp <- factor(rep(c(1,2), c(34,47))) >X.g <- data.frame(grp, X)

The univariate ANOVA can be applied to each variable > summary(aov(X ~grp,data=X.g)) Response 1 : Df Sum Sq Mean Sq F value Pr(>F) grp 1 0.00203054 0.00203054 69.472 1.937e-12 *** Residuals 79 0.00230902 0.00002923 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

138

Response 2 : Df Sum Sq Mean Sq F value Pr(>F) grp 1 0.035881 0.035881 15.519 0.0001752 *** Residuals 79 0.182662 0.002312 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

We can see that there is significant difference between the two groups for each variable (or response). Now we run MANOVA as to include covariance. > manova(X ~grp,data=X.g) Call: manova(X ~ grp, data = X.g) Terms:

grp Residuals resp 1 0.00203054 0.00230902 resp 2 0.03588145 0.18266176 Deg. of Freedom 1 79 Residual standard error: 0.005406298 0.04808507 Estimated effects may be unbalanced >

We can use any test of the set "Pillai", "Wilks", "Hotelling-Lawley", and "Roy" by adding the parameter test=. Let us use Pillai’s > summary(manova(X ~grp,data=X.g), test= "Pillai") Df Pillai approx F num Df den Df Pr(>F) grp 1 0.736 108.811 2 78 < 2.2e-16 *** Residuals 79 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >

The result is also significant because we have relatively high Pillai 0.74 and high F (108) and very small p-value for this F. Therefore we conclude that there are significant differences between the groups for the combined responses. To contrast let’s use the Wilks lambda. > summary(manova(X ~grp,data=X.g), test="Wilks") Df Wilks approx F num Df den Df Pr(>F)

139

grp 1 0.264 108.811 2 78 < 2.2e-16 *** Residuals 79 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

The result is also significant because we have low lambda (0.26), high F (108) and very small p-value for this F. Therefore, we conclude that there are significant differences between the groups for the combined responses. In this case the Pillai and Wilks offered the same result. In fact, it will results the same with Hotelling-Lawley and Roy. Exercise 10-10 Use the airquality dataframe. Form one group of data with Ozone and Temp (two variables) for all days of the month of May and another group with the same data for the month of August. Develop discriminant analysis and MANOVA for these two groups. Solution: Available upon request.

140

Chapter 11

Dependent stochastic processes and time series

Exercise 11-1 What is the stationary probability distribution when p21=0.6 and p12=0.4? Solution: * X= 1

p12 0.4 = = 0.4 p21 + p12 1.0

* X= 2

p21 0.6 = = 0.6 p12 + p21 1.0 Exercise 11-2

.5 .4 0 0  0 0 .9 .5  . Demonstrate that it is equal to Calculate the steady state for the matrix P =  .5 .6 0 0    0 0 .1 .5 X*=[0.27, 0.33, 0.33, 0.07]T. Confirm by examining the steady-state in Figure 11-8. Solution: D = (1-p11 )(2 p 24 +1 - p 23 )+ p12 p 24 = 0.75 * -1 0.4 × 0.5 / 0.75 = 0.266 X 1 = p12 p 24 D = * * -1 0.5 × 0.5 / 0.75 = 0.333 X 2 = X 3 = (1 − p11) p 24 D = * -1 0.1× 0.5 / 0.75 = 0.066 X 4 = (1 − p 23)(1- p11 )D =

It fits well with last values in Figure 11-8 Exercise 11-3 Evaluate the stationary states according to equation 11.23 and demonstrate that we get X**=[0.24, 0.49, 0.19, 0.08]T, compare to the X* for the embedded process. This exercise should demonstrate 141

the importance of the holding time in determining the stationary distribution of a semi-Markov process. Solution: available upon request. Exercise 11-4 Write the matrix of YW equations for AR(3) in a similar manner as we did in equation (11.37). Solution

 ρ (1)    =  ρ (2)   ρ ( 3) 

ρ (1) ρ ( 3 − 1)   a1   1   1 ρ ( 3 − 2 )   a2  =  ρ (1)  ρ ( 3 − 1) ρ ( 3 − 2 ) 1   a3 

ρ (1) ρ ( 2 )   a1   1   1 ρ (1)   a2   ρ (1)  ρ ( 2 ) ρ (1) 1   a3 

Exercise 11-5 Study the rainfall values when dry days tend to be followed by dry days. The transition matrix is c(0.8,0.6,0.2,0.4). Provide plots and numeric results. Discuss results. Compare to Fig 11-4. Solution: available upon request. Exercise 11-6 Add lines of code to the script listed earlier to generate nine realizations (Fig 11-5 and Fig 11-6) to calculate the sample mean of 1) the number of wet days and 2) monthly averages. Use 10 and 100 sample size (realizations). Compare and discuss. Hint: the sample mean of the number of wet days would be the average of the number of wet days for all runs. Likewise, the sample mean of the monthly means would be the average of the mean for all runs (realizations). Also, note that this can be done for the monthly of only wet days or the monthly of all days. Solution: available upon request. Exercise 11-7 Explore the effect of 1st order Erlang PDF in the semi-Markov example. To modify the parameter values, edit the Hk matrix to contain all 1. Compare to the results given in the example (Fig 11-9 and Fig 11-10). Repeat for 3rd order Erlang. Solution: Hk <- matrix(c(1,1,0.0,0.0, 0.0,0.0,1,1, 1,1,0.0,0.0, 0.0,0.0,1,1), ncol=4, byrow=T) nruns=4; y <- list() for(i in 1:nruns){

142

y[[i]] <- semimarkov(P, Hk,Ha, tsim=1000, xinit=3) } panel4(size=7) for(i in 1:nruns) plot(y[[i]]$t,y[[i]]$x,type="s",xlab="Years",ylab="State (Role)",ylim=c(1,4))

4.0 3.5 3.0 1.0

1.5

2.0

2.5

State (Role)

3.0 2.5 1.0

1.5

2.0

State (Role)

3.5

4.0

for(i in 1:nruns) hist(y[[i]]$tau,xlab="Years",main="Hist of Holding time",cex.main=0.7)

200

400

600

800

1000

200

400

1000

4.0 3.5 3.0 1.0

1.5

2.0

2.5

State (Role)

3.5 3.0 2.5 2.0 1.5 1.0

State (Role)

800

Years

4.0

Years

600

200

400

600

800

1000

Years

200

400

600 Years

143

800

1000

Hist of Holding time

Frequency

Hist of Holding time

150

100

200

100

150

Years

Hist of Holding time

Frequency

100

150

Years

100

120

Years

We see faster change in the states, and near exponential holding time. Now for 3rd order Hk <- matrix(c(3,3,0.0,0.0, 0.0,0.0,3,3, 3,3,0.0,0.0, 0.0,0.0,3,3), ncol=4, byrow=T) nruns=4; y <- list() for(i in 1:nruns){ y[[i]] <- semimarkov(P, Hk,Ha, tsim=1000, xinit=3) } panel4(size=7) for(i in 1:nruns) plot(y[[i]]$t,y[[i]]$x,type="s",xlab="Years",ylab="State (Role)",ylim=c(1,4)) for(i in 1:nruns) hist(y[[i]]$tau,xlab="Years",main="Hist of Holding time",cex.main=0.7)

144

1.0

1.5

2.0

3.0

0 200

3.5

4.0

400

400 600

600

800

800 1000 1200

Years

145

3.0

3.5

200

2.5

State (Role)

2.5

State (Role)

0 1000 0

200

400

Years

400

600

Years

800 1000

Years

800 100

1.0

1.5

2.0

3.0

2.5

3.0

State (Role)

2.5

State (Role)

3.5

4.0

3.0

Hist of Holding time

2.0 0.0

0.0

0.5

1.0

1.5

Frequency

1.5 1.0

Frequency

2.5

2.0

Hist of Holding time

100

150

200

250

100

150

200

Years

Hist of Holding time

3.0

250

2.0 0.0

0.5

1.0

1.5

Frequency

3 2

Frequency

2.5

100

150

200

Years

100

150

200

250

300

Years

We see an increase in time spent in each state, and a shift in the histograms to higher values of holding time. Exercise 11-8 Develop a semi-Markov model for five states. Four of these are the four roles already studied and the fifth state is a canopy gap or opening. Only those roles that start in gaps should have nonzero transition probabilities from gap state. Only those roles that create gaps should have nonzero transition probabilities to the gap state. Draw a transition graph, write a matrix, assign parameter values, calculate steady-state values, and develop a simulation using function semimarkov. Solution: available upon request. Exercise 11-9 Predict sunspots 22 years ahead from year 2011 (2012-2033) using a simple AR(9) model. Compare to the results obtained with ARIMA(9,1,0). Predict sunspots 22 years ahead from year 1980 (1981-2003). Compare to observed values in that time period. 146

Solution: First scan the file, convert to ts yrspots <- matrix(scan("lab7/year-spot1700-2011.txt",skip=6),ncol=2,byrow=T) yrspots.rts <- ts(yrspots[,2], start=1700, deltat=1)

then use AR(9), predict, and plot. The AIC is 2596 ar.yrspots <- ar.yw(yrspots.rts) yrspots.pred <- predict(ar.yrspots, yrspots.rts, n.ahead=22) # plot up <- yrspots.pred$pred + 2*yrspots.pred$se low <- yrspots.pred$pred - 2*yrspots.pred$se minx<-min(yrspots.rts,low) maxx<-max(yrspots.rts,up) panel2(size=7) ts.plot(yrspots.rts,yrspots.pred$pred, col=1, lty=c(1,2),xlim=c(1700,2033),ylim=c(minx,maxx),ylab="X") lines(up, col=1, lty=3) lines(low, col=1, lty=3) legend("top",leg=c("Data","Pred","Upper & Lower"), lty=c(1,2,3)) ts.plot(yrspots.rts,yrspots.pred$pred, col=1, lty=c(1,2),xlim=c(1960,2033),ylim=c(minx,maxx),ylab="X") lines(up, col=1, lty=3) lines(low, col=1, lty=3) legend("top",leg=c("Data","Pred","Upper & Lower"), lty=c(1,2,3))

147

-50

100

150

Data Pred Upper & Lower

1700

1750

1800

1850

1900

1950

2000

Time

-50

100

150

Data Pred Upper & Lower

1960

1980

2000 Time

Use ARIMA(9,1,0), the AIC is 2593, a slight improvement arima.yrspots <- arima(yrspots.rts, order=c(9,1,0)) yrspots.pred <- predict(arima.yrspots, n.ahead=22) up <- yrspots.pred$pred + 2*yrspots.pred$se low <- yrspots.pred$pred - 2*yrspots.pred$se minx<-min(yrspots.rts,low) maxx<-max(yrspots.rts,up) panel2(size=7) ts.plot(yrspots.rts,yrspots.pred$pred, col=1, lty=c(1,2),xlim=c(1700,2033),ylim=c(minx,maxx),ylab="X") lines(up, col=1, lty=3) lines(low, col=1, lty=3) legend("top",leg=c("Data","Pred","Upper & Lower"), lty=c(1,2,3)) ts.plot(yrspots.rts,yrspots.pred$pred, col=1, lty=c(1,2),xlim=c(1960,2033),ylim=c(minx,maxx),ylab="X") lines(up, col=1, lty=3) lines(low, col=1, lty=3) legend("top",leg=c("Data","Pred","Upper & Lower"), lty=c(1,2,3))

148

2020

200 50 -50

100

150

Data Pred Upper & Lower

1800

1900

Data Pred Upper & Lower

2000

100

Time

1950

-50

1850

200

1750

150

1700

1960

1980

2000

2020

Time

Similar results to AR(9). Solution for the remainder of the exercise is available upon request. Exercise 11-10 Consider the flow of the Neches River in Texas at station USGS 08040600 near Town Bluff, TX. File lab7/TB-flow.csv contains daily flow data 1952-2010. We used this file for one of the exercises in chapter 7. Read the file, convert the flow to time series, plot the time series and autocorrelation, and develop an ARIMA model. Hint: when applying ts use freq=365, start=1952, end=2010. Solution: available upon request

149

Chapter 12

Geostatistics: kriging Exercise 12 1

Calculate the estimate Z at a point x0 with coordinates (2,3) from measurements at three points x1 at (1,3), x2 at (5,5) and x3 at (5,1). Sketch a plot on the plane to show the location of these points. Assume a semivariance spherical model with nugget 0.2, sill or variance 1, range 5. Use the same Z values as in the example. Solution:

x0 x1 x2 x3 x0

0.00

1.00 0.00

3.61 4.47 0.00

3.61 4.47 4.00 0.00

Apply spherical model

150

0.2 when h = 0   3h h3   γ (h= ) 0.2 + 0.8  −  when 0 < h ≤ 5  10 250   1 when h > 5 

Evaluate and enter results in table x0

0.20

0.65 0.20

0.91 0.99

0.20

0.91 0.99

0.96

0.20

Subtract from variance to obtain covariance and write results x0

0.80

0.35

0.80

0.09

0.01

0.80

0.09

0.01

0.04

0.80

Now arrange the covariance values in matrix form according to equation  0.8 0.01 0.01 1   λ1   0.35 0.01 0.8 0.04 1   λ 2  0.09     =   0.01 0.04 0.8 1   λ 3  0.09       1 1 0   − µ  1.00   1

This equation is solved to find  λ1   0.558  λ   0.221   2 =    λ3   0.221       µ   −0.101 151

Now finally use the coefficients to make the estimate

λ1Z (x1 ) + λ2 Z (x 2 ) + λ3 Z (x= Z (x= 0.558 × 2 + 0.221× 4 + 0.221×= 8 3.76 0) 3)

Exercise 12 2 Consider a waste site of size 1 Km by 1 Km and that we sampled for toxicant concentration at 50 points randomly distributed on the site. Determine maximum distance to employ for the empirical semivariance calculation. Design a kriging grid that would produce predictions every 50 m in both spatial directions. Determine the number of grid columns and rows. Solution: Max distance

2 = 1.41Km . Grid is 1000/50=20 points in each direction, or 20 columns and 20

rows. Exercise 12 3 Consider the situation of the previous exercise and in addition, the site has a slope toward the bottom row. Data suggests that concentrations increase as elevation decrease. What type of kriging would you employ? Describe the kriging process. Solution: Universal kriging (UK). We use the same functions as in ordinary kriging but we first remove the trend. In order to accomplish this we proceed as follows: 1) find a polynomial trend surface, probably a first order trend. Extract the residuals, (the residuals are examined, to find the model of the trend that provides the best fit) 2) generate models of the semivariance and covariance for these residuals obtained in (1) 3) apply ordinary kriging to these residuals, using the models developed in (2) 4) add the trend obtained in (1) to the kriged residuals obtained in (3) Exercise 12-1 Use the lab12/example-1x1.txt file. Use the spherical model built in the computer session example. Perform ordinary kriging on a 100×100 grid. Produce maps of the prediction and of the prediction error. Compare to Figs 12-6 and 12-9. Solution:

152

xy <- scan.geoeas.ppp("lab12/example-1x1.txt") xy.ppp <- ppp(xy$x, xy$y, marks=xy$z) xyz <- point(xy) xyz.vsph <- make.variogram(nugget=0, sill=160, range=0.1) xyz.ok <- Okriging(xyz, xyz.vsph, step=0.01, maxdist=0.25) plot.kriged(xyz, xyz.ok, outpdf="lab12/img/xyz-kriged-highres.pdf")

153

We can appreciate that these are the higher resolution images of Figs 12-6 and 12-9. Exercise 12-2 Use the lab12/example-trend-1x1.txt file. Use the spherical model built in the computer session example. Perform universal kriging on a 100×100 grid. Produce maps of the prediction. Compare to Fig 12-17. Solution: available upon request. Exercise 12-3 Work with the maas dataset of sgeostat package. Recall that we worked with this spatial pattern in Chapter 8. We calculated a model for the semivariogram, and we stored the model variogram in maas.vsph. Use this model to perform ordinary Kriging and produce maps of kriged and variance of kriging error. Use step=100, maxdist=1000. Solution: For easy reference, we repeat some of commands employed in Chapter 8. library(sgeostat) data(maas) maas.point <- point(maas) plot.point.bw(maas.point,v='zinc',xlab='easting',ylab='northing', legend.pos=2,pch=c(21:24),main="",cex=0.7) maas.v <- vario(maas,num.lags=10,type='isotropic', maxdist=2000) var(maas$zinc) m.maas.v <- model.semivar.cov(var=maas.v, nlags=10, n0=50000, c0=150000, a=1000) maas.vsph <- fit.variogram(model="spherical", maas.v, nugget=50000, sill=150000, range=1000,iterations=30)

Recall that these commands can also be executed using the Rcmdr. The optimal fit semivariance model parameter values are nugget=50000, sill=135000 and range=1000

154

Using this model we can proceed to perform ordinary kriging using SEEG function Okriging and given at the end of the chapter. First, we have to select a step for the grid for the prediction. Use minimum and maximum values in each axis to select a distance step. In this case we will use step=100. maas.ok <- Okriging(maas, maas.vsph, step=100, maxdist=1000)

We obtain a dataset of the kriged values of the variable (zinc concentration) over the prediction grid together with the variance of the kriging error. Examine maas.ok and note that it has the following contents > maas.ok x y zhat varhat 1 178605 329714 NA NA 2 178705 329714 NA NA 3 178805 329714 NA NA 4 178905 329714 NA NA … And so on until … 35 179205 329814 439.7621 74444.34 36 179305 329814 434.8722 72702.80

155

37 179405 329814 409.2206 71518.98 38 179505 329814 371.5133 79760.00

columns x and y contain the x and y coordinates of the predictions, zhat is the predicted value, and varhat is variance estimate. Also, note that the prediction was made on a grid yielding 1092 values. Now to obtain maps just apply the SEEG function plot.kriged given at the end of the chapter. Just do plot.kriged(maas, maas.ok)

and obtain two maps. The first has a raster image of the kriged values. For additional visualization, we superimpose a contour map, and a plot of the original point pattern (measured points)

The second map is the variance of the kriging error and provides a visual idea of how the error varies over the domain. We see how the variance is higher towards the east. The function produces also output that we can use for other purposes. 156

We can accomplish the same results using the SEEG addon to the Rcmdr. First, select maas as active dataset. Then go to Spatial| Ordinary Kriging. Enter appropriate text and values in dialog box; that is, semivariance model =maas.vsph, step=100, maxdist=1000, dataset to store results=maas.ok. Press Ok and obtain the same results as above for maas.ok. Now to do the plots: reselect maas as active dataset Then go to Spatial and then select Plot Kriged and we get the same plots as above. Exercise 12-4 Work with the maas dataset of sgeostat package as in the previous exercise. Perform universal Kriging and produce maps of the trend, residuals and errors, and the prediction. Solution: To do universal kriging (UK) we use the same functions as in ordinary kriging but we first remove the trend. In order to accomplish this we proceed as follows: 5) find a polynomial trend surface, and extract the residuals, (the residuals are examined, to find the model of the trend that provides the best fit) 157

6) generate models of the semivariance and covariance for these residuals obtained in (1) 7) apply ordinary kriging to these residuals, using the models developed in (2) 8) add the trend obtained in (1) to the kriged residuals obtained in (3) We will use the same maas sample data set which we used in previous exercises and apply these four steps. 1) First, find the possible trend assuming linear (polynomial order 1) maas.tr <- fit.trend(maas.point,'zinc',np=1,plot.it=T)

We get the coefficients in the beta component, then x and y coordinates and a contour line plot > maas.tr $beta x^0 y^0 x^1 y^0 x^0 y^1 -2.244418e+04 -4.129057e-01 2.932104e-01 $R

x^0 y^0 x^1 y^0 x^0 y^1 [1,] -12.4499 -2241039.20 -4128821.650 [2,] 0.0000 -9258.11 -11147.208 [3,] 0.0000 0.00 -6693.045 $np [1] 1 $x [1] 181072 181025 181165 181298 181307 181390 181165 181027 181060 181232 [11] 181191 181032 180874 180969 181011 180830 180763 180694 180625 180555 …

We obtain an equation of the trend is m ( x, y ) = −2.24 ×104 − 0.412 x + 0.293 y As shown in this figure

158

Now we can remove this trend by extracting residuals > maas.res <- data.frame(maas.tr$x,maas.tr$y, maas.tr$residuals) > colnames(maas.res) <- c("x","y","Res") > maas.res x y Res 1 181072 333611 413.616667 2 181025 333558 528.750253 3 181165 333537 91.714468 4 181298 333484 -220.828923 ….

2) Re-calculate the empirical variogram and generate new model variogram maas.res.v <- vario(maas.res,num.lags=10,type='isotropic', maxdist=2000) maas.res.vsph <- fit.variogram(model="spherical", maas.v, nugget=50000, sill=150000, range=1000, iterations=30)

the procedure goes through 20 iterations to yield ……. Iteration: 20 Gradient vector: -2865.902 -41348.67 -45.1847 New parameter estimates: 47134.1 108651.3 954.8153 rse.dif = 4.768372e-07 (rse = 1298420600 ) ; parm.dist = 41447.89

159

Convergence achieved by sums of squares. Final parameter estimates: 47134.1 108651.3 954.8153

The omnidirectional variogram can be modeled spherically with range=955, sill= 108651, nugget=47134. 3) Now, we perform kriging of the residuals with this semi variance model. maas.res.ok <- Okriging(maas.res, maas.res.vsph, step=100, maxdist=1000)

We map the predicted results and variance of error as before. plot.kriged(maas, maas.res.ok)

and obtain

160

We can appreciate that now we have positive and negative areas because the estimated values are of the residuals (concentration values after removing the trend). The same thing can be done using Rcmdr select active dataset maas.res, go to Spatial| Ordinary Kriging enter text and values in dialog box and then reselect maas.res as active dataset, go to Spatial|Plot Kriged and fill in dialog box with maas.res.vsph, 100,1000, maas.res.ok. 4) The final step is to add the trend to the kriged residuals. To do this, first find the values of the trend at each point in the grid maas.trend <- -22444.18 -0.41*maas.res.ok$x +0.29*maas.res.ok$y

and then create and update the zhat using this trend maas.uk <- maas.res.ok maas.uk$zhat=maas.trend+maas.res.ok$zhat

Then use function plot.kriged to obtain the superimposed trend and fitted residuals 161

plot.kriged(maas, maas.uk)

This last command can also be executed using the Rcmdr SEEG add on using Spatial|Plot Kriged and filling in the dialog box. We can contain the kriging prediction within an area given by a border object. For example, the demo maas.bank border > data(maas.bank) > maas.border.ok <- Okriging(maas, maas.vsph, step=100, maxdist=1000,border.sw=T,border.poly=maas.bank) plot.kriged(maas, maas.ok,outpdf="lab11/img/maas-borderkriged.pdf",border.sw=T,border.poly=maas.bank)

In this case, we get

162

163

Chapter 13

Spatial auto-correlation and auto-regression Exercise 13-1

Table 13-1 gives the matrix of distances (in km) between county seats for five counties. Define neighborhood based on distance between county seats less than 30 km. Form a binary neighborhood matrix. Table 13-1 0 55

Solution: Look at distances less than 30 write a 1 for that entry, otherwise write a 0. 0 0  W = 1  0 0 

0 0 1 1 0

1 1 0 0 1

0 1 0 0 0

0 0  1  0 0  Exercise 13-2

Consider distances of Table 13-1 and the results of the previous exercise. For each pair of neighbors (distance <30 km) calculate the inverse of distance and use it to form a weighted neighborhood matrix. Finally, row-standardize this matrix, i.e., divide the values of inverse distance in each row by the number of connections in the row. Solution: Write inverse of distance in the entries where matrix W of exercise above has non-zero entries 164

0 1 / 23 0 0   0  0 0 1 / 21 1 / 13 0   = 1 / 23 1 / 21 0 0 1 / 10    1 / 13 0 0 0   0  0 0 1 / 10 0 0  

0 0.04 0 0   0  0 0 0.05 0.08 0    0.04 0.05 0 0 0.10  =   0.08 0 0 0   0  0 0 0.10 0 0  

Now standardize 0 0.04 0 0   0  0 0 0.05 / 2 0.08 / 2 0   W = 0.04 / 3 0.05 / 3 0 0 0.10 / 3    0.08 0 0 0   0  0 0 0.10 0 0  

0 0.04 0 0   0  0 0 0.025 0.04 0    0.01 0.02 0 0 0.03    0.08 0 0 0   0  0 0 0.10 0 0  

Exercise 13-3 Table 13-2 provides the values of two variables X and Y for the above regions. First, center each variable with respect to the mean (this is to say, subtract the mean from each value). Use the centered values and the W resulting from the previous exercise to calculate Moran’s I for both X and Y. Table 13-2 Region 1 2 3 4 5

X 2.6 4.5 2.7 1.2 0.8

Y 3.8 4.5 4.3 2.5 2.8

Solution: available upon request. Exercise 13-4 Write the SAR equation (13.14) to predict Y from X of the previous exercise, using the W matrix from the first exercise. Solution: available upon request.

Exercise 13-5 165

Use nc.sids dataset of spdep package. This dataset is on SIDS (Sudden Infant Death Syndrome) incidence by county in North Carolina (NC). In R go to Help, then Html help, then packages, then look for spdep. Then look for nc.sids. All variables and details are given in the help. In addition, you can find a description and guidance in Kaluzny et al. 1996. You are required to do the following: •

Map the regions (polygons showing the borders) and produce maps according to levels of SIDS rates. • Calculate the spatial neighborhood structure. Document and give rationale for your neighbor selection criteria. Provide plots. • Calculate Moran’s I and Geary’s c for variables SID79 and NWBIR79. Determine if these variables are auto-correlated for this set of neighborhoods. • Use SAR to build a predictor of SID79 rates from NWBIR79 birth rates. Evaluate and discuss the results. Solution: > data(nc.sids) > nc.sids

CNTY.ID BIR74 SID74 NWBIR74 BIR79 SID79 NWBIR79 east north x y lon lat L.id M.id Alamance 1904 4672 13 1243 5767 11 1397 278 151 104.13 3997.85 -79.39348 36.04472 1 3 Alexander 1950 1333 0 128 1683 2 150 179 142 -59.43 3993.86 -81.19774 35.92893 2 2 Alleghany 1827 487 0 10 542 3 12 183 182 -50.06 4059.70 -81.14061 36.52443 1 2 Anson 2096 1570 15 952 1875 4 1161 240 75 37.19 3876.70 -80.06503 34.92720 3 2

From the spdep help CNTY.ID county ID BIR74 births, 1974-78 SID74 SID deaths, 1974-78 NWBIR74 non-white births, 1974-78 BIR79 births, 1979-84 SID79 SID deaths, 1979-84 NWBIR79 non-white births, 1979-84 east eastings, county seat, miles, local projection north northings, county seat, miles, local projection x easting, county seats approximately reprojected to UTM zone 18 y northings, county seats approximately reprojected to UTM zone 18 lon easting, county seats in long-lat

166

lat northings, county seats in long-lat L.id Cressie and Read (1985) L index M.id Cressie and Read (1985) M index

East, north, x,y, lon, lat define the location of county seat in various coordinate systems, SID74 and SID79 variables are discrete counts of SIDS deaths for each county for two four-year periods 1974-1978 and 1979-84. There are 100 counties. The number of births per county is stored in BIR74 and BIR79. The nonwhite birth rates are in NWBIR74 and NWBIR79. Of course, a first step is to look at a map of the regions we are dealing with. In this case, the regions are NC counties. We use functions from maptools to read files in format shape (shp) and convert to polygon object usable by maptools and spdep. Then command plot will plot the map, and we use text to write a label on each county. sidspolys <- readShapePoly(system.file("shapes/sids.shp", package="maptools")) plot(sidspolys,axes=T) text(coordinates(sidspolys), labels=row.names(sidspolys), cex=0.6)

167

Recall that the variables are BIR74: births, 1974-78 SID74: SID deaths, 1974-78 NWBIR74: non-white births, 1974-78 BIR79: births, 1979-84 SID79: SID deaths, 1979-84 NWBIR79: non-white births, 1979-84

We can plot a map of counties coded according to the value of a variable, say SID79. First, we examine the values of SID79 cut the values of the variable in intervals > nc.sids$SID79 [1] 11 2 3 4 0 0 4 5 5 6 18 15 20 9 2 4 2 21 3 1 1 0 21 17 18 57 2 1 8 3 7 [32] 22 9 18 0 26 2 1 4 4 38 17 10 8 8 5 6 0 5 5 13 2 6 14 7 5 3 2 1 35 2 8 [63] 5 7 9 3 23 6 1 4 3 0 4 11 0 12 7 26 5 8 8 4 16 7 5 6 2 4 0 9 6 31 2 [94] 0 1 23 7 13 1 1

168

> Sid79 <- as.ordered(cut(nc.sids$SID79, breaks=c(0, 5, 10, 15, 20, 30, 40, 50, 60), include.lowest=TRUE)) > unclass(Sid79) [1] 3 1 1 1 1 1 1 1 1 2 4 3 4 2 1 1 1 5 1 1 1 1 5 4 4 7 1 1 2 1 2 5 2 4 1 5 1 1 1 1 6 4 2 2 2 1 [47] 2 1 1 1 3 1 2 3 2 1 1 1 1 6 1 2 1 2 2 1 5 2 1 1 1 1 1 3 1 3 2 5 1 2 2 1 4 2 1 2 1 1 1 2 2 6 [93] 1 1 1 5 2 3 1 1 attr(,"levels") [1] "[0,5]" "(5,10]" "(10,15]" "(15,20]" "(20,30]" "(30,40]" "(50,60]"

we see how each county is assigned a code from the numbers 1–7. Then we assign a gray tone to each interval and plot the map of polygons according to the intervals and color codes selected

169

cols <- grey(seq(0.3,1,0.1)) plot(sidspolys, col=cols[unclass(Sid79)], border = par("fg"),axes=T) legend(c(-84,-82 ), c(32, 34), legend=paste("Sids 79", levels(Sid79)), fill=cols, bty="n")

As explained in the help of spdep, we can also view the map as probabilities of observing these values if rates were to follow a Poisson distribution. First, compute the parameter for the Poisson sids.phat <- sum(nc.sids$SID79) / sum(nc.sids$BIR79)

apply the cumulative of the Poisson (ppois) 170

pm <- ppois(nc.sids$SID79, sids.phat*nc.sids$BIR79)

as before cut, assign colors and map pm.f <- as.ordered(cut(pm, breaks=c(0.0, 0.01, 0.05, 0.1, 0.9, 0.95, 0.99, 1), include.lowest=TRUE)) cols <- grey(seq(0.3,1,0.1)) plot(sidspolys, col=cols[unclass(pm.f)],border = par("fg"),axes=T) legend(c(-84,-82 ), c(32, 34), legend=paste("prob.", levels(pm.f)), fill=cols, bty="n")

171

For auto-correlation and auto-regression, we need to build a neighbor structure to be stored as object of class nb. Let us find the weighted neighbor matrix. Neighbors can be defined according to distance between county seats or amount of shared borders. The weights would represent intensity of neighbor relationship (e.g. extent of common boundary, closeness of centroids). Assigning weights is critical since spatial correlation and spatial regression models will eventually depend on the weights. First, let us use the distance. Bind the coordinates of county seats as a 100×2 matrix. This can be done using various coordinate systems: eastings-northings, UTM and long-lat > sids.coords <- cbind(nc.sids$east, nc.sids$north) > sids.utm <- cbind(nc.sids$x, nc.sids$y) > sids.lonlat <- cbind(nc.sids$lon, nc.sids$lat)

Let us use the eastings and northings and then use function dnearneigh to find neighbors (start with county seats < 30 miles of each other). This function will exclude the redundant cases when a region is neighbor with itself. > sids.nb <- dnearneigh(sids.coords, 0, 30, row.names = rownames(nc.sids)) > sids.nb Neighbour list object: Number of regions: 100 Number of nonzero links: 398 Percentage nonzero weights: 3.98 Average number of links: 3.98 2 regions with no links: Dare Hyde >

you can find the link for each county by addressing the id number, for example for counties 1 and 2 > sids.nb[1:2] [[1]] [1] 17 19 32 41 68 [[2]] [1] 14 18 49 97 >

This means for example that county 1 (Alamance) has neighbors 17 19 32 41 68. Note that selfneighbors are excluded. However, as you go through sids.nb you realize that counties 28 and 48 have no neighbors at this distance. This is more evident if you plot the links > plot(sids.nb, sids.coords)

172

where you see some isolated county seats. Increase the neighbor cutoff to 35 miles to see if we include these counties. > sids.nb <- dnearneigh(sids.coords, 0, 35, row.names = rownames(nc.sids)) > plot.nb(sids.nb, sids.coords) >

confirming that indeed we do

173

The links can be overlaid on the region polygons to visualize the geographical relationships

> plot.polylist(sidspolys, border = "grey") > plot(sids.nb, sids.lonlat, add=T)

Next, we need weights for neighborhood structure. In this case, a simple calculation of weights is the inverse of the distance; i.e. the closest the regions the larger the weight (the closest two regions are, the more intense the neighbor effect). We can get distances between each pair > sids.dists <- nbdists(sids.nb, sids.coords) > sids.dists [[1]] [1] 24.18677 27.29469 28.44293 20.02498 16.00000 [[2]] [1] 21.00000 17.11724 18.86796 15.13275 etc > summary(unlist(sids.dists)) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.236 20.020 26.400 25.220 30.610 34.990 >

and also we can get an intensity of neighborhood using the inverse of the distance applied to all pairs > inten <- lapply(sids.dists, function(x) 1/x) > inten

174

[[1]] [1] 0.04134491 0.03663717 0.03515813 0.04993762 0.06250000 [[2]] [1] 0.04761905 0.05842062 0.05299989 0.06608186 etc >

Next we produce a neighbor matrix with these weights. This is a 100x100 matrix. Style “W” is row standardized, i.e., divide by number of neighbors in row. For brevity in this guide, we only show the first 4 rows and 10 columns > sids.nbmat <- nb2mat(sids.nb, glist=inten, style="W") > sids.nbmat[1:4,1:10] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 0.0000000 0 0 0 0 0 [2,] 0 0 0 0 0.0000000 0 0 0 0 0 [3,] 0 0 0 0 0.3729811 0 0 0 0 0 [4,] 0 0 0 0 0.0000000 0 0 0 0 0 >

Same result can be put in a list because the function to compute Moran statistics will require a list. Shown here are only the first four rows > sids.nblsw <- nb2listw(sids.nb, glist=inten, style="W") > sids.nblsw Characteristics of weights list object: Neighbour list object: Number of regions: 100 Number of nonzero links: 570 Percentage nonzero weights: 5.7 Average number of links: 5.7 Weights style: W Weights constants summary: n nn S0 S1 S2 W 100 10000 100 44.05019 408.0159 > sids.nblsw$weights[1:4] [[1]] [1] 0.1832845 0.1624147 0.1558581 0.2213764 0.2770662 [[2]] [1] 0.2115261 0.2595072 0.2354280 0.2935387 [[3]] [1] 0.3729811 0.3307017 0.2963172 [[4]] [1] 0.2148541 0.3256509 0.2187491 0.2407459 >

175

We can build a neighbor object using the polygons instead of centroids. Neighboring relations are established based on shared borders instead ot distance between county seats. > sids.nb.pol <- poly2nb(sidspolys, row.names = rownames(nc.sids))

having similar structure as the sids.nb object we built above using distance between county seats. We can examine the differences > diffnb(sids.nb, sids.nb.pol, verbose=TRUE) Neighbour difference for region id: Alamance in relation to id: Durham Person Neighbour difference for region id: Alexander in relation to id: Burke Davie Lincoln Watauga Yadkin Neighbour difference for region id: Ashe in relation to id: Avery Caldwell Neighbour difference for region id: Avery in relation to id: Ashe Yancey Neighbour difference for region id: Beaufort in relation to id: Bertie Hyde … and so on …

Next, we need weights for neighborhood structure. The weights would represent intensity of neighbor relationship (in this case extent of common boundary). Results are put in a list because later the function to compute Moran’s I statistic will require a list. Shown here only the first four rows > sids.nblsw.pol <- nb2listw(sids.nb.pol, glist=NULL, style="W") > sids.nblsw.pol$weights[1:4] [[1]] [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 [[2]] [1] 0.25 0.25 0.25 0.25 [[3]] [1] 0.3333333 0.3333333 0.3333333 [[4]] [1] 0.25 0.25 0.25 0.25 >

Now, with these results we are ready to apply spatial auto-correlation (e.g., using the Moran’s I statistic) and auto-regression (SAR). Recall that the variables are BIR74: births, 1974-78 SID74: SID deaths, 1974-78 NWBIR74: non-white births, 1974-78 BIR79: births, 1979-84

176

SID79: SID deaths, 1979-84 NWBIR79: non-white births, 1979-84

Apply the Moran’s I spatial auto-correlation test to the SID79 variable of nc.sids data set with the neighbor structures built in the previous section. The arguments are the variable, the neighborhood structure, and a decision on whether we assume normality or randomization. Let us assume normality. > moran.test(nc.sids$SID79, sids.nblsw, randomisation=F) Moran's I test under normality data: nc.sids$SID79 weights: sids.nblsw Moran I statistic standard deviate = 3.1038, p-value = 0.0009553 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance 0.190938256 -0.010101010 0.004195403 >

Here although the value for the Moran statistic is low and would seem to indicate no spatial pattern, the low variance makes the Z value high and consequently the p-value is low enough to conclude that there is spatial pattern. We have assumed that the mean and variance are the same for all regions. As discussed in Kaluzny et al. (1996) the variance of sids rate increases for counties with low birth rates and therefore the data needs to be transformed using the Freeman-Tukey (FT) transform of the sids rate and multiplied by the square root of births, to achieve constant variance. The FT transform is >ft.SID79 <- sqrt(1000)*(sqrt(nc.sids$SID79/nc.sids$BIR79) + sqrt((nc.sids$SID79+1)/nc.sids$BIR79))

and then multiply by square root of births >tr.SID79 <- ft.SID79*sqrt(nc.sids$BIR79) >names(tr.SID79) <- rownames(nc.sids)

So, modify the Moran test as follows > moran.test(tr.SID79, sids.nblsw, randomisation=F) Moran's I test under normality data: tr.SID79 weights: sids.nblsw Moran I statistic standard deviate = 4.795, p-value = 8.134e-07 alternative hypothesis: greater sample estimates:

177

Moran I statistic 0.300479754

Expectation -0.010101010

Variance 0.004195403

This produces an increase in Moran’s I, and consequently an improvement in the p-value. Let us try randomization > moran.test(nc.sids$SID79, sids.nblsw, randomisation=T) Moran's I test under randomisation data: nc.sids$SID79 weights: sids.nblsw Moran I statistic standard deviate = 3.2235, p-value = 0.0006332 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance 0.190938256 -0.010101010 0.003889618

This also represents an improvement in p-value compared to normality. So far, results indicate that there is spatial auto-correlation of the Sids rate 1979-1984 in North Carolina counties. We can confirm with Monte Carlo simulations (calculating Moran’s I many times. > moran.mc(nc.sids$SID79, sids.nblsw, nsim=1000) Monte-Carlo simulation of Moran's I data: nc.sids$SID79 weights: sids.nblsw number of simulations + 1: 1001 statistic = 0.1909, observed rank = 995, p-value = 0.005994 alternative hypothesis: greater >

We also have a low p-value. You can repeat with the Geary statistic, by just changing “moran” for "geary". > geary.test(nc.sids$SID79, sids.nblsw, randomisation=T) Geary's C test under randomisation data: nc.sids$SID79 weights: sids.nblsw Geary C statistic standard deviate = -2.3854, p-value = 0.00853 alternative hypothesis: less sample estimates: Geary C statistic Expectation Variance 0.820749626 1.000000000 0.005646557

the results are consistent. We can do the Monte Carlo simulations 178

> geary.mc(nc.sids$SID79, sids.nblsw, nsim=1000) Monte-Carlo simulation of Geary's C data: nc.sids$SID79 weights: sids.nblsw number of simulations + 1: 1001 statistic = 0.8207, observed rank = 16, p-value = 0.01598 alternative hypothesis: less >

We can plot the spatial auto-correlation as a function of lag distance or spatial correlogram > plot(sp.correlogram(sids.nb, nc.sids$SID79, order = 10, method = "corr", style = "W")) >

In the top panel we can appreciate that values are positively correlated at short lags indicating spatial pattern. Another option for method is “I” and this will do Moran’s I at different lags (bottom panel). plot(sp.correlogram(sids.nb, nc.sids$SID79, order = 10, method = "I", style = "W"))>

Spatial auto-correlation depends on the weights, therefore it is good practice to perform several runs of Moran and Geary with different weights. The spatial auto-regression (SAR) linear model (SLM) object is sarslm. Package spdep provides functions lagsarslm for the spatial lag model and errorsarslm for the spatial error model, to

179

perform SAR auto-regression. We will use lagsarslm to estimate the coefficients of a predictor and the autocorrelation parameter ρ. First, assume that race is related to SIDS rates, and therefore SID is modeled as a function of nonwhite birth rates NWBIR. > sids.lag <- lagsarlm(SID79 ~ NWBIR79, data=nc.sids, sids.nblsw,tol.solve=1e-9) > sids.lag Call: lagsarlm(formula = SID79 ~ NWBIR79, data = nc.sids, listw = sids.nblsw, tol.solve = 1e-09) Type: lag Coefficients: (Intercept) NWBIR79 rho 1.709804790 0.003983913 0.147110257 Log likelihood: -297.5277

note that the results include estimates of the coefficients (intercept and slope) and the parameter rho for ρ. This model is then used to predict the sids rate by region (county). It was necessary to decrease the tolerance to 10-9. By default, it is 10-7. Using summary, we get more information > summary(sids.lag) Call:lagsarlm(formula = SID79 ~ NWBIR79, data = nc.sids, listw = sids.nblsw) Residuals: Min 1Q Median 3Q Max -15.5188 -2.9322 -1.0657 2.0848 14.1749 Type: lag Coefficients: (asymptotic standard errors) Estimate Std. Error z value Pr(>|z|) (Intercept) 1.70980479 0.84232023 2.0299 0.04237 NWBIR79 0.00398391 0.00024978 15.9497 < 2e-16 Rho: 0.14711 LR test value: 2.9092 p-value: 0.088077 Asymptotic standard error: 0.083514 z-value: 1.7615 p-value: 0.078152 Log likelihood: -297.5277 for lag model ML residual variance (sigma squared): 22.377, (sigma: 4.7304) Number of observations: 100 Number of parameters estimated: 4 AIC: 603.06, (AIC for lm: 603.96) LM test for residual auto-correlation test value: 0.76495 p-value: 0.38178

180

The coefficient estimates have low p-values. A special function is the Likelihood Ratio Test (LR) to check whether the parameter rho is nonzero. In this case, the estimate is still relatively good with a p-value of 0.08. This tells us that ρ is significantly different from zero. However, the Lagrange Multiplier (LM) test for lack of serial correlation of the residuals yields high p-value and therefore the null hypothesis of serially correlated residuals cannot be rejected. Predicted values and residuals are available as part of the slm object type as fitted and resid. This model can be diagnosed much in the same manner that we evaluated linear regression models: scatter plots, qq plots and residual-predicted plots. For example, lim<- max(nc.sids$SID79, fitted(sids.lag)) split.screen(c(2,1)) screen(1) par(mar=c(4,4,1,.5),xaxs=r, yaxs=r) plot(fitted(sids.lag), resid(sids.lag),xlab="Estimated",ylab="Residuals") abline(h=0) split.screen(c(1,2), screen = 2) screen(3) plot(nc.sids$SID79, fitted(sids.lag),xlab="Observed",ylab="Estimated",xlim=c(0,lim),ylim=c(0,lim)) abline(a=0,b=1) screen(4) qqnorm(resid(sids.lag)) qqline(resid(sids.lag))

To obtain

181

where we see that there are some outliers but it is generally fine. The residuals do not behave like a normal distribution. We know that the Freeman-Tukey (FT) transform helps to stabilize the variance of sids. Let us perform SAR on the FT transform of SID79 and NWBIR79. We already have tr.SID79. Let us calculate the one for NWBIR79 >ft.NWBIR79 <- sqrt(1000)*(sqrt(nc.sids$NWBIR79/nc.sids$BIR79) + sqrt((nc.sids$NWBIR79+1)/nc.sids$BIR79))

and then multiply by square root of births >tr.NWBIR79 <- ft.NWBIR79*sqrt(nc.sids$BIR79) >names(tr.NWBIR79) <- rownames(nc.sids)

182

> tr.sids <- data.frame(tr.SID79, tr.NWBIR79) > tr.sids tr.SID79 tr.NWBIR79 Alamance 214.42540 2364.3180 Alexander 99.49362 775.8855 Alleghany 118.01781 223.5621 Anson 133.95623 2155.4581 Ashe 31.62278 279.2618

Run SAR on the transformed data > sids.tr.lag <- lagsarlm(tr.SID79 ~ tr.NWBIR79, data=tr.sids, sids.nblsw, tol.solve=1e-12) > summary(sids.tr.lag) Call: lagsarlm(formula = tr.SID79 ~ tr.NWBIR79, data = tr.sids, listw = sids.nblsw, tol.solve = 1.00000000000000e-12) Residuals: Min 1Q Median 3Q Max -141.1435 -36.5722 -6.5955 36.9657 122.7853 Type: lag Coefficients: (asymptotic standard errors) Estimate Std. Error z value Pr(>|z|) (Intercept) 33.593934 13.533753 2.4822 0.01306 tr.NWBIR79 0.051332 0.004350 11.8004 < 2e-16 Rho: 0.20042 LR test value: 4.789 p-value: 0.028642 Asymptotic standard error: 0.086414 z-value: 2.3193 p-value: 0.020380 Wald statistic: 5.379 p-value: 0.020380 Log likelihood: -533.0884 for lag model ML residual variance (sigma squared): 2477.8, (sigma: 49.777) Number of observations: 100 Number of parameters estimated: 4 AIC: 1074.2, (AIC for lm: 1077) LM test for residual autocorrelation test value: 18.618 p-value: 1.5971e-05 >

Note that we have improved the estimation of rho using this transform and the LM test of residuals yields low p-value, allowing us to reject the null of serially correlated residuals. > par(mfrow=c(2,2)) > plot(tr.sids$tr.SID79, fitted(sids.tr.lag)) > plot(fitted(sids.tr.lag), resid(sids.tr.lag)) > qqnorm(resid(sids.tr.lag))

183

As we can see we now have achieved a better behavior of the residuals. Exercise 13-6 Use columbus dataset of spdep package. This dataset is for 49 neighborhoods in Columbus, Ohio. It has 49 rows and 22 columns. In R go to Help, then Html help, then packages, then look for spdep. Then look for columbus. All variables and details are given in the help. In addition, you can find a description and guidance for package spdep that uses columbus as example at http://sal.agecon.uiuc.edu/csiss/pdf/spdepintro.pdf You are required to do the following: • •

Map the regions (polygons showing the borders) and produce maps according to levels of crime rates. Calculate the spatial neighborhood structure. Document and give rationale for your 184

neighbor selection criteria. Provide plots. • Calculate Moran’s I and Geary’s c for variables Housing value and Income. Determine if these variables are auto-correlated for this set of neighborhoods. • Use SAR to build a predictor of crime rates from Housing value and Income. Evaluate and discuss the results. Solution: available upon request

185

Chapter 14

Multivariate analysis: reducing dimensionality

Exercise 14-1 1 Show that the vector   is an eigenvector associated with the eigenvalue λ=2 of matrix 1  1 1 A = −  Plot it together with its transformation by this matrix.  2 4 Solution:  1 1  1  1 + 1   2  1 = = 2   −2 4  =        1  −2 + 4   2  1

Exercise 14-2 0.6 0.2  Suppose we have a covariance matrix C =   the two eigenvalues, are 0.2 0.5 0.78  −0.61 are v1 = = v2  = λ1 0.76, = λ 2 0.34 The eigenvectors   Calculate total variance  0.61  0.78  186

and discuss its distribution among the eigenvalues. Would one component suffice or would you keep both? Draw a diagram showing the eigenvectors. Write the matrix of loadings. Solution: Total variance is trace of C = 0.6+0.5=1.1 or sum of eigenvalues 0.76+0.34=1.1. The distribution among eigenvalues is .76/1.1 =69% and .34/1.1= 31%. It is better to keep both components since one only represents 69% of the variance.

0.78 −0.61 The matrix of loadings is A =    0.61 0.78  Exercise 14-3 Suppose we have a covariance matrix

0.6 0.2 0.1 C = 0.2 0.5 0.2   0.1 0.2 0.7  the eigenvalues, are

λ1 0.94 λ 2 0.55 λ 3 0.31 = = = The eigenvectors are

187

 −0.51  v = v1 =  −0.54  2  −0.67 

 0.72  =  v 0.15  3  −0.68

 0.47   −0.83   0.31 

Calculate total variance and discuss its distribution among the eigenvalues. Would you keep all three components?, or would two suffice?, or would one component suffice? Draw a diagram showing the first two eigenvectors. Write the matrix of loadings. Solution: Total variance is trace of C = 0.6+0.5+0.7=1.8 or sum of eigenvalues 0.94+0.55+0.31=1.8. The distribution among eigenvalues is .94/1.8 =52% and .55/1.8= 31%. With two components we have 83 % of the variance and this would suffice. Matrix of loadings  −0.51 0.72 0.47  A =  −0.54 0.15 −0.83   −0.67 −0.68 0.31    Exercise 14-4 Complete the calculations of Q-mode CA to arrive at the scores ZQ shown in equation 14.53. Solution: available upon request. Exercise 14-5 Use the eigen function to find the eigenvalues and eigenvectors of the following covariance matrix C

 2 1 4 C = 1 2 3 4 3 2 Show that the trace is equal to the sum of eigenvalues. Draw a sketch of the eigenvectors on a plane given by the first and second coordinates. Then draw a sketch of eigenvectors on a plane given by the second and third coordinates. Then draw a sketch of eigenvectors on a plane given by the first and third coordinates. Solution: available by request 188

Exercise 14-6 Use data in file lab14/watsheds.txt. It has data for 28 watersheds in Canada. Data from Griffith and Armhein (1991, p 455). Variables are ID watershed label, AREA (Km2), Flow or discharge (m3/s), Flow/Area (flow in 2 months/area), Latitude, Longitude, Snowfall, Precipitation, Temp (°C). Perform PCA on watsheds.txt. Use correlation matrix if necessary. Select required components to account for more than 90% of the variance. Provide biplots for pairwise combinations of those selected components. Interpret and discuss. Solution: # exercise Xw <- matrix(as.numeric(scan("lab14/watsheds.txt",skip=1, what=c("",rep(0,8)))),ncol=9,byrow=T) X <- Xw[,-1] princomp(X, cor=T) > summary(princomp(X, cor=T)) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.0997229 1.3763713 0.88477721 0.64887882 0.49552938 Proportion of Variance 0.5511045 0.2367997 0.09785384 0.05263047 0.03069367 Cumulative Proportion 0.5511045 0.7879043 0.88575811 0.93838858 0.96908225 Comp.6 Comp.7 Comp.8 Standard deviation 0.37684156 0.242343982 0.215874651 Proportion of Variance 0.01775119 0.007341326 0.005825233 Cumulative Proportion 0.98683344 0.994174767 1.000000000 > # it needs four components

Then to plot biplot(princomp(X, cor=T),choices=c(1,2)) biplot(princomp(X, cor=T),choices=c(1,3)) biplot(princomp(X, cor=T),choices=c(1,4)) biplot(princomp(X, cor=T),choices=c(2,3)) biplot(princomp(X, cor=T),choices=c(2,4)) biplot(princomp(X, cor=T),choices=c(3,4))

Shown below are results for the first two sets, 1,2 and 1,3

189

-2

0.4

-4

Var 6 Var 3 6 5

Var 8

15 11

-2

-0.2

225 19

16 Var 5 26

21Var 1 Var 4

27 20 4 12

9 28

-4

0.0

1410 -0.4

Comp.2

Var 7

0.2

17 Var 2

-0.4

-0.2

0.0 Comp.1

190

0.2

0.4

-2

6 6

-4

0.4

Var 1 28 Var 2

Var 8 Var 7

20 2 0.0

27 19

5 14

21Var 5

10 12 425 18

Var 6

Comp.3

0.2

Var 4

22 Var 3 3

6 9

-2

-0.2

-4

-0.4

1 7

-0.4

-0.2

0.0

0.2

0.4

Comp.1

Exercise 14-7 Perform Factor Analysis on the dataset watsheds.txt of the previous exercise. Start with two factors and discuss uniquenesses, communalities and proportion of variance explained. Increase the number of factors to three and repeat. Increase the number of factors to four and repeat. Interpret and discuss. How many factors would you select? Solution:

191

factanal(X,factors=2) > factanal(X,factors=2) Call: factanal(x = X, factors = 2) Uniquenesses: [1] 0.539 0.494 0.506 0.076 0.467 0.146 0.170 0.042 Loadings: Factor1 Factor2 [1,] 0.679 [2,] 0.622 0.345 [3,] -0.273 0.647 [4,] 0.958 [5,] 0.704 -0.195 [6,] 0.344 0.858 [7,] -0.710 0.571 [8,] -0.978 Factor1 Factor2 SS loadings 3.914 1.646 Proportion Var 0.489 0.206 Cumulative Var 0.489 0.695 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 44.43 on 13 degrees of freedom. The p-value is 2.6e-05 >

We can reject the H0. Now increase to three factors > factanal(X,factors=3) Call: factanal(x = X, factors = 3) Uniquenesses: [1] 0.005 0.131 0.165 0.079 0.417 0.314 0.225 0.027 Loadings: Factor1 Factor2 Factor3 [1,] 0.388 0.898 -0.196 [2,] 0.300 0.847 0.250 [3,] -0.289 -0.215 0.840 [4,] 0.907 0.312 [5,] 0.634 0.346 -0.247 [6,] 0.125 0.396 0.717 [7,] -0.769 -0.107 0.416 [8,] -0.916 -0.343 -0.129

192

Factor1 Factor2 Factor3 SS loadings 2.995 2.072 1.571 Proportion Var 0.374 0.259 0.196 Cumulative Var 0.374 0.633 0.830 Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 8.67 on 7 degrees of freedom. The p-value is 0.277 >

Now we cannot reject the Ho that 3 factors are sufficient. We can select three factors. Exercise 14-8 Perform CA on dataset varspec of package vegan. Select required components. Provide biplots for pairwise combinations of those selected components. Interpret and discuss. Solution: There are functions to perform correspondence analysis in several packages. We will use function cca in package vegan. This package is composed of various functions and data sets applied to community ecology. Install from cran web page and load the package. The function cca can take several arguments, given as dataframe. We will use the varespec demonstration dataset in vegan. Because of its application to ecology, cca assumes that the first argument corresponds to species variables at each site. A second argument allows for constraints or environmental variables (we will discuss this aspect in the next chapter). The sites would be observations and the species are variables. First load dataset varespec which consist of 24 sites and cover for 44 species. data(varespec)

a segment is > varespec Cal.vul Emp.nig Led.pal Vac.myr Vac.vit Pin.syl Des.fle Bet.pub Vac.uli 18 0.55 11.13 0.00 0.00 17.80 0.07 0.00 0.00 1.60 15 0.67 0.17 0.00 0.35 12.13 0.12 0.00 0.00 0.00 24 0.10 1.55 0.00 0.00 13.47 0.25 0.00 0.00 0.00 27 0.00 15.13 2.42 5.92 15.97 0.00 3.70 0.00 1.12 23 0.00 12.68 0.00 0.00 23.73 0.03 0.00 0.00 0.00

Apply function cca and examine the results using summary. The results include lambda (eigenvalues or inertia), sites scores (scores of rows or observations) and species scores (scores of columns or variables) 193

> spp.ca <- cca(varespec) > summary(spp.ca) Call: cca(X = varespec) Partitioning of mean squared contingency coefficient: Total 2.083 Unconstrained 2.083 Eigenvalues, and their contribution to the mean squared contingency coefficient CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 lambda 0.5249 0.3568 0.2344 0.1955 0.1776 0.1216 0.1155 0.08894 0.07318 accounted 0.2520 0.4233 0.5358 0.6296 0.7149 0.7732 0.8287 0.87137 0.90650 CA10 CA11 CA12 CA13 CA14 CA15 CA16 CA17 lambda 0.05752 0.04434 0.02546 0.01710 0.01490 0.01016 0.00783 0.006032 accounted 0.93411 0.95539 0.96762 0.97583 0.98298 0.98786 0.99161 0.994510 CA18 CA19 CA20 CA21 CA22 CA23 lambda 0.004008 0.002865 0.001928 0.001807 0.0005864 0.0002434 accounted 0.996434 0.997809 0.998734 0.999602 0.9998832 1.0000000 Scaling 2 for species and site scores -- Species are scaled proportional to eigenvalues -- Sites are unscaled: weighted dispersion equal on all dimensions Species scores CA1 CA2 CA3 CA4 CA5 CA6 Cal.vul 0.0219651 -0.954204 0.055461 -1.2797244 0.0579945 0.798814 Emp.nig 0.0544138 0.226569 0.190301 0.0104668 0.3611286 -0.139798 Led.pal 0.8007640 0.895570 1.473304 0.0531002 1.3623062 -0.098847 Vac.myr 1.0588751 0.969421 1.318804 0.1024323 0.1940588 0.248430 Vac.vit 0.1063608 0.187223 0.071157 0.1076563 0.2894671 -0.051537 Pin.syl -0.3492265 0.351537 -0.174403 -0.0561897 0.1713089 0.134792 Des.fle 1.1120135 0.728024 0.908150 -0.0005922 -0.5536573 -0.024661 Bet.pub 0.4850314 1.165876 1.859362 0.6142811 3.2012930 -0.085239 Vac.uli -0.0601925 -0.973199 0.508703 0.3548606 -0.1289134 -0.566683 Dip.mon -0.3946082 -0.619767 0.253144 0.4157072 0.1551830 -0.377569 Dic.sp 1.3128632 0.215210 -2.382601 1.3654912 0.5844397 0.055023 Dic.fus 0.9204864 -0.336282 -0.192310 -1.2951865 0.1621852 -0.839809 Dic.pol 0.5251420 0.841839 0.389001 0.8468878 1.9249346 0.451664 Hyl.spl 1.4535654 1.041666 1.101762 0.4106708 -1.5866910 0.786165 Ple.sch 0.9492748 0.348263 -0.004863 0.0607013 -0.4727038 0.069878 Pol.pil -0.2756876 -0.743016 0.264334 0.6530831 -0.3066610 -0.135153 Pol.jun 0.7342157 0.059161 -1.087953 0.6678730 0.3257684 -1.067713 Pol.com 0.6135426 0.789529 0.549966 0.5043274 1.1310269 -0.210949 Poh.nut -0.0098863 0.351998 -0.172088 0.0598978 0.3948858 0.138502 Pti.cil 0.3060113 0.954875 1.663059 0.6189861 2.6639607 0.069360

194

Bar.lyc 0.3635903 1.265933 2.214206 0.7485792 3.4180314 0.225229 Cla.arb -0.1109771 -0.886362 0.096956 0.0856283 0.0309403 0.125141 Cla.ran -0.3986726 -0.647506 0.196349 0.3201157 -0.1415224 -0.117122 Cla.ste -1.0413659 0.643770 -0.214966 -0.1662016 -0.1020550 0.001469 Cla.unc 0.5906105 -0.601021 -0.884064 -0.6143250 0.6762879 1.281624 Cla.coc -0.1545561 -0.349095 -0.103782 -0.2510695 -0.0001597 -0.050661 Cla.cor 0.1906382 -0.106239 -0.215287 0.1204409 0.1682545 -0.106946 Cla.gra 0.1417851 -0.185868 -0.115694 0.1675663 0.2079021 0.013103 Cla.fim 0.0006675 -0.096419 0.089396 -0.1927202 0.2036121 -0.050120 Cla.cri 0.2443835 -0.280964 -0.024662 -0.3642363 0.3026955 0.221793 Cla.chl -0.4492051 0.721138 0.105982 0.1885373 0.8220828 0.042090 Cla.bot 0.4091550 0.625599 1.284697 0.4013205 1.8942808 0.418963 Cla.ama -0.4780500 -0.903682 0.403090 0.6975202 -0.0171626 -0.494790 Cla.sp -0.5947602 0.284425 -0.240895 -0.4413346 -0.1008616 0.136249 Cet.eri 0.1781014 -0.411694 -0.815501 -0.0582176 0.3135336 0.827893 Cet.isl -0.2510628 0.813972 0.415902 0.1754087 1.1601226 0.138273 Cet.niv -1.0427252 -0.497924 -0.062553 0.0031262 -0.6254890 1.030965 Nep.arc 1.2181611 0.119157 -2.098997 0.9858735 0.4029573 -1.908139 Ste.sp -0.3747802 -1.362441 0.483100 1.0509067 -0.3747848 -0.502492 Pel.aph 0.2924068 -0.025844 0.021972 0.3143596 0.0769010 -0.293298 Ich.eri 0.0274415 -1.445286 0.349268 0.1597361 -0.1574714 -0.729434 Cla.cer -0.6689408 -0.003126 -0.590988 0.1349723 -0.3431706 0.144390 Cla.def 0.3760557 -0.296651 -0.073940 -0.3076779 0.3810806 0.316994 Cla.phy -0.9300080 0.690432 -0.386922 -0.3276803 -0.0677692 0.171024 Site scores (weighted averages of species scores) CA1 CA2 CA3 CA4 CA5 CA6 18 -0.149232 -0.89910 0.474143 0.55218 0.333521 -0.40879 15 0.962177 -0.24177 -0.065652 -0.49180 -0.649787 -0.23363 24 1.363110 0.25182 -2.784969 1.82017 0.734410 1.34152 27 1.175623 0.83541 0.916089 0.27768 -1.142455 0.21767 23 0.496714 -0.09389 0.301149 0.34124 0.570712 -0.50968 19 0.004893 0.61971 0.057333 0.13950 -0.274987 -0.09789 22 1.188001 -0.19259 0.228980 -2.31092 0.183260 -1.73611 16 0.879113 -0.55664 0.002314 -1.80490 0.031028 -1.55021 28 1.765788 1.36537 1.383872 0.52333 -2.118872 1.19840 13 -0.269156 -1.34875 0.243060 -1.31835 -0.001711 1.52215 14 0.729491 -1.17893 -1.069360 -1.96412 1.355664 2.62928 20 0.528439 -0.31283 -0.009892 0.22972 0.108702 0.60480 25 1.367405 0.20034 -2.292582 1.01623 0.447480 -2.02763 7 -0.365350 -1.77480 0.691674 1.03984 -0.077502 -0.57207 5 -0.591300 -2.02176 0.773139 1.69572 -0.616039 -0.87384 6 -0.456210 -1.29742 0.270061 0.50646 0.150121 0.25985 3 -1.241648 0.20216 -0.156431 0.08690 -0.353443 -0.34506 4 -1.063176 -0.59183 -0.069763 -0.04061 -0.713689 1.27068 2 -1.369446 0.84240 -0.325298 -0.15990 -0.252740 -0.28474 9 -1.293505 1.31734 -0.533097 -0.43829 -0.081513 -0.03610 12 -1.002610 0.82306 -0.324741 -0.19673 -0.025098 -0.08034 10 -1.383095 1.19263 -0.488500 -0.42168 -0.054969 -0.14134

195

11 -0.445323 -0.06481 -0.002680 0.20422 -0.351513 0.19887 21 0.358204 1.35180 2.321948 0.81614 3.663041 0.20222

And we can plot >plot(spp.ca)

To obtain

To interpret this diagram look for areas where specific sites and species are close together, indicating relationships. Also see how sites are sorted along gradients, one given by CA1 (the horizontal axes) and another one CA2 (the vertical axes). In addition, we can also access scores > scores.cca(spp.ca) $species

196

CA1 CA2 Cal.vul 0.0219651107 -0.95420379 Emp.nig 0.0544138026 0.22656884 Led.pal 0.8007640006 0.89556957 Vac.myr 1.0588751484 0.96942054 Vac.vit 0.1063608488 0.18722322 Pin.syl -0.3492265345 0.35153663 Des.fle 1.1120135175 0.72802390 Bet.pub 0.4850313781 1.16587596 Vac.uli -0.0601924678 -0.97319855 Dip.mon -0.3946082445 -0.61976716 Dic.sp 1.3128632358 0.21521039 Dic.fus 0.9204864455 -0.33628152 …. etc, etc … $sites

CA1 CA2 18 -0.149231732 -0.89909538 15 0.962176641 -0.24176673 24 1.363110128 0.25182197 27 1.175623286 0.83540787 23 0.496714476 -0.09389079 19 0.004893311 0.61971266 22 1.188000585 -0.19258682 16 0.879112521 -0.55664061 ……etc etc $<NA> NULL >

With these scores we can also use the biplot function to obtain an arrow for each species >biplot(scores(spp.ca)$sites, scores(spp.ca)$species)

197

A commonly used convention will be not to mark species locations with arrows and instead reserve these for the constraint variables. Exercise 14-9 Perform PCA, FA, and CA on the dataset in file lab14/blocks.txt. Data set is block geometry and dimensions of 25 blocks labeled a-y (Davis, 2002 Fig 6-16, page 507-508). The variable definition is. x1= long axis x2= intermediate axis x3= short axis x4= longest diagonal

198

x5= ratio of radii = (of smallest circumscribed circle) over (largest inscribed circle) x6= ratio of axis = (long + intermediate) over (short) x7= ratio of (surface area) over (volume)

Solution: First we read the file using the what = … argument because we have character and numeric mixed in each record > blocks <- matrix(scan(“lab14/blocks.txt”,what=c("", 0,0,0,0,0,0,0)), ncol=8, byrow=T) Read 200 items

Use the first column for character id for each block > id <- blocks[,1] > id [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y"

then the remainder is made numeric and put in matrix form > X <- matrix(as.numeric(blocks[,2:8]), ncol=7, byrow=F) >X [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 3.76 3.66 0.54 5.275 9.768 13.741 4.782 [2,] 8.59 4.99 1.34 10.022 7.500 10.162 2.130 [3,] 6.22 6.14 4.52 9.842 2.175 2.732 1.089 [4,] 7.57 7.28 7.07 12.662 1.791 2.101 0.822 [5,] 9.03 7.08 2.59 11.762 4.539 6.217 1.276 …. >

Next we assemble a dataframe with the matrix such that rows have names equal to the id of the block > Xd <- data.frame(X) > row.names(Xd) <- id > Xd X1 X2 X3 X4 X5 X6 X7 a 3.76 3.66 0.54 5.275 9.768 13.741 4.782 b 8.59 4.99 1.34 10.022 7.500 10.162 2.130 c 6.22 6.14 4.52 9.842 2.175 2.732 1.089 d 7.57 7.28 7.07 12.662 1.791 2.101 0.822 e 9.03 7.08 2.59 11.762 4.539 6.217 1.276 f 5.51 3.98 1.30 6.924 5.326 7.304 2.403 g 3.27 0.62 0.44 3.357 7.629 8.838 8.389

199

h 8.74 7.00 3.31 11.675 3.529 4.757 1.119 …

Now we are ready to do PCA > block.pca <- princomp(Xd)

Check the summary > summary(block.pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 5.7542517 4.2706971 1.5611621 0.87960173 0.571943081 Proportion of Variance 0.6028718 0.3320816 0.0443755 0.01408703 0.005955976 Cumulative Proportion 0.6028718 0.9349534 0.9793289 0.99341598 0.999371952 Comp.6 Comp.7 Standard deviation 0.1790721294 4.926778e-02 Proportion of Variance 0.0005838527 4.419501e-05 Cumulative Proportion 0.9999558050 1.000000e+00

Compare these results to Davis, 2002, Table 6-14, page 518. We can see that first two components explain 93% of variance. Now we do plots par(mfrow=c(2,1)) barplot(loadings(block.pca), beside=T) plot(block.pca) win.graph();biplot(block.pca, choices=c(1,2)) win.graph();biplot(block.pca, choices=c(1,3))

200

201

202

See explanations and discussion in Davis, 2002. Recall that when units are disparate, it is important to perform the PCA using the correlation matrix instead of the covariance matrix. Equivalently, we could have standardized the variables before applying PCA. Command princomp has argument cor=T. > block.pca <- princomp(Xd,cor=T)

Let’s repeat summary and plot commands > summary(block.pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.8424555 1.6749560 0.66129479 0.52720701 0.28452161 Proportion of Variance 0.4849489 0.4007825 0.06247297 0.03970675 0.01156465 Cumulative Proportion 0.4849489 0.8857314 0.94820439 0.98791113 0.99947578 Comp.6 Comp.7

203

Standard deviation 0.058370035 1.620085e-02 Proportion of Variance 0.000486723 3.749534e-05 Cumulative Proportion 0.999962505 1.000000e+00

We can see that it takes three components to accumulate 95% of variance. More than one biplot can be drawn to include different pairwise combinations of components. For example, Comp1/Comp 2 and Comp2/Comp3. par(mfrow=c(2,1)) barplot(loadings(block.pca), beside=T) plot(block.pca) win.graph(); biplot(block.pca, choices=c(1,2)) win.graph(); biplot(block.pca, choices=c(1,3))

Would yield the following new plots.

204

Alternatively, we can use the Rcmdr. First make the dataset active. Go to Data |Active Dataset|Select Active dataset and pick Xd from the list. Confirm that Xd shows in the box for Active dataset and to reconfirm use View dataset button.

205

Then go to Statistics|Dimensional Analysis|Principal Component Analysis

and select all variables in the dialog box

206

View the results in the output window: > .PC <- princomp(~X1+X2+X3+X4+X5+X6+X7, cor=TRUE, data=Xd) > unclass(loadings(.PC)) # component loadings Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 X1 -0.40529248 0.2928989 0.66735703 -0.088837508 0.2266600 0.40980909 X2 -0.43158146 0.2224438 -0.69798887 0.033778589 0.4365697 0.14430091 X3 -0.38544111 -0.3558811 -0.14769067 -0.627554724 -0.5121490 0.18752524 X4 -0.49388987 0.2322712 0.11864031 -0.210290471 0.1054288 -0.58780883 X5 0.12770922 0.5751049 -0.02944297 -0.110846806 -0.3889916 -0.42320543 X6 0.09680307 0.5800041 -0.17430323 0.006148027 -0.3549447 0.50034608 X7 0.48093962 0.1302972 -0.01759442 -0.735251711 0.4553242 0.03316309 Comp.7 X1 0.27815916 X2 0.25398999 X3 0.10809257 X4 -0.53590190 X5 0.55621352 X6 -0.49746666 X7 -0.04894027 > .PC$sd^2 # component variances Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 3.3946422458 2.8054776509 0.4373108001 0.2779472286 0.0809525463 0.0034070609 Comp.7 0.0002624674 > screeplot(.PC) > remove(.PC)

207

Which at the end produces a screeplot

Note that the object .PC is removed at the end. If we re-execute the commands shown in the script window except this remove, then we can use biplot(.PC) from the console to obtain a biplot. Now let’s perform factor analysis. Let’s try two factors as in Davis, 2002 page 532. We will use function factanal which is loaded in the base installation. This function uses the maximum likelihood method for factor analysis. > blocks.fa <- factanal(Xd, factors =2) > blocks.fa Call: factanal(x = Xd, factors = 2) Uniquenesses: X1 X2 X3 X4 X5 X6 X7 0.157 0.304 0.188 0.005 0.005 0.014 0.255 Loadings: Factor1 Factor2 X1 0.909 0.128 X2 0.833 X3 0.435 -0.789 X4 0.998

208

X5 0.165 0.984 X6 0.205 0.972 X7 -0.676 0.537 Factor1 Factor2 SS loadings 3.231 2.842 Proportion Var 0.462 0.406 Cumulative Var 0.462 0.868 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 120.56 on 8 degrees of freedom. The p-value is 2.54e-22 >

We account for 86.8% of the variance with these two factors. We can calculate the communalities simply as 1- uniquenesses > 1- blocks.fa$uniquenesses X1 X2 X3 X4 X5 X6 X7 0.8431742 0.6957610 0.8121781 0.9950000 0.9950000 0.9863439 0.7451521 >

Their sum 6.07 should be equal to first eigenvalue plus second eigenvalue or correlation explained. Therefore we confirm 6.07/7=0.867 fraction explained. > sum(1- blocks.fa$uniquenesses) [1] 6.072609 > 6.07/7 [1] 0.8671429

We can check that the correlation matrix was used (Compare to Davis, 2002 p 532) > blocks.fa$corre X1 X2 X3 X4 X5 X6 X1 1.0000000 0.5802601 0.2011294 0.9112586 0.2833272 0.2865464 X2 0.5802601 1.0000000 0.3637928 0.8337454 0.1658262 0.2610654 X3 0.2011294 0.3637928 1.0000000 0.4385752 -0.7041843 -0.6805394 X4 0.9112586 0.8337454 0.4385752 1.0000000 0.1630426 0.2022868 X5 0.2833272 0.1658262 -0.7041843 0.1630426 1.0000000 0.9902088 X6 0.2865464 0.2610654 -0.6805394 0.2022868 0.9902088 1.0000000 X7 -0.5332023 -0.6087219 -0.6488426 -0.6755388 0.4272139 0.3571250 X7 X1 -0.5332023 X2 -0.6087219 X3 -0.6488426 X4 -0.6755388 X5 0.4272139

209

X6 0.3571250 X7 1.0000000 >

To produce scores, this calculation should be specified as an argument “bartlett” or “regression”. Both are estimates of the scores. > blocks.fa <- factanal(Xd, factors =2,scores="Bart") > blocks.fa$scores Factor1 Factor2 a -1.26915617 1.58963096 b 0.30177374 0.59412589 c 0.23257052 -1.06378045 d 1.16187207 -1.36566138 e 0.86788364 -0.41174066 and so on to object y

To perform rotation use argument rotation= “name of rotation function”, by default rotation =varimax. We can do the same using Rcmdr. As in PCA, first make the dataset active. Let’s work with the blocks data. As shown in that section, load R Commander, go to Data |ActiveDatset|Select Active dataset and pick Xd from the list. Confirm that Xd shows in the box for Active dataset and to reconfirm use View dataset.

210

Then go to Statistics|Dimensional Analysis|Factor Analysis to obtain the following dialog box

Here we selected all variables X1, ..., X7 and varimax rotation and Bartletts method. 211

Then in the next pop up window move the slider to select number of factors = 2

Look at the results in the Output window > .FA <- factanal(~X1+X2+X3+X4+X5+X6+X7, factors=2, rotation="varimax", scores="Bartlett", data=Xd) > .FA Call: factanal(x = ~X1 + X2 + X3 + X4 + X5 + X6 + X7, factors = 2, "Bartlett", rotation = "varimax") Uniquenesses: X1 X2 X3 X4 X5 X6 X7 0.157 0.304 0.188 0.005 0.005 0.014 0.255 Loadings: Factor1 Factor2 X1 0.909 0.128 X2 0.833 X3 0.435 -0.789 X4 0.998 X5 0.165 0.984 X6 0.205 0.972 X7 -0.676 0.537 Factor1 Factor2 SS loadings 3.231 2.842 Proportion Var 0.462 0.406 Cumulative Var 0.462 0.868 Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 120.56 on 8 degrees of freedom. The p-value is 2.54e-22

212

data = Xd, scores =

> Xd$F1 <- .FA$scores[,1] > Xd$F2 <- .FA$scores[,2] > remove(.FA)

213

Chapter 15 Multivariate analysis II: identifying and developing relationships among observations and variables Exercise 15-1 Consider the example leading to equation 15.2. Calculate the singular values and the ratio of eigenvalues. Calculate the scores of the centroids from the group means using equation 15.3 applied to the group means instead of the individual xij. Draw a sketch showing location of the centroids in discriminant space. Solution: available upon request. Exercise 15-2 Consider the example of the previous exercise. Suppose a new object measures X1,X2,X3=60,90,20. Apply the discriminant functions to decide to what group it most likely belongs. Solution: available upon request Exercise 15-3 Consider 4 environmental variables X (abiotic), 2 response variables Y (biotic), and 10 observations of all variables. What are the dimensions of matrices A and B of CANCOR? Assume the largest eigenvalue is 0.7. Calculate chi-square and degrees of freedom to test the H0 that the largest eigenvalue is zero. Solution: A is 4x1 and B is 2x1. Chi-square is    p + q +1  4 + 2 +1 − n + 1 ln [(1 − λ1 )] = − 10 + 1 ln [(1 − 0.7)] = −5.5ln(0.3) = 6.62  2 2    

χ2 = 

Degrees of freedom p + q + 1 + 0.5 × [( p -1)( q -1)]2/3 = 7 + 0.5 × [3]2/3 = 8.04 So rounding we get df=8. Exercise 15-4 Consider Fig 15-4, how many clusters are generated at a height=30. Compare how members of these clusters relate to each other on the MDS diagram of Fig 15-5. 214

Solution: At 30 we intersect 5 clusters. The elements in these clusters are near each other in Fig 15-5. Exercise 15-5 Data set of macroinvertebrate diversity in streams (McCuen, 1985). Data reproduced in Carr (1995). The data set is in file lab15/streams-macroinv.txt. How well do stream physical variables explain biotic variables? Perform canonical correlation analysis and CCA. Interpret and discuss. Solution: #macroinvertebrate data sm <- matrix(scan("lab15/stream-macroinv.txt", skip=6, what=c("",rep(0,8))), ncol=9, byrow=T) Xm <- matrix(as.numeric(sm[,2:9]), ncol=8,byrow=F) # stream and bio X <- Xm[,1:5]; Y <- Xm[,6:8] id <- sm[,1] Xd <- data.frame(id,X,Y) names(Xd) <- c("id", "X1","X2","X3", "X4","X5","Y1","Y2","Y3") Xmcc <- cancor(X,Y) Ycan <- Y%*%Xmcc$ycoef Xcan <- X%*%Xmcc$xcoef par(mfrow=c(2,2)) plot(Xcan[,1], Ycan[,1]) plot(Xcan[,2], Ycan[,2]) plot(Xcan[,3], Ycan[,3])

215

0.2 0.0 -0.6

0.6

-0.4

-0.2

Ycan[, 2]

1.2 1.0 0.8

Ycan[, 1]

-0.4

-0.2

0.0

0.2

0.4

-0.6

-0.4

-0.2

0.0

0.2

0.4

Xcan[, 2]

-0.3 -0.7

-0.5

Ycan[, 3]

-0.1

Xcan[, 1]

0.4

0.6

0.8

1.0

1.2

Xcan[, 3]

Exercise 15-6 Apply cca of vegan to demo datasets varespec and varechem of vegan to study the relationship of species abundance to environmental variates Al, P, K, and baresoil. Study the capabilities of cca to construct a formula based on of Al, P, K and baresoil. Solution: Load package vegan. We will use the cca function from package vegan. To make it constrained we write >cca(X,Y)

where X is the matrix of species response and Y the environmental (constraint) matrix. 216

For X and Y we will use respectively the varespec and varechem demonstration datasets in vegan. The sites would be observations and the species are variables. Load datasets varespec and varechem, which consist of 24 sites and cover for 44 species and 14 environmental variables giving the soil characteristics of the very same sites as in varespec. We already decribed varespec in Chapter 14. Now for varechem data(varechem) > varechem N P K Ca Mg S Al Fe Mn Zn Mo Baresoil Humdepth pH 18 19.8 42.1 139.9 519.4 90.0 32.3 39.0 40.9 58.1 4.5 0.30 43.90 2.2 2.7 15 13.4 39.1 167.3 356.7 70.7 35.2 88.1 39.0 52.4 5.4 0.30 23.60 2.2 2.8 24 20.2 67.7 207.1 973.3 209.1 58.1 138.0 35.4 32.1 16.8 0.80 21.20 2.0 3.0 27 20.6 60.8 233.7 834.0 127.2 40.7 15.4 4.4 132.0 10.7 0.20 18.70 2.9 2.8 23 23.8 54.5 180.6 777.0 125.8 39.5 24.2 3.0 50.1 6.6 0.30 46.00 3.0 2.7 19 22.8 40.9 171.4 691.8 151.4 40.8 104.8 17.6 43.6 9.1 0.40 40.50 3.8 2.7 … etc

Apply cca function > sppenv.cca <- cca(varespec,varechem) > sppenv.cca Call: cca(X = varespec, Y = varechem) Inertia Rank Total 2.0832 Constrained 1.4415 14 Unconstrained 0.6417 9 Inertia is mean squared contingency coefficient Eigenvalues for constrained axes: CCA1 CCA2 CCA3 CCA4 CCA5 CCA6 CCA7 CCA8 CCA9 CCA10 CCA11 CCA12 CCA13 0.438870 0.291775 0.162847 0.142130 0.117952 0.089029 0.070295 0.058359 0.031141 0.013294 0.008364 0.006538 0.006156 CCA14 0.004733 Eigenvalues for unconstrained axes: CA1 CA2 CA3 CA4 CA5 CA6 CA7 CA8 CA9 0.197765 0.141926 0.101174 0.070787 0.053303 0.033299 0.018868 0.015104 0.009488

we can plot > plot(sppenv.cca) >

217

To obtain

9 10

Cla.phy 12

Hyl.spl 27

Ca MgZn K

S 19

Cla.ste Cla.chl

21 Cla.sp Cet.isl Pol.com Vac.myr Ple.sch Des.fle Pin.syl Poh.nut 25 Dic.sp Pol.jun Nep.arc Pel.aph pH Vac.vit Dic.pol Cla.unc Emp.nig Cla.cor 11 Cla.cri Cla.def Cet.eri Led.pal Cla.cer Cla.fim Pti.cil Cla.gra Cla.coc 23 Cla.bot 15 Bet.pub Bar.lyc Mo 20

CCA2

24 Mn Humdepth

22 Baresoil

Dic.fus

Pol.pil

Cla.ran Cla.arb

Dip.mon Cal.vul 18 Cla.ama Vac.uli Ste.sp 6

-1

4 Cet.niv

-2

-1

13 Ich.eri

7 5

-2

-1

CCA1

In this type of biplot we use numbers as markers for the sites (observations), symbols as the markers for the species (in red if you look at this in color) and arrows ending at the values of the environmental variables (in blue if you look at this in color). To interpret this diagram look for areas where specific sites, species and environmental variables are close together, indicating relationships. Also see how sites, species and constraints are sorted along gradients, one given by CA1 (the horizontal axes) and another one CA2 (the vertical axes). Rest of the solution available by request. 218

Exercise 15-7 Data set in file lab15/census-tract3.txt. Data for 15 tracts and 3 variables: median age, % no family, % average income (in thousand dollars). Perform MDS and cluster analysis to explore potential clusters or groups. Select these clusters. Confirm that there may be difference among these groups using MANOVA and then find linear discriminant functions to separate these groups. Interpret and discuss. Solution: available by request. Exercise 15-8 Perform canonical correlation, cluster analysis and metric MDS on the dataset in file lab14/blocks.txt. Data set is block geometry and dimensions of 25 blocks labeled a-y (Davis, 2002 Fig 6-16, page 507-508). See computer exercises of previous chapter for variable definition. Solution: available by request.

219