Introduction to Econometrics 2nd Second Edition by M. W. Watson and J. H. Stock. All Complete Chapt by QuentinAxel

Chapter 1 Economic Questions and Data 1.1 Multiple Choice 1) Analyzing the behavior of unemployment rates across U.S. states in March of 2006 is an example of using A) time series data. B) panel data. C) cross-sectional data. D) experimental data. Answer: C 2) Studying inflation in the United States from 1970 to 2006 is an example of using A) randomized controlled experiments. B) time series data. C) panel data. D) cross-sectional data. Answer: B 3) Analyzing the effect of minimum wage changes on teenage employment across the 48 contiguous U.S. states from 1980 to 2004 is an example of using A) time series data. B) panel data. C) having a treatment group vs. a control group, since only teenagers receive minimum wages. D) cross-sectional data. Answer: B 4) Panel data A) is also called longitudinal data. B) is the same as time series data. C) studies a group of people at a point in time. D) typically uses control and treatment groups. Answer: A 5) Econometrics can be defined as follows with the exception of A) the science of testing economic theory. B) fitting mathematical economic models to real-world data. C) a set of tools used for forecasting future values of economic variables. D) measuring the height of economists. Answer: D 6) To provide quantitative answers to policy questions A) it is typically sufficient to use common sense. B) you should interview the policy makers involved. C) you should examine empirical evidence. D) is typically impossible since policy questions are not quantifiable. Answer: C 7) An example of a randomized controlled experiment is when A) households receive a tax rebate in one year but not the other. B) one U.S. state increases minimum wages and an adjacent state does not, and employment differences are observed. C) random variables are controlled for by holding constant other factors. D) some 5 th graders in a specific elementary school are allowed to use computers at school while others are not, and their end-of-year performance is compared holding constant other factors. Answer: D Stock/Watson 2e -- CVC2 8/23/06 -- Page 1

8) Ideal randomized controlled experiments in economics are A) often performed in practice. B) often used by the Federal Reserve to study the effects of monetary policy. C) useful because they give a definition of a causal effect. D) sometimes used by universities to determine who graduates in four years rather than five. Answer: C 9) Most economic data are obtained A) through randomized controlled experiments. B) by calibration methods. C) through textbook examples typically involving ten observation points. D) by observing real-world behavior. Answer: D 10) One of the primary advantages of using econometrics over typical results from economic theory, is that A) it potentially provides you with quantitative answers for a policy problem rather than simply suggesting the direction (positive/negative) of the response. B) teaching you how to use statistical packages C) learning how to invert a 4 by 4 matrix. D) all of the above.

Answer: A 11) In a randomized controlled experiment A) there is a control group and a treatment group. B) you control for the effect that random numbers are not truly randomly generated C) you control for random answers D) the control group receives treatment on even days only. Answer: A 12) The reason why economists do not use experimental data more frequently is for all of the following reasons except that real-world experiments A) cannot be executed in economics. B) with humans are difficult to administer. C) are often unethical. D) have flaws relative to ideal randomized controlled experiments. Answer: A 13) The most frequently used experimental or observational data in econometrics are of the following type: A) cross-sectional data. B) randomly generated data. C) time series data. D) panel data. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 2

14) In the graph below, the vertical axis represents average real GDP growth for 65 countries over the period 1960-1995, and the horizontal axis shows the average trade share within these countries.

This is an example of A) cross-sectional data. B) experimental data. C) a time series. D) longitudinal data. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 3

15) The accompanying graph

Is an example of A) cross-sectional data. B) experimental data. C) a time series. D) longitudinal data. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 4

16) The accompanying graph

is an example of A) experimental data. B) cross-sectional data. C) a time series. D) longitudinal data. Answer: C

1.2 Essays 1) Give at least three examples from economics where each of the following type of data can be used: cross-sectional data, time series data, and panel data. Answer: Answers will vary by student. At this level of economics, students most likely have heard of the following use of cross-sectional data: earnings functions, growth equations, the effect of class size reduction on student performance (in this chapter), demand functions (in this chapter: cigarette consumption); time series: the Phillips curve (in this chapter), consumption functions, Okunʹs law; panel data: various U.S. state panel studies on road fatalities (in this book), unemployment rate and unemployment benefits variations, growth regressions (across states and countries), and crime and abortion (Freakonomics).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 5

Chapter 2 Review of Probability 2.1 Multiple Choice 1) The probability of an outcome A) is the number of times that the outcome occurs in the long run. B) equals M × N, where M is the number of occurrences and N is the population size. C) is the proportion of times that the outcome occurs in the long run. D) equals the sample mean divided by the sample standard deviation. Answer: C 2) The probability of an event A or B (Pr(A or B)) to occur equals A) Pr(A) × Pr(B). B) Pr(A) + Pr(B) if A and B are mutually exclusive. Pr(A) C) . Pr(B) D) Pr(A) + Pr(B) even if A and B are not mutually exclusive. Answer: B 3) The cumulative probability distribution shows the probability A) that a random variable is less than or equal to a particular value. B) of two or more events occurring at once. C) of all possible events occurring. D) that a random variable takes on a particular value given that another event has happened. Answer: A 4) The expected value of a discrete random variable A) is the outcome that is most likely to occur. B) can be found by determining the 50% value in the c.d.f. C) equals the population median. D) is computed as a weighted average of the possible outcome of that random variable, where the weights are the probabilities of that outcome. Answer: D 5) Let Y be a random variable. Then var(Y) equals A) E[Y - μY)2 ]. B) E (Y - μY) . C) E (Y -μ )2 . Y D) E (Y - μY) . Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 6

6) The skewness of the distribution of a random variable Y is defined as follows: E (Y3 - μY) A) 2 σY B) E (Y - μ )3 Y 3 E Y3 - μ Y C)

3 σY E (Y - μ )3 Y 3 σY

Answer: D 7) The skewness is most likely positive for one of the following distributions: A) The grade distribution at your college or university. B) The U.S. income distribution. C) SAT scores in English. D) The height of 18 year old females in the U.S. Answer: B 8) The kurtosis of a distribution is defined as follows: E Y-μ 4 Y A) 4 σY 4 E Y4 - μ Y B)

2 σY skewness var(Y)

D) E[(Y - μY)4 ) Answer: A 9) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 0 and 0 C) 0 and 3 D) 1 and 2 Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 7

10) The conditional distribution of Y given X = x, Pr(Y = y X=x), is Pr(Y = y) A) . Pr(X = x) l

∑ Pr(X = xi, Y = y).

i=1 Pr(X = x, Y = y) C) Pr(Y = y) D)

Pr(X = x, Y = y) . Pr(X = x)

Answer: D 11) The conditional expectation of Y given X, E(Y X = x), is calculated as follows: k A) ∑ Yi Pr(X = x i Y= y) i=1 B) E E(Y X)] k C) ∑ y i Pr(Y = y i X= x) i=1 l D) ∑ E(Y X= x i) Pr(X = x i) i=1 Answer: C 12) Two random variables X and Y are independently distributed if all of the following conditions hold, with the exception of A) Pr(Y = y X = x) = Pr(Y = y). B) knowing the value of one of the variables provides no information about the other. C) if the conditional distribution of Y given X equals the marginal distribution of Y. D) E(Y) = E[E(Y X)]. Answer: D 13) The correlation between X and Y A) cannot be negative since variances are always positive. B) is the covariance squared. C) can be calculated by dividing the covariance between X and Y by the product of the two standard deviations. cov(X, Y) . D) is given by corr(X, Y) = var(X) var(Y) Answer: C 14) Two variables are uncorrelated in all of the cases below, with the exception of A) being independent. B) having a zero covariance. C) σXY ≤

2 2 σ X σ Y.

D) E(Y X) = 0. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 8

15) var(aX + bY) = 2 2 A) a2 σ X + b2 σ Y . 2 2 B) a2 σ X + 2abσXY + b2 σ Y . C) σXY + μXμY. 2 2 D) a σ X + b σ Y . Answer: B 16) To standardize a variable you A) subtract its mean and divide by its standard deviation. B) integrate the area below two points under the normal distribution. C) add and subtract 1.96 times the standard deviation to the variable. D) divide it by its standard deviation, as long as its mean is 1. Answer: A 17) Assume that Y is normally distributed N(μ, σ2 ). Moving from the mean (μ) 1.96 standard deviations to the left and 1.96 standard deviations to the right, then the area under the normal p.d.f. is A) 0.67 B) 0.05 C) 0.95 D) 0.33 Answer: C ci – μ , you need 18) Assume that Y is normally distributed N(μ, σ2 ). To find Pr(c1 ≤ Y ≤ c2 ), where c1 < c2 and d i = σ to calculate Pr(d 1 ≤ Z ≤ d 2 ) = A) Φ(d 2 ) - Φ(d 1 ) B) Φ(1.96) - Φ(1.96) C) Φ(d 2 ) - (1 - Φ(d 1 )) D) 1 - (Φ(d 2 ) - Φ(d 1 )) Answer: A 19) If variables with a multivariate normal distribution have covariances that equal zero, then A) the correlation will most often be zero, but does not have to be. B) the variables are independent. C) you should use the χ2 distribution to calculate probabilities. D) the marginal distribution of each of the variables is no longer normal. Answer: B 20) The Student t distribution is A) the distribution of the sum of m squared independent standard normal random variables. B) the distribution of a random variable with a chi-squared distribution with m degrees of freedom, divided by m. C) always well approximated by the standard normal distribution. D) the distribution of the ratio of a standard normal random variable, divided by the square root of an independently distributed chi-squared random variable with m degrees of freedom divided by m. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 9

21) When there are ∞ degrees of freedom, the t∞ distribution A) can no longer be calculated. B) equals the standard normal distribution. C) has a bell shape similar to that of the normal distribution, but with “fatter” tails. 2 D) equals the χ ∞ distribution. Answer: B 22) The sample average is a random variable and A) is a single number and as a result cannot have a distribution. B) has a probability distribution called its sampling distribution. C) has a probability distribution called the standard normal distribution. D) has a probability distribution that is the same as for the Y1 ,..., Yn i.i.d. variables. Answer: B 23) To infer the political tendencies of the students at your college/university, you sample 150 of them. Only one of the following is a simple random sample: You A) make sure that the proportion of minorities are the same in your sample as in the entire student body. B) call every fiftieth person in the student directory at 9 a.m. If the person does not answer the phone, you pick the next name listed, and so on. C) go to the main dining hall on campus and interview students randomly there. D) have your statistical package generate 150 random numbers in the range from 1 to the total number of students in your academic institution, and then choose the corresponding names in the student telephone directory. Answer: D 2 24) The variance of Y, σ Y , is given by the following formula: 2 A) σ Y . B)

σY . n 2 σY

2 σY D)

Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 10

25) The mean of the sample average Y, E(Y), is 1 A) μY . n B) μY. C)

μY n

σY for n > 30. μY

Answer: B 26) In econometrics, we typically do not rely on exact or finite sample distributions because A) we have approximately an infinite number of observations (think of re -sampling). B) variables typically are normally distributed. C) the covariances of Yi, Yj are typically not zero. D) asymptotic distributions can be counted on to provide good approximations to the exact sampling distribution (given the number of observations available in most cases). Answer: D 27) Consistency for the sample average Y can be defined as follows, with the exception of A) Y converges in probability to μY. B) Y has the smallest variance of all estimators. p C) Y μY . D) the probability of Y being in the range μY ± c becomes arbitrarily close to one as n increases for any constant c > 0. Answer: B 28) The central limit theorem states that A) the sampling distribution of B) Y

Y-μY σY

is approximately normal.

μY.

C) the probability that Y is in the range μY ± c becomes arbitrarily close to one as n increases for any constant c > 0. D) the t distribution converges to the F distribution for approximately n > 30. Answer: A 29) The central limit theorem A) states conditions under which a variable involving the sum of Y1 ,..., Yn i.i.d. variables becomes the standard normal distribution. B) postulates that the sample mean Y is a consistent estimator of the population mean μY. C) only holds in the presence of the law of large numbers. D) states conditions under which a variable involving the sum of Y1 ,..., Yn i.i.d. variables becomes the Student t distribution. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 11

30) The covariance inequality states that 2 A) 0 ≤ σ ≤ 1. XY B) σ

2 2 2 ≤σ σ . XY X Y

C) σ

2 2 2 -σ ≤σ . XY X Y

D) σ

2 ≤ XY

2 X

2 Y

Answer: B n

31)

∑ (axi + byi + c)= i=1 n n A) a ∑ x i + b ∑ y i + n× c i=1 i=1 n n B) a ∑ x i + b ∑ y i + c i=1 i=1 C) ax + by + n×c n n D) a ∑ x i + b ∑ y i i=1 i=1 Answer: A n

32)

∑ (axi+b) i=1 A) n×a×x+ n×b B) n(a+b) C) D) Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 12

33) Assume that you assign the following subjective probabilities for your final grade in your econometrics course (the standard GPA scale of 4 = A to 0 = F applies): Probability 0.20 0.50 0.20 0.08

Grade A B C D F

0.02

The expected value is: A) 3.0 B) 3.5 C) 2.78 D) 3.25 Answer: C 34) The mean and variance of a Bernoille random variable are given as A) cannot be calculated B) np and np(1-p) C) p and p(1-p) D) p and (1- p) Answer: D 35) Consider the following linear transformation of a random variable y =

x-μx σx

where μx is the mean of x and σx

is the standard deviation. Then the expected value and the standard deviation of Y are given as A) 0 and 1 B) 1 and 1 C) Cannot be computed because Y is not a linear function of X D)

μ σx

and σx

Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 13

2.2 Essays and Longer Questions 1) Think of the situation of rolling two dice and let M denote the sum of the number of dots on the two dice. (So M is a number between 1 and 12.) (a) In a table, list all of the possible outcomes for the random variable M together with its probability distribution and cumulative probability distribution. Sketch both distributions. (b) Calculate the expected value and the standard deviation for M. (c) Looking at the sketch of the probability distribution, you notice that it resembles a normal distribution. Should you be able to use the standard normal distribution to calculate probabilities of events? Why or why not? Answer: (a) 2 3 4 5 6 7 8 9 10 11 12 Outcome (sum of dots) Probability 0.028 0.056 0.083 0.111 0.139 0.167 0.139 0.111 0.083 0.056 0.028 distribution Cumulative 0.028 0.083 0.167 0.278 0.417 0.583 0.722 0.833 0.912 0.972 1.000 probability distribution

(b) 7.0; 2.42. (c) You cannot use the normal distribution (without continuity correction) to calculate probabilities of events, since the probability of any event equals zero.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 14

2) What is the probability of the following outcomes? (a) Pr(M = 7) (b) Pr(M = 2 or M = 10) (c) Pr(M = 4 or M ≠ 4) (d) Pr(M = 6 and M = 9) (e) Pr(M < 8) (f) Pr(M = 6 or M > 10) Answer: (a) 0.167 or

1 6 = ; 36 6

(b) 0.111 or

1 4 = ; 39 9

2 8 = . 36 9

3) Probabilities and relative frequencies are related in that the probability of an outcome is the proportion of the time that the outcome occurs in the long run. Hence concepts of joint, marginal, and conditional probability distributions stem from related concepts of frequency distributions. You are interested in investigating the relationship between the age of heads of households and weekly earnings of households. The accompanying data gives the number of occurrences grouped by age and income. You collect data from 1,744 individuals and think of these individuals as a population that you want to describe, rather than a sample from which you want to infer behavior of a larger population. After sorting the data, you generate the accompanying table: Joint Absolute Frequencies of Age and Income, 1,744 Households

Household Income Y1 $0-under $200 Y2 $200-under $ 400

Age of head of household X1 X2 X3 X4 X5 16-under 20 20-under 25 25-under 45 45-under 65 65 and > 80 76 130 86 24 13

346

140

Y3 $400-under $600

251

101

Y4 $600-under $800

110

Y5 $800 and >

108

The median of the income group of $800 and above is $1,050. (a) Calculate the joint relative frequencies and the marginal relative frequencies. Interpret one of each of these. Sketch the cumulative income distribution. (b) Calculate the conditional relative income frequencies for the two age categories 16 -under 20, and 45-under 65. Calculate the mean household income for both age categories. (c) If household income and age of head of household were independently distributed, what would you expect these two conditional relative income distributions to look like? Are they similar here? (d) Your textbook has given you a primary definition of independence that does not involve conditional relative frequency distributions. What is that definition? Do you think that age and income are independent here, using this definition?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 15

Answer: (a) The joint relative frequencies and marginal relative frequencies are given in the accompanying table. 5.2 percent of the individuals are between the age of 20 and 24, and make between $200 and under $400. 21.6 percent of the individuals earn between $400 and under $600. Joint Relative and Marginal Frequencies of Age and Income, 1,744 Households Age of head of household X1 X2 X3 X4 Household Income 16-under 20 20-under 25 25-under 45 45-under 65 Y1 $0-under $200 0.046 0.044 0.075 0.049

X5 65 and > Total 0.014 0.227

Y2 $200-under $400 Y3 $400-under $600

0.007

0.052

0.198

0.080

0.005

0.342

0.000

0.011

0.144

0.058

0.003

0.216

Y4 $600-under $800

0.001

0.006

0.063

0.032

0.001

0.102

Y5 $800 and >

0.001

0.062

0.048

0.001

0.112

(b) The mean household income for the 16-under 20 age category is roughly $144. It is approximately $489 for the 45-under 65 age category. Conditional Relative Frequencies of Income and Age 16-under 20, and 45-under 65, 1,744 Households Age of head of household X1 X4 Household Income 16-under 20 Y1 $0-under $200 0.842

45-under 65 0.185

Y2 $200-under $400

0.300

0.137

Stock/Watson 2e -- CVC2 8/23/06 -- Page 16

Y3 $400-under $600

0.000

0.217

Y4 $600-under $800

0.001

0.118

Y5 $800 and >

0.001

0.180

(c) They would have to be identical, which they clearly are not. (d) Pr(Y = y, X = x) = Pr(Y = y) Pr(X = x). We can check this by multiplying two marginal probabilities to see if this results in the joint probability. For example, Pr(Y = Y3 ) = 0.216 and Pr(X = X3 ) = 0.542, resulting in a product of 0.117, which does not equal the joint probability of 0.144. Given that we are looking at the data as a population, not a sample, we do not have to test how “close” 0.117 is to 0.144. 4) Math and verbal SAT scores are each distributed normally with N (500,10000). (a) What fraction of students scores above 750? Above 600? Between 420 and 530? Below 480? Above 530? (b) If the math and verbal scores were independently distributed, which is not the case, then what would be the distribution of the overall SAT score? Find its mean and variance. (c) Next, assume that the correlation coefficient between the math and verbal scores is 0.75. Find the mean and variance of the resulting distribution. (d) Finally, assume that you had chosen 25 students at random who had taken the SAT exam. Derive the distribution for their average math SAT score. What is the probability that this average is above 530? Why is this so much smaller than your answer in (a)? Answer: (a) Pr(Y>750) = 0.0062; Pr(Y>600) = 0.1587; Pr(420<Y<530) = 0.4061; Pr(Y<480) = 0.4270; Pr(Y>530) = 0.3821. (b) The distribution would be N(1000, 2000), using equations (2.29) and (2.31) in the textbook. Note that the standard deviation is now roughly 141 rather than 200. (c) Given the correlation coefficient, the distribution is now N(1000, 35000) , which has a standard deviation of approximately 187. (d) The distribution for the average math SAT score is N(500, 400). Pr(Y > 530) = 0.0668. This probability is smaller because the sample mean has a smaller standard deviation (20 rather than 100). 5) The following problem is frequently encountered in the case of a rare disease, say AIDS, when determining the probability of actually having the disease after testing positively for HIV. (This is often known as the accuracy of the test given that you have the disease.) Let us set up the problem as follows: Y = 0 if you tested negative using the ELISA test for HIV, Y = 1 if you tested positive; X = 1 if you have HIV, X = 0 if you do not have HIV. Assume that 0.1 percent of the population has HIV and that the accuracy of the test is 0.95 in both cases of (i) testing positive when you have HIV, and (ii) testing negative when you do not have HIV. (The actual ELISA test is actually 99.7 percent accurate when you have HIV, and 98.5 percent accurate when you do not have HIV.) (a) Assuming arbitrarily a population of 10,000,000 people, use the accompanying table to first enter the column totals. Test Positive (Y=1) HIV (X=1) No HIV (X=0) Total

Test Negative (Y=0)

Total

10,000,000

(b) Use the conditional probabilities to fill in the joint absolute frequencies. (c) Fill in the marginal absolute frequencies for testing positive and negative. Determine the conditional probability of having HIV when you have tested positive. Explain this surprising result. (d) The previous problem is an application of Bayes’ theorem, which converts Pr( Y = y X = x) into Pr(X = x Y = y). Can you think of other examples where Pr( Y = y X = x) ≠ Pr(X = x Y = y)?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 17

Answer: (a) Test Positive (Y=1)

Test Negative (Y=0)

Total 10,000 9,990,000 10,000,000

Test Positive (Y=1) 9,500 499,500

Test Negative (Y=0) 500 9,490,500

Total 10,000 9,990,000 10,000,000

Test Positive (Y=1) 9,500 499,500 509,000

Test Negative (Y=0) 500 9,490,500 9,491,000

Total 10,000 9,990000 10,000,000

HIV (X=1) No HIV (X=0) Total (b) HIV (X=1) No HIV (X=0) Total (c) HIV (X=1) No HIV (X=0) Total

Pr(X=1 Y=1) = 0.0187. Although the test is quite accurate, there are very few people who have HIV (10,000), and many who do not have HIV (9,999,000). A small percentage of that large number (499,500/9,990,000) is large when compared to the higher percentage of the smaller number (9,500/10,000). d. Answers will vary by student. Perhaps a nice illustration is the probability to be a male given that you play on the college/university men’s varsity team, versus the probability to play on the college/university men’s varsity team given that you are a male student.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 18

6) You have read about the so-called catch-up theory by economic historians, whereby nations that are further behind in per capita income grow faster subsequently. If this is true systematically, then eventually laggards will reach the leader. To put the theory to the test, you collect data on relative (to the United States) per capita income for two years, 1960 and 1990, for 24 OECD countries. You think of these countries as a population you want to describe, rather than a sample from which you want to infer behavior of a larger population. The relevant data for this question is as follows:

0.023 0.014 …. 0.041 0.033 0.625

0.770 1.000 …. 0.200 0.130 13.220

Y × X1

2 X1

2 X2

1.030 1.000

0.018 0.00053 0.593 1.0609 0.014 0.00020 1.000 1.0000 …. …. …. …. …. 0.450 0.008 0.00168 0.040 0.2025 0.230 0.004 0.00109 0.017 0.0529 17.800 0.294 0.01877 8.529 13.9164

where X1 and X2 are per capita income relative to the United States in 1960 and 1990 respectively, and Y is the average annual growth rate in X over the 1960-1990 period. Numbers in the last row represent sums of the columns above. (a) Calculate the variance and standard deviation of X1 and X2 . For a catch-up effect to be present, what relationship must the two standard deviations show? Is this the case here? (b) Calculate the correlation between Y and . What sign must the correlation coefficient have for there to be evidence of a catch-up effect? Explain. Answer: (a) The variances of X1 and X2 are 0.0520 and 0.0298 respectively, with standard deviations of 0.2279 and 0.1726. For the catch-up effect to be present, the standard deviation would have to shrink over time. This is the case here. (b) The correlation coefficient is –0.88. It has to be negative for there to be evidence of a catch -up effect. If countries that were relatively ahead in the initial period and in terms of per capita income grow by relatively less over time, then eventually the laggards will catch -up. 7) Following Alfred Nobel’s will, there are five Nobel Prizes awarded each year. These are for outstanding achievements in Chemistry, Physics, Physiology or Medicine, Literature, and Peace. In 1968, the Bank of Sweden added a prize in Economic Sciences in memory of Alfred Nobel. You think of the data as describing a population, rather than a sample from which you want to infer behavior of a larger population. The accompanying table lists the joint probability distribution between recipients in economics and the other five prizes, and the citizenship of the recipients, based on the 1969-2001 period. Joint Distribution of Nobel Prize Winners in Economics and Non -Economics Disciplines, and Citizenship, 1969-2001

Economics Nobel Prize (X = 0) Physics, Chemistry, Medicine, Literature, and Peace Nobel Prize (X = 1) Total

U.S. Citizen (Y = 0) 0.118

Non= U.S. Citizen (Y = 1) 0.049

Total

0.345

0.488

0.833

0.463

0.537

1.00

(a) Compute E(Y) and interpret the resulting number. (b) Calculate and interpret E(Y X=1) and E(Y X=0). Stock/Watson 2e -- CVC2 8/23/06 -- Page 19

0.167

(c) A randomly selected Nobel Prize winner reports that he is a non-U.S. citizen. What is the probability that this genius has won the Economics Nobel Prize? A Nobel Prize in the other five disciplines? (d) Show what the joint distribution would look like if the two categories were independent. Answer: (a) E(Y) = 0.53.7 . 53.7 percent of Nobel Prize winners were non-U.S. citizens. (b) E(Y X=1) = 0.586 . 58.6 percent of Nobel Prize winners in non-economics disciplines were non-U.S. citizens. E(Y X=0) = 0.293 . 29.3 percent of the Economics Nobel Prize winners were non -U.S. citizens. (c) There is a 9.1 percent chance that he has won the Economics Nobel Prize, and a 90.9 percent chance that he has won a Nobel Prize in one of the other five disciplines. (d) Joint Distribution of Nobel Prize Winners in Economics and Non -Economics Disciplines, and Citizenship, 1969-2001, under assumption of independence

Economics Nobel Prize (X = 0) Physics, Chemistry, Medicine, Literature, and Peace Nobel Prize (X = 1) Total

U.S. Citizen (Y = 0) 0.077

Non= U.S. Citizen (Y = 1) 0.090

Total

0.386

0.447

0.833

0.463

0.537

1.00

0.167

8) A few years ago the news magazine The Economist listed some of the stranger explanations used in the past to predict presidential election outcomes. These included whether or not the hemlines of women’s skirts went up or down, stock market performances, baseball World Series wins by an American League team, etc. Thinking about this problem more seriously, you decide to analyze whether or not the presidential candidate for a certain party did better if his party controlled the house. Accordingly you collect data for the last 34 presidential elections. You think of this data as comprising a population which you want to describe, rather than a sample from which you want to infer behavior of a larger population. You generate the accompanying table: Joint Distribution of Presidential Party Affiliation and Party Control of House of Representatives, 1860 -1996

Democratic President (X = 0) Republican President (X = 1) Total

Democratic Control Republican Control of House (Y = 0) of House (Y = 1) 0.412 0.030

Total 0.441

0.176

0.382

0.559

0.588

0.412

1.00

(a) Interpret one of the joint probabilities and one of the marginal probabilities. (b) Compute E(X). How does this differ from E(X Y = 0 )? Explain. (c) If you picked one of the Republican presidents at random, what is the probability that during his term the Democrats had control of the House? (d) What would the joint distribution look like under independence? Check your results by calculating the two conditional distributions and compare these to the marginal distribution.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 20

Answer: (a) 38.2 percent of the presidents were Republicans and were in the White House while Republicans controlled the House of Representatives. 44.1 percent of all presidents were Democrats. (b) E(X)= 0.559. E(X Y = 0) = 0.701. E(X) gives you the unconditional expected value, while E(X Y = 0) is the conditional expected value. (c) E(X) = 0.559 . 55.9 percent of the presidents were Republicans. E(X Y = 0) = 0.299 . 29.9 percent of those presidents who were in office while Democrats had control of the House of Representatives were Republicans. The second conditions on those periods during which Democrats had control of the House of Representatives, and ignores the other periods. (d) Joint Distribution of Presidential Party Affiliation and Party Control of House of Representatives, 1860-1996, under the Assumption of Independence Democratic Control Republican Control of House (Y = 0) of House (Y = 1) 0.259 0.182

Democratic President (X = 0) Republican President (X = 1) Total

Total 0.441

0.329

0.230

0.559

0.588

0.412

1.00

Pr(X = 0 Y = 0) =

0.259 = 0.440 (there is a small rounding error). 0.588

Pr(Y = 1 X = 1) =

0.230 = 0.411 (there is a small rounding error). 0.559

9) The expectations augmented Phillips curve postulates △p = π – f (u – u), where △p is the actual inflation rate, π is the expected inflation rate, and u is the unemployment rate, with ʺ–ʺ indicating equilibrium (the NAIRU – Non-Accelerating Inflation Rate of Unemployment). Under the assumption of static expectations (π = △p –1), i.e., that you expect this period’s inflation rate to hold for the next period (ʺthe sun shines today, it will shine tomorrowʺ), then the prediction is that inflation will accelerate if the unemployment rate is below its equilibrium level. The accompanying table below displays information on accelerating annual inflation and unemployment rate differences from the equilibrium rate (cyclical unemployment), where the latter is approximated by a five-year moving average. You think of this data as a population which you want to describe, rather than a sample from which you want to infer behavior of a larger population. The data is collected from United States quarterly data for the period 1964:1 to 1995:4. Joint Distribution of Accelerating Inflation and Cyclical Unemployment, 1964:1-1995:4 (u – u) ≥ 0 (Y = 1) 0.383

Total

△p– △p –1 > 0

(u – u) > 0 (Y = 0) 0.156

(X = 0) △p– △p –1 ≤ 0

0.297

0.164

0.461

(X = 1) Total

0.453

0.547

1.00

0.539

(a) Compute E(Y) and E(X), and interpret both numbers. (b) Calculate E(Y X= 1) and E(Y X= 0). If there was independence between cyclical unemployment and acceleration in the inflation rate, what would you expect the relationship between the two expected values to Stock/Watson 2e -- CVC2 8/23/06 -- Page 21

be? Given that the two means are different, is this sufficient to assume that the two variables are independent? (c) What is the probability of inflation to increase if there is positive cyclical unemployment? Negative cyclical unemployment? (d) You randomly select one of the 59 quarters when there was positive cyclical unemployment (( u – u) > 0). What is the probability there was decelerating inflation during that quarter? Answer: (a) E(Y) = 0.547 . 54.7 percent of the quarters saw cyclical unemployment. E(Y) = 0.461 . 46.1 percent of the quarters saw decreasing inflation rates. (b) E(Y X = 1) = 0.356; E(Y X = 0 ) = 0.711. You would expect the two conditional expectations to be the same. In general, independence in means does not imply statistical independence, although the reverse is true. (c) There is a 34.4 percent probability of inflation to increase if there is positive cyclical unemployment. There is a 70 percent probability of inflation to increase if there is negative cyclical unemployment. (d) There is a 65.6 percent probability of inflation to decelerate when there is positive cyclical unemployment.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 22

10) The accompanying table shows the joint distribution between the change of the unemployment rate in an election year and the share of the candidate of the incumbent party since 1928. You think of this data as a population which you want to describe, rather than a sample from which you want to infer behavior of a larger population. Joint Distribution of Unemployment Rate Change and Incumbent Party’s Vote Share in Total Vote Cast for the Two Major -Party Candidates, 1928-2000

△u > 0 (X = 0) △u ≤ 0 (X = 1) Total

(Incumbent- 50%) > 0 (Incumbent- 50%) ≤ 0 (Y = 0) (Y = 1) 0.053 0.211 0.579 0.157 0.632 0.368

Total 0.264 0.736 1.00

(a) Compute and interpret E(Y) and E(X). (b) Calculate E(Y X = 1) and E(Y X = 0). Did you expect these to be very different? (c) What is the probability that the unemployment rate decreases in an election year? (d) Conditional on the unemployment rate decreasing, what is the probability that an incumbent will lose the election? (e) What would the joint distribution look like under independence? Answer: (a) E(Y) = 0.368; E(X) = 0.736. The probability of an incumbent to have less than 50% of the share of votes cast for the two major-party candidates is 0.368. The probability of observing falling unemployment rates during the election year is 73.6 percent. (b) E(Y X = 1) = 0.213; E(Y X = 0) = 0.799 . A student who believes that incumbents will attempt to manipulate the economy to win elections will answer affirmatively here. (c) Pr(X = 1) = 0.736. (d) Pr(Y = 1 X = 1) = 0.213. (e) Joint Distribution of Unemployment Rate Change and Incumbent Party’s Vote Share in Total Vote Cast for the Two Major -Party Candidates, 1928-2000 under Assumption of Statistical Independence

△u > 0 (X = 0) △u ≤ 0 (X = 1) Total

(Incumbent- 50%) > 0 (Incumbent- 50%) > 0 (Y = 0) (Y = 1) 0.167 0.097 0.465 0.271 0.632 0.368

Stock/Watson 2e -- CVC2 8/23/06 -- Page 23

Total 0.264 0.736 1.00

11) The table accompanying lists the joint distribution of unemployment in the United States in 2001 by demographic characteristics (race and gender). Joint Distribution of Unemployment by Demographic Characteristics, United States, 2001

Age 16-19 (X = 0) Age 20 and above (X = 1) Total

White (Y = 0) 0.13

Black and Other (Y = 1) 0.05

Total

0.60

0.22

0.82

0.73

0.27

1.00

0.18

(a) What is the percentage of unemployed white teenagers? (b) Calculate the conditional distribution for the categories ʺwhiteʺ and ʺblack and other.ʺ (c) Given your answer in the previous question, how do you reconcile this fact with the probability to be 60% of finding an unemployed adult white person, and only 22% for the category ʺblack and other.ʺ Answer: (a) Pr(Y = 0, X = 0) = 0.13. (b) Conditional Distribution of Unemployment by Demographic Characteristics, United States, 2001

Age 16-19 (X = 0) Age 20 and above (X = 1) Total

White (Y = 0) 0.18

Black and Other (Y = 1) 0.19

0.82

0.81

1.00

(c) The original table showed the joint probability distribution, while the table in (b) presented the conditional probability distribution. 12) From the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website the chapter 8 CPS data set (ch8_cps.xls) into a spreadsheet program such as Excel. For the exercise, use the first 500 observations only. Using data for average hourly earnings only (ahe), describe the earnings distribution. Use summary statistics, such as the mean, meadian, variance, and skewness. Produce a frequency distribution (“histogram”) using reasonable earnings class sizes. Answer: ahe Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum

19.79 0.51 16.83 19.23 11.49 131.98 0.23 0.96 58.44 2.14 Stock/Watson 2e -- CVC2 8/23/06 -- Page 24

Maximum Sum Count

60.58 9897.45 500.0

The mean is $19.79. The median ($16.83) is lower than the average, suggesting that the mean is being pulled up by individuals with fairly high average hourly earnings. This is confirmed by the skewness measure, which is positive, and therefore suggests a distribution with a long tail to the right. The variance is $2 131.96, while the standard deviation is $11.49. To generate the frequency distribution in Excel, you first have to settle on the number of class intervals. Once you have decided on these, then the minimum and maximum in the data suggests the class width. In Excel, you then define “bins” (the upper limits of the class intervals). Sturges’s formula can be used to suggest the number of class intervals (1+3.31log(n) ), which would suggest about 9 intervals here. Instead I settled for 8 intervals with a class width of $8 — minimum wages in California are currently $8 and approximately the same in other U.S. states. The table produces the absolute frequencies, and relative frequencies can be calculated in a straightforward way. bins 8 16 24 32 40 48 56 66 More

Frequency 50 187 115 68 38 33 8 1 0

rel. freq. 0.1 0.374 0.23 0.136 0.076 0.066 0.016 0.002

Substitution of the relative frequencies into the histogram table then produces the following graph (after eliminating the gaps between the bars).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 25

2.3 Mathematical and Graphical Problems 1) Think of an example involving five possible quantitative outcomes of a discrete random variable and attach a probability to each one of these outcomes. Display the outcomes, probability distribution, and cumulative probability distribution in a table. Sketch both the probability distribution and the cumulative probability distribution. Answer: Answers will vary by student. The generated table should be similar to Table 2.1 in the text, and figures should resemble Figures 2.1 and 2.2 in the text. 2) The height of male students at your college/university is normally distributed with a mean of 70 inches and a standard deviation of 3.5 inches. If you had a list of telephone numbers for male students for the purpose of conducting a survey, what would be the probability of randomly calling one of these students whose height is (a) taller than 6ʹ0ʺ? (b) between 5ʹ3ʺ and 6ʹ5ʺ? (c) shorter than 5ʹ7ʺ, the mean height of female students? (d) shorter than 5ʹ0ʺ? (e) taller than Shaq O’Neal, the center of the Miami Heat, who is 7ʹ1ʺ tall? Compare this to the probability of a woman being pregnant for 10 months (300 days), where days of pregnancy is normally distributed with a mean of 266 days and a standard deviation of 16 days. Answer: (a) Pr(Z > 0.5714) = 0.2839; (b) Pr( –2 < Z < 2) = 0.9545 or approximately 0.95; (c) Pr(Z < -0.8571) = 0.1957; (d) Pr(Z < -2.8571) = 0.0021; (e) Pr(Z > 4.2857) = 0.000009 (the text does not show values above 2.99 standard deviations, Pr(Z >2.99 = 0.0014) and Pr(Z > 2.1250) = 0.0168. 3) Calculate the following probabilities using the standard normal distribution. Sketch the probability distribution in each case, shading in the area of the calculated probability. (a) Pr(Z < 0.0) (b) Pr(Z ≤ 1.0) (c) Pr(Z > 1.96) (d) Pr(Z < –2.0) (e) Pr(Z > 1.645) (f) Pr(Z > –1.645) (g) Pr(–1.96 < Z < 1.96) (h.) Pr(Z < 2.576 or Z > 2.576) (i.) Pr(Z > z) = 0.10; find z. (j.) Pr(Z < –z or Z > z) = 0.05; find z. Answer: (a) 0.5000; (b) 0.8413; (c) 0.0250; (d) 0.0228; (e) 0.0500; (f) 0.9500; (g) 0.0500; (h) 0.0100; (i) 1.2816; (j) 1.96.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 26

4) Using the fact that the standardized variable Z is a linear transformation of the normally distributed random variable Y, derive the expected value and variance of Z. Answer: Z =

Y - μY

μY σY

σY +

μY σY

μY 1 1 Y = a + bY, with a = and b = . Given (2.29) and (2.30) in the text, E(Z) = σY σY σY

1 μ = 0, and σZ = σY Y

1 2 σZ

2 = 1. Z

5) Show in a scatterplot what the relationship between two variables X and Y would look like if there was (a) a strong negative correlation. (b) a strong positive correlation. (c) no correlation. Answer: (a)

(b)

6) What would the correlation coefficient be if all observations for the two variables were on a curve described by Y = X2 ? Answer: The correlation coefficient would be zero in this case, since the relationship is non -linear. 7) Find the following probabilities: 2 (a) Y is distributed χ 4 . Find Pr(Y > 9.49). (b) Y is distributed t∞. Find Pr(Y > –0.5). (c) Y is distributed F4,∞. Find Pr(Y < 3.32). (d) Y is distributed N(500, 10000). Find Pr(Y > 696 or Y < 304). Answer: (a) 0.05. (b) 0.6915. (c) 0.99. (d) 0.05.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 28

8) In considering the purchase of a certain stock, you attach the following probabilities to possible changes in the stock price over the next year. Stock Price Change During Next Twelve Months (%) +15 +5 0 –5 –15

Probability 0.2 0.3 0.4 0.05 0.05

What is the expected value, the variance, and the standard deviation? Which is the most likely outcome? Sketch the cumulative distribution function. 2 Answer: E(Y) = 3.5; σ Y = 8.49; σY = 2.91; most likely: 0.

9) You consider visiting Montreal during the break between terms in January. You go to the relevant Web site of the official tourist office to figure out the type of clothes you should take on the trip. The site lists that the average high during January is –7° C, with a standard deviation of 4° C. Unfortunately you are more familiar with Fahrenheit than with Celsius, but find that the two are related by the following linear function: 5 C= (F – 32). 9 Find the mean and standard deviation for the January temperature in Montreal in Fahrenheit. Answer: Using equations (2.29) and (2.30) from the textbook, the result is 19.4 and 7.2.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 29

10) Two random variables are independently distributed if their joint distribution is the product of their marginal distributions. It is intuitively easier to understand that two random variables are independently distributed if all conditional distributions of Y given X are equal. Derive one of the two conditions from the other. Answer: If all conditional distributions of Y given X are equal, then Pr(Y = y X = 1) = Pr(Y = y X = 2) = ... = Pr(Y = y X = l). But if all conditional distributions are equal, then they must also equal the marginal distribution, i.e., Pr(Y = y X = x) = Pr(Y - y). Given the definition of the conditional distribution of Y given X = x, you then get Pr(Y = y X = x) =

Pr(Y = y, X = x) = Pr(Y = y), Pr(X = x)

which gives you the condition Pr(Y = y, X = x) = Pr(Y = y) Pr(X = x). 11) There are frequently situations where you have information on the conditional distribution of Y given X, but Pr(X = x, Y = y) , derive a are interested in the conditional distribution of X given Y. Recalling Pr(Y = y X = x) = Pr(X = x) relationship between Pr(X = x Y = y) and Pr(Y = y X = x). This is called Bayes’ theorem. Answer: Given Pr(Y = y X = x) =

Pr(X = x Y = y) , Pr(X = x)

Pr(Y = y X = x) × Pr(X = x) = Pr(X = x, Y = y); Pr(X = x Y = y) similarly Pr(X = x Y = y) = and Pr(Y = y) Pr(X = x Y = y) × Pr(Y = y) = Pr(X = x, Y = y). Equating the two and solving for Pr(X = x Y = y) then results in Pr(Y = y X = x) × Pr(X = x) . Pr(X = x Y = y) = Pr(Y = y) 12) You are at a college of roughly 1,000 students and obtain data from the entire freshman class (250 students) on height and weight during orientation. You consider this to be a population that you want to describe, rather than a sample from which you want to infer general relationships in a larger population. Weight ( Y) is measured in pounds and height (X) is measured in inches. You calculate the following sums: n

∑ y i = 94,228.8, ∑ x i = 1,248.9, ∑ xiyi = 7,625.9 i=1

i=1

(small letters refer to deviations from means as in z i = Zi – Z). (a) Given your general knowledge about human height and weight of a given age, what can you say about the shape of the two distributions? (b) What is the correlation coefficient between height and weight here? Answer: (a) Both distributions are bound to be normal. (b) 0.703.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 30

13) Use the definition for the conditional distribution of Y given X = x and the marginal distribution of X to derive the formula for Pr(X = x, Y = y). This is called the multiplication rule. Use it to derive the probability for drawing two aces randomly from a deck of cards (no joker), where you do not replace the card after the first draw. Next, generalizing the multiplication rule and assuming independence, find the probability of having four girls in a family with four children. 3 1 4 1 4 . Answer: × = 0.0045; 0.0625 or = 2 16 52 51 14) The systolic blood pressure of females in their 20s is normally distributed with a mean of 120 with a standard deviation of 9. What is the probability of finding a female with a blood pressure of less than 100? More than 135? Between 105 and 123? You visit the women’s soccer team on campus, and find that the average blood pressure of the 25 members is 114. Is it likely that this group of women came from the same population? Answer: Pr(Y<100) = 0.0131; Pr(Y>135) = 0.0478; Pr(105<Y<123) = 0.6784; Pr(Y< 114) = Pr(Z < -3.33) = 0.0004. (The smallest z-value listed in the table in the textbook is –2.99, which generates a probability value of 0.0014.) This unlikely that this group of women came from the same population. 15) Show that the correlation coefficient between Y and X is unaffected if you use a linear transformation in both variables. That is, show that corr(X,Y) = corr(X*, Y*), where X* = a + bX and Y* = c + dY, and where a, b, c, and d are arbitrary non–zero constants. Answer: corr(X*, Y*) =

cov(X*, Y*) = var(X*) var(Y*)

bd cov(X, Y) corr(X, Y). 2 b var(X) d 2 var(Y)

16) The textbook formula for the variance of the discrete random variable Y is given as k 2 σ Y = ∑ (y i – μY)2 p i. i=1 Another commonly used formulation is k 2 2 2 σ Y = ∑ y i pi – μ Y . i=1 Prove that the two formulas are the same. k k k 2 2 2 2 2 Answer: σ Y = ∑ (y i - μY)2 p i = ∑ (y i + μ Y - 2μYy i) p i = ∑ ( y i p i + μ Y p i - 2μYy ip i). i=1 i=1 i=1 Moving the summation sign through results in k k k k k 2 2 2 σ Y = ∑ y i p i+ μ Y ∑ p i - 2 μY ∑ y i p i. But ∑ p i = 1 and μY ∑ y ip i , giving you the second i=1 i=1 i=1 i=1 i=1 expression after simplification.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 31

17) The Economic Report of the President gives the following age distribution of the United States population for the year 2000: United States Population By Age Group, 2000 Outcome (age category Percentage

Under 5 5-15

16-19

20-24

25-44

45-64

0.06

0.07

0.30

0.22

0.16

65 and over 0.13

Imagine that every person was assigned a unique number between 1 and 275,372,000 (the total population in 2000). If you generated a random number, what would be the probability that you had drawn someone older than 65 or under 16? Treating the percentages as probabilities, write down the cumulative probability distribution. What is the probability of drawing someone who is 24 years or younger? Answer: Pr(Y < 16 or Y > 65) = 0.35; Outcome (age category Cumulative probability distribution

Under 5 5-15

16-19

20-24

25-44

45-64

0.06

0.28

0.35

0.65

0.87

0.22

65 and over 1.00

Pr(Y ≤ 24) = 0.35. 18) The accompanying table gives the outcomes and probability distribution of the number of times a student checks her e-mail daily: Probability of Checking E-Mail Outcome (number of email checks) Probability distribution

0.05

0.15

0.30

0.25

0.15

0.08

0.02

Sketch the probability distribution. Next, calculate the c.d.f. for the above table. What is the probability of her checking her e-mail between 1 and 3 times a day? Of checking it more than 3 times a day? Answer: Outcome (number of email checks) Cumulative probability distribution

0.05

0.20

0.50

0.75

0.90

0.98

1.00

Pr(1 ≤ Y ≤ 3) 0.70 ; Pr(Y > 0.25).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 32

Stock/Watson 2e -- CVC2 8/23/06 -- Page 33

19) The accompanying table lists the outcomes and the cumulative probability distribution for a student renting videos during the week while on campus. Video Rentals per Week during Semester Outcome (number of weekly 0 video rentals) Probability distribution 0.05

0.55

0.25

0.05

0.07

0.02

0.01

Sketch the probability distribution. Next, calculate the cumulative probability distribution for the above table. What is the probability of the student renting between 2 and 4 a week? Of less than 3 a week? Answer: The cumulative probability distribution is given below. The probability of renting between two and four videos a week is 0.37. The probability of renting less than three a week is 0.85. Outcome (number of weekly video rentals) Cumulative probability distribution

0.05

0.60

0.85

0.90

0.97

0.99

1.00

20) The textbook mentioned that the mean of Y, E(Y) is called the first moment of Y, and that the expected value of the square of Y, E(Y2 ) is called the second moment of Y, and so on. These are also referred to as moments about the origin. A related concept is moments about the mean, which are defined as E[(Y – μY)r]. What do you call the second moment about the mean? What do you think the third moment, referred to as ʺskewness,ʺ measures? Do you believe that it would be positive or negative for an earnings distribution? What measure of the third moment around the mean do you get for a normal distribution? Answer: The second moment about the mean is the variance. Skewness measures the departure from symmetry. For the typical earnings distribution, it will be positive. For the normal distribution, it will be zero. 21) Explain why the two probabilities are identical for the standard normal distribution: Pr(–1.96 ≤ X ≤ 1.96) and Pr(–1.96 < X < 1.96). Answer: For a continuous distribution, the probability of a point is zero.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 34

22) SAT scores in Mathematics are normally distributed with a mean of 500 and a standard deviation of 100. The 1 Y-μY 2 ) - ( 1 2 σY formula for the normal distribution is f(Y)= e Use the scatter plot option in a standard 2 2πσ Y

spreadsheet program, such as Excel, to plot the Mathematics SAT distribution using this formula. Start by entering 300 as the first SAT score in the first column (the lowest score you can get in the mathematics section as long as you fill in your name correctly), and then increment the scores by 10 until you reach 800. In the second column, use the formula for the normal distribution and calculate f(Y). Then use the scatter plot option, where you eventually remove markers and substitute these with the solid line option. Answer:

23) Use a standard spreadsheet program, such as Excel, to find the following probabilities from various distributions analyzed in the current chapter: a. If Y is distributed N (1,4), find Pr(Y≤3) b. If Y is distributed N (3,9), find Pr(Y>0) c. If Y is distributed N (50,25), find Pr(40≤Y≤52) d. If Y is distributed N (5,2), find Pr(6≤Y≤8) Answer: The answers here are given together with the relevant Excel commands. a. =NORMDIST(3,1,2,TRUE) = 0.8413 b. =1-NORMDIST(0,3,3,TRUE) = 0.8413 c. =NORMDIST(52,50,5,TRUE)-NORMDIST(40,50,5,TRUE) = 0.6326 d. =NORMDIST(8,5,SQRT(2),TRUE)-NORMDIST(6,5,SQRT(2),TRUE) = 0.2229

Stock/Watson 2e -- CVC2 8/23/06 -- Page 35

24) Looking at a large CPS data set with over 60,000 observations for the United States and the year 2004, you find that the average number of years of education is approximately 13.6. However, a surprising large number of individuals (approximately 800) have quite a low value for this variable, namely 6 years or less. You decide to drop these observations, since none of your relatives or friends have that few years of education. In addition, you are concerned that if these individuals cannot report the years of education correctly, then the observations on other variables, such as average hourly earnings, can also not be trusted. As a matter of fact you have found several of these to be below minimum wages in your state. Discuss if dropping the observations is reasonable. Answer: While it is always a good idea to check the data carefully before conducting a quantitative analysis, you should never drop data before carefully thinking about the problem at hand. While it is not plausible to find many individuals in the U.S. who were raised here with that few years of education, there will be immigrants in the survey. Average years of education can be quite low in other countries. For example, Brazil’s average years of schooling is less than 6 years. The point of the exercise is to think hard whether or not observations are outliers generated by faulty data entry or if there is a reason for observing values which may appear strange at first. 25) Use a standard spreadsheet program, such as Excel, to find the following probabilities from various distributions analyzed in the current chapter: 2

If Y is distributed χ 4 , find Pr( Y ≤ 7.78)

If Y is distributed χ 10 , find Pr( Y > 18.31)

c. d.

If Y is distributed F10,∞, find Pr( Y > 1.83) If Y is distributed t15, find Pr( Y > 1.75)

e. f. g. h.

If Y is distributed t90, find Pr( -1.99 ≤Y ≤ 1.99) If Y is distributed N(0,1), find Pr( -1.99 ≤Y ≤ 1.99) If Y is distributed F7,4, find Pr( Y > 4.12) If Y is distributed F7,120, , find Pr( Y > 2.79)

Answer: The answers here are given together with the relevant Excel commands. a. =1-CHIDIST(7.78,4) = 0.90 b. =CHIDIST(18.31,10) = 0.05 c. =FDIST(1.83,10,1000000) = 0.05 d. =TDIST(1.75,15,1) = 0.05 e. =1-TDIST(1.99,90,2) = 0.95 f. =NORMDIST(1.99,0,1,1)-NORMDIST(-1.99,0,1,1) = 0.953 g. =FDIST(4.12,7,4) = 0.10 h. =FDIST(2.79,7,120) = 0.01

Stock/Watson 2e -- CVC2 8/23/06 -- Page 36

Chapter 3 Review of Statistics 3.1 Multiple Choice 1) An estimator is A) an estimate. B) a formula that gives an efficient guess of the true population value. C) a random variable. D) a nonrandom number. Answer: C 2) An estimate is A) efficient if it has the smallest variance possible. B) a nonrandom number. C) unbiased if its expected value equals the population value. D) another word for estimator. Answer: B ^

3) An estimator μY of the population value μY is unbiased if ^

A) μY = μ . Y B) Y has the smallest variance of all estimators. p C) Y μY . ^

D) E(μY) = μY. Answer: D ^

4) An estimator μY of the population value μY is consistent if ^

A) μY p μ . Y B) its mean square error is the smallest possible. C) Y is normally distributed. p D) Y 0. Answer: A

5) An estimator μY of the population value μY is more efficient when compared to another estimator μY, if ^

A) E(μY) > E(μY). B) it has a smaller variance. C) its c.d.f. is flatter than that of the other estimator. ^

D) both estimators are unbiased, and var(μY) < var(μY). Answer: D 6) With i.i.d. sampling each of the following is true except A) E(Y) = μY. 2 B) var(Y) = σ Y /n. C) E(Y) < E(Y). D) Y is a random variable. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 37

7) The standard error of Y, SE(Y) = σY is given by the following formula: A)

n 1 (Yi – Y)2 . ∑ n i=1 2 SY

C) SY. D)

SY . n

Answer: D 8) The critical value of a two-sided t-test computed from a large sample A) is 1.64 if the significance level of the test is 5%. B) cannot be calculated unless you know the degrees of freedom. C) is 1.96 if the significance level of the test is 5%. D) is the same as the p-value. Answer: C 9) A type I error is A) always the same as (1-type II) error. B) the error you make when rejecting the null hypothesis when it is true. C) the error you make when rejecting the alternative hypothesis when it is true. D) always 5%. Answer: B 10) A type II error A) is typically smaller than the type I error. B) is the error you make when choosing type II or type I. C) is the error you make when not rejecting the null hypothesis when it is false. D) cannot be calculated when the alternative hypothesis contains an ʺ=ʺ. Answer: C 11) The size of the test A) is the probability of committing a type I error. B) is the same as the sample size. C) is always equal to (1-the power of test). D) can be greater than 1 in extreme examples. Answer: A 12) The power of the test is A) dependent on whether you calculate a t or a t2 statistic. B) one minus the probability of committing a type I error. C) a subjective view taken by the econometrician dependent on the situation. D) one minus the probability of committing a type II error. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 38

13) When you are testing a hypothesis against a two-sided alternative, then the alternative is written as A) E(Y) > μY,0. B) E(Y) = μY,0. C) Y ≠ μY,0. D) E(Y) ≠ μY,0. Answer: D 14) A scatterplot A) shows how Y and X are related when their relationship is scattered all over the place. B) relates the covariance of X and Y to the correlation coefficient. C) is a plot of n observations on Xi and Yi, where each observation is represented by the point (Xi, Yi). D) shows n observations of Y over time. Answer: C 15) The following types of statistical inference are used throughout econometrics, with the exception of A) confidence intervals. B) hypothesis testing. C) calibration. D) estimation. Answer: C 16) Among all unbiased estimators that are weighted averages of Y1 ,..., Yn Y, is A) the only consistent estimator of μY. B) the most efficient estimator of μY. C) a number which, by definition, cannot have a variance. D) the most unbiased estimator of μY. Answer: B 17) To derive the least squares estimator μY, you find the estimator m which minimizes n

∑ (Yi – m)2 .

i=1 n

∑ (Yi – m) . i=1 n 2 C) ∑ m Y i . i=1 n D) ∑ (Yi – m) . i=1 B)

Answer: A 18) If the null hypothesis states H0 : E(Y) = μY,0, then a two-sided alternative hypothesis is A) H1 : E(Y) ≠ μY,0. B) H1 : E(Y) ≈ μY,0. C) H1 : μY < μY,0. D) H1 : E(Y) > μY,0. Answer: A Stock/Watson 2e -- CVC2 8/23/06 -- Page 39

19) The p-value is defined as follows: A) p = 0.05. B) PrH0 [ Y – μY,0 > Y act– μY,0 ]. C) Pr(z > 1.96). D) PrH0 [ Y – μY,0 < Y act– μY,0 ].. Answer: B 20) A large p-value implies A) rejection of the null hypothesis. B) a large t-statistic. C) a large Yact. D) that the observed value Yact is consistent with the null hypothesis. Answer: D 21) The formula for the sample variance is n 2 1 (Yi – Y). A) S Y = n–1 ∑ i=1 n 2 1 (Yi – Y)2 . B) S Y = ∑ n–1 i=1 n 2 1 (Yi – μ )2 . C) S Y = ∑ Y n–1 i=1 n–1 2 1 (Yi – Y)2 . D) S Y = n–1 ∑ i=1 Answer: B 22) Degrees of freedom A) in the context of the sample variance formula means that estimating the mean uses up some of the information in the data. B) is something that certain undergraduate majors at your university/college other than economics seem to have an ∞ amount of. C) are (n-2) when replacing the population mean by the sample mean. 2 2 D) ensure that S Y = σ Y . Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 40

23) The t-statistic is defined as follows: A) t =

Y – μY,0 2 σY

n B) t =

C) t =

Y – μY,0 SE(Y)

(Y – μY,0)2 SE(Y)

D) 1.96. Answer: A 24) The power of the test A) is the probability that the test actually incorrectly rejects the null hypothesis when the null is true. B) depends on whether you use Y or Y2 for the t-statistic. C) is one minus the size of the test. D) is the probability that the test correctly rejects the null when the alternative is true. Answer: D 25) The sample covariance can be calculated in any of the following ways, with the exception of: n 1 (Xi – X)(Yi – Y). A) ∑ n–1 i=1 n 1 XiYi – n XY. B) ∑ n–1 n–1 i=1 n 1 (Xi – μX)(Yi – μ ). ∑ Y n i=1 D) rXYSYSY, where rXY is the correlation coefficient. C)

Answer: C 26) When the sample size n is large, the 90% confidence interval for μY is A) Y ± 1.96SE(Y). B) Y ± 1.64SE(Y). C) Y ± 1.64σY. D) Y ± 1.96. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 41

27) The standard error for the difference in means if two random variables M and W , when the two population variances are different, is 2 2 S M+ S W A)

SM SW . + nM n W 2 SM

2 SW

1 ( ). + 2 nM nW 2 SM

nM + n W

2 SW . nW

Answer: D 28) The t-statistic has the following distribution: A) standard normal distribution for n < 15 B) Student t distribution with n–1 degrees of freedom regardless of the distribution of the Y. C) Student t distribution with n–1 degrees of freedom if the Y is normally distributed. D) a standard normal distribution if the sample standard deviation goes to zero. Answer: C 29) The following statement about the sample correlation coefficient is true. A) –1 ≤ rXY ≤ 1. p 2 B) r XY corr(Xi, Yi). C) rXY < 1.

D) rXY =

2 S XY 2 2 SXSY

Answer: A 30) The correlation coefficient A) lies between zero and one. B) is a measure of linear association. C) is close to one if X causes Y. D) takes on a high value if you have a strong nonlinear relationship. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 42

31) When testing for differences of means, the t-statistic t =

Ym-Yw SE(Ym-Yw )

, where SE(Ym-Yw )=

2 sm nm

2 sw nw

has

A) a student t distribution if the population distribution of Y is not normal B) a student t distribution if the population distribution of Y is normal C) a normal distribution even in small samples D) cannot be computed unless nw=nm Answer: B 32) When testing for differences of means, you can base statistical inference on the A) Student t distribution in general B) normal distribution regardless of sample size C) Student t distribution if the underlying population distribution of Y is normal, the two groups have the same variances, and you use the pooled standard error formula D) Chi-squared distribution with (nw + nm - 2) degrees of freedom Answer: C 33) Assume that you have 125 observations on the height ( H) and weight (W) of your peers in college. Let sHW = 68, sH = 3.5, sW = 29. The sample correlation coefficient is A) 1.22 B) 0.50 C) 0.67 D) Cannot be computed since males and females have not been separated out. Answer: C 34) You have collected data on the average weekly amount of studying time ( T) and grades (G) from the peers at your college. Changing the measurement from minutes into hours has the following effect on the correlation coefficient: A) decreases the rTG by dividing the original correlation coefficient by 60 B) results in a higher rTG C) cannot be computed since some students study less than an hour per week D) does not change the rTG Answer: A, D 35) A low correlation coefficient implies that A) the line always has a flat slope B) in the scatterplot, the points fall quite far away from the line C) the two variables are unrelated D) you should use a tighter scale of the vertical and horizontal axis to bring the observations closer to the line Answer: B

3.2 Essays and Longer Questions 1) Think of at least nine examples, three of each, that display a positive, negative, or no correlation between two economic variables. In each of the positive and negative examples, indicate whether or not you expect the correlation to be strong or weak. Answer: Answers will vary by student. Students frequently bring up the following correlations. Positive correlations: earnings and education (hopefully strong), consumption and personal disposable income (strong), per capita income and investment-output ratio or saving rate (strong); negative correlation: Okun’s Law (strong), income velocity and interest rates (strong), the Phillips curve (strong); no correlation: productivity growth and initial level of per capita income for all countries of the world (beta-convergence regressions), consumption and the (real) interest rate, employment and real wages.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 43

2) Adult males are taller, on average, than adult females. Visiting two recent American Youth Soccer Organization (AYSO) under 12 year old (U12) soccer matches on a Saturday, you do not observe an obvious difference in the height of boys and girls of that age. You suggest to your little sister that she collect data on height and gender of children in 4th to 6th grade as part of her science project. The accompanying table shows her findings. Height of Young Boys and Girls, Grades 4-6, in inches

YBoys

Boys SBoys

nBoys

57.8

3.9

YGirls

Girls SGirls

nGirls

58.4

4.2

(a) Let your null hypothesis be that there is no difference in the height of females and males at this age level. Specify the alternative hypothesis. (b) Find the difference in height and the standard error of the difference. (c) Generate a 95% confidence interval for the difference in height. (d) Calculate the t-statistic for comparing the two means. Is the difference statistically significant at the 1% level? Which critical value did you use? Why would this number be smaller if you had assumed a one -sided alternative hypothesis? What is the intuition behind this? Answer: (a) H0 : μBoys - μGirls = 0 vs. H1 : μBoys - μGirls ≠ 0 (b) YBoys - YGirls = -0.6, SE(YBoys - YGirls) =

3.92 4.22 + = 0.77. 55 57

(c) -0.6 ± 1.96 × 0.77 = (-2.11, 0.91). (d) t = -0.78, so t < 2.58, which is the critical value at the 1% level. Hence you cannot reject the null hypothesis. The critical value for the one-sided hypothesis would have been 2.33. Assuming a one-sided hypothesis implies that you have some information about the problem at hand, and, as a result, can be more easily convinced than if you had no prior expectation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 44

3) Math SAT scores (Y) are normally distributed with a mean of 500 and a standard deviation of 100. An evening school advertises that it can improve students’ scores by roughly a third of a standard deviation, or 30 points, if they attend a course which runs over several weeks. (A similar claim is made for attending a verbal SAT course.) The statistician for a consumer protection agency suspects that the courses are not effective. She views the situation as follows: H0 : μY = 500 vs. H1 : μY = 530. (a) Sketch the two distributions under the null hypothesis and the alternative hypothesis. (b) The consumer protection agency wants to evaluate this claim by sending 50 students to attend classes. One of the students becomes sick during the course and drops out. What is the distribution of the average score of the remaining 49 students under the null, and under the alternative hypothesis? (c) Assume that after graduating from the course, the 49 participants take the SAT test and score an average of 520. Is this convincing evidence that the school has fallen short of its claim? What is the p-value for such a score under the null hypothesis? (d) What would be the critical value under the null hypothesis if the size of your test were 5%? (e) Given this critical value, what is the power of the test? What options does the statistician have for increasing the power in this situation? Answer: (a)

(b) Y of the 49 participants is normally distributed, with a mean of 500 and a standard deviation of 14.286 under the null hypothesis. Under the alternative hypothesis, it is normally distributed with a mean of 530 and a standard deviation of 14.286. (c) It is possible that the consumer protection agency had chosen a group of 49 students whose average score would have been 490 without attending the course. The crucial question is how likely it is that 49 students, chosen randomly from a population with a mean of 500 and a standard deviation of 100, will score an average of 520. The p-value for this score is 0.081, meaning that if the agency rejected the null hypothesis based on this evidence, it would make a mistake, on average, roughly 1 out of 12 times. Hence the average score of 520 would allow rejection of the null hypothesis that the school has had no effect on the SAT score of students at the 10% level. (d) The critical value would be 523. (e) Pr(Y < 523 H1 is true) = 0.312. Hence the power of the test is 0.688. She could increase the power by decreasing the size of the test. Alternatively, she could try to convince the agency to hire more test subjects, i.e., she could increase the sample size. 4) Your packaging company fills various types of flour into bags. Recently there have been complaints from one chain of stores: a customer returned one opened 5 pound bag which weighed significantly less than the label indicated. You view the weight of the bag as a random variable which is normally distributed with a mean of 5 pounds, and, after studying the machine specifications, a standard deviation of 0.05 pounds. (a) You take a sample of 20 bags and weigh them. Sketch below what the average pattern of individual weights might look like. Let the horizontal axis indicate the sampled bag number (1, 2, …, 20). On the vertical axis, mark the expected value of the weight under the null hypothesis, and two (≈ 1.96) standard deviations above and below the expected value. Draw a line through the graph for E(Y) + 2 σY, E(Y), and E(Y) – 2σY. How many of the bags in a sample of 20 will you expect to weigh either less than 4.9 pounds or more than 5.1 pounds? (b) You sample 25 bags of flour and calculate the average weight. What is the distribution of the average weight of these 25 bags? Repeating the same exercise 20 times, sketch what the distribution of the average weights would look like in a graph similar to the one you drew in (b), where you have adjusted the standard Stock/Watson 2e -- CVC2 8/23/06 -- Page 45

error of Y accordingly. (c) For each of the twenty observations in (c) a 95% confidence interval is constructed. Draw these confidence intervals, using the same graph as in (c). How many of these 20 confidence intervals would you expect to weigh 5 pounds under the null hypothesis? Answer: (a) On average, there should be one bag in every sample of 20 which weighs less than 4.9 pounds or more than 5.1 pounds.

(b) The average weight of 25 bags will be normally distributed, with a mean of 5 pounds and a standard deviation of 0.01 pounds. (Same graph as in (a), but with the following lower and upper bounds.)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 46

Stock/Watson 2e -- CVC2 8/23/06 -- Page 47

5) Assume that two presidential candidates, call them Bush and Gore, receive 50% of the votes in the population. You can model this situation as a Bernoulli trial, where Y is a random variable with success probability Pr(Y = ^

1) = p, and where Y = 1 if a person votes for Bush and Y = 0 otherwise. Furthermore, let p be the fraction of p(1-p) ) in reasonably large samples, say for n ≥ 40. successes (1s) in a sample, which is distributed N(p, n (a) Given your knowledge about the population, find the probability that in a random sample of 40, Bush would receive a share of 40% or less. (b) How would this situation change with a random sample of 100? (c) Given your answers in (a) and (b), would you be comfortable to predict what the voting intentions for the ^

entire population are if you did not know p but had polled 10,000 individuals at random and calculated p ? Explain. (d) This result seems to hold whether you poll 10,000 people at random in the Netherlands or the United States, where the former has a population of less than 20 million people, while the United States is 15 times as populous. Why does the population size not come into play? ^

Answer: (a) Pr(p < 0.40) = Pr(Z <

0.40 - 0.50 ) = Pr(Z < -1.26) ≈ 0.104. In roughly every 10 th sample of this size, 0.25 40

Bush would receive a vote of less than 40%, although in truth, his share is 50%. ^ 0.40 - 0.50 (b) Pr(p < 0.40) = Pr(Z < ) = Pr(Z < -2.00) ≈ 0.023. With this sample size, you would expect 0.25 100 this to happen only every 50 th sample. (c) The answers in (a) and (b) suggest that for even moderate increases in the sample size, the estimator does not vary too much from the population mean. Polling 10,000 individuals, the probability of finding ^

a p of 0.48, for example, would be 0.00003. Unless the election was extremely close, which the 2000 election was, polls are quite accurate even for sample sizes of 2,500. (d) The distribution of sample means shrinks very quickly depending on the sample size, not the population size. Although at first this does not seem intuitive, the standard error of an estimator is a value which indicates by how much the estimator varies around the population value. For large sample sizes, the sample mean typically is very close to the population mean.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 48

6) You have collected weekly earnings and age data from a sub-sample of 1,744 individuals using the Current Population Survey in a given year. (a) Given the overall mean of $434.49 and a standard deviation of $294.67, construct a 99% confidence interval for average earnings in the entire population. State the meaning of this interval in words, rather than just in numbers. If you constructed a 90% confidence interval instead, would it be smaller or larger? What is the intuition? (b) When dividing your sample into people 45 years and older, and younger than 45, the information shown in the table is found. Age Category

Average Earnings

Age ≥ 45 Age < 45

Y $488.87 $412.20

Standard Deviation SY

$328.64 $276.63

507 1237

Test whether or not the difference in average earnings is statistically significant. Given your knowledge of age-earning profiles, does this result make sense? Answer: (a) The confidence interval for mean weekly earnings is 434.49 ± 2.58 ×

294.67 = 434.49 ± 18.20 = (416.29, 1744

452.69). Based on the sample at hand, the best guess for the population mean is $434.49. However, because of random sampling error, this guess is likely to be wrong. Instead, the interval estimate for the average earnings lies between $416.29 and $452.69. Committing to such an interval repeatedly implies that the resulting statement is incorrect 1 out of 100 times. For a 90% confidence interval, the only change in the calculation of the confidence interval is to replace 2.58 by 1.64. Hence the confidence interval is smaller. A smaller interval implies, given the same average earnings and the standard deviation, that the statement will be false more often. The larger the confidence interval, the more likely it is to contain the population value. (488.87 - 412.20) (b) Assuming unequal population variances, t = = 4.62, which is statistically 328.642 276.632 + 12.7 507 significant at conventional levels whether you use a two-sided or one-sided alternative. Hence the null hypothesis of equal average earnings in the two groups is rejected. Age-earning profiles typically take on an inverted U-shape. Maximum earnings occur in the 40s, depending on some other factors such as years of education, which are not considered here. Hence it is not clear if the alternative hypothesis should be one-sided or two-sided. In such a situation, it is best to assume a two-sided alternative hypothesis. 7) A manufacturer claims that a certain brand of VCR player has an average life expectancy of 5 years and 6 months with a standard deviation of 1 year and 6 months. Assume that the life expectancy is normally distributed. (a) Selecting one VCR player from this brand at random, calculate the probability of its life expectancy exceeding 7 years. (b) The Critical Consumer magazine decides to test fifty VCRs of this brand. The average life in this sample is 6 years and the sample standard deviation is 2 years. Calculate a 99% confidence interval for the average life. (c) How many more VCRs would the magazine have to test in order to halve the width of the confidence interval? Answer: (a) Pr (Y > 7) = Pr(Z > 1) = 0.1587. 2 (b) 6 ± 2.58 × = 6 ± 0.73 = (5.27, 6.73). 50 (c)

1 × (2.58 × 2

2 1 ) = 2.58 × × 2 50

2 = 2.58 × 50

2 , or n = 200. 4 × 50

Stock/Watson 2e -- CVC2 8/23/06 -- Page 49

8) U.S. News and World Report ranks colleges and universities annually. You randomly sample 100 of the national universities and liberal arts colleges from the year 2000 issue. The average cost, which includes tuition, fees, and room and board, is $23,571.49 with a standard deviation of $7,015.52. (a) Based on this sample, construct a 95% confidence interval of the average cost of attending a university/college in the United States. (b) Cost varies by quite a bit. One of the reasons may be that some universities/colleges have a better reputation than others. U.S. News and World Reports tries to measure this factor by asking university presidents and chief academic officers about the reputation of institutions. The ranking is from 1 (ʺmarginalʺ) to 5 (ʺdistinguishedʺ). You decide to split the sample according to whether the academic institution has a reputation of greater than 3.5 or not. For comparison, in 2000, Caltech had a reputation ranking of 4.7, Smith College had 4.5, and Auburn University had 3.1. This gives you the statistics shown in the accompanying table. Reputation Category

Average Cost

Standard deviation of Cost (SY)

Ranking > 3.5 Ranking ≤ 3.5

$29,311.31 $21,227.06

$5,649.21 $6,133.38

29 71

Test the hypothesis that the average cost for all universities/colleges is the same independent of the reputation. What alternative hypothesis did you use? (c) What other factors should you consider before making a decision based on the data in (b)? Answer: (a) 23,571.49 ± 1.96 ×

7,015.52 = 23,571.49 ± 701.55 = (22,869.94, 24,273.04). 100

(b) Assuming unequal population variances, t =

(29311.31 - 21,227.06) = 6.33, which is statistically 5,649.21 2 6,133.38 2 + 29 71

significant whether or not you use a one-sided or two-sided hypothesis test. Your prior expectation is that academic institutions with a higher reputation will charge more for attending, and hence a one-sided alternative would have been appropriate here. (c) There may be other variables which potentially have an effect on the cost of attending the academic institution. Some of these factors might be whether or not the college/university is private or public, its size, whether or not it has a religious affiliation, etc. It is only after controlling for these factors that the “pure” relationship between reputation and cost can be identified.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 50

9) The development office and the registrar have provided you with anonymous matches of starting salaries and GPAs for 108 graduating economics majors. Your sample contains a variety of jobs, from church pastor to stockbroker. (a) The average starting salary for the 108 students was $38,644.86 with a standard deviation of $7,541.40. Construct a 95% confidence interval for the starting salary of all economics majors at your university/college. (b) A similar sample for psychology majors indicates a significantly lower starting salary. Given that these students had the same number of years of education, does this indicate discrimination in the job market against psychology majors? (c) You wonder if it pays (no pun intended) to get good grades by calculating the average salary for economics majors who graduated with a cumulative GPA of B+ or better, and those who had a B or worse. The data is as shown in the accompanying table. Cumulative GPA B+ or better B or worse

Average Earnings

Standard deviation SY

$39,915.25 $37,083.33

$8,330.21 $6,174.86

59 49

Conduct a t-test for the hypothesis that the two starting salaries are the same in the population. Given that this data was collected in 1999, do you think that your results will hold for other years, such as 2002? Answer: (a) 38,644.86 ± 1.96 ×

7,541.40 = 38,644.86 ± 1,422.32 = (37,222.54, 40,067.18). 108

(b) It suggests that the market values certain qualifications more highly than others. Comparing means and identifying that one is significantly lower than others does not indicate discrimination. (39,915.25 - 37,083.33) (c) Assuming unequal population variances, t = = 2.03. The critical value for a 8,33.212 6,174.86 2 + 59 49 one-sided test is 1.64, for a two-sided test 1.96, both at the 5% level. Hence you can reject the null hypothesis that the two starting salaries are equal. Presumably you would have chosen as an alternative that better students receive better starting salaries, so that this becomes your new working hypothesis. 1999 was a boom year. If better students receive better starting offers during a boom year, when the labor market for graduates is tight, then it is very likely that they receive a better offer during a recession year, assuming that they receive an offer at all.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 51

10) During the last few days before a presidential election, there is a frenzy of voting intention surveys. On a given day, quite often there are conflicting results from three major polls. (a) Think of each of these polls as reporting the fraction of successes (1s) of a Bernoulli random variable Y, ^

where the probability of success is Pr(Y = 1) = p. Let p be the fraction of successes in the sample and assume that p(1-p) this estimator is normally distributed with a mean of p and a variance of . Why are the results for all n polls different, even though they are taken on the same day? ^ ^ ^ p (1-p )

(b) Given the estimator of the variance of p ,

, construct a 95% confidence interval for p . For which value

of p is the standard deviation the largest? What value does it take in the case of a maximum p ? (c) When the results from the polls are reported, you are told, typically in the small print, that the “margin of error” is plus or minus two percentage points. Using the approximation of 1.96 ≈ 2, and assuming, “conservatively,” the maximum standard deviation derived in (b), what sample size is required to add and subtract (“margin of error”) two percentage points from the point estimate? (d) What sample size would you need to halve the margin of error? ^

Answer: (a) Since all polls are only samples, there is random sampling error. As a result, p will differ from sample to sample, and most likely also from p. ^

(b) p ± 1.96 × ^

p (1-p ) . A bit of thought or calculus will show that the standard deviation will be largest n

for p = 0.5, in which case it becomes

0.5 . n

(c) n = 2,500. (d) n = 10,000. 11) At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the “CPS Data Used in Chapter 8 ” (ch8_cps.xls) and open it in Excel. This is a rather large data set to work with, so just copy the first 500 observations into a new Worksheet (these are rows 1 to 501). In the newly created Worksheet, mark A1 to A501, then select the Data tab and click on “sort.” A dialog box will open. First select “Add level” from one of the options on the left. Then select “sort by” and choose “Northeast” and “Largest to Smallest.” Repeat the same for the “South” as a second option. Finally press “ok.” This should give you 209 observations for average hourly earnings for the Northeast region, followed by 205 observations for the South. a.

For each of the 209 average hourly earnings observations for the Northeast region and separately for the South region, calculate the mean and sample standard deviation.

Use the appropriate test to determine whether or not average hourly earnings in the Northeast region the same as in the South region.

Find the 1%, 5%, and 10% confidence interval for the differences between the two populatioon means. Is your conclusion consistent with the test in part (b)?

In all three cases of using the confidence interval in (c), the power of the test is quite low (5%). What can you do to increase the power of the test without reducing the size of the test?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 52

Answer: a. YNortheast=$21.12; YSouth=$18.18; s Northeast=$11.86; sSouth=$11.18 b. t =

21.12 - 18.80 = 2.05 You cannot reject the null hypothesis of equal average earnings in the two 11.86 2 11.18 2 + 205 209

regions at the1% level, but you are able to reject it at the 10% and 5% significance level. c.

For the 10% significance level, the confidence interval is ($0.46,$4.18). For the 5% significance level, the interval becomes larger and is ($0.10,$4.54). In either one of the cases you can reject the null hypothesis, since $0 is not contained in the confidence interval. It is only for the 1% significance level that the null hypothesis cannot be rejected. In that case, the confidence interval is ($-0.60, $5.24).

d. You would have to increase the sample size, since that would shrink the standard error (assuming that the sample mean and variance will not change).

3.3 Mathematical and Graphical Problems 1) Your textbook defined the covariance between X and Y as follows: n 1 (Xi – X)(Yi – Y) ∑ n–1 i=1 Prove that this is identical to the following alternative specification: n n 1 XiYi XY ∑ n-1 n-1 i=1 Answer:

n n 1 1 (Xi - X)(Yi - Y) = (XiYi - XYi - YXi + YX) ∑ ∑ n-1 n-1 i=1 i=1 n n n n 1 1 ( ∑ XiYi - X ∑ Yi - Y ∑ Xi + nYX) = ( ∑ XiYi - nXY - nYX + nYX) = n-1 n-1 i=1 i=1 i=1 i=1 n 1 n XY. XiYi = ∑ n-1 n-1 i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 53

2) For each of the accompanying scatterplots for several pairs of variables, indicate whether you expect a positive or negative correlation coefficient between the two variables, and the likely magnitude of it (you can use a small range). (a)

(b)

(c)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 54

(d)

Answer: (a) Positive correlation. The actual correlation coefficient is 0.46. (b) No relationship. The actual correlation coefficient is 0.00007. (c) Negative relationship. The actual correlation coefficient is –0.70. (d) Nonlinear (inverted U) relationship. The actual correlation coefficient is 0.23.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 55

3) Your textbook defines the correlation coefficient as follows: 1 n-1 r=

∑ (Yi – Y)2 (Xi – X)2

i=1

n 1 ∑ (Yi – Y ) 2 n-1 i=1

n 1 ∑ (Xi - X)2 n-1 i=1

Another textbook gives an alternative formula: n n n n ∑ YiXi - ( ∑ Yi)( ∑ Xi) i=1 i=1 i=1 n n n n 2 2 n ∑ Y i - ( ∑ Yi)2 n ∑ X i - ( ∑ Xi ) 2 i=1 i=1 i=1 i=1

Prove that the two are the same.

Answer: r =

n 1 ∑ (Yi - Y)2 (Xi - X)2 n-1 i=1 = n n 1 1 1 2 2 n-1 ∑ (Yi - Y ) n-1 ∑ (Xi - X) n-1 i=1 i=1 n

i=1

∑ Y 2 - nY2 i i=1

∑ X 2 - nX2 i i=1

n ∑ Y 2 - ( ∑ Yi)2 i i=1 i=1

∑ ( X 2 - 2XXi + X2)

i=1

∑ Y 2i - nY2 i=1

n n n n ∑ YiXi - ( ∑ Yi) ( ∑ Xi) i=1 i=1 i=1

∑ ( Y i - 2YYi + Y2)

n n ∑ YiXi - nYnX i=1

∑ YiXi - nYX =

n 1 (YiXi - YXi - XYi + YX) ∑ n-1 i=1

∑ X 2 - X2

i=1

. n

n ∑ X 2 - ( ∑ Xi)2 i i=1 i=1

4) IQs of individuals are normally distributed with a mean of 100 and a standard deviation of 16. If you sampled students at your college and assumed, as the null hypothesis, that they had the same IQ as the population, then in a random sample of size (a) n = 25, find Pr(Y < 105). (b) n = 100, find Pr(Y > 97). (c) n = 144, find Pr(101 < Y < 103). Answer: (a) 0.94 (b) 0.97 (c) 0.21

Stock/Watson 2e -- CVC2 8/23/06 -- Page 56

5) Consider the following alternative estimator for the population mean:

~ 1 1 7 1 7 1 7 Y= ( Y1 + Y2 + Y3 + Y4 + ... + Yn–1 + Yn) n 4

Prove that Y is unbiased and consistent, but not efficient when compared to Y.

Answer: E(Y)= =

1 1 7 1 7 1 7 ( E(Y1 ) + E(Y2 ) + E(Y3 ) + E(Y4 )+ ... + E(Yn-1 ) + E(Yn)) n 4 4 4 4 4 4

~ 1 7 n 1 μ (2 + 2 + ... + + ) = μY = μY. Hence Y is unbiased. 4 4 n n Y

~ ~ 1 1 7 1 7 1 7 var(Y) = E(Y) - μY ) 2 = E[ ( Y1 + Y2 + Y3 + Y4 + ... + Yn-1 + Yn) - μY]2 n 4 4 4 4 4 4 =

1 n2

E[ 1 (Y1 - μ )+ 7 (Y2 - μ ) + ... + 1 (Yn-1 - μ ) + 7 (Yn - μ )]2 Y 4 Y Y Y 4 4 4

[ 1 E(Y1 - μ )2 + 49 E(Y2 - μ )2 + ... + 1 E(Yn-1 - μ )2 + 49 E(Yn - μ )2 ] Y Y Y Y 16 16 16 n2 16 2 σY

1 2 49 2 1 2 49 2 [ n ( 1 + 49 )] = 1.5625 [ σ + σ + ... + σ + σ ]= 16 Y 16 Y 6 n2 16 Y 16 Y n2 2 16

2 σY n

Since var(Y) → 0 as n → ∞, Y is consistent. Y has a larger variance than Y and is therefore not as efficient. 6) Imagine that you had sampled 1,000,000 females and 1,000,000 males to test whether or not females have a higher IQ than males. IQs are normally distributed with a mean of 100 and a standard deviation of 16. You are excited to find that females have an average IQ of 101 in your sample, while males have an IQ of 99. Does this difference seem important? Do you really need to carry out a t-test for differences in means to determine whether or not this difference is statistically significant? What does this result tell you about testing hypotheses when sample sizes are very large? Answer: The difference seems very small, both in terms of absolute values and, more importantly, in terms of standard deviations. With a sample size as large as n=1,000,000, the standard error becomes extremely small. This implies that the distribution of means, or differences in means, has almost turned into a spike. In essence, you are (very close to) observing the population. It is therefore unnecessary to test whether or not the difference is statistically significant. After all, if in the population, the male IQ were 99.99 and the female IQ were 100.01, they would be different. In general, when sample sizes become very large, it is very easy to reject null hypotheses about population means, which involve sample means as an estimator, even if hypothesized differences are very small. This is the result of the distribution of sample means collapsing fairly rapidly as sample sizes increase.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 57

7) Let Y be a Bernoulli random variable with success probability Pr(Y = 1) = p, and let Y1 ,..., Yn be i.i.d. draws ^

from this distribution. Let p be the fraction of successes (1s) in this sample. In large samples, the distribution of ^ ^ p(1- p) p will be approximately normal, i.e., p is approximately distributed N(p, ). Now let X be the number of n successes and n the sample size. In a sample of 10 voters (n=10), if there are six who vote for candidate A, then X ^

= 6. Relate X, the number of success, to p , the success proportion, or fraction of successes. Next, using your knowledge of linear transformations, derive the distribution of X. ^

Answer: X = n × p . Hence if p is distributed N(p,

^ p(1- p) ), then, given that X is a linear transformation of p , X is n

distributed N(np, np(1- p)). 8) When you perform hypothesis tests, you are faced with four possible outcomes described in the accompanying table. Decision based on sample Reject H0

H0 is true I

Don not reject H0

☺

Truth (Population) H1 is true ☺ II

“☺” indicates a correct decision, and I and II indicate that an error has been made. In probability terms, state the mistakes that have been made in situation I and II, and relate these to the Size of the test and the Power of the test (or transformations of these). Answer: I: Pr(reject H0 H0 is correct) = Size of the test. II: Pr(reject H1 H1 is correct) = (1-Power of the test). 9) Assume that under the null hypothesis, Y has an expected value of 500 and a standard deviation of 20. Under the alternative hypothesis, the expected value is 550. Sketch the probability density function for the null and the alternative hypothesis in the same figure. Pick a critical value such that the p-value is approximately 5%. Mark the areas, which show the size and the power of the test. What happens to the power of the test if the alternative hypothesis moves closer to the null hypothesis, i.e.,, μY = 540, 530, 520, etc.? Answer: For a given size of the test, the power of the test is lower.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 58

10) The net weight of a bag of flour is guaranteed to be 5 pounds with a standard deviation of 0.05 pounds. You are concerned that the actual weight is less. To test for this, you sample 25 bags. Carefully state the null and alternative hypothesis in this situation. Determine a critical value such that the size of the test does not exceed 5%. Finding the average weight of the 25 bags to be 4.7 pounds, can you reject the null hypothesis? What is the power of the test here? Why is it so low? Answer: Let Y be the net weight of the bag of flour. Then H0 : E(Y) = 5 and H1 : E(Y) < 5. Under the null hypothesis, Y is distributed normally, with a mean of 5 pounds and a standard deviation of 0.01 pounds. The critical value is approximately 4.98 pounds. Since 4.7 pounds falls in the rejection region, the null hypothesis is rejected. The power of the test is low here, since there is no simple alternative. In the extreme case, where the alternative hypothesis would place the net weight marginally below five pounds, the power of the test would approximately equal its size, or 5% in this case. 11) Some policy advisors have argued that education should be subsidized in developing countries to reduce fertility rates. To investigate whether or not education and fertility are correlated, you collect data on population growth rates (Y) and education (X) for 86 countries. Given the sums below, compute the sample correlation: n n n n 2 2 X Y X Y 1.594; i 449.6; i i 6.4697; 0.03982; Y i = = = = ∑ ∑ ∑ ∑ i ∑ X i = 3,022.76 i=1 i=1 i=1 i=1 i=1 n

Answer: r = –0.716. 12) (Advanced) Unbiasedness and small variance are desirable properties of estimators. However, you can imagine situations where a trade-off exists between the two: one estimator may be have a small bias but a much smaller variance than another, unbiased estimator. The concept of “mean square error” estimator combines the two ^

concepts. Let μ be an estimator of μ. Then the mean square error (MSE) is defined as follows: MSE( μ) = E(μ – ^ ^ ^ ^ μ)2 . Prove that MSE(μ) = bias2 + var(μ). (Hint: subtract and add in E(μ) in E(μ – μ)2 .) ^ ^ ^ ^ ^ ^ ^ Answer: MSE (μ) = E(μ - E(μ) + E(μ) - μ)2 = E[(μ - E(μ)) + (E(μ) - μ)]2 ^ ^ ^ ^ ^ ^ = E[(μ - E(μ))2 + (E(μ) - μ)2 + 2(μ - E(μ))(E(μ) - μ)]

Next, moving through the expectation operator results in ^ ^ ^ ^ ^ ^ E[μ - E(μ)]2 + E[E(μ) - μ)]2 + 2E[(μ) - E(μ))( E(μ) - μ)]. The first term is the variance, and the second term is the squared bias, since ^ ^ ^ ^ E[E(μ) - μ)]2 = [E(μ) - μ)]2 . This proves MSE (μ) = bias2 + var(μ) if the last term equals zero. But ^

^ ^

E[(μ - E(μ))(E(μ) - μ)] = E[E(μ)μ - μμ - (E(μ))2 + μE(μ)] ^ ^ ^ ^ ^ = E(μ) E(μ) - μE(μ) - (E(μ))2 + μE(μ) = 0.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 59

13) Your textbook states that when you test for differences in means and you assume that the two population variances are equal, then an estimator of the population variance is the following “pooled” estimator: 2 S pooled =

1 nm+ nw - 2

i=1

∑ (Yi - Ym)2 +

∑ (Yi - Yw)2

Explain why this pooled estimator can be looked at as the weighted average of the two variances. 2 Answer: S pooled =

1 nm+ nw - 2

i=1

∑ (Yi - Ym)2 +

1 2 2 (n - 1) s m + (nw - 1) s w nm+ nw - 2 m

(nw - 1) (nm - 1) 2 2 S m+ S . nm + nw - 2 w nm+ nw - 2

∑ (Yi - Yw)2

14) Your textbook suggests using the first observation from a sample of n as an estimator of the population mean. 2 It is shown that this estimator is unbiased but has a variance of σ Y , which makes it less efficient than the sample mean. Explain why this estimator is not consistent. You develop another estimator, which is the simple average of the first and last observation in your sample. Show that this estimator is also unbiased and show that it is more efficient than the estimator which only uses the first observation. Is this estimator consistent? Answer: The estimator is not consistent because its variance does not vanish as n goes to infinity, i.e., var(Y1 ) → 0 as n → ∞ does not hold.

~ 1 ~ 1 ~ ~ ~ 1 Y= (Y1 + Yn). E(Y) = (E(Y1 ) + E(Yn)) = (μY + μY) = μY. Hence Y is unbiased. var(Y ) = E(Y - μY)2 = 2 2 2 1 1 E[( Y1 + Yn) - μY]2 2 2 1 1 1 1 2 2 = E[( (Y1 - μY) + (Yn - μY)]2 = [E(Y1 + μY]2 + E(Yn - μY)2 ] = [ σ Y + σ Y ] 2 2 4 4

2 σY 2

Since var(Y) → 0 as n → ∞, does not hold, Y is not consistent. ~ var(Y) < var(Y1 ), and is therefore more efficient than the estimator, which only uses the first observation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 60

15) Let p be the success probability of a Bernoulli random variable Y, i.e., p = Pr(Y = 1). It can be shown that p , the p(1 – p) fraction of successes in a sample, is asymptotically distributed N(p, . Using the estimator of the variance n ^ ^ ^ p (1 - p )

of p ,

, construct a 95% confidence interval for p. Show that the margin for sampling error simplifies to

1/ n if you used 2 instead of 1.96 assuming, conservatively, that the standard error is at its maximum. Construct a table indicating the sample size needed to generate a margin of sampling error of 1%, 2%, 5% and 10%. What do you notice about the increase in sample size needed to halve the margin of error? (The margin of ^

sampling error is 1.96×SE(p ).) ^

p (1 - p ) . n

Answer: The 95% confidence interval for p is p ± 1.96 × ^

case the confidence interval reduces to p ± 1.96 ×

^ p (1 - p ) is at a maximum for p = 0.5, in which n

^ 0.25 ≈ p± n

1 , and the margin of sampling error is n

1 . n 1 n

0.01 0.02 0.05 0.10

10,000 2,500 400 100

To halve the margin of error, the sample size has to increase fourfold. 16) Let Y be a Bernoulli random variable with success probability Pr(Y = 1) = p, and let Y1 ,..., Yn be i.i.d. draws ^

from this distribution. Let p be the fraction of successes (1s) in this sample. Given the following statement Pr(-1.96 < z < 1.96) = 0.95 ^

and assuming that p being approximately distributed N(p,

p(1 - p) , derive the 95% confidence interval for p by n

solving the above inequalities. ^

Answer: Pr(-1.96 <

p-p < 1.96) = 0.95. Multiplying through by the standard deviation results in Pr( -1.96 × p(1 - p) n

p(1 - p) ^ < p - p < 1.96 × n ^

(-1), Pr(p - 1.96 × ± 1.96 ×

^ p(1 - p) )= 0.95. Subtraction of p then yields, after multiplying both sides by n

^ p(1 - p) < p < p + 1.96 × n

^ p(1 - p) ) = 0.95. The 95% confidence interval for p then is p n

p(1 - p) . n

Stock/Watson 2e -- CVC2 8/23/06 -- Page 61

17) Your textbook mentions that dividing the sample variance by n –1 instead of n is called a degrees of freedom correction. The meaning of the term stems from the fact that one degree of freedom is used up when the mean is estimated. Hence degrees of freedom can be viewed as the number of independent observations remaining after estimating the sample mean. Consider an example where initially you have 20 independent observations on the height of students. After calculating the average height, your instructor claims that you can figure out the height of the 20 th student if she provides you with the height of the other 19 students and the sample mean. Hence you have lost one degree of freedom, or there are only 19 independent bits of information. Explain how you can find the height of the 20th student. 20 19 20 1 Y 20 Y i Y Y , × = = + ∑ ∑ Yi . Hence knowledge of the sample mean and the 20 20 ∑ i i=1 i=1 i=1 height of the other 19 students is sufficient for finding the height of the 20 th student.

Answer: Since Y =

18) The accompanying table lists the height (STUDHGHT) in inches and weight (WEIGHT) in pounds of five college students. Calculate the correlation coefficient. STUDHGHT

WEIGHT 165 165 145 155 140

74 73 72 68 66 Answer: r = 0.72.

19) (Requires calculus.) Let Y be a Bernoulli random variable with success probability Pr(Y = 1) = p. It can be p(1 – p) shown that the variance of the success probability p is . Use calculus to show that this variance is n maximized for p = 0.5. ∂ Answer:

p(1 - p) n ∂p

1- p p 1 - = 0. Hence 1 - 2p = 0 or p = . n n 2

Stock/Watson 2e -- CVC2 8/23/06 -- Page 62

20) Consider two estimators: one which is biased and has a smaller variance, the other which is unbiased and has a larger variance. Sketch the sampling distributions and the location of the population parameter for this situation. Discuss conditions under which you may prefer to use the first estimator over the second one. Answer: The bias indicates “how far away,” on average, the estimator is from the population value. Although this average is zero for an unbiased estimator, there may be quite some variation around the population mean. In a single draw, there is therefore a high probability of being some distance away from the population mean. On the other hand, if the variance is very small and the estimator is biased by a small amount, then the probability of being closer to the population value may be higher. (The biased estimator may have a smaller mean square error than the unbiased estimator.)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 63

21) At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the chapter 8 CPS data set (ch8_cps.xls) into a spreadsheet program such as Excel. For the exercise, use the first 500 observations only. Using data for average hourly earnings only (ahe) and years of education ( yrseduc), produce a scatterplot with earnings on the vertical axis and education level on the horizontal axis. What kind of relationship does the scatterplot suggest? Confirm your impression by adding a linear trendline. Find the correlation coefficient between the two and interpret it.

Answer: Without the trendline added, there does not seem to be much of a linear relationship between average hourly earnings and years of education. Perhaps a linear relationship is not plausible since it would imply that the returns to education would become smaller as further years of education are added. However, and regardless of the linearity issues, there is a positive relationship in the data between the two variables, which becomes visible when the trend line is added. The correlation coefficient is positive and has a value of 46.9%, which is reasonably high (the correlation between height and weight for college students is approximately 50% by comparison). 22) IQ scores are normally distributed with an average of 100 and a standard deviation of 16. Some research suggests that left-handed individuals have a higher IQ score than right-handed individuals. To test this hypothesis, a researcher randomly selects 132 individuals and finds that their average IQ is 103.2 with a sample standard deviation of 14.6. Using the results from the sample, can you reject the null hypothesis that left-handed people have an IQ of 100 vs. the alternative that they have a higher IQ? What critical value should you choose if the size of the test is 5%? Answer: The hypothesis is H0 : μ = 100 versus the alternative H1 : μ > 100. The test statistic is t =

103.2-100 =2.52. 14.6 132

Since the critical value for the one-sided alternative is 1.645 at the 5% significance level, the researcher should reject the null hypothesis that left-handed individuals have an IQ of 100.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 64

23) At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the “Test Score data set used in Chapters 4-9” (caschool.xls) and open the Excel data set. Next produce a scatterplot of the average reading score (horizontal axis) and the average mathematics score (vertical axis). What does the scatterplot suggest? Calculate the correlation coefficient between the two series and give an interpretation.

Answer: The scatterplot suggests that, on average, schools which perform highly on the reading score will also perform highly on the mathematics score. The sample correlation between the two series is 92.3%, suggesting a high positive correlation between the two variables. 24) In 2007, a study of close to 250,000 18-19 year-old Norwegian males found that first-borns have an IQ that is 2.3 points higher than those who are second -born. To see if you can find a similar evidence at your university, you collect data from 250 students, of which 140 are first-borns. After subjecting each of these individuals to an IQ test, you find that the first-borns score 108.3 with a standard deviation of 13.2, while the second borns achieve 107.1 with a standard deviation of 11.6. You hypothesize that first -borns and second-borns in a university population have identical IQs against the one -sided alternative hypothesis that first borns have higher IQs. Using a size of the test of 5%, what is your conclusion? Answer: Given that your null hypothesis states H0 : μfirst = μsecond , your test statistic is t =

108.3 - 107.1 = 13.22 11.62 + 140 110

0.76. Since the critical value for the one-sided alternative test is 1.64, you cannot reject the null hypothesis.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 65

Chapter 4 Linear Regression with One Regressor 4.1 Multiple Choice

1) When the estimated slope coefficient in the simple regression model, β1 , is zero, then A) R2 = Y . B) 0 < R2 < 1. C) R2 = 0. D) R2 > (SSR/TSS). Answer: C 2) The regression R2 is defined as follows: ESS A) TSS B)

RSS TSS n

∑ (Yi - Y)(Xi - X) C)

i=1 n

∑ (Yi - Y)2 i=1 D)

∑ (Xi - X)2

i=1

SSR n-2

Answer: A 3) The standard error of the regression (SER) is defined as follows n ^ 1 2 A) ui ∑ n-2 i=1 B) SSR C) 1-R2 D)

n ^ 1 2 ui ∑ n-1 i=1

Answer: A 4) (Requires Appendix material) Which of the following statements is correct? A) TSS = ESS + SSR B) ESS = SSR + TSS C) ESS > TSS D) R2 = 1 - (ESS/TSS) Answer: A 5) Binary variables A) are generally used to control for outliers in your sample. B) can take on more than two values. C) exclude certain individuals from your sample. D) can take on only two values. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 66

6) The following are all least squares assumptions with the exception of: A) The conditional distribution of ui given Xi has a mean of zero. B) The explanatory variable in regression model is normally distributed. C) (Xi, Yi), i = 1,..., n are independently and identically distributed. D) Large outliers are unlikely. Answer: B 7) The reason why estimators have a sampling distribution is that A) economics is not a precise science. B) individuals respond differently to incentives. C) in real life you typically get to sample many times. D) the values of the explanatory variable and the error term differ across samples. Answer: D 8) In the simple linear regression model, the regression slope A) indicates by how many percent Y increases, given a one percent increase in X. B) when multiplied with the explanatory variable will give you the predicted Y. C) indicates by how many units Y increases, given a one unit increase in X. D) represents the elasticity of Y on X. Answer: C 9) The OLS estimator is derived by A) connecting the Yi corresponding to the lowest Xi observation with the Yi corresponding to the highest Xi observation. B) making sure that the standard error of the regression equals the standard error of the slope estimator. C) minimizing the sum of absolute residuals. D) minimizing the sum of squared residuals. Answer: D 10) Interpreting the intercept in a sample regression function is A) not reasonable because you never observe values of the explanatory variables around the origin. B) reasonable because under certain conditions the estimator is BLUE. C) reasonable if your sample contains values of Xi around the origin. D) not reasonable because economists are interested in the effect of a change in X on the change in Y. Answer: C 11) The variance of Yi is given by 2 2 A) β 0 + β 1 var(Xi) + var(ui). B) the variance of ui. 2 C) β 1 var(Xi) + var(ui). D) the variance of the residuals. Answer: C 12) (Requires Appendix) The sample average of the OLS residuals is A) some positive number since OLS uses squares. B) zero. C) unobservable since the population regression function is unknown. D) dependent on whether the explanatory variable is mostly positive or negative. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 67

13) The OLS residuals, ui, are defined as follows: ^

A) Yi - β0 - β1 Xi B) Yi - β0 - β1 Xi ^

C) Yi - Yi D) (Yi - Y)2 Answer: C 14) The slope estimator, β1 , has a smaller standard error, other things equal, if A) there is more variation in the explanatory variable, X. B) there is a large variance of the error term, u. C) the sample size is smaller. D) the intercept, β0 , is small. Answer: A 15) The regression R2 is a measure of A) whether or not X causes Y. B) the goodness of fit of your regression line. C) whether or not ESS > TSS. D) the square of the determinant of R. Answer: B 16) (Requires Appendix) The sample regression line estimated by OLS A) will always have a slope smaller than the intercept. B) is exactly the same as the population regression line. C) cannot have a slope of zero. D) will always run through the point (X, Y). Answer: D 17) The OLS residuals A) can be calculated using the errors from the regression function. B) can be calculated by subtracting the fitted values from the actual values. C) are unknown since we do not know the population regression function. D) should not be used in practice since they indicate that your regression does not run through all your observations. Answer: B ^

18) The normal approximation to the sampling distribution of β1 is powerful because A) many explanatory variables in real life are normally distributed. B) it allows econometricians to develop methods for statistical inference. C) many other distributions are not symmetric. D) is implies that OLS is the BLUE estimator for β1 . Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 68

19) If the three least squares assumptions hold, then the large sample normal distribution of β1 is A) N(0,

1 var[Xi - μX)ui] ). n [var(Xi)]2

B) N(β1 ,

1 var(ui)]2 ). n [var(Xi)]2 2 σu

C) N(β1 ,

∑ (Xi - X)2

i=1 D) N(β1 ,

1 var(ui)] ). n [var(Xi)]2

Answer: B 20) In the simple linear regression model Yi = β0 + β1 Xi + ui, A) the intercept is typically small and unimportant. B) β0 + β1 Xi represents the population regression function. C) the absolute value of the slope is typically between 0 and 1. D) β0 + β1 Xi represents the sample regression function. Answer: B 21) To obtain the slope estimator using the least squares principle, you divide the A) sample variance of X by the sample variance of Y. B) sample covariance of X and Y by the sample variance of Y. C) sample covariance of X and Y by the sample variance of X. D) sample variance of X by the sample covariance of X and Y. Answer: C 22) To decide whether or not the slope coefficient is large or small, A) you should analyze the economic importance of a given increase in X. B) the slope coefficient must be larger than one. C) the slope coefficient must be statistically significant. D) you should change the scale of the X variable if the coefficient appears to be too small. Answer: A 23) E(ui Xi) = 0 says that A) dividing the error by the explanatory variable results in a zero (on average). B) the sample regression function residuals are unrelated to the explanatory variable. C) the sample mean of the Xs is much larger than the sample mean of the errors. D) the conditional distribution of the error given the explanatory variable has a zero mean. Answer: D 24) In the linear regression model, Yi = β0 + β1 Xi + ui, β0 + β1 Xi is referred to as A) the population regression function. B) the sample regression function. C) exogenous variation. D) the right-hand variable or regressor. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 69

25) Multiplying the dependent variable by 100 and the explanatory variable by 100,000 leaves the A) OLS estimate of the slope the same. B) OLS estimate of the intercept the same. C) regression R2 the same. D) variance of the OLS estimators the same. Answer: C 26) Assume that you have collected a sample of observations from over 100 households and their consumption and income patterns. Using these observations, you estimate the following regression Ci = β0 +β 1 Yi+ ui where C is consumption and Y is disposable income. The estimate of β 1 will tell you A)

Δ Income △ Consumption

B) The amount you need to consume to survive Income C) Consumption D)

Δ Consumption △ Income

Answer: D 27) In which of the following relationships does the intercept have a real-world interpretation? A) the relationship between the change in the unemployment rate and the growth rate of real GDP (“Okun’s Law”) B) the demand for coffee and its price C) test scores and class-size D) weight and height of individuals Answer: A ^

28) The OLS residuals, u i, are sample counterparts of the population A) regression function slope B) errors C) regression function’s predicted vlaues D) regression function intercept Answer: B 29) Changing the units of measurement, e.g. measuring testscores in 100s, will do all of the following EXCEPT for changing the A) residuals B) numerical value of the slope estimate C) interpretation of the effect that a change in X has on the change in Y D) numerical value of the intercept Answer: C 30) To decide whether the slope coefficient indicates a “large” effect of X on Y, you look at the A) size of the slope coefficient B) regression C) economic importance implied by the slope coefficient D) value of the intercept Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 70

4.2 Essays and Longer Questions 1) Sir Francis Galton, a cousin of James Darwin, examined the relationship between the height of children and their parents towards the end of the 19 th century. It is from this study that the name “regression” originated. You decide to update his findings by collecting data from 110 college students, and estimate the following relationship: Studenth = 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0 where Studenth is the height of students in inches, and Midparh is the average of the parental heights. (Following Galton’s methodology, both variables were adjusted so that the average female height was equal to the average male height.) (a) Interpret the estimated coefficients. (b) What is the meaning of the regression R2 ? (c) What is the prediction for the height of a child whose parents have an average height of 70.06 inches? (d) What is the interpretation of the SER here? (e) Given the positive intercept and the fact that the slope lies between zero and one, what can you say about the height of students who have quite tall parents? Those who have quite short parents? (f) Galton was concerned about the height of the English aristocracy and referred to the above result as “regression towards mediocrity.” Can you figure out what his concern was? Why do you think that we refer to this result today as “Galton’s Fallacyʺ? Answer: (a) For every one inch increase in the average height of their parents, the student’s height increases by 0.73 of an inch. There is no reasonable interpretation for the intercept. (b) The model explains 45 percent of the variation in the height of students. (c) 19.6 + 0.73 × 70.06 = 70.74. (d) The SER is a measure of the spread of the observations around the regression line. The magnitude of the typical deviation from the regression line or the typical regression error here is two inches. (e) Tall parents will have, on average, tall students, but they will not be as tall as their parents. Short parents will have short students, although on average, they will be somewhat taller than their parents. (f) This is an example of mean reversion. Since the aristocracy was, on average, taller, he was concerned that their children would be shorter and resemble more the rest of the population. If this conclusion were true, then eventually everyone would be of the same height. However, we have not observed a decrease in the variance in height over time.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 71

2) (Requires Appendix material) At a recent county fair, you observed that at one stand people’s weight was forecasted, and were surprised by the accuracy (within a range). Thinking about how the person could have predicted your weight fairly accurately (despite the fact that she did not know about your “heavy bones”), you think about how this could have been accomplished. You remember that medical charts for children contain 5%, 25%, 50%, 75% and 95% lines for a weight/height relationship and decide to conduct an experiment with 110 of your peers. You collect the data and calculate the following sums: n

i=1

∑ Yi = 17,375, ∑ Xi = 7,665.5, n

∑ y i = 94,228.8, ∑ x i = 1,248.9, ∑ xiyi = 7,625.9 i=1

i=1

where the height is measured in inches and weight in pounds. (Small letters refer to deviations from means as in zi = Zi – Z.) (a) Calculate the slope and intercept of the regression and interpret these. (b) Find the regression R2 and explain its meaning. What other factors can you think of that might have an influence on the weight of an individual? ^

Answer: (a) β1 =

^ 7625.9 = 6.11, β0 = 157.95 - 6.11 × 69.69 = -267.86. For every additional inch in height, students 1,248.9

weigh roughly 6 pounds more, on average. n ^2 2 β1 ∑ xi i=1 46,624.1 ESS (b) R2 = = = = 0.495. Roughly half of the weight variation in the 110 students n 94,228.8 TSS 2 ∑ yi i=1 is explained by the single explanatory variable, height. Answers will vary by student for the other factors, but calorie intake and amount of exercise typically appear as part of the list.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 72

3) You have obtained a sub-sample of 1744 individuals from the Current Population Survey (CPS) and are interested in the relationship between weekly earnings and age. The regression, using heteroskedasticity-robust standard errors, yielded the following result: Earn = 239.16 + 5.20 × Age, R2 = 0.05, SER = 287.21., where Earn and Age are measured in dollars and years respectively. (a) Interpret the results. (b) Is the effect of age on earnings large? (c) Why should age matter in the determination of earnings? Do the results suggest that there is a guarantee for earnings to rise for everyone as they become older? Do you think that the relationship between age and earnings is linear? (d) The average age in this sample is 37.5 years. What is annual income in the sample? (e) Interpret the measures of fit. Answer: (a) A person who is one year older increases her weekly earnings by $5.20. There is no meaning attached to the intercept. The regression explains 5 percent of the variation in earnings. (b) Assuming that people worked 52 weeks a year, the effect of being one year older translates into an additional $270.40 a year. This does not seem particularly large in 2002 dollars, but may have been earlier. (c) In general, age-earnings profiles take on an inverted U-shape. Hence it is not linear and the linear approximation may not be good at all. Age may be a proxy for “experience,” which in itself can approximate “on the job training.” Hence the positive effect between age and earnings. The results do not suggest that there is a guarantee for earnings to rise for everyone as they become older since the regression R2 does not equal 1. Instead the result holds “on average.” (d) Since β0 = Y - β1 X ⇒ Y = β0 + β1 X. Substituting the estimates for the slope and the intercept then results in average weekly earnings of $434.16 or annual average earnings of $22,576.32. (e) The regression R2 indicates that five percent of the variation in earnings is explained by the model. The typical error is $287.21. 4) The baseball team nearest to your home town is, once again, not doing well. Given that your knowledge of what it takes to win in baseball is vastly superior to that of management, you want to find out what it takes to win in Major League Baseball (MLB). You therefore collect the winning percentage of all 30 baseball teams in MLB for 1999 and regress the winning percentage on what you consider the primary determinant for wins, which is quality pitching (team earned run average). You find the following information on team performance: Summary of the Distribution of Winning Percentage and Team Earned Run Average for MLB in 1999 Average Standard Percentile deviation 10% 25% 40% 50% 60% 75% (median) 4.71 0.53 3.84 4.35 4.72 4.78 4.91 5.06

Team ERA Winning 0.50 Percentage

0.08

0.40

0.43

0.46

0.48

0.49

0.59

90% 5.25 0.60

(a) What is your expected sign for the regression slope? Will it make sense to interpret the intercept? If not, should you omit it from your regression and force the regression line through the origin? (b) OLS estimation of the relationship between the winning percentage and the team ERA yield the following: Winpct = 0.9 – 0.10 × teamera , R2 =0.49, SER = 0.06, where winpct is measured as wins divided by games played, so for example a team that won half of its games Stock/Watson 2e -- CVC2 8/23/06 -- Page 73

would have Winpct = 0.50. Interpret your regression results. (c) It is typically sufficient to win 90 games to be in the playoffs and/or to win a division. Winning over 100 games a season is exceptional: the Atlanta Braves had the most wins in 1999 with 103. Teams play a total of 162 games a year. Given this information, do you consider the slope coefficient to be large or small? (d) What would be the effect on the slope, the intercept, and the regression R2 if you measured Winpct in percentage points, i.e., as (Wins/Games) × 100? (e) Are you impressed with the size of the regression R2 ? Given that there is 51% of unexplained variation in the winning percentage, what might some of these factors be? Answer: (a) You expect a negative relationship, since a higher team ERA implies a lower quality of the input. No team comes close to a zero team ERA, and therefore it does not make sense to interpret the intercept. Forcing the regression through the origin is a false implication from this insight. Instead the intercept fixes the level of the regression. (b) For every one point increase in Team ERA, the winning percentage decreases by 10 percentage points, or 0.10. Roughly half of the variation in winning percentage is explained by the quality of team pitching. (c) The coefficient is large, since increasing the winning percentage by 0.10 is the equivalent of winning 16 more games per year. Since it is typically sufficient to win 56 percent of the games to qualify for the playoffs, this difference of 0.10 in winning percentage turns can easily turn a loosing team into a winning team. (d) Clearly the regression R2 will not be affected by a change in scale, since a descriptive measure of the quality of the regression would depend on whim otherwise. The slope of the regression will compensate in such a way that the interpretation of the result is unaffected, i.e., it will become 10 in the above example. The intercept will also change to reflect the fact that if X were 0, then the dependent variable would now be measured in percentage, i.e., it will become 94.0 in the above example. (e) It is impressive that a single variable can explain roughly half of the variation in winning percentage. Answers to the second question will vary by student, but will typically include the quality of hitting, fielding, and management. Salaries could be included, but should be reflected in the inputs.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 74

5) You have learned in one of your economics courses that one of the determinants of per capita income (the “Wealth of Nations”) is the population growth rate. Furthermore you also found out that the Penn World Tables contain income and population data for 104 countries of the world. To test this theory, you regress the GDP per worker (relative to the United States) in 1990 ( RelPersInc) on the difference between the average population growth rate of that country (n) to the U.S. average population growth rate (nus ) for the years 1980 to 1990. This results in the following regression output: RelPersInc = 0.518 – 18.831 × 18.831 × (n – nus), R2 = 0.522, SER = 0.197 (a) Interpret the results carefully. Is this relationship economically important? (b) What would happen to the slope, intercept, and regression R2 if you ran another regression where the above explanatory variable was replaced by n only, i.e., the average population growth rate of the country? (The population growth rate of the United States from 1980 to 1990 was 0.009.) Should this have any effect on the t-statistic of the slope? (c) 31 of the 104 countries have a dependent variable of less than 0.10. Does it therefore make sense to interpret the intercept? Answer: (a) A relative increase in the population rate of one percentage point, from 0.01 to 0.02, say, lowers relative per-capita income by almost 20 percentage points (0.188). This is a quantitatively important and large effect. Nations which have the same population growth rate as the United States have, on average, roughly half as much per capita income. (b) The interpretation of the partial derivative is unaffected, in that the slope still indicates the effect of a one percentage point increase in the population growth rate. The regression R2 will remain the same since only a constant was removed from the explanatory variable. The intercept will change as a result of the change in X. (c) To interpret the intercept, you must observe values of X close to zero, not Y.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 75

6) The neoclassical growth model predicts that for identical savings rates and population growth rates, countries should converge to the per capita income level. This is referred to as the convergence hypothesis. One way to test for the presence of convergence is to compare the growth rates over time to the initial starting level. (a) If you regressed the average growth rate over a time period (1960-1990) on the initial level of per capita income, what would the sign of the slope have to be to indicate this type of convergence? Explain. Would this result confirm or reject the prediction of the neoclassical growth model? (b) The results of the regression for 104 countries were as follows: g6090 = 0.019 – 0.0006 × RelProd 60 , R2 = 0.00007, SER = 0.016, where g6090 is the average annual growth rate of GDP per worker for the 1960 -1990 sample period, and RelProd60 is GDP per worker relative to the United States in 1960. Interpret the results. Is there any evidence of unconditional convergence between the countries of the world? Is this result surprising? What other concept could you think about to test for convergence between countries? (c) You decide to restrict yourself to the 24 OECD countries in the sample. This changes your regression output as follows: g6090 = 0.048 – 0.0404 RelProd 60 , R2 = 0.82 , SER = 0.0046 How does this result affect your conclusions from above? Answer: (a) You would require a negative sign. Countries that are far ahead of others at the beginning of the period would have to grow relatively slower for the others to catch up. This represents unconditional convergence, whereas the neoclassical growth model predicts conditional convergence, i.e., there will only be convergence if countries have identical savings, population growth rates, and production technology. (b) An increase in 10 percentage points in RelProd60 results in a decrease of 0.00006 in the growth rate from 1960 to 1990, i.e., countries that were further ahead in 1960 do grow by less. There are some countries in the sample that have a value of RelProd60 close to zero (China, Uganda, Togo, Guinea) and you would expect these countries to grow roughly by 2 percent per year over the sample period. The regression R2 indicates that the regression has virtually no explanatory power. The result is not surprising given that there are not many theories that predict unconditional convergence between the countries of the world. (c) Judging by the size of the slope coefficient, there is strong evidence of unconditional convergence for the OECD countries. The regression R2 is quite high, given that there is only a single explanatory variable in the regression. However, since we do not know the sampling distribution of the estimator in this case, we cannot conduct inference.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 76

7) In 2001, the Arizona Diamondbacks defeated the New York Yankees in the Baseball World Series in 7 games. Some players, such as Bautista and Finley for the Diamondbacks, had a substantially higher batting average during the World Series than during the regular season. Others, such as Brosius and Jeter for the Yankees, did substantially poorer. You set out to investigate whether or not the regular season batting average is a good indicator for the World Series batting average. The results for 11 players who had the most at bats for the two teams are: AZWsavg = –0.347 + 2.290 AZSeasavg , R2 =0.11, SER = 0.145, NYWsavg = 0.134 + 0.136 NYSeasavg , R2 =0.001, SER = 0.092, where Wsavg and Seasavg indicate the batting average during the World Series and the regular season respectively. (a) Focusing on the coefficients first, what is your interpretation? (b) What can you say about the explanatory power of your equation? What do you conclude from this? Answer: (a) The two regressions are quite different. For the Diamondbacks, players who had a 10 point higher batting average during the regular season had roughly a 23 point higher batting average during the World Series. Hence top performers did relatively better. The opposite holds for the Yankees. (b) Both regressions have little explanatory power as seen from the regression R2 . Hence performance during the season is a poor forecast of World Series performance. 8) For the simple regression model of Chapter 4, you have been given the following data: 420 Y 274, 745.75; i = ∑ ∑ Xi = 8,248.979;

420

i=1

i=1 420

420

i=1

420

∑ XiYi = 5,392, 705; ∑ X i = 163,513.03; ∑ Y i = 179,878, 841.13 i=1

(a) Calculate the regression slope and the intercept. (b) Calculate the regression R2 ^

Answer: (a) β1 =

^ 5,392, 705 - 420 × 19.64 × 654.16 = -2.28; β0 = 654.2-2.28 × 19.6 = 698.9. 163513.03 - 420 × 19.64 2

(This is the data set for Chapter 4). -2.28 × (5392704.6 × 19.6 × 654.2) (b) R2 = = 0.051 179878841.1 - 420 × 654.2 2

Stock/Watson 2e -- CVC2 8/23/06 -- Page 77

9) Your textbook presented you with the following regression output: TestScore = 698.9 – 2.28 × STR n = 420, R2 = 0.051, SER = 18.6 (a) How would the slope coefficient change, if you decided one day to measure testscores in 100s, i.e., a testscore of 650 became 6.5? Would this have an effect on your interpretation? (b) Do you think the regression R2 will change? Why or why not? (c) Although Chapter 4 in your textbook did not deal with hypothesis testing, it presented you with the large sample distribution for the slope and the intercept estimator. Given the change in the units of measurement in (a), do you think that the variance of the slope estimator will change numerically? Why or why not? Answer: (a) The new regression line would be NewTestScore = 6.989 - 0.0228 × STR. Hence the decimal point would simply move two digits to the left. The interpretation remains the same, since an increase in the student-teacher ratio by 2, say, increases the new testscore by 0.0456 points on the new testscore scale, which is 4.56 in the original testscores. (b) The regression R2 should not change, since, if it did, an objective measure of fit would depend on whim (the units of measurement). The SER will change (from 18.6 to 0.186). This is to be expected, since the TSS obviously changes, and with the regression R2 unchanged, the SSR (and hence SER) have to adjust accordingly. (c) Since statistical inference will depend on the ratio of the estimator and its standard error, the standard error must change in proportion to the estimator. If this was not true, then statistical inference again would depend on the whim of the investigator. 10) The news-magazine The Economist regularly publishes data on the so called Big Mac index and exchange rates between countries. The data for 30 countries from the April 29, 2000 issue is listed below:

Country

Currency

Indonesia Italy South Korea Chile Spain Hungary Japan Taiwan Thailand Czech Rep. Russia Denmark Sweden Mexico France Israel China South Africa Switzerland Poland Germany Malaysia New Zealand Singapore Brazil

Rupiah Lira Won Peso Peseta Forint Yen Dollar Baht Crown Ruble Crown Crown Peso Franc Shekel Yuan Rand Franc Zloty Mark Dollar Dollar Dollar Real

Price of Actual Exchange Rate Big Mac per U.S. dollar 14,500 7,945 4,500 2,088 3,000 1,108 1,260 514 375 179 339 279 294 106 70 30.6 55 38.0 54.37 39.1 39.50 28.5 24.75 8.04 24.0 8.84 20.9 9.41 18.5 .07 14.5 4.05 9.90 8.28 9.0 6.72 5.90 1.70 5.50 4.30 4.99 2.11 4.52 3.80 3.40 2.01 3.20 1.70 2.95 1.79 Stock/Watson 2e -- CVC2 8/23/06 -- Page 78

Canada Australia Argentina Britain United States

Dollar Dollar Peso Pound Dollar

2.85 2.59 2.50 1.90 2.51

1.47 1.68 1.00 0.63

The concept of purchasing power parity or PPP (“the idea that similar foreign and domestic goods … should have the same price in terms of the same currency,” Abel, A. and B. Bernanke, Macroeconomics, 4th edition, Boston: Addison Wesley, 476) suggests that the ratio of the Big Mac priced in the local currency to the U.S. dollar price should equal the exchange rate between the two countries. (a) Enter the data into your regression analysis program (EViews, Stata, Excel, SAS, etc.). Calculate the predicted exchange rate per U.S. dollar by dividing the price of a Big Mac in local currency by the U.S. price of a Big Mac ($2.51). (b) Run a regression of the actual exchange rate on the predicted exchange rate. If purchasing power parity held, what would you expect the slope and the intercept of the regression to be? Is the value of the slope and the intercept “far” from the values you would expect to hold under PPP? (c) Plot the actual exchange rate against the predicted exchange rate. Include the 45 degree line in your graph. Which observations might cause the slope and the intercept to differ from zero and one? Answer: (a) Country

Predicted Exchange Rate per U.S. dollar

5777 1793 1195 502 149 135 117 27.9 21.9 21.7 15.7 9.86 9.56 8.33 7.37 5.78 3.94 3.59 2.35 2.19 1.99 1.80 1.35 1.27 1.18 1.14 1.03 1.00 0.76

(b) The estimated regression is as follows: Stock/Watson 2e -- CVC2 8/23/06 -- Page 79

ActualExRate = -27.05 + 1.35 × Pr edExRate R2 = 0.994, n = 29, SER = 122.15 For PPP to hold exactly, you would expect an intercept of zero and a slope of unity. Since we do not know the standard error of the slope and the intercept, and since Chapter 4 has not dealt with hypothesis testing, it is hard to judge how “far” 27.05 and 1.35 are away from zero and one respectively. (c) The regression is represented by the solid line, while the dashed one is the 45 degree line. Most of the observations are bunched towards the origin, making it hard to judge from this graph which observations cause the regression line to differ from the 45 degree line. However, the Indonesian Rupiah is certainly a possible candidate.

11)

At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the “California Test Score Data Used in Chapters 4-9” (caschool.xls) and open it in a spreadsheet program such as Excel. In this exercise you will estimate various statistics of the Linear Regression Model with One Regressor through construction of various sums and ratio within a spreadsheet program. Throughout this exercise, let Y correspond to Test Scores (testscore) and X to the Student Teacher Ratio (str). To generate answers to all exercises here, you will have to create seven columns and the sums of five of these. They are (i) Yi, (ii) Xi, (iii) (Yi- Y), (iv) (Xi- X), (v) (Yi- Y)×(Xi- X), (vi) (Xi- X)2 , (vii) (Yi- Y)2 Although neither the sum of (iii) or (iv) will be required for further calculations, you may want to generate these as a check (both have to sum to zero). Stock/Watson 2e -- CVC2 8/23/06 -- Page 80

a. b. c. d. e. f.

Use equation (4.7) and the sums of columns (v) and (vi) to generate the slope of the regression. Use equation (4.8) to generate the intercept. Display the regression line (4.9) and interpret the coefficients. Use equation (4.16) and the sum of column (vii) to calculate the regression R2 . Use equation (4.19) to calculate the SER. Use the “Regression” function in Excel to verify the results.

Answer: Column (i): 654.156548 Column (ii): 19.64043 Column (iii): 1.27329E-11 Column (iv): 1.13E-12 Column (v): -3418.76 Column (vi): 1499.58 Column (vii): 152109.6 ^

-3418.76 = - 2.27981 1499.58

β1 =

β0 = 274745.75-(-2.27981)×8248.979 = 698.933

Yi= 698.9 - 2.28 × Xi. A decrease in the student-teacher ratio of one results in an increase in test

scores of 2.28. It is best not to interpret the intercept; it simply determines the height of the regression line. d. To calculate the regression R2 , you need the TSS given from the sum in column (vii) and either the ESS or SSR. In principle, you could use equation (4.10) to generate the residuals, square these and sum n them up to get SSR. However, the textbook suggests a shortcut at the bottom of p. 142: ∑ u^ 2 = i i=1 n n ^2 ∑ (Yi-Y)2 - β 1 ∑ (Xi-X)2 (the cross-product vanishes due to the orthogonality conditions (4.32) i=1 i=1 and (4.36)). The various terms on the RHS of the equation have been calculated and equation (4.35) n ^2 7794.11 implies that β 1 ∑ (Xi-X)2 = ESS = 7794.11. Hence the regression R2 = = 0.051 152109.6 i=1 e.

The answer in (d) can be used to calculate the SSR, which are 144325.5. Hence the SEE must be 18.6.

SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.226 0.051 0.049 18.581 420

ANOVA df SS Stock/Watson 2e -- CVC2 8/23/06 -- Page 81

Regression Residual Total

1 7794.11 418 144315.5 419 152109.6 Coefficients 698.93 -2.28

Intercept str

12) You have obtained a sample of 14,925 individuals from the Current Population Survey (CPS) and are interested in the relationship between average hourly earnings and years of education. The regression yields the following result: ^

ahe= -4.58 + 1.71×educ , R2 = 0.182, SER = 9.30 where ahe and educ are measured in dollars and years respectively. a.

Interpret the coefficients and the regression R2 .

Is the effect of education on earnings large?

Why should education matter in the determination of earnings? Do the results suggest that there is a guarantee for average hourly earnings to rise for everyone as they receive an additional year of education? Do you think that the relationship between education and average hourly earnings is linear?

The average years of education in this sample is 13.5 years. What is mean of average hourly earnings in the sample?

Interpret the measure SER. What is its unit of measurement.

Answer: a. A person with one more year of education increases her earnings by $1.71. There is no meaning attached to the intercept, it just determines the height of the regression. The model explains 5 percent of the variation in average hourly earnings. b. The difference between a high school graduate and a college graduate is four years of education. Hence a college graduate will earn almost $7 more per hour, on average ($6.84 to be precise). If you assume that there are 2,000 working hours per year, then the average salary difference would be close to $14,000 (actually $13,680). Depending on how much you have spent for an additional year of education and how much income you have forgone, this does not seem particularly large. c. In general, you would expect to find a positive relationship between years of education and average hourly earnings. Education is considered investment in human capital. If this were not the case, then it would be a puzzle as to why there are students in the econometrics course — surely they are not there to just “find themselves” (which would be quite expensive in most cases). However, if you consider education as an investment and you wanted to see a return on it, then the relationship will most likely not be linear. For example, a constant percent return would imply an exponential relationship whereby the additional year of education would bring a larger increase in average hourly earnings at higher levels of education. The results do not suggest that there is a guarantee for earnings to rise for everyone as they become more educated since the regression R2 does not equal 1. Instead the result holds “on average.” ^

d. Since β 0 = Y - β1 X ⇒ Y = β 0 + β1 X Substituting the estimates for the slope and the intercept then results in a mean of average hourly earnings of roughly $18.50. Stock/Watson 2e -- CVC2 8/23/06 -- Page 82

e. The typical prediction error is $9.30. Since the measure is related to the deviation of the actual and fitted values, the unit of measurement must be the same as that of the dependent variable, which is in dollars here.

4.3 Mathematical and Graphical Problems 1) Prove that the regression R2 is identical to the square of the correlation coefficient between two variables Y and X. Regression functions are written in a form that suggests causation running from X to Y. Given your proof, does a high regression R2 present supportive evidence of a causal relationship? Can you think of some regression examples where the direction of causality is not clear? Is without a doubt? Answer: The regression R2 =

ESS , where ESS is given by TSS

∑ (Y - Y)2. But Yi = β0 + β1Xi and Y = β0 + β1X. i=1

n ^ ^2 ^2 Hence (Yi - Y)2 = β 1 (Xi - X)2 and therefore ESS = β 1 ∑ (Xi - X)2 . Using small letters to indicate i=1 n ^2 2 β1 ∑ xi i=1 deviations from mean, i.e., zi = Zi - Z, we get that the regression R2 = . The square of the n 2 ∑ yi i=1 n

n ^2 2 β1 ∑ xi i=1 i=1 i=1 i=1 correlation coefficient is r2 = . Hence the two = = n n n n n 2 2 2 2 2 2 ∑ xi ∑ yi (∑ xi ) ∑ yi ∑ yi i=1 i=1 i=1 i=1 i=1

∑ (yixi)2

∑ (yixi)2 ∑ x i

are the same. Correlation does not imply causation. Income is a regressor in the consumption function, yet consumption enters on the right-hand side of the GDP identity. Regressing the weight of individuals on the height is a situation where causality is without doubt, since the author of this test bank should be seven feet tall otherwise. The authors of the textbook use weather data to forecast orange juice prices later in the text. 2) You have analyzed the relationship between the weight and height of individuals. Although you are quite confident about the accuracy of your measurements, you feel that some of the observations are extreme, say, two standard deviations above and below the mean. Your therefore decide to disregard these individuals. What consequence will this have on the standard deviation of the OLS estimator of the slope? Answer: Other things being equal, the standard error of the slope coefficient will decrease the larger the variation in X. Hence you prefer more variation rather than less. This can be seen from formula (4.20) in the text. Intuitively it is easier for OLS to detect a response to a unit change in X if the data varies more.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 83

3) In order to calculate the regression R2 you need the TSS and either the SSR or the ESS. The TSS is fairly straightforward to calculate, being just the variation of Y. However, if you had to calculate the SSR or ESS by hand (or in a spreadsheet), you would need all fitted values from the regression function and their deviations from the sample mean, or the residuals. Can you think of a quicker way to calculate the ESS simply using terms you have already used to calculate the slope coefficient? n

Answer: The ESS is given by

∑ (Y^ i - Y)2 . But Yi = β0 + β1Xi and Y = β0 + β1X. Hence (Yi - Y)2 = β 1 (Xi - X)2,

i=1

n ^2 and therefore ESS = β 1 ∑ (Xi - X)2 . The right-hand side contains the estimated slope squared and the i=1 denominator of the slope, i.e., all values that have already been calculated. 4) (Requires Appendix material) In deriving the OLS estimator, you minimize the sum of squared residuals with ^

respect to the two parameters β0 and β1 . The resulting two equations imply two restrictions that OLS places on n ^ n ^ u 0 and i = ∑ ∑ ui Xi = 0. Show that you get the same formula for the regression slope i=1 i=1 and the intercept if you impose these two conditions on the sample regression function. the data, namely that

Answer: The sample regression function is Yi = βo + β1 Xi + ui. Summing both sides results in

∑ Yi = n βo + β1

i=1 n ^ ∑ Xi + ∑ ui . Imposing the first restriction, namely that the sum of the residuals is zero, dividing i=1 i=1 n

both sides of the equation by n, and solving for βo gives the OLS formula for the intercept. For the second restriction, multiply both sides of the sample regression function by Xi and then sum n ^ n n n ^ ^ ^ 2 u X X Y X β β i X i i . After imposing the restriction i i + = + ∑ ∑ uiXi =0 o ∑ 1 ∑ i ∑ i=1 i=1 i=1 i=1 i=1 and substituting the formula for the intercept, you get n n n n ^ ^ ^ ^ 2 2 ^ Y X Y X i i (Y nYX X)nX X β β or i i β = + = ∑ ∑ 1 1 ∑ 1 ∑ X i - β1 X , which, after isolating β1 i i=1 i=1 i=1 i=1 and dividing by the variation in ,X results in the OLS estimator for the slope. n

both sides to get

Stock/Watson 2e -- CVC2 8/23/06 -- Page 84

5) (Requires Appendix material) Show that the two alternative formulae for the slope given in your textbook are identical. n n 1 (Xi – X)(Yi – Y) XiYi – XY ∑ ∑ n i=1 i=1 = n n 1 2 2 ∑ (Xi - X)2 n ∑ Xi -X i=1 i=1 Answer: Let’s start with the first equality. The numerator of the right -hand side expression can be written as follows: n

∑ (Xi - X)(Yi - Y) = ∑ (XiYi - XYi - YXi + XY) = ∑ XiYi - X ∑ Yi - Y ∑ Xi - nXY i=1 i=1 i=1 i=1 i=1 n n n = ∑ YiXi - nXY - nXY + nXY = ∑ YiXi - nXY. (Note that ∑ Xi = nX .) i=1 i=1 i=1 Multiplying out the terms in the denominator and moving the summation sign into the expression in n parentheses similarly yields ∑ X 2 - nX2 . Dividing both of these expressions by n then results in the i i=1 left-hand side fraction. 6) (Requires Calculus) Consider the following model: Yi = β0 + ui. Derive the OLS estimator for β0 . n

Answer: To derive the OLS estimator, minimize the sum of squared prediction mistakes

∑ (Yi - b0)2 . Taking

i=1

n ∂ ∂ 2(Yi - b0 )(-1) 2 2 = = ∑ ∑ ∑ (Y b ) (Y b ) i 0 i 0 ∂b0 ∂b i=1 0 i=1 i=1 n

the derivative with respect to b0 results in

n n = (-2) ∑ (Yi - b0 ) = (-2) ∑ Yi - nb0 . Setting the derivative to zero then results in the OLS estimator: i=1 i=1 n ^ ^ (-2) ∑ Yi - nβ0 = 0 ⇒ βo = Y . i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 85

7) (Requires Calculus) Consider the following model: Yi = β1 Xi + ui. Derive the OLS estimator for β1 . n

Answer: To derive the OLS estimator, minimize the sum of squared prediction mistakes

∑ (Yi - b1Xi)2 . Taking

i=1 n

the derivative with respect to b1 results in

∂ 2 = ∑ ∂ (Yi - b1 Xi)2 = ∂b1 ∑ (Yi - b1 Xi) ∂b i=1 1 i=1

∑ 2(Yi - b1Xi)(-Xi) i=1 n n 2 = (-2) ∑ (Yi - b1 Xi)(Xi) = (-2)( ∑ (YiXi - b1 X i ) . Setting the derivative to zero then results in the i=1 i=1 OLS estimator: n YiXi ∑ n n ^ ^ 2 i=1 (-2)( ∑ YiXi - β1 ∑ X i = 0 ⇒ β1 = . n 2 i=1 i=1 ∑ Xi i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 86

8) Show first that the regression R2 is the square of the sample correlation coefficient. Next, show that the slope of a simple regression of Y on X is only identical to the inverse of the regression slope of X on Y if the regression R2 equals one. Answer: The regression R2 =

ESS , where ESS is given by TSS

∑ (Y^ i - Y)2 . But Yi = β0 + β1Xi and Y = β0 + β1X .

i=1

n ^ ^2 ^2 Hence (Yi - Y)2 = β 1 (Xi - X)2 , and therefore ESS = β 1 ∑ (Xi - X)2 . Using small letters to indicate i=1 n ^2 2 β1 ∑ xi i=1 deviations from mean, i.e., zi = Zi - Z, we get that the regression R2 = . The square of the n 2 ∑ yi i=1 n

n ^2 2 β1 ∑ xi i=1 i=1 i=1 i=1 correlation coefficient is r2 = . Hence the two = = n n n n n 2 2 2 2 2 ∑ x i ∑ y i ( ∑ x i )2 ∑ y i ∑ yi i=1 i=1 i=1 i=1 i=1

∑ (yixi)2

∑ (yixi)2 ∑ x i

are the same. n n n n ^2 2 2 2 β1 ∑ xi ∑ yi ∑ xiy i ∑ yi ^2 ^ 2 ^ i=1 ^ i=1 i=1 i=1 Now 1 = r2 = . But β 1 = β1 and therefore β1 = , ⇒β1 = n n n n 2 2 2 ∑ yi ∑ xi ∑ xi ∑ xiy i i=1 i=1 i=1 i=1 which is the inverse of the regression slope of X on Y.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 87

9) Consider the sample regression function ^

Yi = β0 + β1 Xi + ui. First, take averages on both sides of the equation. Second, subtract the resulting equation from the above equation to write the sample regression function in deviations from means. (For simplicity, you may want to use small letters to indicate deviations from the mean, i.e., zi = Zi – Z.) Finally, illustrate in a two-dimensional diagram with SSR on the vertical axis and the regression slope on the horizontal axis how you could find the least squares estimator for the slope by varying its values through trial and error. ^

Answer: Taking averages results in the following equation: Y = β0 + β1 X. Subtracting this equation from the ^

above one, we get y i = β1 x i + ui.

n ^ ^ ^ 2 SSR = ∑ u i = ∑ (y i = β1 x i )2 is a quadratic which takes on different values for different choices of β1 i=1 (the y and x are given in this case, i.e., different from the usual calculus problems, they cannot vary here). You could choose a starting value of the slope and calculate SSR. Next you could choose a different value for the slope and calculate the new SSR. There are two choices for the new slope value for you to make: first, in which direction you want to move, and second, how large a distance you want to choose the new slope value from the old one. (In essence, this is what sophisticated search algorithms do.) You continue with this procedure until you find the smallest SSR. The slope coefficient which has generated this SSR is the OLS estimator. 10) Given the amount of money and effort that you have spent on your education, you wonder if it was (is) all worth it. You therefore collect data from the Current Population Survey (CPS) and estimate a linear relationship between earnings and the years of education of individuals. What would be the effect on your regression slope and intercept if you measured earnings in thousands of dollars rather than in dollars? Would the regression R2 be affected? Should statistical inference be dependent on the scale of variables? Discuss. Answer: It should be clear that interpretation of estimated relationships and statistical inference should not depend on the units of measurement. Otherwise whim could dictate conclusions. Hence the regression R2 and statistical inference cannot be effected. It is easy but tedious to show this mathematically. Next, the intercept indicates the value of Y when X is zero. The change in the units of measurement have no ^

effect on this, since the change in X is cancelled by the change in β1 . The slope coefficient will change to compensate for the change in the units of measurement of X. In the above case, the decimal point will move 3 digits to the left.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 88

11) (Requires Appendix material) Consider the sample regression function ^ * ^ * ^ Y i = γ 0 + γ 1 X i + ui ,

where * indicates that the variable has been standardized. What are the units of measurement for the dependent and explanatory variable? Why would you want to transform both variables in this way? Show that the OLS estimator for the intercept equals zero. Next prove that the OLS estimator for the slope in this case is identical to the formula for the least squares estimator where the variables have not been standardized, times ^ ^ SX the ratio of the sample standard deviation of X and Y, i.e., γ 1 = β1 * . SY Answer: The units of measurement are in standard deviations. Standardizing the variables allows conversion into common units and allows comparison of the size of coefficients. The mean of standardized variables is ^

zero, and hence the OLS intercept must also be zero. The slope coefficient is given by the formula γ 1 = n

∑ xiyi i=1 n

, where small letters indicate deviations from mean, i.e., z = Z - Z.

*2 ∑ xi i=1 n

Note that means of standardized variables are zero, and hence we get γ 1 =

∑ Xi Yi i=1 n

. Writing this

*2 ∑ Xi i=1

n 1 1 x iyi ∑ SX SY ^ i=1 expression in terms of originally observed variables results in γ 1 = , which is the same n 2 1 ∑ xi 2 i=1 SX as the sought after expression after simplification.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 89

12) The OLS slope estimator is not defined if there is no variation in the data for the explanatory variable. You are interested in estimating a regression relating earnings to years of schooling. Imagine that you had collected data on earnings for different individuals, but that all these individuals had completed a college education (16 years of education). Sketch what the data would look like and explain intuitively why the OLS coefficient does not exist in this situation. Answer: There is no variation in X in this case, and it is therefore unreasonable to ask by how much Y would change if X changed by one unit. Regression analysis cannot figure out the answer to this question, because a change in X never happens in the sample.

13) Indicate in a scatterplot what the data for your dependent variable and your explanatory variable would look like in a regression with an R2 equal to zero. How would this change if the regression R2 was equal to one? Answer: For the zero regression R2 , the data would look something like this:

In the case of the regression R2 being one, all observations would lie on a straight line.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 90

14) Imagine that you had discovered a relationship that would generate a scatterplot very similar to the 2 relationship Yi = X i , and that you would try to fit a linear regression through your data points. What do you expect the slope coefficient to be? What do you think the value of your regression R2 is in this situation? What are the implications from your answers in terms of fitting a linear regression through a non -linear relationship? Answer: You would expect the slope to be a straight line (=0) and the regression R2 to be zero in this situation. The implication is that although there may be a relationship between two variables, you may not detect it if you use the wrong functional form. 15) (Requires Appendix material) A necessary and sufficient condition to derive the OLS estimator is that the n ^ n ^ n ^ following two conditions hold: ∑ ui = 0 and ∑ uiXi = 0. Show that these conditions imply that ∑ uiYi = i=1 i=1 i=1 0. Answer:

n ^ n ^ n ^ n ^ ^ β β u β u Y β u 1X 1 ( 0 i i ) 0 i = + + = ∑ ∑ i ∑ ∑ uiXi = 0 i i=1 i=1 i=1 i=1

16) The help function for a commonly used spreadsheet program gives the following definition for the regression slope it estimates: n n n n ∑ XiYi – ( ∑ Xi)( ∑ Yi) i=1 i=1 i=1 n n 2 n ∑ X i - ( ∑ Xi)2 i=1 i=1 Prove that this formula is the same as the one given in the textbook. n n n n ∑ XiYi - ( ∑ Xi)( ∑ Yi) i=1 i=1 i=1 Answer: n n 2 n ∑ X i - ( ∑ Xi)2 i=1 i=1

n n n ∑ XiYi - nXnY n ∑ XiYi - nXY i=1 i=1 . = = n n 2 2 n ∑ X - (nX)2 n ∑ X - nX2 i i i=1 i=1

Dividing both numerator and denominator by n then gives you the desired result.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 91

17) In order to calculate the slope, the intercept, and the regression R2 for a simple sample regression function, list the five sums of data that you need. Answer: Depending whether or not the data is in deviations from means or not ( zi = Zi - Z or Zi, say), you need the following sums: n n n n n 2 2 ∑ Yi, ∑ Xi, ∑ xiyi, ∑ y i , ∑ x i (data in deviation form) or i=1 i=1 i=1 i=1 i=1 n n n n n ^ 2 2 Y , X , X Y , Y , ∑ i ∑ i ∑ i i ∑ i ∑ X i . Using these five columns, you can calculate the slope β1 = i=1 i=1 i=1 i=1 i=1 ^ n β1 ∑ xiy i

∑ x iyi i=1 n

2 ∑ xi i=1

i=1 n 2 ∑ yi i=1

, the intercept β0 = Y- β1 X, and the regression R2 =

^2 n β1 ∑ x2 i

i=1 n 2 ∑ yi i=1

. Alternatively,

if the data is not given in deviation form, the formulae are as follows: β1 =

∑ YiXi - nXY i=1 n

, and for the

∑ X 2i - nX2 i=1

n ^ β1( ∑ XiYi - nXY ) regression R2 =

i=1 n

^2 n β 1 ( ∑ X 2 - nX2 ) i

∑ Y 2i - nY2

i=1

i=1 n ∑ Y 2i - nY2 i=1

18) A peer of yours, who is a major in another social science, says he is not interested in the regression slope and/or intercept. Instead he only cares about correlations. For example, in the testscore/student -teacher ratio regression, he claims to get all the information he needs from the negative correlation coefficient corr(X,Y)=-0.226. What response might you have for your peer? Answer: First of all, the regression slope is related to the regression R2 , and hence its square root, the correlation coefficient, since

β1 ( ∑ XiYi - nXY) i=1 R2 = = n ∑ Y 2i - nY2 i=1

n 2 ( ∑ X i - nX2) β 1 i=1 ^2

∑ Y 2i - nY2

i=1

However, while the correlation coefficient tells you something about the direction and strength of the relationship between two variables, it does not inform you about the effect a one unit increase in the explanatory variable. Hence it cannot answer the question whether or not the relationship is important (although even with the knowledge of the slope coefficient, this requires further information). Your friend would not be able to answer the question which policy makers and researchers are typically interested in, such as, what would be the effect on test scores of a reduction in the student-teacher ratio by one? Stock/Watson 2e -- CVC2 8/23/06 -- Page 92

19) Assume that there is a change in the units of measurement on both Y and X. The new variables are Y*= aY and X* = bX. What effect will this change have on the regression slope? ^* ^ * ^* Answer: We now have the following sample regression function Y = β 0 + β 1 X*. The formula for the slope will

be n

∑ xiyi

^* i=1 β 1= n

∑ (bxi)(ayi) =

∑ xi

i=1 n

∑ (bxi)2

i=1

n ab ∑ x iy i i=1 a^ = = β1. b n 2 b2 ∑ x i i=1

20) Assume that there is a change in the units of measurement on X. The new variables X* = bX. Prove that this change in the units of measurement on the explanatory variable has no effect on the intercept in the resulting regression. ^ ^* ^* ^* Answer: Consider the sample regression function Y = β 0 + β 1 X*. The formula for the intercept will be β 0 = Y -

∑ x i yi

^* ^* i=1 β 1 bX. But β 1 = n

= *2

∑ xi

i=1

∑ (bxi) yi i=1 n

∑ (bxi)2

i=1

n b ∑ xiy i ^* ^ 1^ 1^ i=1 = = β1 . Hence β 0 = Y - β1 bX = β0 . b b n 2 b2 ∑ x i i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 93

21) At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website, go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the “California Test Score Data Used in Chapters 4-9” and read the data either into Excel or STATA (or another statistical program). First run a regression where the dependent variable is test scores and the independent variable is the student -teacher ratio. Record the regression R2 . Then run a regression where the dependent variable is the student-teacher ratio and the independent variable is test scores. Record the regression R2 from this regression. How do they compare? Answer: The regression R2 is 0.051, confirming the idea that the regression R2 is only the square of the correlation coefficient between two variables. This can also be shown formally as follows: n ^ ^ ^ ^ ^ ESS ^ The regression R2 = where ESS is given by ∑ (Y i-Y)2 . But Yi= β 0 + β 1 Xi and Y= β 0 + β 1 X. TSS i=1 ^

Hence (Yi- Y)2 = β 1 (Xi-X)2 and therefore ESS = β 1 (Xi-X)2 . Using small letters to indicate n

∑ x i2

β1 deviations from mean, i.e., : zi = Zi- Z, we get that the regression R2 =

i=1 n

. The square of the

∑ y i2

i=1 n

∑ (yixi)2

correlation coefficient is r2 =

i=1 n

i=1

∑ x i2 ∑ y i2

∑ (yixi)2 ∑ xi2

i=1 i=1 = n n ( ∑ x i2 )2 ∑ y i2 i=1 i=1

β1

∑ x i2

i=1 n

. Hence the two are

∑ y i2

i=1

the same. 22) At the Stock and Watson (http://www.pearsonhighered.com/stock_watson ) website, go to Student Resources and select the option “Datasets for Replicating Empirical Results.” Then select the “California Test Score Data Used in Chapters 4-9” and read the data either into Excel or STATA (or another statistical program). Run a regression of the average reading score (read_scr) on the average math score (math_scr). What values for the slope and the intercept would you expect? Interpret the coefficients in the resulting regression output and the regression R2 . Answer: On average, it would seem plausible, a priori, that schools which score high on the math score would also do well in the reading score. Perhaps an underlying variable, such as genes, parental interest, or the quality of teachers, is driving results in both. The relationship is close to the 45 degree line, where the intercept would be zero and the slope would be one. Interpreted literally, 85 percent of the variation in the reading score is explained by our model.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 94

23) In a simple regression with an intercept and a single explanatory variable, the variation in Y (TSS = n n ∑ (Yi-Y)2 ) can be decomposed into the explained sums of squares ( ESS = ∑ (Y^ i-Y)2 ) and the sum of squared i=1 i=1 n n ^ 2 ) (see, for example, equation (4.35) in the textbook). residuals (SSR = ∑ u^i2 = ∑ (Yi-Y ) i=1 i=1 Consider any regression line, positively or negatively sloped in {X,Y} space. Draw a horizontal line where, hypothetically, you consider the sample mean of Y ( observation of Y.

) to be. Next add a single actual

In this graph, indicate where you find the following distances: the (i) (ii) (iii)

residual actual minus the mean of Y fitted value minus the mean of Y

Answer:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 95

Chapter 5 Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals 5.1 Multiple Choice 1) Heteroskedasticity means that A) homogeneity cannot be assumed automatically for the model. B) the variance of the error term is not constant. C) the observed units have different preferences. D) agents are not all rational. Answer: B 2) With heteroskedastic errors, the weighted least squares estimator is BLUE. You should use OLS with heteroskedasticity-robust standard errors because A) this method is simpler. B) the exact form of the conditional variance is rarely known. C) the Gauss-Markov theorem holds. D) your spreadsheet program does not have a command for weighted least squares. Answer: B 3) When estimating a demand function for a good where quantity demanded is a linear function of the price, you should A) not include an intercept because the price of the good is never zero. B) use a one-sided alternative hypothesis to check the influence of price on quantity. C) use a two-sided alternative hypothesis to check the influence of price on quantity. D) reject the idea that price determines demand unless the coefficient is at least 1.96. Answer: B 4) The t-statistic is calculated by dividing A) the OLS estimator by its standard error. B) the slope by the standard deviation of the explanatory variable. C) the estimator minus its hypothesized value by the standard error of the estimator. D) the slope by 1.96. Answer: C 5) The confidence interval for the sample regression function slope A) can be used to conduct a test about a hypothesized population regression function slope. B) can be used to compare the value of the slope relative to that of the intercept. C) adds and subtracts 1.96 from the slope. D) allows you to make statements about the economic importance of your estimate. Answer: A 6) If the absolute value of your calculated t-statistic exceeds the critical value from the standard normal distribution, you can A) reject the null hypothesis. B) safely assume that your regression results are significant. C) reject the assumption that the error terms are homoskedastic. D) conclude that most of the actual values are very close to the regression line. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 96

7) Under the least squares assumptions (zero conditional mean for the error term, Xi and Yi being i.i.d., and Xi and ui having finite fourth moments), the OLS estimator for the slope and intercept A) has an exact normal distribution for n > 15. B) is BLUE. C) has a normal distribution even in small samples. D) is unbiased. Answer: D 8) In general, the t-statistic has the following form: estimate-hypothesize value A) standard error of estimate B)

estimator standard error of estimator

estimator-hypothesize value standard error of estimator

estimator-hypothesize value standard error of estimator n

Answer: C 9) Consider the following regression line: TestScore = 698.9 – 2.28 × STR. You are told that the t-statistic on the slope coefficient is 4.38. What is the standard error of the slope coefficient? A) 0.52 B) 1.96 C) -1.96 D) 4.38 Answer: A 10) Imagine that you were told that the t-statistic for the slope coefficient of the regression line TestScore = 698.9 – 2.28 × STR was 4.38. What are the units of measurement for the t-statistic? A) points of the test score B) number of students per teacher TestScore C) STR D) standard deviations Answer: D 11) The construction of the t-statistic for a one- and a two-sided hypothesis A) depends on the critical value from the appropriate distribution. B) is the same. C) is different since the critical value must be 1.645 for the one-sided hypothesis, but 1.96 for the two-sided hypothesis (using a 5% probability for the Type I error). D) uses ±1.96 for the two-sided test, but only +1.96 for the one-sided test. Answer: B 12) The p-value for a one-sided left-tail test is given by A) Pr(Z - tact ) = φ(tact). B) Pr(Z < tact ) = φ(tact). C) Pr(Z < tact ) < 1.645. D) cannot be calculated, since probabilities must always be positive. Answer: B Stock/Watson 2e -- CVC2 8/23/06 -- Page 97

13) The 95% confidence interval for β1 is the interval A) (β1 - 1.96SE)(β1 ), β1 + 1.96SE(β1 )). ^

B) (β1 - 1.645SE)(β1 ), β1 + 1.645SE(β1 )). ^

C) (β1 - 1.96SE)(β1 ), β1 + 1.96SE(β1 )). ^

D) (β1 - 1.96, β1 + 1.96). Answer: C 14) The 95% confidence interval for β0 is the interval A) (β0 - 1.96SE(β0 ), β0 + 1.96SE(β0 )). ^

B) (β0 - 1.645SE(β0 ), β0 + 1.645SE(β0 )). ^

C) (β0 - 1.96SE(β0 ), β0 + 1.96SE(β0 )). ^

D) (β0 - 1.96, β0 + 1.96). Answer: C 15) The 95% confidence interval for the predicted effect of a general change in X is A) (β1 △x - 1.96SE(β1 ) × △x, β1△x + 1.96SE(β1 ) × △x). ^

B) (β1 △x - 1.645SE(β1 ) × △x, β1 △x + 1.645SE(β1 ) × △x). ^

C) (β1 △x - 1.96SE(β1 ) × △x, β1 △x + 1.96SE(β1 ) × △x). ^

D) (β1 △x - 1.96, β1 △x + 1.96). Answer: C ^

16) The homoskedasticity-only estimator of the variance of β1 is 2 S^ u A)

∑ Xi - X 2 i=1 S^ u

∑ Xi - X 2 i=1 2 S^ u C)

2 ∑ Xi -X i=1 n ^2 1 Xi - X 2 u i ∑ n-2 1 i=1 D) × 2 . n n 1 X -X2 n ∑ i i=1 Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 98

17) One of the following steps is not required as a step to test for the null hypothesis: ^

A) compute the standard error of β1 . B) test for the errors to be normally distributed. C) compute the t-statistic. D) compute the p-value. Answer: B 18) Finding a small value of the p-value (e.g. less than 5%) A) indicates evidence in favor of the null hypothesis. B) implies that the t-statistic is less than 1.96. C) indicates evidence in against the null hypothesis. D) will only happen roughly one in twenty samples. Answer: C 19) The only difference between a one- and two-sided hypothesis test is A) the null hypothesis. B) dependent on the sample size n. C) the sign of the slope coefficient. D) how you interpret the t-statistic. Answer: D 20) A binary variable is often called a A) dummy variable. B) dependent variable. C) residual. D) power of a test. Answer: A 21) The error term is homoskedastic if A) var(ui Xi = x) is constant for i = 1,…, n. B) var(ui Xi = x) depends on x. C) Xi is normally distributed. D) there are no outliers. Answer: A 22) In the presence of heteroskedasticity, and assuming that the usual least squares assumptions hold, the OLS estimator is A) efficient. B) BLUE. C) unbiased and consistent. D) unbiased but not consistent. Answer: C 23) The proof that OLS is BLUE requires all of the following assumptions with the exception of: A) the errors are homoskedastic. B) the errors are normally distributed. C) E(ui Xi) = 0. D) large outliers are unlikely. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 99

24) If the errors are heteroskedastic, then A) OLS is BLUE. B) WLS is BLUE if the conditional variance of the errors is known up to a constant factor of proportionality. C) LAD is BLUE if the conditional variance of the errors is known up to a constant factor of proportionality. D) OLS is efficient. Answer: B 25) The homoskedastic normal regression assumptions are all of the following with the exception of: A) the errors are homoskedastic. B) the errors are normally distributed. C) there are no outliers. D) there are at least 10 observations. Answer: D 26) Using the textbook example of 420 California school districts and the regression of testscores on the student-teacher ratio, you find that the standard error on the slope coefficient is 0.51 when using the heteroskedasticity robust formula, while it is 0.48 when employing the homoskedasticity only formula. When calculating the t-statistic, the recommended procedure is to A) use the homoskedasticity only formula because the t-statistic becomes larger B) first test for homoskedasticity of the errors and then make a decision C) use the heteroskedasticity robust formula D) make a decision depending on how much different the estimate of the slope is under the two procedures Answer: C 27) Consider the estimated equation from your textbook TestScore=698.9 - 2.28 STR, R2 = 0.051, SER = 18.6 (10.4) (0.52) The t-statistic for the slope is approximately A) 4.38 B) 67.20 C) 0.52 D) 1.76 Answer: A 28) You have collected data for the 50 U.S. states and estimated the following relationship between the change in the unemployment rate from the previous year (△ur) and the growth rate of the respective state real GDP (g y). The results are as follows △ur= 2.81 — 0.23 g y, R2 = 0.36, SER = 0.78 (0.12) (0.04) Assuming that the estimator has a normal distribution, the 95% confidence interval for the slope is approximately the interval A) [2.57, 3.05] B) [-0.31,0.15] C) [-0.31, -0.15] D) [-0.33, -0.13] Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 100

29) Using 143 observations, assume that you had estimated a simple regression function and that your estimate for the slope was 0.04, with a standard error of 0.01. You want to test whether or not the estimate is statistically significant. Which of the following possible decisions is the only correct one: A) you decide that the coefficient is small and hence most likely is zero in the population B) the slope is statistically significant since it is four standard errors away from zero C) the response of Y given a change in X must be economically important since it is statistically significant D) since the slope is very small, so must be the regression R 2 . Answer: B 30) You extract approximately 5,000 observations from the Current Population Survey (CPS) and estimate the following regression function: ahe= 3.32 — 0.45 Age, R2 = 0.02, SER = 8.66 (1.00) (0.04) where ahe is average hourly earnings, and Age is the individual’s age. Given the specification, your 95% confidence interval for the effect of changing age by 5 years is approximately A) [$1.96, $2.54] B) [$2.32, $4.32] C) [$1.35, $5.30] D) cannot be determined given the information provided Answer: A

5.2 Essays and Longer Questions 1) (Continuation from Chapter 4) Sir Francis Galton, a cousin of James Darwin, examined the relationship between the height of children and their parents towards the end of the 19 th century. It is from this study that the name “regression” originated. You decide to update his findings by collecting data from 110 college students, and estimate the following relationship: Studenth = 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0 (7.2) (0.10) where Studenth is the height of students in inches, and Midparh is the average of the parental heights. Values in parentheses are heteroskedasticity robust standard errors. (Following Galton’s methodology, both variables were adjusted so that the average female height was equal to the average male height.) (a) Test for the statistical significance of the slope coefficient. (b) If children, on average, were expected to be of the same height as their parents, then this would imply two hypotheses, one for the slope and one for the intercept. out (i) What should the null hypothesis be for the intercept? Calculate the relevant t-statistic and carry the hypothesis test at the 1% level. (ii) What should the null hypothesis be for the slope? Calculate the relevant t-statistic and carry out the hypothesis test at the 5% level. (c) Can you reject the null hypothesis that the regression R2 is zero? (d) Construct a 95% confidence interval for a one inch increase in the average of parental height. Answer: (a) H0 : β1 = 0, t=7.30, for H1 : β1 > 0, the critical value for a two-sided alternative is 1.645. Hence we reject the null hypothesis (b) H0 : β0 = 0, t=2.72, for H1 : β0 ≠ 0, the critical value for a two-sided alternative is 2.58. Hence we reject the null hypothesis in (i). For the slope we have H0 : β1 = 1, t=-2.70, for H1 : β1 ≠ 1, the critical value for a two-sided alternative is 1.96. Hence we reject the null hypothesis in (ii). (c) For the simple linear regression model, H0 : β1 = 0 implies that R2 = 0. Hence it is the same test as in (a). (d) (0.73 – 1.96 × 0.10, 0.73 + 1.96 × 0.10) = (0.53, 0.93). Stock/Watson 2e -- CVC2 8/23/06 -- Page 101

2) (Requires Appendix) (Continuation from Chapter 4) At a recent county fair, you observed that at one stand people’s weight was forecasted, and were surprised by the accuracy (within a range). Thinking about how the person could have predicted your weight fairly accurately (despite the fact that she did not know about your “heavy bones”), you think about how this could have been accomplished. You remember that medical charts for children contain 5%, 25%, 50%, 75% and 95% lines for a weight/height relationship and decide to conduct an experiment with 110 of your peers. You collect the data and calculate the following sums: n

∑ Yi = 17,375, ∑ Xi = 7,665.5, i=1 n

i=1 n

n 2 2 y 94,228.8, x 1,248.9, = = ∑ i ∑ i ∑ xiyi = 7,625.9 i=1 i=1 i=1 where the height is measured in inches and weight in pounds. (Small letters refer to deviations from means as in zi = Zi – Z.) (a) Calculate the homoskedasticity-only standard errors and, using the resulting t-statistic, perform a test on the null hypothesis that there is no relationship between height and weight in the population of college students. (b) What is the alternative hypothesis in the above test, and what level of significance did you choose? (c) Statistics and econometrics textbooks often ask you to calculate critical values based on some level of significance, say 1%, 5%, or 10%. What sort of criteria do you think should play a role in determining which level of significance to choose? (d) What do you think the relationship is between testing for the significance of the slope and whether or not the regression R2 is zero? Answer: (a) The formula for the homoskedasticity-only standard errors requires knowledge of the residual 2 2 1 SSR, and SSR=TSS-ESS. Given the result in (2b), SSR=47,604.7, and hence S ^ = variance. But S ^ = u u n-2 440.78. The SER is 21.00. Dividing by the square root of the variation in X then results in the homoskedasticity-only standard error of the slope, which is 0.594. The t-statistic is 10.29, which rejects the null hypothesis of no relationship. (b) The alternative hypothesis should be one-sided, since there is strong prior knowledge that taller people weigh more, on average. Given the size of the t-statistic, the null hypothesis can be rejected at any reasonable level of significance. (c) Clearly the levels should not be picked arbitrarily, but should depend on the cost involved with the size and the power of the test. Consider a person who was accused of murder. In that case, the null hypothesis is that he is innocent. The size of the test would be the probability of letting an innocent person go to the electric chair, while (1-power of the test) gives the probability of letting a murderer go free. There are obviously vastly different costs attached to each error, and these will determine the levels chosen. (d) If the slope in a regression function is zero, then there is no relationship between the two variables involved. Hence testing for the significance of the regression slope is the same as testing whether or not the regression R2 is zero.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 102

3) You have obtained measurements of height in inches of 29 female and 81 male students ( Studenth) at your university. A regression of the height on a constant and a binary variable ( BFemme), which takes a value of one for females and is zero otherwise, yields the following result: Studenth = 71.0 – 4.84×BFemme , R2 = 0.40, SER = 2.0 (0.3) (0.57) (a) What is the interpretation of the intercept? What is the interpretation of the slope? How tall are females, on average? (b) Test the hypothesis that females, on average, are shorter than males, at the 1% level. (c) Is it likely that the error term is homoskedastic here? Answer: (a) The intercept gives you the average height of males, which is 71 inches in this sample. The slope tells you by how much shorter females are, on average (almost 5 inches). The average height of females is therefore approximately 66 inches. (b) The t-statistic for the difference in means is -8.49. For a one-sided test, the critical value is –2.33. Hence the difference is statistically significant. (c) It is safer to assume that the variances for males and females are different. In the underlying sample the standard deviation for females was smaller. 4) (continuation from Chapter 4, number 3) You have obtained a sub -sample of 1744 individuals from the Current Population Survey (CPS) and are interested in the relationship between weekly earnings and age. The regression, using heteroskedasticity-robust standard errors, yielded the following result: Earn = 239.16 + 5.20×Age , R2 = 0.05, SER = 287.21., (20.24) (0.57) where Earn and Age are measured in dollars and years respectively. (a) Is the relationship between Age and Earn statistically significant? (b) The variance of the error term and the variance of the dependent variable are related. Given the distribution of earnings, do you think it is plausible that the distribution of errors is normal? (c) Construct a 95% confidence interval for both the slope and the intercept. Answer: (a) The t-statistic on the slope is 9.12, which is above the critical value from the standard normal distribution for any reasonable level of significance. (b) Since the earnings distribution is highly skewed, it is not reasonable to assume that the error distribution is normal. (c) The confidence interval for the slope is (4.08,6.32). The confidence interval for the intercept is (199.49,278.83).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 103

5) (Continuation from Chapter 4, number 5) You have learned in one of your economics courses that one of the determinants of per capita income (the “Wealth of Nations”) is the population growth rate. Furthermore you also found out that the Penn World Tables contain income and population data for 104 countries of the world. To test this theory, you regress the GDP per worker (relative to the United States) in 1990 ( RelPersInc) on the difference between the average population growth rate of that country ( n) to the U.S. average population growth rate (nus ) for the years 1980 to 1990. This results in the following regression output: RelPersInc = 0.518 – 18.831×(n – nus) , R2 =0.522, SER = 0.197 (0.056) (3.177) (a) Is there any reason to believe that the variance of the error terms is homoskedastic? (b) Is the relationship statistically significant? Answer: (a) There are vast differences in the size of these countries, both in terms of the population and GDP. Furthermore, the countries are at different stages of economic and institutional development. Other factors vary as well. It would therefore be odd to assume that the errors would be homoskedastic. (b) The t-statistic is 5.93, making the relationship statistically significant, i.e., we can reject the null hypothesis that the slope is different from zero. 6) You recall from one of your earlier lectures in macroeconomics that the per capita income depends on the savings rate of the country: those who save more end up with a higher standard of living. To test this theory, you collect data from the Penn World Tables on GDP per worker relative to the United States ( RelProd) in 1990 and the average investment share of GDP from 1980 -1990 (SK ), remembering that investment equals saving. The regression results in the following output: RelProd = –0.08 + 2.44×SK , R2 =0.46, SER = 0.21 (0.04) (0.38) (a) Interpret the regression results carefully. (b) Calculate the t-statistics to determine whether the two coefficients are significantly different from zero. Justify the use of a one-sided or two-sided test. (c) You accidentally forget to use the heteroskedasticity-robust standard errors option in your regression package and estimate the equation using homoskedasticity -only standard errors. This changes the results as follows: RelProd = -0.08 + 2.44×SK , R2 =0.46, SER = 0.21 (0.04) (0.26) You are delighted to find that the coefficients have not changed at all and that your results have become even more significant. Why haven’t the coefficients changed? Are the results really more significant? Explain. (d) Upon reflection you think about the advantages of OLS with and without homoskedasticity -only standard errors. What are these advantages? Is it likely that the error terms would be heteroskedastic in this situation? Answer: (a) An increase in the saving rate of 0.1, or from 0.15 to 0.25, results in an increase in relative GDP per worker of 0.244, or from 0.5 to roughly 0.75. (Taiwan had a value of 0.5 for RelProd in 1990, while Sweden was at 0.77.) There is no interpretation for the intercept. The regression explains 46 percent of the variation in GDP per worker relative to the United States. (b) The t- statistics are 2.00 and 6.42 for the intercept and slope respectively. You should use a two -sided test for the intercept, since there are no prior expectations on whether it should be positive or negative. Hence the intercept is statistically significant at the 5 percent level, but not at the 1 percent level. Since we expect a positive sign on the slope, we should conduct a one-sided test. The critical values suggest significance at any reasonable probability level of the size of the test. (c) Whether you use homoskedasticity-only or heteroskedasticity-robust standard errors does not affect the estimator, only the formula for the standard errors. If the assumption of homoskedasticity was valid, then the results would be more significant. However, given the lengthy discussion on homoskedasticity Stock/Watson 2e -- CVC2 8/23/06 -- Page 104

versus heteroskedasticity in the textbook, it is safer to conduct inference under the assumption of heteroskedasticity. (d) In the presence of homoskedasticity in addition to the least squares assumptions in the text, OLS is BLUE (Gauss-Markov theorem). If the errors are heteroskedastic, then the GLS estimator (weighted least squares) is BLUE if the form of heteroskedasticity is known, which rarely occurs in practice. Since economic theory does not suggest, in general, that errors are homoskedastic, it is safer to assume that they are not. This avoids invalid statistical inference. 7) Carefully discuss the advantages of using heteroskedasticity-robust standard errors over standard errors calculated under the assumption of homoskedasticity. Give at least five examples where it is very plausible to assume that the errors display heteroskedasticity. Answer: There are virtually no examples where economic theory suggests that the errors are homoskedastic. Hence the maintained hypothesis should be that they are heteroskedastic. Using homoskedasticity -only standard errors when in truth heteroskedasticity-robust standard errors should be used, results in false inference. What makes this worse is that homoskedasticity-only standard errors are typically smaller than heteroskedasticity-robust standard errors, resulting in t-statistics that are too large, and hence rejection of the null hypothesis too often. There is an alternative GLS estimator, weighted least squares, which is BLUE, but requires knowledge of how the error variance depends on X, e.g. X or X 2 . Answers will vary by student regarding the examples, but earnings functions, cross country beta -convergence regressions, consumption functions, sports regressions involving teams from markets with varying population size, weight-height relationships for children, etc., are all good candidates. 8) (Requires Appendix material from Chapters 4 and 5) Shortly before you are making a group presentation on the testscore/student-teacher ratio results, you realize that one of your peers forgot to type all the relevant information on one of your slides. Here is what you see: TestScore = 698.9 – STR (9.47) (0.48)

R2 = 0.051, SER = 18.6

In addition, your group member explains that he ran the regression in a standard spreadsheet program, and that, as a result, the standard errors in parenthesis are homoskedasticity-only standard errors. (a) Find the value for the slope coefficient. (b) Calculate the t-statistic for the slope and the intercept. Test the hypothesis that the intercept and the slope are different from zero. (c) Should you be concerned that your group member only gave you the result for the homoskedasticity -only standard error formula, instead of using the heteroskedasticity-robust standard errors? Answer: (a) The relationship between the slope coefficient and the regression R2 is n n ^2 2 2 β1 ∑ xi ∑ yi ^2 ESS i=1 i=1 R2 = . = ⇔ β 1 = R2 × TSS n n 2 2 ∑ yi ∑ xi i=1 i=1 n n 2 2 Given the information above, you need to find the TSS (= ∑ y i ) and ∑ x i . The TSS is relatively i=1 i=1 n ^ 1 2 ui = easy to find: the SER is 18.6, and hence the SSR is 144,315.5. (Recall that SER = S ^ = ∑ u n-2 i=1 SSR SSR ). This allows you to calculate the TSS, which is 152,109.6. (Recall that R2 = 1 ⇔ TSS = n-2 TSS SSR 1- R2

). Stock/Watson 2e -- CVC2 8/23/06 -- Page 105

2 To find ∑ x i , note that the homoskedasticity-only standard error for the slope is S ^ = β1 i=1

S^ u n 2 ∑ xi i=1

n n 2 SER 2 2 . Hence, ∑ x i = 38.72 = 1,499.6 . ⇔ ∑ xi = S^ β1 i=1 i=1 Inserting these results into the above formula, you get ^2 ^ 152,109.6 β 1 = 0.051 × = 5.20 ⇔ β1 = -2.28 (luckily for you, your group member entered the negative 1,499.6

sign in front of the slope). (b) The t-statistics are 73.82 and 4.75 respectively. Hence you can reject the two null hypothesis at any reasonable level of significance. (c) There is no theory that suggests the homoskedasticity in the error terms in this case. Given the serious consequences for using homoskedasticity only standard errors in the presence of heteroskedasticity, you should definitely use the heteroskedasticity robust standard errors for inference. 9) (Continuation of the Purchasing Power Parity question from Chapter 4) The news-magazine The Economist regularly publishes data on the so called Big Mac index and exchange rates between countries. The data for 30 countries from the April 29, 2000 issue is listed below:

Country

Currency

Price of Big Mac

Rupiah Lira Won Peso Peseta Forint Yen Dollar Baht Crown Ruble Crown Crown Peso Franc Shekel Yuan Rand Franc Zloty Mark Dollar Dollar Dollar Real Dollar Dollar Peso Pound

14,500 4,500 3,000 1,260 375 339 294 70 55 54.37 39.50 24.75 24.0 20.9 18.5 14.5 9.90 9.0 5.90 5.50 4.99 4.52 3.40 3.20 2.95 2.85 2.59 2.50 1.90

Actual Exchange Rate per U.S. dollar 7,945 2,088 1,108 514 179 279 106 30.6 38.0 39.1 28.5 8.04 8.84 9.41 7.07 4.05 8.28 6.72 1.70 4.30 2.11 3.80 2.01 1.70 1.79 1.47 1.68 1.00 0.63

Stock/Watson 2e -- CVC2 8/23/06 -- Page 106

United States

Dollar

2.51

R2 = 0.994, n = 29, SER = 122.15

(a) Your spreadsheet program does not allow you to calculate heteroskedasticity robust standard errors. Instead, the numbers in parenthesis are homoskedasticity only standard errors. State the two null hypothesis under which PPP holds. Should you use a one-tailed or two-tailed alternative hypothesis? (b) Calculate the two t-statistics. (c) Using a 5% significance level, what is your decision regarding the null hypothesis given the two t-statistics? What critical values did you use? Are you concerned with the fact that you are testing the two hypothesis sequentially when they are supposed to hold simultaneously? (d) What assumptions had to be made for you to use Student’s t-distribution? Answer: (a) Under PPP, H0 : β0 = 0 and Ho : β1 = 1. Economic theory does not tell you whether the intercept should be greater or less than zero if PPP does not hold. The same goes for the slope, i.e., you do not know whether or not it is less than or greater than unity. As a result, you should use a two tailed alternative hypothesis. 1.35- 1 -27.05 - 0 (b) The t-statistic for the intercept is t = = -1.14. For the slope, it is t = = 17.5. 0.02 23.74 (c) Using the Student t-distribution and 27 degrees of freedom, the critical value for a two-sided alternative is 2.05. Hence you can reject the null hypothesis for the intercept but not the slope. Under PPP, both hypothesis are supposed to hold simultaneously and if either or both are rejected, then PPP is not supported by the data. As is discussed later in the textbook, testing hypothesis sequentially is not the same as testing them simultaneously, since p-values change. (At an intuition and heroically assuming independence here, Pr(AandB) = Pr(A) × Pr(B); and hence the rejection probability needs to be adjusted.) (d) In addition to the standard three least squares assumptions, you had to assume that the regression errors are homoskedastic, and that the regression errors are normally distributed. That is you had to assume that the homoskedastic normal regression assumptions hold. 10) (Continuation from Chapter 4, number 6) The neoclassical growth model predicts that for identical savings rates and population growth rates, countries should converge to the per capita income level. This is referred to as the convergence hypothesis. One way to test for the presence of convergence is to compare the growth rates over time to the initial starting level. (a) The results of the regression for 104 countries were as follows: g6090 = 0.019 – 0.0006 × RelProd 60 , R2 = 0.00007, SER = 0.016 (0.004) (0.0073) where g6090 is the average annual growth rate of GDP per worker for the 1960 -1990 sample period, and RelProd60 is GDP per worker relative to the United States in 1960. Numbers in parenthesis are heteroskedasticity robust standard errors. Stock/Watson 2e -- CVC2 8/23/06 -- Page 107

Using the OLS estimator with homoskedasticity-only standard errors, the results changed as follows: g6090 = 0.019 – 0.0006×RelProd 60 , R2 = 0.00007, SER = 0.016 (0.002) (0.0068) Why didn’t the estimated coefficients change? Given that the standard error of the slope is now smaller, can you reject the null hypothesis of no beta convergence? Are the results in the second equation more reliable than the results in the first equation? Explain. (b) You decide to restrict yourself to the 24 OECD countries in the sample. This changes your regression output as follows (numbers in parenthesis are heteroskedasticity robust standard errors): g6090 = 0.048 – 0.0404 RelProd 60 , R2 = 0.82 , SER = 0.0046 (0.004) (0.0063) Test for evidence of convergence now. If your conclusion is different than in (a), speculate why this is the case. (c) The authors of your textbook have informed you that unless you have more than 100 observations, it may not be plausible to assume that the distribution of your OLS estimators is normal. What are the implications here for testing the significance of your theory? Answer: (a) Using homoskedasticity-only standard errors has no effect on the OLS estimator. The t- statistic remains small and is certainly below the critical value. The results are less reliable since there is no reason to believe that the error variance is homoskedastic. (b) The t-statistic for the slope is 6.41. At face value, there is strong evidence for convergence. Neoclassical growth theory does not predict unconditional convergence. Instead it only predicts convergence if the savings rates and population growth rates are identical. It stands to reason that these are much more similar between OECD countries than between the countries of the world. (c) Since there are less than 30 observations, the distribution of the t-statistic is unknown. You should therefore not conduct statistical inference. 11) You have collected 14,925 observations from the Current Population Survey. There are 6,285 females in the sample, and 8,640 males. The females report a mean of average hourly earnings of $16.50 with a standard deviation of $9.06. The males have an average of $20.09 and a standard deviation of $10.85. The overall mean average hourly earnings is $18.58. a.

Using the t-statistic for testing differences between two means (section 3.4 of your textbook), decide whether or not there is sufficient evidence to reject the null hypothesis that females and males have identical average hourly earnings.

You decide to run two regressions: first, you simply regress average hourly earnings on an intercept only. Next, you repeat this regression, but only for the 6,285 females in the sample. What will the regression coefficients be in each of the two regressions?

Finally you run a regression over the entire sample of average hourly earnings on an intercept and a binary variable DFemme, where this variable takes on a value of 1 if the individual is a female, and is 0 otherwise. What will be the value of the intercept? What will be the value of the coefficient of the binary variable?

d. What is the standard error on the slope coefficient? What is the t-statistic? e.

Had you used the homoskedasticity-only standard error in (d) and calculated the t-statistic, how would you have had to change the test-statistic in (a) to get the identical result? Stock/Watson 2e -- CVC2 8/23/06 -- Page 108

Answer: a. H0 : μF = μM; H1 : μF ≠ μM t=

20.09-16.05 . As a result, you can comfortably reject the null hypothesis at any reasonable 10.85 2 9.062 + 8640 6285

confidence level. ^

b. ahe = β 0 = 18.58; ahe = β 0 = 16.50 Hence for each of the regressions, the intercept takes on the value of the overall mean for average hourly earnings, and the mean average hourly earnings for females. c.

ahe = β 0 + β 1 × DFemme = 20.09 - 3.59× DFemme The intercept is the mean of average hourly earnings for males, and the slope is the difference between the mean of average hourly earnings of females and males. d. The standard error on the slope coefficient is 0.16, which is identical to the standard error of the t-statistic in (a) above. Hence the t-statistic is (-21.98). e. You would have had to use the “pooled” standard error formula (3.23) in your textbook.

5.3 Mathematical and Graphical Problems

1) In order to formulate whether or not the alternative hypothesis is one -sided or two-sided, you need some guidance from economic theory. Choose at least three examples from economics or other fields where you have a clear idea what the null hypothesis and the alternative hypothesis for the slope coefficient should be. Write a brief justification for your answer. Answer: Answers will vary by student. The problem is to find examples where there is only a single explanatory variable. A student may argue that the price coefficient in a demand function is downward sloping, but unless you control for other variables, this may not be so. The demand for L.A. Laker tickets and their price comes to mind. CAPM is a nice example. Perhaps the marginal propensity to consume in a consumption function is another. Testing for speculative efficiency in exchange rate markets may also work.

2) For the following estimated slope coefficients and their heteroskedasticity robust standard errors, find the t-statistics for the null hypothesis H0 : β1 = 0. Assuming that your sample has more than 100 observations, indicate whether or not you are able to reject the null hypothesis at the 10%, 5%, and 1% level of a one -sided and two-sided hypothesis. ^

(a) β1 = 4.2, SE(β1 ) = 2.4 (b) β1 = 0.5, SE(β1 ) = 0.37 ^

(d) β1 = 360, SE(β1 ) = 300 Answer: a) t = 1.75; reject null 10% level of two-sided test, and 5% of one-sided test. b) t = 1.35; cannot reject null at 10% of two -sided test, reject null at 10% of one-sided test. c) t = 1.50; cannot reject null at 10% of two -sided test, reject null at 10% of one-sided test. d) t = 1.20; cannot reject null at 10% of both two-sided and one-sided test. Stock/Watson 2e -- CVC2 8/23/06 -- Page 109

3) Explain carefully the relationship between a confidence interval, a one -sided hypothesis test, and a two-sided hypothesis test. What is the unit of measurement of the t-statistic? Answer: In the case of a two-sided hypothesis test, the relationship between the t-statistic and the confidence interval is straightforward. The t-statistic calculates the distance between the estimate and the hypothesized value in standard deviations. If the distance is larger than 1.96 (size of the test: 5%), then the distance is large enough to reject the null hypothesis. The confidence interval adds and subtracts 1.96 standard deviations in this case, and asks whether or not the hypothesized value is contained within the confidence interval. Hence the two concepts resemble the two sides of a coin. They are simply different ways to look at the same problem. In the case of the one -sided test, the relationship is more complex. Since you are looking at a one-sided alternative, it does not really make sense to construct a confidence interval. However, the confidence interval results in the same conclusion as the t-test if the critical value from the standard normal distribution is appropriately adjusted, e.g. to 10% rather than 5%. The unit of measurement of the t-statistic is standard deviations. 4) The effect of decreasing the student-teacher ratio by one is estimated to result in an improvement of the districtwide score by 2.28 with a standard error of 0.52. Construct a 90% and 99% confidence interval for the size of the slope coefficient and the corresponding predicted effect of changing the student -teacher ratio by one. What is the intuition on why the 99% confidence interval is wider than the 90% confidence interval? Answer: The 90% confidence interval for the slope is calculated as follows: (2.28 - 1.645 × 0.52, 2.28 + 1.645 × 0.52) = (1.42, 3.14). The corresponding predicted effect of a unit change in the student -teacher ratio is the same, since the change in X is 1. The 99% confidence interval for the slope coefficient and the unit change in the student -teacher ratio is: (2.28 - 2.58 × 0.52, 2.28 + 2.58 × 0.52) = (0.94, 3.62). The 99% confidence interval corresponds to a smaller size of the test. This means that you want to be “more certain” that the population parameter is contained in the interval, and that requires a larger interval. 5) Below you are asked to decide on whether or not to use a one-sided alternative or a two-sided alternative hypothesis for the slope coefficient. Briefly justify your decision. ^d ^ ^ (a) q i = β0 + β1 p i, where qd is the quantity demanded for a good, and p is its price. ^ actual ^ ^ assess actual assess , where p i is the actual house price, and p i is the assessed house price. (b) p i = β0 + β1 p i

You want to test whether or not the assessment is correct, on average. ^ ^ ^ d (c) Ci = β0 + β1 Y i , where C is household consumption, and Yd is personal disposable income.

Answer: (a) You would use a one-sided alternative hypothesis since economic theory suggests that the quantity demanded and prices are negatively related. (b) The alternative hypothesis is H1 : β1 ≠ 1 since assessments could be too large or too small, on average. You should also test for H1 : β0 ≠ 0. (c) You should use a one-sided alternative hypothesis, since economic theory strongly suggests that the marginal propensity to consume is positive.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 110

6) (Requires Appendix material) Your textbook shows that OLS is a linear estimator β1 =

n ^

∑ aiYi , where ai = i=1

Xi – X

. For OLS to be conditionally unbiased, the following two conditions must hold:

∑ Xi - X 2

n ^

∑ ai = 0 and i=1

i=1 n ^

∑ aiXi = 1. Show that this is the case. i=1 Answer:

n n ^ a i = ∑ ∑ i=1 i=1

Xi - X n

∑ Xi - X 2 i=1

∑ (Xi - X) = 0 since deviations from the mean sum to

∑ Xi - X 2 i=1 i=1

zero.

∑ Xi - X 2

n ^

Xi - X

i=1

∑ Xi - X 2

=1 ∑ (Xi - X) Xi = n n 2 2 i=1 ∑ Xi - X ∑ Xi - X

i=1

∑ aiXi = ∑

Xi =

i=1 n

i=1 i=1 i=1 i=1 term is zero again because of the definition of a mean.

i=1

(Note that

i=1

∑ Xi - X 2 = ∑ (Xi - X) × ∑ (Xi - X) = ∑ (Xi - X) × Xi - X ∑ (Xi - X) , where the last

7) (Requires Appedix material and Calculus) Equation (5.36) in your textbook derives the conditional variance n ~ ~ 2 2 for any old conditionally unbiased estimator β1 to be var(β1 X1 , ..., Xn) = σ u ∑ a i where the conditions for i=1 n n conditional unbiasedness are ∑ ai = 0 and ∑ aiXi = 1. As an alternative to the BLUE proof presented in i=1 i=1 your textbook, you recall from one of your calculus courses that you could minimize the variance subject to the two constraints, thereby making the variance as small as possible while the constraints are holding. Show that ^

in doing so you get the OLS weights ai. (You may assume that X1 ,..., Xn are nonrandom (fixed over repeated samples).) n n n 2 2 Answer: The Lagrangian is σ u ∑ a i - λ1 ∑ ai - λ2 ( ∑ aiXi - 1); i=1,... n where the λi are two Lagrangian i=1 i=1 i=1 multipliers. Minimizing the Lagrangian w.r.t. the n weights ai and the two Lagrangian multipliers, results in (n+2) linear equations in (n+2) unknowns. Solving these for the weights, you get ai = Xi - X n

= ai .

∑ Xi - X 2 i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 111

8) Your textbook states that under certain restrictive conditions, the t- statistic has a Student t-distribution with n-2 degrees of freedom. The loss of two degrees of freedom is the result of OLS forcing two restrictions onto the data. What are these two conditions, and when did you impose them onto the data set in your derivation of the OLS estimator? Answer: The two conditions are

n ^

∑ ui = 0 and ∑ uiXi = 0. These were the result of minimizing the sum of the

i=1 i=1 squared prediction errors, i.e., taking the derivative of the prediction mistakes and setting them to zero. 9) Assume that your population regression function is Yi = βiXi + ui i.e., a regression through the origin (no intercept). Under the homoskedastic normal regression assumptions, the t-statistic will have a Student t distribution with n-1 degrees of freedom, not n–2 degrees of freedom, as was the case in Chapter 5 of your textbook. Explain. Do you think that the residuals will still sum to zero for this case? ^

Answer: In deriving the OLS estimator β1 , you minimize the prediction mistake w.r.t. b1 only, not b0 and b1 . As a n ^ result, you are only placing one restriction on the data, ( ∑ uiXi = 0) not two. Hence there are n-1 i=1 n ^ independent observations. ∑ ui = 0 will no longer hold. i=1 10) In many of the cases discussed in your textbook, you test for the significance of the slope at the 5% level. What is the size of the test? What is the power of the test? Why is the probability of committing a Type II error so large here? Answer: The size of the test is the same as the probability of committing a Type I error. It is therefore 5%. If the ^

alternative hypothesis is vague, as is the case for H1 : β1 ≠ 0 or H1 : β1 < 0 (or H1 : β1 > 0), then the distribution of the alternative hypothesis is located virtually on top of the distribution of the null hypothesis (it is just marginally moved to the left or the right). As a result, the probability of the Type II error must be 1-probability of the Type I error. Hence the power of the test is only 5%, which is low. 11) Assume that the homoskedastic normal regression assumption hold. Using the Student t-distribution, find the critical value for the following situation: (a) n=28, 5% significance level, one-sided test. (b) n=40, 1% significance level, two-sided test. (c) n=10, 10% significance level, one-sided test. (d) n= ∞, 5% significance level, two-sided test. Answer: (a) 1.71 (b) between 2.75 (30 degrees of freedom) and 2.66 (60 degrees of freedom) (c) 1.40 (d) 1.96

Stock/Watson 2e -- CVC2 8/23/06 -- Page 112

12) Consider the following two models involving binary variables as explanatory variables: Wage = β0 + β1 DFemme and Wage = φ1DFemme + φ2Male where Wage is the hourly wage rate, DFemme is a binary variable that is equal to 1 if the person is a female, and 0 if the person is a male. Male = 1 – DFemme. Even though you have not learned about regression functions with two explanatory variables (or regressions without an intercept), assume that you had estimated both models, i.e., you obtained the estimates for the regression coefficients. What is the predicted wage for a male in the two models? What is the predicted wage for a female in the two models? What is the relationship between the β s and the φs? Why would you prefer one model over the other? Answer: For DFemme = 1, the models read Wage = β0 + β1 and Wage = φ1; for DFemme = 0, the models read Wage = β0 and Wage = φ2 . Hence both β0 and φ2 give you the average wage of males. Clearly β0 = φ1 . Since the wage for females is φ1 = β0 + β1, and the wage for males is β0 , then β1 must be the difference in the wage between males and females. Hence the first formulation allows you to test directly whether or not the difference in means (here wages) is statistically significant. ^

^ 13) Consider the sample regression function Yi = β0 + β1 Xi. The table below lists estimates for the slope ( β1 ) and ^2 the variance of the slope estimator ( σ β^ ). In each case calculate the p-value for the null hypothesis of β1 = 0 1

and a two-tailed alternative hypothesis. Indicate in which case you would reject the null hypothesis at the 5% significance level. β1

–1.76

0.0025

2.85

-0.00014

^ 2^

0.37

0.000003

117.5

0.0000013

σβ

Answer: The t-statistics are -2.89, 1.36, 0.26, and -0.123 respectively, with p-values of 0.004, 0.17, 0.79, and 0.90. Hence you only reject the null hypothesis for the first case.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 113

14) Your textbook discussed the regression model when X is a binary variable Yi = β0 + β1 Di + ui, i = 1..., n Let Y represent wages, and let D be one for females, and 0 for males. Using the OLS formula for the slope ^

coefficient, prove that β1 is the difference between the average wage for males and the average wage for females. Answer: Using the OLS formula for the slope, we have nf n ∑ XiYi - nXY ∑ wagei - nf wage ^

β1 =

i=1 n

i=1

2 ∑ X i - nX2 i=1

2 nf

, where nf is the number of females in the sample and wage

nf n

is the average wage. Dividing both the numerator and the denominator by nf , we get n 1 f wagei - wage nf ∑ wage f - wage i=1 n β1 = (wage f - wage), where wage f is the average wage of = = n - nf n - nf nf 1n n females. But note that wage =

nf nm wage f + wage m,where the m subscript indicates males. Substitution n n

of this expression for average wages into the previous expression results in ^

β1 =

nf nm nm n n (wage f - wage) = wage f wage f + wage m = wage f wage m n - nf n n n - nf n - nf

Since n - nf = nm , we have the desired result.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 114

15) Your textbook discussed the regression model when X is a binary variable Yi = β0 + βiDi + ui, i = 1,..., n Let Y represent wages, and let D be one for females, and 0 for males. Using the OLS formula for the intercept coefficient, prove that β0 is the average wage for males. ^

Answer: β0 = Y - β1 X. It is easy but tedious to show that the formula for the slope reduces to the difference between the average wage for females and the average wage for males. nf n ∑ XiYi - nXY ∑ wagei - nf wage β1 = i=1 n

i=1

∑ X i - nX2

i=1 = wage - (wage f - wage m)

2 nf

= wage f - wage m. But Y = wage and X =

nf and hence β0 n

nf n

nf nm nf . Substituting the expression wage = wage f + wagem then results in n n n

nf n β0 = m wage wage m, which equals the male average wage. + m n n 2 16) Let ui be distributed N(0, σ u ), i.e., the errors are distributed normally with a constant variance 2 2 ^ (homoskedasticity). This results in β1 being distributed N(β1 , σ β^ ), where σ β^ = 1 1

2 σu n

. Statistical

∑ (Xi - X)2

i=1

2 2 inference would be straightforward if σ u was known. One way to deal with this problem is to replace σ u 2 ^ with an estimator S ^ . Clearly since this introduces more uncertainty, you cannot expect β1 to be still normally u distributed. Indeed, the t-statistic now follows Student’s t distribution. Look at the table for the Student t-distribution and focus on the 5% two-sided significance level. List the critical values for 10 degrees of freedom, 30 degrees of freedom, 60 degrees of freedom, and finally ∞ degrees of freedom. Describe how the 2 notion of uncertainty about σ u can be incorporated about the tails of the t-distribution as the degrees of freedom increase. Answer: More uncertainty implies that the tales of the distribution should be stretched further to the left and right when compared to the normal distribution. Hence the critical values for the 5% significance level should be greater than 1.96 in absolute levels. However, as the number of observations (degrees of freedom) 2 2 increase, S ^ will converge towards σ u , so that the shape of the t-distribution should resemble the u normal distribution more and more. Finally, when there are infinite degrees of freedom, the sample 2 formula S ^ becomes the population variance, and the t-distribution should converge to the normal u distribution.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 115

17) In a Monte Carlo study, econometricians generate multiple sample regression functions from a known population regression function. For example, the population regression function could be Yi = β0 + β1 Xi = 100 – 0.5 Xi. The Xs could be generated randomly or, for simplicity, be nonrandom (“fixed over repeated samples”). If we had ten of these Xs, say, and generated twenty Ys, we would obviously always have all observations on a straight line, and the least squares formulae would always return values of 100 and 0.5 numerically. However, if we added an error term, where the errors would be drawn randomly from a normal distribution, say, then the OLS formulae would give us estimates that differed from the population regression function values. Assume you did just that and recorded the values for the slope and the intercept. Then you did the same experiment again (each one of these is called a “replication”). And so forth. After 1,000 replications, you plot the 1,000 intercepts and slopes, and list their summary statistics. Sample: 1 1000 BETA0_HAT

BETA1_HAT

Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis

100.014 100.021 106.348 93.862 1.994 0.013 3.026

–0.500 –0.500 –0.468 –0.538 0.011 –0.042 2.986

Jarque-Bera Probability

0.055 0.973

0.305 0.858

Sum 100014.353 Sum Sq. Dev. 3972.403

–499.857 0.118

Observations

1000.000

Here are the corresponding graphs:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 116

Using the means listed next to the graphs, you see that the averages are not exactly 100 and –0.5. However, they are “close.” Test for the difference of these averages from the population values to be statistically significant. Answer: You can use a simple t-statistic to calculate whether or not (-0.499857) and 100.0144 are statistically different from (-0.5) and 100. In the denominator of that statistic you would simply put the standard deviations (0.0109 and 1.9941) divided by the square root of 1,000. As you can see, r = 100.0144 - 100 -0.499857 - (-0.50) = -0.41 and t = = 0.29. Neither one of the estimators is more than 1.96 0.0109 1.9941 1000 1000 standard deviations from truth, and hence you cannot reject the null hypothesis that the estimators are unbiased.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 117

18) In the regression through the origin model Yi = β1 Xi + ui, the OLS estimator is β1 =

∑ XiYi i=1 n

. Prove that the

2 ∑ Xi i=1

estimator is a linear function of Y1 ,..., Yn and prove that it is conditionally unbiased. Answer: Let wi =

Xi n

, then β1 = wiYi. Hence the OLS estimator is a linear function of Y1 ..., Yn. Next, since

∑ Xi i=1

Yi = β1 Xi + ui, we get n n n ^ β1 = ∑ wi (βiXi + ui) = β1 ∑ wiXi + ∑ wiui . i=1 i=1 i=1 n 2 ∑ Xi n n Xi ^ i=1 wi = , ∑ wiXi = = 1 implies β1 = β1 + ∑ wiui . Taking expectations on both sides, n n 2 2 i=1 ∑ X i i=1 ∑ Xi i=1 i=1 we find n 1 Xiui n ∑ n ^ i=1 E(β1 ) = β1 + E ∑ wiui = β1 + E n 1 2 i=1 Xi ∑ n i=1

n 1 X E(u X ,..., Xn) n ∑ i i 1 i=1 = β1 + E = β1 n 1 2 Xi n ∑ i=1

The last equality follows by using the law of iterated expectations. By least squares assumptions, ui is distributed independently of X for all observations other than i, so E(ui X1 ,..., Xn) = E(ui X i) = 0. Hence ^ E(β1 X 1 ,...,Xn) = β1.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 118

19) The neoclassical growth model predicts that for identical savings rates and population growth rates, countries should converge to the per capita income level. This is referred to as the convergence hypothesis. One way to test for the presence of convergence is to compare the growth rates over time to the initial starting level, i.e., to run the regression g6090 = β0 + β1 × RelProd 60 , where g6090 is the average annual growth rate of GDP per worker for the 1960-1990 sample period, and RelProd 60 is GDP per worker relative to the United States in 1960. Under the null hypothesis of no convergence, β1 = 0; H1 : β1 < 0, implying (“beta”) convergence. Using a standard regression package, you get the following output: Dependent Variable: G6090 Method: Least Squares Date: 07/11/06 Time: 05:46 Sample: 1 104 Included observations: 104 White Heteroskedasticity-Consistent Standard Errors & Covariance Variable C YL60 R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

Coefficient 0.018989 –0.000566 0.000068 -0.009735 0.015992 0.026086 283.5498 1.367534

Std. Error t-Statistic 0.002392 7.939864 0.005056 -0.111948 Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

Prob. 0.0000 0.9111 0.018846 0.015915 -5.414418 -5.363565 0.006986 0.933550

You are delighted to see that this program has already calculated p-values for you. However, a peer of yours points out that the correct p-value should be 0.4562. Who is right? Answer: Statistical packages typically do not know what the alternative hypothesis is. As a result, the packages calculate t-statistics and p-values for H1 : β1 ≠ 0. You can tell your fellow student that she is right and you will still have to calculate p-values (and t-statistics) by hand for cases other than H1 : β1 ≠ 0. 20) Changing the units of measurement obviously will have an effect on the slope of your regression function. For n * * ∑ xiyi ^* a ^ i=1 example, let Y*= aY and X* = bX. Then it is easy but tedious to show that β 1 = = β1 . Given this n b *2 ∑ xi i=1 result, how do you think the standard errors and the regression R2 will change? Answer: Statistical inference should not depend on whim, and hence changes in the units of measurement cannot ^* have an effect on the regression R2 . Also, the t-statistics should not change, and hence SE( β 1 ) must ^* ^ a change accordingly (SE( β 1 ) = × SE(β1 )). b

Stock/Watson 2e -- CVC2 8/23/06 -- Page 119

21) Using the California School data set from your textbook, you run the following regression: TestScr = 698.9 - 2.28 STR n = 420, SER = 9.4 where TestScore is the average test score in the district and STR is the student-teacher ratio. The sample standard deviation of test scores is 19.05, and the sample standard deviation of the student teacher ratio is 1.89.

Find the regression R2 and the correlation coefficient between test scores and the student teacher ratio.

Find the homoskedasticity-only standard error of the slope.

Answer: a. R2 = 1 -

144611.3 SSR =1= 0.051 152490.6 TSS

The correlation coefficient is the (negative) square root of this, or (-0.23). ^ 18.6 b. Using formula (5.29), you get σβ 1 = = 0.48 38.8 22) Using the California School data set from your textbook, you run the following regression: TestScr = 698.9 - 2.28 STR n = 420, R2 = 0.051, SER = 18.6

where TestScore is the average test score in the district and STR is the student-teacher ratio. Using heteroskedasticity robust standard errors, you find

while chosing the homoskedasticity-only option, the standard error is 0.48.

Calculate the t-statistic for both standard errors.

Which of the two t-statistics should you base your inference on?

Answer: a. The respective t-statistics are 4.39 (heteroskedasticity-robust standard error) and 4.75 (homoskedasticity-only standard error).

b. Given the similarity of the two statistics and the fact that both are greater than 4, it will not make much of a difference which one you will use. However, it is “cleaner” to use the heteroskedasticity-robust formula, since, in general, it will result in the correct inference procedure.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 120

23) Using data from the Current Population Survey, you estimate the following relationship between average hourly earnings (ahe) and the number of years of education (educ): ahe = -4.58 + 1.71 educ The heteroskedasticity-robust standard error on the slope is (0.03). Calculate the 95% confidence interval for the slope. Repeat the exercise using the 90% and then the 99% confidence interval. Can you reject the null hypothesis that the slope coefficient is zero in the population? Answer: The 95% confidence interval for the slope is (1.65,1.77). For the 90% confidence level, you get (1.66,1.75) while the interval is (1.63,1.79) for the 99% level. Since neither of the confidence intervals contains zero, you can comfortably reject the null hypothesis in all three cases.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 121

Chapter 6 Linear Regression with Multiple Regressors 6.1 Multiple Choice 1) In the multiple regression model, the adjusted R2 , R2 A) cannot be negative. B) will never be greater than the regression R2 . C) equals the square of the correlation coefficient r. D) cannot decrease when an additional explanatory variable is added. Answer: B 2) Under imperfect multicollinearity A) the OLS estimator cannot be computed. B) two or more of the regressors are highly correlated. C) the OLS estimator is biased even in samples of n > 100. D) the error terms are highly, but not perfectly, correlated. Answer: B 3) When there are omitted variables in the regression, which are determinants of the dependent variable, then A) you cannot measure the effect of the omitted variable, but the estimator of your included variable(s) is (are) unaffected. B) this has no effect on the estimator of your included variable because the other variable is not included. C) this will always bias the OLS estimator of the included variable. D) the OLS estimator is biased if the omitted variable is correlated with the included variable. Answer: D 4) Imagine you regressed earnings of individuals on a constant, a binary variable (“ Male”) which takes on the value 1 for males and is 0 otherwise, and another binary variable (“Female”) which takes on the value 1 for females and is 0 otherwise. Because females typically earn less than males, you would expect A) the coefficient for Male to have a positive sign, and for Female a negative sign. B) both coefficients to be the same distance from the constant, one above and the other below. C) none of the OLS estimators to exist because there is perfect multicollinearity. D) this to yield a difference in means statistic. Answer: C 5) When you have an omitted variable problem, the assumption that E(ui Xi) = 0 is violated. This implies that A) the sum of the residuals is no longer zero. B) there is another estimator called weighted least squares, which is BLUE. C) the sum of the residuals times any of the explanatory variables is no longer zero. D) the OLS estimator is no longer consistent. Answer: D 6) If you had a two regressor regression model, then omitting one variable which is relevant A) will have no effect on the coefficient of the included variable if the correlation between the excluded and the included variable is negative. B) will always bias the coefficient of the included variable upwards. C) can result in a negative value for the coefficient of the included variable, even though the coefficient will have a significant positive effect on Y if the omitted variable were included. D) makes the sum of the product between the included variable and the residuals different from 0. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 122

7) (Requires Calculus) In the multiple regression model you estimate the effect on Yi of a unit change in one of the Xi while holding all other regressors constant. This A) makes little sense, because in the real world all other variables change. B) corresponds to the economic principle of mutatis mutandis. C) leaves the formula for the coefficient in the single explanatory variable case unaffected. D) corresponds to taking a partial derivative in mathematics. Answer: D 8) You have to worry about perfect multicollinearity in the multiple regression model because A) many economic variables are perfectly correlated. B) the OLS estimator is no longer BLUE. C) the OLS estimator cannot be computed in this situation. D) in real life, economic variables change together all the time. Answer: C 9) In a two regressor regression model, if you exclude one of the relevant variables then A) it is no longer reasonable to assume that the errors are homoskedastic. B) OLS is no longer unbiased, but still consistent. C) you are no longer controlling for the influence of the other variable. D) the OLS estimator no longer exists. Answer: C 10) The intercept in the multiple regression model A) should be excluded if one explanatory variable has negative values. B) determines the height of the regression line. C) should be excluded because the population regression function does not go through the origin. D) is statistically significant if it is larger than 1.96. Answer: B 11) In the multiple regression model, the least squares estimator is derived by A) minimizing the sum of squared prediction mistakes. B) setting the sum of squared errors equal to zero. C) minimizing the absolute difference of the residuals. D) forcing the smallest distance between the actual and fitted values. Answer: A 12) The sample regression line estimated by OLS A) has an intercept that is equal to zero. B) is the same as the population regression line. C) cannot have negative and positive slopes. D) is the line that minimizes the sum of squared prediction mistakes. Answer: D 13) The OLS residuals in the multiple regression model A) cannot be calculated because there is more than one explanatory variable. B) can be calculated by subtracting the fitted values from the actual values. C) are zero because the predicted values are another name for forecasted values. D) are typically the same as the population regression function errors. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 123

14) Under the least squares assumptions for the multiple regression problem (zero conditional mean for the error term, all Xi and Yi being i.i.d., all Xi and ui having finite fourth moments, no perfect multicollinearity), the OLS estimators for the slopes and intercept A) have an exact normal distribution for n > 25. B) are BLUE. C) have a normal distribution in small samples as long as the errors are homoskedastic. D) are unbiased and consistent. Answer: D 15) The main advantage of using multiple regression analysis over differences in means testing is that the regression technique A) allows you to calculate p-values for the significance of your results. B) provides you with a measure of your goodness of fit. C) gives you quantitative estimates of a unit change in X. D) assumes that the error terms are generated from a normal distribution. Answer: C 16) In a multiple regression framework, the slope coefficient on the regressor X2i A) takes into account the scale of the error term. B) is measured in the units of Yi divided by units of X2i. C) is usually positive. D) is larger than the coefficient on X1i. Answer: B 17) One of the least squares assumptions in the multiple regression model is that you have random variables which are “i.i.d.” This stands for A) initially indeterminate differences. B) irregularly integrated dichotomies. C) identically initiated deltas (as in changes). D) independently and identically distributed. Answer: D 18) Omitted variable bias A) will always be present as long as the regression R2 < 1. B) is always there but is negligible in almost all economic examples. C) exists if the omitted variable is correlated with the included regressor but is not a determinant of the dependent variable. D) exists if the omitted variable is correlated with the included regressor and is a determinant of the dependent variable. Answer: D 19) The following OLS assumption is most likely violated by omitted variables bias: A) E(ui Xi) = 0 B) (Xi, Yi) i=1,..., n are i.i.d draws from their joint distribution C) there are no outliers for Xi, ui D) there is heteroskedasticity Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 124

20) The population multiple regression model when there are two regressors, X1i and X2i can be written as follows, with the exception of: A) Yi = β0 + β1 X1i + β2 X2i + ui, i = 1,..., n B) Yi = β0 X0i + β1 X1i + β2 X2i + ui, X0i = 1, i = 1,..., n 2 C) Yi = ∑ βj Xji + ui, i = 1,..., n j=0 D) Yi = β0 + β1 X1i + β2 X2i + ... + βkXki + ui , i = 1,..., n Answer: D 21) In the multiple regression model Yi = β0 + β1 X1i+ β2 X2i + ... + βkXki + ui , i = 1,..., n, the OLS estimators are obtained by minimizing the sum of n 2 A) squared mistakes in ∑ Yi - b0 - b1 X1i - ... - bkXki i=1 n B) squared mistakes in

∑ Yi - b0 - b1X1i - ... - bkXki - ui

i=1 n C) absolute mistakes in

D) squared mistakes in

∑

Yi - b0 - b1 X1i - ... - bkXki

i=1 n

∑ Yi - b0 - b1Xi

i=1 Answer: A 22) In the multiple regression model, the SER is given by n ^ 1 A) ui ∑ n-2 i=1 n 1 ui B) ∑ n - k -2 i=1 n ^ 1 C) ui n- k-2 ∑ i=1 n ^ 1 2 ui D) ∑ n- k-1 i=1 Answer: D 23) In multiple regression, the R2 increases whenever a regressor is A) added unless the coefficient on the added regressor is exactly zero. B) added. C) added unless there is heterosckedasticity. D) greater than 1.96 in absolute value. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 125

24) The adjusted R2 , or R2 , is given by n-2 SSR A) 1n - k -1 TSS B) 1-

n-2 ESS n - k -1 TSS

C) 1-

n-1 SSR n - k -1 TSS

ESS TSS

Answer: C 25) Consider the following multiple regression models (a) to (d) below. DFemme = 1 if the individual is a female, and is zero otherwise; DMale is a binary variable which takes on the value one if the individual is male, and is zero otherwise; DMarried is a binary variable which is unity for married individuals and is zero otherwise, and DSingle is (1-DMarried). Regressing weekly earnings (Earn) on a set of explanatory variables, you will experience perfect multicollinearity in the following cases unless: A) Earni = β0 + β1 DFemme + β2 Dmale + β3 X3i B) Earni = β0 + β1 DMarried + β2 DSingle + β3 X3i C) Earni = β0 + β1 DFemme + β3 X3i D) Earni = β1 DFemme + β2 Dmale + β3 DMarried + β4 DSingle + β5 X3i Answer: C 26) Consider the multiple regression model with two regressors X1 and X2 , where both variables are determinants of the dependent variable. When omitting X2 from the regression, then there will be omitted variable bias for β1 A) if X1 and X2 are correlated B) always C) if X2 is measured in percentages D) if X2 is a dummy variable Answer: A 27) The dummy variable trap is an example of A) imperfect multicollinearity B) something that is of theoretical interest only C) perfect multicollinearity D) something that does not happen to university or college students Answer: C 28) Imperfect multicollinearity A) is not relevant to the field of economics and business administration B) only occurs in the study of finance C) means that the least squares estimator of the slope is biased D) means that two or more of the regressors are highly correlated Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 126

29) Consider the multiple regression model with two regressors X1 and X2 , where both variables are determinants of the dependent variable. You first regress Y on X1 only and find no relationship. However when regressing Y on X1 and X2 , the slope coefficient β1 changes by a large amount. This suggests that your first regression suffers from A) heteroskedasticity B) perfect multicollinearity C) omitted variable bias D) dummy variable trap Answer: C 30) Imperfect multicollinearity A) implies that it will be difficult to estimate precisely one or more of the partial effects using the data at hand B) violates one of the four Least Squares assumptions in the multiple regression model C) means that you cannot estimate the effect of at least one of the Xs on Y D) suggests that a standard spreadsheet program does not have enough power to estimate the multiple regression model Answer: A

6.2 Essays and Longer Questions

1) Females, on average, are shorter and weigh less than males. One of your friends, who is a pre -med student, tells you that in addition, females will weigh less for a given height. To test this hypothesis, you collect height and weight of 29 female and 81 male students at your university. A regression of the weight on a constant, height, and a binary variable, which takes a value of one for females and is zero otherwise, yields the following result: Studentw = -229.21 – 6.36 × Female + 5.58 × Height , R2 =0.50, SER = 20.99 where Studentw is weight measured in pounds and Height is measured in inches. (a) Interpret the results. Does it make sense to have a negative intercept? (b) You decide that in order to give an interpretation to the intercept you should rescale the height variable. One possibility is to subtract 5 ft. or 60 inches from your Height, because the minimum height in your data set is 62 inches. The resulting new intercept is now 105.58. Can you interpret this number now? Do you thing that the regression R2 has changed? What about the standard error of the regression? (c) You have learned that correlation does not imply causation. Although this is true mathematically, does this always apply? Answer: (a) For every additional inch in height, weight increases by roughly 5.5 pounds. Female students weigh approximately 6.5 pounds less than male students, controlling for height. The regression explains 50 percent of the weight variation among students. It does not make sense to interpret the intercept, since there are no observations close to the origin, or, put differently, there are no individuals who are zero inches tall. (b) There are now observations close to the origin and you can therefore interpret the intercept. A student who is 5ft. tall will weight roughly 105.5 pounds, on average. The two slopes will be unaffected, as will be the regression R2 . Since the explanatory power of the regression is unaffected by rescaling, and the dependent variable and the total sums of squares have remained unchanged, the sums of squared residuals, and hence the SER, must also remain the same. (c) Although true in general, there are cases where Y cannot cause X, as is the case here. Gaining weight is not a good way for becoming taller, or put differently, weighing 250 pounds will not make students over 7 ft. tall.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 127

2) The cost of attending your college has once again gone up. Although you have been told that education is investment in human capital, which carries a return of roughly 10% a year, you (and your parents) are not pleased. One of the administrators at your university/college does not make the situation better by telling you that you pay more because the reputation of your institution is better than that of others. To investigate this hypothesis, you collect data randomly for 100 national universities and liberal arts colleges from the 2000 -2001 U.S. News and World Report annual rankings. Next you perform the following regression Cost = 7,311.17 + 3,985.20 × Reputation – 0.20 × Size + 8,406.79 × Dpriv – 416.38 × Dlibart – 2,376.51 × Dreligion R2 =0.72, SER = 3,773.35 where Cost is Tuition, Fees, Room and Board in dollars, Reputation is the index used in U.S. News and World Report (based on a survey of university presidents and chief academic officers), which ranges from 1 (“marginal ”) to 5 (“distinguished”), Size is the number of undergraduate students, and Dpriv, Dlibart, and Dreligion are binary variables indicating whether the institution is private, a liberal arts college, and has a religious affiliation. (a) Interpret the results. Do the coefficients have the expected sign? (b) What is the forecasted cost for a liberal arts college, which has no religious affiliation, a size of 1,500 students and a reputation level of 4.5? (All liberal arts colleges are private.) (c) To save money, you are willing to switch from a private university to a public university, which has a ranking of 0.5 less and 10,000 more students. What is the effect on your cost? Is it substantial? (d) Eliminating the Size and Dlibart variables from your regression, the estimation regression becomes Cost = 5,450.35 + 3,538.84 × Reputation + 10,935.70 × Dpriv – 2,783.31 × Dreligion; R2 =0.72, SER = 3,792.68 Why do you think that the effect of attending a private institution has increased now? (e) What can you say about causation in the above relationship? Is it possible that Cost affects Reputation rather than the other way around? Answer: (a) An increase in reputation by one category, increases the cost by roughly $3,985. The larger the size of the college/university, the lower the cost. An increase of 10,000 students results in a $2,000 lower cost. Private schools charge roughly $8,406 more than public schools. A school with a religious affiliation is approximately $2,376 cheaper, presumably due to subsidies, and a liberal arts college also charges roughly $416 less. There are no observations close to the origin, so there is no direct interpretation of the intercept. Other than perhaps the coefficient on liberal arts colleges, all coefficients have the expected sign. (b) $ 32,935. (c) Roughly $ 12,4.00. Since over the four years of education, this implies approximately $50,000, it is a substantial amount of money for the average household. (d) Private institutions are smaller, on average, and some of these are liberal arts colleges. Both of these variables had negative coefficients. (e) It is very possible that the university president and chief academic officer are influenced by the cost variable in answering the U.S. News and World Report survey. If this were the case, then the above equation suffers from simultaneous causality bias, a topic that will be covered in a later chapter. However, this poses a serious threat to the internal validity of the study.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 128

3) In the multiple regression model with two explanatory variables Yi = β0 + β1 X1i + β2 X2i + ui the OLS estimators for the three parameters are as follows (small letters refer to deviations from means as in zi = Zi – Z): ^

β0 = Y – β1 X1 – β2 X2 n

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i

β1 =

i=1

∑ x 1i ∑ x 2i - ( ∑ x1ix2i)2

i=1

n n n 2 x y x y x ∑ i 2i ∑ 1i ∑ i 1i ∑ x1ix2i ^ i=1 i=1 i=1 i=1 β2 = n n n 2 ∑ x 1i ∑ x 22i - ( ∑ x1ix2i )2 i=1 i=1 i=1 n

You have collected data for 104 countries of the world from the Penn World Tables and want to estimate the effect of the population growth rate (X1i) and the saving rate (X2i) (average investment share of GDP from 1980 to 1990) on GDP per worker (relative to the U.S.) in 1990. The various sums needed to calculate the OLS estimates are given below: n

∑ Yi = 33.33; ∑ X1i = 2.025; ∑ X2i = 17.313

i=1 i=1 n n n 2 2 2 ∑ y i = 8.3103; ∑ x 1i = .0122; ∑ x 2i = 0.6422 i=1 i=1 i=1 n n n y x y x 1.5676; = -0.2304; = ∑ i 1i ∑ i 2i ∑ x1i x2i = -0.0520 i=1 i=1 i=1 (a) What are your expected signs for the regression coefficient? Calculate the coefficients and see if their signs correspond to your intuition. (b) Find the regression R2 , and interpret it. What other factors can you think of that might have an influence on i=1

productivity? ^

Answer: (a) You expect β1 < 0 and β2 > 0 with no prior expectation on the intercept. Substituting the above ^

numbers into the equations for the regression coefficients results in β1 = -12.95, β2 = 1.39, and β0 = 0.34. ^ n ^ n β1 ∑ y i x 1i + β2 ∑ y i x 2i i=1 i=1 (b) R2 = = 0.62. 62 percent of the variation in relative productivity is n 2 ∑ yi i=1 explained by the regression. There is a vast literature on the subject and students’ answers will obviously vary. Some may focus on additional economic variables such as the initial level of productivity and the inflation rate during the sample period. Others may emphasize institutional variables such as whether or Stock/Watson 2e -- CVC2 8/23/06 -- Page 129

not the country was democratic over the sample period, or had political stability, etc. 4) A subsample from the Current Population Survey is taken, on weekly earnings of individuals, their age, and their gender. You have read in the news that women make 70 cents to the $1 that men earn. To test this hypothesis, you first regress earnings on a constant and a binary variable, which takes on a value of 1 for females and is 0 otherwise. The results were: Earn = 570.70 – 170.72 × Female, R2 =0.084, SER = 282.12. (a) There are 850 females in your sample and 894 males. What are the mean earnings of males and females in this sample? What is the percentage of average female income to male income? (b) You decide to control for age (in years) in your regression results because older people, up to a point, earn more on average than younger people. This regression output is as follows: Earn = 323.70 – 169.78 × Female + 5.15 × Age, R2 =0.135, SER = 274.45. Interpret these results carefully. How much, on average, does a 40 -year-old female make per year in your sample? What about a 20-year-old male? Does this represent stronger evidence of discrimination against females? Answer: (a) Males earn $570.70, females $399.98. Percentage of average female income to male income is 70.1% in the sample. (b) As individuals become one year older, they earn $5.15 more, on average. Females earn significantly less money on average and for a given age. 13.5 percent of the earnings variation is explained by the regression. A 40-year-old female earns $359.92, while a 20-year-old male makes $426.70. There is somewhat more evidence here, since age has been added as a regressor. However, many attributes, which could potentially explain this difference, are still omitted. 5) You have collected data from Major League Baseball (MLB) to find the determinants of winning. You have a general idea that both good pitching and strong hitting are needed to do well. However, you do not know how much each of these contributes separately. To investigate this problem, you collect data for all MLB during 1999 season. Your strategy is to first regress the winning percentage on pitching quality (“Team ERA”), second to regress the same variable on some measure of hitting (“OPS – On -base Plus Slugging percentage”), and third to regress the winning percentage on both. Summary of the Distribution of Winning Percentage, On Base plus Slugging Percentage, and Team Earned Run Average for MLB in 1999 Average

Team ERA OPS

Standard deviation

Percentile 10%

25%

40%

4.35

4.72

50% 60% (median) 4.78 4.91

75%

90%

5.06

5.25

4.71

0.53

3.84

0.778

0.034

0.720 0.754 0.769 0.780

0.790 0.798 0.820

Winning 0.50 Percentage

0.08

0.40

0.49

0.43

0.46

0.48

The results are as follows: Winpct = 0.94 – 0.100 × teamera , R2 = 0.49, SER = 0.06. Winpct = -0.68 + 1.513 × ops , R2 =0.45, SER = 0.06. Stock/Watson 2e -- CVC2 8/23/06 -- Page 130

0.59

0.60

Winpct = -0.19 – 0.099 × teamera + 1.490 × ops , R2 =0.92, SER = 0.02. (a) Interpret the multiple regression. What is the effect of a one point increase in team ERA? Given that the Atlanta Braves had the most wins that year, wining 103 games out of 162, do you find this effect important? Next analyze the importance and statistical significance for the OPS coefficient. (The Minnesota Twins had the minimum OPS of 0.712, while the Texas Rangers had the maximum with 0.840.) Since the intercept is negative, and since winning percentages must lie between zero and one, should you rerun the regression through the origin? (b) What are some of the omitted variables in your analysis? Are they likely to affect the coefficient on Team ERA and OPS given the size of the R2 and their potential correlation with the included variables? Answer: (a) A single point increase in team ERA lowers the winning percentage by approximately 10 percent. A 0.1 increase in OPS results roughly in an increase of 15 percent. Given that there are no observations close to the origin, you should not interpret the intercept. The multiple regression explains 92 percent of the variation in winning percentage. The Atlanta Braves only won 63.6 percent of their games. Given that this represents the best record during that season, a 10 percentage point drop is important. Although the intercept cannot be interpreted, it anchors the regression at a certain level and should therefore not be omitted. (b) The quality of the management and coaching comes to mind, although both may be reflected in the performance statistics, as are salaries. There are other aspects of baseball performance that are missing, such as the fielding percentage of the team. 6) In the process of collecting weight and height data from 29 female and 81 male students at your university, you also asked the students for the number of siblings they have. Although it was not quite clear to you initially what you would use that variable for, you construct a new theory that suggests that children who have more siblings come from poorer families and will have to share the food on the table. Although a friend tells you that this theory does not pass the “straight-face” test, you decide to hypothesize that peers with many siblings will weigh less, on average, for a given height. In addition, you believe that the muscle/fat tissue composition of male bodies suggests that females will weigh less, on average, for a given height. To test these theories, you perform the following regression: Studentw = -229.92 – 6.52 × Female + 0.51 × Sibs+ 5.58 × Height, R2 =0.50, SER = 21.08 where Studentw is in pounds, Height is in inches, Female takes a value of 1 for females and is 0 otherwise, Sibs is the number of siblings. Interpret the regression results. Answer: For every additional inch in height, students weigh, on average, roughly 5.5 pounds more. For a given height and number of siblings, female students weigh approximately 6.5 pounds less. For every additional sibling, the weight of students increases by half a pound. Since there are no observations close to the origin, you cannot interpret the intercept. The regression explains half of the variation in student weight.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 131

7) You have collected data for 104 countries to address the difficult questions of the determinants for differences in the standard of living among the countries of the world. You recall from your macroeconomics lectures that the neoclassical growth model suggests that output per worker (per capita income) levels are determined by, among others, the saving rate and population growth rate. To test the predictions of this growth model, you run the following regression: RelPersInc = 0.339 – 12.894 × n + 1.397 × SK , R2 =0.621, SER = 0.177 where RelPersInc is GDP per worker relative to the United States, n is the average population growth rate, 1980-1990, and SK is the average investment share of GDP from 1960 to1990 (remember investment equals saving). (a) Interpret the results. Do the signs correspond to what you expected them to be? Explain. (b) You remember that human capital in addition to physical capital also plays a role in determining the standard of living of a country. You therefore collect additional data on the average educational attainment in years for 1985, and add this variable (Educ) to the above regression. This results in the modified regression output: RelPersInc = 0.046 – 5.869 × n + 0.738 × SK + 0.055 × Educ, R2 =0.775, SER = 0.1377 How has the inclusion of Educ affected your previous results? (c) Upon checking the regression output, you realize that there are only 86 observations, since data for Educ is not available for all 104 countries in your sample. Do you have to modify some of your statements in (d)? (d) Brazil has the following values in your sample: RelPersInc = 0.30, n = 0.021, SK = 0.169, Educ = 3.5. Does your equation overpredict or underpredict the relative GDP per worker? What would happen to this result if Brazil managed to double the average educational attainment? Answer: (a) The Solow growth model predicts higher productivity with higher saving rates and lower population growth. The signs therefore correspond to prior expectations. A 10 percent point increase in the saving rate results in a roughly 14 percent increase in per capita income relative to the United States. Lowering the population growth rate by 1 percent results in a 13 percent higher per capita income relative to the United States. It is best not to interpret the intercept. The regression explains approximately 62 percent of the variation in per capita income among the 104 countries of the world. (b) The coefficient on the population growth rate is roughly half of what it was originally, while the coefficient on the saving rate has approximately doubled. The regression R2 has increased significantly. (c) When comparing results, you should ensure that the sample is identical, since comparisons are not valid otherwise. (d) The predicted value for Brazil is 0.240. Hence the regression underpredicts Brazil’s per capita income. Increasing Educ to 7.0 would result in a predicted per capita income of 0.43, which is a substantial increase from both its current actual position and the previously predicted value. 8) Attendance at sports events depends on various factors. Teams typically do not change ticket prices from game to game to attract more spectators to less attractive games. However, there are other marketing tools used, such as fireworks, free hats, etc., for this purpose. You work as a consultant for a sports team, the Los Angeles Dodgers, to help them forecast attendance, so that they can potentially devise strategies for price discrimination. After collecting data over two years for every one of the 162 home games of the 2000 and 2001 season, you run the following regression: Attend = 15,005 + 201 × Temperat + 465 × DodgNetWin + 82 × OppNetWin + 9647 × DFSaSu + 1328 × Drain + 1609 × D150m + 271 × DDiv – 978 × D2001; R2 =0.416, SER = 6983 Stock/Watson 2e -- CVC2 8/23/06 -- Page 132

where Attend is announced stadium attendance, Temperat it the average temperature on game day, DodgNetWin are the net wins of the Dodgers before the game (wins -losses), OppNetWin is the opposing team’s net wins at the end of the previous season, and DFSaSu, Drain, D150m, Ddiv, and D2001 are binary variables, taking a value of 1 if the game was played on a weekend, it rained during that day, the opposing team was within a 150 mile radius, the opposing team plays in the same division as the Dodgers, and the game was played during 2001, respectively. (a) Interpret the regression results. Do the coefficients have the expected signs? (b) Excluding the last four binary variables results in the following regression result: Attend = 14,838 + 202 × Temperat + 435 × DodgNetWin + 90 × OppNetWin + 10,472 × DFSaSu, R2 =0.410, SER = 6925 According to this regression, what is your forecast of the change in attendance if the temperature increases by 30 degrees? Is it likely that people attend more games if the temperature increases? Is it possible that Temperat picks up the effect of an omitted variable? (c) Assuming that ticket sales depend on prices, what would your policy advice be for the Dodgers to increase attendance? (d) Dodger stadium is large and is not often sold out. The Boston Red Sox play in a much smaller stadium, Fenway Park, which often reaches capacity. If you did the same analysis for the Red Sox, what problems would you foresee in your analysis? Answer: (a) 10 degree warmer temperature increases attendance by roughly 2,000. A 10 game net increase in wins results in approximately 4,600 more spectators. If the opponents’ net win is 10 games higher when compared to another team, then roughly 800 more people attend. Weekend games attract almost 10,000 more people on average. Rain during the day of the game brings out close to 1,300 more fans. A team from closer by, such as the Angels or the Diamondbacks, attract a bit more than 1,600 more people, and a team from the same division results in close to 270 more fans in the stadium. On average, there were approximately 1,000 fewer spectators per game in 2001 than in 2000, holding all other factors constant. With the exception of the rain variable, the signs correspond to prior expectation. The regression explains 41.6 percent of the variation in Dodger attendance. (b) For an increase in 30 degrees, there will be roughly 6,000 more people in attendance. Although people prefer 75 degrees over 45 degrees, it is unlikely that they prefer 105 degrees over 75 degrees. Temperature rises during the baseball season in Los Angeles. There are typically fewer people in attendance during the earlier parts of the season than during the latter parts. Binary variables for the month of the year would pick up such an effect. (c) The only variable that management has limited control over is the performance of the team. The policy advice would therefore be to assure a superior team performance, which, in turn, increases attendance. (Stating the obvious is not going to keep the consultant on the payroll much longer.) (d) If there was a serious capacity constraint, then estimating the equation in the above way would not yield sensible results. Imagine that Fenway Park was basically sold out and the Red Sox would now improve their net wins. Since you would not observe an increase in the dependent variable, the coefficient for net wins would necessarily have to be zero. 9) The administration of your university/college is thinking about implementing a policy of coed floors only in dormitories. Currently there are only single gender floors. One reason behind such a policy might be to generate an atmosphere of better “understanding” between the sexes. The Dean of Students (DoS) has decided to investigate if such a behavior results in more “togetherness” by attempting to find the determinants of the gender composition at the dinner table in your main dining hall, and in that of a neighboring university, which only allows for coed floors in their dorms. The survey includes 176 students, 63 from your university/college, and 113 from a neighboring institution. (a) The Dean’s first problem is how to define gender composition. To begin with, the survey excludes single persons’ tables, since the study is to focus on group behavior. The Dean also eliminates sports teams from the analysis, since a large number of single-gender students will sit at the same table. Finally, the Dean decides to only analyze tables with three or more students, since she worries about “couples” distorting the results. The Stock/Watson 2e -- CVC2 8/23/06 -- Page 133

Dean finally settles for the following specification of the dependent variable: GenderComp= (50%-% of Male Students at Table) Where “ Z ” stands for absolute value of Z. The variable can take on values from zero to fifty. Briefly analyze some of the possible values. What are the implications for gender composition as more female students join a given number of males at the table? Why would you choose the absolute value here? Discuss some other possible specifications for the dependent variable. (b) After considering various explanatory variables, the Dean settles for an initial list of eight, and estimates the following relationship: GenderComp = 30.90 – 3.78 × Size – 8.81 × DCoed + 2.28 × DFemme + 2.06 × DRoommate - 0.17 × DAthlete + 1.49 × DCons – 0.81 SAT + 1.74 × SibOther, R2 =0.24, SER = 15.50 where Size is the number of persons at the table minus 3, DCoed is a binary variable, which takes on the value of 1 if you live on a coed floor, DFemme is a binary variable, which is 1 for females and zero otherwise, DRoommate is a binary variable which equals 1 if the person at the table has a roommate and is zero otherwise, DAthlete is a binary variable which is 1 if the person at the table is a member of an athletic varsity team, DCons is a variable which measures the political tendency of the person at the table on a seven -point scale, ranging from 1 being “liberal” to 7 being “conservative,” SAT is the SAT score of the person at the table measured on a seven-point scale, ranging from 1 for the category “900-1000” to 7 for the category “1510 and above,” and increasing by one for 100 point increases, and SibOther is the number of siblings from the opposite gender in the family the person at the table grew up with. Interpret the above equation carefully, justifying the inclusion of the explanatory variables along the way. Does it make sense to interpret the constant in the above regression? (c) Had the Dean used the number of people sitting at the table instead of Number-3, what effect would that have had on the above specification? (d) If you believe that going down the hallway and knocking on doors is one of the major determinants of who goes to eat with whom, then why would it not be a good idea to survey students at lunch tables? Answer: (a) 3 females, 0 males: 50; 0 females, 3 males: 50; 2 females, 2 males: 0; 1 female, 3 males: 30; 4 females, 3 males: 7.143. For a given number of males, say 3, the gender composition will first decrease as the number of females increases from 0 to 3. After that, the gender composition will decrease again. You need to choose the absolute value because having many individuals from one gender relative to the other is equally bad for a balanced gender composition. Another possibility would be to use the squared difference. (b) The larger the size at the table, the more balanced the gender composition. Consider a table of 6, where you find two more males than females (4 females, 2 males, gender composition = 16.7) versus a table of 14, where you have two more males than females (gender composition = 7.1). Obviously, if males and females increased in the same proportion, then gender composition would not change. This has not happened here. Students from a coed floor are more likely to sit at a more balanced table in terms of gender composition. This is likely to happen if students knock on neighbors’ doors to see who is willing to join them for lunch. Females are less likely to sit at gender balanced tables, and there is no prior on the coefficient of this variable. Having a roommate increases the likelihood of gender imbalance. Roommates are from the same gender, and joining the roommate for a meal results in a more imbalanced gender composition. Being a member of a varsity team decreases the gender imbalance. Recall that sports teams sitting together are excluded from the sample. Although there is no strong prior here, the result suggests that varsity team members have more friends, on average, from the other sex than does the general student body. Having a more conservative view, holding other factors constant, results in sitting at meals with more people from the same sex. More intelligent students, or at least those with a higher SAT score, sit more frequently with students from the other sex. Having had more siblings from the other gender at home results in a more imbalanced gender composition: the female student who had four brothers when she grew up has had enough of this sort of experience (although, given the Stock/Watson 2e -- CVC2 8/23/06 -- Page 134

specification of the dependent variable, it is also possible that she continues to sit with four males). There are no observations close to the origin, so it is best not to interpret the dependent variable. 24 percent of the variation in gender composition is explained by the regression. (c) The only change would be in the intercept. (d) Many students attend lectures before lunch, and may ask some of the students attending the same lecture to join them for lunch. 10) The Solow growth model suggests that countries with identical saving rates and population growth rates should converge to the same per capita income level. This result has been extended to include investment in human capital (education) as well as investment in physical capital. This hypothesis is referred to as the “conditional convergence hypothesis,” since the convergence is dependent on countries obtaining the same values in the driving variables. To test the hypothesis, you collect data from the Penn World Tables on the average annual growth rate of GDP per worker (g6090) for the 1960-1990 sample period, and regress it on the (i) initial starting level of GDP per worker relative to the United States in 1960 (RelProd 60), (ii) average population growth rate of the country (n), (iii) average investment share of GDP from 1960 to1990 ( SK remember investment equals savings), and (iv) educational attainment in years for 1985 ( Educ). The results for close to 100 countries is as follows: g6090 = 0.004 – 0.172 × n + 0.133 × SK + 0.002 × Educ – 0.044 × RelProd 60, R2 =0.537, SER = 0.011 (a) Interpret the results. Do the coefficients have the expected signs? Why does a negative coefficient on the initial level of per capita income indicate conditional convergence (“beta-convergence”)? (b) Equations of the above type have been labeled “determinants of growth” equations in the literature. You recall from your intermediate macroeconomics course that growth in the Solow growth model is determined by technological progress. Yet the above equation does not contain technological progress. Is that inconsistent? Answer: (a) All slope coefficients have the expected sign given the economic theory behind the equation. The negative coefficient implies that countries which were further behind grew relatively faster, or, put differently, countries which had a higher relative per capita income in 1960 grew relatively slower. (b) The equation only determines growth relative to a given starting point, namely per capita income in 1960. Compare this to runners placed on a track where the starting blocks are at various points of the first 100 m. Let the race last for perhaps 10 seconds and let the runners stop at that point on the track. In essence, you measure where the runners ended up given their starting point, or you can also measure how far they ran given their starting point. In many ways, the above equation is therefore meant to predict the per capita income level in 1990 rather than the growth.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 135

11) You have collected a sub-sample from the Current Population Survey for the western region of the United States. Running a regression of average hourly earnings (ahe) on an intercept only, you get the following result: ahe = β 0 = 18.58 a.

Interpret the result.

You decide to include a single explanatory variable without an intercept. The binary variable DFemme takes on a value of “1” for females but is “0” otherwise. The regression result changes as follows: ahe = β 1 ×DFemme = 16.50×DFemme What is the interpretation now?

You generate a new binary variable DMale by subtracting DFemme from 1, and run the new regression: ahe = β 2 ×DMale = 20.09×DMale What is the interpretation of the coefficient now?

After thinking about the above results, you recognize that you could have generated the last two results either by running a regression on both binary variables, or on an intercept and one of the binary variables. What would the results have been?

Answer: a. The mean average hourly earnings for the sample is $18.58. b. The mean average hourly earnings for females is $16.50 in this sample. c. The mean average hourly earnings for males is $20.09 in this sample. d. ahe = β 1 ×DFemme + β 2 ×DMale = 16.50×DFemme + 20.09×DMale= or ahe = β 0 + β 1 ×DFemme = 20.09 - 3.59×DFemme

Stock/Watson 2e -- CVC2 8/23/06 -- Page 136

6.3 Mathematical and Graphical Problems 1) Your econometrics textbook stated that there will be omitted variable bias in the OLS estimator unless the included regressor, X, is uncorrelated with the omitted variable or the omitted variable is not a determinant of the dependent variable, Y. Give an intuitive explanation for these two conditions. Answer: The regression coefficient is the partial derivative of Y with respect to the corresponding X. The meaning of the partial derivative is the effect of a change in X on Y, holding all the other variables constant. This is identical to a controlled laboratory experiment where only one variable is changed at a time, while all the other variables are held constant. In real life, of course, you cannot change one variable and keep all others, including the omitted variables, constant. Now consider the case of X changing. If it is correlated with the omitted variable and if that variable is a determinant of Y, then Y will change further as a result of X changing. This will cause the “controlled experiment” measure to over or understate the effect that X has on Y, depending on the relationship between X and the omitted variable. If X is not correlated with the omitted variable, then changing X will not have this further indirect effect on Y, so that the pure relationship between X and Y can be measured because it is “as if” the omitted variable were held constant. This has important practical implications if data is hard to obtain for an omitted variable while it can be argued that the variable of interest is not much correlated with the omitted variable. Y will change when a relevant omitted variable will change, and hence the pure effect of X on Y cannot be observed. In the laboratory, Y would change for reasons unrelated to the change in X. However, if the omitted variable is not a determinant of Y, then a change in it will have no effect on the pure relationship between X and Y. Consider the accompanying graph of the determinants of Y, where X is the included variable and Z the omitted variable.

Then the effect of X on Y can be measured properly as long as the arrow from Z to Y does not exist, or as long as changes in X do not cause changes in Z, which in return influence Y. 2) You have obtained data on test scores and student -teacher ratios in region A and region B of your state. Region B, on average, has lower student-teacher ratios than region A. You decide to run the following regression Yi = β0 + β1 X1i + β1 X2i + β3 X3i + ui where X1 is the class size in region A, X2 is the difference in class size between region A and B, and X3 is the class size in region B. Your regression package shows a message indicating that it cannot estimate the above equation. What is the problem here and how can it be fixed? Answer: There is perfect multicollinearity present since one of the three explanatory variables can always be expressed linearly in terms of the other two. Hence there are not really three pieces of independent information contained in the three explanatory variables. Dropping one of the three will solve the problem.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 137

3) In the case of perfect multicollinearity, OLS is unable to calculate the coefficients for the explanatory variables, because it is impossible to change one variable while holding all other variables constant. To see why this is the case, consider the coefficient for the first explanatory variable in the case of a multiple regression model with two explanatory variables: n

n n n 2 y x x x y ∑ i 1i ∑ 2i ∑ i 2i ∑ x1i x2i ^ i=1 i=1 i=1 i=1 β1 = n n n 2 ∑ x 1i ∑ x 22i – ( ∑ x1i x2i)2 i=1 i=1 i=1 (small letters refer to deviations from means as in zi = Zi – Z) . n

Divide each of the four terms by

∑ x 2i ∑ x 2i to derive an expression in terms of regression coefficients

i=1 i=1 from the simple (one explanatory variable) regression model. In case of perfect multicollinearity, what would be R2 from the regression of X1i on X2i? As a result, what would be the value of the denominator in the above expression for β1 ? n y x ∑ i 1i ∑ yix2i i=1 i=1 n n 2 2 ∑ x 1i ∑ x 2i i=1 i=1 n

Answer: β1 =

i=1 n

∑ x1ix2i

i=1 n

i=1

∑ x1ix2i ∑ x1ix2i

∑ x 1i

i=1

∑ x 1i

βyx - β^yx β^x x 2 1 2 1 ^

. For the simple regression case R2 =

1- βx 2 x 1 βx 1 x 2

∑ x 2i

i=1

n ^ β1 ∑ y i x i i=1 , so that the slope of a simple regression of Y on X is the inverse of the slope of a regression n 2 ∑ yi i=1 of X on Y if the regression R2 = 1. But in the case of perfect multicollinearity, the regression R2 = 1 so that in the expression, we get ^

βyx - β^yx β^x x βyx - β^yx β^x x 2 1 2 2 1 2 1 1 , which is not defined. The denominator would be zero in β1 = = ^ 0 1 1- βx 2 x 1 ^ βx 2 x 1 ^

the case of perfect multicollinearity.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 138

4) You try to establish that there is a positive relationship between the use of a fertilizer and the growth of a certain plant. Set up the design of an experiment to establish the relationship, paying particular attention to relevant control variables. Discuss in this context the effect of omitted variable bias. Answer: The answer should follow the randomized controlled experiment described in section 1.2 of the textbook: there should be several plots where the plant is placed, each receiving identical treatment. In this context, the same amount of water and sunshine should be available to each plant, and the soil should have the identical quality. Then some of the plots, determined randomly, should receive varying amounts of the fertilizer. The average yield can then be regressed on the amount of fertilizer received. The experiment could also allow for different amounts of sunshine and water, as long as this were recorded meticulously. In this case, failing to record the amount of sunshine received and therefore not including this variable in the regression would result in omitted variable bias. For obvious reasons, the effect of the fertilizer on yield would be estimated incorrectly, since plants which receive more fertilizer but are always in the shade would produce a lower yield. 5) In the multiple regression model with two regressors, the formula for the slope of the first explanatory variable is n

β1 =

∑ yix1i ∑ x 2i - ∑ yi x2i ∑ x1ix2i i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

(small letters refer to deviations from means as in zi = Zi – Z). An alternative way to derive the OLS estimator is given through the following three step procedure. Step 1: regress Y on a constant and X2 , and calculate the residual (Res1). Step 2: regress X1 on a constant and X2 , and calculate the residual (Res2). Step 3: regress Res1 on a constant and Res2. Prove that the slope of the regression in Step 3 is identical to the above formula. n

∑ yix2i

Answer: Step 1: y i = γ yx x 2i + v i; γ yx = 2 2

i=1 n

2 ∑ x 2i i=1

∑ yix2i

, and v i = y i -

i=1 n

2 ∑ x 2i i=1

Step 2: x 1i = γ x x x 2i + wi; γ x x = 1 2 1 2

∑ x1ix2i i=1 n

2 ∑ x 2i i=1

x 2i.

, and wi = x 1i -

∑ x1ix2i i=1 n

Stock/Watson 2e -- CVC2 8/23/06 -- Page 139

2 ∑ x 2i i=1

x 2i

∑ yix2i

∑ [(yi -

i=1 ^

Step 3: v i = αwi; α =

i=1 n

2 ∑ x 2i i=1

∑ x1ix2i x 2i)(x 1i -

i=1 n

2 ∑ x 2i i=1

x 2i)]

∑ x1ix2i

∑ (x1i i=1

i=1 n

2 ∑ x 2i i=1

x 2i)2

∑ x 2i before

Multiplying out the terms in the numerator and denominator and expanding by

i=1 moving through the summation sign, results in n

i=1

∑ yix2i ∑ x 2i - ∑ x1ix2i ∑ yix2i n

i=1

α=

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

i=1 n

n n n 2 y x x y x ∑ i 1i ∑ 2i ∑ i 2i ∑ x1ix2i ^ i=1 i=1 i=1 i=1 . = β1 = n n n 2 2 ∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1 i=1 i=1 n

∑ x 2i

∑ x 2i i=1 6) In the multiple regression problem with k explanatory variable, it would be quite tedious to derive the formulas for the slope coefficients without knowledge of linear algebra. The formulas certainly do not resemble the formula for the slope coefficient in the simple linear regression model with a single explanatory variable. However, it can be shown that the following three step procedure results in the same formula for slope coefficient of the first explanatory variable, X1 : Step 1: regress Y on a constant and all other explanatory variables other than X1 , and calculate the residual (Res1). Step 2: regress X1 on a constant and all other explanatory variables, and calculate the residual (Res2). Step 3: regress Res1 on a constant and Res2. Can you give an intuitive explanation to this procedure? Answer: Step 1 eliminates the linear influence of all variables other than X1 from Y. Think of pouring a liquid through a filter: the remaining liquid now contains the “purified” Y, or that part of Y that could not be explained by the other X’s. The same happens in Step 2, where X1 is now purified from any correlation with the other X’s. Step 3 establishes the purified relationship between Y and X1 . (This procedure is of interest to students if they want to plot the two -dimensional relationship between Y and X1 .)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 140

7) Give at least three examples from macroeconomics and three from microeconomics that involve specified equations in a multiple regression analysis framework. Indicate in each case what the expected signs of the coefficients would be and if theory gives you an indication about the likely size of the coefficients. Answer: Answers will vary by student. In my experience, students most frequently will bring up demand functions (quantity demanded, price, and other variables such as income, price of substitutes, etc.), supply functions (quantity supplied, price, costs), production functions (output produced, capital, labor, and other inputs), consumption functions (consumption, income, and the real interest rate or wealth), money demand functions (real money supply, income, and interest rate), and the Phillips curve (inflation, unemployment rate, and inflationary expectations). 8) One of your peers wants to analyze whether or not participating in varsity sports lowers or increases the GPA of students. She decides to collect data from 110 male and female students on their GPA and the number of hours they spend participating in varsity sports. The coefficient in the simple regression function turns out to be significantly negative, using the t-statistic and carrying out the appropriate hypothesis test. Upon reflection, she is concerned that she did not ask the students in her sample whether or not they were female or male. You point out to her that you are more concerned about the effect of omitted variables in her regression, such as the incoming SAT score of the students, and whether or not they are in a major from a high/low grading department. Elaborate on your argument. Answer: The presence of omitted variables will result in an inconsistent estimator for the included variable (number of hours spent in varsity sports) if at least one of the following two conditions holds: the omitted variable is relevant in affecting the GPA and/or the omitted variable is correlated with the included variable. Incoming SAT scores are clearly relevant in predicting GPAs, at least in the earlier years. Hence it is relevant. Departmental differences in the general level of grading will even more obviously have an effect on the GPA. The relationship therefore suffers from omitted variable bias.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 141

9) (Requires Calculus) For the case of the multiple regression problem with two explanatory variables, show that minimizing the sum of squared residuals results in three conditions: n ^

n ^

i=1

∑ ui = 0; ∑ ui X1i = 0; ∑ ui X2i = 0 Answer: To minimize the sum of squared prediction mistakes n

∑ (Yi - b0 - b1X1i - b2X2i)2

i=1

you need to take the following three derivatives with respect to b0 , b1 and b2 . This results in n n ∂ 2 = -2 ∑ (Yi - b0 - b1X1i - b2X2i) ∂b0 ∑ (Yi - b0 - b1 X1i - b2 X2i) i=1 i=1 n n ∂ (Yi - b0 - b1 X1i - b2 X2i)2 = -2 ∑ (Yi - b0 - b1 X1i - b2 X2i)X1i ∑ ∂b1 i=1 i=1 n n ∂ (Yi - b0 - b1 X1i - b2 X2i)2 = -2 ∑ (Yi - b0 - b1 X1i - b2 X2i)X2i ∑ ∂b2 i=1 i=1 The OLS estimators are those for which the derivatives are zero. Hence we get n n ^ ^ ^ ^ -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) = 0 = ∑ ui i=1 i=1 n n ^ ^ ^ ^ -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X 1i = 0 = ∑ uiX1i i=1 i=1 n n ^ ^ ^ ^ -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X2i = 0 = ∑ uiX2i i=1 i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 142

10) The probability limit of the OLS estimator in the case of omitted variables is given in your text by the following formula: ^

β1

σu p ^ β1 + ρXu σX

Give an intuitive explanation for two conditions under which the bias will be small. Answer: The bias will be small if there is little correlation between the included variable and the error term. The error term contains the omitted variable. If the omitted variable is correlated with the included variable, then the error term is correlated with the included variable. Now consider the case where the correlation between the included and omitted variable is low, resulting in a low correlation between the error term and the included variable. In that case, changes in the omitted variable will not result in changes in the included variable, which, in return, changes Y, and making it appear as if the included variable had changed Y. The second condition is the size of the ratio of the two standard deviations. The formula suggests that if the included variable varies substantially more than the error term, which contains the omitted variable, then the inconsistency will be small. In that case, the relationship between the included variable and the dependent variable does not get disturbed much by variations in the omitted variable.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 143

11) It is not hard, but tedious, to derive the OLS formulae for the slope coefficient in the multiple regression case with two explanatory variables. The formula for the first regression slope is n

i=1

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i

β1 =

i=1

∑ x 21i ∑ x 22i - ( ∑ x1ix2i )2

(small letters refer to deviations from means as in zi = Zi – Z). Show that this formula reduces to the slope coefficient for the linear regression model with one regressor if the sample correlation between the two explanatory variables is zero. Given this result, what can you say about the effect of omitting the second explanatory variable from the regression? n 2 2 x ∑ 1i ∑ x 2i i=1 i=1 n

Answer: Divide each of the four terms by

∑ yix1i i=1 n β1 =

i=1 n

∑ yix2i ∑ x1ix2i -

∑ x 1i

to get

∑ x 2i

i=1

∑ x 1i

i=1

i=1 1n

i=1 n

∑ x1ix2i ∑ x1ix2i 2

∑ x 1i

βyx1 - βyx2 βx 1 x 2 ^

1- βx 2 x 1 βx 1 x 2

∑ x 2i

i=1

i=1 ^

Now if βx 1 x 2 = 0, then β1 =

βyx1 1

. Omitting the second explanatory variable from the regression will

have no effect on the coefficient which indicates the effect of a change in the included variable and the dependent variable. However, you also do not observe the effect that a change in the omitted variable has on the dependent variable. 12) (Requires Statistics background beyond Chapters 2 and 3) One way to establish whether or not there is independence between two or more variables is to perform a χ 2 – test on independence between two variables. Explain why multiple regression analysis is a preferable tool to seek a relationship between variables. Answer: The χ 2 – test can only establish whether or not a relationship between variables exists, but it cannot tell the researcher anything about the effect of a unit change in X on Y. If the researcher is interested in the quantitative information, then she must use a multiple regression framework. The textbook example on student performance can be used here for an explanation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 144

13) In the multiple regression with two explanatory variables, show that the TSS can still be decomposed into the ESS and the RSS. Answer: The proof proceeds along the same line as in the case of a single explanatory variable. The sample regression function is given by ^

Yi = β0 + β1 X1i + β2 X2i + ui The average is therefore ^

Y = β0 + β1 X1 + β2 X2

since the first order condition has

n ^

∑ ui = 0. Subtracting the second equation from the first and letting

i=1 small letters indicate deviations from mean, results in ^

y i = β1 x 1i + β2 x 2i + ui or y i = y i + ui. Squaring both sides and summing gives you n

n ^ 2

n ^ ^

i=1

∑ y i = ∑ y i + ∑ u i + 2 ∑ yiui . i=1

The last term is zero since it involves terms of the type n ^ n ^ n ^ ∑ ui xi = ∑ ui Xi - X ∑ ui i=1 i=1 i=1 All of which are zero given the first order conditions. We therefore arrive at n ^ n ^ n 2 2 2 ∑ y i = ∑ y i + ∑ u i or TSS = ESS + SSR. This proof generalizes easily for k explanatory i=1 i=1 i=1 variables.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 145

14) The OLS formula for the slope coefficients in the multiple regression model become increasingly more complicated, using the “sums” expressions, as you add more regressors. For example, in the regression with a single explanatory variable, the formula is n

∑ (Xi – X)(Yi - X) i=1 n

∑ (Xi - X)2

i=1

whereas this formula for the slope of the first explanatory variable is n

β1 =

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i i=1

i=1 n

i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

(small letters refer to deviations from means as in zi = Zi – Z) in the case of two explanatory variables. Give an intuitive explanations as to why this is the case. Answer: The additional terms take into account that there is a relationship between the regressors. As a matter of fact, the more complicated formula reduces to the simpler formula if the correlation between the included variables is zero. In a controlled laboratory experiment, only one variable is changed at a time, holding all others constant. This is impossible to do with economic data, so the additional terms are added to control for the change in the other variables. 15) (Requires Calculus) For the case of the multiple regression problem with two explanatory variables, derive the OLS estimator for the intercept and the two slopes. Answer: To minimize the sum of squared prediction mistakes n

∑ (Yi - b0 - b1X1i - b2X2i)2

i=1

you need to take the following three derivatives with respect to b0 , b1 and b2 . This results in n n ∂ 2 = -2 ∑ (Yi - b0 - b1X1i - b2X2i) ∂b0 ∑ (Yi - b0 - b1 X1i - b2 X2i) i=1 i=1 n n ∂ (Yi - b0 - b1 X1i - b2 X2i)X1i 2 = -2 ∑ ∑ (Y b b X b X ) i 0 1 1i 2 2i ∂b1 i=1 i=1 n n ∂ 2 = -2 ∑ (Yi - b0 - b1 X1i - b2 X2i) X2i ∑ (Y b b X b X ) i 0 1 1i 2 2i ∂b2 i=1 i=1 The OLS estimators are those for which the derivatives are zero. Hence we get n ^ ^ ^ ^ ^ ^ -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) = 0; β0 = Y - β1 X1 - β2 X2 i=1 Stock/Watson 2e -- CVC2 8/23/06 -- Page 146

n n ^ ^ ^ ^ ^ n ^ n 2 -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X1i = 0; ∑ YiX1i = β0 nX1 + β1 ∑ X 1i + β2 ∑ X2iX1i i=1 i=1 i=1 i=1 n n n n ^ ^ ^ ^ ^ ^ 2 -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X2i = 0; ∑ YiX2i = β0 nX2 + β2 ∑ X 2i + β1 ∑ X2iX1i i=1 i=1 i=1 i=1 ^

After substituting the result for β0 into the last two equation, these have only two unknowns remaining, ^

namely β1 and β2 . Letting small letters indicate deviations from mean, you get n

∑ yix1i = β1 ∑ x 1i + β2 ∑ x2ix1i i=1 n

i=1 i=1 n n ^ ^ 2 ∑ yix2i = β1 ∑ x1ix2i + = β2 ∑ x 2i i=1 i=1 i=1 ^

There are various methods to solve for β1 and β2 . Here we isolate β2 in the second equation and substitute into the first equation. n ^ β y x 1 ∑ i 2i ∑ x2ix1i n n ^ n i=1 i=1 2 ∑ yix1i = β1 ∑ x 1i + ∑ x2ix1i n 2 i=1 i=1 i=1 ∑ x 2i i=1 n

β1 =

i=1

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i n n 2 2 ∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1 i=1 i=1 n

Similarly you can derive n

β2 =

∑ yix2i ∑ x 1i - ∑ yix1i ∑ x1ix2i i=1

i=1

2 2 ∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1 i=1 i=1 16) (Requires Calculus) For the simple linear regression model of Chapter 4, Yi = β0 + β1 Xi + ui, the OLS estimator n

for the intercept was β0 = Y – β1 X, and β1 =

∑ Xi Yi - nXY i=1 n

. Intuitively, the OLS estimators for the regression

2 ∑ X i - nX2 i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 147

model Yi = β0 + β1 X1i + β2 X2i + ui might be β0 = Y – β1 X1 – β2 X2 , β1 =

∑ X1iYi - nX1Y i=1 n

and β2 = 2

∑ X 1i - n X 1

i=1 n

∑ X2iYi - nX2Y i=1 n

. By minimizing the prediction mistakes of the regression model with two explanatory

2 2 ∑ X 2i - n X 2 i=1

variables, show that this cannot be the case. Answer: To minimize the sum of squared prediction mistakes n

∑ (Yi - b0 - b1X1i - b2X2i)2

i=1

you need to take the following three derivatives with respect to b0 , b1 and b2 . This results in n n ∂ 2 = -2 ∑ (Yi - b0 - b1X1i - b2X2i) ∂b0 ∑ (Yi - b0 - b1 X1i - b2 X2i) i=1 i=1 n n ∂ 2 (Yi - b0 - b1 X1i - b2 X2i) X1i = -2 ∑ ∑ (Y b b X b X ) i 0 1 1i 2 2i ∂b1 i=1 i=1 n n ∂ 2 = -2 ∑ (Yi - b0 - b1X1i - b2X2i) X2i ∂b2 ∑ (Yi - b0 - b1 X1i - b2 X2i) i=1 i=1 The OLS estimators are those for which the derivatives are zero. Hence we get n ^ ^ ^ ^ ^ ^ -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) = 0; β0 = Y - β1 X1 - β2 X2 i=1 n n n n ^ ^ ^ ^ ^ ^ 2 -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X1i = 0; ∑ YiX1i = β0 nX1 + β1 ∑ X 1i + β2 ∑ X2iX1i i=1 i=1 i=1 i=1 n n ^ ^ ^ ^ ^ n ^ n 2 -2 ∑ (Yi - β0 - β1 X1i - β2 X2i) X2i = 0; ∑ YiX2i = β0 nX2 + β2 ∑ X 2i + β1 ∑ X2iX1i i=1 i=1 i=1 i=1 It is clear that the first of these three expressions results in ^

β0 = Y - β1 X1 - β2 X2 . However, the second (third) expression involves terms in X2i (X1i), hence the n

n nX X Y Y ∑ 1i i ∑ X2iYi - nX2Y 1 ^ ^ i=1 i=1 (β2 = ) unless special formula cannot be simplified to β1 = n n 2 2 2 2 ∑ X 1i - n X 1 ∑ X 2i - n X 2 i=1 i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 148

conditions hold (such as

∑ X2iX1i = 0). i=1

17) Your textbook extends the simple regression analysis of Chapters 4 and 5 by adding an additional explanatory variable, the percent of English learners in school districts (PctEl). The results are as follows: TestScore = 698.9 – 2.28 × STR and TestScore = 698.0 – 1.10 × STR – 0.65 × PctEL Explain why you think the coefficient on the student-teacher ratio has changed so dramatically (been more than halved). Answer: This is a good example of omitted variable bias. The previously excluded variable of percent of English learners not only seems to matter and being economically important in the determination of testscores, but also is correlated with the student-teacher ratio (recall that schools with higher student-teacher ratios also had a positive correlation coefficient with the percent of English learners (of almost 20%). As a result, there will be omitted variable bias if you regress the test scores on the student-teacher ratios only. 18) (Requires some Calculus) Consider the sample regression function . ^

Yi = β0 + β1 X1i + β2 X2i. Take the total derivative. Next show that the partial derivative

△Yi △X1i

is obtained by

holding X2i constant, or controlling for X2i. Answer: △ is a linear operator. Hence ^

△Yi = △(β0 + β1 X1i + β2 X2i) = △β0 + β1 △X1i + β2 △X2i = β1 △X1i + β2 △X2i. Dividing through by △ △X2i ^ ^ △X2i ^ △Y , which only equals β1 if X1i then results in = β1 + β2 = 0, i.e., if X2i remains constant △X1i △X1i △X1i following a change in X1i.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 149

19) (Requires Appendix material) Consider the following population regression function model with two ^

explanatory variables: Yi = β0 + β1 X1i + β2 X2i. It is easy but tedious to show that SE( β2 ) is given by the 2 1 following formula: σ β^ = 1 n

2 σu

2 2 1- ρ x 1 ,x 2 σ X1

. Sketch how SE( β2 ) increases with the correlation between X1i

and X2i. Answer: The answer should look something like this:

20) For this question, use the California Testscore Data Set and your regression package (a spreadsheet program if necessary). First perform a multiple regression of testscores on a constant, the student -teacher ratio, and the percent of English learners. Record the coefficients. Next, do the following three step procedure instead: first, regress the testscore on a constant and the percent of English learners. Calculate the residuals and store them under the name resYX2. Second, regress the student-teacher ratio on a constant and the percent of English learners. Calculate the residuals from this regression and store these under the name resX1X2. Finally regress resYX2 on resX1X2 (and a constant, if you wish). Explain intuitively why the simple regression coefficient in the last regression is identical to the regression coefficient on the student-teacher ratio in the multiple regression. Answer: This three step procedure actually explains how OLS controls for the influence of other variables. In the first step, OLS removes the linear influence of the percent of English learners from the dependent variable. The residuals from that regression represent the “left-over” of the testscores that the percent of English learners could not explain (“purified testscores;” think of a filter removing some of the elements). The same explanation holds for the second regression: the student -teacher ratio is purified (if the percent of English learners actually have an influence on student-teacher ratios). In the final step, you regress the two “purified” variables on each other.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 150

21) Assume that you have collected cross-sectional data for average hourly earnings (ahe), the number of years of education (educ) and gender of the individuals (you have coded individuals as “1” if they are female and “0” if they are male; the name of the resulting variable is DFemme). Having faced recent tuition hikes at your university, you are interested in the return to education, that is, how much more will you earn extra for an additional year of being at your institution. To investigate this question, you run the following regression: ahe = -4.58 + 1.71×educ N = 14,925, R2 = 0.18, SER = 9.30 a.

Interpret the regression output.

Being a female, you wonder how these results are affected if you entered a binary variable (DFemme), which takes on the value of “1” if the individual is a female, and is “0” for males. The result is as follows: ahe = -3.44 - 4.09×DFemme + 1.76×educ N = 14,925, R2 = 0.22, SER = 9.08 Does it make sense that the standard error of the regression decreased while the regression R2 increased?

Do you think that the regression you estimated first suffered from omitted variable bias?

Answer: a. For every additional year of education, you receive $1.71 additional earnings. It is best not to interpret the intercept, since there are no (or extremely few) observations at the origin. b. The regression R2 cannot decrease if you add an explanatory variable. If the additional variable does not contribute anything to the fit, then this measure will remain the same. However, in practice, this does not happen. The standard error is a measure of the SSR, and these will almost always decrease with the addition of an explanatory variable. As a result, the observed pattern in the two statistics is to be expected. c. There are two conditions for omitted variable bias to be present. First, DFemme must be a determinant of ahe; and second, it must be correlated with educ. Given that you have not learned how to test for statistical significance in the multiple regression model, the first question is hard to determine at this point. However, you might argue that the coefficient seems large and that you have read elsewhere that there is evidence of females earning less using this type of equation. With regard to the second question, you could argue that the coefficient on educ has changed somewhat, although the increase does not seem to be large ($0.05). For there to be a correlation between education and the binary female variable, you would have to argue that males and females receive years of education. Either way, the omitted variable bias in the first equation does not appear to be large.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 151

22) You have collected data on individuals and their attributes. Consequently you have generated several binary variables, which take on a value of “1” if the individual has that characteristic and are “0” otherwise. One example is the binary variable DMarr which is “1” for married individuals and “0” for non -married variables. If you run the following regression: ahei= β0 + β1 ×educi + β2 ×DMarri + ui a.

What is the interpretation for β2 ?

You are interested in directly observing the effect that being non -married (“single”) has on earnings, controlling for years of education. Instead of recoding all observations such that they are “1” for a not married individual and “0” for a married person, how can you generate such a variable (DSingle) through a simple command in your regression program?

Answer: a. The coefficient will tell you by how much, on average, a married person’s average hourly earnings differ from those of a non-married person, holding years of education constant. b. gen DSingle = 1 — DMarr (STATA); genr DSingle = 1 - DMarr (EViews) 23) Consider the following earnings function: ahei= β0 + β1 ×DFemmei + β2 ×educi+...+ ui versus the alternative specification ahei= γ 0 × DMale + γ 1 ×DFemmei + γ 2 ×educi+...+ ui where ahe is average hourly earnings, DFemme is a binary variable which takes on the value of “1” if the individual is a female and is “0” otherwise, educ measures the years of education, and DMale is a binary variable which takes on the value of “1” if the individual is a male and is “0” otherwise. There may be additional explanatory variables in the equation. a.

How do the βs and γs compare? Putting it differently, having estimated the coefficients in the first equation, can you derive the coefficients in the second equation without re-estimating the regression?

Will the goodness of fit measures, such as the regression R2 , differ between the two equations?

What is the reason why economists typically prefer the second specification over the first?

Answer: a. γ 0 = β0 ; γ 1 = β0 + β1 ; γ 2 = β2 b. The regression R2 will be identical, as will be the standard error of the regression. c. The second equation allows you to consider the difference between earnings of two sub-groups. Economists are often interested in testing for such differences, rather than to find the average level of earnings.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 152

24) You would like to find the effect of gender and marital status on earnings. As a result, you consider running the following regression: ahei= β0 + β1 ×DFemmei + β2 ×DMarri + β3 ×DSingle i + β4 ×educi+...+ ui Where ahe is average hourly earnings, DFemme is a binary variable which takes on the value of “1” if the individual is a female and is “0” otherwise, DMarr is a binary variable which takes on the value of “1” if the individual is married and is “0” otherwise, DSingle takes on the value of “1” if the individual is not married and is “0” otherwise. The regression program which you are using either returns a message that the equation cannot be estimated or drops one of the coefficients. Why do you think that is? Answer: There is perfect multicollinearity here (“dummy variable trap”). You need to drop either Dmarr or DSingle.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 153

Chapter 7 Hypothesis Tests and Confidence Intervals in Multiple Regression 7.1 Multiple Choice 1) The confidence interval for a single coefficient in a multiple regression A) makes little sense because the population parameter is unknown. B) should not be computed because there are other coefficients present in the model. C) contains information from a large number of hypothesis tests. D) should only be calculated if the regression R2 is identical to the adjusted R2 . Answer: C 2) The following linear hypothesis can be tested using the F-test with the exception of A) β2 = 1 and β3 = β4 /β5 . B) β2 =0. C) β1 + β2 = 1 and β3 = -2β4 . D) β0 = β1 and β1 = 0. Answer: A 3) The formula for the standard error of the regression coefficient, when moving from one explanatory variable to two explanatory variables, A) stays the same. B) changes, unless the second explanatory variable is a binary variable. C) changes. D) changes, unless you test for a null hypothesis that the addition regression coefficient is zero. Answer: C 4) All of the following are examples of joint hypotheses on multiple regression coefficients, with the exception of A) H0 : β1 + β2 = 1 B) H0 :

β3 β2

= β1 and β4 = 0

C) H0 : β2 = 0 and β3 = 0 D) H0 : β1 = -β2 and β1 + β2 = 1 Answer: A 5) When testing joint hypothesis, you should A) use t-statistics for each hypothesis and reject the null hypothesis is all of the restrictions fail. B) use the F-statistic and reject all the hypothesis if the statistic exceeds the critical value. C) use t-statistics for each hypothesis and reject the null hypothesis once the statistic exceeds the critical value for a single hypothesis. D) use the F-statistics and reject at least one of the hypothesis if the statistic exceeds the critical value. Answer: D 6) The overall regression F-statistic tests the null hypothesis that A) all slope coefficients are zero. B) all slope coefficients and the intercept are zero. C) the intercept in the regression and at least one, but not all, of the slope coefficients is zero. D) the slope coefficient of the variable of interest is zero, but that the other slope coefficients are not. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 154

7) For a single restriction (q = 1), the F-statistic A) is the square root of the t-statistic. B) has a critical value of 1.96. C) will be negative. D) is the square of the t-statistic. Answer: D 8) The homoskedasticity-only F-statistic is given by the following formula (SSRrestricted - SSRunrestricted)/q A) F= (SSRunrestricted /(n - kunrestricted -1) B) F=

(SSRrestricted - SSRunrestricted)/q SSRrestricted /(n - kunrestricted -1)

C) F=

(SSRunrestricted - SSRrestricted)/q SSRunrestricted /(n - kunrestricted -1)

D) F=

(SSRrestricted - SSRunrestricted)/q-1) SSRunrestricted /(n - kunrestricted)

Answer: A 9) All of the following are correct formulae for the homoskedasticity-only F-statistic, with the exception of (SSRrestricted - SSRunrestricted)/q A) F= SSRunrestricted /(n - kunrestricted -1) B) F=

(SSRunrestricted - SSRrestricted)/q SSRrestricted /(n - krestricted -1)

C) F=

(SSRrestricted - SSRunrestricted) n - kunrestricted-1 × q SSRunrestricted

D) F =

SSRrestricted (n - kunrestricted-1) -1 × SSRunrestricted q

Answer: B 10) In the multiple regression model, the t-statistic for testing that the slope is significantly different from zero is calculated A) by dividing the estimate by its standard error. B) from the square root of the F-statistic. C) by multiplying the p-value by 1.96. D) using the adjusted R2 and the confidence interval. Answer: A 11) To test joint linear hypotheses in the multiple regression model, you need to A) compare the sums of squared residuals from the restricted and unrestricted model. B) use the heteroskedasticity-robust F-statistic. C) use several t-statistics and perform tests using the standard normal distribution. D) compare the adjusted R2 for the model which imposes the restrictions, and the unrestricted model. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 155

12) The homoskedasticity-only F-statistic is given by the following formula (R2 unrestricted - R2 restricted)/q A) F= (1-R2 unrestricted) /(n - kunrestricted -1) B) F=

C) F=

D) F=

1 - R2 unrestricted)/q R2 unrestricted /(n - kunrestricted -1) (R2 unrestricted - R2 restricted)/q (1-R2 unrestricted) /(n - krestricted -1) (R2 unrestricted - R2 unrestricted)/q (1-R2 unrestricted) /(n - krestricted -1)

Answer: A 13) Let R2 unrestricted and R2 restricted be 0.4366 and 0.4149 respectively. The difference between the unrestricted and the restricted model is that you have imposed two restrictions. There are 420 observations. The F-statistic in this case is A) 4.61 B) 8.01 C) 10.34 D) 7.71 Answer: B 14) If you wanted to test, using a 5% significance level, whether or not a specific slope coefficient is equal to one, then you should A) subtract 1 from the estimated coefficient, divide the difference by the standard error, and check if the resulting ratio is larger than 1.96. B) add and subtract 1.96 from the slope and check if that interval includes 1. C) see if the slope coefficient is between 0.95 and 1.05. D) check if the adjusted R2 is close to 1. Answer: A 15) If the absolute value of your calculated t-statistic exceeds the critical value from the standard normal distribution you can A) safely assume that your regression results are significant. B) reject the null hypothesis. C) reject the assumption that the error terms are homoskedastic. D) conclude that most of the actual values are very close to the regression line. Answer: B 16) If you reject a joint null hypothesis using the F-test in a multiple hypothesis setting, then A) a series of t-tests may or may not give you the same conclusion. B) the regression is always significant. C) all of the hypotheses are always simultaneously rejected. D) the F-statistic must be negative. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 156

17) When your multiple regression function includes a single omitted variable regressor, then A) use a two-sided alternative hypothesis to check the influence of all included variables. B) the estimator for your included regressors will be biased if at least one of the included variables is correlated with the omitted variable. C) the estimator for your included regressors will always be biased. D) lower the critical value to 1.645 from 1.96 in a two -sided alternative hypothesis to test the significance of the coefficients of the included variables. Answer: B 18) A 95% confidence set for two or more coefficients is a set that contains A) the sample values of these coefficients in 95% of randomly drawn samples. B) integer values only. C) the same values as the 95% confidence intervals constructed for the coefficients. D) the population values of these coefficients in 95% of randomly drawn samples. Answer: D 19) When there are two coefficients, the resulting confidence sets are A) rectangles. B) ellipses. C) squares. D) trapezoids. Answer: B 20) When testing the null hypothesis that two regression slopes are zero simultaneously, then you cannot reject the null hypothesis at the 5% level, if the ellipse contains the point A) (-1.96, 1.96). B) (0, 1.96) . C) (0,0). D) (1.962 , 1.96 2 ). Answer: C 21) The OLS estimators of the coefficients in multiple regression will have omitted variable bias A) only if an omitted determinant of Yi is a continuous variable. B) if an omitted variable is correlated with at least one of the regressors, even though it is not a determinant of the dependent variable. C) only if the omitted variable is not normally distributed. D) if an omitted determinant of Yi is correlated with at least one of the regressors. Answer: D 22) At a mathematical level, if the two conditions for omitted variable bias are satisfied, then A) E(ui X1i, X2i,..., Xki) ≠ 0. B) there is perfect multicollinearity. C) large outliers are likely: X1i, X2i,..., Xki and Yi and have infinite fourth moments. D) (X1i, X2i,..., Xki,Yi), i = 1,..., n are not i.i.d. draws from their joint distribution. Answer: A 23) All of the following are true, with the exception of one condition: A) a high R2 or R2 does not mean that the regressors are a true cause of the dependent variable. B) a high R2 or R2 does not mean that there is no omitted variable bias. C) a high R2 or R2 always means that an added variable is statistically significant. D) a high R2 or R2 does not necessarily mean that you have the most appropriate set of regressors. Answer: C Stock/Watson 2e -- CVC2 8/23/06 -- Page 157

24) The general answer to the question of choosing the scale of the variables is A) dependent on you whim. B) to make the regression results easy to read and to interpret. C) to ensure that the regression coefficients always lie between -1 and 1. D) irrelevant because regardless of the scale of the variable, the regression coefficient is unaffected. Answer: B 25) If the estimates of the coefficients of interest change substantially across specifications, A) then this can be expected from sample variation. B) then you should change the scale of the variables to make the changes appear to be smaller. C) then this often provides evidence that the original specification had omitted variable bias. D) then choose the specification for which your coefficient of interest is most significant. Answer: C 26) You have estimated the relationship between testscores and the student -teacher ratio under the assumption of homoskedasticity of the error terms. The regression output is as follows: TestScore = 698.9 - 2.28×STR, and the standard error on the slope is 0.48. The homoskedasticity -only “overall” regression F- statistic for the hypothesis that the Regression R2 is zero is approximately A) 0.96 B) 1.96 C) 22.56 D) 4.75 Answer: C 27) Consider a regression with two variables, in which X1i is the variable of interest and X2i is the control variable. Conditional mean independence requires A) E(ui|X1i, X2i) = E(ui|X2i) B) E(ui|X1i, X2i) = E(ui|X1i) C) E(ui|X1i) = E(ui|X2i) D) E(ui) = E(ui|X2i) Answer: A 28) The homoskedasticity-only F-statistic and the heteroskedasticity-robust F-statistic typically are A) the same B) different C) related by a linear function D) a multiple of each other (the heteroskedasticity-robust F-statistic is 1.96 times the homoskedasticity-only F-statistic) Answer: B 29) Consider the following regression output where the dependent variable is testscores and the two explanatory variables are the student-teacher ratio and the percent of English learners: TestScore = 698.9 - 1.10×STR - 0.650×PctEL. You are told that the t-statistic on the student-teacher ratio coefficient is 2.56. The standard error therefore is approximately A) 0.25 B) 1.96 C) 0.650 D) 0.43 Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 158

30) The critical value of F4,∞ at the 5% significance level is A) 3.84 B) 2.37 C) 1.94 D) Cannot be calculated because in practice you will not have infinite number of observations Answer: B

7.2 Essays and Longer Questions

1) The F-statistic with q = 2 restrictions when testing for the restrictions β1 = 0 and β2 = 0 is given by the following formula: ^ 2 2 t 1 + t 2 - 2ρ t ,t t1 t 1 2 1 F= 2 ^2 1- ρ t1 ,t2

Discuss how this formula can be understood intuitively. Answer: For the case when there is no correlation between the two explanatory variables, the formula reduces to a simple average of the squared t-statistics, i.e., F =

2 2 1 t 1 + t 2 .The F2,∞ distribution is the 2

distribution of a random variable with a chi-squared distribution with 2 degrees of freedom, divided by 2. Equivalently, the F2,∞ distribution is the distribution of the average of 2 squared standard normal random variables. Because the t-statistics are uncorrelated by assumption, they are independent standard normal random variables under the null hypothesis. If either β1 or β2 are nonzero (or both), 2 2 then either t 1 or t 2 or both will be large. This leads to a large F-statistic, and hence a rejection of the null hypothesis. 2) The cost of attending your college has once again gone up. Although you have been told that education is investment in human capital, which carries a return of roughly 10% a year, you (and your parents) are not pleased. One of the administrators at your university/college does not make the situation better by telling you that you pay more because the reputation of your institution is better than that of others. To investigate this hypothesis, you collect data randomly for 100 national universities and liberal arts colleges from the 2000 -2001 U.S. News and World Report annual rankings. Next you perform the following regression ^

Cost = 7,311.17 + 3,985.20 × Reputation – 0.20 × Size (2,058.63) (664.58) (0.13) + 8,406.79 × Dpriv – 416.38 × Dlibart – 2,376.51 × Dreligion (2,154.85) (1,121.92) (1,007.86) R2 =0.72, SER = 3,773.35 where Cost is Tuition, Fees, Room and Board in dollars, Reputation is the index used in U.S. News and World Report (based on a survey of university presidents and chief academic officers), which ranges from 1 (“marginal ”) to 5 (“distinguished”), Size is the number of undergraduate students, and Dpriv, Dlibart, and Dreligion are binary variables indicating whether the institution is private, a liberal arts college, and has a religious affiliation. The numbers in parentheses are heteroskedasticity-robust standard errors. (a) Indicate whether or not the coefficients are significantly different from zero. (b) What is the p-value for the null hypothesis that the coefficient on Size is equal to zero? Based on this, should you eliminate the variable from the regression? Why or why not? Stock/Watson 2e -- CVC2 8/23/06 -- Page 159

(c) You want to test simultaneously the hypotheses that βsize = 0 and βDilbert = 0. Your regression package returns the F-statistic of 1.23. Can you reject the null hypothesis? (d) Eliminating the Size and Dlibart variables from your regression, the estimation regression becomes ^

Cost= 5,450.35 + 3,538.84 × Reputation + 10,935.70 × Dpriv – 2,783.31 × Dreligion; (1,772.35) (590.49) (875.51) (1,180.57) R2 =0.72, SER = 3,792.68 Why do you think that the effect of attending a private institution has increased now? (e) You give a final attempt to bring the effect of Size back into the equation by forcing the assumption of homoskedasticity onto your estimation. The results are as follows: ^

Cost= 7,311.17 + 3,985.20 × Reputation – 0.20 × Size (1,985.17) (593.65) (0.07) + 8,406.79 × Dpriv – 416.38 × Dlibart – 2,376.51 × Dreligion (1,423.59) (1,096.49) (989.23) R2 =0.72, SER = 3,682.02 Calculate the t-statistic on the Size coefficient and perform the hypothesis test that its coefficient is zero. Is this test reliable? Explain. Answer: (a) The coefficient on liberal arts colleges, is not significantly different from zero. All other coefficients are statistically significant at conventional levels, with the exception of the size coefficient, which carries a t-statistic of 1.54, and hence is not statistically significant at the 5% level (using a one -sided alternative hypothesis). (b) Using a one-sided alternative hypothesis, the p-value is 6.2 percent. Variables should not be eliminated simply on grounds of a statistical test. The sign of the coefficient is as expected, and its magnitude makes it important. It is best to leave the variable in the regression and let the reader decide whether or not this is convincing evidence that the size of the university matters. (c)The critical value for F2,∞ is 3.00 (5% level) and 4.61 (1% level). Hence you cannot reject the null hypothesis in this case. (d) Private institutions are smaller, on average, and some of these are liberal arts colleges. Both of these variables had negative coefficients. (e) Although the coefficient would be statistically significant in this case, the test is unreliable and should not be used for statistical inference. There is no theoretical suggestion here that the errors might be homoskedastic. Since the standard errors are quite different here, you should use the more reliable ones, i.e., the heteroskedasticity-robust. 3) In the multiple regression model with two explanatory variables Yi = β0 + β1 X1i + β2 X2i + ui the OLS estimators for the three parameters are as follows (small letters refer to deviations from means as in zi = Zi - Z): ^

β0 = Y- β1 X1 - β2 X2

Stock/Watson 2e -- CVC2 8/23/06 -- Page 160

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i

β1 =

i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1 n

∑ yix2i ∑ x 1i - ∑ yix1i ∑ x1ix2i

β2 =

i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

∑ Yi = 33.33; ∑ X1i = 2.025; ∑ X2i =17.313 i=1 i=1 n n 2 2 2 ∑ y i = 8.3103; ∑ x 1i = .0122; ∑ x 2i = 0.6422 i=1 i=1 i=1

i=1 n

i=1

∑ yix1i = - 0.2304; ∑ yix2i = 1.5676; ∑ x1ix2i = -0.0520 The heteroskedasticity-robust standard errors of the two slope coefficients are 1.99 (for population growth) and 0.23 (for the saving rate). Calculate the 95% confidence interval for both coefficients. How many standard deviations are the coefficients away from zero? Answer: The 95% confidence interval for the population growth is (–16.85, -9.05), and the 95% confidence interval for the saving rate is (0.94, 1.84). The population growth coefficient has a t-statistic of -6.51, and the saving rate coefficient of 6.04. These represent standard deviations away from zero.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 161

4) A subsample from the Current Population Survey is taken, on weekly earnings of individuals, their age, and their gender. You have read in the news that women make 70 cents to the $1 that men earn. To test this hypothesis, you first regress earnings on a constant and a binary variable, which takes on a value of 1 for females and is 0 otherwise. The results were: Earn = 570.70 - 170.72 × Female, R2 =0.084, SER = 282.12. (9.44) (13.52) (a) Perform a difference in means test and indicate whether or not the difference in the mean salaries is significantly different. Justify your choice of a one-sided or two-sided alternative test. Are these results evidence enough to argue that there is discrimination against females? Why or why not? Is it likely that the errors are normally distributed in this case? If not, does that present a problem to your test? (b) Test for the significance of the age and gender coefficients. Why do you think that age plays a role in earnings determination? Answer: (a) The t-statistic is -12.63, while the critical value is –1.64. The difference is therefore statistically significant. A one-sided alternative was chosen since the claim is that females make less than males. This represents little evidence of discrimination, since attributes of males and females have not been included. Given that earnings distributions are not normally distributed, the errors will also not be distributed normally, and assuming that they are, results in problematic inference. (b) The t-statistics are 9.36 for the age coefficient, and -13.00 for the gender coefficient. Both of these values are greater than the (absolute) critical value from the standard normal distribution (1.64). Hence you can reject the null hypothesis that these coefficients are zero. Age proxies “on the job training.” A better proxy that has been used frequently in the past is the Mincer experience variable (Age-Education-6). Obviously this is a better proxy for some subsample of individuals than for others.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 162

5) You have collected data from Major League Baseball (MLB) to find the determinants of winning. You have a general idea that both good pitching and strong hitting are needed to do well. However, you do not know how much each of these contributes separately. To investigate this problem, you collect data for all MLB during 1999 season. Your strategy is to first regress the winning percentage on pitching quality (“Team ERA”), second to regress the same variable on some measure of hitting (“OPS – On -base Plus Slugging percentage”), and third to regress the winning percentage on both. Summary of the Distribution of Winning Percentage, On Base plus Slugging Percentage, and Team Earned Run Average for MLB in 1999 Average Standard deviation

Team 4.71 ERA OPS 0.778 Winning 0.50 Percentage

Percentile 10%

25%

40%

75%

90%

4.72

50% 60% (median) 4.78 4.91

0.53

3.84

4.35

5.06

5.25

0.034 0.08

0.720 0.40

0.754 0.43

0.769 0.46

0.780 0.48

0.798 0.59

0.820 0.60

0.790 0.49

The results are as follows: Winpct = 0.94 – 0.100 × teamera , R2 = 0.49, SER = 0.06. (0.08) (0.017) Winpct = –0.68 + 1.513 × ops, R2 =0.45, SER = 0.06. (0.17) (0.221) Winpct = –0.19 – 0.099 × teamera + 1.490 × ops , R2 =0.92, SER = 0.02. (0.08) (0.008) (0.126) (a) Use the t-statistic to test for the statistical significance of the coefficient. (b) There are 30 teams in MLB. Does the small sample size worry you here when testing for significance? Answer: (a) The t-statistics for team ERA and OPS are -12.38 and 11.83. Both of these are highly significant. (b) The t-statistic is only normally distributed in large samples. As a result, inference is problematic here.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 163

6) In the process of collecting weight and height data from 29 female and 81 male students at your university, you also asked the students for the number of siblings they have. Although it was not quite clear to you initially what you would use that variable for, you construct a new theory that suggests that children who have more siblings come from poorer families and will have to share the food on the table. Although a friend tells you that this theory does not pass the “straight-face” test, you decide to hypothesize that peers with many siblings will weigh less, on average, for a given height. In addition, you believe that the muscle/fat tissue composition of male bodies suggests that females will weigh less, on average, for a given height. To test these theories, you perform the following regression: Studentw

= –229.92 – 6.52 × Female + 0.51 × Sibs+ 5.58 × Height, (44.01) (5.52) (2.25) (0.62)

R2 =0.50, SER = 21.08 where Studentw is in pounds, Height is in inches, Female takes a value of 1 for females and is 0 otherwise, Sibs is the number of siblings (heteroskedasticity-robust standard errors in parentheses). (a) Carrying out hypotheses tests using the relevant t-statistics to test your two claims separately, is there strong evidence in favor of your hypotheses? Is it appropriate to use two separate tests in this situation? (b) You also perform an F-test on the joint hypothesis that the two coefficients for females and siblings are zero. The calculated F-statistic is 0.84. Find the critical value from the F-table. Can you reject the null hypothesis? Is it possible that one of the two parameters is zero in the population, but not the other? (c) You are now a bit worried that the entire regression does not make sense and therefore also test for the height coefficient to be zero. The resulting F-statistic is 57.25. Does that prove that there is a relationship between weight and height? Answer: (a) The t-statistics for gender and number of siblings are -1.18 and 0.23 respectively. Neither coefficient is statistically significant at conventional levels. If you wanted to test the two hypothesis simultaneously, then you should use an F-test. (b) The critical value is 3.00 at the 5% level, and 4.61 at the 1% level. Hence you cannot reject the null hypothesis. The hypothesis is that both coefficients are zero, and this cannot be rejected. Had you rejected the null hypothesis, then the alternative hypothesis states that one or both of the restrictions do not hold. (c) Although you cannot prove anything in this context with certainty, there is a very high probability that there is a relationship between height and weight in the population, given the sample result. The critical value from the F-table is 3.78 at the 1% level.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 164

RelPersInc = 0.339 – 12.894 × n + 1.397 × SK , R2 =0.621, SER = 0.177 (0.068) (3.177)

(0.229)

where RelPersInc is GDP per worker relative to the United States, n is the average population growth rate, 1980-1990, and SK is the average investment share of GDP from 1960 to1990 (remember investment equals saving). Numbers in parentheses are for heteroskedasticity-robust standard errors. (a) Calculate the t-statistics and test whether or not each of the population parameters are significantly different from zero. (b) The overall F-statistic for the regression is 79.11. What is the critical value at the 5% and 1% level? What is your decision on the null hypothesis? (c) You remember that human capital in addition to physical capital also plays a role in determining the standard of living of a country. You therefore collect additional data on the average educational attainment in years for 1985, and add this variable (Educ) to the above regression. This results in the modified regression output: RelPersInc = 0.046 – 5.869 × n + 0.738 × SK + 0.055 × Educ, R2 =0.775, SER = 0.1377 (0.079) (2.238)

(0.294)

(0.010)

How has the inclusion of Educ affected your previous results? (d) Upon checking the regression output, you realize that there are only 86 observations, since data for Educ is not available for all 104 countries in your sample. Do you have to modify some of your statements in (d)? Answer: (a) The t-statistics for population growth and the saving rate are –4.06 and 6.10, making both coefficients significantly different from zero at conventional levels of significance. (b) The critical value is 3.00 and 4.61 respectively, allowing you to reject the null hypothesis that all slope coefficients are zero. (c) The coefficient on the population growth rate is roughly half of what it was originally, while the coefficient on the saving rate has approximately doubled. The regression R2 has increased significantly. (d) When comparing results, you should ensure that the sample is identical, since comparisons are not valid otherwise. In addition, there are now less than 100 observations, making inference based on the standard normal distribution problematic.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 165

8) Attendance at sports events depends on various factors. Teams typically do not change ticket prices from game to game to attract more spectators to less attractive games. However, there are other marketing tools used, such as fireworks, free hats, etc., for this purpose. You work as a consultant for a sports team, the Los Angeles Dodgers, to help them forecast attendance, so that they can potentially devise strategies for price discrimination. After collecting data over two years for every one of the 162 home games of the 2000 and 2001 season, you run the following regression: Attend = 15,005 + 201 × Temperat + 465 × DodgNetWin + 82 × OppNetWin (8,770) (121) (169) (26) + 9647 × DFSaSu + 1328 × Drain + 1609 × D150m + 271 × DDiv – 978 × D2001; (1505) (3355) (1819) (1,184) (1,143) R2 =0.416, SER = 6983 where Attend is announced stadium attendance, Temperat it the average temperature on game day, DodgNetWin are the net wins of the Dodgers before the game (wins -losses), OppNetWin is the opposing team’s net wins at the end of the previous season, and DFSaSu, Drain, D150m, Ddiv, and D2001 are binary variables, taking a value of 1 if the game was played on a weekend, it rained during that day, the opposing team was within a 150 mile radius, the opposing team plays in the same division as the Dodgers, and the game was played during 2001, respectively. Numbers in parentheses are heteroskedasticity- robust standard errors. (a) Are the slope coefficients statistically significant? (b) To test whether the effect of the last four binary variables is significant, you have your regression program calculate the relevant F-statistic, which is 0.295. What is the critical value? What is your decision about excluding these variables? Answer: (a) The t-statistics for Temperat, DodgNewWin, OppNetWin, and DFSaSu are all statistically significant at the 5% level, using a one-sided test. The constant is insignificant using a two-sided test. All the other coefficients are not statistically significant at the 5% level. (b) The critical value at the 5% level is 2.37. Hence you cannot reject the null hypothesis that all four coefficients are simultaneously zero. 9) The administration of your university/college is thinking about implementing a policy of coed floors only in dormitories. Currently there are only single gender floors. One reason behind such a policy might be to generate an atmosphere of better “understanding” between the sexes. The Dean of Students (DoS) has decided to investigate if such a behavior results in more “togetherness” by attempting to find the determinants of the gender composition at the dinner table in your main dining hall, and in that of a neighboring university, which only allows for coed floors in their dorms. The survey includes 176 students, 63 from your university/college, and 113 from a neighboring institution. The Dean’s first problem is how to define gender composition. To begin with, the survey excludes single persons’ tables, since the study is to focus on group behavior. The Dean also eliminates sports teams from the analysis, since a large number of single-gender students will sit at the same table. Finally, the Dean decides to only analyze tables with three or more students, since she worries about “couples” distorting the results. The Dean finally settles for the following specification of the dependent variable: GenderComp= (50%-% of Male Students at Table) Where “ Z ” stands for absolute value of Z. The variable can take on values from zero to fifty. After considering various explanatory variables, the Dean settles for an initial list of eight, and estimates the following relationship, using heteroskedasticity-robust standard errors (this Dean obviously has taken an econometrics course earlier in her career and/or has an able research assistant): GenderComp = 30.90 – 3.78 × Size – 8.81 × DCoed + 2.28 × DFemme +2.06 × DRoommate Stock/Watson 2e -- CVC2 8/23/06 -- Page 166

(7.73) (0.63)

(2.66)

(2.42)

(2.39)

- 0.17 × DAthlete + 1.49 × DCons – 0.81 SAT + 1.74 × SibOther, R2 =0.24, SER = 15.50 (3.23) (1.10) (1.20) (1.43) where Size is the number of persons at the table minus 3; DCoed is a binary variable, which takes on the value of 1 if you live on a coed floor; DFemme is a binary variable, which is 1 for females and zero otherwise; DRoommate is a binary variable which equals 1 if the person at the table has a roommate and is zero otherwise; DAthlete is a binary variable which is 1 if the person at the table is a member of an athletic varsity team; DCons is a variable which measures the political tendency of the person at the table on a seven -point scale, ranging from 1 being “liberal” to 7 being “conservative”; SAT is the SAT score of the person at the table measured on a seven-point scale, ranging from 1 for the category “900-1000” to 7 for the category “1510 and above”; and increasing by one for 100 point increases; and SibOther is the number of siblings from the opposite gender in the family the person at the table grew up with. (a) Indicate which of the coefficients are statistically significant. (b) Based on the above results, the Dean decides to specify a more parsimonious form by eliminating the least significant variables. Using the F-statistic for the null hypothesis that there is no relationship between the gender composition at the table and DFemme, DRoommate, DAthlete, and SAT, the regression package returns a value of 1.10. What are the degrees of freedom for the statistic? Look up the 1% and 5% critical values from the F- table and make a decision about the exclusion of these variables based on the critical values. (c) The Dean decides to estimate the following specification next: GenderComp = 29.07 – 3.80 × Size – 9.75 × DCoed + 1.50 × DCons + 1.97 × SibOther, (3.75) (0.62) (1.04) (1.04) (1.44) R2 =0.22 SER = 15.44 Calculate the t-statistics for the coefficients and discuss whether or not the Dean should attempt to simplify the specification further. Based on the results, what might some of the comments be that she will write up for the other senior administrators of your college? What are some of the potential flaws in her analysis? What other variables do you think she should have considered as explanatory factors? Answer: (a) Only the constant, Size, and DCoed are statistically significant at the 5% level. (b ) The F4,∞ is 2.37 at the 5% level, and 3.32 at the 1% level. Hence you cannot reject the null hypothesis that all four coefficients are zero. (c) The t-statistics for the five coefficients are as follows: 7.75, -6.13, -9.38, 1.44 and 1.37. The Dean should leave the specification as is and allow readers to decide if they want to place much weight on the insignificant coefficients. The variable of interest is DCoed and she will most likely focus on that, concluding that having coed floors in dormitories will increase the gender balance at dining hall tables. She will most likely go further in her report and suggest that communication between the sexes will improve as a result of coed floors. One of the major flaws in the analysis is that students from one college do not have coed floors in dormitories while students from the other college do not have single gender floors. Ideally you would like to survey students from the same college where some of the students lived on single gender floors while others did not. Answers on omitted variables will obviously vary. Ideally some survey question should be included which would indicate the student’s attitude towards the other sex.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 167

10) The Solow growth model suggests that countries with identical saving rates and population growth rates should converge to the same per capita income level. This result has been extended to include investment in human capital (education) as well as investment in physical capital. This hypothesis is referred to as the “conditional convergence hypothesis,” since the convergence is dependent on countries obtaining the same values in the driving variables. To test the hypothesis, you collect data from the Penn World Tables on the average annual growth rate of GDP per worker (g6090) for the 1960-1990 sample period, and regress it on the (i) initial starting level of GDP per worker relative to the United States in 1960 (RelProd 60), (ii) average population growth rate of the country (n), (iii) average investment share of GDP from 1960 to1990 ( SK remember investment equals savings), and (iv) educational attainment in years for 1985 ( Educ). The results for close to 100 countries is as follows (numbers in parentheses are for heteroskedasticity-robust standard errors): g6090 = 0.004 - 0.172 × n + 0.133 × SK + 0.002 × Educ – 0.044 × RelProd60, (0.007) (0.209) 2 R =0.537, SER = 0.011

(0.015)

(0.001)

(0.008)

(a) Is the coefficient on this variable significantly different from zero at the 5% level? At the 1% level? (b) Test for the significance of the other slope coefficients. Should you use a one-sided alternative hypothesis or a two-sided test? Will the decision for one or the other influence the decision about the significance of the parameters? Should you always eliminate variables which carry insignificant coefficients? Answer: (a) The coefficient has a t-statistic of 5.50 and is therefore statistically significant at both the 5% and the 1% level. (b) The t-statistics are –0.82. 8.87, and 2.00. Hence the coefficient on population growth is not statistically significant. You should use a one-sided alternative hypothesis test since economic theory gives you information about the expected sign on these variables. In the above case, the decision will not be influenced by the choice of a one-sided or two-sided test, since the (absolute value of the) critical value is 1.64 or 1.96 at the 5% significance level. If there is a strong prior on the sign of the coefficient, then the variable should not be eliminated based on the significance test. Instead it should be left in the equation, but the low p-value should be flagged to the reader, and the reader should decide herself how convincing the evidence is in favor of the theory.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 168

11) Using the 420 observations of the California School data set from your textbook, you estimate the following relationship: TestScore = 681.44 - 0.61LchPct n=420, R2 =0.75, SER = 9.45 where TestScore is the test score and LchPct is the percent of students eligible for subsidized lunch (average = 44.7, max = 100, min = 0). a.

Interpret the regression result.

In your interpretation of the slope coefficient in (a) above, does it matter if you start your explanation with “for every x percent increase” rather than “for every x percentage point increase”?

The “overall” regression F-statistic is 1149.57. What are the degrees of freedom for this statistic?

Find the critical value of the F-statistic at the 1% significance level. Test the null hypothesis that the regression R2 = 0.

The above equation was estimated using heteroskedasticity robust standard errors. What is the standard error for the slope coefficient?

Answer: a. For every 10 percentage point increase in students eligible for subsidized lunch, average test scores go up by 6.1 points. If a school has no students eligible for subsidized lunch, then the average test score is approximately 681 points. 75% of the variation in test scores is explained by our model. b. Since your RHS variable is measured already in percent, it makes sense to increase that variable by 10 percentage points (say), rather than by 10 percent. If LchPct increases from 20 to 30, then this represents an increase of 10 percentage points, or an increase of 50 percent. c. There are 2 degrees of freedom in the numerator, and 418 ( ∞) degrees of freedom in the denominator. d. F2,∞= 4.61. Hence you can comfortable reject the null hypothesis of no linear relationship between test scores and the percent of students eligible for subsidized lunch. e. With a single explanatory variable, the t-statistic is the square root of the F-statistic. Here it is 33.91. From this result, and given the size of the coefficient, the standard error is 1.80.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 169

12) Consider the following regression using the California School data set from your textbook. TestScore = 681.44 - 0.61LchPct n=420, R2 =0.75, SER = 9.45 where TestScore is the test score and LchPct is the percent of students eligible for subsidized lunch (average = 44.7, max = 100, min = 0). a.

What is the effect of a 20 percentage point increase in the student eligible for subsidized lunch?

Your textbook started with the following regression in Chapter 4: TestScr = 698.9 - 2.28STR n=420, R2 =0.051, SER = 18.58 where STR is the student teacher ratio. Your textbook tells you that in the multiple regression framework considered, the percentage of students eligible for subsidized lunch is a control variable, while the student teacher ratio is the variable of interest. Given that the regression R2 is so much higher for the first equation than for the second equation, shouldn’t the role of the two variables be reversed? That is, shouldn’t the student teacher ratio be the control variable while the percent of students eligible for subsidized lunch be the variable of interest?

Answer: a. The effect would be a 12.2 test score increase. b. The choice of variable of interest versus control variable has nothing to do with which variable has a higher explanatory power in the two models. Instead it depends on the question your are analyzing. In Chapter 4, the question was raised whether or not the test scores of students could be improved by hiring more teachers. Hence the variable of interest became class size or its proxy, the student teacher ratio. However, there are other variables which may have an effect on test scores, and not controlling for those will result in omitted variable bias on the coefficient of the variable of interest. Of course, the role of a control variable and the variable of interest can be switched if a different policy question is addressed. For example, a politician might be interest in figuring out the effect of improved student performance if she can raise income levels in certain school districts, or across the board.

7.3 Mathematical and Graphical Problems 1) Explain carefully why testing joint hypotheses simultaneously, using the F-statistic, does not necessarily yield the same conclusion as testing them sequentially (“one at a time” method), using a series of t-statistics. Answer: Testing a joint hypothesis sequentially does not result in the desired significance level. Even if this were not a problem, then the shape of the confidence set of the textbook suggests another reason for this strategy to be problematic. Drawing a confidence interval for both parameters and extending the lines up and to the right, results in a rectangle, indicating the area where the joint hypothesis would be rejected using the t-statistic. Obviously the confidence set does not coincide with the rectangle, and there are therefore various outcomes possible under which both strategies would come to the same conclusion or different conclusions. Since the proper testing strategy involves using the F-statistic, the t-statistic could result in improper inference under circumstances where the two areas do not coincide.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 170

2) Set up the null hypothesis and alternative hypothesis carefully for the following cases: (a) k = 4, test for all coefficients other than the intercept to be zero (b) k = 3, test for the slope coefficient of X1 to be unity, and the coefficients on the other explanatory variables to be zero (c) k = 10, test for the slope coefficient of X1 to be zero, and for the slope coefficients of X2 and X3 to be the same but of opposite sign. (d) k = 4, test for the slope coefficients to add up to unity Answer: (a) H0 : β1 = 0, β2 = 0, β3 = 0, β4 = 0 (b) H0 : β1 = 1, β2 = 0, β3 = 0 (c) H0 : β1 = 0, β2 + β3 = 0 (d) H0 : β1 + β2 + β3 + β4 = 1 3) Consider a situation where economic theory suggests that you impose certain restrictions on your estimated multiple regression function. These may involve the equality of parameters, such as the returns to education and on the job training in earnings functions, or the sum of coefficients, such as constant returns to scale in a production function. To test the validity of your restrictions, you have your statistical package calculate the corresponding F-statistic. Find the critical value from the F-distribution at the 5% and 1% level, and comment whether or not you will reject the null hypothesis in each of the following cases. (a) number of observations: 152; number of restrictions: 3; F-statistic: 3.21 (b) number of observations: 1,732; number of restrictions:7; F-statistic: 4.92 (c) number of observations: 63; number of restrictions: 1; F-statistic: 2.47 (d) number of observations: 4,000; number of restrictions: 5; F-statistic: 1.82 (e) Explain why you can use the Fq,∞ distribution to compute the critical values in (a)-(d). Answer: (a) F3,∞ = 2.60 (5% level), F3,∞ = 3.78 (1% level). Reject the null hypothesis at the 5% level, but not at the 1% level. (b ) F7,∞ = 2.01 (5% level), F7,∞ = 2.64 (1% level). Reject the null hypothesis at the 5% level and at the 1% level. (c) F1,∞ = 3.84 (5% level), F1,∞ = 6.63 (1% level). Cannot reject the null hypothesis at the 5% level or at the 1% level. (d) F5,∞= 2.21 (5% level), F5,∞ = 3.02 (1% level). Cannot reject the null hypothesis at the 5% level or at the 1% level. (e) The F-statistic is distributed Fq,∞ in large samples. Although strictly speaking this only holds for the limiting case of n = ∞, for practical purposes the approximation is close for n > 100. This is therefore problematic for (c) above, where n = 63.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 171

4) Females, on average, are shorter and weigh less than males. One of your friends, who is a pre -med student, tells you that in addition, females will weigh less for a given height. To test this hypothesis, you collect height and weight of 29 female and 81 male students at your university. A regression of the weight on a constant, height, and a binary variable, which takes a value of one for females and is zero otherwise, yields the following result: Studentw = –229.21 – 6.36 × Female + 5.58 × Height , R2 =0.50, SER = 20.99 (43.39) (5.74) (0.62) where Studentw is weight measured in pounds and Height is measured in inches (heteroskedasticity-robust standard errors in parentheses). Calculate t-statistics and carry out the hypothesis test that females weigh the same as males, on average, for a given height, using a 10% significance level. What is the alternative hypothesis? What is the p-value? What critical value did you use? Answer: The t-statistics for the intercept, the gender binary variable, and the height variable are -5.28, -1.11, and 9.00, respectively. For a one-sided alternative hypothesis, βFemale < 0, the critical value from the standard normal table is –1.28. Hence you cannot reject the null hypothesis at the 10% level. The p-value is 13.4%. 5) You are presented with the following output from a regression package, which reproduces the regression results of testscores on the student-teacher ratio from your textbook

Dependent Variable: TESTSCR Method: Least Squares Date: 07/30/06 Time: 17:44 Sample: 1 420 Included observations: 420 Variable

Coefficient

Std. Error

t-Statistic

Prob.

C STR

698.93 -2.28

9.47 0.48

73.82 -4.75

0.00 0.00

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.05 0.05 18.58 144315.48 -1822.25 0.13

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 8.69 8.71 22.58 0.00

Std. Error are homoskedasticity only standard errors. a) What is the relationship between the t-statistic on the student-teacher ratio coefficient and the F-statistic? b) Next, two explanatory variables, the percent of English learners (EL_PCT) and expenditures per student (EXPN_STU) are added. The output is listed as below. What is the relationship between the three t -statistics for the slopes and the homoskedasticity-only F-statistic now? Dependent Variable: TESTSCR Method: Least Squares Date: 07/30/06 Time: 17:55 Sample: 1 420 Included observations: 420 Stock/Watson 2e -- CVC2 8/23/06 -- Page 172

Variable

Coefficient

Std. Error

t-Statistic

Prob.

649.58 -2.29 -0.66 0.00

15.21 0.48 0.04 0.00

42.72 -0.60 -16.78 2.74

0.00 0.55 0.00 0.01

C STR EL_PCT EXPN_STU

R-squared 0.44 Adjusted R-squared 0.43 S.E. of regression 14.35 Sum squared resid 85699.71 Log likelihood -1712.81 Durbin-Watson stat 0.74

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 8.18 8.21 107.45 0.00

Answer: (a) The F-statistic tests the null hypothesis that all slope coefficients are zero. In the case of a single explanatory variable, this is the same as testing for the significance of the explanatory variable coefficient. In that case, the F-statistic is the same as the square of the t-statistic in the case of a single restriction (q = 1). (b) There is no simple relationship between the F-statistic and the three t-statistics now. The F-statistic tests the null hypothesis that H0 : βSTR = βEL_PCT = βEXPN_STU = 0 simultaneously. The t-statistics test the significance of each slope coefficient separately. 6) Consider the following multiple regression model Yi = β0 + β1 X1i + β2 X2i + β3 X3i + ui You want to consider certain hypotheses involving more than one parameter, and you know that the regression error is homoskedastic. You decide to test the joint hypotheses using the homoskedasticity -only F-statistics. For each of the cases below specify a restricted model and indicate how you would compute the F-statistic to test for the validity of the restrictions. (a) β1 = -β2 ; β3 = 0 (b) β1 + β2 + β3 = 1 (c) β1 = β2 ; β3 = 0 Answer: (a) The restricted model is Yi = β0 + β2 (X2i - X1i) + ui = 0 and the rule-of-thumb F-statistic would be F =

(SSRrestricted - SSRunrestricted/2 . SSRunrestricted/n - 3-1

(b) (Yi - X3i) = β0 + β1 (X1i - X3i) + β2 (X2i - X3i) + ui and the rule-of-thumb F-statistic would be F = (SSRrestricted - SSRunrestricted/1 SSRunrestricted/n - 3-1 (c) Yi = β0 + (β1 X1i + X2i) + ui and the homoskedasticity-only F-statistic would be F=

(SSRrestricted - SSRunrestricted/2 SSRunrestricted/(n - 3-1)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 173

7) Give an intuitive explanation for F =

(SSRrestricted - SSRunrestricted/q . Name conditions under which the SSRunrestricted/(n - kunrestricted -1)

F-statistic is large and hence rejects the null hypothesis. Answer: First rewrite (SSRrestricted - SSRunrestricted/q (SSRrestricted - SSRunrestricted (n - kunrestricted -1) F= = × SSRunrestricted/(n - kunrestricted -1) SSRunrestricted q The numerator for the first expression is the difference between the sum of squared residuals between the restricted and the unrestricted model. Anytime you place restrictions on the model, the SSR will increase (or, strictly speaking, at least no decrease). Hence if the explanatory power ( SSR) of your regression decreases (increase) by much as a result of the restrictions you have placed on the model, then the numerator will be large. However, the SSR depend on units of measurement. To make the first expression independent of the units of measurement, the difference is divided by the unrestricted residual sums of squares. The first fraction now represents the percentage increase in the SSR that result from the imposition of the restrictions. The second fraction has the degrees of freedom of the denominator in its numerator, and the degrees of freedom of the numerator in its denominator. The degrees of freedom of the numerator is the difference of the degrees of freedom of the restricted and the unrestricted regression respectively, i.e., (n - krestricted -1) - (n - kunrestricted -1) = kunrestricted krestricted = q. As the degrees of freedom (number of observations) increase, we are closer to observing the population rather than the sample. Since the null hypothesis is a statement about the population, even small differences in parameters should become statistically significant eventually. 8) Prove that (SSRrestricted - SSRunrestricted/q F= = SSRunrestricted/(n - kunrestricted -1)

2 2 R unrestricted - R restricted /q 2 1- R unrestricted /(n-kunrestricted - 1)

Answer: Note that SSR = TSS - ESS. Hence we get (TSS - ESS restricted- (TSS - ESS unrestricted))/q F= . Next, dividing numerator and denominator by TSS, (TSS - ESS unrestricted)(n - kunrestricted -1) ESS unrestricted) gives us F =

TSS

TSS - ESS unrestricted) TSS

ESS restricted) TSS

/q . Since R2 =

/(n - kunrestricted - 1)

we were looking for.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 174

ESS , this gives us the expression TSS

9) To calculate the homoskedasticity-only overall regression F-statistic, you need to compare the SSR restricted with the SSRunrestricted. Consider the following output from a regression package, which reproduces the regression results of testscores on the student-teacher ratio, the percent of English learners, and the expenditures per student from your textbook: Dependent Variable: TESTSCR Method: Least Squares Date: 07/30/06 Time: 17:55 Sample: 1 420 Included observations: 420 Variable

Coefficient

Std. Error

t-Statistic

Prob.

C STR EL_PCT EXPN_STU

649.58 -0.29 -0.66 0.00

15.21 0.48 0.04 0.00

42.72 -0.60 -16.78 2.74

0.00 0.55 0.00 0.01

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.44 0.43 14.35 85699.71 -1712.81 0.74

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 8.18 8.21 107.45 0.00

Sum of squared resid corresponds to SSRunrestricted. How are you going to find SSRrestricted? ^

Answer: You could simply run a regression of Testscr on a constant. However, for the case the Testscoret = β0 + ^

βSTR× STRi + βEL_PCT × EL_PCT i + βEXPN_STU × EXPN_STU + ui restricted residuals are Yi = β0 ^

+ ui, and for the restricted sum of square residuals, you get simply the variation in test scores n SSRrestricted = ∑ (Testscore i - Testscore)2 . i=1 10) Adding the Percent of English Speakers (PctEL) to the Student Teacher Ratio (STR) in your textbook reduced the coefficient for STR from 2.28 to 1.10 with a standard error of 0.43. Construct a 90% and 99% confidence interval to test the hypothesis that the coefficient of STR is 2.28. Answer: The 90% confidence interval is (1.10± 1.64 × 0.43) = (0.39, 1.81). The 99% confidence interval is (-0.01, 2.21). Hence you can reject the null hypothesis at both the 90% and 99% confidence level.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 175

11) The homoskedasticity only F-statistic is given by the formula F=

(SSRrestricted - SSRunrestricted)/q SSRunrestricted/(n-kunrestricted - 1)

where SSRrestricted is the sum of squared residuals from the restricted regression, SSRunrestricted is the sum of squared residuals from the unrestricted regression, q is the number of restrictions under the null hypothesis, and kunrestricted is the number of regressors in the unrestricted regression. Prove that this formula is the same as the following formula based on the regression R2 of the restricted and unrestricted regression: F=

(ESS unrestricted - ESS restricted)/q 1- ESSunrestricted/(n-kunrestricted - 1)

Answer: Note that SSR = TSS - ESS. Hence we get (TSS - ESS restricted - (TSS - ESS unrestricted))/q F= , which gives the above expression once the TSS in (TSS - ESS unrestricted)/(n-kunrestricted - 1) the numerator are cancelled. 12) Trying to remember the formula for the homoskedasticity-only F-statistic, you forgot whether you subtract the restricted SSR from the unrestricted SSR or the other way around. Your professor has provided you with a table containing critical values for the F distribution. How can this be of help? Answer: All the values in the F table are positive. Hence the correct answer must produce a positive value in the numerator and denominator (or negative expressions in both). But SSR? - SSR?)/q F= and hence the denominator is positive. Hence for the numerator SSRunrestricted/(n - kunrestricted -1) to be also positive, you must have SSRrestricted - SSRunrestricted. 13) Consider the following regression output for an unrestricted and a restricted model. Unrestricted model: Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:35 Sample: 1 420 Included observations: 420 Coefficient

Std. Error

t-Statistic

Prob.

658.47

7.68

85.73

0.00

STR EL_PCT LOG(AVGINC) MEAL_PCT CALW_PCT

-0.76 -0.19 11.69 -0.37 -0.07

0.23 0.03 1.74 0.04 0.06

-3.27 -5.62 6.71 -9.53 -1.21

0.00 0.00 0.00 0.00 0.23

Variable

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.80 0.79 8.64 30888.64 -1498.51 1.51

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 7.16 7.22 324.94 0.00

Stock/Watson 2e -- CVC2 8/23/06 -- Page 176

Restricted model: Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:37 Sample: 1 420 Included observations: 420 Variable

Coefficient

Std. Error

t-Statistic

Prob.

C STR EL_PCT LOG(AVGINC)

593.48 -0.39 -0.43 28.36

6.96 0.27 0.03 1.40

85.32 -1.42 -14.34 20.32

0.00 0.16 0.00 0.00

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.71 0.71 10.26 43792.42 -1571.82 1.30

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 7.50 7.54 342.98 0.00

Calculate the homoskedasticity only F-statistic and determine whether the null hypothesis can be rejected at the 5% significance level. Answer: There are two restrictions, namely H0 : βmeal_pct = 0, βcalw_pct = 0. The F-statistic is F=

43792.42 420 - 5 - 1 -1 × = 86.47. The 5% critical value from the F2,∞ distribution is 3.00. Hence we 30888.64 2

easily reject the two restrictions at the 5% level of significance. 14) Consider the regression output from the following unrestricted model: Unrestricted model: Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:35 Sample: 1 420 Included observations: 420 Coefficient

Std. Error

t-Statistic

Prob.

C STR EL_PCT LOG(AVGINC) MEAL_PCT CALW_PCT

658.47 -0.76 -0.19 11.69 -0.37 -0.07

7.68 0.23 0.03 1.74 0.04 0.06

85.73 -3.27 -5.62 6.71 -9.53 -1.21

0.00 0.00 0.00 0.00 0.00 0.23

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.80 0.79 8.64 30888.64 -1498.51 1.51

Variable

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 7.16 7.22 324.94 0.00

Stock/Watson 2e -- CVC2 8/23/06 -- Page 177

To test for the null hypothesis that neither coefficient on the percent eligible for subsidized lunch nor the coefficient on the percent on public income assistance is statistically significant, you have your statistical package plot the confidence set. Interpret the graph below and explain what it tells you about the null hypothesis.

Answer: The dot in the center of the ellipse is the point estimate for the two coefficients (-0.37,-0.07). Since the (0,0) point is not inside the ellipse, you reject the null hypothesis. 15) Consider the regression model Yi = β0 + β1 X1i + β2 X2i+ β3 X3i + ui. Use “Approach #2” from Section 7.3 to transform the regression so that you can use a t-statistic to test: β1 =

β2 β3

Answer: This is not a linear restriction. Hence you cannot use the F-test to test for its validity.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 178

16) Consider the following Cobb-Douglas production function Yi = AK

β1 β2 u L e i (where Y is output, A is the i i

level of technology, K is the capital stock, and L is the labor force), which has been linearized here (by using logarithms) to look as follows: * y i = β 0 + β1 ki + β2 li + ui Assuming that the errors are heteroskedastic, you want to test for constant returns to scale. Using a t-statistic and “Approach #2,” how would you proceed. Answer: Under constant returns to scale, β1 + β2 = 1. Hence you need to transform the unrestricted model above * by subtracting l from both sides, and by adding and subtracting β1 li. This results in (y i - li) = β 0 + β1 (ki - li) + (β1 + β2 - 1) li + ui. The left hand side variable is now the (log of the) output-labor ratio, and the first explanatory variable on the right hand side is the (log of the) capital-labor ratio. If the null hypothesis of constant returns to scale holds, then the coefficient on l should be zero. This can be directly tested using a t-statistic. 17) Consider the following two models to explain testscores. Model 1: Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:52 Sample: 1 420 Included observations: 420 Variable

Coefficient

Std. Error

t-Statistic

Prob.

C STR EL_PCT LOG(AVGINC) MEAL_PCT

658.55 -0.73 -0.18 11.57 -0.40

7.68 0.23 0.03 1.74 0.02

85.70 -3.18 -5.52 6.65 -13.09

0.00 0.00 0.00 0.00 0.00

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.80 0.79 8.64 30998.01 -1499.25 1.52

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 7.16 7.21 405.36 0.00

Model 2:

Dependent Variable: TESTSCR Method: Least Squares Date: 07/31/06 Time: 17:56 Sample: 1 420 Included observations: 420

Stock/Watson 2e -- CVC2 8/23/06 -- Page 179

Variable

Coefficient

Std. Error

t-Statistic

Prob.

C STR EL_PCT LOG(AVGINC) CALW_PCT

620.92 -0.66 -0.39 21.87 -0.41

7.27 0.25 0.03 1.52 0.05

85.41 -2.58 -14.05 14.41 -8.22

0.00 0.01 0.00 0.00 0.00

R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood Durbin-Watson stat

0.75 0.75 9.53 37659.29 -1540.13 1.41

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion F-statistic Prob(F-statistic)

654.16 19.05 7.36 7.41 315.31 0.00

Explain why you cannot use the F-test in this situation to discriminate between Model 1 and Model 2. Answer: Neither model is contained (“nested”) in the other, in the sense that you cannot place restrictions on Model 1 to obtain Model 2 (and vice versa). Hence there is no unrestricted and restricted model in this case. 18) Your textbook has emphasized that testing two hypothesis sequentially is not the same as testing them simultaneously. Consider the following confidence set below, where you are testing the hypothesis that H0 : β5 = 0, β6 = 0.

Your statistical package has also generated a dotted area, which corresponds to drawing two confidence intervals for the respective coefficients. For each case where the ellipse does not coincide in area with the corresponding rectangle, indicate what your decision would be if you relied on the two confidence intervals vs. the ellipse generated by the F-statistic. Answer: The following possible outcomes can be seen in the figure above: (i) both F-statistic and the two confidence intervals generate the same result; (ii) you do not reject the null hypothesis using the F-statistic, but you do so by using the confidence intervals (these are the points in the area at the “tip” of the ellipse); (iii) you reject the null hypothesis using the confidence intervals but not the F-statistic.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 180

19) You have estimated the following regression to explain hourly wages, using a sample of 250 individuals: AHE i = -2.44 - 1.57 × DFemme + 0.27 × DMarried + 0.59 × Educ + 0.04 × Exper - 0.60 × DNonwhite (1.29) (0.33)

(0.36)

(0.09)

(0.01)

(0.49)

+ 0.13 × NCentral - 0.11 × South (0.59) (0.58) 2 R = 0.36, SER = 2.74, n = 250 Numbers in parenthesis are heteroskedasticity robust standard errors. Add “*”(5%) and “**” (1%) to indicate statistical significance of the coefficients. Answer: AHE i = -2.44 - 1.57 × DFemme + 0.27 × DMarried + 0.59 **× Educ ) (1.29) (0.33) (0.36) (0.09) + 0.04** × Exper - 0.60 × DNonwhite + 0.13 × NCentral - 0.11 × South (0.01) (0.49) (0.59) (0.57) 20) You have estimated the following regression to explain hourly wages, using a sample of 250 individuals: AHE = -2.44 - 1.57 × DFemme + 0.27 × DMarried + 0.59 × Educ + 0.04 × Exper - 0.60 × DNonwhite (1.29) (0.33) (0.36) (0.09) (0.01) (0.49) +0.13 × NCentral - 0.11 × South (0.59) (0.58) R2 = 0.36, SER = 2.74, n = 250 Test the null hypothesis that the coefficients on DMarried, DNonwhite, and the two regional variables, NCentral and South are zero. The F-statistic for the null hypothesis βmarried = βnonwhite = βnonwhite = βncentral = βsouth = 0 is 0.61. Do you reject the null hypothesis? Answer: The critical value for F4,∞=3.32 at the 1% significance level. Hence you cannot reject the null hypothesis. 21) Using the California School data set from your textbook, you decide to run a regression of the average reading score (ScrRead) on the average mathematics score (ScrMaths). The result is as follows, where the numbers in parenthesis are homoskedasticity only standard errors: ScrRead = 8.47 + 0.9895×ScrMaths (13.20) (0.0202) N = 420, R2 = 0.85, SER = 7.8 You believe that the average mathematics score is an unbiased predictor of the average reading score. Consider the above regression to be the unrestricted from which you would calculate SSRUnrestricted . How would you find the SSRRestricted? How many restrictions would have to impose? Answer: Since the restricted regression would read ScrRead = 0 + 1×ScrMaths, you would need to calculate n ∑ (ScrReadi-ScrMathsi)2 . Using the F-test to simultaneously test for a zero intercept coefficient and a i=1 unit slope coefficient, you would have to impose two restrictions (q = 2).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 181

22) Looking at formula (7.13) in your textbook for the homoskedasticity-only F-statistic, F=

(SSR restricted - SSR unrestricted)/q SSRunrestricted / (n - k unrestricted-1)

give three conditions under which, ceteris paribus, you would find a large value, and hence would be likely to reject the null hypothesis. Answer: The F-statistic will be larger for (i) large percentage changes in the SSR between the restricted and the unrestricted regression; (ii) smaller number of restrictions (q); (iii) larger sample size (large number of degrees of freedom). 23) Analyzing a regression using data from a sub-sample of the Current Population Survey with about 4,000 observations, you realize that the regression R2 , and the adjusted R2 , R2 , are almost identical. Why is that the case? In your textbook, you were told that the regression R2 will almost always increase when you add an explanatory variable, but that the adjusted measure does not have to increase with such an addition. Can this still be true? Answer: The difference between the two measures is the adjustment by the degrees of freedom. Once the number of observations become very large, it does not matter how many explanatory variables you have in your regression, the ratio of (n-1) being roughly the same as (n-k-1). As a result, the adjusted measure will also almost always increase with the addition of another explanatory variable.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 182

Chapter 8 Nonlinear Regression Functions 8.1 Multiple Choice 1) In nonlinear models, the expected change in the dependent variable for a change in one of the explanatory variables is given by A) △Y = f(X1 + X1 , X2 ,... Xk). B) △Y = f(X1 + △X1 , X2 + △X2 ,..., Xk+ △Xk)- f(X1 , X2 ,...Xk). C) △Y = f(X1 + △X1 , X2 ,..., Xk)- f(X1 , X2 ,...Xk). D) △Y = f(X1 + X1 , X2 ,..., Xk)- f(X1 , X2 ,...Xk). Answer: C 2) The interpretation of the slope coefficient in the model Yi = β0 + β1 ln(Xi) + ui is as follows: A) a 1% change in X is associated with a β1 % change in Y. B) a 1% change in X is associated with a change in Y of 0.01 β1 . C) a change in X by one unit is associated with a β1 100% change in Y. D) a change in X by one unit is associated with a β1 change in Y. Answer: B 3) The interpretation of the slope coefficient in the model ln(Yi) = β0 + β1 Xi + ui is as follows: A) a 1% change in X is associated with a β1 % change in Y. B) a change in X by one unit is associated with a 100 β1 % change in Y. C) a 1% change in X is associated with a change in Y of 0.01 β1 . D) a change in X by one unit is associated with a β1 change in Y. Answer: B 4) The interpretation of the slope coefficient in the model ln(Yi) = β0 + β1 ln(Xi)+ ui is as follows: A) a 1% change in X is associated with a β1 % change in Y. B) a change in X by one unit is associated with a β1 change in Y. C) a change in X by one unit is associated with a 100 β1 % change in Y. D) a 1% change in X is associated with a change in Y of 0.01 β1 . Answer: A 5) In the case of regression with interactions, the coefficient of a binary variable should be interpreted as follows: A) there are really problems in interpreting these, since the ln(0) is not defined. B) for the case of interacted regressors, the binary variable coefficient represents the various intercepts for the case when the binary variable equals one. C) first set all explanatory variables to one, with the exception of the binary variables. Then allow for each of the binary variables to take on the value of one sequentially. The resulting predicted value indicates the effect of the binary variable. D) first compute the expected values of Y for each possible case described by the set of binary variables. Next compare these expected values. Each coefficient can then be expressed either as an expected value or as the difference between two or more expected values. Answer: D 6) The following interactions between binary and continuous variables are possible, with the exception of A) Yi = β0 + β1 Xi + β2 Di + β3 (Xi × Di) + ui. B) Yi = β0 + β1 Xi + β2 (Xi × Di) + ui. C) Yi = (β0 + Di) + β1 Xi + ui. D) Yi = β0 + β1 Xi + β2 Di + ui. Answer: C Stock/Watson 2e -- CVC2 8/23/06 -- Page 183

7) An example of the interaction term between two independent, continuous variables is A) Yi = β0 + β1 Xi + β2 Di + β3 (Xi × Di) + ui. B) Yi = β0 + β1 X1i + β2 X2i + ui. C) Yi = β0 + β1 D1i + β2 D2i + β3 (D1i × D2i) + ui. D) Yi = β0 + β1 X1i + β2 X2i + β3 (X1i × X2i) + ui. Answer: D 8) Including an interaction term between two independent variables, X1 and X2 , allows for the following except: A) the interaction term lets the effect on Y of a change in X1 depend on the value of X2 . B) the interaction term coefficient is the effect of a unit increase in X1 and X2 above and beyond the sum of the individual effects of a unit increase in the two variables alone. C) the interaction term coefficient is the effect of a unit increase in (X1 × X2 ). D) the interaction term lets the effect on Y of a change in X2 depend on the value of X1 . Answer: C 9) A nonlinear function A) makes little sense, because variables in the real world are related linearly. B) can be adequately described by a straight line between the dependent variable and one of the explanatory variables. C) is a concept that only applies to the case of a single or two explanatory variables since you cannot draw a line in four dimensions. D) is a function with a slope that is not constant. Answer: C 10) An example of a quadratic regression model is A) Yi = β0 + β1 X + β2 Y2 + ui. B) Yi = β0 + β1 ln(X) + ui. C) Yi = β0 + β1 X + β2 X2 + ui. 2 D) Y i = β0 + β1 X + ui. Answer: C 11) (Requires Calculus) In the equation TestScore = 607.3 + 3.85 Income – 0.0423Income2 , the following income level results in the maximum test score A) 607.3. B) 91.02. C) 45.50. D) cannot be determined without a plot of the data. Answer: C 12) To decide whether Yi = β0 + β1 X + ui or ln(Yi) = β0 + β1 X + ui fits the data better, you cannot consult the regression R2 because A) ln(Y) may be negative for 0<Y<1. B) the TSS are not measured in the same units between the two models. C) the slope no longer indicates the effect of a unit change of X on Y in the log-linear model. D) the regression R2 can be greater than one in the second model. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 184

13) You have estimated the following equation: TestScore = 607.3 + 3.85 Income – 0.0423 Income2 , where TestScore is the average of the reading and math scores on the Stanford 9 standardized test administered to 5th grade students in 420 California school districts in 1998 and 1999. Income is the average annual per capita income in the school district, measured in thousands of 1998 dollars. The equation A) suggests a positive relationship between test scores and income for most of the sample. B) is positive until a value of Income of 610.81. C) does not make much sense since the square of income is entered. D) suggests a positive relationship between test scores and income for all of the sample. Answer: A 14) A polynomial regression model is specified as: 2 r A) Yi = β0 + β1 Xi + β2 X + ··· + βrX + ui. i i 2 r B) Yi = β0 + β1 Xi + β Xi + ··· + β Xi + ui. 1 1 2 r C) Yi = β0 + β1 Xi + β2 Y + ··· + βrY + ui. i i D) Yi = β0 + β1 X1i + β2 X2 + β3 (X1i × X2i) + ui. Answer: A 15) For the polynomial regression model, A) you need new estimation techniques since the OLS assumptions do not apply any longer. B) the techniques for estimation and inference developed for multiple regression can be applied. C) you can still use OLS estimation techniques, but the t-statistics do not have an asymptotic normal distribution. D) the critical values from the normal distribution have to be changed to 1.96 2 , 1.96 3 , etc. Answer: B 16) To test whether or not the population regression function is linear rather than a polynomial of order r, A) check whether the regression R2 for the polynomial regression is higher than that of the linear regression. B) compare the TSS from both regressions. C) look at the pattern of the coefficients: if they change from positive to negative to positive, etc., then the polynomial regression should be used. D) use the test of (r-1) restrictions using the F-statistic. Answer: D 17) The best way to interpret polynomial regressions is to A) take a derivative of Y with respect to the relevant X. B) plot the estimated regression function and to calculate the estimated effect on Y associated with a change in X for one or more values of X. C) look at the t-statistics for the relevant coefficients. D) analyze the standard error of estimated effect. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 185

18) The exponential function A) is the inverse of the natural logarithm function. B) does not play an important role in modeling nonlinear regression functions in econometrics. C) can be written as exp(ex ). D) is ex , where e is 3.1415…. Answer: A 19) The following are properties of the logarithm function with the exception of A) ln(1/ x) = -ln(x). B) ln(a + x) = ln(a) + ln(x). C) ln(ax) = ln(a) + ln(x). D) ln(x a) a ln(x). Answer: B 20) The binary variable interaction regression A) can only be applied when there are two binary variables, but not three or more. B) is the same as testing for differences in means. C) cannot be used with logarithmic regression functions because ln(0) is not defined. D) allows the effect of changing one of the binary independent variables to depend on the value of the other binary variable. Answer: D 21) In the regression model Yi = β0 + β1 Xi + β2 Di + β3 (Xi × Di) + ui , where X is a continuous variable and D is a binary variable, β3 A) indicates the slope of the regression when D=1. B) has a standard error that is not normally distributed even in large samples since D is not a normally distributed variable. C) indicates the difference in the slopes of the two regressions. D) has no meaning since (Xi × Di) = 0 when Di = 0. Answer: C 22) In the regression model Yi = β0 + β1 Xi + β2 Di + β3 (Xi × Di) + ui , where X is a continuous variable and D is a binary variable, β2 A) is the difference in means in Y between the two categories. B) indicates the difference in the intercepts of the two regressions. C) is usually positive. D) indicates the difference in the slopes of the two regressions. Answer: B 23) In the regression model Yi = β0 + β1 Xi + β2 Di + β3 (Xi × Di) + ui , where X is a continuous variable and D is a binary variable, to test that the two regressions are identical, you must use the A) t-statistic separately for β2 = 0, β2 = 0. B) F-statistic for the joint hypothesis that β0 = 0, β1 = 0. C) t-statistic separately for β3 = 0. D) F-statistic for the joint hypothesis that β2 = 0, β3 = 0. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 186

24) In the model Yi = β0 + β1 X1 + β2 X2 + β3 (X1 × X2 ) + ui, the expected effect

△Y is △X1

A) β1 + β3 X2 . B) β1 . C) β1 + β3 . D) β1 + β3 X1 . Answer: A 25) In the log-log model, the slope coefficient indicates A) the effect that a unit change in X has on Y. B) the elasticity of Y with respect to X. C) △Y / △X. △Y Y D) × . △X X Answer: B 26) In the model ln(Yi) = β0 + β1 Xi + ui, the elasticity of E(Y|X) with respect to X is A) β1 X B) β1 C)

β1 X β0 + β1 X

D) cannot be calculated because the function is non-linear Answer: A 27) Assume that you had estimated the following quadratic regression model TestScore = 607.3 + 3.85 Income - 0.0423 Income2 . If income increased from 10 to 11 ($10,000 to $11,000), then the predicted effect on testscores would be A) 3.85 B) 3.85-0.0423 C) Cannot be calculated because the function is non-linear D) 2.96 Answer: D 2

28) Consider the polynomial regression model of degree Yi = β0 + β1 Xi + β2 X i + ...+ βr X i + ui. According to the null hypothesis that the regression is linear and the alternative that is a polynomial of degree r corresponds to A) H0 : βr = 0 vs. βr ≠ 0 B) H0 : βr = 0 vs. β1 ≠ 0 C) H0 : β3 = 0, ..., βr = 0, vs. H1 : all βj ≠ 0, j = 3, ..., r D) H0 : β2 = 0, β3 = 0 ..., βr = 0, vs. H1 : at least one βj ≠ 0, j = 2, ..., r Answer: D 29) Consider the following least squares specification between testscores and the student -teacher ratio: TestScore = 557.8 + 36.42 ln (Income). According to this equation, a 1% increase income is associated with an increase in test scores of A) 0.36 points B) 36.42 points C) 557.8 points D) cannot be determined from the information given here Answer: A Stock/Watson 2e -- CVC2 8/23/06 -- Page 187

30) Consider the population regression of log earnings [Yi, where Yi = ln(Earnings i)] against two binary variables: whether a worker is married (D1i, where D1i=1 if the ith person is married) and the worker’s gender ( D2i, where D2i=1 if the ith person is female), and the product of the two binary variables Yi = β0 + β1 D1i + β2 D2i + β3 (D1i×D2i) + ui. The interaction term A) allows the population effect on log earnings of being married to depend on gender B) does not make sense since it could be zero for married males C) indicates the effect of being married on log earnings D) cannot be estimated without the presence of a continuous variable Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 188

8.2 Essays and Longer Questions 1) Females, it is said, make 70 cents to the dollar in the United States. To investigate this phenomenon, you collect data on weekly earnings from 1,744 individuals, 850 females and 894 males. Next, you calculate their average weekly earnings and find that the females in your sample earned $346.98, while the males made $517.70. (a) Calculate the female earnings in percent of the male earnings. How would you test whether or not this difference is statistically significant? Give two approaches. (b) A peer suggests that this is consistent with the idea that there is discrimination against females in the labor market. What is your response? (c) You recall from your textbook that additional years of experience are supposed to result in higher earnings. You reason that this is because experience is related to “on the job training.” One frequently used measure for (potential) experience is “Age-Education-6.” Explain the underlying rationale. Assuming, heroically, that education is constant across the 1,744 individuals, you consider regressing earnings on age and a binary variable for gender. You estimate two specifications initially: Earn = 323.70 + 5.15 × Age – 169.78 × Female, R2 =0.13, SER=274.75 (21.18) (0.55) (13.06) Ln(Earn) = 5.44 + 0.015 × Age – 0.421 × Female, R2 =0.17, SER=0.75 (0.08) (0.002) (0.036) where Earn are weekly earnings in dollars, Age is measured in years, and Female is a binary variable, which takes on the value of one if the individual is a female and is zero otherwise. Interpret each regression carefully. For a given age, how much less do females earn on average? Should you choose the second specification on grounds of the higher regression R2 ? (d) Your peer points out to you that age-earning profiles typically take on an inverted U-shape. To test this idea, you add the square of age to your log-linear regression.

Ln(Earn) = 3.04 + 0.147 × Age – 0.421 × Female – 0.0016 Age2 , (0.18) (0.009) (0.033) (0.0001) R2 =0.28, SER=0.68 Interpret the results again. Are there strong reasons to assume that this specification is superior to the previous one? Why is the increase of the Age coefficient so large relative to its value in (c)? (e) What other factors may play a role in earnings determination? Answer: (a) Female earnings are at 67 percent of male earnings. The difference in means test described in section 3.4 of the text. The t-statistic for comparison of two means is given in equation (3.20), which is one way to test for statistical significance. The alternative is to run a regression of earnings on a constant and a binary variable, which takes on the value of one for females and is zero otherwise. Using a t-test on the slope of the binary variable amounts to the same test as the difference in means (section 4.7 in the text). (b) Differences in attributes of the individuals, such as education, ability, and tenure with an employer, have not been taken into account. Hence, in itself, this is weak evidence, at best, for discrimination. (c) The potential experience variable is a reasonable proxy for “on the job training” if the individual started to work after completing her or his education, and stayed employed thereafter. Hence this is a better proxy for some than for others. The linear specification suggests that for every additional year the individual receives $5.15 of additional weekly earnings on average. Females make $167.78 less than males at a given age. There is no data close to the origin, so the intercept should not be interpreted. The regression explains 13 percent of the variation in earnings. The log-linear specification says that earnings increase by 1.5 percent for every additional year in an Stock/Watson 2e -- CVC2 8/23/06 -- Page 189

individual’s life. Females earn approximately 42.1 percent less than males at a given age. Again, the intercept should not be interpreted. The regression explains 17 percent of the variation in the log of earnings. You should not prefer this specification over the linear one on grounds of the higher regression R2 since these cannot be compared as a result of the difference in the units of measurement of the dependent variable. (d) The coefficient on the added variable is statistically significant and has resulted in a substantial increase in the regression R2 . The increase in the Age coefficient is due to the fact that earnings increase more initially than later in life or, mathematically speaking, it compensates for the negative coefficient on Age2 , which lowers earnings as individuals become older. (e) Students’ answers will differ, but education, ability, regional differences, race, and professional choice are often mentioned. 2) An extension of the Solow growth model that includes human capital in addition to physical capital, suggests that investment in human capital (education) will increase the wealth of a nation (per capita income). To test this hypothesis, you collect data for 104 countries and perform the following regression: RelPersInc = 0.046 – 5.869 × gpop + 0.738 × SK + 0.055 × Educ, R2 =0.775, SER = 0.1377 (0.079) (2.238)

(0.294)

(0.010)

where RelPersInc is GDP per worker relative to the United States, gpop is the average population growth rate, 1980 to1990, sK is the average investment share of GDP from 1960 to1990, and Educ is the average educational attainment in years for 1985. Numbers in parentheses are for heteroskedasticity -robust standard errors. (a) Interpret the results and indicate whether or not the coefficients are significantly different from zero. Do the coefficients have the expected sign? (b) To test for equality of the coefficients between the OECD and other countries, you introduce a binary variable (DOECD), which takes on the value of one for the OECD countries and is zero otherwise. To conduct the test for equality of the coefficients, you estimate the following regression: RelPersInc = -0.068 – 0.063 × gpop + 0.719 × SK + 0.044 × Educ, (0.072) (2.271)

(0.365)

(0.012)

0.381 × DOECD – 8.038 × (DOECD × gpop)- 0.430 × (DOECD × SK) (0.184)

(5.366)

(0.768)

+0.003 × (DOECD × Educ), R2 =0.845, SER = 0.116 (0.018) Write down the two regression functions, one for the OECD countries, the other for the non -OECD countries. The F- statistic that all coefficients involving DOECD are zero, is 6.76. Find the corresponding critical value from the F table and decide whether or not the coefficients are equal across the two sets of countries. (c) Given your answer in the previous question, you want to investigate further. You first force the same slopes across all countries, but allow the intercept to differ. That is, you reestimate the above regression but set βDOECD×gpop =βDOECD×S = βDOECD×Educ = 0. The t-statistic for DOECD is 4.39. Is the coefficient, which K was 0.241, statistically significant? (d) Your final regression allows the slopes to differ in addition to the intercept. The F-statistic for βDOECD×gpop = βDOECD×S = βDOECD×Educ = 0 is 1.05. What is your decision? Each one of the t-statistics K is also smaller than the critical value from the standard normal table. Which test should you use? (e) Looking at the tests in the two previous questions, what is your conclusion?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 190

Answer: (a) A one percentage point decrease in the population growth rate increases GDP per worker relative to the United States by roughly 0.06. An increase in the investment share of 0.1 results in an increase of GDP per worker relative to the United States by approximately 0.07. For every additional year of average educational attainment, the increase is 0.055. The intercept should not be interpreted. The regression explains 77.5 percent of the variation in relative productivity. All coefficients are significantly different from zero at conventional levels. All coefficients carry the expected sign. (b) The regression for the non-OECD countries is RelPerInc = -0.068 – 0.063 × gpop + 0.719 × SK + 0.044 × Educ. For the OECD countries we get RelPerInc = 0.313 – 8.101 × gpop + 0.289 × SK + 0.047 × Educ. The critical value is 3.32 at the 1% level and hence you can reject the null hypothesis that the coefficients are equal. (c) Answer: Given the critical value, the coefficient is statistically significant, that is, you can reject βDOECD = 0. (d) Given the critical value of 3.78 at the 1% level, you cannot reject the null hypothesis that the additional coefficients are all zero. The F-test is the proper procedure to use when testing for simultaneous restrictions. (e) There is evidence that the slopes can be set equal. However, there seems to be a level difference between the two groups of countries. 3) You have been asked by your younger sister to help her with a science fair project. During the previous years she already studied why objects float and there also was the inevitable volcano project. Having learned regression techniques recently, you suggest that she investigate the weight -height relationship of 4 th to 6th graders. Her presentation topic will be to explain how people at carnivals predict weight. You collect data for roughly 100 boys and girls between the ages of nine and twelve and estimate for her the following relationship:

Weight = 45.59 + 4.32 × Height4 , R2 = 0.55, SER = 15.69 (3.81) (0.46) where Weight is in pounds, and Height4 is inches above 4 feet. (a) Interpret the results. (b) You remember from the medical literature that females in the adult population are, on average, shorter than males and weigh less. You also seem to have heard that females, controlling for height, are supposed to weigh less than males. To see if this relationship holds for children, you add a binary variable ( DFY) that takes on the value one for girls and is zero otherwise. You estimate the following regression function: Weight = 36.27 + 17.33 × DFY + 5.32 × Height4 – 1.83 × (DFY × Height4), (5.99) (7.36) (0.80) (0.90) R2 = 0.58, SER = 15.41 Are the signs on the new coefficients as expected? Are the new coefficients individually statistically significant? Write down and sketch the regression function for boys and girls separately. (c) The medical literature provides you with the following information for median height and weight of nine to twelve-year-olds: Median Height and Weight for Children, Age 9 -12 Stock/Watson 2e -- CVC2 8/23/06 -- Page 191

9-year-old 10-year-old 11-year-old 12-year-old

Boysʹ Weight 60 70 77 87

Boysʹ Height 52 54 56 58.5

Girlsʹ Weight 60 70 80 92

Girlsʹ Height 49 52 57 60

Insert two height/weight measures each for boys and girls and see how accurate your predictions are. (d) The F-statistic for testing that the intercept and slope for boys and girls are identical is 2.92. Find the critical values at the 5% and 1% level, and make a decision. Allowing for a different intercept with an identical slope results in a t-statistic for DFY of (–0.35). Having identical intercepts but different slopes gives a t -statistic on (DFYHeight4) of (–0.35) also. Does this affect your previous conclusion? (e) Assume that you also wanted to test if the relationship changes by age. Briefly outline how you would specify the regression including the gender binary variable and an age binary variable ( Older) that takes on a value of one for eleven to twelve year olds and is zero otherwise. Indicate in a table of two rows and two columns how the estimated relationship would vary between younger girls, older girls, younger boys, and older boys. Answer: (a) For every inch above 4 feet, children of that age group gain roughly 4 pounds. A student who is 4 feet tall, weighs approximately 45.5 pounds. The regression explains 55 percent of the weight variation in children of that age group. (b) Shorter girls weight more than boys, and taller boys weigh more than girls on average. Given your prior expectations, this is somewhat unexpected. The coefficients involving the binary variable are statistically significant at conventional levels. The regressions for boys is Weight = 36.27 + 5.32 × Height4. For girls it is Weight = 53.60 + 3.49 × Height4.

(c) The “XX” points mark a female, and the “XY” a male. The regression line predicts a 9 -year-old boy to weigh 57.2 pounds, an 11-year-old boy to weight 78.8 pounds, a 10 -year- old girl to weigh 67.6 and a 12-year-old girl to weigh 95.5 pounds. Hence the weights are quite close. Stock/Watson 2e -- CVC2 8/23/06 -- Page 192

(d) The critical value is 3.00 at the 5% level, and 4.61 at the 1% level. Hence you cannot reject equality of the two coefficients. The previous conclusion is unaffected since the test was for both hypotheses to hold simultaneously. The t-statistics indicate that imposing the equality and testing for either the slope or the intercept to be significantly different between boys and girls, does not result in a different coefficient either. (e) Weight = β0 + β1 DFY + β2 Height4 + β3 (DFY × Height4) + β4 Older + β5 (Older × Height4) + u

Boys Girls

Younger ^

β0 + β2 Height4

Older ^

(β0 + β4 ) + ( β2 + β5 ) Height4 ^

(β0 + β1 ) + (β2 + β3 ) Height4 (β0 + β1 + β4 ) + (β2 + β3 + β5 ) Height4

4) You have learned that earnings functions are one of the most investigated relationships in economics. These typically relate the logarithm of earnings to a series of explanatory variables such as education, work experience, gender, race, etc. (a) Why do you think that researchers have preferred a log-linear specification over a linear specification? In addition to the interpretation of the slope coefficients, also think about the distribution of the error term. (b) To establish age-earnings profiles, you regress ln(Earn) on Age, where Earn is weekly earnings in dollars, and Age is in years. Plotting the residuals of the regression against age for 1,744 individuals looks as shown in the figure:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 193

Do you sense a problem? (c) You decide, given your knowledge of age-earning profiles, to allow the regression line to differ for the below and above 40 years age category. Accordingly you create a binary variable, Dage, that takes the value one for age 39 and below, and is zero otherwise. Estimating the earnings equation results in the following output (using heteroskedasticity-robust standard errors): LnEarn = 6.92 – 3.13 × Dage – 0.019 × Age + 0.085 × (Dage × Age), R2 =0.20, SER =0.721. (38.33) (0.22) (0.004) (0.005) Sketch both regression lines: one for the age category 39 years and under, and one for 40 and above. Does it make sense to have a negative sign on the Age coefficient? Predict the ln( earnings) for a 30 year old and a 50 year old. What is the percentage difference between these two? (d) The F-statistic for the hypothesis that both slopes and intercepts are the same is 124.43. Can you reject the null hypothesis? (e) What other functional forms should you consider? Answer: (a) The error variance and the variance of the dependent variable are related. Given that the dependent variable (earnings) is not normally distributed, it is difficult to postulate that the error variance is normally distributed. Using logarithms results in a distribution that is closer to a normal. In addition, there seems to be a better fit for the log-linear specification, and the coefficients can be interpreted as percentage changes. (b) There seems to be a pattern in the residuals when sorted by age. This suggests a misspecified functional form. (c) According to the specification, earnings increase with age until the individual is 39 years old. It is only from age 40 onwards that the regression predicts a negative relationship between earnings and age. According to the estimates, a 30-year-old would have ln(earnings) of 5.77, while the predicted value for a 50-year-old would be 5.97. The difference between the two is approximately 20 percent.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 194

(d) The critical value from the F-table is 4.61 at the 1% level. Hence the null hypothesis is rejected. (e) Instead of the inverted V-shape for the above regression, an inverted U -shape would most likely produce a better fit. This can be generated through the use of a polynomial regression model of degree 2. 5) Sports economics typically looks at winning percentages of sports teams as one of various outputs, and estimates production functions by analyzing the relationship between the winning percentage and inputs. In Major League Baseball (MLB), the determinants of winning are quality pitching and batting. All 30 MLB teams for the 1999 season. Pitching quality is approximated by “Team Earned Run Average” (ERA), and hitting quality by “On Base Plus Slugging Percentage” (OPS). Summary of the Distribution of Winning Percentage, On Base Plus Slugging Percentage, and Team Earned Run Average for MLB in 1999 Average Standard deviation

Team ERA 4.71 OPS 0.778 Winning 0.50 Percentage

0.53 0.034 0.08

Percentile 10%

25%

40%

3.84 0.720 0.40

4.35 0.754 0.43

4.72 0.769 0.46

50% 60% (median) 4.78 4.91 0.780 0.790 0.48 0.49

75%

90%

5.06 0.798 0.59

5.25 0.820 0.60

Your regression output is: Winpct = –0.19 – 0.099 × teamera + 1.490 × ops , R2 =0.92, SER = 0.02. (0.08) (0.008) (0.126) (a) Interpret the regression. Are the results statistically significant and important? (b) There are two leagues in MLB, the American League (AL) and the National League (NL). One major difference is that the pitcher in the AL does not have to bat. Instead there is a “designated hitter” in the hitting line-up. You are concerned that, as a result, there is a different effect of pitching and hitting in the AL from the NL. To test this hypothesis, you allow the AL regression to have a different intercept and different slopes from the NL regression. You therefore create a binary variable for the American League ( DAL) and estimate the following specification: Stock/Watson 2e -- CVC2 8/23/06 -- Page 195

Winpct = – 0.29 + 0.10 × DAL – 0.100 × teamera + 0.008 × (DAL× teamera) (0.12) (0.24) (0.008) (0.018) + 1.622*ops – 0.187 *(DAL× ops) , R2 =0.92, SER = 0.02. (0.163) (0.160) What is the regression for winning percentage in the AL and NL? Next, calculate the t -statistics and say something about the statistical significance of the AL variables. Since you have allowed all slopes and the intercept to vary between the two leagues, what would the results imply if all coefficients involving DAL were statistically significant? (c) You remember that sequentially testing the significance of slope coefficients is not the same as testing for their significance simultaneously. Hence you ask your regression package to calculate the F -statistic that all three coefficients involving the binary variable for the AL are zero. Your regression package gives a value of 0.35. Looking at the critical value from you F -table, can you reject the null hypothesis at the 1% level? Should you worry about the small sample size? Answer: (a) Lowering the team ERA by one results in a winning percentage increase of roughly ten percent. Increasing the OPS by 0.1 generates a higher winning percentage of approximately 15 percent. The regression explains 92 percent of the variation in winning percentages. Both slope coefficients are statistically significant, and given the small differences in winning percentage, they are also important. (b) NL: Winpct = – 0.29 – 0.100 × teamera + 1.622 × ops. AL : Winpct = – 0.19 – 0.092 × teamera + 1.435 × ops. The t-statistics for all variables involving DAL are, in order of appearance in the above regression, 0.42, 0.44, and –1.17. None of the coefficients is statistically significant individually. If these were statistically significant, then this would indicate that the coefficients vary between the two leagues. Hence it would suggest that the introduction of the designated hitter might have changed the relationship. (c) The critical value of the F-statistic is 3.78 at the 1% level, and hence you cannot reject the null hypothesis, that all three coefficients are zero. However, the F-statistic is not really distributed as F3,∞, and, as a result, inference is problematic here. 6) There has been much debate about the impact of minimum wages on employment and unemployment. While most of the focus has been on the employment-to-population ratio of teenagers, you decide to check if aggregate state unemployment rates have been affected. Your idea is to see if state unemployment rates for the 48 contiguous U.S. states in 1985 can predict the unemployment rate for the same states in 1995, and if this prediction can be improved upon by entering a binary variable for “high impact” minimum wage states. One labor economist labeled states as high impact if a large fraction of teenagers was affected by the 1990 and 1991 federal minimum wage increases. Your first regression results in the following output: 85 95 Ur i = 3.19 + 0.27 × Ur i , R2 = 0.21, SER=1.031 (0.56) (0.07) (a) Sketch the regression line and add a 45 0 line to the graph. Interpret the regression results. What would the interpretation be if the fitted line coincided with the 45 0 line? (b) Adding the binary variable DhiImpact by allowing the slope and intercept to differ, results in the following fitted line: 95 85 85 Ur i = 4.02 + 0.16 × Ur i – 3.25 × DhiImpact + 0.38 × (DhiImpact× Ur i ), (0.66) (0.09)

(0.89)

(0.11)

R2 = 0.31, SER=0.987 Stock/Watson 2e -- CVC2 8/23/06 -- Page 196

The F-statistic for the null hypothesis that both parameters involving the high impact minimum wage variable are zero, is 42.16. Can you reject the null hypothesis that both coefficients are zero? Sketch the two regression lines together with the 450 line and interpret the results again. (c) To check the robustness of these results, you repeat the exercise using a new binary variable for the so-called mining state (Dmining), i.e., the eleven states that have at least three percent of their total state earnings derived from oil, gas extraction, and coal mining, in the 1980s. This results in the following output: 95 85 85 Ur i = 4.04 + 0.15× Ur i – 2.92 × Dmining + 0.37 × (Dmining × Ur i ), (0.65) (0.09)

(0.90)

(0.10)

R2 = 0.31, SER=0.997 How confident are you that the previously found effect is due to minimum wages? Answer: (a) An increase in the 1985 unemployment rate results in an increase in the unemployment rate in 1995 of 0.27 percent. Put differently, if one state had a one percent higher unemployment rate in 1985 than another state, then this difference would shrink, on average, to 0.27 percent in 1995. 21 percent of the variation in 1995 state unemployment rates is explained by the regression. If the fitted line coincided with the 450 line, then the unemployment rates in 1995 would remain unchanged when compared to 1985. The estimated regression implies, unrealistically, mean reversion in the unemployment rates.

(b) The critical value for the F-statistic is 4.61 at the 1% level and hence the null hypothesis that both coefficients are zero in the population is rejected. (The sample size is small, however, so the distribution of the test statistic is not really known.) The intercept for the high -impact states is smaller and the slope is steeper. This suggests that for high-impact states there is less of a mean reversion effect present: if high-impact states had high 1985 unemployment rates, then they are expected to have higher unemployment rates in 1995 when compared to a low-impact state. High and low unemployment rates are thereby more persistent for high-impact states.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 197

(c) The results here are similar to those in (b) in that the regression for the mining states is steeper than the one for the other states. Perhaps omitted variables play a role here, such as relative (oil) price shocks that affect some states more than others. Oil prices fell considerably over the time period and it is possible that the high-impact binary variable coefficient picks up the effect of omitted variables. Including more explanatory variables would be desirable. 7) Labor economists have extensively researched the determinants of earnings. Investment in human capital, measured in years of education, and on the job training are some of the most important explanatory variables in this research. You decide to apply earnings functions to the field of sports economics by finding the determinants for baseball pitcher salaries. You collect data on 455 pitchers for the 1998 baseball season and estimate the following equation using OLS and heteroskedasticity -robust standard errors: Ln(Earni) = 12.45 + 0.052 × Years + 0.00089 × Innings + 0.0032 × Saves (0.08) (0.026)

(0.00020)

(0.0018)

– 0.0085 × ERA, R2 =0.45, SER=0.874 (0.0168) where Earn is annual salary in dollars, Years is number of years in the major leagues, Innings is number of innings pitched during the career before the 1998 season, Saves is number of saves during the career before the 1998 season, and ERA is the earned run average before the 1998 season. (a) What happens to earnings when the pitcher stays in the league for one additional year? Compare the salaries of two relievers, one with 10 more saves than the other. What effect does pitching 100 more innings have on the salary of the pitcher? What effect does reducing his ERA by 1.5? Do the signs correspond to your expectations? Explain. (b) Are the individual coefficients statistically significant? Indicate the level of significance you used and the type of alternative hypothesis you considered. (c) Although you are quite impressed with the fit of the regression, someone suggests that you should include the square of years and innings as additional explanatory variables. Your results change as follows: Ln(Earni) = 12.15 + 0.160 × Years + 0.00268 × Innings + 0.0063 × Saves (0.05) (0.039) (0.00030) (0.0010) - 0.0584 × ERA – 0.0165 × Years2 - 0.00000045 × Innings2 Stock/Watson 2e -- CVC2 8/23/06 -- Page 198

(0.0165)

(0.0026)

(0.00000012)

R2 =0.69, SER=0.666 What is her reasoning? Are the coefficients of the quadratic terms statistically significant? Are they meaningful? (d) Calculate the effect of moving from two to three years, as opposed to from 12 to 13 years. (e) You also decide to test the specification for stability across leagues (National League and American League) by including a dummy variable for the National League and allowing the intercept and all slopes to differ. The resulting F-statistic for restricting all coefficients that involve the National League dummy variable to zero, is 0.40. Compare this to the relevant critical value from the table and decide whether or not these additional variables should be included. Answer: (a) For staying an additional year in the league, the pitcher receives a 5.2 percent increase in earnings. On average, the reliever with 10 more saves ends up with 3.2 percent higher earnings. Pitching100 additional innings results in 8.9 percent higher earnings, and lowering the ERA by 1.5 increases earnings by 1.3 percent. ERA, innings pitched, and number of saves are all quality of input indicators and should therefore have the signs as in the regression above. Years in the major leagues stands as a proxy for on the job training and should therefore carry a positive sign. (b) Given that there is prior expectation on the sign of the coefficients, you should conduct a one-sided hypothesis test. All variables with the exception of ERA carry statistically significant coefficients at the 5% level. (c) Allowing for the quadratic terms to enter results in an inverted U-shape for the relationship between the log of earnings, and both years in the league and innings pitched. Both coefficients are highly significant and have resulted also in a significant ERA coefficient. (d) Having played for two years and staying for one more year in the league results in an earnings increase of 7.8 percent, while staying for an additional year after 12 years in the majors results in a predicted decrease of 25.3 percent. (e) F7,∞ = 2.01 at the 5% level. Hence you cannot reject the null hypothesis of equality of coefficients across leagues. 8) After analyzing the age-earnings profile for 1,744 workers as shown in the figure, it becomes clear to you that the relationship cannot be approximately linear.

You estimate the following polynomial regression model, controlling for the effect of gender by using a binary variable that takes on the value of one for females and is zero otherwise: Earn = –795.90 + 82.93 × Age – 1.69 × Age2 + 0.015 × Age3 – 0.0005 × Age4 (283.11) (29.29) (1.06) (0.016) (0.0009) Stock/Watson 2e -- CVC2 8/23/06 -- Page 199

– 163.19 Female, R2 =0.225, SER=259.78 (12.45) (a) Test for the significance of the Age4 coefficient. Describe the general strategy to determine the appropriate degree of the polynomial. (b) You run two further regressions. Present an argument as to which one you should use for further analysis. Earn = – 683.21 + 65.83 × Age – 1.05 × Age2 + 0.005 × Age3 (120.13) (9.27) (0.22) (0.002) – 163.23 Female, R2 =0.225, SER=259.73 (12.45) Earn = – 344.88 + 41.48 × Age – 0.45× Age2 (51.58) (2.64) (0.03) – 163.81 Female, R2 =0.222, SER=260.22 (12.47) (c) Sketch the graph of fitted earnings of males against age of your preferred regression. Does this make sense? Are you concerned about the negative coefficient on the regression intercept? What is the implication for female earners in this sample? (d) Explain how you would calculate the effect of changing age by one year on earnings, holding constant the gender variable. Finally, briefly describe how you would calculate the standard errors of the estimated effect. Answer: (a) The coefficient has a t-statistic of 0.56 and hence is not statistically significant at conventional levels. The strategy is described in section 6.2 of the textbook. Considering first a polynomial of degree r, the coefficient associated with the largest value of r is tested for significance. From there, a sequential hypothesis testing procedure should be followed. (b) The coefficient of Age3 is statistically significant at the 1% level using a one-sided hypothesis. The polynomial of degree three seems therefore the appropriate regression. (c)

There is little difference between the two fits for values between the age of 25 and 60. The inverted U-shape is well known to exist for age-earnings profiles, and hence the plot makes sense. There is no interpretation for the intercept, since there is no data close to the origin. Females earn significantly less at every age level. (d) Since this is a nonlinear relationship, the effect will depend on the age level. This is described in section 6.1 of the textbook. In essence, the predicted earnings value for one age level has to be computed Stock/Watson 2e -- CVC2 8/23/06 -- Page 200

first. Next, the same has to be done for the age level plus one. Finally the two values are differenced to find the change in earnings associated with the age level. For the polynomial of degree 3, the first task is to consider the estimated change in earnings associated ^ ^ ^ with a change in age by one year, say from 30 to 31. This is given by △Y = β1 × (31- 30) + β2 (312 - 302 ) + ^ ^ ^ ^ ^ β3 (313 - 303 ) or △Y = β1 + 61β2 + 2791β3 . The standard error of the estimated effect is then given from ^

SE(△Y =

^ ^ ^ ^ ^ ^ △Y , where F = [(β1 + 61β2 + 2791β3 ) / SE(β1 + 61β2 + 2791β3 ]2 . A 95% confidence interval F ^

for the change in the expected value of earnings is (β1 + 61β2 + 2791β3 ) ± 1.96 × SE(β1 + 61β2 + 2791β3 ). Obviously these expressions get quite complicated once you go beyond a quadratic. 9) Earnings functions attempt to find the determinants of earnings, using both continuous and binary variables. One of the central questions analyzed in this relationship is the returns to education. (a) Collecting data from 253 individuals, you estimate the following relationship ln(Earni) = 0.54 + 0.083 × Educ, R2 = 0.20, SER = 0.445 (0.14) (0.011) where Earn is average hourly earnings and Educ is years of education. What is the effect of an additional year of schooling? If you had a strong belief that years of high school education were different from college education, how would you modify the equation? What if your theory suggested that there was a “diploma effect”? (b) You read in the literature that there should also be returns to on -the-job training. To approximate on-the-job training, researchers often use the so called Mincer or potential experience variable, which is defined as Exper = Age – Educ – 6. Explain the reasoning behind this approximation. Is it likely to resemble years of employment for various sub-groups of the labor force? (c) You incorporate the experience variable into your original regression ln(Earni) = -0.01 + 0.101 × Educ + 0.033 × Exper – 0.0005 × Exper2 , (0.16) (0.012) (0.006) (0.0001) R2 = 0.34, SER = 0.405 What is the effect of an additional year of experience for a person who is 40 years old and had 12 years of education? What about for a person who is 60 years old with the same education background? (d) Test for the significance of each of the coefficients of the added variables. Why has the coefficient on education changed so little? Sketch the age-(log)earnings profile for workers with 8 years of education and 16 years of education. (e) You want to find the effect of introducing two variables, gender and marital status. Accordingly you specify a binary variable that takes on the value of one for females and is zero otherwise ( Female), and another binary variable that is one if the worker is married but is zero otherwise (Married). Adding these variables to the regressors results in: ln(Earni) = 0.21 + 0.093 × Educ + 0.032 × Exper – 0.0005 ×Exper2 (0.16) (0.012)

(0.006)

(0.0001)

- 0.289 × Female + 0.062 Married, (0.049) (0.056)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 201

R2 = 0.43, SER = 0.378 Are the coefficients of the two added binary variables individually statistically significant? Are they economically important? In percentage terms, how much less do females earn per hour, controlling for education and experience? How much more do married people make? What is the percentage difference in earnings between a single male and a married female? What is the marriage differential between males and females? (f) In your final specification, you allow for the binary variables to interact. The results are as follows: ln(Earni) = 0.14 + 0.093 × Educ + 0.032 × Exper – 0.0005 × Exper2 (0.16) (0.011) (0.006) (0.001) - 0.158 × Female + 0.173 × Married – 0.218 × (Female × Married), (0.075) (0.080) (0.097) R2 = 0.44, SER = 0.375 Repeat the exercise in (e) of calculating the various percentage differences between gender and marital status. Answer: (a) One additional year of education carries an 8.3 percent increase, or a return, on earnings. You would need additional data to see if this coefficient was different for high school versus college education. Including both variables in the regression would then allow you to test for equality of the coefficients. A “diploma effect” could be studied by creating a binary variable for a high school diploma, a junior college diploma, a B.A. or B.Sc. diploma, and so forth. (b) The idea is that everybody works except in the first six years of life and during the time spent in school/university for education. This approximation will work better for people with a strong attachment to the labor force. It will not work well for females and those who are frequently unemployed or out of the workforce. (c) For the first person, the Exper variable increases from 22 to 23, and results in a 1.1 percent earnings increase. For the 60 year old, there is an expected decrease of 1 percent. (d) Both coefficients are highly significant using conventional levels of significance. The fact that the coefficient on the education variable hardly changed suggests that education and experience are not highly correlated.

(e) The coefficient for the female binary variable is statistically significant even at the 1% level. The coefficient for the married binary variable only has a t-statistic of 1.11 and is not statistically significant at the 10% level. Both coefficients indicate economic importance, since females make approximately 29 percent less than males and married people earn roughly 6 percent more. A married female earns Stock/Watson 2e -- CVC2 8/23/06 -- Page 202

roughly 23 percent less than a single male. Married females earn 29 percent less than married males, the same percentage that single females earn less than single males. (f) The default is the single male. Single females earn 15.8 percent less. Married males earn 17.3 percent more. Married females earn 20.3 percent less. Comparing married females with married males now results in a percentage differential of 37.6 percent in favor of the males. 10) One of the most frequently estimated equations in the macroeconomics growth literature are so -called convergence regressions. In essence the average per capita income growth rate is regressed on the beginning-of-period per capita income level to see if countries that were further behind initially, grew faster. Some macroeconomic models make this prediction, once other variables are controlled for. To investigate this matter, you collect data from 104 countries for the sample period 1960 -1990 and estimate the following relationship (numbers in parentheses are for heteroskedasticity-robust standard errors): g6090 = 0.020 – 0.360 × gpop + 0.00 4 × Educ – 0.053×RelProd 60, R2 =0.332, SER = 0.013 (0.009) (0.241)

(0.001)

(0.009)

where g6090 is the growth rate of GDP per worker for the 1960-1990 sample period, RelProd 60 is the initial starting level of GDP per worker relative to the United States in 1960, gpop is the average population growth rate of the country, and Educ is educational attainment in years for 1985. (a) What is the effect of an increase of 5 years in educational attainment? What would happen if a country could implement policies to cut population growth by one percent? Are all coefficients significant at the 5% level? If one of the coefficients is not significant, should you automatically eliminate its variable from the list of explanatory variables? (b) The coefficient on the initial condition has to be significantly negative to suggest conditional convergence. Furthermore, the larger this coefficient, in absolute terms, the faster the convergence will take place. It has been suggested to you to interact education with the initial condition to test for additional effects of education on growth. To test for this possibility, you estimate the following regression: g6090 = 0.015 -0.323 × gpop + 0.005 × Educ –0.051×RelProd60 (0.009) (0.238)

(0.001)

(0.013)

–0.0028 × (EducRelProd 60), R2 =0.346, SER = 0.013 (0.0015) Write down the effect of an additional year of education on growth. West Germany has a value for RelProd 60 of 0.57, while Brazil’s value is 0.23. What is the predicted growth rate effect of adding one year of education in both countries? Does this predicted growth rate make sense? (c) What is the implication for the speed of convergence? Is the interaction effect statistically significant? (d) Convergence regressions are basically of the type △ln Yt = β0 – β1 ln Y0 where △ might be the change over a longer time period, 30 years, say, and the average growth rate is used on the left-hand side. You note that the equation can be rewritten as △ln Yt = β0 – (1 – β1 ) ln Y0 Over a century ago, Sir Francis Galton first coined the term “regression” by analyzing the relationship between the height of children and the height of their parents. Estimating a function of the type above, he found a positive intercept and a slope between zero and one. He therefore concluded that heights would revert to the mean. Since ultimately this would imply the height of the population being the same, his result has become known as “Galton’s Fallacy.” Your estimate of β1 above is approximately 0.05. Do you see a parallel to Galton Stock/Watson 2e -- CVC2 8/23/06 -- Page 203

’s Fallacy? Answer: (a) Increasing educational attainment by 5 years results in an increase of productivity growth of 2 percent. Decreasing the population growth rate by one percent increases productivity growth by 0.4 percent. All coefficients are statistically significant at the 5% level with the exception of population growth. You should not eliminate a variable simply because it is not statistically significant. It is better to report the statistics and let the reader decide. (b)

△g6090 = 0.005 - 0.0028 RelProd60. For West Germany, the effect is 0.3 percent, while for Brazil it is △Educ

0.4 percent. These are small gains, but they accumulate over time. △g6090 (c) = -0.051 - 0.0028Educ, which therefore depends on educational attainment. Countries △RelProd60 with higher educational attainment will converge faster. The coefficient has a t-statistic of 1.87 and is therefore statistically significant at the 5% level using a one-sided hypothesis test. (d) The above regressions generate a mean reversion outcome. Interpreted literally, the implication is that all countries end up with the same productivity or per capita income, just as all persons would be of the same height. It can be shown that Galton’s Fallacy is the result of errors-in-variables which biases the slope coefficient downward. This topic is covered in Chapter 7. The solution is to use instrumental variable techniques, also discussed in Chapter 10. The literature in this area has done so, and the convergence result persists. 11) Pages 283-284 in your textbook contain an analysis of the “Return to Education and the Gender Gap.” Column (4) in Table 8.1 displays regression results using the 2009 Current Population Survey. The equation below shows the regression result for the same specification, but using the 2005 Current Population Survey. Interpret the major results. ln earnings = 1.215 + 0.0899×educ - 0.521×DFemme+ 0.0180×(DFemme×educ) (0.018) (0.0011) (0.022) (0.0016) + 0.0232×exper - 0.000368×exper2 - 0.058×Midwest - 0.0098×South - 0.030×West (0.0008) (0.000018) (0.006) (0.0078) (0.0030)

Answer: The return to education for males is approximately 9% and its coefficient has a t-statistic of 11.25. For females, the return is slightly higher, approximately 11%. Since the binary variable for females is interacted with the number of years of education, the gender gap depends on the number of years of education. For the typical high school graduate (12 years of education), the gender gap is approximately 27%, while for the typical college graduate (16 years of education) the gender gap narrows to 19%. The potential experience variable enters in an inverted U-shape, which is to be expected given the shape of age-earnings profiles and the fact that potential experience depends on the age of the individual. There is a declining marginal value for each year of potential experience until it eventually becomes negative. Northeast is the omitted region, and all other regions have lower (log) earnings, ranging from 0.8% in the South to 5.8% in the Midwest. All coefficients are statistically significant.

8.3 Mathematical and Graphical Problems 1) Give at least three examples from economics where you expect some nonlinearity in the relationship between variables. Interpret the slope in each case. Answer: Answers will vary by student. Typical answers involve the Cobb-Douglas production function, the Phillips curve, earnings functions, and (given the textbook discussion) student performance and income.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 204

2) Suggest a transformation in the variables that will linearize the deterministic part of the population regression functions below. Write the resulting regression function in a form that can be estimated by using OLS. β1 β2 (a) Yi = β0 X 1i X 2i (b) Yi =

Xi β0 + β1 Xi

e β0 + β1 X1 1+ e β0 + β1 X1

β1 (d) Yi = β0 X 1i e β2 β2 X2i Answer: (a) ln(Yi) = ln( β0 ) + β1 ln(X1i) + β2 ln(X2i) (b)

1 1 = β0 + β1 Yi Xi

Yi 1-Yi

= β0 + β1 Xi

(d) ln(Yi) = ln( β0 ) + β1 ln(X1i) + β2 X2i 3) Indicate whether or not you can linearize the regression functions below so that OLS estimation methods can be applied: (a) Yi = e β0 + β1 Xi+ui β1 β2 (b) Yi = β1 X 1i X 2i + ui Answer: (a) The function can be linearized by taking logs on both sides. (b) The function cannot be linearized due to the additive error term. 4) Choose at least three different nonlinear functional forms of a single independent variable and sketch the relationship between the dependent and independent variable. Answer: Answers will vary by student. Most commonly used forms are the quadratic regression, the inverse (in X) regression, and the log-log model.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 205

5) In the case of perfect multicollinearity, OLS is unable to estimate the slope coefficients of the variables involved. 2 Assume that you have included both X1 and X2 as explanatory variables, and that X2 = X , so that there is an 1 exact relationship between two explanatory variables. Does this pose a problem for estimation? Answer: There is no problem for estimation, since the second explanatory variable is not linearly related to the first. This is an example of a polynomial regression model of degree 2, which is frequently estimated in econometrics

Stock/Watson 2e -- CVC2 8/23/06 -- Page 206

6) The figure shows is a plot and a fitted linear regression line of the age -earnings profile of 1,744 individuals, taken from the Current Population Survey.

(a) Describe the problems in predicting earnings using the fitted line. What would the pattern of the residuals look like for the age category under 40? (b) What alternative functional form might fit the data better? (c) What other variables might you want to consider in specifying the determinants of earnings? Answer: (a) There would be many overpredictions for this age category under 40, and hence more negative residuals. (b) It would be better to fit a quadratic here, i.e., a polynomial regression model, which would produce an inverted U-shape. (c) Answers will vary by students, but education, gender, race, tenure with an employer, professional choice, and ability are typically present in answers. 7) (Requires Calculus) Show that for the log-log model the slope coefficient is the elasticity. ∂ ln(Y) Answer: Consider the deterministic part Y = AXβ1 . Then ln(Y) = β0 + β1 ln(X), where β0 = ln(A). Now = β1 ∂ ln(X)

1 ∂Y Y 1 ∂X Y

∂Y X ∂Y . Alternatively you can derive the same result by taking the derivative from Y = A ∂X Y ∂X

Xβ1 . 8) Assume that you had data for a cross-section of 100 households with data on consumption and personal disposable income. If you fit a linear regression function regressing consumption on disposable income, what prior expectations do you have about the slope and the intercept? The slope of this regression function is called the “marginal propensity to consume.” If, instead, you fit a log-log model, then what is the interpretation of the slope? Do you have any prior expectation about its size? Answer: For the log-log specification, the slope is the elasticity. Since there are many theories that predict a constant average propensity to consume, the elasticity should equal one.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 207

9) The textbook shows that ln(x + △x) – ln(x) ≅

△x . Show that this is equivalent to the following approximation x

ln(1 + y) ≅ y if y is small. You use this idea to estimate a demand for money function, which is of the form m = β0 × GDP β1 ×, (1+ R) β1 × eu where m is the quantity of (real) money, GDP is the value of (real) Gross Domestic Product, and R is the nominal interest rate. You collect the quarterly data from the Federal Reserve Bank of St. Louis data bank (“FRED”), which lists the money supply and GDP in billions of dollars, prices as an index, and nominal interest rates in percentage points per year You generate the variables in your regression program as follows: m = (money supply)/price index; GDP = (Gross Domestic Product/Price Index), and R = nominal interest rate in percentage points per annum. Next you perform the log-transformations on the real money supply, real GDP, and on (1+R). Can you for see a problem in using this transformation? Answer: ln(x + △x) - ln(x) = ln

x + △x △x △x . Let y = 0.05, then ln(1 + y) = 0.049 = ln 1+ = ln(1+ y), where y = x x x

≈ 0.05. Note that this approximation does not hold well for larger fractions, such as 0.60. The interest rate is listed in percentage points. Entering R as 5, rather than 0.05, makes β2 not equal a semi-elasticity. 10) You have estimated an earnings function, where you regressed the log of earnings on a set of continuous explanatory variables (in levels) and two binary variables, one for gender and the other for marital status. One of the explanatory variables is education. (a) Interpret the education coefficient. (b) Next, specify the binary variables and an equation, where the default is a single male, without allowing for interaction between marital status and gender. Indicate the coefficients that measure the effect of a single male, single female, married male, and married female. (c) Finally allow for an interaction between the gender and marital status binary variables. Repeat the exercise of writing down the various effects based on the female/male and single/married status. Why is the latter approach more general than the former? Answer: (a) The coefficient on education gives you the return to education, i.e., if education increased by one year, then by how many percent do earnings increase? (b) Let DGender equal one if the individual is a female, and be zero otherwise. DMarried takes on a value of one if the individual is married and is zero otherwise. The regression is ^

ln Earn = β0 + β1 DGender + β2 DMarried + ... ^

Single male: β0 ; single female:β0 + β1 ; married male: β0 + β2 ; married female: β0 + β1 + β2 . ^

Single male: β0 ; single female: β0 + β1 ; married male: β0 + β2 ; married female: β0 + β1 + β2 + β3 . This approach is more general because it allows the effect of being married and female to be different from being married and male. In (b), both females and males were faced with identical effects from being ^

married, β2 . In (c), this effect differs due to the additional coefficient β3 .

Stock/Watson 2e -- CVC2 8/23/06 -- Page 208

11) You have been told that the money demand function in the United States has been unstable since the late 1970. To investigate this problem, you collect data on the real money supply (m=M/P; where M is M1 and P is the GDP deflator), (real) gross domestic product (GDP) and the nominal interest rate (R). Next you consider estimating the demand for money using the following alternative functional forms: (i) m = β0 + β1 × GDP + β2 x R+ u (ii) m = β0 × GDP β1 x Rβ2 × eu (iii) m = β0 × GDP β1 x 1+ Rβ2 × eu Give an interpretation for β1 and β2 in each case. How would you calculate the income elasticity in case (i)? Answer: In (i), both coefficients show the effect of a unit increase of the respective variables on the demand for ^

money. In (ii), the two coefficients are elasticities. In (iii), β1 is an elasticity, whereas β2 is often referred ^

to as a “semi-elasticity.” The specification becomes ln(m) = ln(β0 ) + β1 ln(GDP) + β2 R + u , since ln(1+ R) ^

≈ R for small R. Hence β2 =

△m m △R

, that is, it indicates by how many percent the (real) money demand

will increase for a percentage change in the interest rate. 12) You have collected data for a cross-section of countries in two time periods, 1960 and 1997, say. Your task is to find the determinants for the Wealth of a Nation (per capita income) and you believe that there are three major determinants: investment in physical capital in both time periods (X1,T and X1,0 ), investment in human capital or education (X2,T and X2,0 ), and per capita income in the initial period (Y0 ). You run the following regression: ln(YT) = β0 + β1 X1,T + β2 X1,0 + β3 X2,T + β4 X1,0 + ln(Y0 ) + uT One of your peers suggests that instead, you should run the growth rate in per capita income over the two periods on the change in physical and human capital. For those results to be a parsimonious presentation of your initial regression, what three restrictions would have to hold? How would you test for these? The same person also points out to you that the intercept vanishes in equations where the data is differenced. Is that true? Answer: The regression using growth rates is as follows: [ln(YT) - ln(Y0 )] = β0 + β1 (X1,T - X1,0 )+ β3 (X2,T - X1,0)+ ( β5 - 1) ln(Y0 ) + uT For this to be a parsimonious presentation of the initial regression, the following two restrictions must hold: β1 = -β2 , and β3 = -β4 .The use of an F-test is required here to test the restrictions simultaneously. The intercept is still present in the equation, and the assertion therefore cannot be true.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 209

13) Earnings functions attempt to predict the log of earnings from a set of explanatory variables, both binary and continuous. You have allowed for an interaction between two continuous variables: education and tenure with the current employer. Your estimated equation is of the following type: ^

ln(Earn) = β0 + β1 × Femme + β2 × Educ + β3 × Tenure + β4 x (Educ × Tenure) + ··· where Femme is a binary variable taking on the value of one for females and is zero otherwise, Educ is the number of years of education, and tenure is continuous years of work with the current employer. What is the effect of an additional year of education on earnings (“returns to education”) for men? For women? If you allowed for the returns to education to differ for males and females, how would you respecify the above equation? What is the effect of an additional year of tenure with a current employer on earnings? Answer: For both males and females, the effect of an additional year of education is ^ ^ △ln(Earn) = β2 + β4 x Tenure, and hence depends on continuous years of work with the current △Educ employer. To allow the effect to be different for males and females, an interaction variable between Femme and Educ would have to be introduced. The return to tenure with a current employer is ^ ^ △ln(Earn) = β3 + β4 x Educ. △Tenure 14) Many countries that experience hyperinflation do not have market-determined interest rates. As a result, some authors have substituted future inflation rates into money demand equations of the following type as a proxy: m = β0 × (1+ △ ln P)β1 × eu (m is real money, and P is the consumer price index). Income is typically omitted since movements in it are dwarfed by money growth and the inflation rate. Authors have then interpreted β1 as the “semi-elasticity” of the inflation rate. Do you see any problems with this interpretation? Answer: Linearizing the above equation results in ln(m) = ln(β0 ) + β1 ln(1+ △ln P) + u. Now this simplifies to ln(m) = ln(β0 ) + β1 △ln P + u if △ln P, the inflation rate, is small. In that case, β1 represents the effect of a percent increase in the inflation rate on the demand for money. However, if the inflation rate is not small, as is the case in hyperinflations, then the approximation does not hold any longer. 15) To investigate whether or not there is discrimination against a sub-group of individuals, you regress the log of earnings on determining variables, such as education, work experience, etc., and a binary variable which takes on the value of one for individuals in that sub-group and is zero otherwise. You consider two possible specifications. First you run two separate regressions, one for the observations that include the sub -group and one for the others. Second, you run a single regression, but allow for a binary variable to appear in the regression. Your professor suggests that the second equation is better for the task at hand, as long as you allow for a shift in both the intercept and the slopes. Explain her reasoning. Answer: By running the regression over the entire sample period, you can test for equality of coefficients, or alternatively, for the significance of binary variables coefficients. Also, the combined sample has more observations, and hence smaller standard errors.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 210

16) Being a competitive female swimmer, you wonder if women will ever be able to beat the time of the male gold medal winner. To investigate this question, you collect data for the Olympic Games since 1910. At first you consider including various distances, a binary variable for Mark Spitz, and another binary variable for the arrival and presence of East German female swimmers, but in the end decide on a simple linear regression. Your dependent variable is the ratio of the fastest women’s time to the fastest men’s time in the 100 m backstroke, and the explanatory variable is the year of the Olympics. The regression result is as follows, TFoverM = 4.42 – 0.0017 × Olympics, where TFoverM is the relative time of the gold medal winner, and Olympics is the year of the Olympic Games. What is your prediction when females will catch up to men in this discipline? Does this sound plausible? What other functional form might you want to consider? Answer: According to the above regression, women will catch up in the year 2011.76 or 2012. (This happens to be an Olympics year.) This is not plausible for swimming, and a better functional form would be TFoverM = 1 β0 + β1 . Olympics 17) Sketch for the log-log model what the relationship between Y and X looks like for various parameter values of the slope, i.e., β1 > 1; 0 < β1 < 1; β1 = (-1). Answer:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 211

18) Show that for the following regression model Yt = e β0 + β1 × t + u where t is a time trend, which takes on the values 1, 2, …,T, β1 represents the instantaneous (“continuous compounding”) growth rate. Show how this rate is related to the proportionate rate of growth, which is calculated from the relationship Yt = Y0 × (1 + g)t when time is measured in discrete intervals.

Answer: ln(Yt) = β0 + β1 × t + u and hence β1 =

∂ln(Yt) = ∂t

1 ∂Y Y t ∂t

. From Yt = Y0 × (1 + g)t, we get ln(Yt) = ln(Y0 ) +

ln(1 + g)t = β0 + β1 t, where β0 = ln(Y0 ) and β1 = (1 + g) ≈ g for small g. Hence if g is small, then regressing the log of a variable on time generates a slope coefficient which is approximately the proportionate rate of growth for small growth rates.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 212

19) Your task is to estimate the ice cream sales for a certain chain in New England. The company makes available to you quarterly ice cream sales (Y) and informs you that the price per gallon has approximately remained constant over the sample period. You gather information on average daily temperatures ( X) during these quarters and regress Y on X, adding seasonal binary variables for spring, summer, and fall. These variables are constructed as follows: DSpring takes on a value of 1 during the spring and is zero otherwise, DSummer takes on a value of 1 during the summer, etc. Specify three regression functions where the following conditions hold: the relationship between Y and X is (i) forced to be the same for each quarter; (ii) allowed to have different intercepts each season; (iii) allowed to have varying slopes and intercepts each season. Sketch the difference between (i) and (ii). How would you test which model fits the data the best? Answer: (i) Yi = β0 + β1 Xi + ui ; (ii) Yi = β0 + β1 Xi + β2 DSpring + β3 DSummer + β4 DFall + ui ; (iii) Yi = β0 + β1 Xi + β2 DSpring + β3 DSummer + β4 DFall + β5 (DSpring × Xi) + β6 (DSummer × Xi ) + β7 (DFall × Xi) + ui ; (iii) is the most general of the models, the others are nested. Hence you can use the F-test to see if certain restrictions hold. For example, (i) is a parsimonious representation of (iii) if all coefficients involving the seasonal binary variables are simultaneously equal to zero. 20) In estimating the original relationship between money wage growth and the unemployment rate, Phillips used United Kingdom data from 1861 to 1913 to fit a curve of the following functional form · W ( + β0 ) = β1 × urβ2 × eu, W · W is the percentage change in money wages and ur is the unemployment rate. Sketch the function. where W What role does β0 play? Can you find a linear transformation that allows you to estimate the above function using OLS? If, after taking logarithms on both sides of the equation, you tried to estimate β1 and β2 using OLS by choosing different values for β0 by “trial and error procedure” (Phillips’s words), what sort of problem might you run into with the left-hand side variable for some of the observations? Answer: Given the shape of the Phillips curve, β2 will be negative and β1 will be positive. Hence for large values · W β . Taking logarithms on of β1 × ur 2 will be approximately zero, -β0 and is the lower asymptote of W · W both sides results in ln( + β0 ) = ln(β1 )+ β2 ln(ur) + u, which cannot be estimated by OLS due to the W · W form of the dependent variable. Choosing different values for β0 can result in situations where ( + W β0 ) is negative and hence is not defined.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 213

21) Using a spreadsheet program such as Excel, plot the following logistic regression function with a single X, Yi = ^

1+e-(β 0 +β 1 Xi)

, where β0 = - 4.13 and β 1 = 5.37. Enter values of X in the first column starting from 0 and then

incrementing these by 0.1 until you reach 2.0. Then enter the logistic function formula in the next column. Finally produce a scatter plot, connecting the predicted values with a line. Answer:

22) Table 8.1 on page 284 of your textbook displays the following estimated earnings function in column (4): ln earnings = 1.503 + 0.1032×educ - 0.451×DFemme+ 0.0143×(DFemme×educ) (0.023) (0.0012) (0.024) (0.0017) + 0.0232×exper - 0.000368×exper2 - 0.058×Midwest - 0.0098×South - 0.030×West (0.0012) (0.000023) (0.006) (0.006) (0.007) n = 52.790, R2 = 0.267 Given that the potential experience variable (exper) is defined as (Age-Education-6) find the age at which individuals with a high school degree (12 years of education) and with a college degree (16 years of education) have maximum earnings, holding all other factors constant. Answer: The answer can be found either by using calculus or graphical/spreadsheet techniques. Maximum earnings occurs at potential experience of 31.5. Hence with 12 years of education, the maximum earnings happen at age 49.5, while for a person with 16 years of education these occur at 53.5 years. (Since taking logarithms results in a monotonistic transformation of the original data, the same results hold for the log of earnings as for earnings).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 214

23) Consider a typical beta convergence regression function from macroeconomics, where the growth of a countryʹs per capita income is regressed on the initial level of per capita income and various other economic and socio-economic variables. Assume that two of these variables are the average number of years of education in the specific country and a binary variable which indicates whether or not the country experienced a significant number of years of civil war/unrest. Explain why it would make sense to have these two variables enter separately and also why you should use an interaction term. What signs would you expect on the three coefficients? Answer: Simple extensions of the standard neoclassical growth model suggest that the number of years of education have a positive effect on conditional growth in the wealth of a nation (per capita income). A civil war would have a negative effect on the investment/output ratio (savings rate) and you would therefore expect a negative sign on the coefficient. However, it is important to interact the variables because no matter how much education the average person has, there will be virtually no investment in a country during a civil war. Hence you would expect a negative sign, which would indicate the effect that a civil war has on the education effect. 24) Consider the following regression of testscores on an intercept, a binary variable that equals 1 if the student-teacher ratio is 20 or more (HiSTR) and another binary variable that equals 1 if the percentage of English learners is 10% or more (HiEL). TestScore = 664/1 - 1.9×HiSTR - 18.2×HiEL - 3.5×(HiSTR×HiEL) Using the two by two table below, fill in the expected testscores of a student with various combinations of the high/low student teacher ratio and the high/low percent of English lerners. STR < 20

STR

EL < 10% EL

10% STR < 20

Answer: EL < 10% EL

10%

664.1 645.9

STR 662.2 640.5

Stock/Watson 2e -- CVC2 8/23/06 -- Page 215

Chapter 9 Assessing Studies Based on Multiple Regression 9.1 Multiple Choice 1) The analysis is externally valid if A) the statistical inferences about causal effects are valid for the population being studied. B) the study has passed a double blind refereeing process for a journal. C) its inferences and conclusions can be generalized from the population and setting studied to other populations and settings. D) some committee outside the author’s department has validated the findings. Answer: C 2) By including another variable in the regression, you will A) decrease the regression R2 if that variable is important. B) eliminate the possibility of omitted variable bias from excluding that variable. C) look at the t-statistic of the coefficient of that variable and include the variable only if the coefficient is statistically significant at the 1% level. D) decrease the variance of the estimator of the coefficients of interest. Answer: B 3) Errors-in-variables bias ^

A) is present when the probability limit of the OLS estimator is given by β1

β1 +

2 σx 2 2 σx + σw

B) arises when an independent variable is measured imprecisely. C) arises when the dependent variable is measured imprecisely. D) always occurs in economics since economic data is never precisely measured. Answer: B 4) Sample selection bias A) occurs when a selection process influences the availability of data and that process is related to the dependent variable. B) is only important for finite sample results. C) results in the OLS estimator being biased, although it is still consistent. D) is more important for nonlinear least squares estimation than for OLS. Answer: A 5) Simultaneous causality bias A) is also called sample selection bias. B) happens in complicated systems of equations called block recursive systems. C) results in biased estimators if there is heteroskedasticity in the error term. D) arises in a regression of Y on X when, in addition to the causal link of interest from X to Y, there is a causal link from Y to X. Answer: D 6) The reliability of a study using multiple regression analysis depends on all of the following with the exception of A) omitted variable bias. B) errors-in-variables. C) presence of homoskedasticity in the error term. D) external validity. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 216

7) A statistical analysis is internally valid if A) its inferences and conclusions can be generalized from the population and setting studied to other populations and settings. B) statistical inference is conducted inside the sample period. C) the hypothesized parameter value is inside the confidence interval. D) the statistical inferences about causal effects are valid for the population being studied. Answer: D 8) The components of internal validity are A) a large sample, and BLUE property of the estimator. B) a regression R2 above 0.75 and serially uncorrelated errors. C) unbiasedness and consistency of the estimator, and desired significance level of hypothesis testing. D) nonstochastic explanatory variables, and prediction intervals close to the sample mean. Answer: C 9) A study based on OLS regressions is internally valid if A) the errors are homoskedastic, and there are no more than two binary variables present among the regressors. B) you use a two-sided alternative hypothesis, and standard errors are calculated using the heteroskedasticity-robust formula. C) weighted least squares produces similar results, and the t-statistic is normally distributed in large samples. D) the OLS estimator is unbiased and consistent, and the standard errors are computed in a way that makes confidence intervals have the desired confidence level. Answer: D 10) Panel data estimation can sometimes be used A) to avoid the problems associated with misspecified functional forms. B) in case the sum of residuals is not zero. C) in the case of omitted variable bias when data on the omitted variable is not available. D) to counter sample selection bias. Answer: C 11) Misspecification of functional form of the regression function A) is overcome by adding the squares of all explanatory variables. B) is more serious in the case of homoskedasticity-only standard error. C) results in a type of omitted variable bias. D) requires alternative estimation methods such as maximum likelihood. Answer: C 12) Errors-in-variables bias A) is only a problem in small samples. B) arises from error in the measurement of the independent variable. C) becomes larger as the variance in the explanatory variable increases relative to the error variance. D) is particularly severe when the source is an error in the measurement of the dependent variable. Answer: B 13) A survey of earnings contains an unusually high fraction of individuals who state their weekly earnings in 100s, such as 300, 400, 500, etc. This is an example of A) errors-in-variables bias. B) sample selection bias. C) simultaneous causality bias. D) companies that typically bargain with workers in 100s of dollars. Answer: A Stock/Watson 2e -- CVC2 8/23/06 -- Page 217

14) In the case of a simple regression, where the independent variable is measured with i.i.d. error, ^

A) β1

B) β1

C) β1

D) β1

2 σX 2 2 σX+ σw

2 σw 2 2 σX+ σw

β1

β1 +

β1.

2 σX 2 2 σX+ σw

Answer: A 15) In the case of errors-in-variables bias, A) maximum likelihood estimation must be used. B) the OLS estimator is consistent if the variance in the unobservable variable is relatively large compared to variance in the measurement error. C) the OLS estimator is consistent, but no longer unbiased in small samples. D) binary variables should not be used as independent variables. Answer: B 16) Sample selection bias occurs when A) the choice between two samples is made by the researcher. B) data are collected from a population by simple random sampling. C) samples are chosen to be small rather than large. D) the availability of the data is influenced by a selection process that is related to the value of the dependent variable. Answer: D 17) Simultaneous causality A) means you must run a second regression of X on Y. B) leads to correlation between the regressor and the error term. C) means that a third variable affects both Y and X. D) cannot be established since regression analysis only detects correlation between variables. Answer: B 18) Correlation of the regression error across observations A) results in incorrect OLS standard errors. B) makes the OLS estimator inconsistent, but not unbiased. C) results in correct OLS standard errors if heteroskedasticity-robust standard errors are used. D) is not a problem in cross-sections since the data can always be “reshuffled.” Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 218

19) Applying the analysis from the California test scores to another U.S. state is an example of looking for A) simultaneous causality bias. B) external validity. C) sample selection bias. D) internal validity. Answer: B 20) Comparing the California test scores to test scores in Massachusetts is appropriate for external validity if A) Massachusetts also allowed beach walking to be an appropriate P.E. activity. B) the two income distributions were very similar. C) the student-to-teacher ratio did not differ by more than five on average. D) the institutional settings in California and Massachusetts, such as organization in classroom instruction and curriculum, were similar in the two states. Answer: D 21) The guidelines for whether or not to include an additional variable include all of the following, with the exception of A) providing “full disclosure” representative tabulations of the results. B) testing whether additional questionable variables have nonzero coefficients. C) determining whether it can be measured in the population of interest. D) being specific about the coefficient or coefficients of interest. Answer: C 22) Possible solutions to omitted variable bias, when the omitted variable is not observed, include the following with the exception of A) panel data estimation. B) nonlinear least squares estimation. C) use of instrumental variables regressions. D) use of randomized controlled experiments. Answer: B 23) A possible solution to errors-in-variables bias is to A) use log-log specifications. B) choose different functional forms. C) use the square root of that variable since the error becomes smaller. D) mitigate the problem through instrumental variables regression. Answer: D 24) You try to explain the number of IBM shares traded in the stock market per day in 2005. As an independent variable you choose the closing price of the share. This is an example of A) simultaneous causality. B) invalid inference due to a small sample size. C) sample selection bias since you should analyze more than one stock. D) a situation where homoskedasticity-only standard errors should be used since you only analyze one company. Answer: A 25) In the case of errors-in-variables bias, the precise size and direction of the bias depend on A) the sample size in general. B) the correlation between the measured variable and the measurement error. C) the size of the regression R2 . D) whether the good in question is price elastic. Answer: B Stock/Watson 2e -- CVC2 8/23/06 -- Page 219

26) The question of reliability/unreliability of a multiple regression depends on A) internal but not external validity B) the quality of your statistical software package C) internal and external validity D) external but not internal validity Answer: C 27) A statistical analysis is internally valid if A) all t-statistics are greater than |1.96| B) the regression R2 > 0.05 C) the population is small, say less than 2,000, and can be observed D) the statistical inferences about causal effects are valid for the population studied Answer: D 28) A definition of internal validity is A) the estimator of the causal effect being unbiased and consistent B) the estimator of the causal effect being efficient C) inferences and conclusions being generalized from the population to toher populations D) OLS estimation being available in your statistical package Answer: A 29) Threats to in internal validity lead to A) perfect multicollinearity B) the inability to transfer data sets into your statistical package C) failures of one or more of the least squares assumptions D) a false generalization to the population of interest Answer: C 30) The true causal effect might not be the same in the population studied and the population of interest because A) of differences in characteristics of the population B) of geographical differences C) the study is out of date D) all of the above Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 220

9.2 Essays and Longer Questions 1) Until about 10 years ago, most studies in labor economics found a small but significant negative relationship between minimum wages and employment for teenagers. Two labor economists challenged this perceived wisdom with a publication in 1992 by comparing employment changes of fast -food restaurants in Texas, before and after a federal minimum wage increase. (a) Explain how you would obtain external validity in this field of study. (b) List the various threats to external validity and suggest how to address them in this case. Answer: (a) Obtaining external validity involves generalizing the results from the population and setting them under study, in this case Texas. Students familiar with the Card and Krueger literature on minimum wages will point to the New Jersey/Pennsylvania study, or the high/low impact minimum wage paper by Card. In general, studies of the effect of minimum wages on employment using data from other states and/or countries will generate external validity. (b) The main threats to external validity are the differences between the population and setting studied versus the population and setting of interest. In particular, there may be geographic and/or time differences, in that the study may be out of date. Being out of date is not a major concern here, since the study was done relatively recently. Using data from Texas only could be of concern if you believed that the Texas fast-food restaurants are different from those elsewhere, say in terms of monopsony power, the type of teenager they attract, etc. (Students familiar with the literature may point out that no data was obtained from McDonald’s, but again, that does not pose a particular threat.) Generalizing from fast-food restaurants to other sectors such as the garment industry, is an entirely different matter, as is generalizing from teenagers to older workers, especially females. Some authors have established that increases in minimum wages lead to lower school enrollment rates by whites, who then replace black fast-food restaurant workers. These types of substitutions are not likely to occur with older workers. Comparisons with other countries, where cultural differences may be larger than within the United States, are potentially more problematic. 2) Your textbook used the California Standardized Testing and Reporting (STAR) data set on test student performance in Chapters 4-7. One justification for putting second to twelfth graders through such an exercise once a year is to make schools more accountable. The hope is that schools with low scores will improve the following year and in the future. To test for the presence of such an effect, you collect data from 1,000 L.A. County schools for grade 4 scores in 1998 and 1999, both for reading ( Read) and mathematics (Maths). Both are on a scale from zero to one hundred. The regression results are as follows (homoskedasticity -only standard errors in parentheses): Maths99 = 6.967 + 0.919 Maths98, R2 = 0.825, SER = 7.818 (0.542) (0.013) Re ad99 = 4.131 + 0.943 , Re ad98 = R2 = 0.887, SER = 6.416 (0.409) (0.011) (a) Interpret the results and indicate whether or not the coefficients are significantly different from zero. Do the coefficients have the expected sign and magnitude? (b) Discuss various threats to internal and external validity, and try to assess whether or not these are likely to be present in your study. (c) Changing the estimation method to allow for heteroskedasticity -robust standard errors produces four new standard errors: (0.539), (0.015), (0.452), and (0.015) in the order of appearance in the two equations above. Given these numbers, do any of your statements in (b) change? Do you think that the coefficients themselves changed? (d) If reading and maths scores were the same in 1999 as in 1998, on average, what coefficients would you expect for the intercept and the slope? How would you test for the restrictions? (e) The appropriate F-statistic in (d) is 138.27 for the maths scores, and 104.85 for the reading scores. Comparing these values to the critical values in the F table, can you reject the null hypothesis in each case? (f) Your professor tells you that the analysis reminds her of “Galton’s Fallacy.” Sir Francis Galton regressed the Stock/Watson 2e -- CVC2 8/23/06 -- Page 221

height of children on the average height of their parents. He found a positive intercept and a slope between zero and one. Being concerned about the height of the English aristocracy, he interpreted the results as “regression to mediocrity” (hence the name regression). Do you see the parallel? Answer: (a) High (low) reading and maths scores in 1998 will result in high (low) reading and maths scores in 1999. The slope coefficients suggest a high degree of persistence. However, both regression lines cross the 45 degree line, thereby implying implausibly mean reversion. All coefficients are statistically significant, and approximately 80 to 90 percent of the variation in the 1999 scores are explained by the 1998 scores. (b) The biggest threat to internal validity stems from the errors-in-variables problem. Assume that the tests scores in maths in a given year are determined by a given set of factors, such as class size, socioeconomic variables of the school district, quality of teachers, etc. Let the maths score in the second year also be determined by the same factors, which are unlikely to change by much between the two years. Then subtracting the earlier year from the more current year results in a population regression function with a slope of one and an intercept of zero, and an error term which is correlated with the previous year’s score. Hence the OLS estimator will be biased downward from one and the intercept will be biased upward from zero, giving the above result. There are few threats to internal or external validity present through the other factors, although the L.A. school district may not be typical when compared to a less urban setting. (c) The coefficients are unaffected by the choice of standard error calculation. However, hypothesis tests have no longer the desired significance levels, unless the errors are homoskedastic. There is no suggestion from the institutional setting of the district that this should be the case here. (Indeed, homoskedasticity is rejected for the above sample.) (d) In that case the intercept would be zero, and the slope one. This is a simultaneous hypothesis, and hence the F-test is appropriate here. (e) The critical value is 4.61 at the 1% level, thereby comfortably rejecting the null hypothesis in each case. (f) The situation is similar here. Instead of regressing the outcome in one period on determining factors, it is regressed on the outcome in a previous period. In each case the outcome in the previous period is an imperfect measure, or contains a measure error, of the underlying determinants. This results in problems with internal validation. 3) Keynes postulated that the marginal propensity to consume (MPC = hypothesized that the average propensity to consume (APC =

△C ) is between zero and one. He also △Ypd

△C ) would fall as personal disposable income △Ypd

Ypd increased. (a) Specify a linear consumption function. Show that the assumption of a falling APC implies the presence of a positive intercept. (b) Using annual per capita data, estimation of the consumption function for the United States results in the following output for the years 1929-1938: ^

Ct = 981.35 + 0.735 Ypd ,t, R2 = 0.98, SER= 50.65 (158.65) (0.038) Can you reject the null hypothesis that the slope is less than one? Greater than zero? Test the hypothesis that the intercept is zero. Should you be concerned about the sample size when conducting these tests? What other threats to internal validity may be present here? (c) Given the GDP identity for a closed economy, Yt ≡ Ct + It + Gt , show why economists saw important policy implications in finding an APC that would decrease over time. (d) Simon Kuznets, who won the Nobel Prize in economics, collected data on consumption expenditures and Stock/Watson 2e -- CVC2 8/23/06 -- Page 222

national income from 1869 to 1938 and found, using overlapping period averages, that the APC was relatively constant over this period. To reconcile this finding with the regression results, Milton Friedman, who also won the Nobel Prize, formulated the “permanent income” hypothesis. In essence, Friedman hypothesized that both actual consumption and income are measured with error,

Ct = Ct + v t and Yt = Yt + wt ,

where Ct and Yt were called “permanent” consumption and income, respectively, and v t and wt, the two measurement errors, were labeled transitory consumption and income. Friedman hypothesized that the transitory components were purely random error terms, uncorrelated with the permanent parts. Let permanent consumption and income be related as follows: Ct = k × Ypd ,t + ut so that the APC and MPC are the same and constant over time. Furthermore, let both transitory and permanent income be independent of the error term. Show that by regressing actual consumption on actual income, the MPC will be downward biased, and the intercept will be greater than zero, even in large samples (to simplify the analysis, assume that permanent income and all of the errors are i.i.d. and mutually independent). ^

Answer: (a) Ci = β0 + β1 Ypd ,i. Dividing both sides by personal disposable income results in

Ci Ypd ,i

= APC = β0

^ 1 + β1 . Hence the APC will fall with increases in personal disposable income. Ypd ,i

(b) Assuming that all assumptions required for proper inference are satisfied here, the t-statistic for an MPC of one is –6.97, thereby rejecting the null hypothesis. You can also reject the null hypothesis that the slope is zero (t-statistic = 26.32). The sample is very small here and certainly less than the number of observations required to permit the use of the standard normal distribution. There may also be omitted variables here, such as wealth, the real interest rate, the inflation rate, etc. The functional form may be misspecified, and there may be errors in variables (permanent income). Perhaps most seriously, there is simultaneous causality present, given the GDP identity. Ct It Gt (c) Dividing both sides of the identity by GDP results in 1 ≡ . With the APC falling over + + Yt Yt Yt time as income increased, either the investment output ratio or the government output ratio would have to make up for this fall. The likely candidate was the government-expenditure share. (d) This is the standard errors-in-variables problem discussed in the textbook. Following the derivation ^

in footnote 2 in the textbook, it is straightforward to show β1

2 σX

β1 , where X is permanent

2 2 σX+ σw

income, and w is the measurement error in income. Hence the marginal propensity to consume will be ^

downward biased, or β1 < k. For the intercept we get

~ ^ ~ ^ ~ β0 = ~ Y - β1 X = β0 + β1 X + v - β1 X , and collecting terms results in ^ p β0 = β - (β^1 - β1 ) ~ β0 + μXβ1 X + v. Therefore β0 0

2 σw 2 2 σX+ σw

, since β1

β1 - β1

μ~ X. Hence the intercept in the consumption function will be upward biased. Stock/Watson 2e -- CVC2 8/23/06 -- Page 223

2 σw 2 2 σX+ σw

and X

4) The Phillips curve is a relationship in macroeconomics between the inflation rate (inf) and the unemployment rate (ur). Estimating the Phillips curve using quarterly data for the United States from 1962:I to 1995:IV, you find Inf t = 4.08 + 0.118 urt, R2 = 0.003, SER = 3.148 (1.11) (0.176) (a) Explain why, at first glance, this is a surprising result. (b) Do you think that there is omitted variable bias in the regression? (c) What other threats to internal validity may be present? (d) If you could find a proper specification for the Phillips curve using United States data, what external validity criteria would you suggest? Answer: (a) There is supposed to be a negative relationship between inflation and unemployment. (b) The omitted variable is inflationary expectations and the natural rate of unemployment. (c) There is simultaneous causality in that inflation also causes employment and thereby unemployment in many models. The functional form is most likely incorrect, since the Phillips curve is typically not shown as a straight line. There may also be omitted variables in the form of supply side shocks. (d) The most obvious choice would be to estimate the Phillips curve for other countries. It is also possible to estimate the Phillips curve for a cross-section of countries. Using state data is more problematic since state unemployment rates vary, but inflation rates are very similar and only exist for certain cities (using the CPI). 5) You have decided to analyze the year-to-year variation in temperature data. Specifically you want to use this year’s temperature to predict next year’s temperature for certain cities. As a result, you collect the daily high temperature (Temp) for 100 randomly selected days in a given year for three United States cities: Boston, Chicago, and Los Angeles. You then repeat the exercise for the following year. The regression results are as follows (heteroskedasticity-robust standard errors in parentheses): BOS BOS Temp t = 18.19 + 0.75 × Temp t-1 ; R2 = 0.62, SER = 12.33 (6.46) (0.10) CHI CHI Temp t = 2.47 + 0.95 × Temp t-1 ; R2 = 0.93, SER = 5.85 (3.98) (0.05) LA LA Temp t = 37.54 + 0.44 × Temp t-1 ; R2 = 0.18, SER = 7.17 (15.33) (0.22) (a) What is the prediction of the above regression for Los Angeles if the temperature in the previous year was 75 degrees? What would be the prediction for Boston? (b) Assume that the previous year’s temperature gives accurate predictions, on average, for this year’s temperature. What values would you expect in this case for the intercept and slope? Sketch how each of the above regressions behaves compared to this line. (c) After reflecting on the results a bit, you consider the following explanation for the above results. Daily high temperatures on any given date are measured with error in the following sense: for any given day in any of the three cities, say January 28, there is a true underlying seasonal temperature ( X), but each year there are ^

different temporary weather patterns (v, w) which result in a temperature X different from X. For the two years in your data set, the situation can be described as follows:

Xt1 = X + v t and Xt2 = X + wt Stock/Watson 2e -- CVC2 8/23/06 -- Page 224

Subtracting Xt1 from Xt2, you get Xt2 = Xt1 + wt – v t. Hence the population parameter for the intercept and slope are zero and one, as expected. Show that the OLS estimator for the slope is inconsistent, where

β1

2 σv 1-

2 2 σX+ σv

(d) Use the formula above to explain the differences in the results for the three cities. Is your mathematical explanation intuitively plausible? Answer: (a) The prediction for Los Angeles is 70.5 degrees, and for Boston 74.4 degrees. (b) In that case, the intercept would be zero, and the slope one.

(d) Rewriting β1

2 σv 2 2 σX+ σv

as β1

2 σX 1+

suggests that the slope in the

2 σv

temperature regression will be closer to one, the more variation there is in the underlying “true” temperature. Temperatures in Los Angeles vary the least throughout the year, and you would therefore expect the largest bias. The slope for Chicago suggests that temperatures there have the most variation. The standard deviation for the Boston temperature is 19.5 and for Chicago 21.0. However, these are actual temperature standard deviations. To calculate the variance of X in the above example, you could Stock/Watson 2e -- CVC2 8/23/06 -- Page 225

collect data over a 100-year period on the same dates and form daily averages. It is the standard deviation of these temperatures that would most resemble the standard deviation in X. 6) A study of United States and Canadian labor markets shows that aggregate unemployment rates between the two countries behaved very similarly from 1920 to 1982, when a two percentage point gap opened between the two countries, which has persisted over the last 20 years. To study the causes of this phenomenon, you specify a regression of Canadian unemployment rates on demographic variables, aggregate demand variables, and labor market characteristics. (a) Assume that your analysis is internally valid. What would make it externally valid? (b) If one of the determinants of Canadian unemployment is aggregate United States economic activity (or perhaps shocks to it), what variable would you suggest as its replacement if you did a similar study for the United States? (c) Certain Canadian geographical areas, such as the prairies and British Columbia, seem particularly sensitive to commodity price shocks (Edmonton’s NHL team is called the Edmonton Oilers). Having collected provincial data, you establish a relationship between provincial unemployment rates and commodity price changes (shocks). How would you address external validity now? Answer: (a) Threats to external validation come from the difference between the population and settings studied versus the population and settings of interest. Finding, for example, that the variables which characterize the unemployment insurance system exert an influence on Canadian unemployment, does not automatically imply that this holds universally. To obtain external validity, the exercise should be repeated to other geographic units, such as countries or states. If the coefficients are similar, or differences in coefficients can be explained, then the study is externally valid. (b) Shocks to world aggregate demand, or the major trading partners for the United States, would be a possibility. (c) The task is to find geographical units that are also sensitive to commodity price changes. Texas, Louisiana, and Oklahoma would be candidates for obtaining external validity. 7) Several authors have tried to measure the “persistence” in U.S state unemployment rates by running the following regression: uri,t = β0 + β1 × uri,t-k + zi,t where ur is the state unemployment rate, i is the index for the i-th state, t indicates a time period, and typically k ≥ 10. (a) Explain why finding a slope estimate of one and an intercept of zero is typically interpreted as evidence of “persistence.” (b) You collect data on the 48 contiguous U.S. states’ unemployment rates and find the following estimates: ^

uri,1995 = 2.25 + 0.60 × uri,1970; R2 = 0.40, SER = 0.90 (0.61) (0.13) Interpret the regression results. (c) Analyzing the accompanying figure, and interpret the observation for Maryland and for Washington. Do you find evidence of persistence? How would you test for it?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 226

(d) One of your peers points out that this result makes little sense, since it implies that eventually all states would have identical unemployment rates. Explain the argument. (e) Imagine that state unemployment rates were determined by their natural rates and some transitory shock. The natural rates themselves may be functions of the unemployment insurance benefits of the state, unionization rates of its labor force, demographics, sectoral composition, etc. The transitory components may include state-specific shocks to its terms of trade such as raw material movements and demand shocks from the other states. You specify the i-th state unemployment rate accordingly as follows for the two periods when you observe it,

Xi,t = Xi + v i,t and Xi,t-k = Xi + wi,t-k ,

so that actual unemployment rates are measured with error. You have also assumed that the natural rate is the same for both periods. Subtracting the second period from the first then results in the following population regression function:

Xi,t = 0 + 1 × Xi,t-k + (v i,t – wi,t-k)

It is not too hard to show that estimation of the observed unemployment rate in period t on the unemployment rate in period (t-k) by OLS results in an estimator for the slope coefficient that is biased towards zero. The formula is ^

β1

2 σv 1–

2 2 σX+ σv

Stock/Watson 2e -- CVC2 8/23/06 -- Page 227

Using this insight, explain over which periods you would expect the slope to be closer to one, and over which period it should be closer to zero. (f) Estimating the same regression for a different time period results in ^

uri,1995 = 3.19 + 0.27 × uri,1985 ; R2 = 0.21, SER = 1.03 (0.56) (0.07) If your above analysis is correct, what are the implications for this time period? Answer: (a) This result would imply that states with high (low) unemployment rates in the ( t-k) period would have high (low) unemployment rates in period t. Hence high (low) unemployment rates would persist. (b) A state which had an unemployment rate of 3 percent in 1970 is predicted to have an unemployment rate of approximately 4 percent in 1995. If the state had a 7 percent unemployment rate in 1970, then the prediction becomes approximately 6.5 percent. There is no interpretation for the constant. The regression explains 40 percent of the variation in state unemployment rates in 1995. (c) Washington had the highest unemployment rate in 1970, namely above 9 percent. There are several states in 1995 that have higher unemployment rates. Washington seems to have reverted towards the mean unemployment rate of all states. Maryland had a relatively low unemployment rate in 1970 (about 3.5 percent), but has a relatively higher unemployment rate in 1995. It also has reverted towards the mean. (d) The positive intercept and the slope between zero and one imply that high (low) unemployment rate states will have high (low) unemployment rates in the future, but that they will not be as high (low) as in the base period. Hence there is mean reversion. The prediction would be that ultimately all states would end up with identical unemployment rates. However, unemployment rate differences should persist if there are differences in the natural rates of the state unemployment rates. These may be due to different sectoral compositions, unemployment insurance benefits, tax rates, etc. Unless states were identical with regard to these variables, then unemployment rates should differ. ^

(e) Noting that β1

2 σv 2 2 σX+ σv

can be rewritten as β1

1 2 σX 2 σv

, you would expect β1 to lie +1

closer to one over time periods when natural rate variations dominate the transitory deviation of state unemployment rates from their natural rates. Therefore if you attempted to predict the unemployment rates in the mid 1980s from those in the mid 1970s, then the slope coefficient should be further away from one. (There are several studies that have found virtually no persistence in state unemployment rates over this period.) (f) Following the previous argument, the result suggests that there were more transitory deviations from the natural rate over this period. The large drop in oil prices, particularly in 1986, comes to mind. 8) Sir Francis Galton (1822-1911), an anthropologist and cousin of Charles Darwin, created the term regression. In his article “Regression towards Mediocrity in Hereditary Stature,” Galton compared the height of children to that of their parents, using a sample of 930 adult children and 205 couples. In essence he found that tall (short) parents will have tall (short) offspring, but that the children will not be quite as tall (short) as their parents, on average. Hence there is regression towards the mean, or as Galton referred to it, mediocrity. This result is obviously a fallacy if you attempted to infer behavior over time since, if true, the variance of height in humans would shrink over generations. This is not the case. (a) To research this result, you collect data from 110 college students and estimate the following relationship: studenth = 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0 (7.2) (0.10) Stock/Watson 2e -- CVC2 8/23/06 -- Page 228

where Studenth is the height of students in inches and Midparh is the average of the parental heights. Values in parentheses are heteroskedasticity-robust standard errors. Sketching this regression line together with the 45 degree line, explain why the above results suggest “regression to the mean” or “mean reversion.” (b) Researching the medical literature, you find that height depends, to a large extent, on one gene (“phog”) and on environmental influences. Let us assume that parents and offspring have the same invariant (over time) gene and that actual height is therefore measured with error in the following sense,

Xi,0 = Xi + v i,o and Xi,p = Xi + wi,p,

where X is measured height, X is the height given through the gene, v and w are environmental influences, and the subscripts o and p stand for offspring and parents, respectively. Let the environmental influences be independent from each other and from the gene. Subtracting the measured height of offspring from the height of parents, what sort of population regression function do you expect? (c) How would you test for the two restrictions implicit in the population regression function in (b)? Can you tell from the results in (a) whether or not the restrictions hold? (d) Proceeding in a similar way to the proof in your textbook, you can show that

β1

2 σv 1-

2 2 σX+ σv

for the situation in (b). Discuss under what conditions you will find a slope closer to one for the height comparison. Under what conditions will you find a slope closer to zero? (e) Can you think of other examples where Galton’s Fallacy might apply? Answer: (a) As can be seen in the accompanying graph, the regression line crosses the 45 degree line. Tall (short) parents will have tall (short) children, but on average, they will not be as tall (short) as their parents. Hence they will regress to the mean, or mean revert.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 229

(b) Xi,t = 0 + 1 × Xi,t-k + (v it - wi,t-k) (c) You would have to test simultaneously whether the intercept is zero and the slope is one. This requires an F-test. Analyzing the t-statistics above suggests rejection of both hypotheses. However, testing the hypotheses sequentially is not the same as testing them simultaneously. ^ p ^ 1 1. β1 will equal unity if there is no (d) The above expression can be rewritten as β1 2 σX +1 2 σv measurement error, or if the variance in the gene is relatively large compared to the measurement error. (e) Answer will vary by student. There are many examples of Galton’s Fallacy, some of which have been used in the test bank (state unemployment rates in year t when compared to year t-k; temperatures in a given city this year compared to the previous year; grade received in the final examination relative to the midterm grade; mutual fund performance this year versus last year; convergence regressions, sports performance this year compared to the previous year, etc.). 9) Macroeconomists who study the determinants of per capita income (the “wealth of nations”) have been particularly interested in finding evidence on conditional convergence in the countries of the world. Finding such a result would imply that all countries would end up with the same per capita income once other variables such as saving and population growth rates, education, government policies, etc., took on the same value. Unconditional convergence, on the other hand, does not control for these additional variables. (a) The results of the regression for 104 countries was as follows, g6090 = 0.019 – 0.0006 × RelProd 60 , R2 = 0.00007, SER = 0.016 (0.004) (0.0073), where g6090 is the average annual growth rate of GDP per worker for the 1960 -1990 sample period, and RelProd60 is GDP per worker relative to the United States in 1960. For the 24 OECD countries in the sample, the output is g6090 = 0.048 – 0.0404 RelProd 60 , R2 = 0.82 , SER = 0.0046 (0.004) (0.0063) Interpret the results and point out the difference with regard to unconditional convergence. (b) The “beta-convergence” regressions in (a) are of the following type, △t ln Yi,t = β0 + β0 ln Yi,0 + ui,t , T where △t ln Yi,t = ln Yi,0 – ln Yi,0 , and t and o refer to two time periods, i is the i-th country. Explain why a significantly negative slope implies convergence (hence the name). (c) The equation in (b) can be rewritten without any change in information as (ignoring the division by T) ln Yt = β0 + γ 1 ln Y0 + ut In this form, how would you test for unconditional convergence? What would be the implication for convergence if the slope coefficient were one? (d) Let’s write the equation in (c) as follows: Stock/Watson 2e -- CVC2 8/23/06 -- Page 230

Yt = β0 + γ 1~ Y0 + ut and assume that the “~” variables contain measurement errors of the following type,

~ ~ Yi,t = Y * + v i,t and Yi,0 = Y * + wi,0 , t 0 where the “*” variables represent true, or permanent, per capita income components, while v and w are temporary or transitory components. Subtraction of the initial period from the current period then results in

Yi,t = ( Y * – Y * ) + Yi,0 + (v i,t – wi,0 ) t 0 Ignoring, without loss of generality, the constant in the above equation, and making standard assumptions about the error term, one can show that by regressing current per capita income on a constant and the initial period per capita income, the slope behaves as follows:

β1

2 σv 1–

2 2 σ Y* + σ v

Discuss the implications for the convergence results above. Answer: (a) There is evidence for unconditional convergence among the OECD countries, but not for the countries of the world as a whole. Only for the OECD countries is the slope coefficient significantly different from zero. (b) A significantly negative slope coefficient implies that countries which were further behind initially, grow faster subsequently. Hence these countries will eventually converge. (c) Ignoring T above, γ 1 = β1 - 1. Hence for convergence to occur, γ 1 has to be significantly different from unity. If it were unity, then there would be no convergence or mean reversion. (d) If Y is measured with error, perhaps due to a temporary difference resulting from a shock during the initial year of measurement, then beta will be biased downward, i.e., the regression will indicate convergence when there is none in truth.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 231

10) One of the most frequently used summary statistics for the performance of a baseball hitter is the so -called batting average. In essence, it calculates the percentage of hits in the number of opportunities to hit (appearances “at the plate”). The management of a professional team has hired you to predict next season’s performance of a certain hitter who is up for a contract renegotiation after a particularly great year. To analyze the situation, you search the literature and find a study which analyzed players who had at least 50 at bats in 1998 and 1997. There were 379 such players. (a) The reported regression line in the study is 1998 1997 2 Batavg i ; R = 0.17 = 0.138 + 0.467 × Batavg i and the intercept and slope are both statistically significant. What does the regression imply about the relationship between past performance and present performance? What values would the slope and intercept have to take on for the future performance to be as good as the past performance, on average? (b) Being somewhat puzzled about the results, you call your econometrics professor and describe the results to her. She says that she is not surprised at all, since this is an example of “Galton’s Fallacy.” She explains that Sir Francis Galton regressed the height of offspring on the mid-height of their parents and found a positive intercept and a slope between zero and one. He referred to this result as “regression towards mediocrity.” Why do you think econometricians refer to this result as a fallacy? (c) Your professor continues by mentioning that this is an example of errors-in-variables bias. What does she mean by that in general? In this case, why would batting averages be measured with error? Are baseball statisticians sloppy? (d) The top three performers in terms of highest batting averages in 1997 were Tony Gwynn (.372), Larry Walker (.366), and Mike Piazza (.362). Given your answers for the previous questions, what would be your predictions for the 1998 season? Answer: (a) The regression implies mean reversion: those players who had a high (low) average in 1997 will have a high (low) average in 1998, but it will not be as high (low) as before. If the performance was as good or bad as in the past, then the intercept would have to be zero and the slope one. (b) If the result were true, then eventually everyone would be of the same height. (c) Errors-in-variables bias refers to a situation where variables are not measured precisely, but contain a measurement error. In this situation, the player may have had an extraordinarily good or bad year, resulting, perhaps, from an injury, adjustments to a new league, a new city, etc. This results in a measurement error of his underlying ability. It has nothing to do with not measuring the batting average correctly. (d) The forecast would be for Tony Gwynn to bat (.312), Larry Walker (.309), and Mike Piazza (.307). 11) Your textbook compares the results of a regression of test scores on the student -teacher ratio using a sample of school districts from California and from Massachusetts. Before standardizing the test scores for California, you get the following regression result: TestScr = 698.9 - 2.28×STR n = 420, R2 = 0.051, SER = 18.6

In addition, you are given the following information: the sample mean of the student -teacher ratio is 19.64 with a standard deviation of 1.89, and the standard deviation of the test scores is 19.05.

After standardizing the test scores variable and running the regression again, what is the value of the slope? What is the meaning of this new slope here (interpret the result)?

What will be the new intercept? Now that test scores have been standardized, should you interpret the Stock/Watson 2e -- CVC2 8/23/06 -- Page 232

intercept?

Does the regression R2 change between the two regressions? What about the t-statistic for the slope estimator?

Answer: a. Standardization of a variable is a simple linear transformation, 1 * Yi- Y -Y Y = a + b Yi Y i= = + sY i sY sY (say). Hence the new regression slope will be

The numerical value of the new slope is (-0.11). The interpretation is as follows: if you decrease the student-teacher ratio by one, then test scores improve by 0.11 of a standard deviation of test scores or 0.11×19.05 = 2.10 (there are some rounding errors here).

b. The intercept will be n

∑ y i xi

^* i=1 β 1= n

∑ xi

i=1

∑ Y i xi =

i=1 n

∑ (a + b Yi) xi =

i=1 n

∑ xi

i=1

∑ Yi x i =b

i=1 n

= b × β1

∑ xi

i=1

Or, in this case, 2.35. Mathematically speaking, the intercept continues to represent the (standardized) test score when the student-teacher ratio is zero. This does not make sense and it is best not to interpret the intercept.

c. Performing a linear transformation on the regressand (or the regressor for that matter) does not change the regression R2 . It is easy but tedious to show that it is unaffected. Intuitively this makes sense since otherwise you could affect the goodness of fit by whim (changing the scale of the data). Similarly, logic dictates that the t-statistic is unaffected.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 233

12) Suppose that you have just read a review of the literature of the effect of beauty on earnings. You were initially surprised to find a mild effect of beauty even on teaching evaluations at colleges. Intrigued by this effect, you consider explanations as to why more attractive individuals receive higher salaries. One of the possibilities you consider is that beauty may be a marker of performance/productivity. As a result, you set out to test whether or not more attractive individuals receive higher grades (cumulative GPA) at college. You happen to have access to individuals at two highly selective liberal arts colleges nearby. One of these specializes in Economics and Government and incoming students have an average SAT of 2,100; the other is known for its engineering program and has an incoming SAT average of 2,200. Conducting a survey, where you offer students a small incentive to answer a few questions regarding their academic performance, and taking a picture of these individuals, you establish that there is no relationship between grades and beauty. Write a short essay using some of the concepts of internal and external validity to determine if these results are likely to apply to universities in general. Answer: Students will consider various points that pose a threat to internal and external validity. Obviously there is a difference in populations (external validity) between highly selective liberal arts colleges and universities in general. SAT scores at these colleges are much higher than for the average university. In addition, the gender composition may be quite different, especially for engineering school, where males dominate in terms of student numbers. Even in economics, the ratio of female to male students is typically 1:2. This is an example of sample selection bias (internal validity). Other potential problems with this study may include errors-in-variables from students not reporting the correct GPA. However, this may not be a severe problem since GPA is the dependent variable. There could be a problem if there are systematic problems in inflating the GPA for lower GPAs. It is also not clear from the setup how beauty was judged. If judges were chosen who are friends of the individuals, then their judgments may be biased, which is more severe since beauty is an explanatory variable. The setup also does not indicate what the control variables are. In the absence of controls, there will be omitted variable bias (internal validity) since intelligence will clearly be a determining factor of cumulative GPAs.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 234

9.3 Mathematical and Graphical Problems 1) Your textbook gives the following example of simultaneous causality bias of a two equation system: Yi = β0 + β1 Xi + ui Xi = γ 0 + γ 1 Yi + v i D In microeconomics, you studied the demand and supply of goods in a single market. Let the demand ( Q i ) S and supply ( Q i ) for the i-th good be determined as follows,

D Q i = β0 – β1 Pi + ui ,

S Q i = γ 0 – γ 1 Pi + v i , where P is the price of the good. In addition, you typically assume that the market clears. Explain how the simultaneous causality bias applies in this situation. The textbook explained a positive correlation between Xi and ui for γ 1 > 0 through an argument that started from “imagine that ui is negative.” Repeat this exercise here. Answer: Although quantities appear on the left-hand side of both equations, this is a system of two equations in two unknowns, where quantity and price are determined simultaneously by demand and supply. A negative ui, call it a “demand shock,” decreases the quantity demanded. Since demand equals supply, this results in a lower quantity traded, and hence a lower price. (At the old price level, there would now be excess supply, and hence the price would fall.) The negative ui has therefore resulted in a lower price, and hence the error term in the demand equation is positively correlated with the price in the same equation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 235

2) The errors-in-variables model analyzed in the text results in

β1

2 σX

2 2 σX+ σw

β1

so that the OLS estimator is inconsistent. Give a condition involving the variances of X and w, under which the bias towards zero becomes small. ^

Answer: β1

2 σX 2 2 σX+ σw

β1 =

2 σw 1+

β1 . Hence if the variance of X is large relative to w, so that variations in

2 σX

the variable measured with error is dominated by the unobserved component, then the bias disappears. 2 Also, if there is no measurement error, then σ w = 0, and the bias disappears. 3) You have been hired as a consultant by building contractor, who have been sued by the owners’ representatives of a large condominium project for shoddy construction work. In order to assess the damages for the various units, the owners’ association sent out a letter to owners and asked if people were willing to make their units available for destructive testing. Destructive testing was conducted in some of these units as a result of the responses. Based on the tests, the owners’ association inferred the damage over the entire condo complex. Do you think that the inference is valid in this case? Discuss how proper sampling should proceed in this situation. Answer: This is clearly a case of sample selection bias which leads to bias in the OLS estimator in general. It should be clear that inference cannot be conducted properly, since owners who suspect that their unit is faulty are much more likely to agree to destructive testing of their unit than those who have not experienced any problems. The proportion of units assumed to be faulty in the population is bound to be too large when derived through sampling of this type. The proper sampling method would be to decide on the units to be tested through random sampling. A random number generator should be used to determine the sampled units. The owners’ association must guarantee that the randomly selected units are available for destructive testing.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 236

4) Assume that a simple economy could be described by the following system of equations, Ct = β0 + β1 Yt + ui It = I , where C is consumption, Y is income, and I is investment. (This may be a primitive island society which does not trade with other islands. There is no government, and the only good consumed and invested (saved) is sunflower seeds.) Assume the presence of the GDP identity, Y = C + I. If you estimated the consumption function, what sort of problem involving internal validity may be present? Answer: There is simultaneous causality present in the system. Income causes consumption, which in return causes income (GDP). A negative consumption “shock,” ut, causes consumption, and hence aggregate demand, to fall. With lower aggregate demand, not all goods supplied are being sold in the market, and hence income (Yt) falls. There is therefore a positive correlation between ut and Yt, i.e., the error term and the regressor are correlated. 5) Your professor wants to measure the class’s knowledge of econometrics twice during the semester, once in a midterm and once in a final. Assume that your performance, and that of your peers, on the day of your midterm exam only measure knowledge imperfectly and with an error,

~ 1 1 1 Xi = Xi + wi ,

where X is your exam grade, X is underlying econometrics knowledge, and w is a random error with mean zero 2 and variance σ w . w may depend on whether you have a headache that day, whether or not the questions you had prepared for appeared on the exam, your mood, etc. A similar situation holds for the final, which is exam two:

2 2 2 X i = X i + w i . What would happen if you ran a regression of grades received by students in the final on midterm grades? Answer: This is a typical errors-in-variables problem, which results in a downward biased estimator of the slope.

2 2 1 1 2 1 Subtracting the first equation from the second results in X i = (X i - X i ) + X i + ( w i - w i ). If underlying econometrics knowledge at each exam did not change, then the regression should have a slope of one and a zero intercept. (Alternatively, you can allow for an intercept.) The main point here is that the performance during the first exam is only an imperfect measure of econometric ability, meaning that there is measurement error. This results in a correlation between the error term and the regressor, ^

and the OLS estimator will be inconsistent. β1

2 σX 2 2 σX+ σw

β1 =

2 σX 2 2 σX+ σw

< 1, and so the regression

will display mean reversion: students with high (low) midterm scores will most likely have high (low) scores in the final, but they will not be quite as high (low) as in the midterm.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 237

6) Consider the one-variable regression model, Yi = β0 + β1 Xi + ui, where the usual assumptions from Chapter 4

are satisfied. However, suppose that both Y and X are measured with error, Yi = Yi + zi and Xi = Xi + wi. Let both measurement errors be i.i.d. and independent of both Y and X respectively. If you estimated the ~ ~ regression model Yi = β0 + β1 Xi + v i using OLS, then show that the slope estimator is not consistent. Answer: The difference from the example used in section 7.2 of the text is that both the regressor and the dependent variable are measured with error here. Proceeding along the lines in section 7.2, you can write the population regression equation Yi = β0 + β1 Xi + ui in terms of the imprecisely measured variables

Yi = β0 + β1 Xi + [β1 (Xi - Xi) + ( Yi - Yi) + ui] = β0 + β1 Xi + v i

where v i = zi - β1 wi + ui. Hence the dependent variable being measured with error does not cause additional problems to the case discussed in the textbook, but the error term continues to be correlated with the regressor. As a matter of fact, it is easiest to combine the this measurement error with the * population regression error term, i.e., u i = zi + ui, in which case the derivation shown in Chapter 7

~ ~ * footnote 2 of the textbook holds after making this small adjustment. Note that cov(Xi, u i ) = cov(Xi, zi) + ~

2 cov(Xi ui) = 0, and hence cov(Xi, v i) = - β1 σ w as before, and β1

2 σX

2 2 σX+ σw

β1 .

7) In the simple, one-explanatory variable, errors-in-variables model, the OLS estimator for the slope is inconsistent. The textbook derived the following result

β1

2 σX 2 2 σX+ σw

β1.

Show that the OLS estimator for the intercept behaves as follows in large samples:

β1

~ p ~ μX. ~ β

β0 + μ~ X

2 σw 2 2 σX+ σw

β1,

where X

^ ~

Answer: 0 = Y - β1 X = β0 + β1 X + v - β1 X, and, collecting terms, this results in β0 = β0 - (β1 - β1 ) X + v. ^

Therefore β0

β0 + μXβ1

2 σw 2 2 σX+ σw

, since β1

β1 - β1

2 σw 2 2 σX+ σw

Stock/Watson 2e -- CVC2 8/23/06 -- Page 238

8) Assume that you had found correlation of the residuals across observations. This may happen because the regressor is ordered by size. Your regression model could therefore be specified as follows: Yi = β0 + β1 Xi + ui ui = ρu i-1 + v i; ρ < 1. Furthermore, assume that you had obtained consistent estimates for β0 , β1 , ρ. If asked to make a prediction for ^

Y, given a value of X(= Xj) and uj-1 , how would you proceed? Would you use the information on the lagged residual at all? Why or why not? Answer: Given that the error term for j is related to the error term in j-1, it seems intuitive to use that information ^

in prediction, i.e., if Yj-1 is larger than β0 + β1 Xj-1 , thenYj will also be larger than but not by as much (given ρ > 0). Substitution of the second equation into the first equation results in Yi = β0 + β1 Xi + ρu i-1 + v i. Hence the predicted value should be calculated as ^

Yj = β0 + β1 Xj + ρuj-1 .

9) Your textbook only analyzed the case of an error-in-variables bias of the type Xi= Xi + wi. What if the error were generated in the simple regression model by entering data that always contained the same typographical ~ ~ error, say Xi= Xi + a or Xi= bXi, where a and b are constants. What effect would this have on your regression model? Answer: This would have an effect similar to changing the units of measurement. The measurement error is not random here, and the bias can be determined exactly.

For the case Xi= Xi + a, the slope will be unaffected and the usual properties for the OLS slope estimator

^ ~

will hold. However, since X = X + a and β0 = Y - β1 X) - β1 a, the intercept will be underestimated by the constant measurement error times the slope.

For the case Xi = bXi, the intercept is unaffected, but the ratio of the estimated slope with measurement error to the slope without measurement error is b. 10) Explain why the OLS estimator for the slope in the simple regression model is still unbiased, even if there is correlation of the error term across observations. ^

Answer: The proof for unbiasedness is presented in Appendix 4.3 of the textbook. There β1 = β1 + n n 1 1 (Xi - X)ui (Xi - X)ui ∑ ∑ n n ^ i=1 i=1 , and E(β1 ) = β1 + E . n n 1 1 2 2 n ∑ (Xi - X) n ∑ (Xi - X) i=1 i=1 n 1 (Xi - X) E(ui X 1 ,..., Xn) ∑ n ^ i=1 Given the law of iterated expectations, this becomes E(β1 ) = β1 + E , and n 1 2 n ∑ (Xi - X) i=1 the second term vanishes due to the least squares assumptions of independence between the error term and the regressor. The assumption of correlation of the error term across observations has not entered into the proof. However, it will play a role in the derivation of standard errors.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 239

11) To analyze the situation of simultaneous causality bias, consider the following system of equations: Yi = β0 + β1 Xi + ui Xi = γ 0 + γ 1 Yi + v i Demonstrate the negative correlation between Xi and γ 1 for γ 1 < 0 , either through mathematics or by presenting an argument which starts as follows: “Imagine that ui is negative.” Answer: The mathematical derivation of the correlation is given in footnote 3 of Chapter 7 in the textbook. Setting γ 1 <0 results in a negative correlation between Xi and ui. A negative shock to the first equation yields a lower Y. This in turn increases X in the second equation. Hence there is a negative correlation between Xi and ui. 12) Think of three different economic examples where cross-sectional data could be collected. Indicate in each of these cases how you would check if the analysis is externally valid. Answer: Answers will differ by student. Using U.S. state data to analyze determinants of unemployment or the effect of minimum wages on employment-population ratios, and using a sample of Canadian provinces, or other subnational geographical units, may be mentioned. Similarly cross -country comparisons to test convergence in per capita income could be compared to results within countries. Given the textbook example, test scores in elementary schools within one state may be validated by using data from another state.

13) The textbook derived the following result: β1

2 σX

2 2 σX+ σw

β1 . Show that this is the same as β1

2 σw 2 2 σw+ σX

β1 .

2 σX Answer:

2 2 σX+ σw

β1 =

2 2 σX±σw 2 2 σX+ σw

β1 = 1-

2 σw 2 2 σX+ σw

β1 = β1 -

2 σw 2 2 σX+ σw

Stock/Watson 2e -- CVC2 8/23/06 -- Page 240

β1 .

β1

14) Your textbook has analyzed simultaneous equation systems in the case of two equations, Yi = β0 + β1 Xi + ui Xi = γ 0 + γ 1 Yi + v i , where the first equation might be the labor demand equation (with capital stock and technology being held constant), and the second the labor supply equation (X being the real wage, and the labor market clears). What if you had a a production function as the third equation Zi = δ0 + δ1 Yi + wi where Z is output. If the error terms, u, v, and w, were pairwise uncorrelated, explain why there would be no simultaneous causality bias when estimating the production function using OLS. Answer: Although the above system represents three equations in three unknowns, it is “block -recursive,” meaning that X and Y (the real wage and employment) are completely determined by the first two equations and independently of the production function (Z). Given the solution for employment (Y), the third equation solely determines output (Z). Put differently, if there was a positive shock to the production function, which would result in higher output, then this would have no effect on employment (Y), and there would therefore be no feedback into the production function. Hence the error term in the third equation is not correlated with the regressor. 15) A professor in your microeconomics lectures derived a labor demand curve in the lecture. Given some reasonable assumptions, she showed that the demand for labor depends negatively on the real wage. You want to put this hypothesis to the test (“show me”) and collect data on employment and real wages for a certain industry. You try to estimate the labor demand curve but find no relationship between the two variables. Is economic theory wrong? Explain. Answer: This is a case of simultaneous causality. Since there is a supply of labor as well, the real wage depends on employment, which, in a market-clearing model, is determined by the intersection of supply and demand. In a Keynesian world with wait unemployment, you would expect a negative relationship between real wages and employment, given the capital stock and productivity. 16) Your textbook uses the following example of simultaneous causality bias of a two equation system: Yi = β0 + β1 Xi + ui Xi = γ 0 + γ 1 Yi + v i To be more specific, think of the first equation as a demand equation for a certain good, where Y is the quantity demanded and X is the price. The second equation then represents the supply equation, with a third equation establishing that demand equals supply. Sketch the market outcome over a few periods and explain why it is impossible to identify the demand and supply curves in such a situation. Next assume that an additional variable enters the demand equation: income. In a new graph, draw the initial position of the demand and supply curves and label them D0 and S0 . Now allow for income to take on four different values and sketch what happens to the two curves. Is there a pattern that you see which suggests that you might be able to identify one of the two equations with real-life data?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 241

Answer: See the accompanying graph.

You only observe market outcomes (the intersection of the demand and supply curve). Fitting a regression line through these points does not gives you neither the supply curve nor the demand curve, and hence neither is identified.

The market outcome now generates give observations at the intersection of the two curves. Fitting a line through the five points will give an estimate of the supply curve. Hence by shifting the demand curve in this fashion, you can identify the supply curve.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 242

17) Give at least three examples where you could envision errors-in-variables problems. For the case where the ^ p measurement error occurs only for the explanatory variable in the simple regression case, derive β1 2 σX 2 2 σX+ σw

β1 .

Answer: Answers will vary by student. Consumption functions are frequently mentioned, where permanent consumption is proportional to permanent income, both of which differ from actual measures of consumption and income through transitory components. There are several examples in this chapter of the test bank where the underlying measure of the regressor is proxied by previous outcomes (unemployment rates, weather, height, etc.). Students may feel that responses to surveys result in measurement error, e.g., when people respond to questions regarding their income, their SAT score, and so forth. The formula is derived in Chapter 7, footnote 2 of the textbook. 18) Your textbook states that correlation of the error term across observations “will not happen if the data are obtained by sampling at random from the population.” However, in one famous study of the electric utility industry, the observations were listed by the size of the output level, from smallest to largest. The pattern of the residuals was as shown in the figure.

What does this pattern suggest to you? Answer: The pattern suggests that there is correlation in the error term across observations, and therefore possibly the presence of an omitted variable, or, most likely here, a misspecification of the functional form of the regression function. β1 19) Consider a situation where Y is related to X in the following manner: Yi = β0 × X i × eui. Draw the deterministic part of the above function. Next add, in the same graph, a hypothetical Y, X scatterplot of the actual observations. Assume that you have misspecified the functional form of the regression function and estimated the relationship between Y and X using a linear regression function. Add this linear regression function to your graph. Separately, show what the plot of the residuals against the X variable in your regression would look like.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 243

Answer: See the accompanying graphs.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 244

20) In macroeconomics, you studied the equilibrium in the goods and money market under the assumption of prices being fixed in the very short run. The goods market equilibrium was described by the so -called IS equation Ri = β0 – β1 Yi + ui where R represented the nominal interest rate and Y was real GDP. β0 contained variables determined outside the system, such as government expenditures, taxes, and inflationary expectations. The money market equilibrium was given by the so-called LM equation Ri = γ 0 + γ 1 Yi + v i and γ 0 contained the real money supply and the intercept from the money demand equation. Show that there is simultaneous causality bias in this situation. Answer: Consider the case of a positive shock to the LM curve. This will increase the interest rate, which, in return, will result in lower output through the IS curve. Hence there is negative correlation between the error in the LM curve and the regressor, resulting in simultaneous causality bias. 21) Assume the following model of the labor market: W Nd = β0 + β1 +u P W Ns = γ0 + γ1 +v P Nd = Ns = N where N is employment, (W/P) is the real wage in the labor market, and u and v are determinants other than the real wage which affect labor demand and labor supply (respectively). Let 2 2 E(u) = E(v) = 0; var(u) = σ u ; var(v) = σ v ; cov(u,v) = 0 Assume that you had collected data on employment and the real wage from a random sample of observations and estimated a regression of employment on the real wage (employment being the regressand and the real wage being the regressor). It is easy but tedious to show that

(β 1 - β1 )

(γ1 - β1 )

2 σu 2 2 σ u+ σ v

since the slope of the labor supply function is positive and the slope of the labor demand function is negative. Hence, in general, you will not find the correct answer even in large samples. a.

What is this bias referred to?

What would the relationship between the variance of the labor supply/demand shift variable have to be for the bias to disappear? Stock/Watson 2e -- CVC2 8/23/06 -- Page 245

Give an intuitive answer why the bias would disappear in that situation. Draw a graph to illustrate your argument.

Answer: a. Simultaneous equations bias b. The variance of v, the shift variable of the labor supply curve, would have to be substantially larger compared to the variance of the labor demand shift variable. c. Take the extreme case where the labor demand curve hardly shifts at all, but there are large changes in the labor supply curve caused by the shift variable v. In that case, the labor supply curve would “trace out “the labor demand curve. Since in real life you only observe the intersection of the demand and supply relationship, it becomes clear now why the simultaneous equation bias has been removed.

22) To compare the slope coefficient from the California School data set with that of the Massachusetts School data set, you run the following two regressions: TestScrCA = 2.35 - 0.123×STRCA (0.54) (0.027) n = 420, R2 = 0.051, SER = 0.98

TestScrMA = 1.97 - 0.114×STRMA (0.57) (0.033) n = 220, R2 = 0.067, SER = 0.97 Numbers in parenthesis are heteroskedasticity-robust standard errors, and the LHS variable has been standardized. Calculate a t-statistic to test whether or not the two coefficients are the same. State the alternative hypothesis. Which level of significance did you choose? Answer: H0 : β1,CA = β1,MA; H1 : β1,CA ≠ β1,MA;t =

0.123-0.114 = 0.21. Hence you cannot reject the null 0.027 2 + 0.114 2

hypothesis at any reasonable level of significance. The underlying assumption here is that the two samples are independent, which seems reasonable. Stock/Watson 2e -- CVC2 8/23/06 -- Page 246

23) You have read the analysis in chapter 9 and want to explore the relationship between poverty and test scores. You decide to start your analysis by running a regression of test scores on the percent of students who are eligible to receive a free/reduced price lunch both in California and in Massachusetts. The results are as follows: TestScrCA = 681.44 - 0.610×PctLchCA (0.99) (0.018) n = 420, R2 = 0.75, SER = 9.45 TestScrMA = 731.89 - 0.788×PctLchMA (0.95)

(0.045)

n = 220, R2 = 0.61, SER = 9.41

Numbers in parenthesis are heteroskedasticity-robust standard errors. a.

Calculate a t-statistic to test whether or not the two slope coefficients are the same.

Your textbook compares the slope coefficients for the student-teacher ratio instead of the percent eligible for a free lunch. The authors remark: “Because the two standardized tests are different, the coefficients themselves cannot be compared directly: One point on the Massachusetts test is not the same as one point on the California test.” What solution do they suggest?

Answer: a. H0 : β1,CA = β1,MA; H1 : β1,CA ≠ β1,MA;t =

0.788-0.610 = 3.67. Hence you reject the null 0.018 2 + 0.045 2

hypothesis. b. The authors suggest standardizing the test score variable in both states by subtracting the mean and by dividing by the standard deviation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 247

Chapter 10 Regression with Panel Data 10.1 Multiple Choice

1) The notation for panel data is (Xit, Yit), i = 1, ..., n and t = 1, ..., T because A) we take into account that the entities included in the panel change over time and are replaced by others. B) the X’s represent the observed effects and the Y the omitted fixed effects. C) there are n entities and T time periods. D) n has to be larger than T for the OLS estimator to exist. Answer: C

2) The difference between an unbalanced and a balanced panel is that A) you cannot have both fixed time effects and fixed entity effects regressions. B) an unbalanced panel contains missing observations for at least one time period or one entity. C) the impact of different regressors are roughly the same for balanced but not for unbalanced panels. D) in the former you may not include drivers who have been drinking in the fatality rate/beer tax study. Answer: B 3) Consider the special panel case where T = 2. If some of the omitted variables, which you hope to capture in the changes analysis, in fact change over time, then the estimator on the included change regressor A) will be unbiased only when allowing for heteroskedastic-robust standard errors. B) may still be unbiased. C) will only be unbiased in large samples. D) will always be unbiased. Answer: B 4) The Fixed Effects regression model A) has n different intercepts. B) the slope coefficients are allowed to differ across entities, but the intercept is “fixed” (remains unchanged). C) has “fixed” (repaired) the effect of heteroskedasticity. D) in a log-log model may include logs of the binary variables, which control for the fixed effects. Answer: A 5) In the Fixed Effects regression model, you should exclude one of the binary variables for the entities when an intercept is present in the equation A) because one of the entities is always excluded. B) because there are already too many coefficients to estimate. C) to allow for some changes between entities to take place. D) to avoid perfect multicollinearity. Answer: D 6) In the Fixed Effects regression model, using (n – 1) binary variables for the entities, the coefficient of the binary variable indicates A) the level of the fixed effect of the ith entity. B) will be either 0 or 1. C) the difference in fixed effects between the ith and the first entity. D) the response in the dependent variable to a percentage change in the binary variable. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 248

7) cov (uit, uis Xit, Xis = 0 for t ≠ s means that A) there is no perfect multicollinearity in the errors. B) division of errors by regressors in different time periods is always zero. C) there is no correlation over time in the residuals. D) conditional on the regressors, the errors are uncorrelated over time. Answer: D 8) With Panel Data, regression software typically uses an “entity -demeaned” algorithm because A) the OLS formula for the slope in the linear regression model contains deviations from means already. B) there are typically too many time periods for the regression package too handle. C) the number of estimates to calculate can become extremely large when there are a large number of entities. D) deviations from means sum up to zero. Answer: C 9) The “before and after” specification, binary variable specification, and “entity -demeaned” specification produce identical OLS estimates A) as long as there are observations for more than two time periods. B) if you use the heteroskedasticity-robust option in your regression program. C) for the case of more than 100 observations. D) as long as T = 2 and the intercept is excluded from the “before and after” specification. Answer: D 10) In the Fixed Time Effects regression model, you should exclude one of the binary variables for the time periods when an intercept is present in the equation A) because the first time period must always excluded from your data set. B) because there are already too many coefficients to estimate. C) to avoid perfect multicollinearity. D) to allow for some changes between time periods to take place. Answer: C 11) If you included both time and entity fixed effects in the regression model which includes a constant, then A) one of the explanatory variables needs to be excluded to avoid perfect multicollinearity. B) you can use the “before and after” specification even for T > 2. C) you must exclude one of the entity binary variables and one of the time binary variables for the OLS estimator to exist. D) the OLS estimator no longer exists. Answer: C 12) Consider estimating the effect of the beer tax on the fatality rate, using time and state fixed effect for the Northeast Region of the United States (Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island) for the period 1991-2001. If Beer Tax was the only explanatory variable, how many coefficients would you need to estimate, excluding the constant? A) 18 B) 17 C) 7 D) 11 Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 249

13) Consider the regression example from your textbook, which estimates the effect of beer taxes on fatality rates across the 48 contiguous U.S. states. If beer taxes were set nationally by the federal government rather than by the states, then A) it would not make sense to use state fixed effect. B) you can test state fixed effects using homoskedastic-only standard errors. C) the OLS estimator will be biased. D) you should not use time fixed effects since beer taxes are the same at a point in time across states. Answer: D 14) In the panel regression analysis of beer taxes on traffic deaths, the estimation period is 1982 -1988 for the 48 contiguous U.S. states. To test for the significance of time fixed effects, you should calculate the F-statistic and compare it to the critical value from your Fq,∞ distribution, where q equals A) 6. B) 7. C) 48. D) 53. Answer: A 15) When you add state fixed effects to a simple regression model for U.S. states over a certain time period, and the regression R2 increases significantly, then it is safe to assume that A) the included explanatory variables, other than the state fixed effects, are unimportant. B) state fixed effects account for a large amount of the variation in the data. C) the coefficients on the other included explanatory variables will not change. D) time fixed effects are unimportant. Answer: B 16) Time Fixed Effects regression are useful in dealing with omitted variables A) even if you only have a cross-section of data available. B) if these omitted variables are constant across entities but vary over time. C) when there are more than 100 observations. D) if these omitted variables are constant across entities but not over time. Answer: B 17) Indicate for which of the following examples you cannot use Entity and Time Fixed Effects: a regression of A) OECD unemployment rates on unemployment insurance generosity for the period 1980 -2006 (annual data). B) the (log of) earnings on the number of years of education, using the Current Population Survey of 60,000 households for March 2006. C) the per capita income level in Canadian Provinces on provincial population growth rates, using decade averages for 1960, 1970, and 1980. D) the risk premium of 75 stocks on the market premium for the years 1998-2006. Answer: B 18) Panel data is also called A) longitudinal data. B) cross-sectional data. C) time series data. D) experimental data. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 250

19) (Requires Appendix material) When the fifth assumption in the Fixed Effects regression (cov (uit, uis Xit, Xis) = 0 for t ≠ s ) is violated, then A) using heteroskedastic-robust standard errors is not sufficient for correct statistical inference when using OLS. B) the OLS estimator does not exist. C) you can use the simple homoskedasticity-only standard errors calculated in your regression package. D) you cannot use fixed time effects in your estimation. Answer: A 20) In the panel regression analysis of beer taxes on traffic deaths, the estimation period is 1982 -1988 for the 48 contiguous U.S. states. To test for the significance of entity fixed effects, you should calculate the F-statistic and compare it to the critical value from your Fq,∞ distribution, where q equals A) 48. B) 54. C) 7. D) 47. Answer: D 21) The main advantage of using panel data over cross sectional data is that it A) gives you more observations. B) allows you to analyze behavior across time but not across entities. C) allows you to control for some types of omitted variables without actually observing them. D) allows you to look up critical values in the standard normal distribution. Answer: C 22) One of the following is a regression example for which Entity and Time Fixed Effects could be used: a study of the effect of A) minimum wages on teenage employment using annual data from the 48 contiguous states in 2006 . B) various performance statistics on the (log of) salaries of baseball pitchers in the American League and the National League in 2005 and 2006. C) inflation and inflationary expectations on unemployment rates in the United States, using quarterly data from 1960-2006. D) drinking alcohol on the GPA of 150 students at your university, controlling for incoming SAT scores. Answer: B 23) Consider a panel regression of unemployment rates for the G7 countries (United States, Canada, France, Germany, Italy, United Kingdom, Japan) on a set of explanatory variables for the time period 1980 -2000 (annual data). If you included entity and time fixed effects, you would need to specify the following number of binary variables: A) 21. B) 6. C) 28. D) 26. Answer: D 24) A pattern in the coefficients of the time fixed effects binary variables may reveal the following in a study of the determinants of state unemployment rates using panel data: A) macroeconomic effects, which affect all states equally in a given year. B) attitude differences towards unemployment between states. C) there is no economic information that can be retrieved from these coefficients. D) regional effects, which affect all states equally, as long as they are a member of that region. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 251

25) In the panel regression analysis of beer taxes on traffic deaths, the estimation period is 1982 -1988 for the 48 contiguous U.S. states. To test for the significance of time fixed effects, you should calculate the F-statistic and compare it to the critical value from your Fq,∞ distribution, which equals (at the 5% level) A) 2.01. B) 2.10. C) 2.80. D) 2.64. Answer: B 26) Assume that for the T = 2 time periods case, you have estimated a simple regression in changes model and found a statistically significant positive intercept. This implies A) a negative mean change in the LHS variable in the absence of a change in the RHS variable since you subtract the earlier period from the later period B) that the panel estimation approach is flawed since differencing the data eliminates the constant (intercept) in a regression C) a positive mean change in the LHS variable in the absence of a change in the RHS variable D) that the RHS variable changed between the two subperiods Answer: C 27) HAC standard errors and clustered standard errors are related as follows: A) they are the same B) clustered standard errors are one type of HAC standard error C) they are the same if the data is differenced D) clustered standard errors are the square root of HAC standard errors Answer: B 28) In panel data, the regression error A) is likely to be correlated over time within an entity B) should be calculated taking into account heteroskedasticity but not autocorrelation C) only exists for the case of T > 2 D) fits all of the three descriptions above Answer: A 29) It is advisable to use clustered standard errors in panel regressions because A) without clustered standard errors, the OLS estimator is biased B) hypothesis testing can proceed in a standard way even if there are few entities ( n is small) C) they are easier to calculate than homoskedasticity-only standard errors D) the fixed effects estimator is asymptotically normally distributed when n is large Answer: D 30) If Xit is correlated with Xis for different values of s and t, then A) Xit is said to be autocorrelated B) the OLS estimator cannot be computed C) statistical inference cannot proceed in a standard way even if clustered standard errors are used D) this is not of practical importance since these correlations are typically weak in applications Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 252

10.2 Essays and Longer Questions 1) A study, published in 1993, used U.S. state panel data to investigate the relationship between minimum wages and employment of teenagers. The sample period was 1977 to 1989 for all 50 states. The author estimated a model of the following type: ln(Eit )= β0 + β1 ln(Mit /Wit ) + γ 2 D2 i + ... + γ nD50i + δ2 B2 t + ... + δTB13t + uit, where E is the employment to population ratio of teenagers, M is the nominal minimum wage, and W is average hourly earnings in manufacturing. In addition, other explanatory variables, such as the adult unemployment rate, the teenage population share, and the teenage enrollment rate in school, were included. (a) Name some of the factors that might be picked up by time and state fixed effects. (b) The author decided to use eight regional dummy variables instead of the 49 state dummy variables. What is the implicit assumption made by the author? Could you test for its validity? How? (c) The results, using time and region fixed effects only, were as follows: ln Eit = -0.182 × ln(Mit /Wit ) + ...; R2 = 0.727 (0.036) Interpret the result briefly. (d) State minimum wages do not exceed federal minimum wages often. As a result, the author decided to choose the federal minimum wage in his specification above. How does this change your interpretation? How is the original equation ln(Eit )= β0 + β1 ln(Mit /Wit ) + γ 2 D2 i + ... + γ nD8 i + δ2 B2 t + ... + δTB13t + uit, affected by this? Answer: (a) Time effects will pick up the effect of omitted variables that are common to all 50 states at a given point in time. Federal fiscal and monetary variables, exchange rate and U.S. terms of trade movements, aggregate business cycle developments, etc., are candidates here. State fixed effects will include variables that are slowly changing over time within a specific state such as attitudes toward employment or labor force participation, state specific labor market policies, industrial and labor force composition, etc. (b) The implicit assumption by the author is that the coefficients on the state fixed effects are identical within a region but differ between regions. Since these coefficients imply linear restrictions, they can be tested using the F-test. (c) Consider a ten percent increase in minimum wages, say from $5 to $5.50 with constant average hourly earnings. This corresponds to a ten percent increase in relative minimum wages. The resulting decrease in the teenage to population ratio is 1.8 or almost 2 percent. The regression explains roughly 73 percent of the employment to population ratio of teenagers during the period of 1977 to 1989 for the 50 U.S. states. (d) This choice in effect drops the i subscript from the minimum wage, since there is no variation by state. The original equation then reads ln(Eit )= β0 + β1 ln(Mit /Wit ) + γ 2 D2 i + ... + γ nD8 i + δ2 B2 t + ... + δTB13t + uit. Furthermore, since the federal minimum wage is constant across the nine regions at a point in time, it is absorbed by the time effects. The coefficient on the relative minimum wage therefore reflects regional variations in average hourly earning in manufacturing. The minimum wage only enters indirectly as changes in the federal minimum wage since there are different relative levels to average hourly earnings in each region.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 253

2) You want to find the determinants of suicide rates in the United States. To investigate the issue, you collect state level data for ten years. Your first idea, suggested to you by one of your peers from Southern California, is that the annual amount of sunshine must be important. Stacking the data and using no fixed effects, you find no significant relationship between suicide rates and this variable. (This is good news for the people of Seattle.) However, sorting the suicide rate data from highest to lowest, you notice that those states with the lowest population density are dominating in the highest suicide rate category. You run another regression, without fixed effect, and find a highly significant relationship between the two variables. Even adding some economic variables, such as state per capita income or the state unemployment rate, does not lower the t-statistic for the population density by much. Adding fixed entity and time effects, however, results in an insignificant coefficient for population density. (a) What do you think is the cause for this change in significance? Which fixed effect is primarily responsible? Does this result imply that population density does not matter? (b) Speculate as to what happens to the coefficients of the economic variables when the fixed effects are included. Use this example to make clear what factors entity and time fixed effects pick up. (c) What other factors might play a role? Answer: (a) Population density only changes slowly over time, hence state effects will pick up the influence of this variable. This does not imply that population is of no relevance. However, there are other omitted variables in this regression, such as religious and cultural attitudes towards suicide, that are also captured by the state effects, and these may also be correlated with population density. (b) Since there is sufficient variation of state unemployment rates and state per capita income both over time and across states, the coefficients on these variables are likely to remain statistically significant. However, there may be multicollinearity between the two variables, and the standard errors may therefore be large. (c) Answers will vary by student. Cultural and institutional factors, such as attitudes towards suicide and religion, and social services, are frequently mentioned. 3) Two authors published a study in 1992 of the effect of minimum wages on teenage employment using a U.S. state panel. The paper used annual observations for the years 1977 -1989 and included all 50 states plus the District of Columbia. The estimated equation is of the following type (Eit )= β0 + β1 (Mit /Wit ) + γ 2 D2 i + ... + γ nD51i + δ2 B2 t + ... + δTB13t + uit, where E is the employment to population ratio of teenagers, M is the nominal minimum wage, and W is average wage in the state. In addition, other explanatory variables, such as the prime -age male unemployment rate, and the teenage population share were included. (a) Briefly discuss the advantage of using panel data in this situation rather than pure cross sections or time series. (b) Estimating the model by OLS but including only time fixed effects results in the following output ^

Eit = β0 - 0.33 × (Mit /Wit ) + 0.35(SHYit) – 1.53 × uramit; R2 = 0.20 (0.08)

(0.28)

(0.13)

where SHY is the proportion of teenagers in the population, and uram is the prime-age male unemployment rate. Coefficients for the time fixed effects are not reported. Numbers in parenthesis are homoskedasticity -only standard errors. Comment on the above results. Are the coefficients statistically significant? Since these are level regressions, how would you calculate elasticities? (c) Adding state fixed effects changed the above equation as follows: ^

Eit = β0 + 0.07 × (Mit /Wit ) – 0.19 × (SHYit) – 0.54 × uramit; R2 = 0.69 (0.10)

(0.22)

(0.11)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 254

Compare the two results. Why would the inclusion of state fixed effects change the coefficients in this way? (d) The significance of each coefficient decreased, yet R2 increased. How is that possible? What does this result tell you about testing the hypothesis that all of the state fixed effects can be restricted to have the same coefficient? How would you test for such a hypothesis? Answer: (a) There are likely to be omitted variables in the above regression. One way to deal with some of these is to introduce state and time effects. State effects will capture the influence of omitted variables that are state specific and do not vary over time, while time effects capture those of country wide variables that are common to all states at a point in time. Furthermore, there are more observations when using panel data, resulting in more variation. (b) There is negative relationship between minimum wages and the employment to population ratio. Increases in the share of teenagers in the population result in a higher employment to population ratio, and increases in the prime-age male unemployment rate lower the employment to population ratio. 20 percent of employment to population of teenagers variation is explained by the above regression. The relative minimum wage and the prime-age male unemployment rate are significant using a 1% significance level, while the proportion of teenagers in the population is not. Elasticities vary with levels here. One possibility is to report elasticities at sample means. (c) The parameter of interest here is the coefficient on the relative minimum wage. While it was highly significant in the previous regression, it now has changed signs and is statistically insignificant. The explanatory power of the equation has increased substantially. The size of the other two coefficients has also decreased. The results suggest that omitted variables, which are now captured by state fixed effects, were correlated with the regressors and caused omitted variable bias. (d) The influence of the state effects is large. These are bound to be statistically significant and the hypothesis to restrict these coefficients to zero is bound to fail. Since these are linear hypothesis that are supposed to hold simultaneously, an F-test is appropriate here. 4) You learned in intermediate macroeconomics that certain macroeconomic growth models predict conditional convergence or a catch up effect in per capita GDP between the countries of the world. That is, countries which are further behind initially in per-capita GDP will grow faster than the leader. You gather data from the Penn World Tables to test this theory. (a) By limiting your sample to 24 OECD countries, you hope to have a more homogeneous set of countries in your sample, i.e., countries that are not too different with respect to their institutions. To simplify matters, you decide to only test for unconditional convergence. In that case, the laggards catch up even without taking into account differences in some of the driving variables. Your scatter plot and regression for the time period 1975-1989 are as follows:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 255

g8975 = 0.024 – 0.005 PCGDP75_US; R2 = 0.025, SER = 0.006 (0.06) (0.008) where g8975 is the average annual growth rate of per capita GDP from 1975-1989, and PCGDP75_US is per capita GDP relative to the United States in 1975. Numbers in parenthesis are heteroskedasticity -robust standard errors. Interpret the results. Is there indication of unconditional convergence? What critical value did you use? (b) Although you are quite discouraged by the result, you think that it might be due to the specific time period used. During this period, there were two OPEC oil price shocks with varying degrees of exposure for the OECD countries. You therefore repeat the exercise for the period 1960 -1974, with the following results:

g7460 = 0.061 – 0.043 PCGDP60_US; R2 = 0.613, SER = 0.008 (0.004) (0.007) where g7460 is the average annual growth rate of per capita GDP from 1960-1974, and PCGDP60_US is per capita GDP relative to the United States in 1960. Compare this regression to the previous one. (c) You decide to run one more regression in differences. The dependent variable is now the change in the growth rate of per capita GDP from 1960-1974 to 1975-1989 (diffg) and the regressor the difference in the initial conditions (diffinit). This produces the following graph and regression:

Stock/Watson 2e -- CVC2 8/23/06 -- Page 256

diffg = -0.006 – 0.096 × diffinit; R2 = 0.468; SER = 0.009 (0.03) (0.021) Interpret these results. Explain what has happened to unobservable omitted variables that are constant over time. Suggest what some of these variables might be. (d) Given that there are only two time periods, what other methods could you have employed to generate the identical results? Why do you think that the slope coefficient in this regression is significant given the results over the sub-periods? Answer: (a) Although the slope coefficient is negative, thereby indicating unconditional convergence, the t-statistic does not exceed the critical value. However, using the standard normal distribution here is not really justified here since there are only 24 observations. (b) The explanatory power of the regression is much higher and there is a larger t-statistic for the slope coefficient. If a standard normal distribution could be used here, then the absolute value of the t-statistic would easily exceed the critical value of 1.64. This suggests unconditional convergence over the sample period. However, the same comment regarding the sample size as in (a) applies here. (c) The slope coefficient suggests that countries which are further behind initially with respect to the United States, will grow relatively faster. Almost 50 percent of the relative growth difference variation is explained by the regression. Decreasing the initial per capita income ratio to the United States by 10 percentage points will decrease the relative growth performance by 1 percentage point. Omitted variables that remain constant over time are picked up by focusing on changes in the variables. Some of these may be cultural and institutional variables such as the level of educational attainment, saving rates, population growth rates, independence of the central bank, etc. (d) Using either a fixed effects regression or entity-demeaned OLS would have resulted in identical estimates. In general, it is possible for included coefficients to be statistically insignificant as a result of omitted variable bias. This result depends, among other factors, on the relationship between the omitted variables and the included variables. Using the differencing method has eliminated at least some of the omitted variables. 5) A researcher investigating the determinants of crime in the United Kingdom has data for 42 police regions over 22 years. She estimates by OLS the following regression ln(cmrt)it = αi + φt + β1 unrtmit + β2 proythit + β3 ln(pp)it + uit; i = 1,..., t = 1,..., 22 where cmrt is the crime rate per head of population, unrtm is the unemployment rate of males, proyth is the proportion of youths, pp is the probability of punishment measured as (number of convictions)/(number of Stock/Watson 2e -- CVC2 8/23/06 -- Page 257

crimes reported). α and φ are area and year fixed effects, where αi equals one for area i and is zero otherwise for all i, and φt is one in year t and zero for all other years for t = 2, …, 22. φ1 is not included. (a) What is the purpose of excluding φ1 ? What are the terms α and φ likely to pick up? Discuss the advantages of using panel data for this type of investigation. (b) Estimation by OLS using heteroskedasticity and autocorrelation -consistent standard errors results in the following output, where the coefficients of the fixed effects are not reported: ln(cmrt)it = 0.063 × unrtmit + 3.739 × proythit – 0.588 × ln(pp)it ; R2 = 0.904 (0.109)

(0.179)

(0.024)

Comment on the results. In particular, what is the effect of a ten percent increase in the probability of punishment? (c) To test for the relevance of the area fixed effects, your restrict the regression by dropping all entity fixed effects and add single constant is added. The relevant F-statistic is 135.28. What are the degrees of freedom? What is the critical value from your F table? (d) Although the test rejects the hypothesis of eliminating the fixed effects from the regression, you want to analyze what happens to the coefficients and their standard errors when the equation is re -estimated without ^

fixed effects. In the resulting regression, β2 and β3 do not change by much, although their standard errors ^

roughly double. However, β1 is now 1.340 with a standard error of 0.234. Why do you think that is? Answer: (a) Since there is no constant in addition to the entity and time fixed effects, setting φt to one in year t and zero for all other years for t = 1, …, 22 would result in perfect multicollinearity. α picks up omitted variables that are specific to police regions and do not vary over time. φ picks up effects that are common to all police regions in a given year. Attitudes toward crime may vary between rural regions and metropolitan areas. These would be hard to capture through measurable variables. Common macroeconomic shocks that affect all regions equally will be captured by the time fixed effects. Although some of these variables could be explicitly introduced, the list of possible variables is long. By introducing time fixed effects, the effect is captured all in one variable. (b) A higher male unemployment rate and a higher proportion of youths increase the crime rate, while a higher probability of punishment decreases the crime rate. The coefficients on the probability of punishment and the proportion of youths is statistically significant, while the male unemployment rate is not. The regression explains roughly 90 percent of the variation in crime rates in the sample. A ten percent increase in the number of convictions over the number of crimes reported decreases the crime rate by roughly six percent. (c) The coefficients of the three regressors other than the entity coefficients would have been unaffected, had there been a constant in the regression and (n-1) police region specific entity variables. In this case, the entity coefficients on the police regions would have indicated deviations from the constant for the first police region. Hence there are 41 restrictions imposed by eliminating the entity fixed effects and adding a constant. Since there are over 100 observations (900 degrees of freedom), the critical value for F41,∞ ≈ F30,∞ = 1.70 at the 1% level. Hence the restrictions are rejected. (d) This result would make the male unemployment rate coefficient significant. It suggests that male unemployment rates change slowly over the years in a given police district and that this effect is picked up by the entity fixed effects. Of course, there are other slowly changing variables, such as attitudes towards crime, that are captured by these fixed effects.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 258

6) You want to investigate the relationship between cumulative GPA scores at graduation and incoming SAT scores of students. For this purpose, you have collected data from a balanced panel of 120 undergraduate colleges and universities in the United States over a ten year period. Discuss some of the entity fixed effects which you potentially capture by allowing for a binary variable for each of the colleges. Answer: Students will come up with various possible entity fixed effects. These should include differences between educational institutions that have • •

made it a policy not to fight grade inflation a different degree of selectivity (if you admit only students with an SAT score of 2400, then there will be no relationship) gender or religion specific requirements an admission process that is need-blind large varsity sports programs

• • •

and so forth. 7) You want to study the relationship between weight and height of young children (4 th grade to 7th grade). You collect data for more than 400 students and track the progress of these students over the following four years, where you end up with a balanced panel of 400 students (you discard the observations for the students who moved away). Discuss some of the entity fixed effects which you potentially capture by allowing for a binary variable for each of the students. Do you expect significant time fixed effects if you allowed for them? Answer: Students will come up with various possible entity fixed effects. These will reflect differences between students potentially depending on • • • • • •

gender ethnicity degree of participation in exercises/athletic programs growth spurts during these years nutrition genes

and so forth. It is hard to think of time fixed effects. Potentially there could be an effect if all students went to a different school in 7th grade (e.g. middle school) and this school had a less/more healthy lunch diet. 8) You first encountered growth regression in your intermediate macroeconomics course (“beta -convergence regressions”), that is, conditionally on some initial condition in per capita income, different authors tried to find the determinants of growth. Since growth is a long-run phenomenon, various studies collected data for a panel of numerous countries using 10-year averages, over a time period stretching from 1960 to 2005. For example, a balanced panel might consist of 50 or so odd countries for the time periods 1960 -1970, 1971-1980, … , 2000-2005. Instead of using two-way fixed effects (entity fixed effects and time fixed) authors often only employed time fixed effects. Why do you think that is? What sort of information would be lost if these authors employed entity fixed effects as well? Answer: Time fixed effects will eliminate common growth phenomenon experienced by all countries during the same decade (say). These could include productivity slow -downs due to the oil crisis of the ‘70s, effects of the Great Moderation of the ‘90s, etc. However, most of these studies were interested in determining the effect of institutional differences between countries. These effects, such as the degree of democracy, law and order, openness of the economy, size of government, civil wars, geography, religion, etc., are typically slowly changing, and by including entity fixed effects, you would lose the effects you are interested in studying.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 259

10.3 Mathematical and Graphical Problems 1) Your textbook suggests an “entity-demeaned” procedure to avoid having to specify a potentially large number of binary variables. While it is somewhat tedious to specify a binary variable for each entity, this can still be handled relatively easily in the case of the 48 contiguous states. Give a few examples where it might be close to impossible to implement specifying such large number of entity binary variables. The idea of the “entity-demeaned” procedure was introduced as a computationally convenient and simplifying procedure. Since there are also time fixed effects, why is there no discussion of using a “time-demeaned” procedure? Using the following equation Yit = β0 + β1 Xit + β3 St + uit, Show how β1 can be estimated by the OLS regression using “time -demeaned” variables. Answer: Answers will vary by student with regard to the examples. A panel containing tens of thousands of individuals would make it impractical to specify entity fixed effects. The same would hold for a large number of firms. Regression software typically does not estimate panel regressions using “time-demeaned” variables, since there are not that many observations across time. In the textbook example, there were seven years of data for the 48 contiguous U.S. states. There maybe observations for 10,000 individuals over a few years. Still, in principle, you could use a “time -demeaning” procedure. Taking averages on both sides of the above equation results in Yt = β1 Xt + β0 + β3 St + ut

where Yt =

n n n 1 1 1 Y , X t and u it X , = = t n ∑ uit . n ∑ it n ∑ i=1 i=1 i=1

Subtracting the averaged equation from the original one yields

Yit - Yt = β1 (Xit - Xt) + (uit - ut) or Yit = β1 Xit + uit ,

where Yit = Yit -Yi, and Xit and uit are defined similarly. The “time-demeaned” regression can then be estimated by OLS.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 260

2) Consider the case of time fixed effects only, i.e., Yit = β0 + β1 Xit + β3 St + uit, First replace β0 + β3 St with φt. Next show the relationship between the φt and δt in the following equation Yit = β0 + β1 Xit + δ2 B2 t + ... + δTBTt + uit, where each of the binary variables B2, …, BT indicates a different time period. Explain in words why the two equations are the same. Finally show why there is perfect multicollinearity if you add another binary variable B1. What is the intuition behind the fact that the OLS estimator does not exist in this case? Would that also be the case if you dropped the intercept? Answer: Yit = β1 Xit + φt + uit. The relationship is φ1 = β0 , and φt = β0 + δt for t ≥ 2. Consider time period t, then the population regression line for that period is φt + β1 Xit, with β1 being the same for all time periods, but the intercept varying from time period to time period. The variation of the intercept comes from factors which are common to all entities in a given time period, i.e., the St. The same role is played by δ2 , … δT, since B2 t …BTt are only different from zero during one period. There is perfect multicollinearity if one of the regressors can be expressed as a linear combination of the other regressors. Define B0 t as a variable that equals one for all period. In that case, the previous regression can be rewritten as Yit = β0 B0 t + β1 Xit + δ2 B2 t + ... + δTBTt + uit. Adding B1 with a coefficient here results in Yit = β0 B0 t + β1 Xit + δ1 B1 t + δ2 B2 t ... + δTBTt + uit.. But B0 t = B1 t + B2 t + ... + BTt, and hence there is perfect multicollinearity. Intuitively, whenever any one of the binary variable equals one in a given period, so does the constant. Hence the coefficient of that variable cannot pick up a separate effect from the data. Dropping the intercept from the regression eliminates the problem. 3) Consider the following panel data regression with a single explanatory variable Yit = β0 + β1 Xit + uit. In each of the examples below, you will be adding entity and time fixed effects. Indicate the total number of coefficients that need to be estimated. (a) The effect of beer taxes on the fatality rate, annual data, 1982 -1988, nine U.S. regions (New England, Pacific, Mid-Atlantic, East North Central, etc.). (b) The effect of the minimum wage on teenage employment, annual data, 1963 -2000, five Canadian Regions (Atlantic Provinces, Quebec, Ontario, Prairies, British Columbia). (c) The effect of savings rates on per capita income, data for three decades (1960 -1969, 1970-1979, 1980-1989; one observation per decade), 104 countries of the world. (d) The effect of pitching quality in baseball (as measured by the Team ERA) on the winning percentage, annual data, 1998-1999 season, 1999-2000 season, 30 teams. Answer: (a) 16 coefficients (6 time fixed effects, 8 entity fixed effects, intercept, slope). (b) 43 coefficients (37 time fixed effects, 5 entity fixed effects, intercept, slope). (c) 107 coefficients (3 time fixed effects, 103 entity fixed effects, intercept, slope). (d) 32 coefficients (1 time fixed effect, 29 entity fixed effects, intercept, slope).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 261

4) Your textbook modifies the four assumptions for the multiple regression model by adding a new assumption. This represents an extension of the cross-sectional data case, where errors are uncorrelated across entities. The new assumption requires the errors to be uncorrelated across time, conditional on the regressors as well (cov(uit, uis Xit, Xis) = 0 for t ≠ s.). (a) Discuss why there might be correlation over time in the errors when you use U.S. state panel data. Does this mean that you should not use OLS as an estimator? (b) Now consider pairs of adjacent states such as Indiana and Michigan, Texas and Arkansas, New York and Connecticut, etc. Is it likely that the fifth assumption will hold here, even though the “contemporaneous” errors are correlated? If not, can you still use OLS for estimation? Answer: (a) The error term may contain omitted variables. If these change slowly from one period to the next, then the error term will be correlated over time. In that case (cov(uit, uis Xit, Xis) = 0 for t ≠ s will be violated. The OLS estimator is still unbiased, but valid statistical inference cannot be conducted, even when using heteroskedasticity-robust standard errors. However, heteroskedasticity- and autocorrelation- consistent standard errors can be used in this situation. (b) The fifth assumption deals with observations that do not occur during the same time period. It does not address the problems of errors of one entity being affected by errors in another entity during the same period. While potentially there are more efficient estimators available in such a situation, OLS can still be used for estimation. 5) In Sports Economics, production functions are often estimated by relating the winning percentage of teams ( Y) to inputs indicating performance in certain aspects of the game. However, this omits the quality of management. Assume that you could measure the quality of pitching and hitting by a single index L, and that managerial ability is represented by M, which is assumed to be constant over time. The production function would then be specified as follows: Yit = β0 +β1 Lit + β2 Mi + uit where i is an index for the baseball team, and t indexes time and all variables are in logs. (a) Assume that managerial ability is unobservable but is positively related, in a linear way, to L. Explain why ^

the OLS estimator β1 is inconsistent in the case of a single cross-section, i.e., if you attempt to estimate the above regression for a single year. Do you expect this coefficient to over- or under-estimate β1 ? (b) If you had data for two years, indicate the transformation, which allows you to obtain a consistent estimator for β1 . Answer: (a) Regressing Y on L alone will result in omitted variable bias. An increase in the pitching and hitting ^

index will increase managerial ability, which in return increases the winning percentage. Hence β1 will be expected to overestimate the effect of pitching and hitting on the winning percentage. Said differently, OLS will attribute more to pitching and hitting quality and it deserves. (b) Since managerial ability is assumed to be constant over time, then differencing the data over the two time-periods will eliminate this effect for all teams. This can be shown as follows: Yi2 = β0 +β1 Li2 + β2 Mi + ui2 Yi1 = β0 +β1 Li1 + β2 Mi + ui1 Subtracting the second equation from the first results in Yi2 - Yi1 = β1 (Li2 - Li1 ) + ui2 - ui1 . Alternatively, the binary variable specification or the “entity-demeaned” specification could have been used with identical estimation results.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 262

6) A study attempts to investigate the role of the various determinants of regional Canadian unemployment rates in order to get a better picture of Canadian aggregate unemployment rate behavior. The annual data (1967-1991) is for five regions (Atlantic region, Quebec, Ontario, Prairies, and British Columbia), and four age-gender groups (female and male, adult and young). Focusing on young females, the authors find significant effects for the following variables: the regional relative minimum wage rate (minimum wages divided by average hourly earnings), the regional share of youth in the labor force, the regional share of adult females in the labor force, United States activity shocks (deviations of United States GDP from trend), an indicator of the degree of monetary tightness in Canada, regional union density, and a regional index of unemployment insurance generosity. Explain why the authors only used region fixed effects. How would their specification have to change if they also employed time fixed effects? Answer: Since the study used Canada-wide effects (United States activity shocks, and monetary tightness), these are identical for all regions at a point in time. Using time fixed effects in addition to these two variables would have generated perfect multicollinearity among the regressors, and hence the OLS estimator would not exist. An alternative specification would include time fixed effects, but eliminate the two variables which are constant across all regions at a given point in time. 7) (Requires Matrix Algebra) Consider the time and entity fixed effect model with a single explanatory variable Yit = β0 + β1 Xit + γ 2 D2 i + ... + γ nDni + δ2 B2 t + ... + δTBTt + uit, For the case of n = 4 and T = 3, write this model in the form Y = Xβ + U, where, in general, ʹ X1 Y1 Y=

Y2 Yn

, U=

u1 u2

, X=

1 X11 ... Xk1 1 X12 ... Xk1 1 X1n ... Xkn

ʹ X2

β0 β , and β = 1 βk

ʹ Xn

How would the X matrix change if you added two binary variables, D1 and B1? Demonstrate that in this case the columns of the X matrix are not independent. Finally show that elimination of one of the two variables is ^ not sufficient to get rid of the multicollinearity problem. In terms of the OLS estimator, β = (X′X)-1 X′Y, why does perfect multicollinearity create a problem? Answer: For the case of n = 4 and T = 3, the general model would look as follows: Y11

1 X11 0 0 0 0 0

Y12

1 X12 0 0 0 0 0

Y13

1 X13 0 0 0 1 0

Y21

1 X21 1 0 0 0 0

β0

Y22

1 X22 1 0 0 1 0

β1

Y23

1 0 0 0 1

1 X23 Y31 = 1 X31 Y32 1 X32 Y33 1 X33

0 1 0 0 0 0 1 0 1 0 0

1 0 0 1

Y41

1 X41 0 0 1 0 0

Y42

1 X42 0 0 1 1 0

Y43

1 X43 0 0 1 0 1

u11 u12 u13 u21

γ2

u22 u23

δ2 δ3

u33 u41

γ 3 + u31 u32 γ4

u42 u43

Adding the two binary variable would change the X matrix in this way: Stock/Watson 2e -- CVC2 8/23/06 -- Page 263

1 X11 0 0 0 0 0 1 1 1 X12 0 0 0 1 0 1 0 1 X13 0 0 0 0 1 1 0 1 X21 1 0 0 0 0 0 1 1 X22 1 0 0 1 0 0 0 1 X23 1 0 0 0 1 0 0 X=

1 X31 0 1 0 0 0 0 1 1 X32 0 1 0 1 0 0 0 1 X33 0 1 0 0 1 0 0 1 X41 0 0 1 0 0 0 1 1 X42 0 0 1 1 0 0 0 1 X43 0 0 1 0 1 0 0

Adding columns 6, 7, and 9 results in column 1. Also adding columns 3, 4 , 5, and 8 results in column 1. Hence the columns are not linearly independent and there is perfect multicollinearity among the columns of the matrix. Eliminating column 9, say, is not sufficient to get rid of this problem, since adding columns 3, 4, 5, and 8 still equals column 1. In case of perfect multicollinearity, the X matrix will not have full rank, and hence (X′X)-1 will also not have full rank (it is singular). In this case, (X′X)-1 cannot be inverted, and hence the OLS estimator does not exist. 8) Consider the time and entity fixed effect model with a single explanatory variable Yit = β0 + β1 Xit + γ 2 D2 i + ... + γ nDni + δ2 B2 t + ... + δTBTt + uit, Assume that you had estimated the above equation by OLS. Typically the coefficients for the entity and time binary variables are not reported. Can you think of situations where the pattern of these coefficients might be of interest? What could you do, for example, if you had a strong theoretical justification for believing that a few macroeconomic variables had an effect on Yit ? Answer: The coefficients pick up the effects of omitted variables that are common to all entities at a point in time (time fixed effects), or that are constant across time for entities (entity fixed effect). If data is available on slowly changing variables across time, say population density or average educational attainment by U.S. state, or on macroeconomic variables, then you could perform a regression of the binary variable coefficients on these variables to determine the degree of correlation. Obviously, the correlation will be less than perfect, and unless these variables bear coefficients of interest, then there is little to be gained from these auxiliary regressions.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 264

9) ʺEmpirical studies of economic growth are flawed because many of the truly important underlying determinants, such as culture and institutions, are very hard to measure.ʺ Discuss this statement paying particular attention to simple cross-section data and panel data models. Use equations whenever possible to underscore your argument. Answer: Although some cultural and institutional variables, such as corruption, black market activity, central bank independence, trust, etc., are hard to measure, authors have developed such series for the countries of the world. Still, either these variables are measure with error or not all cultural and institutional aspects are bound to be captured. Hence you would expect omitted variable bias to be present in cross-sectional studies. However, if you could argue that these effects are constant across time or at least slowly changing, then introducing country fixed effects in panel studies goes some way to alleviate the omitted variable problem. Similarly by using time fixed effects, common world business cycle effects can be largely eliminated. For an empirical study of economic growth using U.S. states, time fixed effects would eliminate common effects of monetary policy and inflation. The above argument can be made using equations along the theoretical arguments presented in sections 8.3 and 8.4 of the textbook. 10) Give at least three examples from macroeconomics and five from microeconomics that involve specified equations in a panel data analysis framework. Indicate in each case what the role of the entity and time fixed effects in terms of omitted variables might be. Answer: Answers will vary by student. Given the textbook example, you can expect a study of fatality rates and beer taxes to appear. Other examples mentioned may be minimum wage studies using data from U.S. states or Canadian provinces, panel data in earnings studies, empirical studies of economic growth across the countries of the world or regions within a country, determinants of unemployment rates using data from geographical units (countries, regions, states), degree of democratization of the countries of the world, etc. Students should point out in the various examples how entity and time fixed effects pick up variables that are constant across entities at a point in time, or constant over time for specific entities. For geographical units, these typically involve cultural and institutional factors, and common macroeconomic effects. 11) Your textbook specifies a simple regression problem for two time periods for the years 1982 and 1988 as follows: FatalityRatei,1982= β0 + β1 BeerTaxi,1982 + ui,1982 FatalityRatei,1988= β0 + β1 BeerTaxi,1988 + ui,1988 After subtracting the first equation from the second equation, the authors estimate the model and find a negative intercept. a.

Show how you would have to modify the two equations to allow for the presence of an intercept in the differenced model.

What would the relative magnitude of the modified model have to be for you to find a negative intercept?

Answer: a. FatalityRatei,1982= β0 + β1 BeerTaxi,1982 + ui,1982 FatalityRatei,1988= α0 + β1 BeerTaxi,1988 + ui,1988

b. α0 < β0

Stock/Watson 2e -- CVC2 8/23/06 -- Page 265

12) Your textbook reports the following result from an two-way fixed effects (entity and time fixed effects) regression model: FatalityRate = -0.66 BeerTax + StateFixedEffects + TimeFixedEffects (0.36) Where the number in parenthesis is the heteroskedasticity- and autocorrelation-consistent (HAC) standard error. a.

Calculate the t-statistic. Can you reject the null hypothesis that the slope coefficient is zero in the population, using a two-sided test and a 5% significance level?

Given that economic theory suggests that the population slope is negative under the alternative hypothesis, is it possible to use a one-sided test here? In that case, does your conclusion change?

Using only heteroskedasticity-robust standard errors, but not HAC standard errors, the value in parenthesis becomes 0.25. Repeat the calculations in (a) and report your decision based on a two -sided test.

d. Since the coefficient becomes more statistically significant in (d), should this influence your choice of standard errors? Why or why not? Answer: a. t =

-0.64 = -1.78 < -1.96. Hence you cannot reject the null hypothesis that the coefficient is zero in the 0.36

population. b. The beer tax represents part of the cost (price) of alcohol consumption and an increase in price should reduce the demand for alcohol. Hence economic theory suggests a negative price coefficient. It therefore seems reasonable to use a one-sided test. Since the critical value is -1.64 in that case, you can reject the null hypothesis at the 5% significance level. c. The t-statistic is now -2.56 and you can reject the null hypothesis at the 5% level, and almost at the 1% level. d. It is better to use the clustered standard errors, since these are valid whether or not there is heteroskedasticity, autocorrelation, or both. Using heteroskedasticity -robust standard errors only will result in invalid statistical inference, since they were derived under the assumption of no serial correlation in the error term.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 266

Chapter 11 Regression with a Binary Dependent Variable 11.1 Multiple Choice 1) The binary dependent variable model is an example of a A) regression model, which has as a regressor, among others, a binary variable. B) model that cannot be estimated by OLS. C) limited dependent variable model. D) model where the left-hand variable is measured in base 2. Answer: C 2) (Requires Appendix material) The following are examples of limited dependent variables, with the exception of A) binary dependent variable. B) log-log specification. C) truncated regression model. D) discrete choice model. Answer: B 3) In the binary dependent variable model, a predicted value of 0.6 means that A) the most likely value the dependent variable will take on is 60 percent. B) given the values for the explanatory variables, there is a 60 percent probability that the dependent variable will equal one. C) the model makes little sense, since the dependent variable can only be 0 or 1. D) given the values for the explanatory variables, there is a 40 percent probability that the dependent variable will equal one. Answer: B 4) E(Y X1 ,..., Xk) = Pr(Y = 1 X1 ,..., Xk) means that A) for a binary variable model, the predicted value from the population regression is the probability that Y=1, given X. B) dividing Y by the X’s is the same as the probability of Y being the inverse of the sum of the X’s. C) the exponential of Y is the same as the probability of Y happening. D) you are pretty certain that Y takes on a value of 1 given the X’s. Answer: A 5) The linear probability model is A) the application of the multiple regression model with a continuous left-hand side variable and a binary variable as at least one of the regressors. B) an example of probit estimation. C) another word for logit estimation. D) the application of the linear multiple regression model to a binary dependent variable. Answer: D 6) In the linear probability model, the interpretation of the slope coefficient is A) the change in odds associated with a unit change in X, holding other regressors constant. B) not all that meaningful since the dependent variable is either 0 or 1. C) the change in probability that Y=1 associated with a unit change in X, holding others regressors constant. D) the response in the dependent variable to a percentage change in the regressor. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 267

7) The following tools from multiple regression analysis carry over in a meaningful manner to the linear probability model, with the exception of the A) F-statistic. B) significance test using the t-statistic. C) 95% confidence interval using ± 1.96 times the standard error. D) regression R2 . Answer: D 8) (Requires material from Section 11.3 – possibly skipped) For the measure of fit in your regression model with a binary dependent variable, you can meaningfully use the A) regression R2 . B) size of the regression coefficients. C) pseudo R2 . D) standard error of the regression. Answer: C 9) The major flaw of the linear probability model is that A) the actuals can only be 0 and 1, but the predicted are almost always different from that. B) the regression R2 cannot be used as a measure of fit. C) people do not always make clear-cut decisions. D) the predicted values can lie above 1 and below 0. Answer: D 10) The probit model A) is the same as the logit model. B) always gives the same fit for the predicted values as the linear probability model for values between 0.1 and 0.9. C) forces the predicted values to lie between 0 and 1. D) should not be used since it is too complicated. Answer: C 11) The logit model derives its name from A) the logarithmic model. B) the probit model. C) the logistic function. D) the tobit model. Answer: C 12) In the probit model Pr(Y = 1 = Φ(β0 + β1 X), Φ A) is not defined for Φ(0). B) is the standard normal cumulative distribution function. C) is set to 1.96. D) can be computed from the standard normal density function. Answer: B 13) In the expression Pr(Y = 1 = Φ(β0 + β1 X), A) (β0 + β1 X) plays the role of z in the cumulative standard normal distribution function. B) β1 cannot be negative since probabilities have to lie between 0 and 1. C) β0 cannot be negative since probabilities have to lie between 0 and 1. D) min (β0 + β1 X) > 0 since probabilities have to lie between 0 and 1. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 268

14) In the probit model Pr(Y = 1 X1 , X2 ,..., Xk) = Φ(β0 + β1 X1 + βx X2 + ... + βkXk), A) the β’s do not have a simple interpretation. B) the slopes tell you the effect of a unit increase in X on the probability of Y. C) β0 cannot be negative since probabilities have to lie between 0 and 1. D) β0 is the probability of observing Y when all X’s are 0 Answer: A 15) In the expression Pr(deny = 1 P/I Ratio, black) = Φ(–2.26 + 2.74P/I ratio + 0.71black), the effect of increasing the P/I ratio from 0.3 to 0.4 for a white person A) is 0.274 percentage points. B) is 6.1 percentage points. C) should not be interpreted without knowledge of the regression R2 . D) is 2.74 percentage points. Answer: B 16) The maximum likelihood estimation method produces, in general, all of the following desirable properties with the exception of A) efficiency. B) consistency. C) normally distributed estimators in large samples. D) unbiasedness in small samples. Answer: D 17) The logit model can be estimated and yields consistent estimates if you are using A) OLS estimation. B) maximum likelihood estimation. C) differences in means between those individuals with a dependent variable equal to one and those with a dependent variable equal to zero. D) the linear probability model. Answer: B 18) When having a choice of which estimator to use with a binary dependent variable, use A) probit or logit depending on which method is easiest to use in the software package at hand. B) probit for extreme values of X and the linear probability model for values in between. C) OLS (linear probability model) since it is easier to interpret. D) the estimation method which results in estimates closest to your prior expectations. Answer: A 19) Nonlinear least squares A) solves the minimization of the sum of squared predictive mistakes through sophisticated mathematical routines, essentially by trial and error methods. B) should always be used when you have nonlinear equations. C) gives you the same results as maximum likelihood estimation. D) is another name for sophisticated least squares. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 269

20) (Requires Advanced material) Only one of the following models can be estimated by OLS : A) Y = AKαLβ + u. B) Pr(Y = 1 X) = Φ(β0 + β1 X) C) Pr(Y = 1 X) = F(β0 + β1 X) =

1 . -(β0 +β1 X) 1+ e

D) Y = AKα Lβu. Answer: D 21) (Requires Advanced material) Nonlinear least squares estimators in general are not A) consistent. B) normally distributed in large samples. C) efficient. D) used in econometrics. Answer: C 22) (Requires Advanced material) Maximum likelihood estimation yields the values of the coefficients that A) minimize the sum of squared prediction errors. B) maximize the likelihood function. C) come from a probability distribution and hence have to be positive. D) are typically larger than those from OLS estimation. Answer: B 23) To measure the fit of the probit model, you should: A) use the regression R2 . B) plot the predicted values and see how closely they match the actuals. C) use the log of the likelihood function and compare it to the value of the likelihood function. D) use the fraction correctly predicted or the pseudo R2 . Answer: D 24) When estimating probit and logit models, A) the t-statistic should still be used for testing a single restriction. B) you cannot have binary variables as explanatory variables as well. C) F-statistics should not be used, since the models are nonlinear. D) it is no longer true that the R2 < R2 . Answer: A 25) The following problems could be analyzed using probit and logit estimation with the exception of whether or not A) a college student decides to study abroad for one semester. B) being a female has an effect on earnings. C) a college student will attend a certain college after being accepted. D) applicants will default on a loan. Answer: B 26) In the probit regression, the coefficient β 1 indicates A) the change in the probability of Y = 1 given a unit change in X B) the change in the probability of Y = 1 given a percent change in X C) the change in the z- value associated with a unit change in X D) none of the above Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 270

27) Your textbook plots the estimated regression function produced by the probit regression of deny on P/I ratio. The estimated probit regression function has a stretched “S” shape given that the coefficient on the P/I ratio is positive. Consider a probit regression function with a negative coefficient. The shape would A) resemble an inverted “S” shape (for low values of X, the predicted probability of Y would approach 1) B) not exist since probabilities cannot be negative C) remain the “S” shape as with a positive slope coefficient D) would have to be estimated with a logit function Answer: A 28) Probit coefficients are typically estimated using A) the OLS method B) the method of maximum likelihood C) non-linear least squares (NLLS) D) by transforming the estimates from the linear probability model Answer: B 29) F-statistics computed using maximum likelihood estimators A) cannot be used to test joint hypothesis B) are not meaningful since the entire regression R2 concept is hard to apply in this situation C) do not follow the standard F distribution D) can be used to test joint hypothesis Answer: D 30) When testing joint hypothesis, you can use A) the F- statistic B) the chi-squared statistic C) either the F-statistic or the chi-square statistic D) none of the above Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 271

11.2 Essays and Longer Questions 1) Your task is to model students’ choice for taking an additional economics course after the first principles course. Describe how to formulate a model based on data for a large sample of students. Outline several estimation methods and their relative advantage over other methods in tackling this problem. How would you go about interpreting the resulting output? What summary statistics should be included? Answer: Answers will vary by student. This is an example of a binary dependent variable problem with multiple regressors. The variable of interest here is the grade received in the first principles course. Students may talk about grade-inflating departments luring students away by giving higher grades, or by watering down the value of the signal contained in a grade by compressing the grade distribution towards the upper end. Other control variables mentioned by students may be the Math SAT score, a binary variable for business majors, etc. Students should mention the linear probability model, the probit model, and the logit model, and discuss their relative advantages and disadvantages. Several points mentioned in the textbook just before section 11.3 should be brought up, such as the ease with which the linear probability model can be estimated and interpreted, although its functional form cannot capture the nature of the problem. However, in the case where there are few extreme values, the model may be used as an adequate approximation. There is little to choose between logit and probit, and the fit between both is extremely close with the exception to the tails. The answer to the interpretation question should focus on the idea that all three models try to predict a probability given the attributes of the subject, i.e., E(Y X) = Pr(Y = 1 X). Students are expected to mention that the regression R2 is of no use here, given the nature of the dependent variable, and that a pseudo R2 and the fraction correctly predicted is available as an alternative. 2) The Report of the Presidential Commission on the Space Shuttle Challenger Accident in 1986 shows a plot of the calculated joint temperature in Fahrenheit and the number of O-rings that had some thermal distress. You collect the data for the seven flights for which thermal distress was identified before the fatal flight and produce the accompanying plot.

(a) Do you see any relationship between the temperature and the number of O-ring failures? If you fitted a linear regression line through these seven observations, do you think the slope would be positive or negative? Significantly different from zero? Do you see any problems other than the sample size in your procedure? (b) You decide to look at all successful launches before Challenger, even those for which there were no Stock/Watson 2e -- CVC2 8/23/06 -- Page 272

incidents. Furthermore you simplify the problem by specifying a binary variable, which takes on the value one if there was some O-ring failure and is zero otherwise. You then fit a linear probability model with the following result, OFail = 2.858 – 0.037 × Temperature; R2 = 0.325, SER = 0.390, (0.496) (0.007) where Ofail is the binary variable which is one for launches where O-rings showed some thermal distress, and Temperature is measured in degrees of Fahrenheit. The numbers in parentheses are heteroskedasticity -robust standard errors. Interpret the equation. Why do you think that heteroskedasticity-robust standard errors were used? What is your prediction for some O-ring thermal distress when the temperature is 31°, the temperature on January 28, 1986? Above which temperature do you predict values of less than zero? Below which temperature do you predict values of greater than one? (c) To fix the problem encountered in (b), you re-estimate the relationship using a logit regression: Pr(OFail = 1 Temperature) = F (15.297 – 0.236 × Temperature); pseudo- R2 =0.297 (7.329) (0.107) What is the meaning of the slope coefficient? Calculate the effect of a decrease in temperature from 80° to 70°, and from 60° to 50°. Why is the change in probability not constant? How does this compare to the linear probability model? (d) You want to see how sensitive the results are to using the logit, rather than the probit estimation method. The probit regression is as follows: Pr(OFail = 1 Temperature) = Φ(8.900 – 0.137 × Temperature); pseudo- R2 =0.296 (3.983) (0.058) Why is the slope coefficient in the probit so different from the logit coefficient? Calculate the effect of a decrease in temperature from 80° to 70°, and from 60° to 50° and compare the resulting changes in probability to your results in (c). What is the meaning of the pseudo - R2 ? What other measures of fit might you want to consider? (e) Calculate the predicted probability for 80° and 40°, using your probit and logit estimates. Based on the relationship between the probabilities, sketch what the general relationship between the logit and probit regressions is. Does there seem to be much of a difference for values other than these extreme values? (f) You decide to run one more regression, where the dependent variable is the actual number of incidences (NoOFail). You allow for a different functional form by choosing the inverse of the temperature, and estimate the regression by OLS. NoOFail = -3.8853 + 295.545 × (1/Temperature); R2 = 0.386, SER = 0.622 (1.516) (106.541) What is your prediction for O-ring failures for the 31° temperature which was forecasted for the launch on January 28, 1986? Sketch the fitted line of the regression above. Answer: (a) There does not appear to be a linear relationship underlying the few observations where O -ring failure occurred. If estimated by OLS, you would expect a slightly negative relationship (the slope turns out to be –0.025). It certainly would not be statistically significant using the t-statistic (although a standard normal distribution cannot be used given the small sample size). Using a linear function is also a problem since, even in the presence of a significant slope, the dependent variable cannot be less than zero. (b) There is a negative relationship between the temperature and the occurrence of an O -ring failure. At high temperatures, say above 75 degrees, there is less than a 10 percent chance of O -ring failure. Stock/Watson 2e -- CVC2 8/23/06 -- Page 273

As was mentioned in the textbook, the errors of the linear probability model are always heteroskedastic. It is therefore necessary to use heteroskedasticity-robust standard errors for inference. The linear probability model predicts O-ring failure with certainty for temperatures below 50 degrees. The prediction for 31 degrees is therefore above one (1.7). The model predicts negative values for temperatures above 77 degrees Fahrenheit. (c) The slope coefficient is negative. Hence increases in temperature result in a lowering of the probability of O-ring failures. Beyond that, neither the slope nor the intercept is easy to interpret. The decrease in temperature from 80° to 70° results in an increase in the probability of 20.0 percent, and from 60° to 50° in an increase in the probability of 21.3 percent. The change in probability is not constant since this is a nonlinear model. In the linear probability model the change in probability would remain constant, being 30.7 percent in the above example. (d) The slope coefficients should not be directly compared, since the functions are different. This does not imply that the calculated probabilities are not similar between using the logit and probit model. For example, the decrease in temperature from 80° to 70° results in an increase in the probability of 22.5 percent, and from 60° to 50° in an increase in the probability of 22.8 percent. The pseudo - R2 calculates the increase in the likelihood function by using temperature compared to the case where no explanatory variables is used. An alternative measure of fit is the fraction correctly predicted. (e) There is little difference between the logit and probit predictions, other than in the extremes. For 80°, the logit and probit predicted values are 2.7 and 2.0 percent respectively, and at 40°, they are 99.7 percent and 99.9 percent. Hence the logit is slightly higher at high temperatures and slightly lower at low temperatures. However, the difference is very small.

(f) The predicted number of failures from this regression is 5.7.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 274

3) A study tried to find the determinants of the increase in the number of households headed by a female. Using 1940 and 1960 historical census data, a logit model was estimated to predict whether a woman is the head of a household (living on her own) or whether she is living within another’s household. The limited dependent variable takes on a value of one if the female lives on her own and is zero if she shares housing. The results for 1960 using 6,051 observations on prime-age whites and 1,294 on nonwhites were as shown in the table: Regression Regression model Constant Age age squared education farm status South expected family earnings family composition Pseudo-R2 Percent Correctly Predicted

(1) White Logit 1.459 (0.685) -0.275 (0.037) 0.00463 (0.00044) -0.171 (0.026) -0.687 (0.173) 0.376 (0.098) 0.0018 (0.00019) 4.123 (0.294) 0.266

(2) Nonwhite Logit -2.874 (1.423) 0.084 (0.068) 0.00021 (0.00081) -0.127 (0.038) -0.498 (0.346) -0.520 (0.180) 0.0011 (0.00024) 2.751 (0.345) 0.189

82.0

83.4

where age is measured in years, education is years of schooling of the family head, farm status is a binary variable taking the value of one if the family head lived on a farm, south is a binary variable for living in a certain region of the country, expected family earnings was generated from a separate OLS regression to predict earnings from a Stock/Watson 2e -- CVC2 8/23/06 -- Page 275

set of regressors, and family composition refers to the number of family members under the age of 18 divided by the total number in the family. The mean values for the variables were as shown in the table. Variable age age squared education farm status south expected family earnings family composition

(1) White mean 46.1 2,263.5 12.6 0.03 0.3 2,336.4

(2) Nonwhite mean 42.9 1,965.6 10.4 0.02 0.5 1,507.3

0.2

0.3

(a) Interpret the results. Do the coefficients have the expected signs? Why do you think age was entered both in levels and in squares? (b) Calculate the difference in the predicted probability between whites and nonwhites at the sample mean values of the explanatory variables. Why do you think the study did not combine the observations and allowed for a nonwhite binary variable to enter? (c) What would be the effect on the probability of a nonwhite woman living on her own, if education and family composition were changed from their current mean to the mean of whites, while all other variables were left unchanged at the nonwhite mean values? Answer: (a) Since these are logit estimates, the value of the coefficients cannot be interpreted easily. However, statements can be made about the direction of the relationship between the dependent variable and the regressors. There is a decrease in the probability of females of living on their own with an increase in years of education. Not living on a farm also lowers the probability. These results hold both for whites and nonwhites. In addition, for whites the probability of living on her own increases up to a point with age, but then decreases. This is the result of age entering as a level and the square of age. This relationship with regard to age is not statistically significant for nonwhites. In the south, white females are more likely to live on their own, but nonwhites are not. An increase in expected family earnings and family composition increase the probability of females living on their own. (b) For whites, the probability is 0.90, while for nonwhites, it is 0.88. In the above approach, all coefficients are allowed to vary, whereas in a combined sample, the coefficients on the variables other than the binary race variable would have to be identical. (c) The probability would increase to 0.81. 4) A study investigated the impact of house price appreciation on household mobility. The underlying idea was that if a house were viewed as one part of the household’s portfolio, then changes in the value of the house, relative to other portfolio items, should result in investment decisions altering the current portfolio. Using 5,162 observations, the logit equation was estimated as shown in the table, where the limited dependent variable is one if the household moved in 1978 and is zero if the household did not move: Regression model constant Male Black Married78 marriage

Logit -3.323 (0.180) -0.567 (0.421) -0.954 (0.515) 0.054 (0.412) 0.764 Stock/Watson 2e -- CVC2 8/23/06 -- Page 276

change A7983 PURN Pseudo-R2

(0.416) -0257 (0.921) -4.545 (3.354) 0.016

where male, black, married78, and marriage change are binary variables. They indicate, respectively, if the entity was a male-headed household, a black household, was married, and whether a change in marital status occurred between 1977 and 1978. A7983 is the appreciation rate for each house from 1979 to 1983 minus the SMSA-wide rate of appreciation for the same time period, and PNRN is a predicted appreciation rate for the unit minus the national average rate. (a) Interpret the results. Comment on the statistical significance of the coefficients. Do the slope coefficients lend themselves to easy interpretation? (b) The mean values for the regressors are as shown in the accompanying table. Variable male black married78 marriage change A7983 PNRN

Mean 0.82 0.09 0.78 0.03 0.003 0.007

Taking the coefficients at face value and using the sample means, calculate the probability of a household moving. (c) Given this probability, what would be the effect of a decrease in the predicted appreciation rate of 20 percent, that is A7983 = –0.20? Answer: (a) Since the logit model is nonlinear, the slope coefficients cannot be easily interpreted. However, the signs of the coefficients indicate the direction of the relationship between the regressors and the binary dependent variable. Accordingly, being married or having experienced a marriage change increases the probability of moving. A male-headed household or a black household is less likely to move. If the predicted appreciation rate relative to the national average increased, then the household is less likely to move. The same holds for the actual appreciation rate from 1979 to 1983. None of the slope coefficients are statistically significant with the exception of the black household and marriage change coefficients. The two t-statistics are –1.85 and 1.84 respectively. These would be statistically significant at the 5% level of a one-sided hypothesis test. (b) The probability is 0.021. (c) The resulting probability would be 0.051, i.e., more than twice the value in the previous result.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 277

5) A study analyzed the probability of Major League Baseball (MLB) players to “survive” for another season, or, in other words, to play one more season. The researchers had a sample of 4,728 hitters and 3,803 pitchers for the years 1901-1999. All explanatory variables are standardized. The probit estimation yielded the results as shown in the table: Regression Regression model constant number of seasons played performance average performance

(1) Hitters probit 2.010 (0.030) -0.058 (0.004) 0.794 (0.025) 0.022 (0.033)

(2) Pitchers probit 1.625 (0.031) -0.031 (0.005) 0.677 (0.026) 0.100 (0.036)

where the limited dependent variable takes on a value of one if the player had one more season (a minimum of 50 at bats or 25 innings pitched), number of seasons played is measured in years, performance is the batting average for hitters and the earned run average for pitchers, and average performance refers to performance over the career. (a) Interpret the two probit equations and calculate survival probabilities for hitters and pitchers at the sample mean. Why are these so high? (b) Calculate the change in the survival probability for a player who has a very bad year by performing two standard deviations below the average (assume also that this player has been in the majors for many years so that his average performance is hardly affected). How does this change the survival probability when compared to the answer in (a)? (c) Since the results seem similar, the researcher could consider combining the two samples. Explain in some detail how this could be done and how you could test the hypothesis that the coefficients are the same. Answer: (a) Note that all variables are standardized, so that the mean is zero. This results in a survival probability of 0.997 for hitters and 0.991 for pitchers. These results are so high because there is a high probability, in general, for a player to return the following season. (b) Since the variables are standardized, this implies a change of two for the performance variable. The result for hitters is a lowering of the survival probability to 0.65, and for pitchers to 0.633 (c) After combining the sample for hitters and pitchers, you would allow for a different intercept and slopes by introducing a binary variable for pitchers if hitters are the default. This binary variable would be introduced by itself and in combination with each of the above variables, thereby allowing all coefficients to differ. You could then conduct an F-test for the joint hypothesis that all coefficients involving the binary variables are zero. If the hypothesis cannot be rejected, then there is no difference between the coefficients for hitters and pitchers. 6) The logit regression (11.10) on page 393 of your textbook reads: Pr(deny=1|P/Iratio,black) = F(-4.13 + 5.37 P/Iratio + 1.27 black) a)

Using a spreadsheet program such as Excel, plot the following logistic regression function with a single X, ^ ^ ^ 1 where β 0 = -4.13, β 1 = 5.37, β 2 = 1.27. Enter values for X1 in the first column Yi = ^ ^ ^ 1+e-(β 0 +β 1 X1i+β 2 X2i) ^

starting from 0 and then increment these by 0.1 until you reach 2.0. Let X2 be 0 at first. Then enter the logistic function formula in the next column. Next allow X2 to be 1 and calculate the new values for the logistic function in the third column. Finally produce the predicted probabilities for both blacks and whites, connecting the predicted values with a line. (b) Using the same spreadsheet calculations, list how the probability increases for blacks and for whites Stock/Watson 2e -- CVC2 8/23/06 -- Page 278

as the P/I ratio increases from 0.5 to 0.6. (c) What is the difference in the rejection probability between blacks and whites for a P/I ratio of 0.5 and for 0.9? Why is the difference smaller for the higher value here? (d) Table 11.2 on page 401 of your textbook lists logit regressions (column 2) with further explanatory variables. Given that you can only produce simple plots in two dimensions, how would you proceed in (a) above if there were more than a single explanatory variable? Answer: a.

b. The increase in the deny probability increases by 9.7 percentage points for whites, and by 13.3 percentage points for blacks. c. At a P/I value of 0.5, the difference is approximately 30%, while it is 20% for the higher value. As the ratio increases, the probability that everyone gets rejected increases and approaches 1, regardless of race. d. In that case you would have to hold the other explanatory variables constant. A simple solution would be to set all of these to zero. A more reasonable approach would be to set them to their sample average if they are continuous variables, and to set them either to 0 or 1 for binary variables. 7) Equation (11.3) in your textbook presents the regression results for the linear probability model. a.

Using a spreadsheet program such as Excel, plot the fitted values for whites and blacks in the same graph, for P/I ratios ranging from 0 to 1 (use 0.05 increments).

Explain some of the strengths and shortcomings of the linear probability model using this graph.

Answer: a.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 279

Answer:

b. The strength is that the regression line is easy to interpret once you realize that the fitted values are probabilities of being denied a loan: increases in the P/I ratio of 10 percentage points increase the probability of being denied by roughly 6 percentage points. The role of the binary variable for blacks also becomes clear: blacks have a roughly 18 percentage point higher probability of being rejected for a loan when compared to whites, at any given level of a P/I ratio. As for shortcomings, it becomes clear that this model cannot be used to calculate the probability of rejection for whites with a P/I ratio less than approximately 20 percent. In that case, the predicted probability would be negative. Similarly, you would expect the probability increase for a given change in the P/I ratio to change as the P/I ratio becomes larger; this is not the case for the linear probability model. Furthermore, you will find values larger than 1 for the P/I ratio in the data set used for Chapter 11. As a result, the predicted probability of being rejected for a loan would be above 1 for some individuals, which does not make sense.

8) Equation (11.3) in your textbook presents the regression results for the linear probability model, and equation (11.10) the results for the logit model. a.

Using a spreadsheet program such as Excel, plot the predicted probabilities for being denied a loan for both the linear probability model and the logit model if you are black. (Use a range from 0 to 1 for the P/I Ratio and allow for it to increase by increments of 0.05.)

Given the shortcomings of the linear probability model, do you think that it is a reasonable approximation to the logit model?

Repeat the exercise using predicted probabilities for whites.

Answer: a.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 280

Answer:

b. The predicted probabilities are actually quite close for P/I Ratio values between 0 and 0.5. Beyond that, the linear probability model predicts substantially lower rejection probabilities. c.

Here the shortcomings of the linear probability model become obvious for P/I Ratio values of less than approximately 0.2: the predicted probabilities become negative. However, for values of between 0.2 and 0.7, the predicted probabilities of both models are approximately the same, so that the linear probability model would work well as an approximation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 281

11.3 Mathematical and Graphical Problems 1) Sketch the regression line for the linear probability model with a single regressor. Indicate for which values of the slope and intercept the predictions will be above one and below zero. Can you rule out homoskedasticity in the error terms with certainty here? Answer: The errors in the linear probability model are always heteroskedastic.

2) Consider the following logit regression: Pr(Y = 1 X) = F (15.3 – 0.24 × X) Calculate the change in probability for X increasing by 10 for X = 40 and X = 60. Why is there such a large difference in the change in probabilities? Answer: Pr(Y=1 X=40) = 0.997; Pr(Y=1 X=50) = 0.964; Pr(Y=1 X=60) = 0.711; Pr(Y=1 X=70) = 0.182. The change is large due to the nonlinear nature of the model and the values for which the change was calculated.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 282

3) You have a limited dependent variable (Y) and a single explanatory variable (X). You estimate the relationship using the linear probability model, a probit regression, and a logit regression. The results are as follows: ^

Y = 2.858 – 0.037 × X (0.007) Pr(Y = 1 X) = F (15.297 – 0.236 × X) Pr(Y = 1 X) = Φ (8.900 – 0.137 × X) (0.058) (a) Although you cannot compare the coefficients directly, you are told that “it can be shown” that certain ^

relationships between the coefficients of these models hold approximately. These are for the slope: βprobit ≈ ^

0.625 × βLogit , βlinear ≈ 0.25 × βLogit . Take the logit result above as a base and calculate the slope coefficients for the linear probability model and the probit regression. Are these values close? (b) For the intercept, the same conversion holds for the logit-to-probit transformation. However, for the linear probability model, there is a different conversion: ^

β0,linear ≈ 0.25 × β0,Logit + 0.5 Using the logit regression as the base, calculate a few changes in X (temperature in degrees of Fahrenheit) to see how good the approximations are. ^

Answer: (a) βprobit ≈ 0.625 × 0.236 = 0.148, which is quite close to the estimated slope, judging by its standard ^

deviation. βlinear ≈ 0.25 × 0.236 = 0.059 is close numerically, but not as close when you take into account the small standard deviation. (b) The approximation gives a probit intercept of 9.561 and a linear approximation of 4.324.

Temperature X 30 40 50 60 70 80

Linear probability model Actual approximation 1.7 2.6 1.4 2.0 1.0 1.4 0.6 0.8 0.3 0.4 -0.1 -0.2

Probit model actual approximation 1 1 1 1 0.98 0.98 0.75 0.75 0.25 0.21 0.02 0.01

In terms of calculated probabilities, the approximation is closer for the probit model than for the linear probability model. 4) The population logit model of the binary dependent variable Y with a single regressor is Pr(Y=1 X1 )=

1 -(β + β1 X1 ) 1+e 0

Logistic functions also play a role in econometrics when the dependent variable is not a binary variable. For example, the demand for televisions sets per household may be a function of income, but there is a saturation or satiation level per household, so that a linear specification may not be appropriate. Given the regression model

Stock/Watson 2e -- CVC2 8/23/06 -- Page 283

Yi =

β0 -β X 1 + β1e 2 i

+ ui,

sketch the regression line. How would you go about estimating the coefficients? Answer: The equation cannot be estimated using linear methods or transformations that allow linearization. However, nonlinear least squares estimation is possible as described in section 11.3 of the textbook.

Some students may point out that β0 will give an estimate of the satiation level (perhaps 10 TVs per household), and that the point of inflection is at 1 ln β1 . X= β2

Stock/Watson 2e -- CVC2 8/23/06 -- Page 284

5) (Requires Appendix material) Briefly describe the difference between the following models: censored and truncated regression model, count data, ordered responses, and discrete choice data. Try to be specific in terms of describing the data involved. Answer: The answer should follow the discussion in Appendix 11.3. Briefly: censored regression models have a dependent variable that has been “censored” above or below a certain cutoff, such as in the case where some individuals actually spend different amounts of money on an item, but others do not spend any amount. An example is the tobit regression model. The difference to the truncated regression model is that data is available for both types of individuals, buyers and non-buyers in the case of the censored model, but only for buyers in the case of the truncated regression model. An example for these types of models are expenditures by individuals. There are other examples in economics where sample selection bias occurs, such as in the case of earnings functions (labor economics), industrial organization, and finance. Count data involves a discrete dependent variable, such as the number of times an activity is performed. Just as OLS does not perform well in the discrete dependent variable case, the same holds here, and special methods (Poisson and negative binomial regression models) have been developed to deal with the special format. Ordered response data resembles the count data situation, in that there is a natural ordering. The difference is that there are no natural numerical values attached, such as is the case when activity by individuals happens a discrete number of times during a certain period. The Federal Reserve may decide to lower the federal funds rate or not, and conditionally on lowering it, it may decide on a mild cut or a more severe cut. Ordered Probit Models have been developed for such situations. Finally, discrete choice data also allows for multiple responses, but these are not ordered, such as when the individual can decide on different modes of transportation. In addition to its use in transportation economics, multinomial probit and logit regression models have been developed and applied in labor economics and health economics.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 285

6) (Requires Appendix material and Calculus) The logarithm of the likelihood function (L) for estimating the population mean and variance for an i.i.d. normal sample is as follows (note that taking the logarithm of the likelihood function simplifies maximization. It is a monotonic transformation of the likelihood function, meaning that this transformation does not affect the choice of maximum): n 1 n L = - log(2πσ2 ) – ∑ (Yi - μY)2 2 2σ2 i=1 Derive the maximum likelihood estimator for the mean and the variance. How do they differ, if at all, from the OLS estimator? Given that the OLS estimators are unbiased, what can you say about the maximum likelihood estimators here? Is the estimator for the variance consistent? Answer: Taking the derivative with respect to the two parameters μY and σ2 results in n n 1 1 ∂L 2(Y μ )(-1) (Y - μY) == i Y 2 ∑ 2 ∑ i ∂μ 2σ 2σ Y i=1 i=1 n n 1 ∂L (Yi - μY)2 . =+ ∑ 2σ2 2 σ4 ∂ σ2 i=1 The maximum likelihood estimator is then the value for μY and σ2 that maximizes the (log) likelihood function. Setting both equations to zero, and assuming that this results in a maximum rather than a minimum (second order conditions will not be discussed here), yields ^

μY,MLE =

n n n ^2 1 1 1 ^ Yi = Y and σ MLE = (Yi - Y)2 ) . 2= ∑ ∑ ∑ (Y μ ) Y,MLE i n n n i=1 i=1 i=1

The maximum likelihood estimator of the population mean is therefore the sample mean. Since the OLS estimator is identical, and it is unbiased, the MLE will also be unbiased. However, the MLE for the population differs from the OLS estimator, and since the OLS estimator is unbiased, the MLE must be biased. But, the difference between the two estimators vanishes as n increases, and hence the MLE is consistent.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 286

7) Besides maximum likelihood estimation of the logit and probit model, your textbook mentions that the model can also be estimated by nonlinear least squares. Construct the sum of squared prediction mistakes and suggest how computer algorithms go about finding the coefficient values that minimize the function. You may want to use an analogy where you place yourself into a mountain range at night with a flashlight shining at your feet. Your task is to find the lowest point in the valley. You have two choices to make: the direction you are walking in and the step length. Describe how you will proceed to find the bottom of the valley. Once you find the lowest point, is there any guarantee that this is the lowest point of all valleys? What should you do to assure this? n

Answer: In general,

∑ (Yi - f(b0 + b1X1i + ... + bkXki)]2 is the sum of squared prediction mistakes, whether or not

i=1 the function f() is linear or nonlinear. Nonlinear least squares then uses a sophisticated algorithm of trial and error to find the minimum of the squared prediction mistakes by changing the values of the parameters. Some of the routines are called Newton-Raphson, Gauss-Newton, Method of Steepest Ascent, etc. What they have in common is the general principle that they evaluate the squared prediction after changing the parameters in a certain direction and by a certain size. In the analogy, the student is lowered into a mountain range at night and her task is to find the lowest point of the valley. The rule may be that she will walk in one direction as long as at the end of the step she is at a lower point than at the beginning of the step. If not, then she should walk in a different direction. She is also allowed to choose the step length. There is, of course, no guarantee that another point in another valley is not lower than the one she found in the valley she is in, nor is she guaranteed to find the lowest point if she makes very large steps. To assure that this is the lowest point, she should ask to be dropped off in a different location (“starting point”) and see if she finds the same spot again. Finally, she should be warned that it is possible to walk along ridges for a long time without much progress visible. 8) Consider the following probit regression Pr(Y = 1 X) = Φ(8.9 – 0.14 × X) Calculate the change in probability for X increasing by 10 for X = 40 and X = 60. Why is there such a large difference in the change in probabilities? Answer: Pr(Y=1 X=40) = 0.999; Pr(Y=1 X=50) = 0.971; Pr(Y=1 X=60) = 0.691; Pr(Y=1 X=70) = 0.184. The large differences happen as a result of the non-linearity of the function, and the points at which they are calculated. 9) Earnings equations establish a relationship between an individual’s earnings and its determinants such as years of education, tenure with an employer, IQ of the individual, professional choice, region within the country the individual is living in, etc. In addition, binary variables are often added to test for “discrimination” against certain sub-groups of the labor force such as blacks, females, etc. Compare this approach to the study in the textbook, which also investigates evidence on discrimination. Explain the fundamental differences in both approaches using equations and mathematical specifications whenever possible. Answer: In the former case, the binary variable appears as a regressor. That is, the regression may be ln( Earni) = β0 + β1 × Educi + β2 × Exper + β3 × Binary + ... + ui, where earnings of an individual are explained by a set of attributes. Binary is a shift variable, which is one for females (or blacks, religion, union members, etc.). The coefficient on the shift variable then indicates whether or not the individual is treated differently, controlling for all other influences. However, the dependent variable is continuous. In the case of a limited dependent variable, it is the left-hand variable that is binary. Here behavior of a qualitative type is being explained, i.e., Binaryi = β0 + β1 × X1i + β2 × X2i + ... + βk × Xki + ui, although some of the regressors may also be binary variables.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 287

10) (Requires Appendix material and Calculus) The log of the likelihood function (L) for the simple regression model with i.i.d. normal errors is as follows (note that taking the logarithm of the likelihood function simplifies maximization. It is a monotonic transformation of the likelihood function, meaning that this transformation does not affect the choice of maximum):

L=–

n n n 1 log(2π) – log σ2 – ∑ (Yi - β0 - β1Xi)2 2 2 2σ2 i=1

Derive the maximum likelihood estimator for the slope and intercept. What general properties do these estimators have? Explain intuitively why the OLS estimator is identical to the maximum likelihood estimator here. Answer: Maximizing the likelihood function with respect to the regression coefficients is the same as making the third term as small as possible. However, this term will become the sum of squared residuals once the function is maximized. Hence maximizing the likelihood function is identical to minimizing the sum of squared residuals, and the two methods of choosing an estimator are therefore identical for the regression coefficients. Taking the derivative of the log of the likelihood with respect to the three parameters β0 , β1 and σ2 results in n 1 ∂L =∑ 2(Yi - β0 - β1Xi)(-1) ∂β0 2σ2 i=1 n 1 ∂L 2(Yi - β0 - β1 Xi)(-Xi) =∑ ∂β1 2σ2 i=1 n n 1 ∂L =+ ∑ (Yi - β0 - β1Xi)2 2σ2 2σ4 ∂σ2 i=1 Setting the equations to zero and solving for the three parameters then results in the maximum likelihood estimator (MLE). n

i=1 n

∑ (Yi - β0,MLE - β1,MLEXi) = 0, or β0,MLE = Y - β1,MLE X. ^

∑ (Yi - β0,MLE - β1,MLEXi)(Xi) = 0, or, after multiplying through by Xi and substituting β0,MLE ,

i=1 n

∑ YiXi - nXY

β1,MLE =

i=1 n

^2 2 σ MLE

2 ∑ X i - nX2 i=1 1

∑ (Yi - β^0,MLE - β^1,MLE Xi)2 = 0, or

^4 2 σ MLE i=1

n ^ n ^2 1 2 1 ^ ^ u . σ MLE = 2 = n ∑ i n ∑ (Yi - β0,MLE - β1,MLE Xi) i=1 i=1 Stock/Watson 2e -- CVC2 8/23/06 -- Page 288

The estimator for the regression slope and intercept is therefore identical to the OLS estimator. However, the estimator for the error variance is different and biased. In general, MLEs are consistent. They are also normally distributed in large samples. 11) The estimated logit regression in your textbook is Pr(deny=1|P/Iratio,black) = F(-4.13 + 5.37 P/Iratio + 1.27 black) Using a spreadsheet program, such as Excel, generate a table with predicted probabilities for both whites and blacks using P/I Ratio values between 0 and 1 and increments of 0.05. Answer: P/I Ratio

whites 0.02 0.02 0.03 0.03 0.04 0.06 0.07 0.10 0.12 0.15 0.19 0.24 0.29 0.35 0.41 0.47 0.54 0.61 0.67 0.73 0.78

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

blacks 0.05 0.07 0.09 0.11 0.14 0.18 0.22 0.27 0.33 0.39 0.46 0.52 0.59 0.65 0.71 0.76 0.81 0.85 0.88 0.90 0.92

12) The estimated logit regression in your textbook is Pr(deny=1|P/Iratio,black) = F(-4.13 + 5.37 P/Iratio + 1.27 black) Is there a meaningful interpretation to the slope for the P/I Ratio? Calculate the increase of a rejection probability for both blacks and whites as the P/I Ratio increases from 0.1 to 0.2. Repeat the exercise for an increase from 0.65 to 0.75. Why is the increase in the probability higher for blacks at the smaller value of the P/I Ratio but higher for whites at the larger P/I Ratio? Answer: There is no meaningful interpretation of the regression slope: it certainly does not indicate by how much the rejection probability increases for a given change in the P/I Ratio. For whites, the change in the rejection probabilities are 0.02 and 0.13. For blacks, the respective values are 0.05 and 0.11. The differences are due to the non-linearity of the logit function: it is steeper for blacks at low values of the P/I Ratio and flattens out for higher values.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 289

Chapter 12 Instrumental Variables Regression 12.1 Multiple Choice 1) Estimation of the IV regression model A) requires exact identification. B) allows only one endogenous regressor, which is typically correlated with the error term. C) requires exact identification or overidentification. D) is only possible if the number of instruments is the same as the number of regressors. Answer: C 2) Two Stage Least Squares is calculated as follows; in the first stage: A) Y is regressed on the exogenous variables only. The predicted value of Y is then regressed on the instrumental variables. B) the unknown coefficients in the reduced form equation are estimated by OLS, and the predicted values are calculated. In the second stage, Y is regressed on these predicted values and the other exogenous variables. C) the exogenous variables are regressed on the instruments. The predicted value of the exogenous variables is then used in the second stage, together with the instruments, to predict the dependent variable. D) the unknown coefficients in the reduced form equation are estimated by weighted least squares, and the predicted values are calculated. In the second stage, Y is regressed on these predicted values and the other exogenous variables. Answer: B 3) The conditions for a valid instruments do not include the following: A) each instrument must be uncorrelated with the error term. B) each one of the instrumental variables must be normally distributed. C) at least one of the instruments must enter the population regression of X on the Zs and the Ws. D) perfect multicollinearity between the predicted endogenous variables and the exogenous variables must be ruled out. Answer: B 4) The IV regression assumptions include all of the following with the exception of A) the error terms must be normally distributed. B) E(ui W1i,…, Wri) = 0. C) Large outliers are unlikely: the X’s, W’s, Z’s, and Y’s all have nonzero, finite fourth moments. D) (X1i,…, Xki, W1i,…,Wri, Z1i, … Zmi, Yi) are i.i.d. draws from their joint distribution. Answer: A 5) The rule-of-thumb for checking for weak instruments is as follows: for the case of a single endogenous regressor, A) a first stage F must be statistically significant to indicate a strong instrument. B) a first stage F > 1.96 indicates that the instruments are weak. C) the t-statistic on each of the instruments must exceed at least 1.64. D) a first stage F < 10 indicates that the instruments are weak. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 290

6) The J-statistic A) tells you if the instruments are exogenous. B) provides you with a test of the hypothesis that the instruments are exogenous for the case of exact identification. 2 C) is distributed χ m-k where m-k is the degree of overidentification. 2 D) is distributed χ m-k where m-k is the number of instruments minus the number of regressors. Answer: C 7) In the case of the simple regression model Yi = β0 + β1 Xi + ui, i = 1,…, n, when X and u are correlated, then A) the OLS estimator is biased in small samples only. B) OLS and TSLS produce the same estimate. C) X is exogenous. D) the OLS estimator is inconsistent. Answer: D 8) The following will not cause correlation between X and u in the simple regression model: A) simultaneous causality. B) omitted variables. C) irrelevance of the regressor. D) errors in variables. Answer: C 9) The distinction between endogenous and exogenous variables is A) that exogenous variables are determined inside the model and endogenous variables are determined outside the model. B) dependent on the sample size: for n > 100, endogenous variables become exogenous. C) depends on the distribution of the variables: when they are normally distributed, they are exogenous, otherwise they are endogenous. D) whether or not the variables are correlated with the error term. Answer: D 10) The two conditions for a valid instrument are A) corr(Zi, Xi) = 0 and corr(Zi, ui) ≠ 0. B) corr(Zi, Xi) = 0 and corr(Zi, ui) = 0. C) corr(Zi, Xi) ≠ 0 and corr(Zi, ui) = 0. D) corr(Zi, Xi) ≠ 0 and corr(Zi, ui) ≠ 0. Answer: C 11) Instrument relevance A) means that the instrument is one of the determinants of the dependent variable. B) is the same as instrument exogeneity. C) means that some of the variance in the regressor is related to variation in the instrument. D) is not possible since X and u are correlated and Z and u are not correlated. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 291

12) Consider a competitive market where the demand and the supply depend on the current price of the good. Then fitting a line through the quantity-price outcomes will A) give you an estimate of the demand curve. B) estimate neither a demand curve nor a supply curve. C) enable you to calculate the price elasticity of supply. D) give you the exogenous part of the demand in the first stage of TSLS. Answer: B 13) When there is a single instrument and single regressor, the TSLS estimator for the slope can be calculated as follows: ^ TSLS SZY A) β 1 . = SZX ^ TSLS SXY B) β 1 . = 2 SX ^ TSLS SZX . C) β 1 = SZY ^ TSLS SZY D) β 1 . = 2 SZ

Answer: A 14) The TSLS estimator is A) consistent and has a normal distribution in large samples. B) unbiased. C) efficient in small samples. D) F-distributed. Answer: A 15) The reduced form equation for X A) regresses the endogenous variable X on the smallest possible subset of regressors. B) relates the endogenous variable X to all the available exogenous variables, both those included in the regression of interest and the instruments. C) uses the predicted values of X from the first stage as a regressor in the original equation. D) uses smaller standard errors, such as homoskedasticity-only standard errors, for inference. Answer: B 16) When calculating the TSLS standard errors A) you do not have to worry about heteroskedasticity, since it was eliminated in the first stage B) you can use the standard errors reported by OLS estimation of the second stage regression. C) the critical values from the standard normal table should be adjusted for the proper degrees of freedom. D) you should use heteroskedasticity-robust standard errors. Answer: D 17) Having more relevant instruments A) is a problem because instead of being just identified, the regression now becomes overidentified. B) is like having a larger sample size in that the more information is available for use in the IV regressions. C) typically results in larger standard errors for the TSLS estimator. D) is not as important for inference as having the same number of endogenous variables as instruments. Answer: B Stock/Watson 2e -- CVC2 8/23/06 -- Page 292

18) Weak instruments are a problem because A) the TSLS estimator may not be normally distributed, even in large samples. B) they result in the instruments not being exogenous. C) the TSLS estimator cannot be computed. D) you cannot predict the endogenous variables any longer in the first stage. Answer: A 19) (Requires Appendix material) The relationship between the TSLS slope and the corresponding population parameter is: n 1 (Zi - Z)ui ∑ n ^ TSLS i=1 A) ( β 1 . - β1 ) = n 1 (Z - Z)(Xi - X) n ∑ i i=1 n 1 (Zi - Z) ∑ n ^ TSLS i=1 B) ( β 1 . - β1 ) = n 1 (Z - Z)(Xi - X) n ∑ i i=1 n 1 (Zi - Z)ui ∑ n ^ TSLS i=1 C) ( β 1 . - β1 ) = n 1 (Z - Z)2 n ∑ i i=1 n 1 (Xi - X)ui ∑ n ^ TSLS i=1 D) ( β 1 . - β1 ) = n 1 (Z - Z)(Xi - X) n ∑ i i=1 Answer: A 20) If the instruments are not exogenous, A) you cannot perform the first stage of TSLS. B) then, in order to conduct proper inference, it is essential that you use heteroskedasticity -robust standard errors. C) your model becomes overidentified. D) then TSLS is inconsistent. Answer: D 21) In the case of exact identification A) you can use the J-statistic in a test of overidentifying restrictions. B) you cannot use TSLS for estimation purposes. C) you must rely on your personal knowledge of the empirical problem at hand to assess whether the instruments are exogenous. D) OLS and TSLS yield the same estimate. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 293

22) To calculate the J-statistic you regress the A) squared values of the TSLS residuals on all exogenous variables and the instruments. The statistic is then the number of observations times the regression R2 . B) TSLS residuals on all exogenous variables and the instruments. You then multiply the homoskedasticity-only F-statistic from that regression by the number of instruments. C) OLS residuals from the reduced form on the instruments. The F-statistic from this regression is the J-statistic. D) TSLS residuals on all exogenous variables and the instruments. You then multiply the heteroskedasticity-robust F-statistic from that regression by the number of instruments. Answer: B 23) (Requires Chapter 8) When using panel data and in the presence of endogenous regressors A) the TSLS does not exist. B) you do not have to worry about the validity of instruments, since there are so many fixed effects. C) the OLS estimator is consistent. D) application of the TSLS estimator is straightforward if you use two time periods and difference the data. Answer: D 24) In practice, the most difficult aspect of IV estimation is A) finding instruments that are both relevant and exogenous. B) that you have to use two stages in the estimation process. C) calculating the J-statistic. D) finding instruments that are exogenous. Relevant instruments are easy to find. Answer: A 25) Consider a model with one endogenous regressor and two instruments. Then the J-statistic will be large A) if the number of observations are very large. B) if the coefficients are very different when estimating the coefficients using one instrument at a time. C) if the TSLS estimates are very different from the OLS estimates. D) when you use homoskedasticity-only standard errors. Answer: B 26) Let W be the included exogenous variables in a regression function that also has endogenous regressors ( X). The W variables can A) be control variables B) have the property E(ui|Wi) = 0 C) make an instrument uncorrelated with u D) all of the above Answer: D 27) The logic of control variables in IV regressions A) parallels the logic of control variables in OLS B) only applies in the case of homoskedastic errors in the first stage of two stage least squares estimation C) is different in a substantial way from the logic of control variables in OLS since there are two stages in estimation D) implies that the TSLS is efficient Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 294

28) For W to be an effective control variable in IV estimation, the following condition must hold A) E(ui ) = 0 B) E(u i|Zi,W i) = E(ui|Wi ) C) E(uiuj) ≠ 0 D) there must be an intercept in the regression Answer: B 29) The IV estimator can be used to potentially eliminate bias resulting from A) multicollinearity. B) serial correlation. C) errors in variables. D) heteroskedasticity. Answer: C 30) Instrumental Variables regression uses instruments to A) establish the Mozart Effect. B) increase the regression R2 . C) eliminate serial correlation. D) isolate movements in X that are uncorrelated with u. Answer: D 31) Endogenous variables A) are correlated with the error term. B) always appear on the LHS of regression functions. C) cannot be regressors. D) are uncorrelated with the error term. Answer: A 32) Consider the following two equations to describe labor markets in various sectors of the economy W Nd = β0 + β1 +u P W Ns = γ0 + γ1 +v P Nd = Ns = N A) W/P is exogenous, n is endogenous B) Both n and W/P are endogenous C) n is exogenous, W/P is endogenous D) the parameters cannot be estimated because it would require two equations to be estimated at the same time (simultaneously) Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 295

12.2 Essays and Longer Questions 1) Write a short essay about the Overidentifying Restrictions Test. What is meant exactly by “overidentification?” State the null hypothesis. Describe how to calculate the J-statistic and what its distribution is. Use an example of two instruments and one endogenous variable to explain under what situation the test will be likely to reject the null hypothesis. What does this example tell you about the exactly identified case? If your variables pass the test, is this sufficient for these variables to be good instruments? Answer: The regression coefficients in the regression model with endogenous regressors can be either underidentified, exactly identified, or overidentified. If the number of instruments (m) equals the number of endogenous regressors (k), then the coefficients are exactly identified. If there are more instruments than number of endogenous regressors, then the regression coefficients are overidentified. For the instrumental variable estimator to exist, there must be at least as many instruments as endogenous regressors (m ≥ k). In the case of overidentification, the exogeneity of the instruments can be tested. Under the null hypothesis, all instruments are exogenous. Under the alternative hypothesis, at least one of the instruments is endogenous. Technically, the overidentifying restrictions test uses the TSLS residuals to see if these are correlated with the instruments. The residuals are regressed on the instruments and the included exogenous regressors. Under the null hypothesis, all coefficients other than the constant are zero. Since this is a case of joint hypothesis testing, the F-statistic is computed, and 2 from it the J-statistic, where J = mF. In large samples the distribution of this statistic is χ m-k . Calculating the J-statistic amounts to comparing different IV estimates. In the case of two instruments and one endogenous regressor, where the degree of overidentification is one, two such estimates exist. Due to sample variation, these estimates will differ, although they should be similar, or “close” to each other. If one or both of the instruments is not exogenous, then the estimates will not be similar, or the difference between the two will be sufficiently large so as not to be the result of pure sampling variation. In this situation the null hypothesis will be rejected. This procedure can only be executed when the coefficients are overidentified, since there is no comparison possible for the case of exactly identified coefficients. Passing the test is not sufficient for the instruments to be valid since, in addition to being exogenous, they must also be relevant, i.e., they must be correlated with the endogenous regressor.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 296

2) Using some of the examples from your textbook, describe econometric studies which required instrumental variable techniques. In each case emphasize why the need for instrumental variables arises and how authors have approached the problem. Make sure to include a discussion of overidentification, the validity of instruments, and testing procedures in your essay. Answer: The textbook mentions several studies which used instrumental variable estimation techniques, starting with Whright’s problem to estimate demand and supply elasticities on animal and vegetable oils and fats. This is a case of simultaneous causality bias since the price and quantity in the market are determined by both the supply and demand for the commodity. Wright used the weather, which shifted the supply curve only and thereby traced out the demand curve. Since there was only a single instrument, the coefficients are exactly identified, and the validity of the instrument cannot not be tested. Another example mentioned is the effect of class size on test scores. The reason for a correlation between class size and the error term potentially stems from omitted variable bias here, such as the quality of the teaching staff and outside opportunities for some of the students. In the hypothetical examples of an earthquake, some schools may receive more students than usual dependent on the closeness to the epicenter, if the school was unaffected structurally. The increase in class size is related to the closeness to the epicenter, but this distance should be uncorrelated with the ability of the teaching staff and the outside opportunities. As in the previous study, there is only a single instrument and hence no possibility to use the overidentification test. The primary example of instrumental variable estimation in the chapter involves estimation of the demand elasticity for cigarettes. Due to simultaneity bias for the demand equation, sales taxes are used as an instrument first in a cross section of states in a single year and later in a panel. Prices and quantities are determined simultaneously by supply and demand, and as a result, prices will be correlated with the error term in the demand equation. Sales taxes are fairly highly correlated with prices, explaining almost half of the variation in these. It is argued that due to differences in choices about public finance due to political considerations across states, these are exogenous. Only one instrument is used in the cross section and hence there is no degree of overidentification. Later another instrument is introduced, cigarette-specific taxes. With two instruments and one endogenous regressor, the J-statistic can be computed for the overidentifying restrictions test. Further examples discussed in the textbook include the effect of an increase in the prison population on crime rates, further discussion of class size and test scores, and aggressive treatment of heart attacks and the potential for saving lives. 3) Describe the consequences of estimating an equation by OLS in the presence of an endogenous regressor. How can you overcome these obstacles? Present an alternative estimator and state its properties. Answer: In the case of an endogenous regressor, there is correlation between the variable and the error term. In this case, the OLS estimator is inconsistent. To get a consistent estimator in this situation, instrumental variable techniques, such as TSLS, should be used. If one or more valid instruments can be found, meaning that the instrument must be relevant and exogenous, then a consistent estimator can be derived. The relevance of instruments can be tested using the rule of thumb (a first -stage F-statistic of more than 10 in the TSLS estimator). The exogeneity of the instruments can be tested using the J-statistic. The test requires that there is at least one more instrument than endogenous regressors, i.e., that the equation is overidentified. In large samples the sampling distribution of the TSLS estimator is approximately normal, so that statistical inference can proceed as usual using the t-statistic, confidence intervals, or joint hypothesis tests involving the F-statistic. However, inference based on these statistics will be misleading in the case where instruments are not valid.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 297

4) Write an essay about where valid instruments come from. Part of your explorations must deal with checking the validity of instruments and what the consequences of weak instruments are. Answer: In order for instruments to be valid, they have to be relevant and exogenous. To find valid instruments, two approaches are typically used. First economic theory can serve as a guide. In the case of simultaneous causality in a market, for example, theory predicts shifts in one curve but not the other as a result of changes in an instrumental variable. The second approach focuses on shifts in the endogenous regressor that is caused by an “exogenous source of variation” in the variable resulting from a random phenomenon. The textbook uses the example of an earthquake which changes student teacher ratios as students in affected areas have to be redistributed. To check the validity of instruments, there is the rule of thumb to determine whether or not an instrument is weak. It states that the F-statistic in the first stage of the TSLS procedure should exceed 10. Instrument exogeneity can be tested only in the case of overidentification. If there are more instruments than endogenous regressors, then the J-statistic can be calculated. The null hypothesis of exogeneity will be rejected, in essence, if the TSLS residuals are correlated with the instruments. If instruments are weak, then the TSLS estimator is biased and statistical inference does not yield reliable confidence intervals even in large samples. 5) You have estimated a government reaction function, i.e., a multiple regression equation, where a government instrument, say the federal funds rate, depends on past government target variables, such as inflation and unemployment rates. In addition, you added the previous period’s popularity deficit of the government, e.g. the (approval rating of the president – 50%), as one of the regressors. Your idea is that the Federal Reserve, although formally independent, will try to expand the economy if the president is unpopular. One of your peers, a political science student, points out that approval ratings depend on the state of the economy and thereby indirectly on government instruments. It is therefore endogenous and should be estimated along with the reaction function. Initially you want to reply by using a phrase that includes the words “money neutrality” but are worried about a lengthy debate. Instead you state that as an economist, you are not concerned about government approval ratings, and that government approval ratings are determined outside your (the economic) model. Does your whim make the regressor exogenous? Why or why not? Answer: In general, the question of whether or not a variable is endogenous or exogenous depends on its correlation with the error term, not on the size of the underlying model. The point to make is that just because a variable is endogenous does not imply that its determinants have to be modeled. If the purpose of the exercise is to eventually simulate the model for policy purposes, then the feedback envisioned by the political science student is potentially important. However, if the aim is simply to forecast the behavior of the government reaction function, then the issue of endogeneity or exogeneity is only relevant for questions regarding the type of estimator to be used. Of course, if a regressor is endogenous, then instrumental variable techniques must be used to ensure desirable properties of the estimator. 6) You have been hired as a consultant to estimate the demand for various brands of coffee in the market. You are provided with annual price data for two years by U.S. state and the quantities sold. You want to estimate a demand function for coffee using this data. What problems do you think you will encounter if you estimated the demand equation by OLS? Answer: Answers will differ by student. However, the following points should be mentioned: (i) there will be simultaneous equation bias because quantity and price are determined simultaneously in the market. (ii) If this is the case, then the OLS estimator will not be consistent. (iii) In that case, IV estimation should be used to get a consistent estimator of the demand elasticity or response to a price increase. (iv) This brings up the question of a valid instrument. It is not clear that students will come up with an easy answer, but their deliberations should be insightful. One possible instrument is the price (change) from a previous year, which most likely will be highly correlated with this year’s price (change) but not with the error term in the equation. (v) There should be some discussion on the other factors determining coffee demand, although some of these can be ignored if there is data for two periods and the data is differenced (fixed effects). Stock/Watson 2e -- CVC2 8/23/06 -- Page 298

7) Studies of the effect of minimum wages on teenage employment typically regress the teenage employment to population ratio on the real minimum wage or the minimum wage relative to average hourly earnings using OLS. Assume that you have a cross section of United States for two years. Do you think that there are problems with simultaneous equation bias? Answer: For OLS not to be consistent, there would have to be omitted variable bias or simultaneous equation bias. The former can be dealt with by differencing the data, if you assume that most other factors are being held constant. If the minimum wage does not change between the two periods, i.e. it is constant, then this will bring further problems with the interpretation, since the variation in the RHS variable only comes from the denominator. In many ways, the question should come down to the correlation between minimum wages and the error term in the equation. Students may argue that minimum wages are set by the legislature or, more recently, by ballot, and are therefore exogenous. A more nuanced discussion may point out that neither the legislature nor the electorate will raise minimum wages in time periods of low employment (a recession — although the 2008 and 2009 raises will contradict this statement to some extent; however, these were decided in 2006/2007 when the economy was booming). There may be further problems because of the denominator of the minimum wage variable, either the CPI or AHE, both of which are potentially correlated with teenage employment. The point here is for the student to think about the problem at hand and to point out various obstacles to getting a good estimate of the elasticity/response of employment from a minimum wage increase.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 299

12.3 Mathematical and Graphical Problems 1) To analyze the year-to-year variation in temperature data for a given city, you regress the daily high temperature (Temp) for 100 randomly selected days in two consecutive years (1997 and 1998) for Phoenix. The results are (heteroskedastic-robust standard errors in parenthesis): PHX PHX Temp 1998 = 15.63 + 0.80 × Temp 1997 ; R2 = 0.65, SER = 9.63 (0.10) (a) Calculate the predicted temperature for the current year if the temperature in the previous year was 40°F, 78°F, and 100°F. How does this compare with you prior expectation? Sketch the regression line and compare it to the 45 degree line. What are the implications? (b) You recall having studied errors-in-variables before. Although the web site you received your data from seems quite reliable in measuring data accurately, what if the temperature contained measurement error in the following sense: for any given day, say January 28, there is a true underlying seasonal temperature ( X), but ^

each year there are different temporary weather patterns (v, w) which result in a temperature X different from X. For the two years in your data set, the situation can be described as follows:

X1997 = X + v 1997 and X1998 = X + w1998

Subtracting X1997 from X1998 , you get X1998 = X1997 + w1998 – v 1997 . Hence the population parameter for the intercept and slope are zero and one, as expected. It is not difficult to show that the OLS estimator for the slope is inconsistent, where

β1

2 σv 1–

2 2 σx + σv

As a result you consider estimating the slope and intercept by TSLS. You think about an instrument and consider the temperature one month ahead of the observation in the previous year. Discuss instrument validity for this case. (c) The TSLS estimation result is as follows: PHX PHX Temp 1998 = -6.24 + 1.07× Temp 1997 ; (0.06) Perform a t-test on whether or not the slope is now significantly different from one. Answer: (a) The three predicted temperatures will be 47.6, 78.0, and 95.6 respectively. The initial expectation should be that the temperature in 1998 is the same in 1997 for a given date. The regression line and the 45 degree line are sketched in the accompanying figure. The implication is mean reversion: if the temperature was low (40 degrees), then it will also be low the following year, but not as low. Alternatively, if the temperature was high (100 degrees), then it will be high again, but not as high. If this prediction extrapolated into the future, then eventually all temperatures should be the same for all days. This obviously does not make sense.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 300

(b) For an instrument to be valid, two conditions have to hold. First, the instrument has to be relevant, and second, the instrument has to be exogenous. If temperatures in one month ahead can predict the current temperature, as it certainly does in Phoenix, then the instrument is relevant or correlated with the current month’s temperature. If in addition, whatever caused the temperature in the current month to deviate from its long-term value is only a temporary phenomenon, such as a weather system created by a storm in the Pacific, then next month’s temperature should not be correlated with this event. Hence the instrument would be exogenous. (c) The t-statistic is 1.17, and hence you cannot reject the null hypothesis that the slope equals one. 2) Consider the following population regression model relating the dependent variable Yi and regressor Xi, Yi = β0 + β1 Xi + ui, i = 1,…, n. Xi ≡ Yi + Zi where Z is a valid instrument for X. (a) Explain why you should not use OLS to estimate β1 . (b) To generate a consistent estimator for β1 , what should you do? (c) The two equations above make up a system of equations in two unknowns. Specify the two reduced form equations in terms of the original coefficients. (Hint: substitute the identity into the first equation and solve for Y. Similarly, substitute Y into the identity and solve for X.) (d) Do the two reduced form equations satisfy the OLS assumptions? If so, can you find consistent estimators of the two slopes? What is the ratio of the two estimated slopes? This estimator is called “Indirect Least Squares.” How does it compare to the TSLS in this example? Answer: (a) Substitution of the first equation into the identity shows that X is correlated with the error term. Hence estimation with OLS results in an inconsistent estimator. SZY ^ 2SLS (b) The instrumental variable estimator is consistent and in this case is β 1 . Adventurous = SZX students will derive this estimator along the lines shown in Appendix 10.2. (c) Stock/Watson 2e -- CVC2 8/23/06 -- Page 301

Yi = β0 + β1 (Yi + Zi) + ui Xi = (β0 + β1 Xi + ui) + Zi or (1- β1 )Yi = β0 + β1 Zi + ui (1- β1 )Xi = β0 + Zi + ui Hence Yi = π 0 + π 2 Zi + v 1i Xi = π 3 + π 4 Zi + v 2i β1 β0 1 1 , π2 = ,π = , and v 1i = v 2i = u. where π 0 = π 3 = 1- β1 4 1- β1 1- β1 i 1- β1 (d) Since Z is a valid instrument by assumption, it must be uncorrelated with the error term and hence SYZ ^

SZZ SYZ π2 using OLS results in a consistent estimator. ^ = which is identical to the TSLS estimator. = SZZ SXZ π4 SZZ 3) Here are some examples of the instrumental variables regression model. In each case you are given the number 2 of instruments and the J-statistic. Find the relevant value from the χ m-k distribution, using a 1% and 5% significance level, and make a decision whether or not to reject the null hypothesis. (a) Yi = β0 + β1 X1i + ui, i = 1,..., n; Z1i, Z2i are valid instruments, J = 2.58. (b) Yi = β0 + β1 X1i + β2 X2i + β3 W1i + ui, i = 1,..., n; Z1i, Z2i, Z3i, Z4i are valid instruments, J = 9.63. (c) Yi = β0 + β1 X1i + β2 W1i + β3 W2i + β4 W3i + ui, i = 1,..., n; Z1i, Z2i, Z3i, Z4i are valid instruments, J = 11.86. 2 Answer: (a) The test statistic is distributed χ 1 and the critical values are 6.63 and 3.84 at the 1% and 5% significance level. Hence you cannot reject the null hypothesis that all the instruments are exogenous. 2 (b) The test statistic is distributed χ 2 and the critical values are 9.21 and 5.99 at the 1% and 5% significance level. Hence you can reject the null hypothesis that all the instruments are exogenous. 2 (c) The test statistic is distributed χ 3 and the critical values are 11.34 and 7.81 at the 1% and 5% significance level. Hence you can reject the null hypothesis that all the instruments are exogenous.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 302

4) To study the determinants of growth between the countries of the world, researchers have used panels of countries and observations spanning over long periods of time (e.g. 1965-1975, 1975-1985, 1985-1990). Some of these studies have focused on the effect that inflation has on growth and found that although the effect is small for a given time period, it accumulates over time and therefore has an important negative effect. (a) Explain why the OLS estimator may be biased in this case. (b) Explain how methods using panel data could potentially alleviate the problem. (c) Some authors have suggested using an index of central bank independence as an instrumental. Discuss whether or not such an index would be a valid instrument. Answer: (a) The presence of simultaneous causality is highly likely since inflation may respond to growth. Depending on the list of regressors, omitted variables can also bias the estimator for the effect of the inflation rate. (b) Country fixed effects or differencing the data can solve the problem if inflation stays relatively constant over time from one country to the other. Unfortunately if the effect of inflation on growth is the focus of the study, then much of the cross-sectional information is lost using this approach. (c) For this index to be valid, central bank independence has to be relevant and exogenous. If inflation rates are correlated with the index, then central bank independence is a relevant instrument. Although there is a high correlation for developed countries, there is little to no correlation when data for all countries is considered. Whether or not the index is exogenous cannot be tested unless the coefficients of the equation are overidentified. Otherwise personal judgment is the only guide. An argument that central bank independence is exogenous would have to rely on it being based on institutional arrangements which are independent of inflation. Although the independence of central banks in many countries was initially determined by concerns independent of inflation, there have been many situations where the institutional arrangements were altered as a result of high inflation. 5) (Requires Matrix Algebra) The population multiple regression model can be written in matrix form as Y = Xβ + U where Y1 Y=

Y2 O Yn

,U=

u1 u2 O un

, X=

1 X11 N Xk1 W11 N Wr1 1 X12 N Xk2 W12 N Wr2 OO R O O R O 1 X1n N Xkn W1n N Wrn

β0 , and β =

β1 O βk

Note that the X matrix contains both k endogenous regressors and (r +1) included exogenous regressors (the constant is obviously exogenous). The instrumental variable estimator for the overidentified case is ^ IV β = [X′ Z(Z′ Z)-1 Z′ X]-1 X′ Z(Z′ Z)-1 Z′ Y, where Z is a matrix, which contains two types of variables: first the r included exogenous regressors plus the constant, and second, m instrumental variables.

1 Z11 N Zm1 W11 N Wr1 1 Z12 N Zm2 W12 N Wr2 OO R O O R O 1 Z1n N Zmn W1n N Wrn

It is of order n × (m+r+1). For this estimator to exist, both (Z′ Z) and [X′ Z(Z′ Z)-1 Z′ X] must be invertible. State the conditions under which this will be the case and relate them to the degree of overidentification. Stock/Watson 2e -- CVC2 8/23/06 -- Page 303

Answer: In order for a matrix to be invertible, it must have full rank. Since Z′ Z is of order (m + r + 1) × (m + r + 1), then in order to invert Z′ Z, it must have rank (m + r + 1). In the case of a product such as Z′ Z, the rank is at most less than or equal to the rank of Z′ or Z, whichever is smaller. Z is of order n × (m + r + 1), and assuming that there is no perfect multicollinearity, will have either rank n or rank (m + r + 1), whichever is the smaller of the two. Hence if there are fewer observations than the number of instrumental variables plus exogenous variables, then the rank of Z will be n(< m + r + 1), and the rank of Z′ Z is also n(< m + r + 1). Hence Z′ Z does not have full rank, and therefore cannot be inverted. The IV estimator does not exist as a result. In the past, this was considered a strong possibility with large econometric models, where many predetermined variables entered. If there are more observations than instruments, then the rank of Z′ Z is ( m + r + 1). X′ Z will be of order (k + r + 1) × (m + r + 1), which will have rank (k + r + 1) if m > k, i.e., if there is overidentification. Furthermore [X′ Z(Z′ Z)-1 Z′ X] is of order (k + r + 1) × (k + r + 1) and will have full rank since the rank of a product of the three matrices involved is at most the rank of the minimum of the three matrices X′ Z, Z′ Z, and Z′ X. 6) Consider the following model of demand and supply of coffee: Coffee Coffee Tea Demand: Q i = β1 P i + β2 P i + ui Coffee Coffee Tea Supply: Q i = β3 P i + β4 P i + β5 Weather + v i (variables are measure in deviations from means, so that the constant is omitted). What are the expected signs of the various coefficients this model? Assume that the price of tea and Weather are exogenous variables. Are the coefficients in the supply equation identified? Are the coefficients in the demand equation identified? Are they overidentified? Is this result surprising given that there are more exogenous regressors in the second equation? Answer: Changes in Weather will shift the supply equation and thereby trace out the demand equation. Hence the coefficients of the demand equation are exactly identified since the number of instruments equals the number of endogenous regressors. However the coefficients of the supply equation are underidentified since there is no instrumental variable available for estimation. The result is not surprising, since it is not the number of exogenous regressors in the equation that matters when determining whether or not the coefficients are identified. Instead what matters is the number of instruments available relative to the number of endogenous regressors. It is possible that the regression coefficients can be (over)identified even if there are no exogenous regressors present in the equation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 304

7) You started your econometrics course by studying the OLS estimator extensively, first for the simple regression case and then for extensions of it. You have now learned about the instrumental variable estimator. Under what situation would you prefer one to the other? Be specific in explaining under which situations one estimation method generates superior results. Answer: Under the OLS assumptions, the OLS estimator is unbiased and consistent. The sampling distribution of the estimator is approximately normal in large samples. Hence statistical inference can proceed as usual using the t-statistic, confidence intervals, or joint hypothesis tests involving the F-statistic. One major concern throughout the text has been the development of new estimation techniques in the case where one of the OLS assumptions is violated, specifically that there is correlation between the error term and at least one of the regressors. This may be the result of omitted variables, error -in-variables, or simultaneous causality bias. These make up three of the threats to internal validity. In each of these cases, OLS becomes biased and an alternative estimator should be used. Even if the OLS assumptions are violated and the OLS estimator is biased because of omitted variable bias, simultaneous causality, or errors-in-variables, using TSLS will not improve the situation if the instruments are not valid. In that case, TSLS will yield inconsistent estimators if the instruments are not exogenous. It will be biased and statistical inference will not be valid if the instruments are weak. Furthermore, the estimator will not even normally distributed in large samples. If the instruments are valid and the other IV regression assumptions hold, then the TSLS estimator is consistent and therefore preferable over the OLS estimator. Although its distribution is complicated in small samples, the sampling distribution of the estimator is approximately normal in large samples. Hence statistical inference can proceed as usual using the t-statistic, confidence intervals, or joint hypothesis tests involving the F-statistic.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 305

8) Your textbook gave an example of attempting to estimate the demand for a good in a market, but being unable to do so because the demand function was not identified. Is this the case for every market? Consider, for example, the demand for sports events. One of your peers estimated the following demand function after collecting data over two years for every one of the 162 home games of the 2000 and 2001 season for the Los Angeles Dodgers. Attend = 15,005 + 201 × Temperat + 465 × DodgNetWin + 82 × OppNetWin (8,770) (121) (169) (26) + 9647 × DFSaSu + 1328 × Drain + 1609 × D150m + 271 × DDiv – 978 × D2001; (1505) (3355) (1819) (1,184) (1,143) R2 = 0.416, SER = 6983 Where Attend is announced stadium attendance, Temperat it the average temperature on game day, DodgNetWin are the net wins of the Dodgers before the game (wins-losses), OppNetWin is the opposing team’s net wins at the end of the previous season, and DFSaSu, Drain, D150m, Ddiv, and D2001 are binary variables, taking a value of 1 if the game was played on a weekend, it rained during that day, the opposing team was within a 150 mile radius, plays in the same division as the Dodgers, and during 2001, respectively. Numbers in parenthesis are heteroskedasticity- robust standard errors. Even if there is no identification problem, is it likely that all regressors are uncorrelated with the error term? If not, what are the consequences? Answer: In the case of sports events, often price and quantity are not simultaneously determined by supply and demand. For baseball games, the supply of seats is fixed at the capacity level of the stadium. In addition, prices for games are also fixed in advance and do not vary with the attractiveness of the opponent. Therefore the supply curve is infinitely elastic up to the point of where the game is sold out. This situation is complicated by ticket scalping and the fact that teams stage special events (fireworks, etc.). Taking these considerations into account may result in simultaneous causality bias, or a threat to internal validity because of the identification problem. However, assuming that there is no identification problem, there may still be omitted variable bias or errors-in-variables bias. For example, attendance typically increases the tighter the race for a play -off spot towards the end of the season. Furthermore, it is not the opposing team’s net wins at the end of the previous season that accounts for the attractiveness of the opponent, but the performance during the current season. If the opposing team’s current performance is related to its performance in the previous season, then the OLS estimator is biased.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 306

9) Earnings functions, whereby the log of earnings is regressed on years of education, years of on the job training, and individual characteristics, have been studied for a variety of reasons. Some studies have focused on the returns to education, others on discrimination, union non-union differentials, etc. For all these studies, a major concern has been the fact that ability should enter as a determinant of earnings, but that it is close to impossible to measure and therefore represents an omitted variable. Assume that the coefficient on years of education is the parameter of interest. Given that education is positively correlated to ability, since, for example, more able students attract scholarships and hence receive more years of education, the OLS estimator for the returns to education could be upward biased. To overcome this problem, various authors have used instrumental variable estimation techniques. For each of the instruments potential instruments listed below briefly discuss instrument validity. (a) The individual’s postal zip code. (b) The individual’s IQ or testscore on a work related exam. (c) Years of education for the individual’s mother or father. (d) Number of siblings the individual has. Answer: (a) Instrumental validity has two components, instrument relevance (corr(Zi, Xi) ≠ 0, and instrument exogeneity (corr(Zi, ui) = 0). The individual’s postal zip code will certainly be uncorrelated with the omitted variable, ability, even though some zip codes may attract more able individuals. However, this is an example of a weak instrument, since it is also uncorrelated with years of education. (b) There is instrument relevance in this case, since, on average, individuals who do well in intelligence scores or other work related test scores, will have more years of education. Unfortunately there is bound to be a high correlation with the omitted variable ability, since this is what these tests are supposed to measure. (c) A non-zero correlation between the mother’s or father’s years of education and the individual’s years of education can be expected. Hence this is a relevant instrument. However, it is not clear that the parent ’s years of education are uncorrelated with parent’s ability, which in turn, can be a major determinant of the individual’s ability. If this is the case, then years of education of the mother or father is not a valid instrument. (d) There is some evidence that the larger the number of siblings of an individual, the less the number of year of education the individual receives. Hence number of siblings is a relevant instrument. It has been argued that number of siblings is uncorrelated with an individual’s ability. In that case it also represents an exogenous instrument. However, there is the possibility that ability depends on the attention an individual receives from parents, and this attention is shared with other siblings. 10) The two conditions for instrument validity are corr(Zi, Xi) ≠ 0 and corr(Zi, ui) = 0. The reason for the inconsistency of OLS is that corr(Xi, ui) ≠ 0. But if X and Z are correlated, and X and u are also correlated, then how can Z and u not be correlated? Explain. Answer: The introduction to Chapter 10 on instrumental variables regression and section 10.1 went into a lengthy explanation of this problem. The major idea is that corr(Xi, ui) has two parts: one for which the correlation is zero and a second for which it is non-zero. The trick is to isolate the uncorrelated part of X. For the instrument to be valid, corr(Zi, ui) = 0 and corr(Zi, Xi) ≠ 0 must hold. TSLS then generates predicted values of X in the first stage by using a linear combination of the instruments. As long as corr(Zi, Xi) ≠ 0 and corr(Zi, ui) = 0, then the part of X which is uncorrelated with the error term is extracted through the prediction. In the second stage, this captured exogenous variation in X is then used to estimate the effect of X on Y, which is exogenous.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 307

11) Consider the a model of the U.S. labor market where the demand for labor depends on the real wage, while the supply of labor is vertical and does not depend on the real wage. You could argue that the supply of labor by households (think of hours supplied by two adults and two children) has not changed much over the last 60 years or so in the U.S. while real wages more than doubled over the same time span. At first that seems strange given the higher participation rate of females over that period, but that increase has been countered by a lower male participation rate (resulting from earlier retirement), an increase in legal holidays, and an increase in vacation days. a.

Write down two equations representing the labor supply and labor demand function, allowing for an error term in each of the demand and supply equation. In addition, assume that the labor market clears.

How would you estimate the labor supply equation?

Assuming that the error terms are mutually independent i.i.d. random variables, both with mean zero, show that the real wage and the error term of the labor demand equation are correlated.

d. If you find a non-zero correlation, should you estimate the labor demand equation using OLS? If so, what are the consequences? e.

Estimating the labor demand equation by IV estimation, which instrument suggests itself immediately?

Answer: a. Student may use different symbols, but will end up with something like the following specification: W Nd = β0 + β1 +u P Ns = γ0 + v Nd = Ns = N ^

b. The labor supply equation can be estimated by OLS. γ0 = N

c. Using the above symbols, it can be shown that E(W/P, u) = -

2 σu β1

d. OLS will not be consistent. e. Hours worked per household is correlated with the real wage but not correlated with the error term (here u) in the labor demand equation. Hence it is a valid instrument.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 308

Chapter 13 Experiments and Quasi-Experiments 13.1 Multiple Choice 1) The following are reasons for studying randomized controlled experiment in an econometrics course, with the exception of A) at a conceptual level, the notion of an ideal randomized controlled experiment provides a benchmark against which to judge estimates of causal effects in practice. B) when experiments are actually conducted, their results can be very influential, so it is important to understand the limitations and threats to validity of actual experiments as well as their strength. C) randomized controlled experiments in economics are common. D) external circumstances sometimes produce what appears to be randomization. Answer: C 2) Program evaluation A) is conducted for most departments in your university/college about every seven years. B) is the field of study that concerns estimating the effect of a program, policy, or some other intervention or “treatment.” C) tries to establish whether EViews, SAS or Stata work best for your econometrics course. D) establishes rating systems for television programs in a controlled experiment framework. Answer: B 3) In the context of a controlled experiment, consider the simple linear regression formulation Yi = β0 + β1 Xi + ui. Let the Yi be the outcome, Xi the treatment level, and ui contain all the additional determinants of the outcome. Then A) the OLS estimator of the slope will be inconsistent in the case of a randomly assigned Xi since there are omitted variables present. B) Xi and ui will be independently distributed if the Xi be are randomly assigned. C) β0 represents the causal effect of X on Y when X is zero. D) E(Y X = 0)is the expected value for the treatment group. Answer: B 4) In the context of a controlled experiment, consider the simple linear regression formulation Yi = β0 + β1 Xi + ui. Let the Yi be the outcome, Xi the treatment level when the treatment is binary, and ui contain all the additional ^

determinants of the outcome. Then calling β1 a differences estimator A) makes sense since it is the difference between the sample average outcome of the treatment group and the sample average outcome of the control group. ^

B) and β0 the level estimator is standard terminology in randomized controlled experiments. C) does not make sense, since neither Y nor X are in differences. D) is not quite accurate since it is actually the derivative of Y on X. Answer: A 5) The following does not represent a threat to internal validity of randomized controlled experiments: A) attrition. B) failure to follow the treatment protocol. C) experimental effects. D) a large sample size. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 309

6) The Hawthorne effect refers to A) subjects dropping out of the study after being randomly assigned to the treatment or control group. B) the failure of individuals to follow completely the randomized treatment protocol. C) the phenomenon that subjects in an experiment can change their behavior merely by being included in the experiment. D) assigning individuals, in part, as a result of their characteristics or preferences. Answer: C 7) The following is not a threat to external validity: A) the experimental sample is not representative of the population of interest. B) the treatment being studied is not representative of the treatment that would be implemented more broadly. C) experimental participants are volunteers. D) partial compliance with the treatment protocol. Answer: D 8) Assume that data are available on other characteristics of the subjects that are relevant to determining the experimental outcome. Then including these determinants explicitly results in A) the limited dependent variable model. B) the differences in means test. C) the multiple regression model. D) large scale equilibrium effects. Answer: C 9) All of the following are reasons for using the differences estimator with additional regressors, with the exception of A) efficiency. B) providing a check for randomization. C) providing an adjustment for “conditional” randomization. D) making the difference estimator easier to calculate than in the case of the differences estimator without the additional regressors. Answer: D 10) Experimental data are often A) observational data. B) binary data, in that the subject either does or does not respond to the treatment. C) panel data. D) time series data. Answer: C 11) With panel data, the causal effect A) cannot be estimated since correlation does not imply causation. B) is typically estimated using the probit regression model. C) can be estimated using the “differences-in-differences” estimator. D) can be estimated by looking at the difference between the treatment and the control group after the treatment has taken place. Answer: C 12) Causal effects that depend on the value of an observable variable, say Wi, A) cannot be estimated. B) can be estimate by interacting the treatment variable with Wi. C) result in the OLS estimator being inefficient. D) requires use of homoskedasticity-only standard errors. Answer: B Stock/Watson 2e -- CVC2 8/23/06 -- Page 310

13) To test for randomization when Xi is binary, A) you regress Xi, on all W’s and compute the F-statistic for testing that all the coefficients on the W’s are zero. (The W’s measure characteristics of individuals, and these are not affected by the treatment.) B) is not possible, since binary variables can only be regressors. C) requires reordering the observations randomly and re-estimating the model. If the coefficients remain the same, then this is evidence of randomization. D) requires seeking external validity for your study. Answer: A 14) The following estimation methods should not be used to test for randomization when Xi, is binary: A) linear probability model (OLS) with homoskedasticity-only standard errors. B) probit. C) logit. D) linear probability model (OLS) with heteroskedasticity-robust standard errors. Answer: A 15) In a quasi-experiment A) quasi differences are used, i.e., instead of △Y you need to use (Yafter - λ × Ybefore), where 0 < λ < 1. B) randomness is introduced by variations in individual circumstances that make it appear as if the treatment is randomly assigned. C) the causal effect has to be estimated through quasi maximum likelihood estimation. D) the t-statistic is no longer normally distributed in large samples. Answer: B 16) Your textbooks gives several examples of quasi experiments that were conducted. The following is not an example of a quasi experiment: A) labor market effects of immigration. B) effects on civilian earnings of military service. C) the effect of cardiac catheterization. D) the effect of unemployment on the inflation rate. Answer: D 17) A repeated cross-sectional data set A) is also referred to as panel data. B) is a collection of cross-sectional data sets, where each cross-sectional data set corresponds to a different time period. C) samples identical entities at least twice. D) is typically used for estimating the following regression model Yit = β0 + β1 Xit + β2 W1,it + ... + β1+ rWr,it + uit Answer: B 18) For quasi-experiments, A) there is a particularly important potential threat to internal validity, namely whether the “as if” randomization in fact can be treated reliably as true randomization. B) there are the same threats to internal validity as for true randomized controlled experiments, without modifications. C) there is little threat to external validity, since the populations are typically already different. D) OLS estimation should not be used. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 311

19) Experimental effects, such as the Hawthorne effect, A) generally are not germane in quasi-experiments. B) typically require instrumental variable estimation in quasi-experiments. C) can be dealt with using binary variables in quasi-experiments. D) are the most important threat to internal validity in quasi-experiments. Answer: A 20) Heterogeneous population A) implies that heteroskedasticity-robust standard errors must be used. B) suggest that multiple characteristics must be used to describe the population. C) effects can be captured through interaction terms. D) refers to circumstances in which there is unobserved variation in the causal effect with the population. Answer: D 21) If the causal effect is different for different people, then the population regression equation for a binary treatment variable Xi, can be written as A) Yi = β0 + β1 Xi + ui. B) Yi = β0 + β1iXi + ui. C) Yi = β0i + β1iXi + ui. D) Yi = β0 + β1 Gi + β2 Dt + ui. Answer: C 22) In the case of heterogeneous causal effects, the following is not true: A) in the circumstances in which OLS would normally be consistent (when E(ui Xi) = 0), the OLS estimator continues to be consistent. B) OLS estimation using heteroskedasticity-robust standard errors is identical to TSLS. C) the OLS estimator is properly interpreted as a consistent estimator of the average causal effect in the population being studied. D) the TSLS estimator in general is not a consistent estimator of the average causal effect if an individual’s decision to receive treatment depends on the effectiveness of the treatment for that individual. Answer: B 23) One of the major lessons learned in the chapter on experiments and quasi -experiments A) is that there are almost no true experiments in economics and that quasi-experiments are a poor substitute. B) you should always use TSLS when estimating causal effects in quasi -experiments. C) populations are always homogeneous. D) is that the insights of experimental methods can be applied to quasi -experiments, in which special circumstances make it seem “as if” randomization has occurred. Answer: D 24) Quasi-experiments A) provide a bridge between the econometric analysis of observational data sets and the statistical ideal of a true randomized controlled experiment. B) are not the same as experiments, and lessons learned from the use of the latter can therefore not be applied to them. C) most often use difference-in-difference estimators, which are quite different from OLS and instrumental variables methods studied in earlier chapters of the book. D) use the same methods as studied in earlier chapters of the book, and hence the interpretation of these methods is the same. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 312

25) The major distinction between the experiments and quasi-experiments chapter and earlier chapters is the A) frequent use of binary variables. B) type of data analyzed and the special opportunities and challenges posed when analyzing experiments and quasi-experiments. C) superiority of TSLS over OLS. D) use of heteroskedasticity-robust standard errors. Answer: B 26) A potential outcome A) is the outcome for an individual under a potential treatment. B) cannot be observed because most individuals do not achieve their potential. C) is the same as a causal effect. D) is none of the above. Answer: A 27) A causal effect for a single individual A) can be deduced from the average treatment effect. B) cannot be measured. C) depends on observable variables only. D) is observable since it is used as part of calculating the mean of individual causal effects. Answer: B 28) Randomization based on covariates is A) not of practical importance since individuals are hardly ever assigned in this fashion. B) dependent on the covariances of the error term (serial correlation). C) a randomization in which the probability of assignment to the treatment group depends on one of more observable variables W. D) eliminates the omitted variable bias when using the difference estimator based on Yi = β0 + β1 Xi + ui , where Y is the outcome variable and X is the treatment indicator. Answer: C 29) Testing for the random receipt of treatment A) is not possible, in general. B) entails testing the hypothesis that the coefficients on W1i, …, Wri are non-zero in a regression of Xi on W1i, …, Wr . C) is not meaningful since the LHS variable is binary. D) entails testing the hypothesis that the coefficients on W1i, …, Wri are zero in a regression of Xi on W1i, …, Wr . Answer: D 30) Failure to follow the treatment protocol means that A) the OLS estimator cannot be computed. B) instrumental variables estimation of the treatment effect should be used where the initial random assignment is the instrument for the treatment actually received. C) you should use the TSLS estimator and regress the outcome variable Y on the initial random assignment in the first stage to get predicted values of the outcome variable. D) the Hawthorne effect plays a crucial role. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 313

31) Small sample sizes in an experiment A) biases the estimators of the causal effect. B) may pose a problem because the assumption that errors are normally distributed is dubious for experimental data. C) do not raise threats to the validity of confidence intervals as long as heteroskedasticity -robust standard errors are used. D) may affect confidence intervals but not hypothesis tests. Answer: B 32) A repeated cross-sectional data set is A) a collection of cross-sectional data sets, where each cross-sectional data set corresponds to a different time period. B) the same as a balanced panel data set. C) what Card and Krueger used in their study of the effect of minimum wages on teenage employment. D) time series. Answer: A 33) In a sharp regression discontinuity design, A) crossing the threshold influences receipt of the treatment but is not the sole determinant. B) the population regression line must be linear above and below the threshold. C) Xi will in general be correlated with ui. D) receipt of treatment is entirely determined by whether W exceeds the threshold. Answer: D 34) Threats to internal validity of quasi-experiments include A) failure of randomization. B) failure to follow the treatment protocol. C) attrition. D) all of the above with some modifications from true randomized controlled experiments. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 314

13.2 Essays and Longer Questions 1) You want to study whether or not the use of computers in the classroom for elementary students has an effect on performance. Explain in some detail how you would ideally set up such an experiment and what threats to internal and external validity there might be. Answer: Answer will differ by students. Students have the choice to suggest a quasi-experiment or a controlled experiment. Although it is possible to focus on states where elementary schools have introduced the use of computers in the classroom, and compare the change in test scores with those in states which have not done so, it is more likely that students will concentrate on a controlled experiment design here. The answer should emphasize the initial random selection of pupils, classrooms, or schools, from the population of a state, or the nation, and the random assignment to a treatment group. Furthermore, teachers must also be randomly assigned. In essence, the random selection and assignment is to ensure that E(ui Xi) = 0 holds. X could be a binary variable, indicating whether computers were introduced in the classroom, or it could indicate the intensity with which computers were used. Answers should mention each of the threats to internal and external validity. Failure to randomize might occur because the treatment group could be assigned according to the performance level of students or computers already being used in classrooms, or the previous experience students had with computers. Teachers may be chosen depending on their knowledge of computers and software. Failure to follow treatment protocol is less of a risk here. Attrition is a problem if parents move to another school district or private schools as a result of the assignment to the treatment and control groups. Experimental effects are hard to avoid in this situation since it does not make sense to have a double blind experiment. Small samples should not be a problem in this set-up. Threats to external validity include nonrepresentative samples, which are unlikely to occur here unless there are a large number of volunteers. Similarly, nonrepresentative programs or policy should not pose a problem. There may be general equilibrium effects if more technically oriented teachers have to be hired or others have to be reeducated. Finally, there may be treatment vs. eligibility effects if there is a choice to opt in or out of the treatment and control group.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 315

2) Canada and the United States had approximately the same aggregate unemployment rates from the 1920s to 1981. In 1982, a two percentage point gap appears, which has roughly persisted until today, with the Canadian unemployment rate in the third quarter of 2002 being 7.6 percent while the American rate stood at 5.9 percent in the same period. Several authors have investigated this phenomenon. One study, published in 1990, contained the following statement: “It is a clichė that, as compared to analysis in the physical sciences, economic analysis is hampered by the lack of controlled experiments. In this regard, study of the Canadian economy can be much facilitated by comparison with the behaviour of the US …” Discuss what the authors may have had in mind. List some potential threats to internal and external validity when comparing aggregate unemployment rate behavior between countries. Answer: It should be clear that the authors were not really talking about a controlled experiment, but instead had in mind a quasi-experiment or natural experiment. In a randomized controlled experiment to study the effect of unemployment insurance benefits on unemployment, for example, unemployed workers would be “treated” with various degrees of unemployment insurance generosity, such as the amount by which their former wages are replaced by unemployment insurance benefits (“replacement rate”), the duration of benefits, the scrutiny of the agency monitoring the job search effort, etc. Instead the authors must have thought that the two economies were similar in many aspects, and that because of an external event, either in Canada or in the U.S., one was subjected to a treatment, while the other was not, which resulted in the aggregate unemployment rate difference. It is the difference in location (living in the U.S. vs. in Canada) that gives the resemblance to a randomly assigned treatment. The above study is of the first type of quasi-experiments discussed in the textbook whereby the treatment received is viewed as if randomly determined. One threat to external validity is to generalize the results from a U.S. -Canada comparison to other cultural and less developed economies. Also, consider unemployment insurance generosity as a treatment variable. (Canada liberalized unemployment benefits considerably in the early ‘70s). In that case E(ui Xi) = 0 is unlikely to hold, and additional regressors and instrumental variable techniques should be used.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 316

3) Earnings functions provide a measure, among other things, of the returns to education. It has been argued these regressions contain a serious omitted variable bias due to differences in abilities. Furthermore, ability is hard to measure and bound to be highly correlated with years of schooling. Hence the standard estimate of about a 10 percent return to every year of schooling is upward biased. Suggest some ways to address this problem. One famous study looked at earnings of identical twins. Explain how this can be viewed as a quasi-experiment, and mention some of the threats to internal and external validity that such a study might encounter. Answer: Answers will vary by student. The omitted variable bias should play a central part in the discussion. E(ui Xi, W1i,..., Wri) = 0 will not hold if one of the W’s is years of education and u contains unobserved ability. If ability causes individuals to have higher earnings and longer years of education, perhaps through obtaining university scholarships easier, then the returns to education are biased upward. One way to circumvent this problem is, as some studies have done in the past, to approximate ability by IQ scores. If IQ scores measure ability with error, then instrumental variable techniques can be employed. These were discussed in Chapter 10 of the textbook. Another possibility is to model ability as an omitted variable that remains constant over time. In that case, panel estimation methods with fixed effects, presented in Chapter 8 of the textbook, can be used. Data can be differenced to eliminate the entity fixed effects or binary variables can be added to capture them. At any rate, this approach requires data being available for more than a single point in time. The use of data from identical twins is fascinating since these have identical genes and, typically, identical family backgrounds. The suggestion is therefore to assume that they have identical ability as well. If some twins have different years of schooling while others do not, then this can be treated as a quasi-experiment since the researcher can view this choice as if it had been randomly assigned. Obviously it cannot count as a randomized controlled experiment, since the difference in schooling was not determined by the flip of a coin, say. But it may also run into problems in providing an as if randomization. The text flagged some of the potential problems in section 11.1: “Initially, one might think that an ideal experiment would take two otherwise identical individuals, treat one of them, and compare the difference in their outcomes while holding constant all other influences. This is not, however, a practical experimental design, for it is impossible to find two identical individuals: even identical twins have different life experiences, so they are not identical in every way.” Finally, if identical twins are “different” from the general population, then there is also a threat to external validity by generalizing the results for the population of all individuals.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 317

4) Describe the major differences between a randomized controlled experiment and a quasi -experiment. Answer: Answers will vary by student. Some of the following points should appear. A randomized controlled experiment relies on the random selection of entities from a population of interest, and the random assignment of these individuals into either a treatment or control group. To study the causal effects, a simple regression model with a single regressor can be specified. This regressor can either be a binary variable or a variable indicating treatment levels. Since E(ui Xi) = 0 is guaranteed if the assignment and selection was random, then the causal or treatment effect can be measured through E(Yi X = x) - E(Yi X = 0). The random selection and assignment assures that there is no omitted variable bias, and therefore the OLS estimator is unbiased. Adding additional regressors can result in increased efficiency. Alternatively a differences-in-differences estimator with or without additional regressors is also available if the entities have been observed for two periods, one before and one after the treatment. In the case of more than two observations per entity, panel methods can be employed. There are various threats to internal and external validity. These include failure to randomize, failure to follow treatment protocol, attrition, experiment effects, and small samples (threats to internal validity), and nonrepresentative sample, nonrepresentative program or policy, general equilibrium effects, and treatment vs. eligibility effects (threats to external validity). A quasi-experiment is also called a “natural experiment” since the treatment of some entities resulted from an external event. The treatment is administered “as if” it was random. The reason for observing quasi-experiments more often in economics is that they are less expensive and raise less of an ethical concern. The “as if” randomly assigned treatment is the result of, as the textbook puts it, “vagaries in legal institutions, location, timing of policy or program implementation, natural randomness such as birth dates, rainfall, or other factors that are unrelated to the causal effect under study.” There are two types of quasi-experiments, one whereby treatment is viewed as if randomly determined, the other whereby the “as if” randomization provides an instrumental variable. Threats to internal and external validity are the same as for randomized controlled experiments once they are modified. For example, experimental effects are typically absent since individuals are not aware that they are part of an experiment. Small samples is replaced by instrument validity in quasi -experiments. 5) Roughly ten percent of elementary schools in California have a system whereby 4 th to 6th graders share a common classroom and a single teacher (multi-age, multi-grade classroom). Suggest an experimental design that would allow you to assess the effect of learning in this environment. Mention some of the threats to internal and external validity and how you would attempt to circumvent these. Answer: Students should be selected randomly within a school and should be randomly assigned to a treatment group (multi-age, multi-grade classroom) and a control group (traditional grade assignment; 4 th, 5th, and 6th grade only per room). Alternatively, and depending on the size of the experiment, a subset of schools could be chosen and some pupils would randomly be assigned to traditional grade assignments while others would be moved into multi-age, multi-grade classrooms. Another alternative would be to simply choose some schools randomly which would have multi-age, multi-grade classrooms only. The causal effect could then be estimated in a simple regression model with a binary regressor. Random selection and random assignment would assure E(ui Xi) = 0 and thereby eliminate one threat to internal validity through omitted variable bias. Another threat to internal validity would be if the worst or best performing schools were chosen instead of using a random selection, or if parents in the district were allowed to vote whether or not to have the school selected for the experiment. This would imply a failure to randomize. If students were allowed to refuse to participate by transferring to a neighboring school, then this would represent failure to follow treatment protocol. Double blind experiments are obviously not feasible since both instructors and students know into which setting they are being placed (“experimental effects”). There are few threats to external validity except for the situation whereby students would be allowed to opt in or out of the experimental group (“treatment vs. eligibility effect”). Stock/Watson 2e -- CVC2 8/23/06 -- Page 318

6) Assume for the moment that the student-teacher ratio effect on test scores was large enough that you would advocate reducing class sizes in elementary schools. In 1996, the State of California reduced class sizes from K-3 to no more than 20 students across all public elementary schools (Class Size Reduction Act) at a cost of approximately $2 billion. In a short essay, discuss why the general equilibrium effects might differ from the results obtained using experiments. Answer: The General Equilibrium effects are the result of the additional demand for teachers. Each elementary school needed additional teachers in order to reduce the class size to 20 or less — think of a school that had perhaps 3 Kindergarten classes of 25 students each. In that case, one additional classroom had to be created — typically some temporary structure. The question arises where the additional teacher came from. If your school district was a desirable district to teach in, perhaps because of having a reputation of well behaved children or classrooms that were well equipped, then teachers from other districts, perhaps less desirable ones, would apply to the better school district. Presumably the desirable school district would pick the best teacher(s) available, leaving the less desirable school district with a lower level of teacher quality. The same phenomenon would repeat itself at the lower level school district, and so forth, until you would get to the least desirable school district, which would have to hire new teachers from a cohort that could not find a job elsewhere. Given the size of the State of California, the General Equilibrium effect could be substantial, perhaps even drawing quality teachers from other states.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 319

13.3 Mathematical and Graphical Problems 1) Your textbook mentions use of a quasi-experiment to study the effects of minimum wages on employment using data from fast food restaurants. In 1992, there was an increase in the (state) minimum wage in one U.S. ^ diffs-in-diffs state (New Jersey) but not in neighboring location (Eastern Pennsylvania). To calculate the β 1 you

need the change in the treatment group and the change in the control group. To do this, the study provides you with the following information

FTE Employment before FTE Employment after

PA 23.33

NJ 20.44

21.17

21.03

Where FTE is “full time equivalent” and the numbers are average employment per restaurant. ^ diffs-in-diffs (a) Calculate the change in the treatment group, the change in the control group, and finally β 1 . ^ diffs-in-diffs to be positive or negative? Since minimum wages represent a price floor, did you expect β 1 ^ diffs-in-diffs , is this number primarily due to a change in the treatment group or the control (b) If you look at β 1

group? Is this what you expected? ^ diffs-in-diffs is 1.36. Test whether or not the coefficient is statistically significant, (c) The standard error for β 1

given that there are 410 observations. If you believed that the benefit from small minimum wage increases outweighed the cost in terms of employment loss, would finding that this coefficient was not statistically significant discourage you? ^ diffs-in-diffs Answer: (a) change in treatment group: + 0.59, change in control group: - 2.16, β 1

= 2.75. Standard economic theory suggests a negative, not positive, change. (b) The overall change of 2.76 is primarily due to the change in Eastern Pennsylvania (2.16), i.e., the control group. Following standard economic theory, if employment fell in Eastern Pennsylvania, then you would expect employment in New Jersey to fall by even more than in Eastern Pennsylvania. Not only did employment in New Jersey not fall by less, it actually increased. (c) The t-statistic is 2.03, thereby making the coefficient statistically significant at the 5% level (two-sided test). Even if the coefficient was not statistically significant, it is not negative. Hence finding an insignificant coefficient should be discouraging since it suggests that there is no negative employment effect of a small increase in minimum wages.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 320

diffs-in-diffs 2) Define the β 1 in terms of observable differences in the treatment and control group, before and after the treatment. Explain why this presentation is the equivalent of calculating the coefficient in a regression framework. ^ diffs-in-diffs Answer: β 1 = (Y treatment,after - Y treatment,before) - (Y control,after- Y control,before)

= △Y treatment - △Y control. Consider the following regression △Yi = β0 + β1 Xi + ui th where Y is the value for the i individual after the experiment is completed, minus the value of Y before it starts, and X is a randomly assigned binary treatment variable, which takes on the value of one if treatment was received and is zero otherwise. Then for an individual who did not receive treatment, ^ Y control,after- Y control,before = β0 . If the individual received treatment, then ^ ^ Y treatment,after - Y treatment,before = β0 + β1 . Hence ^ β1 = (Y treatment,after - Y treatment,before) - (Y control,after- Y control,before). ^ diffs-in-diffs 3) Your textbook gives a graphical example of β 1 , where outcome is plotted on the vertical axis, and

time period appears on the horizontal axis. There are two time periods entered: “t = 1” and “t = 2.” The former corresponds to the “before” time period, while the latter represents the “after” period. The assumption is that the policy occurred sometime between the time periods (call this “t = p”). Keeping in mind the graphical ^ diffs-in-diffs example of β 1 , carefully read what a reviewer of the Card and Krueger (CK) study of the minimum

wage effect on employment in the New Jersey-Pennsylvania study had to say: ^ diffs-in-diffs ] “Two assumptions are implicit throughout the evaluation of the ‘natural experiment:’ (1) [ β 1 ^ diffs-in-diffs ] indicates the effect of the would be zero if the treatment had not occurred, so a nonzero [ β 1

treatment (that is, nothing else could have caused the difference in the outcomes to change), and (2) … the intervention occurs after we measure the initial outcomes in the two groups. … Three conditions are particularly relevant in interpreting CK’s work: (1) [t = 1] must be sufficiently before [t = p] that [the treatment group] did not adjust to the treatment before [t=1] – otherwise [Ytreatent,before – Ycontrol,before] will reflect the effect of the treatment; (2) [t = 2] must be sufficiently after [t = p] to allow the treatment’s effect to be fully felt; and (3) we must be sure that the same difference [Ytreatent,before – Ycontrol,before] would have been observed at [t = 2] if the treatment had not been imposed, that is, [the control group must be good enough] that there is no need to adjust the differences for factors other than the treatment that might have caused them to change.” Use a figure similar to the textbook to explain what this reviewer meant. Answer: See accompanying figures.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 321

^ diffs-in-diffs

(1) β

would be zero if treatment had occurred.

(2) The intervention occurs after we not measure the initial outcomes in the two groups.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 322

Rule out (1)

and (2)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 323

and (3) in the case of no treatment.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 324

4) Consider the simple population regression model where the treatment is the same for the members of the treatment group, and hence X is a binary variable. Explain why the coefficient on X represents the difference between two means. How is the test for the statistical significance of the coefficient on X related to the test for differences in means between two populations, when their variances are different? Write down the null and alternative hypothesis in each case. Answer: The answer should proceed along the lines of “Regression When X Is a Binary Variable” (Section 4.7) of the textbook, where the binary variable now indicates whether or not an individual has received treatment. In terms of the regression model with a single regressor this is formulated as Yi = β0 + β1 Xi + ui, where Xi is 1 or 0 depending on whether or not the individual received treatment. Then in the case of no treatment received, Yi = β0 + ui and E(Yi Xi = 0) = β0 . Alternatively, when treatment was received, Yi = β0 + β1 + ui and E(Yi Xi = 1) = β0 + β1 . Hence β1 is the difference between the two means. To test whether or not there is a difference, the hypotheses are H0 : β1 = 0 vs. H1 : β1 ≠ 0. The null hypothesis can be tested using the usual t-statistic and allowing for heteroskedasticity-robust standard errors. This test corresponds to the test encountered in section 3.4 of the textbook, where H0 : μtreatment - μcontrol = 0 vs. H1 : μtreatment - μcontrol ≠ 0, and the standard error of the differences in means is calculated under the assumption that the two population variances are unequal. 5) Present alternative estimators for causal effects using experimental data when data is available for a single period or for two periods. Discuss their advantages and disadvantages. Answer: There are essentially four estimators discussed in the textbook: two each for a single period randomized controlled experiment, and two for panel data. For each of these situations, a binary or treatment level regressor X is used, and additional characteristics can be added, thereby distinguishing the two possible estimators within the single/panel two periods framework. The single period estimator of the causal or treatment effect is the OLS estimator in the regression model with a single regressor ^

Yi = β0 + β1 Xi + ui. Random selection and assignment assures that E(ui Xi) = 0. Thus even with omitted variables present, E(Yi Xi) β0 + β1 Xi, since X is independently distributed from the omitted variables. The OLS estimator ^

β1 , also called the differences estimator, is unbiased and consistent. A different estimator, called differences estimator with additional regressors, is obtained by adding characteristics for the individual, which are not affected by the treatment. This is done to deal with some of the threats to validity, but also for efficiency purposes. The multiple regression model in this case is Yi = β0 + β1 Xi + β2 W1i + ... + β1+ rWri + ui, i = 1,..., n ^

and β1 is the differences estimator with additional regressors. Here β1 is consistent even if E(ui Xi, W1i,..., Wri) = 0 does not hold, as long as there is conditional mean independence. In that case, the OLS estimator is consistent. The inclusion of the characteristics also allows for testing for random receipt of Stock/Watson 2e -- CVC2 8/23/06 -- Page 325

treatment and random assignment using the usual F-statistic in auxiliary regressions. The third estimator generalizes the two estimators above to the case of panel data. The idea here is that data is available for two periods, one before the treatment is administered and one after. The differences-in-differences estimator is then defined as ^ diffs-in-diffs β1 = (Ytreatment,after - Ytreatment,before) - (Ycontrol,after - Ycontrol,before)

= △Ytreatment - △Ycontrol. If the treatment is randomly assigned, the estimator is unbiased, consistent, and more efficient that the differences estimator. In addition, it eliminates pretreatment differences in Y. Alternative it can be viewed in a regression framework △Yi = β0 + β1 Xi + ui where Y is the value for the ith individual after the experiment is completed, minus the value of Y before it starts. Then for an individual who did not receive treatment, ^

Ycontrol,after- Ycontrol,before = β0 . If the individual received treatment, then ^ ^ Ytreatment,after - Ytreatment,before= β0 + β1 . Hence ^ β1 = Ytreatment,after - Ytreatment,before) - (Ycontrol,after - Ycontrol,before). As in the case for a single time period, additional characteristics can be added. In that case △Yi = β0 + β1 Xi + β2 W1i + ... + β1+ rWri + ui, i = 1,..., n where the interpretation of the W variable effect is different from before, since the dependent variable is differenced.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 326

6) To analyze the effect of a minimum wage increase, a famous study used a quasi -experiment for two adjacent ^ diffs-in-diffs states: New Jersey and (Eastern) Pennsylvania. A β 1 was calculated by comparing average

employment changes per restaurant between to treatment group (New Jersey) and the control group (Pennsylvania). In addition, the authors provide data on the employment changes between “low wage” restaurants and “high wage” restaurants in New Jersey only. A restaurant was classified as “low wage,” if the starting wage in the first wave of surveys was at the then prevailing minimum wage of $4.25. A “high wage” restaurant was a place with a starting wage close to or above the $5.25 minimum wage after the increase. (a) Explain why employment changes of the “high wage” and “low wage” restaurants might constitute a quasi-experiment. Which is the treatment group and which the control group? (b) The following information is provided

FTE Employment before FTE Employment after

Low wage 19.56

High wage 22.25

20.88

20.21

Where FTE is “full time equivalent” and the numbers are average employment per restaurant. ^ diffs-in-diffs . Since Calculate the change in the treatment group, the change in the control group, and finally β 1 ^ diffs-in-diffs to be positive or negative? minimum wages represent a price floor, did you expect β 1 ^ diffs-in-diffs is 1.48. Test whether or not this is statistically significant, given that (c) The standard error for β 1

there are 174 observations. Answer: (a) In the above example, the increase in wages (“treatment”) occurs not because of changes in the demand or supply of labor, but because of an external event, namely the raising of the minimum wage in New Jersey. This is therefore a good example of a “natural experiment.” The treatment group is the “low wage” restaurants, since the wages there are actually changed. The “high wage” restaurants are the control group. ^ diffs-in-diffs (b) change in treatment group: + 1.32, change in control group: - 2.04, β 1 = 3.36. The prior

expectation would be negative. (c) The t-statistic is 2.27, making the coefficient statistically significant at the 5% level (two-sided test).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 327

7) Specify the multiple regression model that contains the difference-in-difference estimator (with additional regressors). Explain the circumstances under which this model is preferable to the simple difference-in-difference estimator. Explain how the W’s can be used to test for randomization. How does the interpretation of the W variables change compared to the differences estimator with additional regressors? Answer: The differences-in-differences estimator with additional regressors is △Yi = β0 + β1 Xi + β2 W1i + ... + β1+ rWri + ui, i = 1,..., n. This is more general than the differences-in-differences estimator △Yi = β0 + β1 Xi + ui ^ diffs-in-diffs which equals β 1 = (Ytreatment ,after - Ytreatment ,before) - (Ycontrol ,after - Ycontrol ,before)

= △Ytreatment - △Ycontrol, and hence the name. Since in some applications, the assumption E(ui Xi, W1i,..., Wri) = 0 is not likely to hold, the differences-in-differences estimator will not be consistent. However, the differences-in-differences estimator will be consistent under the weaker assumption of conditional mean independence. Including the additional characteristics (W variables) also can improve efficiency. Furthermore, adding these variables allows the researcher to perform tests for randomization, since Xi should be uncorrelated with the W variables, and also with the assignment. Regressing Xi on W1i, …, Wri, and using an F-test for the hypothesis that all coefficients on the W’s are constant constitutes a test for the random receipt of treatment. Performing a similar regression of the assignment Zi on the W’s with an accompanying F-test is a test for random assignment. Obviously if treatment and assignment were randomly determined, then neither should be dependent on characteristics of the entities. The dependent variable in the case of the differences estimator is a level, while in the case of the differences-in-differences estimator it is a change. Hence W affects the change in the latter case, not the level itself.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 328

8) Let the vertical axis of a figure indicate the average employment fast food restaurants. There are two time periods, t = 1 and t = 2, where time period is measured on the horizontal axis. The following table presents average employment levels per restaurant for New Jersey (the treatment group) and Eastern Pennsylvania (the control group).

FTE Employment before FTE Employment after

PA 23.33

NJ 20.44

21.17

21.03

Enter the four points in the figure and label them Ytreatment ,before, Ytreatment ,after , Ycontrol,before, and ^ diffs-in-diffs Ycontrol ,after. Connect the points. Finally calculate and indicate the value for β 1 . ^ diffs-in-diffs Answer: β 1 = △Ytreatment - △Ycontrol= (21.03-20.44)-(21.17-23.33) = 2.75.

See also accompanying figure.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 329

9) (Requires Appendix material) Discuss how the differences-in-differences estimator can be extended to multiple time periods. In particular, assume that there are n individuals and T time periods. What do the individual and time effects control for? Answer: The extension of the differences-in-differences estimator to multiple time periods uses the differences estimator for a single period, and adds binary variables for entity and time fixed effects. As with the differences estimator and the differences-in-differences estimator, additional regressors W for characteristics can be added. Without these characteristics, the population regression model is as follows Yit = β0 + β1 Xit + γ 2 D2 i + ... + γ nDni + δ2 B2 t + ... + δTBTt + v it with i = 1,…,n entities, and t = 1, … ,T time periods. The entity effects control for unobserved variables that remain constant over time for the same entity, and the time effects control for unobserved variables that are the same for all individuals at a point in time. Examples of time fixed effects could be business cycle conditions or macroeconomic conditions in general. Examples of entity fixed effects might be gender, race, years of previous education, etc. The model simplifies to the differences-in-differences regression model for two periods (T = 2). If W variables are added, then these can also be interacted with the time effect binary variables. The major advantage over the differences-in-differences model is that effects can be traced out over time.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 330

10) The New Jersey-Pennsylvania study on the effect of minimum wages on employment mentioned in your textbook used a comparison in means “before” and “after” analysis. The difference -in-difference estimate turned out to be 2.76 with a standard error of 1.36. The authors also used a difference-in-differences estimator with additional regressors of the type △Yi = β0 + β1 Xi + β2 W1,t + ... + β1+ rWr,i + ui where i = 1, …, 410. X is a binary variable taking on the value one for the 331 observations in New Jersey. Since the authors looked at Burger King, KFC, Wendy’s, and Roy Rogers fast food restaurants and the restaurant could be company owned, four W-variables were added. (a) Given that there are four chains and the possibility of a company ownership, why did the authors not include five W-variables? ^

(b) OLS estimation resulted in β1 of 2.30 with a standard error of 1.20. Test for statistical significance and specify the alternative hypothesis. (c) Why is this estimate different from the number calculated from △Ytreatment – △Ycontrol = 2.76? What is the advantage of employing this estimator of the simple difference -in-difference estimator? Answer: (a) Including a fifth W-variable would have resulted in perfect multicollinearity. (b) The t-statistic is +1.92. If the alternative hypothesis was H1 : β1 < 0, then you cannot reject the null hypothesis. If the alternative hypothesis was H1 : β1 ≠ 0, then you cannot reject the null hypothesis at the 5% level, although you can at the 10% level. The choice of alternative hypothesis depends on prior expectations, and standard economic theory would suggest H1 : β1 < 0. (c) The difference is small in terms of the standard error and may be due to sample variation. Although the difference-in-difference estimator is consistent, the difference-in-difference estimator with additional regressors can be more efficient. It is different because it stems from using the multiple regression model △Yi = β0 + β1 Xi + β2 W1i + ... + β1+ rWri + ui, i = 1,..., n rather than the regression with a single regressor △Yi + β0 + β1 Xi + ui, i = 1,..., n ^

and E(ui Xi, W1i, ..., Wri) = 0 may not hold. In that case, β1 is consistent as long as there is conditional mean independence. The inclusion of the characteristics also allows for testing for random receipt of treatment and random assignment using the usual F-statistic in auxiliary regressions.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 331

Chapter 14 Introduction to Time Series Regression and Forecasting 14.1 Multiple Choice 1) Pseudo out of sample forecasting can be used for the following reasons with the exception of A) giving the forecaster a sense of how well the model forecasts at the end of the sample. B) estimating the RMSFE. C) analyzing whether or not a time series contains a unit root. D) evaluating the relative forecasting performance of two or more forecasting models. Answer: D 2) Autoregressive distributed lag models include A) current and lagged values of the error term. B) lags of the dependent variable, and lagged values of additional predictor variables. C) current and lagged values of the residuals. D) lags and leads of the dependent variable. Answer: B 3) Time series variables fail to be stationary when A) the economy experiences severe fluctuations. B) the population regression has breaks. C) there is strong seasonal variation in the data. D) there are no trends. Answer: B 4) Departures from stationarity A) jeopardize forecasts and inference based on time series regression. B) occur often in cross-sectional data. C) can be made to have less severe consequences by using log -log specifications. D) cannot be fixed. Answer: A 5) In order to make reliable forecasts with time series data, all of the following conditions are needed with the exception of A) coefficients having been estimated precisely. B) the regression having high explanatory power. C) the regression being stable. D) the presence of omitted variable bias. Answer: D 6) The first difference of the logarithm of Yt equals A) the first difference of Y. B) the difference between the lead and the lag of Y. C) approximately the growth rate of Y when the growth rate is small. D) the growth rate of Y exactly. Answer: C 7) The time interval between observations can be all of the following with the exception of data collected A) daily. B) by decade. C) bi-weekly. D) across firms. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 332

8) One reason for computing the logarithms (ln), or changes in logarithms, of economic time series is that A) numbers often get very large. B) economic variables are hardly ever negative. C) they often exhibit growth that is approximately exponential. D) natural logarithms are easier to work with than base 10 logarithms. Answer: C 9) The jth autocorrelation coefficient is defined as cov(Yt, Yt-1 ) A) . var(Yt) var(Yt-1 ) B)

cov(Yt, Yt-j-1 ) var(Yt) var(Yt-j) cov(Yt, ut) var(Yt) var(ut)

cov(Yt, Yt-j) var(Yt) var(Yt-j)

Answer: D 10) Negative autocorrelation in the change of a variable implies that A) the variable contains only negative values. B) the series is not stable. C) an increase in the variable in one period is, on average, associated with a decrease in the next. D) the data is negatively trended. Answer: C 11) An autoregression is a regression A) of a dependent variable on lags of regressors. B) that allows for the errors to be correlated. C) model that relates a time series variable to its past values. D) to predict sales in a certain industry. Answer: C 12) The root mean squared forecast error (RMSFE) is defined as ^

E YT - YT T-1

E (YT+1 - YT+1 T)2 .

(YT - YT T - 1 )2 .

E (YT - YT T-1 ) .

Answer: B 13) One of the sources of error in the RMSFE in the AR(1) model is A) the error in estimating the coefficients β0 and β1 . B) due to measuring variables in logarithms. C) that the value of the explanatory variable is not known with certainty when making a forecast. D) the model only looks at the previous period’s value of Y when the entire history should be taken into account. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 333

14) The forecast is A) made for some date beyond the data set used to estimate the regression. B) another word for the OLS predicted value. C) equal to the residual plus the OLS predicted value. D) close to 1.96 times the standard deviation of Y during the sample. Answer: A 15) The AR(p) model A) is defined as Yt = β0 + βp Yt-p + ut. B) represents Yt as a linear function of p of its lagged values. C) can be represented as follows: Yt = β0 + β1 Xt + βp Yt-p + ut. D) can be written as Yt = β0 + β1 Yt-1 + ut-p . Answer: B 16) The ADL(p,q) model is represented by the following equation A) Yt = β0 + βp Yt-p + δqXt-q + ut. B) Yt = β0 + β1 Yt-1 + β2 Yt-2 + ... + βp Yt-p + δqut-q. C) Yt = β0 + β1 Yt-1 + β2 Yt-2 + ... + βp Yt-p + δ0 + δ1 Xt-1 + ut-q. D) Yt = β0 + β1 Yt-1 + β2 Yt-2 + ... + βp Yt-p + δ1 Xt-1 + δ2 Xt-2 + ... + δqXt-q + ut. Answer: D 17) Stationarity means that the A) error terms are not correlated. B) probability distribution of the time series variable does not change over time. C) time series has a unit root. D) forecasts remain within 1.96 standard deviation outside the sample period. Answer: B 18) The Times Series Regression with Multiple Predictors A) is the same as the ADL(p,q) with additional predictors and their lags present. B) gives you more than one prediction. C) cannot be estimated by OLS due to the presence of multiple lags. D) requires that the k regressors and the dependent variable have nonzero, finite eighth moments. Answer: A 19) The Granger Causality Test A) uses the F-statistic to test the hypothesis that certain regressors have no predictive content for the dependent variable beyond that contained in the other regressors. B) establishes the direction of causality (as used in common parlance) between X and Y in addition to correlation. C) is a rather complicated test for statistical independence. D) is a special case of the Augmented Dickey-Fuller test. Answer: A 20) To choose the number of lags in either an autoregression or in a time series regression model with multiple predictors, you can use any of the following test statistics with the exception of the A) F-statistic. B) Akaike Information Criterion. C) Bayes Information Criterion. D) Augmented Dickey-Fuller test. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 334

21) The random walk model is an example of a A) deterministic trend model. B) binomial model. C) stochastic trend model. D) stationary model. Answer: C 22) Problems caused by stochastic trends include all of the following with the exception of A) the estimator of an AR(1) is biased towards zero if its true value is one. B) the model can no longer be estimated by OLS. C) t-statistics on regression coefficients can have a nonnormal distribution, even in large samples. D) the presence of spurious regression.. Answer: B 23) The Augmented Dickey Fuller (ADF) t-statistic A) has a normal distribution in large samples. B) has the identical distribution whether or not a trend is included or not. C) is a two-sided test. D) is an extension of the Dickey-Fuller test when the underlying model is AR(p) rather than AR(1). Answer: D 24) If a “break” occurs in the population regression function, then A) inference and forecasting are compromised when neglecting it. B) an Augmented Dickey Fuller test, rather than the Dickey Fuller test, should be used to test for stationarity. C) this suggests the presence of a deterministic trend in addition to a stochastic trend. D) forecasting, but not inference, is unaffected, if the break occurs during the first half of the sample period. Answer: A 25) You should use the QLR test for breaks in the regression coefficients, when A) the Chow F-test has a p value of between 0.05 and 0.10. B) the suspected break data is not known. C) there are breaks in only some, but not all, of the regression coefficients. D) the suspected break data is known. Answer: B 26) The Bayes-Schwarz Information Criterion (BIC) is given by the following formula ln(T) SSR(p) ] + (p+1) A) BIC(p) = ln [ T T B) BIC(p) = ln [

SSR(p) 2 ] + (p+1) T T

C) BIC(p) = ln [

SSR(p) ln(T) ] - (p+1) T T

D) BIC(p) = ln [

SSR(p) ln(T) ] × (p+1) T T

Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 335

27) The Akaike Information Criterion (AIC) is given by the following formula SSR(p) ln(T) A) AIC(p) = ln [ ] + (p+1) T T B) AIC(p) = ln [

SSR(p) 2 ] + (p+1) T T

C) AIC(p) = ln [

SSR(p) p+2 ]+ T T

D) AIC(p) = ln [

SSR(p) 2 ] × (p+1) T T

Answer: B 28) The BIC is a statistic A) commonly used to test for serial correlation B) only used in cross-sectional analysis C) developed by the Bank of England in its river of blood analysis D) used to help the researcher choose the number of lags in an autoregression Answer: D 29) The AIC is a statistic A) that is used as an alternative to the BIC when the sample size is small (T < 50) B) often used to test for heteroskedasticity C) used to help a researcher chose the number of lags in a time series with multiple predictors D) all of the above Answer: C 30) The formulae for the AIC and the BIC are different. The A) AIC is preferred because it is easier to calculate B) BIC is preferred because it is a consistent estimator of the lag length C) difference is irrelevant in practice since both information criteria lead to the same conclusion D) AIC will typically underestimate p with non-zero probability Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 336

14.2 Essays and Longer Questions 1) You set out to forecast the unemployment rate in the United States (UrateUS), using quarterly data from 1960, first quarter, to 1999, fourth quarter. (a) The following table presents the first four autocorrelations for the United States aggregate unemployment rate and its change for the time period 1960 (first quarter) to 1999 (fourth quarter). Explain briefly what these two autocorrelations measure. First Four Autocorrelations of the U.S. Unemployment Rate and Its Change, 1960:I – 1999:IV Lag

Unemployment Rate

1 2 3 4

0.97 0.92 0.83 0.75

Change of Unemployment Rate 0.62 0.32 0.12 -0.07

(b) The accompanying table gives changes in the United States aggregate unemployment rate for the period 1999:I-2000:I and levels of the current and lagged unemployment rates for 1999:I. Fill in the blanks for the missing unemployment rate levels. Changes in Unemployment Rates in the United States First Quarter 1999 to First Quarter 2000 Quarter 1999:I 1999:II 1999:III 1999:IV 2000:I

U.S. Unemployment First Lag Rate 4.3 4.4

Change in Unemployment Rate -0.1 0.0 -0.1 -0.1 -0.1

(c) You decide to estimate an AR(1) in the change in the United States unemployment rate to forecast the aggregate unemployment rate. The result is as follows: △UrateUSt = -0.003 + 0.621 △ UrateUSt-1 , R2 = 0.393, SER = 0.255 (0.022) (0.106) The AR(1) coefficient for the change in the inflation rate was 0.211 and the regression R2 was 0.04. What does the difference in the results suggest here? (d) The textbook used the change in the log of the price level to approximate the inflation rate, and then predicted the change in the inflation rate. Why aren’t logarithms used here? (e) If much of the forecast error arises as a result of future error terms dominating the error resulting from estimating the unknown coefficients, then what is your best guess of the RMSFE here? (f) The actual unemployment rate during the fourth quarter of 1999 is 4.1 percent, and it decreased from the third quarter to the fourth quarter by 0.1 percent. What is your forecast for the unemployment rate level in the first quarter of 1996? (g) You want to see how sensitive your forecast is to changes in the specification. Given that you have estimated the regression with quarterly data, you consider an AR(4) model. This results in the following output

△UrateUSt = -0.005 + 0.663 △UrateUSt-1 - 0.082 UrateUSt-2 Stock/Watson 2e -- CVC2 8/23/06 -- Page 337

(0.022) (0.125)

(0.139)

+ 0.106 UrateUSt-3 – 0.176 △ UrateUSt-4 , R2 = 0.416, SER = 0.253 (0.117)

(0.091)

What is your forecast for the unemployment rate level in 2000:I? Compare the forecast error of the AR(4) model with the forecast error of the AR(1) model. (h) There does not seem to be much difference in the forecast of the unemployment rate level, whether you use the AR(1) or the AR(4). Given the various information criteria and the regression R2 below, which model should you use for forecasting?

p 0 1 2 3 4 5 6

BIC AIC R2 0.604 0.624 0.000 0.158 0.1181 0.393 0.185 0.125 0.397 0.217 0.138 0.400 0.218 0.1183 0.416 0.249 0.130 0.417 0.277 0.138 0.420

Answer: (a) There is a very strong positive autocorrelation for the unemployment rate level. The 1 st to 4 th autocorrelation coefficient is even higher than for the inflation rate. This suggests that a high (low) level of the unemployment rate will persist for quite a while. Although the autocorrelations decline, they are still high even at lag 4. This reflects the long-term trends in unemployment rates. If during a given quarter in the 1960s or the 1990s the unemployment rate was low, then it was also low in the following quarter. If the unemployment rate was high in a given quarter, as it was in the early 1980s, then it was also high in the following quarter. Different from the inflation rate results discussed in the text, the change in the unemployment rate also shows positive autocorrelations. Furthermore, these are quite large for the first lag. Eventually, after a year, they turn negative. Hence an increase (decrease) in the unemployment rate is followed typically by an increase (decrease) in the following quarters, before the process reverses itself. (b) Changes in Unemployment Rates in the United States from the First Quarter 1999 to the First Quarter 2000 Quarter 1999:I 1999:II 1999:III 1999:IV 2000:I

U.S. Unemployment First Lag Rate 4.3 4.4 4.3 4.3 4.2 4.3 4.1 4.2 4.0 4.1

Change in Unemployment Rate -0.1 0.0 -0.1 -0.1 -0.1

(c) There is a higher persistence in the change of unemployment rate than in the change of the inflation rate. The higher regression R2 means that almost 40 percent of the variation in the change of the unemployment rate can be explained by a single regressor, namely its lag. Students may recall Figure 12.1 from the textbook, which shows a much smoother behavior for the levels, and hence the differences, for the unemployment rate. (d) The change of the log of the price level was used to convert a level variable (prices) into a change of its growth rate. Unemployment is already measured as a rate in the above example. Hence differencing the variable results in a change in the rate. Stock/Watson 2e -- CVC2 8/23/06 -- Page 338

(e) In this situation, the SER approximates the RMSFE. In the case of the change of the unemployment rate, it is 0.255 percentage points. (f) UrateUS1999:IV = 4.1 and the predicted change in the unemployment rate from 1999:IV to 2000:I is 0.06 or 0.1 rounded. The forecasted unemployment rate for 2000:I is UrateUs 2000:I = UrateUS1999:IV + △UrateUS2001:1 = 4.1% + 0.1% = 4.2%. The model therefore forecasts a slight increase in the unemployment rate. (g) △UrateUS = -0.005 + 0.663 × (-0.1) 2001:I 1999:IV -0.082 × (-0.1) + 0.106 × 0.0 - 0.176 × (-0.1) ≅ -0.046. (Students may suggest a forecast of –0.1 or 0.0. The answer will proceed with 0.0.) The corresponding forecast for the unemployment rate in 2000:I is then 4.1% + 0.0% = 4.1%. The forecast error for the AR(4) model is 4.0% - 4.1% = -0.1%, which is slightly smaller than the –0.2% forecast error of the AR(1) model. (h) Close call, but both the BIC and the AIC favor the AR(1) over the AR(4). (The F-test statistic for restricting the AR(4) to an AR(1) is 1.49 with a p-value of 0.21.) 2) You have collected quarterly data on Canadian unemployment (UrateC) and inflation (InfC) from 1962 to 1999 with the aim to forecast Canadian inflation. (a) To get a better feel for the data, you first inspect the plots for the series.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 339

Inspecting the Canadian inflation rate plot and having calculated the first autocorrelation to be 0.79 for the sample period, do you suspect that the Canadian inflation rate has a stochastic trend? What more formal methods do you have available to test for a unit root? (b) You run the following regression, where the numbers in parenthesis are homoskedasticity -only standard errors: △InfCt = 0.49– 0.10 Inft-1 – 0.39 △InfCt-1 – 0.33 △InfCt-2 – 0.21 △InfCt-3 + 0.05 △InfCt-4 (0.28) (0.05)

(0.09)

(0.08)

Test for the presence of a stochastic trend. Should you have used heteroskedasticity -robust standard errors? Does the fact that you use quarterly data suggest including four lags in the above regression, or how should you determine the number of lags? (c) To forecast the Canadian inflation rate for 2000:I, you estimate an AR(1), AR(4), and an ADL(4,1) model for the sample period 1962:I to 1999:IV. The results are as follows: △InfCt = 0.002 – 0.31 △InfCt-1 (0.014) (0.10) △InfCt = 0.021 – 0.46 △InfCt-1 – 0.39 △InfCt-2 – 0.25 △InfCt-3 + 0.03 △InfCt-4 (0.158) (0.10)

(0.11)

(0.08)

(0.07)

△InfCt = 1.279 – 0.51 △InfCt-1 – 0.44 △InfCt-2 – 0.30 △InfCt-3 – 0.02 △InfCt-4 (0.57)

(0.10)

(0.11)

(0.09)

(0.08)

- 0.16 UrateCt-1 (0.07) In addition, you have the following information on inflation in Canada during the four quarters of 1999 and the first quarter of 2000: Inflation and Unemployment in Canada, First Quarter 1999 to First Quarter 2000

Stock/Watson 2e -- CVC2 8/23/06 -- Page 340

Quarter

1999:I 1999:II 1999:III 1999:IV 2000:I

Unemployment Rate (UrateCt)

Rate of Inflation at an Annual Rate (Inft)

First Lag (Inft-1 )

Change in Inflation (△Inft)

7.7 7.9 7.7 7.0 6.8

0.8 4.3 2.9 1.3 2.1

0.8 0.8 4.3 2.9 1.3

0.0 3.5 -1.4 -1.5 0.8

For each of the three models, calculate the predicted inflation rate for the period 2000:I and the forecast error. (d) Perform a test on whether or not Canadian unemployment rates Granger -cause the Canadian inflation rate. Answer: (a) A small autocorrelation coefficient together with a time series plot which displays no apparent trend suggest the absence of a stochastic trend. Here the first autocorrelation coefficient is fairly high and the figure displays long-run swings similar to the U.S. figure discussed in the textbook. To test for a stochastic trend using more formal methods requires use of the Dickey-Fuller test, or better, the augmented Dickey-Fuller test. (b) The t-statistic on the lagged inflation rate level is (-2.00). The critical value for the ADF statistic is (-2.57) at the 10% level. Hence you cannot reject the null hypothesis of a unit root. The ADF statistic requires computation using homoskedasticity-only standard errors. Hence heteroskedasticity-robust standard errors should not be used. The number of lags included should be determined using the AIC information criterium, rather than the BIC, since it results in a better performance in finite-samples of the ADF statistic. (As with the U.S. data used in the textbook, this results in a chosen lag length of three. The ADF statistic in that case is (-1.91), which is still below the critical value at the 10% level.) (c) △InfC2000:I 1999:IV for the various models is: 0.002 - 0.31 × (-1.5) = 0.467 ≅ 0.5 (AR(1)); 0.021- 0.46 ×(-1.5)- 0.39 × (-1.4) - 0.25 × 3.5 + 0.03 × 0.0 = 0.382 ≅ 0.4 (AR(4)); 1.279 - 0.51 × (-1.5) - 0.44 × (-1.4) - 0.30 × 3.5 - 0.02 ×0.0 - 0.16 × 7.0 = 0.49 ≅ 0.5 (ADL(4,1)). △InfC2000:I then is: 1.3 + 0.5 = 1.8 (AR(1)); 1.3 + 0.4 = 1.7 (AR(4)); 1.3 + 0.5 = 1.8 (ADL(4,1)). The forecast error is: 0.3 (AR(1)); 0.4 (AR(4)); 0.3 (ADL(4,1)). (d) Since the ADL(4,1) only included the lagged unemployment rate, the t-statistic replaces the F-statistic typically used for this test. The t-statistic is (-2.256) and the F-statistic is 2.256 2 = 5.091. Both are statistically significant at the 5% level with a p-value of 0.026. Hence the null hypothesis that the unemployment rate does not Granger-cause the inflation rate is rejected. 3) There is some evidence that the Phillips curve has been unstable during the 1962 to 1999 period for the United States, and in particular during the 1990s. You set out to investigate whether or not this instability also occurred in other places. Canada is a particularly interesting case, due to its proximity to the United States and the fact that many features of its economy are similar to that of the U.S. (a) Reading up on some of the comparative economic performance literature, you find that Canadian unemployment rates were roughly the same as U.S. unemployment rates from the 1920s to the early 1980s. The accompanying figure shows that a gap opened between the unemployment rates of the two countries in 1982, which has persisted to this date.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 341

Inspection of the graph and data suggest that the break occurred during the second quarter of 1982. To investigate whether the Canadian Phillips curve shows a break at that point, you estimate an ADL(4,4) model for the sample period 1962:I-1999:IV and perform a Chow test. Specifically you postulate that the constant and coefficients of the unemployment rates changed at that point. The F-statistic is 1.96. Find the critical value from the F-table and test the null hypothesis that a break occurred at that time. Is there any reason why you should be skeptical about the result regarding the break and using the Chow -test to detect it? (b) You consider alternative ways to test for a break in the relationship. The accompanying figure shows the F-statistics testing for a break in the ADL(4,4) equation at different dates.

The QLR-statistic with 15% trimming is 3.11. Comment on the figure and test for the hypothesis of a break in the ADL(4,4) regression. (c) To test for the stability of the Canadian Phillips curve in the 1990s, you decide to perform a pseudo out-of-sample forecasting. For the 24 quarters from 1994:I-1999:IV you use the ADL(4,4) model to calculate the Stock/Watson 2e -- CVC2 8/23/06 -- Page 342

forecasted change in the inflation rate, the resulting forecasted inflation rate, and the forecast error. The standard error of the ADL(4,4) for the estimation sample period 1962:1 -1993:4 is 1.91 and the sample RMSFE is 1.70. The average forecast error for the 24 inflation rates is 0.003 and the sample standard deviation of the forecast errors is 0.82. Calculate the t-statistic and test the hypothesis that the mean out-of-sample forecast error is zero. Comment on the result and the accompanying figure of the actual and forecasted inflation rate.

Answer: (a) The critical value from the F5,∞ distribution is 1.85 at the 10% significance level, and 2.21 at the 5% significance level. (The p-value is actually 0.088.). Hence, at the 10% significance level, you can reject that null hypothesis that the constant and the four lagged unemployment rate coefficients remained constant over the entire sample period, which suggests that a break occurred in 1982:2. There is not sufficient evidence to reject the null hypothesis at the 5% significance level. However, the text emphasizes that “[preliminary] estimation of the break date means that the usual F critical values cannot be used for the Chow test for a break at that date.” This applies to the above example since the series was analyzed before testing. (b) The critical value for the QLR(5) statistic with 15% trimming is 3.26 at the 10% level. Hence you cannot reject the null hypothesis of no break in the regression. Except for the peak at the end of 1982 and the beginning of 1983, the F-statistic does not really come close to the critical value. (c) The average forecast error is very small. The t-statistic is t=

0.003 = 0.179 0.82 24

and therefore you cannot reject the hypothesis that the mean out -of-sample forecast is zero. Indeed, you get the same impression from the graph, which shows that there are very few periods of systematically too large or small inflation rate forecasts. The conclusion is that the Canadian Phillips curve has done well as a model for forecasting at the end of the sample. This result is quite different from the results in the textbook for the U.S. Phillips curve.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 343

4) You collect monthly data on the money supply (M2) for the United States from 1962:1 -2002:4 to forecast future money supply behavior.

where LM2 and DLM2 are the log level and growth rate of M2. (a) Using quarterly data, when analyzing inflation and unemployment in the United States, the textbook converted log levels of variables into growth rates by differencing the log levels, and then multiplying these by 400. Given that you have monthly data, how would you proceed here? (b) How would you go about testing for a stochastic trend in LM2 and DLM2? Be specific about how to decide the number of lags to be included and whether or not to include a deterministic trend in your test. The textbook found the (quarterly) inflation rate to have a unit root. Does this have any affect on your expectation about whether or not the (monthly) money growth rate should be stationary? (c) You decide to conduct an ADF unit root test for LM2, DLM2, and the change in the growth rate △DLM2. This results in the following t-statistic on the parameter of interest. LM2

DLM2

△DLM2

Stock/Watson 2e -- CVC2 8/23/06 -- Page 344

with trend -0.505

without trend -4.100

with trend -4.592

without trend -8.897

Find the critical value at the 1%, 5%, and 10% level and decide which of the coefficients is significant. What is the alternative hypothesis? (d) In forecasting the money growth rate, you add lags of the monetary base growth rate ( DLMB) to see if you can improve on the forecasting performance of a chosen AR(10) model in DLM2. You perform a Granger causality test on the 9 lags of DLMB and find a F-statistic of 2.31. Discuss the implications. (e) Curious about the result in the previous question, you decide to estimate an ADL(10,10) for DLMB and calculate the F-statistic for the Granger causality test on the 9 lag coefficients of DLM2. This turns out to be 0.66. Discuss. (f) Is there any a priori reason for you to be skeptical of the results? What other tests should you perform? Answer: (a) To annualize monthly growth rates, you would need to multiply them by 1,200. The annualized growth rate of money would be 1200 △ln(LM2 t). (b) The ADF statistic should be calculated to test for the presence of a unit root in each of the series. The BIC information criterion can be used to determine the lag length, and homoskedasticity -only standard errors, rather than heteroskedasticity-robust standard errors, should be considered for the regression. Studies of the finite-sample properties of unit root tests have shown that it is better to use the AIC criterion although it overestimates the lag length on average. Given that money growth determines the inflation rate in the long-run, your expectation would be to also find a unit root for money growth. (c) LM2 contains a time trend, and hence the critical values for an intercept and a time trend are relevant. These are (-3.96), (-3.41), and (-3.12) for the three significance levels respectively. Hence you cannot reject the null hypothesis of a unit root for LM2. The growth rate of money does not have a time trend for the entire sample period, so the intercept only critical values should be used. These are ( -3.43), (-2.86), and (-2.57) respectively. Hence you are able to reject the null hypothesis of a unit root for money at the 1% significance level. The alternative hypothesis is that there is no unit root. However, failure to reject the null hypothesis only means that there is “insufficient evidence to conclude that it is false.” (d) The critical value for the null hypothesis that monetary growth rates do not Granger cause money supply growth rates is F9,∞ = 1.88 at the 5% significance level, and 2.41 at the 1% significance level. Hence you can reject the null hypothesis at the 5% level, but not at the 1% level. (e) In this situation, you cannot reject the null hypothesis that the money supply growth does not Granger cause monetary base growth. This makes sense if the Federal Reserve uses monetary base growth as an instrument and money supply growth is not a target. (f) It is somehow surprising to find money growth not to contain a unit root when the inflation rate does. It is also possible that the relationship has changed over time, as money markets have been liberalized during the sample period. Hence it would help to test for breaks using the QLR statistic and pseudo out-of-sample forecasts. 5) Having learned in macroeconomics that consumption depends on disposable income, you want to determine whether or not disposable income helps predict future consumption. You collect data for the sample period 1962:I to 1995:IV and plot the two variables. (a) To determine whether or not past values of personal disposable income growth rates help to predict consumption growth rates, you estimate the following relationship. △LnCt = 1.695 + 0.126 △LnCt-1 + 0.153 △LnCt-2 , (0.484) (0.099) (0.103) + 0.294 △ LnCt-3 – 0.008 △ LnCt-4 (0.103)

(0.102)

+ 0.088 △ LnYt-1 – 0.031 △ LnYt-2 – 0.050 △LnYt-3 – 0.091 △LnYt-4 (0.076)

(0.078)

(0.074)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 345

The Granger causality test for the exclusion on all four lags of the GDP growth rate is 0.98. Find the critical value for the 1%, the 5%, and the 10% level from the relevant table and make a decision on whether or not these additional variables Granger cause the change in the growth rate of consumption. (b) You are somewhat surprised about the result in the previous question and wonder, how sensitive it is with regard to the lag length in the ADL(p,q) model. As a result, you calculate BIC and AIC of p and q from 0 to 6. The results are displayed in the accompanying table: p,q 0 1 2 3 4 5 6

BIC 5.061 5.052 5.095 5.110 5.165 5.206 5.270

AIC 5.039 4.988 4.989 4.960 4.972 4.973 4.992

Which values for p and q should you choose? (c) Estimating an ADL(1,1) model gives you a t-statistic of 1.28 on the coefficient of lagged disposable income growth. What does the Granger causality test suggest about the inclusion of lagged income growth as a predictor of consumption growth? Answer: (a) The critical value for F4,∞ is 3.32, 2.37, and 1.94 respectively. The decision is therefore not to reject the null hypothesis at the 1% significance level. (b) The minimum for both the AIC and the BIC is at p=q=1. (c) For a single restriction, t = F2 and the critical value is therefore 1.96 for the t-statistic. Hence you cannot reject the null hypothesis that the coefficient on lagged disposable income growth is zero, or that disposable income growth does not Granger cause consumption growth. 6) (Requires Internet Access for the test question) The following question requires you to download data from the internet and to load it into a statistical package such as STATA or EViews. a.

Your textbook estimates an AR(1) model (equation 14.7) for the change in the inflation rate using a sample period 1962:I — 2004:IV. Go to the Stock and Watson companion website for the textbook and download the data “Macroeconomic Data Used in Chapters 14 and 16.” Enter the data for consumer price index, calculate the inflation rate, the acceleration of the inflation rate, and replicate the result on page 526 of your textbook. Make sure to use heteroskedasticity-robust standard error option for the estimation.

Next find a website with more recent data, such as the Federal Reserve Economic Data (FRED) site at the Federal Reserve Bank of St. Louis. Locate the data for the CPI, which will be monthly, and convert the data in quarterly averages. Then, using a sample from 1962:I — 2009:IV, re -estimate the above specification and comment on the changes that have occurred.

Based on the BIC, how many lags should be included in the forecasting equation for the change in the inflation rate? Use the new data set and sample period to answer the question.

Answer: a. The EViews output would look as follows: Dependent Variable: D2LP Method: Least Squares Date: 12/30/10 Time: 20:29 Sample: 1962Q1 2004Q4 Included observations: 172

Stock/Watson 2e -- CVC2 8/23/06 -- Page 346

White Heteroskedasticity-Consistent Standard Errors & Covariance

Coefficient Std. Error

0.017 0.127 0.097

t-Statistic

Prob.

0.135 -2.467

0.893 0.015

D2LP( -1)

-0.238

R-squared Adjusted R -squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic)

0.056 Mean dependent var 0.051 S.D. dependent var 1.664 Akaike info criterion 470.691 Schwarz criterion Hannan-Quinn criter. -330.634 10.157 Durbin-Watson stat 0.002

0.017 1.708 3.868 3.904 3.883 2.166

b. Not much has changed. The intercept became smaller, but was statistically insignificant anyway. The slope coefficient increase somewhat (as did the Regression R2 with it) and its t-statistic also became stronger. Some of this is the result of data revisions (even for the old sample period the slope coefficient increased somewhat) while part of it has changed because of the longer sample period. Dependent Variable: D2LP Method: Least Squares Date: 12/30/10 Time: 21:19 Sample: 1962Q1 2009Q4 Included observations: 192 White Heteroskedasticity-Consistent Standard Errors & Covariance

Coefficient Std. Error

C D2LP(-1)

t-Statistic

Prob.

0.014 0.153

0.089

0.929

0.094

-3.070

0.002

-0.290

R-squared

0.084 Mean dependent var

0.010

Adjusted R-squared

0.079

S.D. dependent var

2.203

S.E. of regression

2.114

Akaike info criterion

4.345

Sum squared resid

849.127

Schwarz criterion

4.379

Log likelihood

-415.161

Hannan-Quinn criter.

4.359

F-statistic

17.428

Durbin-Watson stat

2.203

Prob(F-statistic)

0.000

c. Using the BIC for p = 0, 1, 2, …, 6, the minimum continues to be at p = 2. Hence the BIC still favors an AR(2).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 347

7) Statistical inference was a concept that was not too difficult to understand when using cross-sectional data. For example, it is obvious that a population mean is not the same as a sample mean (take weight of students at your college/university as an example). With a bit of thought, it also became clear that the sample mean had a distribution. This meant that there was uncertainty regarding the population mean given the sample information, and that you had to consider confidence intervals when making statements about the population mean. The same concept carried over into the two-dimensional analysis of a simple regression: knowing the height-weight relationship for a sample of students, for example, allowed you to make statements about the population height-weight relationship. In other words, it was easy to understand the relationship between a sample and a population in cross-sections. But what about time-series? Why should you be allowed to make statistical inference about some population, given a sample at hand (using quarterly data from 1962 -2010, for example)? Write an essay explaining the relationship between a sample and a population when using time series. Answer: Essays will differ by students. What is crucial here is the emphasis on stationarity or the concept that the distribution remains constant over time. If the dependent variable and regressors are non -stationary, then conventional hypothesis tests, confidence intervals, and forecasts can be unreliable. However, if they are stationary, then it is plausible to argue that a sample will repeat itself again and again and again, when getting additional data. It is in that sense that inference to a larger population can be made. There are two concepts crucial to stationarity which are discussed in the textbook: (i) trends, and (ii) breaks. Students should bring up methods for testing for stationarity and breaks, such as the DF and ADF statistics, and the QLR test. 8) (Requires Internet access for the test question) The following question requires you to download data from the internet and to load it into a statistical package such as STATA or EViews. a.

Your textbook suggests using two test statistics to test for stationarity: DF and ADF. Test the null hypothesis that inflation has a stochastic trend against the alternative that it is stationary by performing the DF and ADF test for a unit autoregressive root. That is, use the equation (14.34) in your textbook with four lags and without a lag of the change in the inflation rate as a regressor for sample period 1962:I — 2004:IV. Go to the Stock and Watson companion website for the textbook and download the data “Macroeconomic Data Used in Chapters 14 and 16.” Enter the data for consumer price index, calculate the inflation rate and the acceleration of the inflation rate, and replicate the result on page 526 of your textbook. Make sure not to use the heteroskedasticity -robust standard error option for the estimation.

For the new sample period, find the DF statistic.

Finally, calculate the ADF statistic, allowing for the lag length of the inflation acceleration term to be determined by either the AIC or the BIC.

Answer: a. For the sample period 1962:I — 2004:IV, the result is as follows:

Dependent Variable: D2LP Method: Least Squares Date: 12/31/10 Time: 10:44 Sample: 1962Q1 2004Q4 Included observations: 172

Stock/Watson 2e -- CVC2 8/23/06 -- Page 348

Coefficient Std. Error

t-Statistic

Prob.

0.51 0.21

2.37

0.02

DLP(-1)

-0.11

0.04

-2.69

0.01

D2LP(-1)

-0.19

0.08

-2.32

0.02

D2LP(-2)

-0.26

0.08

-3.15

0.00

D2LP(-3)

0.20

0.08

2.51

0.01

D2LP(-4)

0.01

0.08

0.13

0.90

R-squared

0.24 Mean dependent var

0.02

Adjusted R-squared

0.21

S.D. dependent var

1.71

S.E. of regression

1.51

Akaike info criterion

3.70

Sum squared resid

380.61

Schwarz criterion

3.81

Log likelihood

-312.37

Hannan-Quinn criter.

3.75

F-statistic

10.31

Durbin-Watson stat

1.99

Prob(F-statistic)

0.00

Hence the ADF statistic is -2.69. You cannot reject the null hypothesis of non-stationarity at the 5% level (critical value -2.86), but you could at the 10% level (critical value -2.57). b. Not much has changed. The intercept became smaller, but was statistically insignificant anyway. The slope coefficient increase somewhat (as did the Regression R2 with it) and its t-statistic also became stronger. Some of this is the result of data revisions (even for the old sample period the slope coefficient increased somewhat) while part of it has changed because of the longer sample period. Dependent Variable: D2LP Method: Least Squares Date: 12/31/10 Time: 11:20 Sample: 1962Q1 2009Q4 Included observations: 192

Coefficient Std. Error

t-Statistic

Prob.

0.62 0.26

2.36

0.02

DLP(-1)

-0.15

0.05

-2.75

0.01

D2LP(-1)

-0.29

0.08

-3.54

0.00

D2LP(-2)

-0.30

0.09

-3.45

0.00

D2LP(-3)

0.03

0.08

0.31

0.76

D2LP(-4)

-0.05

0.08

-0.62

0.54

R-squared

0.24 Mean dependent var

0.01

Adjusted R-squared

0.22

S.D. dependent var

2.20

S.E. of regression

1.95

Akaike info criterion

4.20

Sum squared resid

707.46

Schwarz criterion

4.31

Log likelihood

-397.64

Hannan-Quinn criter.

4.25

F-statistic

11.54

Durbin-Watson stat

2.00

Prob(F-statistic)

0.00

Stock/Watson 2e -- CVC2 8/23/06 -- Page 349

c. The DF statistic is obtained by simply regressing the change in the inflation rate on the lagged level of the inflation rate. The t-statistic on the lagged inflation level is the ADF statistic, which is -5.28, rejecting the null hypothesis of non-stationarity. d. Both the AIC and the BIC have a minimum for two lags. For that case, the ADF statistic is -2.94 and the null hypothesis of non-stationarity can therefore be rejected at the 5% level, but not at the 1% level.

14.3 Mathematical and Graphical Problems

1) (Requires Appendix material) Define the difference operator △ = (1 – L) where L is the lag operator, such that i LjYt = Yt-j. In general, △ j = (1- Lj)i, where i and j are typically omitted when they take the value of 1. Show the expressions in Y only when applying the difference operator to the following expressions, and give the resulting expression an economic interpretation, assuming that you are working with quarterly data: (a) △4 Yt (b) △2 Yt (c) △1 △4 Yt 2 (d) △ 4 Yt Answer: (a) △4 Yt = (1 - L4 ) Yt = Yt - Yt-4 . With quarterly data, this is the annual change. If Y is in logarithms, then this is the annual growth rate. (b) △2 Yt = (1 - L)2 Yt = (1 - 2L+ L2 )Yt = Yt - 2Yt-1 + Yt-2 = (Yt - Yt-1 ) - (Yt-1 - Yt-2 ) = △Yt - △Yt-1 This represents the change of the change in a variable, or the “acceleration.” If Y is in logarithms, then this is the quarterly change in the growth rate. A good example would be the acceleration in the quarterly inflation rate. (c) △1 △4 Yt = (1 - L)(1 - L4 )Yt = (1 - L - L4 + L5 )Yt =Yt - Yt-1 - Yt-4 + Yt-5 = (Yt - Yt-4 ) - (Yt-1 - Yt-5 ) This is the quarterly change in the annual change. If Y is in logarithms, then this is the quarterly acceleration or change in the annual growth rate. (d) △ 2 Yt = (1 - L4 )2 Yt = (1 - 2L4 + L8 )Yt =Yt - 2Yt-4 + Yt-8 4 = (Yt - Yt-4 ) - (Yt-4 - Yt-8 ) = △4 Yt - △4 Yt-4 This represents the change in the annual change. If Y is in logarithms, then this is the change in the annual growth rate.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 350

2) The textbook displayed the accompanying four economic time series with “markedly different patterns.” For each indicate what you think the sample autocorrelations of the level (Y) and change ( △Y) will be and explain your reasoning. (a)

(b)

(c)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 351

(d)

Answer: (a) There is strong positive autocorrelation in the federal funds rate, with sample autocorrelations declining for higher lags. There are obvious long-term trends in the series in that the federal funds rate was high during the first quarter of 1982, and high again in the second quarter of 1982. Similarly, it was low during the first quarter of 1962 and again low in the second quarter of that year. Since inflationary expectations and therefore the inflation rate itself play a large role in federal funds rate movements, it should not be surprising to find a similar pattern in the autocorrelations for the inflation rate and the federal funds rate. (The autocorrelations are 0.90, 0.83, 0.80 and 0.72 for lags one to four.) For the change in the federal funds rate you would also expect a similar pattern in the autocorrelations as for the inflation rate, i.e., a negative first autocorrelation. On average, an increase in the federal funds rate in one quarter is associated with a decrease in the following quarter. (The autocorrelations are –0.14, -0.19, 0.24, -0.12.) (b) (Different from the textbook, the figure here only displays the exchange rate behavior after the collapse of the Bretton Woods system of fixed exchange rates.) As in the previous graph, there should be positive autocorrelations reflecting long-term trends in the exchange rates. Students might point out that due to purchasing power parity you could expect long-term exchange rate behavior or be similar to the behavior of inflation rates. However, the inflation rate of the U.K would also have to be considered. (The Stock/Watson 2e -- CVC2 8/23/06 -- Page 352

actual autocorrelations are 0.93, 0.85, 0.79, and 0.72 for lags one to four). Students may have difficulty detecting the positive nature of the sample autocorrelations in the change of the exchange rate: positive (negative) changes in the exchange rate tend to be followed by positive (negative) changes in the following period. Perhaps students are able to see that the behavior of the exchange rate is somewhat smoother than that of the federal funds rate. (The actual autocorrelations are 0.22, 0.14, 0.12, and 0.07 for lags one to four.) (c) Students should be able to identify the high autocorrelations in the level: typically a high level of real GDP will be followed by a high level in the next period. In addition, there is to a large extent, a trend increase. (The actual autocorrelations are 0.98, 0.96, 0.94, and 0.92 for one to four lags.) Since positive growth rates in real GDP are typically followed by positive growth rates during the next quarter, students should be able to see that the autocorrelations for the change in the logarithm of real GDP will also be positive. (The actual autocorrelations are 0.29, 0.39, 0.40, and 0.36 for lags one to four.) (d) Students should be able to see that the returns are essentially unpredictable, and that the level autocorrelations should be very low. There are no long-term trends visible and a high return on a given day is as likely to be followed by a high return the next day as a low return. (The actual autocorrelations are 0.07, -0.01, -0.02, and 0.00 for lags one to four.) At the same time students should be able to see a relatively strong negative first autocorrelation, since there are no long-term trends in the level returns. A strong positive day-to-day change must therefore be followed, on average, by a strong negative change. Due to the unpredictability, these autocorrelations should also fall off quite quickly (The actual autocorrelations are –0.46, -0.04, -0.02, and 0.03 for lags one to four.) 3) You have decided to use the Dickey Fuller (DF) test on the United States aggregate unemployment rate (sample period 1962:I – 1995:IV). As a result, you estimate the following AR(1) model △UrateUs t = 0.114 – 0.024 UrateUSt-1 , R2 =0.0118, SER = 0.3417 (0.121) (0.019) You recall that your textbook mentioned that this form of the AR(1) is convenient because it allows for you to test for the presence of a unit root by using the t- statistic of the slope. Being adventurous, you decide to estimate the original form of the AR(1) instead, which results in the following output UrateUs t = 0.114 – 0.976 UrateUSt-1 , R2 =0.9510, SER = 0.3417 (0.121) (0.019) You are surprised to find the constant, the standard errors of the two coefficients, and the SER unchanged, while the regression R2 increased substantially. Explain this increase in the regression R 2 . Why should you have been able to predict the change in the slope coefficient and the constancy of the standard errors of the two coefficients and the SER? Answer: There is no additional information in the second regression, hence the SSR, and therefore the SER, will not change. The only difference is that the lag of the dependent variable has been subtracted from both sides. This linear transformation changes the coefficient on the lag dependent variable from (-0.024) to (-0.024)-(-1) = -0.976. The regression R2 is defined as ESS/TSS or 1-(SSR/TSS). The only change here has been in the TSS, which is now calculated from a level rather than a difference. Since TSS increases and SSR remains unchanged, SSR/TSS must decrease, and the regression R2 will increase. Finally, the heteroskedasticity-robust standard errors contain the residuals and other terms involving the regressor, both of which have not changed between the two specifications. Hence the standard errors should also remain unchanged.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 353

4) Consider the standard AR(1) Yt = β0 + β1 Yt-1 + ut, where the usual assumptions hold. (a) Show that y t = β0 Yt-1 + ut, where y t is Yt with the mean removed, i.e., y t = Yt – E(Yt). Show that E(Yt) = 0. r (b) Show that the r-period ahead forecast E(y T+r T) = β 1 y T. If 0 < β1 < 1, how does the r-period ahead for large r? T+r T (c) The median lag is the number of periods it takes a time series with zero mean to halve its current value (in log(2) . expectation), i.e., the solution r to E(y T+r T) = 0.5y T. Show that in the present case this is given by r = – log(β1 )

forecast behave as r becomes large? What is the forecast of Y

Answer: (a) E(YT) = β0 + β1 E(Yt-1 ), since E(ut) = 0. Therefore Yt - E(Yt) = β1 [Yt-1 - E(Yt-1 )] + ut or y t = β1 y t-1 + ut. Now 2 y t = β1 y t-1 + ut = β1 (β1 y t-2 + ut-1 ) + ut = β 1 y t-2 + ut + β1 ut-1. Repeated substitution then results in n n i i n+1 y t = β 1 y t-(n+1) + ∑ β 1 ut-i , or as n → ∞, y t = ∑ β 1 ut-i . i=1 i=0 Taking expectations on both sides of the equation results in E(y t) = 0, since E(ut) = E(ut-1 ) = ... = E(ut-n) = ... = 0. (b) E(y

T+1 T

) = β1 y T since E(y T+1 ) = 0. E(y

y T. For large r, E(y E(y

T+r T

β0 1 - β1

T+r T

T+2 T

2 ) = β1 y T+1 = β 1 y T and so on until E(y

T+r T

r )= β1

) = 0. Performing similar repeated substitutions for Yt instead of y t, results in

r + β 1 y T and hence E(y

T+r T

β0 1- β1

for large r.

r 1 r 1 ) = β 1 y T = y T or β 1 = . Taking logs and solving for r then results in rlog(β1 ) = -log(2) 2 2

log(2) . log(β1 )

5) Consider the following model e Yt = α0 + α1 X t + ut where the superscript “e” indicates expected values. This may represent an example where consumption depended on expected, or “permanent,” income. Furthermore, let expected income be formed as follows: e e e X t = X t-1 + λ(Xt-1 - X t-1 ); 0 < λ < 1 This particular type of expectation formation is called the “adaptive expectations hypothesis.” (a) In the above expectation formation hypothesis, expectations are formed at the beginning of the period, say the 1st of January if you had annual data. Give an intuitive explanation for this process. (b) Transform the adaptive expectation hypothesis in such a way that the right hand side of the equation only contains observable variables, i.e., no expectations. (c) Show that by substituting the resulting equation from the previous question into the original equation, you get an ADL(0, ∞) type equation. How are the coefficients of the regressors related to each other? (d) Can you think of a transformation of the ADL(0, ∞) equation into an ADL(1,1) type equation, if you allowed Stock/Watson 2e -- CVC2 8/23/06 -- Page 354

the error term to be (ut – λut-1 )? e Answer: (a) The term (Xt-1 - X t-1 ) is the forecast error for the previous period. If no forecast error was made, then the forecast for the current period is the same as for the previous period. If there was a forecast error, then the forecast for the current period is adjusted by a fraction λ of that forecast error. Note also e e that the adaptive expectations hypothesis can be rewritten as X t =(1 -λ) X t-1 + λX t-1 ; 0 < λ < 1, in which case the expected value can be seen as a linear combination of the previous period’s forecast and the previous periods actual value. e e e (b) X t =(1 -λ) X t-1 + λX t-1 = (1- λ)[(1-λ) X t-2 + λX t-2 ] + λX t-1 e = (1- λ)2 X t-2 + λX t-1 + λ(1- λ)Xt-2 . Repeated substitution results in n ∞ e n+1 e X t-2 = (1- λ)n+1 X t-(n+1) + λ ∑ (1- λ)iXt-i-1 or, as n → ∞, X t = λ ∑ (1- λ)i Xt-i-1 . i=0 i=0 ∞ e (c) Yt = α0 + α1 X t + ut = α0 + α1 (λ ∑ (1-λ)i Xt-i-1 ) + ut or i=0 Yt = β0 + β1 Xt-1 + β2 Xt-2 + ... + βrXt-r + ... ut . Here α0 = β0 , and βi = α1 λ(1-λ)i; ≥ 1. ∞ (d) Lagging both sides of Yt = α0 + α1 (λ ∑ (1-λ)i Xt-i-1 ) + ut and multiplying both sides by (1-λ), i=0 results in ∞ (1-λ)Yt-1 = α0 (1-λ) + α1 (λ ∑ (1-λ) i+1 Xt-i-2 ) + (1-λ)ut-1 . Finally, subtraction of this equation from Yt i=0 ∞ = α0 + α1 (λ ∑ (1-λ)i Xt-i-1 ) + ut gives you i=0 Yt = α0 λ + α1 λX t-1 + (1-λ)Yt-1 + (ut - (1-λ)ut-1 ).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 355

6) The following two graphs give you a plot of the United States aggregate unemployment rate for the sample period 1962:I to 1999:IV, and the (log) level of real United States GDP for the sample period 1962:I to 1995:IV. You want test for stationarity in both cases. Indicate whether or not you should include a time trend in your Augmented Dickey-Fuller test and why.

Answer: Looking over the entire sample period, there does not appear to be a deterministic trend for the unemployment rate. There is no need to include a time trend for the ADF test in this case. The log level of real GDP, on the other hand, is clearly upward trended and a time trend should therefore be included.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 356

7) (Requires Appendix material): Show that the AR(1) process Yt = a1 Yt-1 + et; a1 < 1, can be converted to a MA( ∞) process. 2 Answer: Yt = a1 Yt-1 + et = a1 (a1 Yt-2 + et-1 ) + et = a 1 Yt-2 + et + a1 et-1 . Repeated substitution then results in Yt = n+1 n a 1 Yt-(n+1)+ et + a1 et-1 + ... + a 1 et-n, and for n → ∞, q Yt = et + a1 et-1 + ... + a 1 et-q + ... . 8) (Requires Appendix material) The long-run, stationary state solution of an AD(p,q) model, which can be written as A(L)Yt = β0 + c(L)Xt-1 + ut, where a0 = 1, and aj = -βj, cj = δj, can be found by setting L=1 in the two lag polynomials. Explain. Derive the long-run solution for the estimated ADL(4,4) of the change in the inflation rate on unemployment: △Inft = 1.32 – .36 △Inft-1 – 0.34△Inft-2 + 0.7△Inft-3 – 0.3△Inft-4 -2.68Unempt-1 + 3.43Unempt-2 – 1.04Unempt-3 + .07Unempt-4 Assume that the inflation rate is constant in the long-run and calculate the resulting unemployment rate. What does the solution represent? Is it reasonable to assume that this long -run solution is constant over the estimation period 1962-1999? If not, how could you detect the instability? Answer: In a stationary state equilibrium, variables do not change from one period to the next. Hence Xt-1 =Xt-2 = ... Xt-q. This is achieved in the above formulation by setting L=1. This solution represents the equilibrium rate of unemployment or NAIRU. In the above example it is 6%. The NAIRU does not remain constant but instead is a function of various determining variables such as demographic composition of the labor force, the competitiveness of labor and product markets, the generosity of the unemployment benefits system, etc. One way to detect instability is to test for breaks, using a Chow -test, if the break date is known, or using the QLR statistic, if the break date is unknown. 9) You want to determine whether or not the unemployment rate for the United States has a stochastic trend using the Augmented Dickey Fuller Test (ADF). The BIC suggests using 3 lags, while the AIC suggests 4 lags. (a) Which of the two will you use for your choice of the optimal lag length? (b) After estimating the appropriate equation, the t-statistic on the lag level unemployment rate is (–2.186) (using a constant, but not a trend). What is your decision regarding the stochastic trend of the unemployment rate series in the United States? (c) Having worked in the previous exercise with the unemployment rate level, you repeat the exercise using the difference in United States unemployment rates. Write down the appropriate equation to conduct the Augmented Dickey-Fuller test here. The t-statistic on relevant coefficient turns out to be (-4.791). What is your conclusion now? Answer: (a) The BIC is a consistent estimator of the true lag length, while the AIC will overestimate the lag length. The textbook suggests that if the researcher is concerned about too few lags, then the AIC can be used as a reasonable alternative. (b) The large-sample critical value of the ADF statistic is –2.57 at the 10% level. Hence you cannot reject the null hypothesis of a unit root. (c) △2 UrateUs t = β0 + δ△UrateUSt-1 + γ 1 △2 UrateUSt-1 + γ 2 △2 UrateUSt-2 + γ 3 △2 UrateUSt-3 + ut The critical value at the 1% level is –3.43, so that you can reject the null hypothesis of a unit root in the change of the U.S. unemployment rate.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 357

10) Consider the AR(1) model Yt = β0 + β1 Yt-1 + ut, β1 < 1.. (a) Find the mean and variance of Yt. (b) Find the first two autocovariances of Yt. (c) Find the first two autocorrelations of Yt. Answer: (a) Rewrite the AR(1) model as follows Yt = β0 + β1 Yt-1 + ut = β0 + β1 (β0 + β1 Yt-2 + ut-1 ) + ut 2 = β0 (1+ β1 ) + β Yt-2 + ut + β1 ut-1 . 1 Continuing the substitution indefinitely then results in Yt = β0 (1 + β1 + β

∞ i 2 3 + β + ...) + ∑ β ut-i . Given the result for the sum of a geometric series, the final 1 1 1 i=0

expression is

Yt =

β0 1- β1

∞

∑ β 1 ut-i . To find the mean and the variance, take first expectations on both sides E(Yt) =

i=0

β0

∑ β 1 E(ut-i) = 1- β1 , since E(ut) = 0 for all t.

i=0

∞ i To derive the variance, note that Yt - E(Yt) = ∑ β 1 ut-i . Hence the variance is E(Yt - E(Yt))2 = i=0 2 σu ∞ i 2 ∑ (β 1 ) E(ut-i)2 = σ u ∑ ( β i1 )2 = 2 . 1-β i=0 i=0 1 ∞

(b) The first two autocovariances are defined as cov(Yt, Yt-1 ) and cov(Yt, Yt-2 ). Using the fact that Yt = ∞ i + ∑ β 1 ut-i and that the expected values for both Yt and Yt-j, you get E[(Yt - E(Yt)(Yt-1 1- β1 i=0 ∞ ∞ i i E(Yt-1 )] = E[( ∑ β ut-i) ( ∑ β 1 ut-i )]= 1 i=0 i=1 β0

3 5 2 var(ut)(β1 + β 1 + β 1 + ...) = var(ut) β1 (1 + β1 + β 1 + ...) 2 σu β1

1 - β1

. 2 σu

2 σu

Similarly cov(Yt, Yt-2 ) = β 2 (and, more generally cov(Yt, Yt-j)= β j ). 1 β 1 1 1 1 - β1 (c) Since corr(Yt, Yt-j) =

cov(Yt, Yt-j) var(Yt)

2 , corr(Yt, Yt-1 ) = β1 and corr(Yt, Yt-2 ) = β 1 (and, in general,

j corr(Yt, Yt-j) = β 1 ). Stock/Watson 2e -- CVC2 8/23/06 -- Page 358

11) Find data for real GDP (Yt) for the United States for the time period 1959:I (first quarter) to 1995:IV. Next generate two growth rates: The (annualized) quarterly growth rate of real GDP [(ln Yt — ln Yt-1 )×400] and the annual growth rate of real GDP [(ln Yt — ln Yt-4 )×100]. Which is more volatile? What is the reason for this? Explain.

Answer: The quarterly growth rate that is more volatile because the annual growth rate is a moving average of the quarterly growth rate, and hence “wild swings” are smoothed out: (ln Yt — ln Yt-4) = (ln Yt — ln Yt-1) + (ln Yt-1 — ln Yt-2) + (ln Yt-2 — ln Yt-3)+ (ln Yt-3 — ln Yt-4) 12) You have collected data for real GDP (Y) and have estimated the following function: ^

lnYt = 7.866 + 0.00679×Zeit (0.007) (0.00008) t = 1961:I — 2007:IV, R2 = 0.98, SER = 0.036 where Zeit is a deterministic time trend, which takes on the value of 1 during the first quarter of 1961, and is increased by one for each following quarter. a.

Interpret the slope coefficient. Does it make sense?

Interpret the regression R2 . Are you impressed by its value?

Do you think that given the regression R2 , you should use the equation to forecast real GDP beyond the sample period?

Answer: a. The slope coefficient indicates the average growth rate per quarter. Since 1896, the U.S. economy has grown at a rate of approximately 3%. As a result, observing a quarterly growth rate of 0.7% makes very much sense. b. The regression R2 tells you that 98 percent of the variation in the log of real GDP is explained by the model. Since the model only contains a deterministic time trend, this seems high on face value. c. The logarithm of real GDP is bound to be non-stationary (using the ADF statistic, you would not be able to reject the null hypothesis that the log of real GDP has a unit root). Hence this equation should not be used for forecasting despite the very high regression R2 .

Stock/Watson 2e -- CVC2 8/23/06 -- Page 359

Chapter 15 Estimation of Dynamic Causal Effects 15.1 Multiple Choice 1) A distributed lag regression A) is also called AR(p). B) can also be used with cross-sectional data. C) gives estimates of dynamic causal effects. D) is sometimes referred to as ADL. Answer: C 2) Heteroskedasticity- and autocorrelation-consistent standard errors A) result in the OLS estimator being BLUE. B) should be used when errors are autocorrelated. C) are calculated when using the Cochrane-Orcutt iterative procedure. D) have the same formula as the heteroskedasticity robust standard errors in cross-sections. Answer: B 3) Sensitivity analysis of the results may include the following with the exception of A) stability over time analysis of the estimated multipliers. B) using homoskedasticity only rather than HAC standard errors. C) investigation of omitted variable bias. D) looking at different computations of the HAC standard errors. Answer: B 4) A seasonal binary (or indicator or dummy) variable, in the case of monthly data, A) is a binary variable that take on the value of 1 for a given month and is 0 otherwise. B) is a variable that has values of 1 to 12 in a given year. C) is a variable that contains 1s during a given year and is 0 otherwise. D) does not exist, since a month is not a season. Answer: A 5) Ascertaining whether or not a regressor is strictly exogenous or exogenous ultimately requires all of the following with the exception of A) economic theory. B) institutional knowledge. C) expert judgment. D) use of HAC standard errors. Answer: D 6) In time series, the definition of causal effects A) says that one variable helps predict another variable. B) does not make much sense since there are not multiple subjects. C) assumes that the same subject is being given different treatments at different points in time. D) requires panel data. Answer: C 7) The distributed lag model is given by A) Yt = β0 + β1 Xt + β2 Yt-1 + ut. B) Yt = β0 + β1 Yt-1 + β2 Yt-2 + ... + βrYt-r + ut. C) Yt = β0 + β1 ut + β2 ut+1 + β3 ut+2 + ... + βr+1 ut+r + et. D) Yt = β0 + β1 Xt + β2 Xt-1 + β3 Xt-2 + ... + βr+1 Xt-r + ut. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 360

8) The concept of exogeneity is important because A) it clarifies whether or not the variable is determined inside or outside your model. B) maximum likelihood estimation is no longer valid. C) under strict exogeneity, OLS may not be efficient as an estimator of dynamic causal effects. D) endogenous variables are not stationary, but exogenous variables are. Answer: C 9) The impact effect is the A) zero period dynamic multiplier. B) h period dynamic multiplier, h>0. C) cumulative dynamic multiplier. D) long-run cumulative dynamic multiplier. Answer: A 10) Estimation of dynamic multipliers under strict exogeneity should be done by A) instrumental variable methods. B) OLS. C) feasible GLS. D) analyzing the stationarity of the multipliers. Answer: C 11) Autocorrelation of the error terms A) makes it impossible to calculate homoskedasticity only standard errors. B) causes OLS to be no longer consistent. C) causes the usual OLS standard errors to be inconsistent. D) results in OLS being biased. Answer: C 12) The long-run cumulative dynamic multiplier A) cannot be calculated since in the long-run, we are all dead. B) is the sum of all individual dynamic multipliers. C) is the coefficient on Xt-r in the standard formulation of the distributed lag model. D) is the difference between the coefficient on Xt-1 and Xt-r. Answer: B 13) The concepts of exogeneity, strict exogeneity, and predeterminedness A) are defined in such a way that strict exogeneity implies exogeneity. B) can be used interchangeably. C) are defined in such a way that exogeneity implies strict exogeneity. D) correspond to endogeneity, strict endogeneity, and lagged endogenous variables. Answer: A 14) GLS A) results in smaller variances of the estimator than OLS if the regressors are strictly exogenous. B) is the same as OLS using HAC standard errors. C) can be used even if the regressors are not strictly exogenous. D) can be used for time-series estimation, but not in cross-sectional data. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 361

15) Quasi differences in Yt are defined as A) Yt - Yt-1 . B) Yt - φ1 Yt-1 . C) △Yt - φ1 Yt-1 . D) φ1 (Yt - Yt-1 ). Answer: B 16) Infeasible GLS A) requires too much memory even for today’s PCs. B) uses complicated interative techniques. C) cannot be calculated since it also uses quasi differences for Xt. D) assumes the parameters of the error autocorrelation process to be known. Answer: D 17) The 95% confidence interval for the dynamic multipliers should be computed by using the estimated coefficient ± A) 1.96 times the RMSFE. B) 1.96 times the HAC standard errors. C) 1.96, since the HAC errors are standardized. D) 1.64 times the HAC standard errors since the alternative hypothesis is one -sided. Answer: B 18) The Cochrane-Orcutt iterative method is A) a special case of GLS estimation. B) a method to compute HAC standard errors. C) a special case of maximum likelihood estimation. D) a grid search for the autoregressive parameters on the error process. Answer: A 19) To convey information about the dynamic multipliers more effectively, you should A) plot them. B) discuss these carefully one at a time. C) estimate them by maximum likelihood methods. D) first make sure that they are stationary. Answer: A 20) GLS involves A) writing the model in differences and estimating it by OLS, using HAC standard errors. B) truncating the sample at both ends of the period, then estimating by OLS using HAC standard errors. C) checking the AIC rather than the BIC in choosing the maximum lag -length of the regressors. D) transforming the regression model so that the errors are homoskedastic and serially uncorrelated, and then estimating the transformed regression model by OLS. Answer: D 21) GLS is consistent and BLUE if A) X is predetermined. B) the error process is AR(1). C) X is strictly exogenous. D) all the roots are inside the unit circle. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 362

22) The distributed lag model assumptions include all of the following with the exception of: A) There is no perfect multicollinearity. B) Xt is strictly exogenous. C) E(ut Xt, Xt-1 , Xt-2 ) = 0 D) The random variables Xt and Yt have a stationary distribution. Answer: B 23) In the distributed lag model, the coefficient on the contemporaneous value of the regressor is called the A) dynamic effect. B) cumulative multiplier. C) autoregressive error. D) impact effect. Answer: D 24) In the distributed lag model, the dynamic causal effect A) is the sequence of coefficients on the current and lagged values of X. B) is not the same as the dynamic multiplier. C) is generated by choosing different truncation points for the HAC standard errors. D) requires estimation of the model by Cochrane-Orcutt method. Answer: A 25) HAC standard errors should be used because A) they are convenient simplifications of the heteroskedasticity -robust standard errors. B) conventional standard errors may result in misleading inference. C) they are easier to calculate than the heteroskedasticity-robust standard errors and yet still allow you to perform inference correctly. D) when there is a structural break, then conventional standard errors result in misleading inference. Answer: B 26) The interpretation of the coefficients in a distributed lag regression as causal dynamic effects hinges on A) the assumption that X is exogenous B) not having more than four lags when using quarterly data C) using GLS rather than OLS D) the use of monthly rather than annual data Answer: A 27) Given the relationship between the two variables, the following is most likely to be exogenous: A) the inflation rate and the short term interest rate: short-term interest rate is exogenous B) U.S. rate of inflation and increases in oil prices: oil prices are exgoneous C) Australian exports and U.S. aggregate income: U.S. aggregate income is exogenous D) change in inflation, lagged changes of inflation, and lags of unemployment: lags of unemployment are exogenous Answer: C 28) When Xt is strictly exogenous, the following estimator(s) of dynamic causal effects are available: A) estimating an ADL model and calculating the dyamic multipliers from the estimated ADL coefficients B) using GLS to estimate the coefficients of the distributed lag model C) neither (a) or (b) D) (a) and (b) Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 363

29) In time series data, it is useful to think of a randomized controlled experiment A) consisting of the same subject being given different treatments at different points in time B) consisting of different subjects being given the same treatment at the same point in time C) as being non-existent (this is a time series after all, and there are no real “parallel universes” D) consisting of the at least two subjects being given different treatments at the same point in time Answer: A 30) Consider the distributed lag model Yt = β0 + β1 Xt + β2 Xt-1 + β3 Xt-2 + … + βr+1 Xt-r + ut. The dynamic causal effect is A) β 0 + β1 B) β 1 + β2 +…+β r+1 C) β 0 + β1 +…+β r+1 D) β 1 Answer: B

15.2 Essays and Longer Questions 1) To estimate dynamic causal effects, your textbook presents the distributed lag regression model, the autoregressive distributed lag model, and a quasi-difference representation of the distributed lag model with autoregressive errors. Using a simple example, such as a distributed lag model with only the current and past value of X and an AR(1) model for the error term, discuss how these models are related. In each case suggest estimation methods and evaluate the relative merit in using one rather than the other. Answer: The student’s answer should follow the discussion in section 13.2-13.3 (distributed lag model) and 13.5 (autoregressive distributed lag model and quasi-difference representation of the distributed lag model with autoregressive errors). Major points which should include the assumption of exogeneity in the case of the distributed lag model, which, together with the other distributed lag model assumptions, allows for the dynamic multiplier and cumulative dynamic multiplier estimation by OLS. Given the AR(1) nature of the error term, the importance of using HAC standard errors should be stressed. For the ADL and quasi-difference representation, the importance of the strictly exogenous regressor assumption must be emphasized. The answer should include the derivation of the dynamic multipliers from the OLS estimated ADL coefficients and the difference between the infeasible and feasible GLS estimator. For the latter, the Cochrane-Orcutt procedure should be mentioned. If the regressors are strictly exogenous, then GLS is asymptotically BLUE. However, since the ADL specification requires estimation of fewer parameters, it may be preferred in practice. If there is no convincing argument for the regressor being strictly exogenous, but an argument for exogeneity can be made, then OLS estimation using HAC standard errors is the preferred method.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 364

2) Your textbook presents as an example of a distributed lag regression the effect of the weather on the price of orange juice. The authors mention U.S. income and Australian exports, oil prices and inflation, monetary policy and inflation, and the Phillips curve as other candidates for distributed lag regression. Briefly discuss whether or not the exogeneity assumption is likely to hold in each of these cases. Explain why it is so hard to come up with good examples of distributed lag regressions in economics. Answer: Student’s answers should follow the discussion of section 13.7 in the textbook. Although there is some degree of simultaneity between Australian exports and U.S. income, the Australian economy is too small relative to the American economy to present much of a feedback from a fall in exports. It is therefore reasonable to assume that U.S. income is exogenous in a regression of Australian exports on U.S. income. The situation is different for oil prices and inflation since it is reasonable to assume that members of OPEC countries analyze world wide economic conditions, including inflation rates, when setting oil prices. If this is the case, then oil prices are not exogenous. Monetary policy and inflation are other examples where it cannot be assumed reasonably that the monetary base or the federal funds rate is exogenous. The Federal Reserve takes into account current and future inflation rates when setting their instrument, which is thereby endogenous. Finally, the Phillips curve is another example where it cannot be assumed that the (lagged) unemployment rate is exogenous, since past values of the unemployment rate were simultaneously determined with past inflation rates. 3) Money supply is linked to the monetary base by the money multiplier. Macroeconomic textbooks tell you that the central bank cannot control the money supply, but it can control the monetary base. As a result, you decide to specify a distributed lag equation of the growth in the money supply on the growth in the monetary base. One of your peers tells you that this is not a good idea for modeling the relationship between the two variables. What does she mean? Answer: Although the monetary base is one of the determinants of the money supply, there are other factors, such as interest rates, that have an effect on the money multiplier. Hence there is the problem of omitted variables. If interest rates are correlated with the monetary base, then the OLS estimator will be inconsistent. Furthermore, it is likely that due to financial innovations, dynamic causal effects have changed over time. Finally there is the concern of simultaneous causality bias. If the Federal Reserve changes the monetary base as a result of changes in the money supply, perhaps as a result of targeting, then the monetary base becomes endogenous.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 365

4) In your intermediate macroeconomics course, government expenditures and the money supply were treated as exogenous, in the sense that the variables could be changed to conduct economic policy to influence target variables, but that these variables would not react to changes in the economy as a result of some fixed rule. The St. Louis Model, proposed by two researchers at the Federal Reserve in St. Louis, used this idea to test whether monetary policy or fiscal policy was more effective in influencing output behavior. Although there were various versions of this model, the basic specification was of the following type: △ln(Yt) = β0 + β1 △ln mt + ... + βp △ln mt-p-1 + βp+1 △ln Gt + ... + βp+q△ln Gt-q-1 + ut Assuming that money supply and government expenditures are exogenous, how would you estimate dynamic causal effects? Why do you think this type of model is no longer used by most to calculate fiscal and monetary multipliers? Answer: If the money supply and government expenditures were exogenous, then a distributed lag model could be used to estimate the dynamic multipliers and cumulative dynamic multipliers using OLS. The coefficients in the above equation are then the dynamic multipliers. To obtain the h-period cumulative dynamic multipliers, all coefficients over the h-periods have to be added up. There is an alternative form for the above equation which allows for statistical testing of the cumulative dynamic multipliers. This involves differencing the regressors with the exception of the last lag, p and q, in the above equation. The coefficient on the p and q lagged regressor then represents the long-run cumulative multiplier. The OLS estimator of the coefficients in the above equation is consistent. However, the errors are likely to be autocorrelated since omitted variables from the above equation are probably serially correlated themselves. In that case the OLS standard errors are inconsistent and statistical inference based on these standard errors will be misleading. To avoid this problem, heteroskedasticity- and autocorrelation-consistent standard errors can be calculated. The reason why this type of model is no longer used by most to calculate fiscal and monetary multipliers is that researchers are not willing to assume that the money supply and government expenditures are exogenous. Both monetary and fiscal policy takes into account current and future expected output growth in setting their policy instruments, which are therefore endogenous. 5) Your textbook mentions heteroskedasticity- and autocorrelation- consistent standard errors. Explain why you should use this option in your regression package when estimating the distributed lag regression model. What are the properties of the OLS estimator in the presence of heteroskedasticity and autocorrelation in the error terms? Explain why it is likely to find autocorrelation in time series data. If the errors are autocorrelated, then why not simply adjust for autocorrelation by using some non-linear estimation method such as Cochrane-Orcutt? Answer: In the presence of either heteroskedasticity and/or autocorrelation in the errors, OLS estimation of the regression coefficients is still consistent. However, the homoskedasticity-only or heteroskedasticity-robust standard errors are inconsistent and use of these in the presence of serial correlation results in misleading statistical inference. For example, confidence intervals do not contain the true value in the postulated number of times in repeated samples. The solution is to adjust the estimator for the standard errors by incorporating sample autocorrelation estimates. This results in the heteroskedasticity- and autocorrelation-consistent (HAC) estimator of the variance of the estimator. For this estimator to be consistent, a certain truncation parameter is introduced, so that not all T-1 sample autocorrelations are used. Incorporating this idea into the HAC formula results in the Newey -West variance estimator. Autocorrelation in the errors is likely if there are omitted variables which are slowly changing over time. Since the omitted variables are implicitly contained in the error term, this would result in autocorrelation of the error term. For generalized least squares to have desirable properties, the regressors have to be strictly (past, present, and future) exogenous, rather than just (past and present) exogenous. There are very few truly exogenous variables in economics. Furthermore, most of the relationships between economic time series contain simultaneous causality. As the example in the textbook on orange juice prices and cold weather showed, it is even more difficult to find strictly exogenous variables. Stock/Watson 2e -- CVC2 8/23/06 -- Page 366

6) Your textbook presents as an example of a distributed lag regression the effect of the weather on the price of orange juice. The authors mention U.S. income and Australian exports, oil prices and inflation, monetary policy and inflation, and the Phillips curve as other potential candidates for distributed lag regression. You are considering estimating the effect of minimum wages on teenage employment (employment population ratio) using a time series of U.S. data. Write a short essay on whether a distributed lag model would be a suitable tool to figure out dynamic causal effects in this case. Answer: One of the first questions student must address is whether or not the X variable here is exogenous. In studies of the labor market, e.g. microeconomics, students learned that it is real wages that determine employment, not nominal wages. Some authors have used relative wages as an explanatory variable, where the denominator is average hourly earnings. Setting aside whether or not minimum wages are exogenous, the students should then focus on whether the price index used to adjust nominal minimum wages or average hourly earnings are exogenous. However, most students will focus only on the numerator (nominal minimum wages) and will argue that minimum wages are typically set by the legislature following some political process and may therefore be considered exogenous. Some will go further and argue that the process of setting minimum wages will depend on the state of the business cycle. For example, recent increases in minimum wages (2007, 2008, 2009) would most likely not have occurred if legislators would have anticipated teenage unemployment rates of over 25% for teenagers. If that is the case, then minimum wage legislation depends on the state of the business cycle and hence teenage employment. As a result, minimum wages should not be considered exogenous.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 367

15.3 Mathematical and Graphical Problems 1) One of the central predictions of neo-classical macroeconomic growth theory is that an increase in the growth rate of the population causes at first a decline the growth rate of real output per capita, but that subsequently the growth rate returns to its natural level, itself determined by the rate of technological innovation. The intuition is that, if the growth rate of the workforce increases, then more has to be saved to provide the new workers with physical capital. However, accumulating capital takes time, so that output per capita falls in the short run. Under the assumption that population growth is exogenous, a number of regressions of the growth rate of output per capita on current and lagged population growth were performed, as reported below. (A constant was included in the regressions but is not reported. HAC standard errors are in brackets. BIC is listed at the bottom of the table). Regression of Growth Rate of Real Per -Capita GDP on Lags of Population Growth, United States, 1825-2000

Lag number 0 1 2 3 4 BIC

(1) Dynamic multipliers -0.9 (1.3) 3.5 (1.6) -1.3 (1.7) 0.2 (1.7) -2.0 (1.5) -234.4

(2) Dynamic multipliers -1.1 (1.3) 3.2 (1.6) -3.0 (1.6) 1.5 (1.2) -

(3) Dynamic multipliers -1.3 (1.7) 1.8 (1.6) -2.2 (1.4) -

(4) Dynamic multipliers -0.2 (1.7) 0.8 (1.5) -

(5) Dynamic multipliers -2.0 (1.5) -

-236.1

-238.5

-240.0

-241.8

(a) Which of these models is favored by the information criterion? (b) How consistent are these estimates with the theory? Is this a fair test of the theory? Why or why not? (c) Can you think of any improved data to test the theory? Answer: (a) BIC has a minimum for no lag and this criterium therefore favors a static specification. (b) The estimates tell us that there is no dynamic multipliers other than the contemporaneous or impact effect. Even the impact effect is not statistically significant. It is unlikely that population growth is exogenous and therefore this does not represent a fair test of the theory. In addition, there is omitted variable bias with other relevant variables, such as the savings rate, education, etc. missing as regressors. (c) Per capita output or income is likely to be a determinant of fertility. As a result, population growth is not likely to be exogenous. Perhaps the working age population would be a better choice here, but data for early periods are almost impossible to obtain. 2) The Gallup Poll frequently surveys the electorate to quantify the public’s opinion of the president. Since 1945, Gallup settled on the following wording of its presidential poll: “Do you approve or disapprove of the way (name) is handling his job as president?” Gallup has not changed its presidential question since then, and respondents can answer “approve,” “disapprove,” or “no opinion.” You want to see how this approval rating is related to the Michigan index of consumer sentiment (ICS). The monthly survey, conducted with a minimum sample of 500, asks people if they feel “better/worse off” with regard to current and future conditions. (a) To estimate dynamic causal effects, you collect quarterly data from 1962:I – 1998:II for the United States. You allow a binary variable for each presidency to capture the intrinsic popularity of the President. Furthermore, you eliminate observations that include a change in party for the presidency by using a binary Stock/Watson 2e -- CVC2 8/23/06 -- Page 368

variable, which takes on the value of one during the first quarter of the year after the election. Finally, a friendly political scientist provides you with (i) an “events” variable, (ii) a “Vietnam” binary variable, and (iii) a “honeymoon” variable, which measures the effect of a higher popularity of a president immediately following the election. (The coefficients of these variables will not be reported here.) Assuming that consumer sentiment is exogenous, you estimate the following two specifications (numbers in parenthesis are heteroskedasticity- and autocorrelation-consistent standard errors): Approvalt = 26.08 + 0.178 × ICSt + 0.232 × ICSt-1 ; R2 = 0.667, SER = 7.00 (8.83) (0.120) (0.135) Approvalt = 26.08 + 0.178 × △ICSt + 0.411 + ICSt-1 ; R2 = 0.667, SER = 7.00 (8.17) (0.120 ) (0.089) What is the difference between the two specifications? What is the advantage of estimating the second equation, if any? (b) Assuming that the errors follow an AR(1) process, you also estimate the following alternative: Approvalt = -4.61 + 0.300 × ICSt – 0.070 × ICSt-1 - 0.054 × ICSt-2 ; + 0.776 × Approvalt-1 ; (5.84) (0.083) (0.099) (0.083) (0.057) R2 = 0.868, SER = 4.45 How is this specification related to the previous ones? What implicit assumptions did you have to make to allow for desirable properties of the OLS estimator? (c) You finally estimate the approval equation using the quasi-difference specification and the GLS estimator.

Approvalt= –4.61 + 0.300 × ICSt – 0.070 × ICSt-1 ; (5.84) (0.083)

(0.099)

R2 = 0.868, SER = 4.45

where Zt = Zt – φ1 Zt-1 and φ1 = 0.896 (0.040). How is this equation related to the ones in (a) and (b)? What are the properties of the GLS estimator here, under the assumption that ICS is strictly exogenous? (d) Is it likely that the ICS is exogenous here? Strictly exogenous? Answer: (a) If the regressor is exogenous, then the estimates in the first regression measure the impact effect and the one-period dynamic multiplier of a change in consumer sentiment on approval ratings. The coefficients in the second equation are cumulative dynamic multipliers, where the coefficient on ICSt-1 represents the long-run cumulative multiplier. The advantage of the second equation is that it allows for testing cumulative dynamic multipliers. (b) This is the ADL representation of the distributed lag model with first order autocorrelation. The assumption is that ICS is a strictly exogenous regressor. If this is the case, then the dynamic multipliers can be calculated from these estimates. (c) This is the quasi-difference representation of the distributed lag model with autoregressive errors. Given the restrictions on the parameters of the ADL model, it simply reorganizes the regressors. If ICS were strictly exogenous, then GLS produces asymptotically efficient (BLUE) estimators. (d) If approval ratings depend on economic variables, such as the inflation rate, the unemployment rate, and income growth, then there is omitted variable bias, since these variables will be correlated with consumer sentiment. Furthermore, if lower approval ratings (“popularity deficit”) result in stimulating the economy, which in return will have an effect on consumer sentiment, then there is simultaneous Stock/Watson 2e -- CVC2 8/23/06 -- Page 369

causality in addition. If a variable is not exogenous, then it is also not strictly exogenous.

~ ~

3) Consider the following distributed lag model Yt = β0 + β1 Xt + β2 Xt-1 + ut, where ut = φ1 ut-1 + ut, ut is serially uncorrelated, and X is strictly exogenous. (a) How many parameters are there to be estimated between the two equations? (b) Using the two equations of the model above, derive the ADL form of the model. (c) There are five regressors in the ADL model, namely Yt-1 , Xt, Xt-1 , Xt-2 and the constant. Estimating the ADL model linearly will give you five coefficients. Can you derive the parameters of the original two equation model from these five estimates? Why or why not? (d) What alternative method do you have to retrieve the parameters of the two equation model? Answer: (a) There are four parameters to be estimated, β0 , β1 , β2 and φ1 . (b) The ADL form of the model is derived by multiplying the first equation by φ1 and lagging it, then subtracting the resulting equation from the first equation, and using the AR(1) equation of the error term for simplification of the resulting specification. Yt = β0 + β1 Xt + β2 Xt-1 + ut -[φ1 Yt-1 = φ1 β0 + φ1 β1 Xt-1 + φ1 β2 Xt-2 + φ1 ut-1 ] which, after collecting terms, results in Yt = β0 (1-φ1 ) + φ1 Yt-1 + β1 Xt + (β2 - φ1 β1 ) Xt-1 - φ1 β2 Xt-2 + (ut - φ1 ut-1 ) or ~ Yt = α0 + φ1 Yt-1 + δ0 Xt + δ1 Xt-1 + δ2 Xt-2 + ut. (c) The original four parameters cannot be derived without restrictions since in essence you have five equation in four unknowns. (d) The above model can be specified in quasi-differences, i.e.,

(Yt - φ1 Yt-1 ) = β0 (1- φ1 ) + β1 (Xt - φ1 Xt-1 ) + β2 (Xt-1 - φ1 Xt-2 ) + ut or ~ ~ ~ Yt = α0 + β1 Xt + β2 Xt-1 + ut.

The parameters now can be estimated using nonlinear least squares, or specifically, the Cochrane-Orcutt, or the iterated Cochrane-Orcutt estimator. 4) A model that attracted quite a bit of interest in macroeconomics in the 1970s was the St. Louis model. The underlying idea was to calculate fiscal and monetary impact and long run cumulative dynamic multipliers, by relating output (growth) to government expenditure (growth) and money supply (growth). The assumption was that both government expenditures and the money supply were exogenous. Estimation of a St. Louis type model using quarterly data from 1960:I-1995:IV results in the following output (HAC standard errors in parenthesis): ygrowtht = 0.018 + 0.006 × dmgrowtht + 0.235 × dmgrowtht-1 + 0.344 × dmgrowtht-2 (0.004) (0.079) (0.091) (0.087) + 0.385 × dmgrotht-3 + 0.425 × mgrowtht-4 + 0.170 × dggrowth t – 0.044dggrowth t-1 (0.097)

(0.069)

(0.049)

(0.068)

- 0.003 × dggrowth t-2 – 0.079 × dggrowth t-3 + 0.018 × ggrowtht-4 ; (0.040)

(0.051)

(0.027)

R2 = 0.346, SER=0.03 Stock/Watson 2e -- CVC2 8/23/06 -- Page 370

where ygrowth is quarterly growth of real GDP, mgrowth is quarterly growth of real money supply (M2), and ggrowth is quarterly growth of real government expenditures. “d” in front of ggrowth and mgrowth indicates a change in the variable. (a) Assuming that money and government expenditures are exogenous, what do the coefficients represent? Calculate the h-period cumulative dynamic multipliers from these. How can you test for the statistical significance of the cumulative dynamic multipliers and the long-run cumulative dynamic multiplier? (b) Sketch the estimated dynamic and cumulative dynamic fiscal and monetary multipliers. (c) For these coefficients to represent dynamic multipliers, the money supply and government expenditures must be exogenous variables. Explain why this is unlikely to be the case. As a result, what importance should you attach to the above results? Answer: (a) In that case the coefficients represent dynamic multipliers. Lag number

0 1 2 3 4

Monetary Dynamic Multiplier 0.006 0.235 0.344 0.385 0.425

Monetary Cumulative Multiplier 0.006 0.241 0.585 0.970 1.395

Fiscal Dynamic Multiplier 0.170 -0.044 -0.003 -0.079 0.018

Fiscal Cumulative Multiplier 0.170 0.126 0.123 0.044 0.062

To test for the significance of the cumulative dynamic multipliers and the long -run cumulative dynamic multiplier, the equation must be reestimated with all regressors appearing in differences with the exception of the longest lag. The coefficients of these regressors then represent cumulative dynamic multipliers and t-statistics can be used to test for their statistical significance. (b) See the accompanying figures.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 371

(c) There is little reason to believe that these government instruments are exogenous. Even if the monetary base and those components of government expenditures which do not respond to business cycle fluctuations had been chosen rather than the above regressors, then these instruments respond to changes in the growth rate of GDP. As a matter of fact, government reaction functions were also estimated at the time to capture how government instruments respond to changes in target variables. As a result, the regressors will be correlated with the error term, OLS estimation is inconsistent, and inference not dependable. It is hard to imagine how useable information can be retrieved from these numbers. 5) Your textbook used a distributed lag model with only current and past values of Xt–1 coupled with an AR(1) error model to derive a quasi-difference model, where the error term was uncorrelated. (a) Instead use a static model Yt = β0 + β1 Xt + ut here, where the error term follows an AR(1). Derive the quasi difference form. Explain why in the case of the infeasible GLS estimators you could easily estimate the βs by OLS. (b) Since φ1 (the autocorrelation parameter for ut) is unknown, describe the Cochrane-Orcutt estimation procedure. (c) Explain how the iterated Cochrane-Orcutt estimator works in this situation. Iterations stop when there is “convergence” in the estimates. What do you think is meant by that? (d) Your textbook has pointed out that the iterated Cochrane-Orcutt GLS estimator is in fact the nonlinear least squares estimator of the model. Given that -1 < φ1 < 1, suggest a “grid search” or some strategy to “nail down” ^

the value of φ1 which minimizes the sum of squared residuals. This is the so-called Hildreth-Lu method. Answer: (a) The quasi-difference model is derived by multiplying the equation by φ1 and lagging it, then subtracting the resulting equation from the first equation, and using the AR(1) equation of the error term for simplification of the resulting specification. Yt = β0 + β0 Xt + ut -[φ1 Yt-1 = φ1 β0 + φ1 β1 Xt-1 + φ1 ut-1 ] which results in Stock/Watson 2e -- CVC2 8/23/06 -- Page 372

Yt - φ1 Yt-1 = β0 (1 - φ1 ) + β1 Xt-1 - φ1 β1 Xt-1 + (ut - φut-1 ). Using the quasi-difference notation then yields

Yt = α0 + β1 Xt + ut. If φ1 was known, then it would be possible to generate the quasi-difference variables in a statistical package and then estimate the coefficients using the transformed variables using OLS. (b) In this case, nonlinear least squares has to be used to estimate the three parameters. One possible feasible GLS estimator in this case is the Cochrane-Orcutt estimator. In the first step, φ1 is set to zero, in

which case β0 and β1 can be estimated by OLS. The resulting residuals are then used to calculate the OLS estimator for φ1 . This, in return, can then generate the quasi-differenced variables and OLS is then employed to get the estimate of β0 and β1 . (c) The iterated Cochrane-Orcutt estimator continues the process described in (a). For example, in the next step, a new set of residuals is used to update the previous estimate of φ1 , which will generate a new set of quasi-differenced variables and new estimates of β0 and β1 . The iterations stop when the differences in the estimates from one round to the next differ by less than a very small number, which can be chosen by the econometrician. This is then called convergence. (d) Under the Hildreth-Lu method, the sum of squared residuals is computed for various values of φ1 , using quasi-differenced variables. For example, initially a coarse grid is chosen of –0.9, -0.8, -0.7, …, 0.7, 0.8, 0.9. For the value of φ1 which yields the smallest SSR, say 0.7, a new finer grid is chosen, such as 0.65, 0.66, 0.67, …, 0.73, 0.74, 0.75, and again the SSR is calculated for each of these values. The value of φ1 which has the smallest SSR is retained and yet a finer grid around it is chosen, etc. 6) (Requires Appendix material) Your textbook states that in “the distributed lag regression model, the error term ut can be correlated with its lagged values. This autocorrelation arises, because, in time series data, the omitted factors that comprise ut can themselves be serially correlated.” (a) Give an example what the authors have in mind. (b) Consider the ADL model, where the X’s are strictly exogenous, and there is no autocorrelation (and/or heteroskedasticity) in the error term.

~ * Yt = β 0 + β1 Xt + β2 Xt-1 + β3 Yt-1 + ut How many coefficients are there to be estimated? Show that this model can be respecified using the lag operator notation:

~ * φ(L)Yt = β 0 + β1 δ(L)Xt + ut where, φ(L) = 1 – β3 L. What is δ(L) here? β (c) Assume heroically that β3 = 2 , i.e., that there is a “common factor” in the lag polynomials φ(L) and δ(L) β1 Show that in this case the model becomes Yt = β0 + β1 Xt + ut

Stock/Watson 2e -- CVC2 8/23/06 -- Page 373

where β0 =

* β0 1 - β3

and ut =

1 ~ u. 1- β3 L t

(d) Explain why autocorrelation in this model can be seen as a “simplification,” not a “nuisance.” Can you use the F-test to test the above hypothesis? Why or why not? Answer: (a) Taking the textbook example of the percentage change in the real price of orange juice and the number of freezing degree days, the error term potentially contains other variables such as change in tastes of the population, the price of substitutes, income, etc. Some of these variables may be hard to measure, but all of these are bound to change slowly over time and are not likely to be correlated with the weather variable. β2 β2 ~ * (b) (1 - β3 L)Yt = β + β1 (1+ L) Xt + ut, so δ(L) = (1+ L) 0 β1 β1 (c) Dividing both sides by 1 - β3 L results in the above equation after cancellation. (d) There is one parameter less to estimate. The restriction is non -linear, so the F-test does not apply here. 7) It has been argued that Canada’s aggregate output growth and unemployment rates are very sensitive to United States economic fluctuations, while the opposite is not true. (a) A researcher uses a distributed lag model to estimate dynamic causal effects of U.S. economic activity on Canada. The results (HAC standard errors in parenthesis) for the sample period 1961:I -1995:IV are: urcant = -1.42 + 0.717 × urus t + 0.262 × urust-1 + 0.023 × urus t-2 - 0.083 × urust-3 (0.83) (0.457)

(0.557)

(0.398)

(0.405)

- 0.726 × urus t-4 + 1.267 × urus t-5 ; R2 = 0.672, SER = 1.444 (0.504)

(0.385)

where urcan is the Canadian unemployment rate, and urus is the United States unemployment rate. Calculate the long-run cumulative dynamic multiplier. (b) What are some of the omitted variables that could cause autocorrelation in the error terms? Are these omitted variables likely to uncorrelated with current and lagged values of the U.S. unemployment rate? Do you think that the U.S. unemployment rate is exogenous in this distributed lag regression? Answer: (a) The long-run cumulative dynamic multiplier is 1.460. (b) Autocorrelation in the error term is the result of omitted variables which are serially correlated. Canadian unemployment rates depend on Canadian labor market conditions and most likely on Canadian aggregate demand variables in the short run. Prime candidates for slowly changing omitted variables would be demographics, indicators of unemployment insurance generosity, changes in the terms of trade, monetary policy indicators such as the real interest rate, etc. Some of these variables are highly likely to be correlated with U.S. unemployment rates since demographics are similar between the two countries and Canadian monetary policy often follows moves made by the Federal Reserve. A case could be made that the U.S. unemployment rate is exogenous as a result of the relative size of the two economies. However, due to the size of the trade between the two countries, this is not as easy to support as if the dependent variable were the unemployment rate in Costa Rica, say.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 374

e 8) Consider the following model Yt = β0 + X t + ut where the superscript “e” indicates expected values. This may represent an example where consumption depends on expected, or “permanent,” income. Furthermore, let expected income be formed as follows: e e e X t = X t-1 + λ(Xt - X t-1 ); 0 < λ < 1 (a) In the above expectation formation hypothesis, expectations are formed at the end of the period, say the 31 st of December, if you had annual data. Give an intuitive explanation for this process. (b) Rewrite the expectations equation in the following form: e e X t = (1 – λ) X t-1 + λX t e Next, following the method used in your textbook, lag both sides of the equation and replace X t-1 . Repeat e e this process by repeatedly substituting expression for X t-2 , X t-3 , and so forth. Show that this results in the following equation: e e X t = λX t + λ(1-λ) Xt-1 + λ(1- λ)2 Xt-2 + ... + λ(1- λ)n Xt-n + (1 – λ)n+1 X t+1 Explain why it is reasonable to drop the last right hand side term as n becomes large. e (c) Substitute the above expression into the original model that related Y to X t . Although you now have right hand side variables that are all observable, what do you perceive as a potential problem here if you wanted to estimate this distributed lag model without further restrictions? (d) Lag both sides of the equation, multiply through by (1- λ), and subtract this equation from the equation found in (c). This is called a “Koyck transformation.” What does the resulting equation look like? What is the error process? What is the impact effect (zero-period dynamic multiplier) of a unit change in X, and how does it differ from long run cumulative dynamic multiplier? e Answer: (a) If the forecast error for the previous period, (Xt - X t-1 ) was zero, then expectations are not changed for the next period. If there was a non-zero forecast error, then expectations are changed by a fraction of that forecast error. e e e e e (b) Substitution of X t-1 = (1 - λ) X t-2 + λX t-1 into X t = (1-λ) X t-1 + λX t results in X t = (1-λ)2 e e e e X t-2 + λX t + (1-λ)λX t-1 . The process is then repeated for X t-2 , which gives X t = (1- λ)3 X t-3 + λX t + (1- λ)λX t-1 + (1-λ)2 λX t-2 and so on. The last term involving the unobservable expectation can be dropped for large n since 0 < λ < 1. e (c) Yt = β0 + β1 X t + ut= β0 + β1 λX t + β1 λ(1- λ)Xt- 1 + β1 λ(1-λ)2 Xt-2 + ... + β1 λ(1- λ)n Xt-n + ut. For large n, this would require estimation of a large number of coefficients, potentially more than there are observations available on lags of X. (d) The Koyck transformation works as follows Stock/Watson 2e -- CVC2 8/23/06 -- Page 375

Yt = β0 + β1 λXt + β1 λ(1- λ)Xt-1 + β1 λ(1- λ)2 Xt-2 + ... β1 λ(1- λ)n Xt-n + ut -[(1-λ)Yt-1 = (1- λ)β0 + β1 λ(1- λ)Xt-1 + β1 λ(1- λ)2 Xt-2 + ... + β1 λ(1- λ)n Xt-n + β1 λ(1- λ)n+1 Xt-n-1 + (1- λ)ut-1 ] which, after canceling terms results in Yt = β0 λ + β1 λX t + (1- λ)Yt-1 + ut - (1- λ)ut-1 where β1 λ(1- λ)n+1 Xt-n-1 has been dropped using the same argument as above. Note that there the error process is now a moving average. The impact effect is β1 λ, which is smaller than the long-run cumulative dynamic multiplier β1 , since 0 < λ < 1. 9) The distributed lag regression model requires estimation of (r+1) coefficients in the case of a single explanatory variable. In your textbook example of orange juice prices and cold weather, r = 18. With additional explanatory variables, this number becomes even larger. Consider the distributed lag regression model with a single regressor Yt = β0 + β1 Xt + β2 Xt-1 + β3 Xt-2 + ... + βr+1 Xt-r + ut (a) Early econometric analysis of distributed lag regression models was interested in reducing the number of parameters by approximating the coefficients by a polynomial of a suitable degree, i.e., βi+1 ≈ f(i) for i = 0, 1, …, r. Let f(i) be a third degree polynomial, with coefficients α0 , ...., α3 . Specify the equations for β1 , β2 , β3 , β4 , and βr+1 . (b) Substitute these equations into the original distributed lag regression, and rearrange terms so that Y appears as a linear function of β0 , α0 , α1 , α2 , α3 and a transformation of the Xt, Xt-1 , Xt-2 , ..., Xt-r (c) Assume that the third-degree polynomial approximation is quite accurate. Then what is the advantage of this polynomial lag technique? Answer: (a) For a third degree polynomial, f(i) = α0 + α1 i + α2 i2 + α3 i3 . Then β1 = f(0) = α0 β2 = f(1) = α0 + α1 + α2 + α3 β3 = f(2) = α0 + 2α1 + 4α2 + 8α3 β4 = f(3) = α0 + 3α1 + 9α2 + 27α3 ... βr+1 = f(r) = α0 + rα1 + r2 α2 + r3 α3 (b) Substitution into the original distributed lag regression yields Yt = β0 + α0 Xt + (α0 + α1 + α2 + α3 )Xt-1 + (α0 + 2α1 + 4α2 + 8α3 )Xt-2 + ... + ( α0 + rα1 + r2 α2 + r3 α3 )Xt-r and collecting terms in the coefficients results in Yt = β0 + α0 (Xt + Xt-1 + Xt-2 + ... + Xt-r) + α1 (Xt-1 + 2Xt-2 + ... + rXt-r) +α2 (Xt-1 + 4Xt-2 + ... + r2 Xt-r) + α3 (Xt-1 + 8Xt-2 + ... + r3 Xt-r). (c) By placing restrictions on the lag distribution and transforming the regressors, there are fewer parameters to estimate, in this case five.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 376

10) The distributed lag model relating orange juice prices to the Orlando weather reported in the text was of the form %ChgPt = β0 + β1 FDD t + β2 FDD t-1 + β3 FDD t-2 + ... + β19FDD t–18 + ut (a) Suppose that an agricultural economist tells you that a freeze in December is more harmful than a freeze in the other months. How would you modify the regression to incorporate this effect? How would you test for this December effect? (b) The same economist tells you that the damage caused by freezes is not well captured by the FDD variable. She says that a single day temperature with a temperature of 24° is more damaging than 8 days with a temperature of 31°. How would you modify the regression to incorporate this effect? Answer: (a) A binary variable can be added to the list of regressors, which takes on the value of one in December and is zero otherwise. A t-statistic can be computed for the coefficient of the December binary variable, using HAC standard errors. The t-statistic has a standard normal distribution. (b) An additional regressor (TempFreeze) can be introduced, either by itself or interacted with FDD. To capture the postulated effect, it might be specified as follows: FDD

∑ (Tempt - 32°)2 i=1 where DFreeze is a binary variable that takes on the value of one for a month with freezing temperature, Temp is the minimum temperature for any monthly freezing degree day. TempFreezet = DFreezet ×

11) (Requires some calculus) In the following, assume that Xt is strictly exogenous and that economic theory suggests that, in equilibrium, the following relationship holds between Y* and Xt, where the “*” indicates equilibrium. Y* = kXt

An error term could be added here by assuming that even in equilibrium, random variations from strict proportionality might occur. Next let there be adjustment costs when changing Y, e.g. costs associated with changes in employment for firms. As a result, an entity might be faced with two types of costs: being out of equilibrium and the adjustment cost. Assume that these costs can be modeled by the following quadratic loss function: L = λ1 (Yt — Y* )2 + λ1 (Yt — Yt-1 )2

Minimize the loss function w.r.t. the only variable that is under the entity’s control, Yt and solve for Yt.

Note that the two weights on Y* and Yt-1 add up to one. To simplify notation, let the first weight be θ and the second weight (1-θ). Substitute the original expression for Y* into this equation. In terms of the ADL(p,q) terminology, what are the values for p and q in this model?

Answer: a. Yt =

λ1 λ1 +λ2

Y* +

λ1 λ1 +λ2

Yt-1

b. Yt = θ Y* + (1-θ) Yt-1 = θk X t + (1-θ) Yt-1 = φ1 Yt-1 + δ1 Xt

Stock/Watson 2e -- CVC2 8/23/06 -- Page 377

12) Your textbook estimates the initial relationship between the percentage change of real frozen OJ and the freezing degree days as follows: %ChgPt = -0.40 + 0.47 FDD t (0.22) (0.13) t = 1950:1 — 2000:12, R2 = 0.09, SER = 4.8

Calculate the t-statistic for the slope coefficient. Can you reject the null hypothesis that the coefficient is zero in the population?

The above regression was estimated using HAC standard errors. When you re -estimate the regression using homoskedasticity-only standard errors, the standard error of the slope coefficient drops to 0.06. Calculate the t-statistic for the slope coefficient again. Which of the two standard errors should you use for statistical inference?

Answer: a. The t-statistic is 3.62. Hence you can reject the null hypothesis at any reasonable level of significance. b. The t-statistic has now increased to 7.94. In the presence of either heteroskedasticity and/or autocorrelation in the errors, OLS estimation of the regression coefficients is still consistent. However, the homoskedasticity-only or heteroskedasticity-robust standard errors are inconsistent and use of these in the presence of serial correlation results in misleading statistical inference. For example, confidence intervals do not contain the true value in the postulated number of times in repeated samples. The solution is to adjust the estimator for the standard errors by incorporating sample autocorrelation estimates. This results in the heteroskedasticity- and autocorrelation-consistent (HAC) estimator of the variance of the estimator. For this estimator to be consistent, a certain truncation parameter is introduced, so that not all T-1 sample autocorrelations are used. Incorporating this idea into the HAC formula results in the Newey-West variance estimator.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 378

13) You are hired to forecast the unemployment rate in a geographical area that is peripheral to a large metropolitan area in the United States. The area in question is called the Inland Empire (San Bernardino County and Riverside County) and is situated east of Greater Los Angeles (Los Angeles County and Orange County). While the area has a large population (it is the 14 th largest metropolitan statistical area in the United States), its economic activity relies heavily on that of the larger area it is attached to. For example, it is estimated that approximately 20% of its workforce commutes into the Greater Los Angeles area for work and few workers commute the other way. Furthermore, its logistics industry is heavily dependent on economic activity in the Greater Los Angeles Area. As a result, you view the unemployment rate of the Greater Los Angeles Area (urGLA) to be exogenous in determining the unemployment rate in the Inland Empire (urIE ). You estimate the following distributed lag model, where numbers in parenthesis are HAC standard errors: IE GLA GLA GLA GLA GLA △ ur t = 0.00002 + 0.74 △ ur t - 0.04 △ ur t-1 - 0.01 △ ur t-2 + 0.07 △ ur t-3 + 0.05 △ ur t-4 (0.00010) (0.06)

(0.06)

GLA GLA + 0.09 △ ur t-5 + 0.10△ ur t-6 (0.05)

(0.06)

t = 1991:01-2009:12, R2 = 0.60, SER = 0.001

What is the impact effect of a one percentage point increase (say from 0.06 to 0.07) of the unemployment rate in the Greater Los Angeles area?

What is the long-run cumulative dynamic multiplier?

Why do you think the variables above appear in changes rather than in levels?

Answer: a. The unemployment rate in the Inland Empire will increase by 0.0074, or roughly three -quarters of a percentage point. b. The unemployment rate in the Inland Empire will increase by roughly one percentage points in the long-run. c. The implication must be that the unemployment rates are not stationary over the sample period.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 379

14) There is some economic research which suggests that oil prices play a central role in causing recessions in developed countries. Some of this work suggests that it is only oil price increases that matter and even more specifically, that it is the percentage point difference between oil prices at date t and the maximum value over the previous year. Realizing that energy prices in general can fluctuate quite dramatically in both directions and that geographic areas also benefit substantially from oil price decreases, you decide to estimate the following distributed lag model using annual data (numbers in parenthesis are HAC standard errors): ^

Yt = 3.39 - 0.009 (Poil/CPI)t - 0.028 (Poil/CPI)t-1 (0.27) (0.010) (0.011) t = 1960-2008, R2 = 0.15, SER = 1.88

What is the impact effect of a 25 percent increase in real oil prices?

What is the predicted cumulative change in GDP Growth over two years of this effect?

The HAC F-statistic is 4.07. Can you reject the null hypothesis that oil price changes have no effect on real GDP growth? What is the critical value you considered? Is there any reason why you should be cautious using an F-test in this case, given the sample period?

Answer: a. GDP growth would decrease by almost a quarter of a percentage point. b. The predicted decline in growth would be almost one percentage point ( -0.925). c. The critical value of F2,∞ = 3.00 at the 5% significance level. Hence you can reject the null hypothesis that oil prices have no effect on real GDP growth. However, since the sample period involves only 50 or so observations, it is not clear that the test statistic is actually F-distributes (small sample).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 380

Chapter 16 Additional Topics in Time Series Regression 16.1 Multiple Choice 1) A vector autoregression A) is the ADL model with an AR process in the error term. B) is the same as a univariate autoregression. C) is a set of k time series regressions, in which the regressors are lagged values of all k series. D) involves errors that are autocorrelated but can be written in vector format. Answer: C 2) A multiperiod regression forecast h periods into the future based on an AR(p) is computed A) the same way as the iterated AR forecast. B) by estimating the multiperiod regression Yt = δ0 + δ1 Yt-h + ... + δp Yt-p-h+1 + ut, then using the estimated coefficients to compute the forecast h periods in advance. C) by estimating the multiperiod regression Yt = δ0 + δ1 Yt-h + ut , then using the estimate coefficients to compute the forecast h period in advance. D) by first computing the one-period ahead forecast, next using that to compute the two-period ahead forecast, and so forth. Answer: B 3) Multiperiod forecasting with multiple predictors A) is the same as the iterated AR forecast method. B) can use the iterated VAR forecast method. C) will yield superior results when using the multiperiod regression forecast h periods into the future based on p lags of each Yt , rather than the iterated VAR forecast method. D) will always yield superior results using the iterated VAR since it takes all equations into account. Answer: B 4) If Yt is I(2), then A) △2 Yt is stationary. B) Yt has a unit autoregressive root. C) △Yt is stationary. D) Yt is stationary. Answer: A 5) The following is not a consequence of Xt and Yt being cointegrated: A) if Xt and Yt are both I(1), then for some θ, Yt – θ X t is I(0). B) Xt and Yt have the same stochastic trend. C) in the expression Yt – θ Xt , θ is called the cointegrating coefficient. D) if Xt and Yt are cointegrated then integrating one of the variables gives you the same result as integrating the other. Answer: D 6) One advantage of forecasts based on a VAR rather than separately forecasting the variables involved is A) that VAR forecasts are easier to calculate. B) you typically have knowledge of future values of at least one of the variables involved. C) it can help to make the forecasts mutually consistent. D) that VAR involves panel data. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 381

7) The coefficients of the VAR are estimated by A) using a simultaneous estimation method such as TSLS. B) maximum likelihood. C) panel methods. D) estimating each of the equations by OLS. Answer: D 8) Under the VAR assumptions, the OLS estimators are A) consistent and have a joint normal distribution even in small samples. B) BLUE. C) consistent and have a joint normal distribution in large samples. D) unbiased. Answer: C 9) A VAR allows you to test joint hypothesis that involve restrictions across multiple equations by A) computing a z-statistic. B) computing the BIC but not the AIC. C) using a stability test. D) computing an F-statistic. Answer: D 10) A VAR with five variables, 4 lags and constant terms for each equation will have a total of A) 21 coefficients. B) 100 coefficients. C) 105 coefficients. D) 84 coefficients. Answer: C 11) You can determine the lag lengths in a VAR A) by using confidence intervals. B) by using critical values from the standard normal table. C) by using either F-tests or information criteria. D) with the help from economic theory and institutional knowledge. Answer: C 12) The biggest conceptual difference between using VARs for forecasting and using them for structural modeling is that A) you need to use the Granger causality test for structural modeling. B) structural modeling requires very specific assumptions derived from economic theory and institutional knowledge of what is exogenous and what is not. C) you can no longer use the information criteria to decide on the lag length. D) structural modeling only allows a maximum of three equations in the VAR. Answer: B 13) The error term in a multiperiod regression A) is serially correlated. B) causes OLS to be inconsistent. C) is serially correlated, but less so the longer the forecast horizon. D) is serially uncorrelated. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 382

14) △2 Yt A) = △Yt - △Yt-1 . 2 2 B) = Y t - Y t-1 . C) = △Yt - △Yt-2 . D) = Yt - Yt-2 . Answer: A 15) The order of integration A) can never be zero. B) is the number of times that the series needs to be differenced for it to be stationary. C) is the value of φ1 in the quasi difference(△Yt - φ1 Yt-1 ). D) depends on the number of lags in the VAR specification. Answer: B 16) To test the null hypothesis of a unit root, the ADF test A) has higher power than the so-called DF-GLS test. B) uses complicated interative techniques. C) cannot be calculated if the variable is integrated of order two or higher. D) uses a t-statistic and a special critical value. Answer: D 17) Unit root tests A) use the standard normal distribution since they are based on the t-statistic. B) cannot use the standard normal distribution for statistical inference. As a result the ADF statistic has its own special table of critical values. C) can use the standard normal distribution only when testing that the level variable is stationary, but not the difference variable. D) can use the standard normal distribution but only if HAC standard errors were computed. Answer: B 18) In a VECM, A) past values of Yt - θ X t help to predict future values of △Yt and/or △Xt. B) errors are corrected for serial correlation using the Cochrane-Orcutt method. C) current values of Yt - θ Xt help to predict future values of △Yt and/or △Xt. D) VAR techniques, such as information criteria, no longer apply. Answer: A 19) The following is not an appropriate way to tell whether two variables are cointegrated: A) see if the two variables are integrated of the same order. B) graph the series and see whether they appear to have a common stochastic trend. C) perform statistical tests for cointegration. D) use expert knowledge and economic theory. Answer: A 20) If Xt and Yt are cointegrated, then the OLS estimator of the coefficient in the cointegrating regression is A) BLUE. B) unbiased when using HAC standard errors. C) unbiased even in small samples. D) consistent. Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 383

21) Assume that you have used the OLS estimator in the cointegrating regression and test the residual for a unit root using an ADF test. The resulting ADF test statistic has a A) normal distribution in large samples. B) non-normal distribution which requires ADF critical values for inference. C) non-normal distribution which requires EG-ADF critical values for inference. D) normal distribution when HAC standard errors are used. Answer: C 22) The DOLS estimator has the following property if Xt and Yt are cointegrated: A) it is BLUE even in small samples. B) it is efficient in large samples. C) it has a standard normal distribution when homoskedasticity-only standard errors are used. D) it has a non-normal distribution in large samples when HAC standard errors are used. Answer: B 23) Volatility clustering A) is evident in most cross-sections. B) implies that a series is serially correlated. C) can mostly be found in studies of the labor market. D) is evident in many financial time series. Answer: D 24) Using the ADL(1,1) regression Yt = β0 + β1 Yt-1 + γ 1 Xt-1 + ut, the ARCH model for the regression error 2 assumes that ut is normally distributed with mean zero and variance σ t , where 2 2 2 2 A) σ t = α0 + α1 u t-1 + α2 u t-2 + ... + αp u t-p . 2 2 2 2 2 B) σ t = u t-1 + ... + u t-p + φ1 σ t-1 + ... + φq σ t-q . 2 2 2 C) σ t = φ1 σ t-1 + ... + φq σ t-q . 2 2 2 2 2 D) σ t = α0 + α1 u t-1 + ... + αp u t-p + φ1 σ t-1 + ... + φq σ t-q . Answer: A 25) ARCH and GARCH models are estimated using the A) OLS estimation method. B) the method of maximum likelihood. C) DOLS estimation method. D) VAR specification. Answer: B 26) A VAR with k time series variables consists of A) k equations, one for each of the variables, where the regressors in all equations are lagged values of all the variables B) a single equation, where the regressors are lagged values of all the variables C) k equations, one for each of the variables, where the regressors in all equations are never more than one lag of all the variables D) k equations, one for each of the variables, where the regressors in all equations are current values of all the variables Answer: A Stock/Watson 2e -- CVC2 8/23/06 -- Page 384

27) The BIC for the VAR is ^

2 T

ln(T) T

A) BIC(p) = ln[det (Σu)] + k(kp+1) B) BIC(p) = ln[det (Σu)] + k(p+1) C) BIC(p) = ln[det (Σu)] + k(kp+1) D) BIC(p) = ln[SSR(p)] + k(p+1)

ln(T) T

Answer: C 28) The lag length in a VAR using the BIC proceeds as follows: Among a set of candidate values of p, the estimated lag length xxx is the value of p A) For which the BIC exceeds the AIC B) That maximizes BIC(p) C) Cannot be determined here since a VAR is a system of equations, not a single one D) That minimizes BIC(p) Answer: D 29) The dynamic OLS (DOLS) estimator of the cointegrating coefficient, if Yt and Xt are cointegrated, A) is efficient in large samples B) statistical inference about the cointegrating coefficient is valid C) the t-statistic constructed using the DOLS estimator with HAC standard errors has a standard normal distribution in large samples D) all of the above Answer: D 30) The EG-ADF test A) is the similar to the DF-GLS test B) is a test for cointegration C) has as a limitation that it can only test if two variables, but not more than two, are cointegrated D) uses the ADF in the second step of its procedure Answer: B

16.2 Essays and Longer Questions 1) “Heteroskedasticity typically occurs in cross-sections, while serial correlation is typically observed in time-series data.” Discuss and critically evaluate this statement. Answer: Serial correlation in cross-sections can occur by chance if the data is ordered using one of the regressors. While it is easy to get rid of serial correlation in this case by simply “reshuffling” the data, the serial correlation contains some information, such as a possible misspecification of functional form. Serial correlation does occur typically in time-series data, but as the textbook emphasized, conditional heteroskedasticity “shows up in many economic time series.” The ARCH and GARCH models are often used when volatility clustering is present in financial time series, including the inflation rate. Hence this special type of heteroskedasticity is observed in time-series data.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 385

2) Some macroeconomic theories suggest that there is a short-run relationship between the inflation rate and the unemployment rate. How would you go about forecasting these two variables? Suggest various alternatives and discuss their advantages and disadvantages. Answer: There are various methods available for forecasting the inflation rate and the unemployment rate. One basic distinction is whether or not the two variables are forecasted separately, or jointly as a system of two equations. Another distinction involves one period ahead forecasts vs. multiperiod forecasts. Finally, if multiperiod forecasts are used, then there is a multiperiod forecasting regression method vs. an interated forecast method. Univariate Regression Methods: Here either the change in the inflation rate or the unemployment rate is modeled as an AR(p) and estimated by OLS. Observed values for the regressors are then substituted to produce a one period ahead forecast. (The one period ahead forecast for the inflation rate can then also be derived.) Statistical methods, such as the BIC or AIC can be used for choosing the number of lags. There are two important properties of the forecasts: the best forecast of either the change of the inflation rate or the unemployment rate depends only on the most recent p past values, and the errors are serially uncorrelated. These follow from the OLS assumptions. The multiperiod regression method for making an h-period ahead forecast of the change in inflation or unemployment rate using the AR( p) involves regressing these variables on its p lags, starting from (t-h), i.e., Yt = δ0 + δ1 Yt-h + . . . + δp Yt-p-h+1 + ut. Since the error term is serially correlated for the multiperiod regression, HAC standard errors must be used to have a reliable basis for inference. The iterated AR forecast method for the AR( p) is achieved by forecasting one period ahead initially, then using the forecasted value for the two period ahead forecast, ^

^ ^

and so on. More formally, the two-period ahead forecast is Yt t-2 = β0 + β1 Yt-1 t-2 + β2 Yt-2 + β3 Yt-3 ^

^ ^

+ ... + βp Yt-p , while the three-period ahead forecast is Yt t-3 = β0 + β1 Yt-1 t-3 + β2 Yt-2 t-3 + β3 Yt-3 ^

+ ... + βp Yt-p , etc. Multiple Predictors: If economic theory suggests that other variables could help forecast either the change in the inflation rate or the unemployment rate, then lags of these variables can be included. The Granger-causality test can be used to determine whether or not these additional variables belong in the regression. The same methods that were used for the AR(p) model can be applied for the ADL(p,q) model. For example, in the multiperiod forecasting using multivariate forecasts, all regressors must be lagged h periods to produce the h-period ahead forecast. To forecast both the change in the inflation and unemployment rate, regressions for each of the two dependent variables have to be estimated first, i.e., for both variables the following regression is estimated by OLS: Yt = δ0 + δ1 Yt-h + ... + δp Yt-p-h+1 + δp+1 Xt-h + ... + δ2 pXt-p-h+1 + ut. Then the estimated coefficients are used to make the h -period ahead forecast. The interated forecast method now involves making one-period ahead forecasts using the estimated VAR specification, and using these forecasted values for both variables in subsequent forecasts. The two period ahead forecast, for example, for variables would be calculated as follows: ^

Yt t-2 = β10 + β11 Yt-1 t-2 + β12 Yt-2 + β13Yt-3 + ...+ β1p Yt-p ^

+ γ 11 Xt-1 t-2 + γ 12 Xt-2 + γ 13Xt-3 + ... + γ 1p Xt-p . The decision on which method to use depends on the quality of the specification. If the AR( p) or the VAR is a good approximation to the underlying relationship, then the iterated forecast method is better. Note that if multiple predictors are involved, the ADL is not an alternative, since the additional predictors have to be forecasted themselves. However, even if one of the VAR equations is not a good representation of the underlying process, then the multiperiod regression forecasts are more accurate on average. Since the difference between the two methods is typically small, the textbook suggests to use the one “which is most conveniently implemented in your software.”

Stock/Watson 2e -- CVC2 8/23/06 -- Page 386

3) Think of at least five examples from economics where theory suggests that the variables involved are cointegrated. For one of these cases, explain how you would test for cointegration between the variables involved and how you could use this information to improve forecasting. Answer: Answers will vary by student, but given the textbook example of the three -month and one-year interest rates, you can expect students to list it. Consumption and income, real money balances, income and the interest rate (or income velocity and the interest rate), purchasing power parity, inflation rates across countries, are prime candidates. I will use the example of real consumption and income to explain how to test for cointegration and how to potentially incorporate the information into forecasting. Both (the log of) consumption and income should be plotted over time to check whether they give the appearance of having a common stochastic trend. Furthermore, economic theory suggests that they are proportional to each other, although the factor of proportionality may depend on other variables. Under the null hypothesis, Ct - θYt has a unit root, where C is the log of consumption and Y is the log of disposable income. If θ was known, then the DF or DF–GLS unit root tests could be employed here, but since it is not, the cointegrating coefficient has to be estimated first by OLS, which is consistent if consumption and disposable income are cointegrated. The resulting residuals from the regression Ct = α + θYt + zt are then subjected to a DF t-test with an intercept and no time trend. The t-statistic is compared to the critical values for the EG–ADF, and if they exceed these, then the null hypothesis is reject in favor of consumption and disposable income being cointegrated. ^

The lag of the estimated error correction term (Ct - θYt )can then be used as an additional regressor in a VAR specification to predict both the growth rate of real consumption and the growth rate of real disposable income. This specification is known as the vector error correction model (VECM). 4) What role does the concept of cointegration and the order of integration play in modeling the relationship between variables? Explain how tests of cointegration work. Answer: Cointegration between two or more variables is a regression analysis concept to potentially reveal long-run relationships among time series variables. Variables are said to be cointegrated if the have the same stochastic trend in common. Most economic time series are I(1) variables, which means that they have a unit autoregressive root and that the first difference in that variable is stationary. Since these variables are often measured in logs, their first difference approximates growth rates. Cointegration requires a common stochastic trend. Therefore, variables which are tested for cointegration must have the same order of integration. The concept of cointegration is also an effort to bring back long-run relationships between variables into short-run forecasting techniques, such as VARs. Adding the error correction term from the cointegrating relationship to the VARs results in the vector error correction model. Here all variables are stationary, either because they have been differenced or because the common stochastic trend has been removed. VECMs therefore combine short-run and long-run information. One way to think about the role of the error correction term is that it provides an “anchor” which pulls the modeled relationships eventually back to their long-run behavior, even if it is disturbed by shocks in the short-run. Cointegration also represents the return of the static regression model, i.e., regressions where no lags or used. To test for cointegration using the EG-ADF test requires estimating a static regression between the potentially cointegrated variables by OLS first, and then to conduct an ADF test on the residuals from this regression. If the residuals do not have a unit root, then the variables are said to be cointegrated. Since this is a two step procedure, critical values for the ADF t -statistic are adjusted and are referred to the critical values for the EG-ADF statistic. Although the OLS estimator is consistent, it has a nonnormal distribution and hence inference should not be conducted based on the t-statistic, even if HAC standard errors are used. Alternative techniques to circumvent this problem, such as the DOLS estimator, which is consistent and efficient in large samples, have been developed. The DOLS and another frequently used technique, called the Johansen method, can be easily extended to multiple cointegrating relationships. Stock/Watson 2e -- CVC2 8/23/06 -- Page 387

5) Carefully explain the difference between forecasting variables separately versus forecasting a vector of time series variables. Mention how you choose optimal lag lengths in each case. Part of your essay should deal with multiperiod forecasts and different methods that can be used in that situation. Finally address the difference between VARS and VECM. Answer: t-When variables are forecasted separately, then single equations of the AR( p) type are typically involved. If economic theory and/or institutional knowledge suggest that additional predictors should be included, then forecasts can be potentially improved by estimating an ADL(p,q) model. For one period ahead forecasts, these are identical to forecasts based on systems of equations. Lag lengths will be chosen using the BIC or the AIC criterium. There are three important reasons why VARs may be preferable for forecasting. One results from the forecasting horizon. If forecasts are to be made two or more periods ahead, then if future values of the additional predictors are to be used, these have to be forecasted themselves. This can be avoided by choosing the multiperiod regression method. Here, in the case of an h period forecast, multiperiod regressions are estimated where all predictors are lagged h periods or more. Second, using VAR forecasting methods will make the forecasts for the variables involved mutually consistent. This is the result of using the iterated VAR forecasts whereby the forecasted values are subsequently used to forecast further ahead. Finally VAR models allow for restrictions across equations to be tested. Multiperiod regression methods in general may be preferable over iterated forecasts if the AR(p), ADL(p,q) or VAR models are incorrectly specified. In practice, the difference in forecasts tends to be very small between the multiperiod regression and iterated forecast methods. VAR models can be enhanced by incorporating long -run information in the form of error correction terms. If some of the variables in the VAR model have a common stochastic trend, then this can be used to improve the forecasts by including the error correction term, thereby turning the VAR model into a VECM. 6) You have collected quarterly data for the unemployment rate ( Unemp) in the United States, using a sample period from 1962:I (first quarter) to 2009:IV (the data is collected at a monthly frequency, but you have taken quarterly averages). a.

Does economic theory suggest that the unemployment rate should be stationary?

Testing the unemployment rate for stationarity, you run the following regression (where the lag length was determined using the BIC; using the AIC instead does not change the outcome of the test, even though it chooses 9 lags of the LHS variable): Unempt = 0.217 - 0.035 Unempt-1 + 0.689 ΔUnempt-1 (0.01) 0.0012)

(0.054)

Use the ADF statistic with an intercept only to test for stationarity. What is your decision? c.

The standard errors reported above were homoskedasticity -only standard errors. Do you think you could potentially improve on inference by allowing for HAC standard errors?

An alternative test for a unit root, the DF-GLS, produces a test statistic of -2.75. Find the critical value and decide whether or not to reject the null hypothesis. If the decision is different from (c), is there any reason why you might prefer the DF-GLS test over the ADF test?

Stock/Watson 2e -- CVC2 8/23/06 -- Page 388

Answer: a. In macroeconomics or labor economics, you have learned about the natural rate of unemployment, or the Non-Accelerating Inflation Rate of Unemployment (NAIRU). The idea here is that unemployment rates may deviate from this equilibrium unemployment rate, but that, following a shock, the unemployment rate will revert towards this equilibrium. Hence you might expect the difference between the unemployment rate and the NAIRU, referred by some as the cyclical unemployment rate, to be stationary. Unfortunately the equilibrium unemployment rate is not a constant over time and may be affected by demographics, the price of search (unemployment insurance benefits), and other variables. If the NAIRU is not a constant over time, then the unemployment rate itself may not be stationary. Furthermore, there is also the idea of hysteresis, which allows for the unemployment rate to move to a new equilibrium rate once a shock hits the economy. The bottom line is that while there is some guidance from economic theory, it is an empirical question whether or not the unemployment rate is stationary. b. The t-statistic for the ADF test is -2.84. The critical value at the 5% level is -2.86. Hence you can reject the null hypothesis of a unit root for the unemployment rate at the 10% level, but (just) fail to reject the null hypothesis at the 5% level. Most economist treat the unemployment rate as stationary. c. The ADF statistic is computed using non-robust standard errors. It turns out that under the null hypothesis of a unit root, the homoskedasticty-only standard errors generate a t-statistic that is robust to heteroskedasticity. d. The critical value for the DF-GLS test is -2.58 at the 1% level. Hence you can reject the null hypothesis of a unit root using this test. The DF-GLS has a higher power when compared to the ADF test, and hence should be preferred.

16.3 Mathematical and Graphical Problems 2 2 2 2 1) Consider the GARCH(1,1) model σ t = α0 + α1 u t-1 + φ1 σ t-1 . Show that this model can be rewritten as σ t = α0

2 2 2 2 + α1 ( u t-1 + φ1 u t-2 + φ 2 u t-3 + φ 3 u t-4 + ...). (Hint: use the GARCH(1,1) model but specify it for 1-φ1 1 1 2 φ t-1 ; substitute this expression into the original specification, and so on.) Explain intuitively the meaning of the resulting formulation. 2 2 2 2 2 2 Answer: σ t = α0 + α1 u t-1 + φ1 σ t-1 = α0 + α1 u t-1 + φ1 (α0 + α1 u t-2 + φ1 σ t-2 ) 2 2 2 2 = α0 (1 + φ1 ) + α1 ( u t-1 + φ1 u t-2 ) + φ 1 σ t-2 2 2 2 2 2 3 2 = α0 (1 + φ1 + φ 1 ) + α1 ( u t-1 + φ1 u t-2 + φ 1 u t-3 ) + φ 1 σ t-3 . Continuing with the 2 3 substitutions infinitely and noting that the sum of the geometric series is 1+ φ1 + φ 1 + φ 1 +... = 2 you finally arrive at σ t =

α0 1- φ1

1 1- φ1

3 2 2 2 2 2 + α1 ( u t-2 + φ1 u t-2 + φ 1 u t-3 + φ 1 u t-4 + ...). This expression

states that the variances depend on a weighted average of past squared residuals, where the distant past receives a smaller weight than more recently observed squared residuals.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 389

2) You have collected quarterly data on inflation and unemployment rates for Canada from 1961:III to 1995:IV to estimate a VAR(4) model of the change in the rate of inflation and the unemployment rate. The results are △Inft = 1.02 – .54 △Inft-1 – .46 △Inft-2 – .32 △Inft-2 – .01 △Inft-4 (.09) (.09) (.09) (.08) (.44) -.76 Unempt-1 + .20 Unempt-2 – .16 Unempt-3 + .59 Unempt-4 (.43)

(.76)

(.44)

R2 = .26. Unempt = 0.18 – .003 △Inft-1 – .016 △Inft-2 – .018 △Inft-3 – .010 △Inft-4 (.10) (.016) (.018) (.017) (.016) + 1.47 Unempt-1 – .46 Unempt-2 – .08 Unempt-3 + .05 Unempt-4 (.08)

(.14)

(.08)

R2 = .980. (a) Explain how you would use the above regressions to conduct one period ahead forecasts. (b) Should you test for cointegration between the change in the inflation rate and the unemployment rate and, in the case of finding cointegration here, respecify the above model as a VECM? (c) The Granger causality test yields the following F-statistics: 3.75 for the test that the coefficients on lagged unemployment rate in the change of inflation equation are all zero; and 0.36 for the test that the coefficients on lagged changes in the inflation rate are all zero. Based on these results, does unemployment Granger–cause inflation? Does inflation Granger-cause unemployment? Answer: (a) One period ahead forecasts are the same as for the ADL(4,4) models of the inflation rate and unemployment rate. For example, forecasting the change in the inflation rate for 1996:I requires use of the actual values for unemployment and change in inflation rates through 1995:IV. The unemployment rate for 1996:I is forecasted in the same way using the second regression. (b) Most economic theories suggest that there is no long-run relationship between the inflation rate and the unemployment rate, or, stated differently, that the long -run Phillips curve is vertical. Hence economic theory does not suggest testing for cointegration or using the error correction term in a VECM model. (c) The critical value for the F4,∞ statistic is 3.32 at the 1% significance level, and 1.94 at the 10% significance level. Based on the calculated F-statistics above you can reject the null hypothesis that lagged unemployment rates do not Granger-cause the inflation rate, but you cannot reject the null hypothesis that lagged inflation does not Granger -cause the unemployment rate.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 390

3) Purchasing power parity (PPP), postulates that the exchange rate between two countries equals the ratio of the Pf respective price indexes or ExchRate = (where ExchRate is the foreign exchange rate between the two P countries, and P represents the price index, with f indicating the foreign country). The long-run version of PPP implies that that the exchange rate and the price ratio share a common trend. (a) You collect monthly foreign exchange rate data from 1974:1 to 2002:4 for the U.S./U.K. exchange rate ($/£) and you collect data on the Consumer Price Index for both countries. Explain how you would used the Engle –Granger test statistic to investigate the long-run PPP hypothesis. (b) One of your peers explains that there may be an easier way to test for the validity of PPP. She suggests to simply test whether or not the “real” exchange rate, or competitiveness, is stationary. (The real exchange rate is P .) Is she correct? Explain. How would you implement her suggestion? Which given by ExchRate × Pf alternative test-statistic is available? Answer: (a) Using the Engle-Granger two step procedure, the (log of) the exchange rate will be regressed on the relative price ratio (log difference of the two prices). The residuals from this regression will then be subjected to a Dickey-Fuller t-test with an intercept but no time trend. This is the EG-ADF procedure. However, the OLS estimator of the coefficient in this regression is only consistent if the two variables are cointegrated. Furthermore, inference can be misleading since the OLS estimator does not have a normal distribution. If a test is performed on whether the coefficient of the price ratio is unity, then the DOLS estimator should be used with HAC standard errors. (b) If PPP holds, then the exchange rate and the relative price ratio will have a cointegrating coefficient of θ = 1. First the real exchange rate should be plotted to inspect visually whether or not the two variables are cointegrated. To test this more formally, the real exchange rate should be tested for containing a unit root, using the ADF statistic. If the null hypothesis is rejected, then this would suggest that PPP holds in the long-run. Since the ADF test is not the most powerful test, the DF-GLS test can be used as an alternative. 4) You have collected quarterly Canadian data on the unemployment and the inflation rate from 1962:I to 2001:IV. You want to re-estimate the ADL(3,1) formulation of the Phillips curve using a GARCH(1,1) specification. The results are as follows: △Inft = 1.17 – .56 △Inft-1 – .47 △Inft-2 – .31 △Inft-3 – .13 Unempt-1 (.48) (.08) (.10) (.09) (.06) ^2 2 2 σ t = .86 + .27 u t-1 + .53 σ t-1 .

(.40) (.11)

(.15)

2 2 (a) Test the two coefficients for u t-1 and σ t-1 in the GARCH model individually for statistical significance. (b) Estimating the same equation by OLS results in △Inft = 1.19 – .51 △Inft-1 – .47 △Inft-2 – .28 △Inft-3 – .16Unempt-1 (.54) (.10) (.11) (.08) (.07) Briefly compare the estimates. Which of the two methods do you prefer? (c) Given your results from the test in (a), what can you say about the variance of the error terms in the Phillips Curve for Canada? (d) The following figure plots the residuals along with bands of plus or minus one predicted standard deviation (that is, ±σt) based on the GARCH(1,1) model.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 391

Describe what you see. Answer: (a) The two t-statistics are 2.46 and 3.53 respectively. Since they are normally distributed in large samples you can use the standard normal distribution for significance testing and the construction of confidence intervals. The first is coefficient statistically significant at the 5% level, while the second is statistically significant at the 1% level. (b) These are two estimation methods, OLS and Maximum Likelihood. The GARCH(1,1) model produces very similar estimates for the lagged inflation and unemployment rates. The difference stems from the fact that the two GARCH coefficients are (significantly) different from zero. Since they are statistically significant, GARCH is the preferred model since it does not constrain the coefficients to zero. (c) The tests in (a) suggest that the errors are not homoskedastic but conditionally heteroskedastic. (d) There is changing volatility in the residuals. The conditional standard deviation bands are relatively tight in the ‘60s but the uncertainty about inflation forecasts increases steadily. There are periods of widening bands in the early ‘80s and ’90s, and again at the end of the sample period. These follow economic recessions.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 392

5) Consider the following model Yt = β0 + β1 Xt + β2 Xt-1 + β3 Yt-1 + ut, where Xt is strictly exogenous. Show that 3

by imposing the restriction

∑ βi = 1 , you can derive the following so-called Error Correction Mechanism i=1

(ECM) model △Yt = β0 + β1 △Xt – θ(Y – X)t-1 + ut where θ = β1 + β2 . What is the short-run (impact) response of a unit increase in X? What is the long-run solution? Why do you think the term in parenthesis in the above expression is called ECM? Answer: Starting with Yt = β0 + β1 Xt + β2 Xt-1 + β3 Yt-1 + ut, subtracting Yt-1 from both sides, and adding and subtracting β1 Xt-1 on the right hand side, results in △Yt = β0 + β1 △Xt + ( β1 + β2 )Xt-1 - (1- β3 )Yt-1 + ut. 3

Note that

∑ βi = 1 implies β1 + β2 = 1- β3. Since θ = β1 + β2 , then △Yt = β0 + β1△Xt - θ( Y - X)t-1 + i=1

β0 + β1 g 2 - g 1 △Yt ut. The impact response is . = β1 . The steady-state solution is Y = + X, where gY and θ △Xt gX are the steady-state growth rates of Y and X respectively (assuming that the model is in logs). (Y-X) represents the amount of disequilibrium in the previous period. The term is sometimes referred to as “Equilibrium Correction Mechanism” rather than “Error Correction Mechanism.” If the relationship is in equilibrium in the previous period, then there is no additional movement in Y other than from the short-run response. 6) Your textbook states that there “are three ways to decide if two variables can plausibly be modeled as cointegrated: use expert knowledge and economic theory, graph the series and see whether they appear to have a common stochastic trend, and perform statistical tests for cointegration. All three ways should be used in practice.” Accordingly you set out to check whether (the log of) consumption and (the log of) personal disposable income are cointegrated. You collect data for the sample period 1962:I to 1995:IV and plot the two variables.

(a) Using the first two methods to examine the series for cointegration, what do you think the likely answer is? Stock/Watson 2e -- CVC2 8/23/06 -- Page 393

(b) You begin your numerical analysis by testing for a stochastic trend in the variables, using an Augmented Dickey-Fuller test. The t-statistic for the coefficient of interest is as follows: Variable with lag of 1 t-statistic

LnYpd

△LnYpd

LnC

△LnC

-1.93

-5.24

-2.20

-4.31

where LnYpd is (the log of) personal disposable income, and LnC is (the log of) real consumption. The estimated equation included an intercept for the two growth rates, and, in addition, a deterministic trend for the level variables. For each case make a decision about the stationarity of the variables based on the critical value of the Augmented Dickey-Fuller test statistic. Why do you think a trend was included for level variables? (c) Using the first step of the EG–ADF procedure, you get the following result: lnC t = – 0.24 + 1.017 lnYpd t Should you interpret this equation? Would you be impressed if you were told that the regression R2 was 0.998 and that the t-statistic for the slope was 266.06? Why or why not? (d) The Dickey–Fuller test for the residuals for the cointegrating regressions results in a t-statistic of (–3.64). State the null and alternative hypothesis and make a decision based on the result. (e) You want to investigate if the slope of the cointegrating vector is one. To do so, you use the DOLS estimator and HAC standard errors. The slope coefficient is 1.024 with a standard error of 0.009. Can you reject the null hypothesis that the slope equals one? Answer: (a) There are economic theories which postulate that real consumption and real personal disposable income are proportional to each other in equilibrium. The above figure also suggests that the (log) difference between the two series is stationary, or that they appear to have a common stochastic trend. (b) The graph suggests the presence of a time trend. The critical values at the 10% significance level is (-3.12) and (-3.96) at the 1% level. Hence you cannot reject the null hypothesis that the log levels of consumption and disposable income contain a unit root. You are able to reject the null hypothesis for the difference in both variables. Hence both series are I(1). (c) The equation is estimated using OLS, which is only consistent if consumption and disposable income are cointegrated. But even if the null hypothesis of a unit root can be rejected, the t -statistic does not have a normal distribution, even when using HAC standard errors. As a result, inference can be misleading. The high regression R2 is not surprising, given that the two variables are I(1). This could be an example of a spurious regression. However, alternative estimators are available, such as DOLS, which is consistent and efficient in large samples and statistical inference on the coefficient of disposable income is valid if HAC standard errors are used. Alternatively, the Johansen procedure can be used. (d) Under the null hypothesis, the residuals from the above regression will have a unit root. Given the critical value for the EG–ADF statistic of (-3.96) at the 1% significance level, the null hypothesis is rejected in favor of the alternative hypothesis that consumption and disposable income are cointegrated over this period. (e) The t-statistic on the null hypothesis is 2.67. Hence you can reject the null hypothesis at the 5% significance level.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 394

7) Your textbook so far considered variables for cointegration that are integrated of the same order. For example, the log of consumption and personal disposable income might both be I(1) variables, and the error correction term would be I(0), if consumption and personal disposable income were cointegrated. (a) Do you think that it makes sense to test for cointegration between two variables if they are integrated of different orders? Explain. (b) Would your answer change if you have three variables, two of which are I(1) while the third is I(0)? Can you think of an example in this case? Answer: (a) To test for cointegration requires that the two variables have the same stochastic trend. If one variable is I(1) while the other is I(0), then obviously they do not have the same stochastic trend and therefore cannot be cointegrated. (b) In this case there would possibly be cointegration between the two I(1) variables, but not between all three variables. This does not imply that the third variable could not enter into the relationship. Think, for example, about a money demand relationship between the (log of) real money balances, income, and the nominal interest rate. It may well be that in some samples the nominal interest rate is I(0), while real money balances and income are I(1). Finding real money balances and income to be cointegrated does not imply that the nominal interest rate does not enter the money demand function. There is simply no need for the interest rate to enter the cointegrating relation because it is I(0). The cointegrating relation only involves zero-frequency relationships between the first differences of real money balances and income, and the zero-frequency component of the first difference of the interest rate is non-existent. 8) For the United States, there is somewhat conflicting evidence whether or not the inflation rate has a unit autoregressive root. For example, for the sample period 1962:I to 1999:IV using the ADF statistic, you cannot reject at the 5% significance level that inflation contains a stochastic trend. However the null hypothesis can be rejected at the 10% significance level. The DF-GLS test rejects the null hypothesis at the five percent level. This result turns out to be sensitive to the number of lags chosen and the sample period. (a) Somewhat intrigued by these findings, you decide to repeat the exercise using Canadian data. Letting the AIC choose the lag length of the ADF regression, which turns out to be three, the ADF statistic is ( -1.91). What is your decision regarding the null hypothesis? (b) You also calculate the DF-GLS statistic, which turns out to be (-1.23). Can you reject the null hypothesis in this case? (c) Is it possible for the two test statistics to yield different answers and if so, why? Answer: (a) For the Canadian data, the null hypothesis cannot be rejected even at the 10% significance level. Hence for the chosen sample period and lag length, the Canadian inflation rate seems to have a stochastic trend. (b) The critical value for the DF-GLS statistic is (-1.62) at the 10% significance level. Hence the DF-GLS test comes to the same conclusion as the test based on the ADF statistic: there is evidence of a stochastic trend. (c) The two test statistics can come to different conclusion, although this is not the case with the Canadian inflation rate. The reason is that the DF-GLS test has more power.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 395

9) You have collected time series for various macroeconomic variables to test if there is a single cointegrating relationship among multiple variables. Formulate the null hypothesis and compare the EG–ADF statistic to its critical value. (a) Canadian unemployment rate, Canadian Inflation Rate, United States unemployment rate, United States inflation rate; t = (-3.374). (b) Approval of United States presidents (Gallup poll), cyclical unemployment rate, inflation rate, Michigan Index of Consumer Sentiment; t = (-3.837). (c) The log of real GDP, log of real government expenditures, log of real money supply (M2); t = (-2.23). (d) Briefly explain how you could potentially improve on VAR(p) forecasts by using a cointegrating vector. Answer: (a) The null hypothesis of a unit root in the error correction term cannot be rejected even at the 10% level. Hence there is little support of a single cointegrating relationship between these four variables. (b) The critical value is (-4.20) at the 10% significance level. Hence you cannot reject the null hypothesis of the error correction term having a unit root. (c) Since the critical value for three variables is (-3.84) at the 10% significance level, there does not seem to be a cointegrating relationship between the three variables. (d) Adding the error correction term from the cointegrating relationship between variables to the VAR(p) model results in a vector error correction model (VECM). The advantage of this model over a VAR model is that it incorporates both short-run and long-run information into the forecasting equation. 10) There has been much talk recently about the convergence of inflation rates between many of the OECD economies. You want to see if there is evidence of this closer to home by checking whether or not Canada’s inflation rate and the United States’ inflation rate are cointegrated. (a) You begin your numerical analysis by testing for a stochastic trend in the variables, using an Augmented Dickey-Fuller test. The t-statistic for the coefficient of interest is as follows: Variable with lag of 1 t-statistic

InfCan

△InfCan

InfUS

△ InfUS

-1.93

-6.38

-2.37

-5.63

where InfCan is the Canadian inflation rate, and InfUS is the United States inflation rate. The estimated equation included an intercept. For each case make a decision about the stationarity of the variables based on the critical value of the Augmented Dickey-Fuller test statistic. (b) Your test for cointegration results in a EG–ADF statistic of (–7.34). Can you reject the null hypothesis of a unit root for the residuals from the cointegrating regression? (c) Using a working hypothesis that the two inflation rates are cointegrated, you want to test whether or not the slope coefficient equals one. To do so you estimate the cointegrating equation using the DOLS estimator with HAC standard errors. The coefficient on the U.S. inflation rate has a value of 0.45 with a standard error of 0.13. Can you reject the null hypothesis that the slope equals unity? (d) Even if you could not reject the null hypothesis of a unit slope, would that have been sufficient evidence to establish convergence? Answer: (a) The critical value for the ADF is (-2.57) at the 10% significance level for the sample period. Therefore you cannot reject the null hypothesis that there is a unit root for both inflation rates. However, given the critical value for the ADF statistic of (-3.43) you can reject the null hypothesis for the difference or the acceleration in the inflation rates at the 1% significance level. Both price levels appear to be I(2) variables. (b) Given the critical value of (-3.96) for the EG-ADF statistic, you can reject the null hypothesis of a unit root in favor of the two inflation rates being cointegrated. (c) The DOLS estimator allows for statistical inference on the coefficient using the standard normal distribution. Since 0.45 is more than two standard deviations from unity, you can reject the null hypothesis of that regression coefficient being one. (d) Finding a unit slope would not be sufficient for convergence, since it would allow for a constant difference between the two inflation rates. To have convergence you would need that difference to be zero. Stock/Watson 2e -- CVC2 8/23/06 -- Page 396

11) You have re-estimated the two variable VAR model of the change in the inflation rate and the unemployment rate presented in your textbook using the sample period 1982:I (first quarter) to 2009:IV. To see if the conclusions regarding Granger causality of changed, you conduct an F-test for this new sample period. The results are as follows: The F-statistic testing the null hypothesis that the coefficients on Unempt-1 , Unemp t-2 , Unempt-3 , and Unemplt-4 are zero in the inflation equation (Equation 16.5 in your textbook) is 6.04. The F-statistic testing the hypothesis that the coefficients on the four lags of ΔInft are zero in the unemployment equation (Equation 16.6 in your textbook) is 0.80. a.

What is the critical value of the F-statistic in both cases?

Do you think that the unemployment rate Granger-causes changes in the inflation rate?

Do you think that the change in the inflation rate Granger -causes the unemployment rate?

Answer: a. The critical value at the 5% level is F4,∞ = 2.37 b. Given the value of the Granger causality statistic, which is greater than the critical value, you can reject the null hypothesis, meaning that the unemployment rate is a useful predictor for the change in the inflation rate. Hence the unemployment rate Granger-causes changes in inflation. c. In this case, the Granger causality statistic does not exceed the critical value, and hence the conclusion is that the change in the inflation rate does not Granger-cause the unemployment rate.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 397

12) In this case, the Granger causality statistic does not exceed the critical value, and hence the conclusion is that the change in the inflation rate does not Granger-cause the unemployment rate. Inft = 0.05 - 0.31 ΔInft-1 (0.14) (0.07) t = 1982:I — 2009:IV, R2 = 0.10, SER = 2.4

Calculate the one-quarter-ahead forecast of both ΔInf2010:I and Inf2010:I (the inflation rate in 2009:IV was 2.6 percent, and the change in the inflation rate for that quarter was -1.04).

Calculate the forecast for 2010:II using the iterated multiperiod AR forecast both for the change in the inflation rate and the inflation rate.

What alternative method could you have used to forecast two quarters ahead? Write down the equation for the two-period ahead forecast, using parameters instead of numerical coefficients, which you would have used.

Answer: a. Inf2010:I|2009:IV = 0.05 - 0.31 ΔInf2009:IV = 0.05 - 0.31 ×(- 1.04) = 0.4 The forecast is therefore that the inflation rate would increase by 0.4 percentage points, and the inflation rate for 2005:I would therefore be 3.0 percent. b. Inf2010:II|2009:IV = 0.05 - 0.31 ΔInf2010:I|2009:IV = 0.05 - 0.31 × 0.4 = -0.1 The forecast for the change in the inflation rate is to decline by 0.1 percentage points. The forecasted level would therefore be 2.9 percent. c.

The alternative would have been to use the “Direct Multiperiod Forecasts” method. The ^

estimated equation would have been Inf2010:II|2009:IV = β 0 + β 1 ΔInf2009:IV

Stock/Watson 2e -- CVC2 8/23/06 -- Page 398

13) You have collected quarterly data for real GDP (Y) for the United States for the period 1962:I (first quarter) to 2009:IV. a.

Testing the log of GDP for stationarity, you run the following regression (where the lag length was determined using the AIC): △ln Yt = 0.03 - 0.0024 ln Yt-1 + 0.253 △ln Yt-1 + 0.167 △ln Yt-2 (0.03) (0.0014)

(0.072)

t = 1962:I — 2009:IV, R2 = 0.16, SER = 0.008 Use the ADF statistic with an intercept only to test for stationarity. What is your decision?

You have decided to test the growth rate of real GDP for stationarity for the same sample period. The regression is as follows: △2 ln Yt = 0.0041 - 0.543 △ln Yt-1 - 0.186 △2 ln Yt-1 (0.0009) (0.082)

(0.071)

t = 1962:I — 2009:IV, R2 = 0.36, SER = 0.008

Use the ADF statistic with an intercept only to test for stationarity. What is your decision?

Using the orders of integration terminology, what order of integration is the log level of real GDP? The growth rate?

Given that the SER hardly changed in the second equation, why is the regression R2 larger?

Answer: a. The t-statistic for the ADF test is -1.77. The critical value at the 5% level is -2.86. Hence you cannot reject the null hypothesis of a unit root for the log level of real GDP. b. The t-statistic for the ADF test is -6.65. The critical value at the 5% level is -2.86. Hence you can reject the null hypothesis of a unit root for the (quarterly) growth rate of real GDP. c. The log of real GDP is I(1), the growth rate is I(0); the growth rate is stationary. d. Obviously the TSS must have increased since R 2 = 1 — (SSR/TSS).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 399

14) Economic theory suggests that the law of one price holds. Applying this concept to foreign and domestic goods implies that goods will sell for the same price across countries. The consumer price index is the price for a basket of goods, and is calculated for countries as a whole. Hence in the absence of barriers to trade, and large transportation costs (and the fact that not all goods are traded) you should observe Purchasing Power Parity (PPP) between two countries, or ExchRate×P=Pf, where ExchRate is the foreign exchange rate between the two countries, and P represents the price index, with f indicating the foreign country. Dividing both sides of the Pf equation by the domestic price level then gives you the standard formulation for PPP: ExchRate = . If PPP P holds in the long run, then the exchange rate and the price ratio should share a common trend. Since it is a long-run concept, cointegration provides an interesting way to test for it. a.

Using monthly data for the U.S./U.K. exchange rate ($/₤) and the respective price indexes, you estimate the following regression: ExchRatet = 0.44 + 0.69 (ln PUS - ln PUK ) Collecting the residuals from this regression and using an ADF test for cointegration, you find a t-statistic of -2.71. Can you reject the null-hypothesis of no cointegration? What is the critical value?

Was it good econometric practice to test for cointegration right away? What else should you have done before proceeding with the EG-ADF test?

Answer: a. The critical value is -3.41 and hence the EG-ADF test cannot reject the null hypothesis of no cointegration. b. For the regression to establish cointegration, you should test first whether or not the LHS and RHS variables are of the same order of integration. It is well known that exchange rates follow a random walk and are therefore I(1) variables, but price indexes are typically of the same order of integration for countries with similar inflation rates such as the U.K. and the U.S. Hence the RHS variable will likely be stationary or I(0). (The ADF statistic for the exchange rate is -2.18 while the log price difference has an ADF statistic of -4.67.)

Stock/Watson 2e -- CVC2 8/23/06 -- Page 400

Chapter 17 The Theory of Linear Regression with One Regressor 17.1 Multiple Choice 1) All of the following are good reasons for an applied econometrician to learn some econometric theory, with the exception of A) turning your statistical software from a “black box” into a flexible toolkit from which you are able to select the right tool for a given job. B) understanding econometric theory lets you appreciate why these tools work and what assumptions are required for each tool to work properly. C) learning how to invert a 4×4 matrix by hand. D) helping you recognize when a tool will not work well in an application and when it is time for you to look for a different econometric approach. Answer: C 2) Finite-sample distributions of the OLS estimator and t-statistics are complicated, unless A) the regressors are all normally distributed. B) the regression errors are homoskedastic and normally distributed, conditional on X1 ,... Xn. C) the Gauss-Markov Theorem applies. D) the regressor is also endogenous. Answer: B 3) If, in addition to the least squares assumptions made in the previous chapter on the simple regression model, the errors are homoskedastic, then the OLS estimator is A) identical to the TSLS estimator. B) BLUE. C) inconsistent. D) different from the OLS estimator in the presence of heteroskedasticity. Answer: B 4) When the errors are heteroskedastic, then A) WLS is efficient in large samples, if the functional form of the heteroskedasticity is known. B) OLS is biased. C) OLS is still efficient as along as there is no serial correlation in the error terms. D) weighted least squares is efficient. Answer: A 5) The following is not part of the extended least squares assumptions for regression with a single regressor: 2 A) var(ui Xi) = σ u . B) E(ui Xi) = 0. C) the conditional distribution of ui given Xi is normal. 2 D) var(ui Xi) = σ u,i . Answer: D 6) The extended least squares assumptions are of interest, because A) they will often hold in practice. B) if they hold, then OLS is consistent. C) they allow you to study additional theoretical properties of OLS. D) if they hold, we can no longer calculate confidence intervals. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 401

7) Asymptotic distribution theory is A) not practically relevant, because we never have an infinite number of observations. B) only of theoretical interest. C) of interest because it tells you what the distribution approximately looks like in small samples. D) the distribution of statistics when the sample size is very large. Answer: D 8) Besides the Central Limit Theorem, the other cornerstone of asymptotic distribution theory is the A) normal distribution. B) OLS estimator. C) Law of Large Numbers. D) Slutsky’s theorem. Answer: C 9) The link between the variance of Y and the probability that Y is within (± δ of μY is provided by A) Slutsky’s theorem. B) the Central Limit Theorem. C) the Law of Large Numbers. D) Chebychev’s inequality. Answer: D 10) It is possible for an estimator of μY to be inconsistent while A) converging in probability to μY. B) Sn

μY .

C) unbiased. D) Pr Sn – μY ≥ δ → 0. Answer: C 11) Slutsky’s theorem combines the Law of Large Numbers A) with continuous functions. B) and the normal distribution. C) and the Central Limit Theorem. D) with conditions for the unbiasedness of an estimator. Answer: C 12) An implication of

n (β1 – β1 )

N(0,

var(v i) [var(Xi)]2

) is that

A) β1 is unbiased. ^

B) β1 is consistent. C) OLS is BLUE. D) there is heteroskedasticity in the errors. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 402

13) Under the five extended least squares assumptions, the homoskedasticity -only t-distribution in this chapter A) has a Student t distribution with n-2 degrees of freedom. B) has a normal distribution. 2 C) converges in distribution to a χ n-2 distribution. D) has a Student t distribution with n degrees of freedom. Answer: A 2 2 14) You need to adjust S ^ by the degrees of freedom to ensure that S ^ is u u 2 A) an unbiased estimator of σ u . 2 B) a consistent estimator of σ u . C) efficient in small samples. D) F-distributed. Answer: A 15) E

n ^ 2 1 u n-2 ∑ i i=1 A) is the expected value of the homoskedasticity only standard errors. 2 B) = σ u . C) exists only asymptotically. 2 D) = σ u /(n-2).

Answer: B 16) The Gauss-Markov Theorem proves that A) the OLS estimator is t distributed. B) the OLS estimator has the smallest mean square error. C) the OLS estimator is unbiased. D) with homoskedastic errors, the OLS estimator has the smallest variance in the class of linear and unbiased estimators, conditional on X1 ,…, Xn. Answer: D 17) The following is not one of the Gauss-Markov conditions: 2 2 A) var(ui X1 ,…, Xn) = σ u , 0 < σ u < ∞ for i = 1,…, n, B) the errors are normally distributed. C) E(uiuj X1 ,…, Xn) = 0, i = 1,…, n, j = 1,..., n, i ≠ j D) E(ui X1 ,…, Xn) = 0 Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 403

18) The class of linear conditionally unbiased estimators consists of A) all estimators of β1 that are linear functions of Y1 ,…, Yn and that are unbiased, conditional on X1 ,…, Xn . B) OLS, WLS, and TSLS. C) those estimators that are asymptotically normally distributed. D) all estimators of β1 that are linear functions of X1 ,…, Xn and that are unbiased, conditional on X1 ,…, Xn. Answer: A n ^ ^ ^ 19) The OLS estimator is a linear estimator, β1 = ∑ ai Yi , where ai = i=1 A)

Xi - X

∑ (Xj - X)2

j=1 B) C)

1 . n Xi - X n

∑ (Xj - X) j=1 D)

Xi n

∑ (Xj - X)2

j=1

Answer: A 20) If the errors are heteroskedastic, then A) the OLS estimator is still BLUE as long as the regressors are nonrandom. B) the usual formula cannot be used for the OLS estimator. C) your model becomes overidentified. D) the OLS estimator is not BLUE. Answer: D 21) Estimation by WLS A) although harder than OLS, will always produce a smaller variance. B) does not mean that you should use homoskedasticity -only standard errors on the transformed equation. C) requires quite a bit of knowledge about the conditional variance function. D) makes it very hard to interpret the coefficients, since the data is now weighted and not any longer in its original form. Answer: C 22) The WLS estimator is called infeasible WLS estimator when A) the memory required to compute it on your PC is insufficient. B) the conditional variance function is not known. C) the numbers used to compute the estimator get too large. D) calculating the weights requires you to take a square root of a negative number. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 404

23) Feasible WLS does not rely on the following condition: A) the conditional variance depends on a variable which does not have to appear in the regression function. B) estimating the conditional variance function. C) the key assumptions for OLS estimation have to apply when estimating the conditional variance function. D) the conditional variance depends on a variable which appears in the regression function. Answer: D 24) In practice, the most difficult aspect of feasible WLS estimation is A) knowing the functional form of the conditional variance. B) applying the WLS rather than the OLS formula. C) finding an econometric package that actually calculates WLS. D) applying WLS when you have a log -log functional form. Answer: A 25) The advantage of using heteroskedasticity -robust standard errors is that A) they are easier to compute than the homoskedasticity-only standard errors. B) they produce asymptotically valid inferences even if you do not know the form of the conditional variance function. C) it makes the OLS estimator BLUE, even in the presence of heteroskedasticity. D) they do not unnecessarily complicate matters, since in real-world applications, the functional form of the conditional variance can easily be found. Answer: B 26) Homoskedasticity means that 2 A) var(ui|Xi) = σ ui 2 B) var(Xi) = σ u 2 C) var(ui|Xi) = σ u ^ 2 D) var(ui|Xi) = σ ui

Answer: C 27) In order to use the t-statistic for hypothesis testing and constructing a 95% confidence interval as 1.96 standard errors, the following three assumptions have to hold: A) the conditional mean of ui , given Xi is zero; (Xi ,Yi), i = 1,2, …, n are i.i.d. draws from their joint distribution; Xi and ui have four moments B) the conditional mean of ui , given Xi is zero; (Xi ,Yi), i = 1,2, …, n are i.i.d. draws from their joint distribution; homoskedasticity C) the conditional mean of ui , given Xi is zero; (Xi ,Yi), i = 1,2, …, n are i.i.d. draws from their joint distribution; the conditional distribution of ui given Xi is normal D) none of the above Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 405

28) If the variance of u is quadratic in X, then it can be expressed as 2 A) var(ui|Xi) = θ 0 1/2 B) var(ui|Xi) = θ0 +θ1 X i 2 C) var(ui|Xi) = θ0 +θ1 X i 2 D) var(ui|Xi) = σ u Answer: C 29) In practice, you may want to use the OLS estimator instead of the WLS because A) heteroskedasticity is seldom a realistic problem B) OLS is easier to calculate C) heteroskedasticity robust standard errors can be calculated D) the functional form of the conditional variance function is rarely known Answer: D 30) If the functional form of the conditional variance function is incorrect, then A) the standard errors computed by WLS regression routines are invalid B) the OLS estimator is biased C) instrumental variable techniques have to be used D) the regression R2 can no longer be computed Answer: A 31) Suppose that the conditional variance is var(ui|Xi ) = λh(Xi ) where λ is a constant and h is a known function. The WLS estimator is A) the same as the OLS estimator since the function is known B) can only be calculated if you have at least 100 observations C) the estimator obtained by first dividing the dependent variable and regressor by the square root of h and then regressing this modified dependent variable on the modified regressor using OLS D) the estimator obtained by first dividing the dependent variable and regressor by h and then regressing this modified dependent variable on the modified regressor using OLS Answer: C ^

32) The large-sample distribution of β 1 is ^

n(β 1 -β 1 ) d N(0

var(νi) [var(Xi)]2 var(νi) [var(Xi)]2 var(νi) [var(Xi)]2

where νi= (Xi-μx)ui where νi= ui where νi= Xiui

2 σu [var(Xi)]2

Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 406

33) (Requires Appendix material) If X and Y are jointly normally distributed and are uncorrelated, A) then their product is chi-square distributed with n-2 degrees of freedom B) then they are independently distributed C) then their ratio is t-distributed D) none of the above is true Answer: B 2 34) Assume that var(ui|Xi) = θ0 +θ1 X i . One way to estimate θ 0 and θ1 consistently is to regress ^ 2 A) ui on X i using OLS ^2 2 B) u i on X i using OLS ^2 C) u i on

Xiusing OLS

^2 2 D) u i on X i using OLS but surpressing the constant (ʺrestricted least squaresʺ)

Answer: B 35) Assume that the variance depends on a third variable, W i, which does not appear in the regression function, 1 One way to estimate θ 0 andθ 1 consistently is to regress and that var(u i|Xi,Wi) = θ0 +θ1 Wi ^ 2 A) ui on W i using OLS ^

B) ui on

1 using OLS Wi

Xi ^2 C) u i on using OLS Wi ^2 1 using OLS D) u i on Wi

Answer: D

Stock/Watson 2e -- CVC2 8/23/06 -- Page 407

17.2 Essays and Longer Questions 1) Discuss the properties of the OLS estimator when the regression errors are homoskedastic and normally distributed. What can you say about the distribution of the OLS estimator when these features are absent? Answer: In the initial discussion of the OLS estimator, it was established that if the three least squares assumptions hold, then the OLS estimator is unbiased, consistent, and has an asymptotically normal distribution. Small sample properties are more difficult to establish, at least in the case when the regressors are random variables. If the assumption of homoskedasticity is added to the previous assumptions, then the OLS estimator is efficient in the class of linear and conditionally unbiased estimators. This result is known as the Gauss-Markov Theorem. Since the proof depends on the assumption of homoskedasticity, OLS is not efficient in its absence. In that case, an alternative estimator, WLS, is efficient in large samples. However, the result depends on knowing the functional form of the heteroskedasticity, so that the parameters can be estimated. If the functional form is unknown, which is the case in virtually all real-world applications, then using the computed standard errors results in invalid statistical inference. If the conditional distribution of the errors is normal, then a small sample distribution for the OLS estimator can be derived using the homoskedasticity-only standard errors. The resulting t-statistic now follows a Student t distribution. 2) What does the Gauss-Markov theorem prove? Without giving mathematical details, explain how the proof proceeds. What is its importance? Answer: The Gauss-Markov Theorem proves that in the class of linear and unbiased estimators the OLS estimator has the smallest variance or is BLUE. The proof first establishes the conditions under which a linear estimator is unbiased. It then derives the variance of the estimator. The smallest variance property is then established by showing that the conditional variance of any old linear and unbiased estimator exceeds that of the OLS estimator, unless they are the same. To show this it is assumed that the OLS weights and the weights of any other linear estimator differ by some amount. Substitution of this condition into the conditional variance formula for any linear and unbiased estimator then shows that the resulting variance exceeds that of the OLS estimator unless the difference in the weights is zero. Hence OLS is BLUE. The Gauss-Markov Theorem gave the major justification for the widespread use of the OLS estimator. 3) One of the earlier textbooks in econometrics, first published in 1971, compared “estimation of a parameter to shooting at a target with a rifle. The bull’s-eye can be taken to represent the true value of the parameter, the rifle the estimator, and each shot a particular estimate.” Use this analogy to discuss small and large sample properties of estimators. How do you think the author approached the n → ∞ condition? (Dependent on your view of the world, feel free to substitute guns with bow and arrow, or missile.) Answer: Unbiasedness: the shots produce a scatter, but the center of the scatter is the bulls -eye. If the riffle produces a scatter of shots that is centered on another point, then the gun is biased. Efficiency: Requires comparison with other unbiased guns. Looking at the scatters produced by the shots, the smallest scatter is the one from the efficient gun. BLUE: Remove all guns which are not linear and/or biased. The gun among these remaining ones which produces the smallest scatter is the BLUE gun. Consistency: n → ∞ is the condition as you march towards the bulls-eye, i.e., the distance becomes shorter as n → ∞. A shot fired from a consistent gun hits the bull’s-eye with increasing probability as you get closer to the bull’s-eye. Or, perhaps even better, you might want to substitute “being very close to the bull’s-eye” for “hitting the bull’s-eye.”

Stock/Watson 2e -- CVC2 8/23/06 -- Page 408

4) “I am an applied econometrician and therefore should not have to deal with econometric theory. There will be others who I leave that to. I am more interested in interpreting the estimation results.” Evaluate. Answer: Being presented with regression output and interpreting these uncritically does not allow the applied econometrician to understand the limitations of the tool. As a result, the interpretation may be false as might be the case in rejecting hypotheses when standard statistical inference does not apply in the situation at hand. In particular, having knowledge of econometric theory allows the econometrician to check whether or not the assumptions, which are necessary for statistical properties to hold, apply in a given situation. Knowing when to apply and when not to apply certain techniques is essential in conducting statistical inference, such as hypothesis testing and using confidence intervals. If the applied econometrician understands the limitations of certain estimation techniques, such as OLS, then she will be able to look for alternative approaches rather than blindly applying techniques by pushing “buttons” in econometric software. The above statement therefore seems short-sighted. 5) “One should never bother with WLS. Using OLS with robust standard errors gives correct inference, at least asymptotically.” True, false, or a bit of both? Explain carefully what the quote means and evaluate it critically. Answer: WLS is a special case of the GLS estimator. Furthermore, OLS is a special case of the WLS estimator. Both will produce different estimates of the intercept and the coefficients of the other regressors, and different estimates of their standard errors. WLS has the advantage over OLS, that it is (asymptotically) more efficient than OLS. However, the efficiency result depends on knowing the conditional variance function. When this is the case, the parameters can be estimated and the weights can be specified. Unfortunately in practice, as Stock and Watson put it, “the functional form of the conditional variance function is rarely known.” Using an incorrect functional form for the estimation of the parameters results in incorrect statistical inference. The bottom line is that WLS should be used in those rare instances where the functional form is known, but not otherwise. Estimation of the parameters using OLS with heteroskedasticity-robust standard errors, on the other hand, leads to asymptotically valid inferences even for the case where the functional form of the heteroskedasticity is not known. It therefore seems that for real world applications the above statement is true.

17.3 Mathematical and Graphical Problems 2 1) Consider the model Yi = β1 Xi + ui, where ui = c X i ei and all of the X’s and e’s are i.i.d. and distributed N(0,1). (a) Which of the Extended Least Squares Assumptions are satisfied here? Prove your assertions. (b) Would an OLS estimator of β1 be efficient here? (c) How would you estimate β1 by WLS? Answer: (a) The extended least squares assumptions are: 1. E(cXiei Xi) = 0 (conditional mean zero) – this holds here since the X’s and e’s are i.i.d; 2. (Xi, Yi), i = 1,…, n are independent and identically distributed (i.i.d.) draws from their joint distribution - this applies here; 3. (Xi, ui) have nonzero finite fourth moments – this follows from the normal distribution, which has moments of all orders. 2 4 4. var(ui Xi) = σ u (homoskedasticity) – this fails since var(ui Xi) = X i ; and 5. The conditional distribution of ui given Xi is normal (normal errors) – this holds since Xi, ui is perfectly normal, so to speak. (b) Since the model is heteroskedastic, WLS offers efficiency gains. 2 2 (c) You would weight each observation by 1/ X i , i.e., regress Yi/ X i on 1/Xi.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 409

2) (Requires Appendix material) This question requires you to work with Chebychev’s Inequality. (a) State Chebychev’s Inequality. (b) Chebychev’s Inequality is sometimes stated in the form “The probability that a random variable is further than k standard deviations from its mean is less than 1/k2 .” Deduce this form. (Hint: choose δ artfully.) (c) If X is distributed N(0,1), what is the probability that X is two standard deviations from its mean? Three? What is the Chebychev bound for these values? (d) It is sometimes said that the Chebychev inequality is not “sharp.” What does that mean? Answer: (a) Pr( V – μV ≥ δ) ≤ var(V)/δ2 , where V is a random variable. (b) In the statement of the result, choose δ = kσ, where σ2 = var(V). (c) 0.046 and 0.0027 respectively. (The smallest/largest z-value in Table 1 of the textbook is –2.99/2.99. Using these values, the second number modifies to 0.0028.) Chebychev’s inequality gives 0.25 and 0.11, respectively. (d) Answer: This means that, for some distributions, the probability that a random variable is further than k standard deviations away from its mean is much less than 1/ k2 . 3) For this question you may assume that linear combinations of normal variates are themselves normally distributed. Let a, b, and c be non-zero constants. (a) X and Y are independently distributed as N(a, σ2 ). What is the distribution of (bX+cY)? n 2 1 Xi ? (b) If X1 ,..., Xn are distributed i.i.d. as N(a, σ X ), what is the distribution of ∑ n i=1 (c) Draw this distribution for different values of n. What is the asymptotic distribution of this statistic? (d) Comment on the relationship between your diagram and the concept of consistency. n 1 Xi . What is the distribution of n(X – a)? Does your answer depend on n? (e) Let X = n ∑ i=1 Answer: (a) E(bX + cY) = bE(X) + cE(Y) = a(b + c); var(bX + xY) = (b2 + c2 ) σ2 . Hence (bX+cY) are distributed N(a(b + c), σ2 (b2 + c2 )). (b) From (a) it follows that this is distributed as N(a,

σ2 ). n

(c) The curves will be normal curves centered on a, but becoming spike-like as n grows. (d) The diagram shows that, as n grows, the probability distribution concentrates on a. The probability of n 1 Xi different from a becomes small as n grows. This is consistency. observing a value of n ∑ i=1 (e) n(X - a) is distributed N(0, σ2 ). This does not dependent on n, in contrast to the large-sample non-normal case where this distribution is only approached as n grows. 4) Consider the model Yi - β1 Xi + ui, where the Xi and ui the are mutually independent i.i.d. random variables with finite fourth moment and E(ui) = 0. ^

(a) Let β1 denote the OLS estimator of β1 . Show that n

∑ Xiui i=1 n

n(β1 - β1 ) =

2 ∑ Xi i=1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 410

∑ Xiui (b) What is the mean and the variance of

i=1

? Assuming that the Central Limit Theorem holds, what is

its limiting distribution? ^

n(β1 - β1 )? State what theorems are necessary for your deduction.

∑ XiYi

Answer: (a) The OLS estimator in this case is β1 =

i=1 n

2 ∑ Xi i=1

. Substituting for Yi into the estimator and

re-arranging terms then gives the above expression. n

∑ Xiui (b) The mean is zero and the variance is obtained from var

i=1 n

1 2 2 n var (Xiui) = σ u E( X i ). n

2 2 If the Central Limit Theorem holds, then this will be distributed N(0, σ u E( X i ). n

∑ Xiui i=1 (c) Let

n(β1 - β1 ) =

= 2

∑ Xi

xN bN

2 2 , say. Then x N approaches N(0, σ u E( X i )) in distribution, and

i=1 xN 2 x bN approaches E(X ) in probability. It follows that approaches in distribution, which is i bN b 2 2 distributed N(0, σ u /E( X i )) (Slutsky’s theorem). 5) (Requires Appendix material) If the Gauss-Markov conditions hold, then OLS is BLUE. In addition, assume here that X is nonrandom. Your textbook proves the Gauss-Markov theorem by using the simple regression n ~ model Yi = β0 + β1 Xi + ui and assuming a linear estimator β1 = ∑ aiYi . Substitution of the simple regression i=1 model into this expression then results in two conditions for the unbiasedness of the estimator: n

i=1

∑ ai = 0 and ∑ aiXi = 1. n ~ 2 2 The variance of the estimator is var( β1 X1 ,…, Xn) = σ u ∑ a i . i=1 Different from your textbook, use the Lagrangian method to minimize the variance subject to the two constraints. Show that the resulting weights correspond to the OLS weights.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 411

Answer: Define the Lagrangian as follows: n n n 2 2 L = σ u ∑ a i - λ1 ∑ ai - λ2 ( ∑ aiXi - 1). i=1 i=1 i=1 To obtain the first order conditions, take the (n+2) derivatives with respect to the n weights and the two Lagrange multipliers and set these to zero. 2 ∂ L = 0 = 2ai σ u - λ1 - λ2 Xi; i= 1,..., n ∂ai ∂ L=0= ∂λ1

∑ ai

i=1 n ∂ L = 0 ∑ aiXi - 1 ∂λ2 i=1 Using the summation operator on both sides of the first equation and bringing the first constraint into play then gives λ1 = -λ2 X . Using this result in the first equation to eliminate the first Lagrange 2 multiplier results in the following conditions for the n weights: 2ai σ u = λ2 (Xi - X). To bring the second 2 constraint into play, multiply both sides by Xi and use the summation operator on both sides again 2 σ u n

i=1

∑ aiXi = λ2 ∑ (Xi - X) Xi or 2 σ u = λ2 ∑ (Xi - X)2 . Substituting the result for the second Lagrange

multiplier λ2 =

i=1

2 2σ u n

∑ (Xi - X)2

2 2 into 2ai σ u = λ2 (Xi - X) then gives 2ai σ u =

i=1

after simplifying ai =

(Xi - X) n

2 2σ u n

(Xi - X) and

∑ (Xi - X)2

i=1

. But these are the OLS weights, since the OLS slope estimator is

∑ (Xi - X)2

i=1

defined as follows n n (X X)(X Y) ∑ i ∑ (Xi - X)Yi n i Xi - X ^ i=1 i=1 β1 = . = = ∑ wi -Yi) , where wi = n n n ∑ (Xi - X)2 ∑ (Xi - X)2 i=1 ∑ (Xi - X)2 i=1 i=1 i=1 6) Your textbook states that an implication of the Gauss-Markov theorem is that the sample average, Y, is the 2 most efficient linear estimator of E(Yi) when Y1 ,..., Yn are i.i.d. with E(Yi) = μY and var(Yi) = σ Y . This follows from the regression model with no slope and the fact that the OLS estimator is BLUE. ~ n Provide a proof by assuming a linear estimator in the Y’s, μ = ∑ aiYi . i=1 (a) State the condition under which this estimator is unbiased. (b) Derive the variance of this estimator. Stock/Watson 2e -- CVC2 8/23/06 -- Page 412

i=1

∑ aiYi = ∑ aiE(Yi) = μY ∑ ai . Hence for this to be an unbiased estimator, the

Answer: (a) E(μ) = E

i=1 n

following condition must hold:

∑ ai = 1 . i=1

n n ~ ~ ~ 2 2 (b) var(μ) = E(μ - E(μ))2 = E( ∑ aiYi - μy )2 = ( ∑ a 2 E(Yi - μy )2 = σ Y ∑ a i . i=1 i=1 i i=1 n

n n 2 2 (c) Define the Lagrangian L = σ Y ∑ a i - λ ( ∑ ai - 1) , where λ is the Lagrange multiplier. To obtain i=1 i=1 the first order conditions, minimize L with respect to the n weights and the Lagrange multiplier, and solve the resulting (n+1) equations in the (n+1) unknowns. 2 ∂ L = 0 = 2 σ Y ai - λ; i = 1,..., n ∂ai ∂ L=0= ∂λ

∑ ai - 1 i=1

n 2 Summing the first equation 2 σ Y ∑ ai = nλ and bringing in the second equation subsequently, results i=1

in λ =

2 2σ Y n

2 . Substituting this result into the first equation then gives 2 σ Y ai =

= 1,..., n. Since these are also the OLS weights, then OLS is BLUE.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 413

2 2σ Y n

; i = 1,..., or ai =

1 ;i n

7) (Requires Appendix material) State and prove the Cauchy-Schwarz Inequality. Answer: The proof here reproduces the relevant section from Appendix 15.2 of the textbook. Chebychev’s inequality uses the variance of the random variable V to bound the probability that V is farther than ± δ from its mean, where δ is a positive constant: Pr( V - μV ≥ δ) ≤ var(V)/δ2 (Chebychev’s inequality). Proof. Let W = V – μV , let f be the p.d.f. of W, and let δ be any positive number. Now,

E(W)2 ) =

≥

∫

∞

∫ w2 f (w)dw

-∞ ∞ w2 f( w)dw +

∞

∫ w2 f (w)dw + ∫ w2 f (w)dw -δ ∞

-∞ δ

∫ w2 f (w)dw + ∫ w2 f (w)dw -∞

≥ δ2

∫

-δ

f (w)dw +

-∞

∫

∞ f (w)dw

= δ2 Pr( W ≥ δ), where the first equality is the definition of E(W2 ), the second equality holds because the range of integration divides up the real line, the first inequality holds because the term that was dropped is nonnegative, the second inequality holds because w2 ≥ δ2 over the range of integration, and the final equality holds by the definition of Pr( W ≥ δ ). Substituting W = V – μV into the final expression, noting that E(W2 ) = E[(V – μV )2 ] = var(V), and rearranging yields the inequality. 8) Consider the simple regression model Yi = β0 + β1 Xi + ui where Xi > 0 for all i, and the conditional variance is 2 var(ui Xi) = θX where θ is a known constant with θ > 0. i

(a) Write the weighted regression as Yi = β0 X0i + β1 X1i + ui. How would you construct Yi, X0i and X1i? ~ (b) Prove that the variance of is ui homoskedastic. (c) Which coefficient is the intercept in the modified regression model? Which is the slope? (d) When interpreting the regression results, which of the two equations should you use, the original or the modified model?

Answer: (a) Yi =

Yi ~ Xi ~ 1 , X0i = , and X1i = = 1. Xi Xi Xi

2 θX i ~ ui var(ui Xi) (b) var(ui Xi) = var X Xi = = = θ, which is constant. i 2 2 Xi Xi

(c) The coefficient on X1i is now the intercept, while the coefficient on X0i is the slope. (d) The modified model is simply used to obtain estimates of the original model. The modified model should therefore not be used for interpretation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 414

9) (Requires Appendix material) Your textbook considers various distributions such as the standard normal, t, χ2 , and F distribution, and relationships between them. 2 χ n1 . (a) Using statistical tables, give examples that the following relationship holds: Fn ,∞ = 1 n1 (b) t∞ is distributed standard normal, and the square of the t-distribution with n2 degrees of freedom equals the value of the F distribution with (1, n2 ) degrees of freedom. Why does this relationship between the t and F distribution hold? Answer: (a) For example, the critical value at the 10% significance level for the F-distribution is F30,∞. the 10% significance level for the χ2 distribution is 40.26 and dividing by 30 results in 1.34. (b) The textbook states that if W1 and W2 are independent random variables with chi-squared distributions and respective degrees of freedom n1 and n2 . Then the random variable F=

W1 /n1 W2/n2

has an F distribution with (n1 , n2 ) degrees of freedom. This distribution is denoted Fn n . For the 1 2 2 t-distribution, the following holds: Let Z have a standard normal distribution, let W have a χ m distribution, and let Z and W be independently distributed. Then the random variable t=

Z W/m

Z2 has a Student t distribution with m degrees of freedom, denoted tm. Squaring this term gives t2 = . W/m But if Z1 ,Z2 ,…,Zn are n i.i.d standard normal random variables, then the random variable W=

∑ Zi

i=1 has a chi-squared distribution with n degrees of freedom. Hence Z2 , the square of a standard normal variable, has a chi-square distribution with one degree of freedom. This gives t2 =

Z2 /1 = F1,m. W/m

10) Consider estimating a consumption function from a large cross-section sample of households. Assume that households at lower income levels do not have as much discretion for consumption variation as households with high income levels. After all, if you live below the poverty line, then almost all of your income is spent on necessities, and there is little room to save. On the other hand, if your annual income was $1 million, you could save quite a bit if you were a frugal person, or spend it all, if you prefer. Sketch what the scatterplot between consumption and income would look like in such a situation. What functional form do you think could approximate the conditional variance var(ui Inome)? Answer: See the accompanying figure. var(ui Inome) could be a + b × Income or a + b × Income2 . Hence there would be heteroskedasticity.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 415

Chapter 18 The Theory of Multiple Regression 18.1 Multiple Choice 1) The extended least squares assumptions in the multiple regression model include four assumptions from Chapter 6 (ui has conditional mean zero; (Xi,Yi), i = 1,…, n are i.i.d. draws from their joint distribution; Xi and ui have nonzero finite fourth moments; there is no perfect multicollinearity). In addition, there are two further assumptions, one of which is A) heteroskedasticity of the error term. B) serial correlation of the error term. C) homoskedasticity of the error term. D) invertibility of the matrix of regressors. Answer: C 2) The difference between the central limit theorems for a scalar and vector -valued random variables is A) that n approaches infinity in the central limit theorem for scalars only. B) the conditions on the variances. C) that single random variables can have an expected value but vectors cannot. D) the homoskedasticity assumption in the former but not the latter. Answer: B 3) The Gauss-Markov theorem for multiple regression states that the OLS estimator A) has the smallest variance possible for any linear estimator. B) is BLUE if the Gauss-Markov conditions for multiple regression hold. C) is identical to the maximum likelihood estimator. D) is the most commonly used estimator. Answer: B 4) The GLS assumptions include all of the following, with the exception of A) the Xi are fixed in repeated samples. B) Xi and ui have nonzero finite fourth moments. C) E(UU′ X) = Ω(X), where Ω(X) is n×n matrix-valued that can depend on X. D) E(U X) = 0 n. Answer: A 5) The multiple regression model can be written in matrix form as follows: A) Y = Xβ. B) Y = X + U. C) Y = βX + U. D) Y = Xβ + U. Answer: D 6) The linear multiple regression model can be represented in matrix notation as Y= Xβ + U, where X is of order n×(k+1). k represents the number of A) regressors. B) observations. C) regressors excluding the “constant” regressor for the intercept. D) unknown regression coefficients. Answer: C

Stock/Watson 2e -- CVC2 8/23/06 -- Page 416

7) The multiple regression model in matrix form Y = Xβ + U can also be written as ′ A) Yi = β 0 + X β + ui, i = 1,…, n. i ′ B) Yi = X β i, i = 1,…, n. i C) Yi = βX

′ + ui, i = 1,…, n. i

′ D) Yi = X β + ui, i = 1,…, n. i Answer: D 8) The assumption that X has full column rank implies that A) the number of observations equals the number of regressors. B) binary variables are absent from the list of regressors. C) there is no perfect multicollinearity. D) none of the regressors appear in natural logarithm form. Answer: C 9) One implication of the extended least squares assumptions in the multiple regression model is that A) feasible GLS should be used for estimation. B) E(U|X) = In. C) X′X is singular. D) the conditional distribution of U given X is N(0 n, In). Answer: D 10) One of the properties of the OLS estimator is ^

A) Xβ = 0 k+1 .

B) that the coefficient vector β has full rank. ^

C) X′(Y – Xβ ) = 0 k+1 . D) (X′X)-1 = X′Y Answer: C n

11) Minimization of ^

∑ (Yi - b0 - b1X1i - ... - bkXki)2 results in

i=1

A) X′Y = Xβ . ^

B) Xβ = 0 k+1 . ^

C) X′(Y – Xβ ) = 0 k+1 . D) Rβ = r. Answer: C 12) The Gauss-Markov theorem for multiple regression proves that A) MX is an idempotent matrix. B) the OLS estimator is BLUE. C) the OLS residuals and predicted values are orthogonal. 2 D) the variance-covariance matrix of the OLS estimator is σ u (X′X)-1 . Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 417

13) The GLS estimator is defined as A) (X′Ω-1 X)-1 (X′Ω-1 Y). B) (X′X)-1 X′Y. C) A′Y. D) (X′X)-1 X′U. Answer: A 14) The OLS estimator A) has the multivariate normal asymptotic distribution in large samples. B) is t-distributed. C) has the multivariate normal distribution regardless of the sample size. D) is F-distributed. Answer: A ^

15) β - β A) cannot be calculated since the population parameter is unknown. B) = (X′X)-1 X′U . ^

C) = Y - Y. D) = β + (X′X)-1 X′U Answer: B 16) The heteroskedasticity-robust estimator of ∑

n(β- β)

is obtained

A) from (X′X)-1 X′U. B) by replacing the population moments in its definition by the identity matrix. C) from feasible GLS estimation. D) by replacing the population moments in its definition by sample moments. Answer: D 17) A joint hypothesis that is linear in the coefficients and imposes a number of restrictions can be written as A) (X′X)-1 X′Y. B) Rβ = r . ^

C) β – β . D) Rβ= 0. Answer: B 18) Let there be q joint hypothesis to be tested. Then the dimension of r in the expression Rβ = r is A) q × 1. B) q × (k+1). C) (k+1) × 1. D) q. Answer: A 19) The formulation Rβ= r to test a hypotheses A) allows for restrictions involving both multiple regression coefficients and single regression coefficients. B) is F-distributed in large samples. C) allows only for restrictions involving multiple regression coefficients. D) allows for testing linear as well as nonlinear hypotheses. Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 418

20) Let PX = X(X′X)-1 X′ and MX = In - PX. Then MX MX = A) X(X′X)-1 X′ - PX. 2 B) M X C) In. D) MX. Answer: D 21) In the case when the errors are homoskedastic and normally distributed, conditional on X, then ^ 2 A) β is distributed N(β, ∑ ^ ),where ∑ ^ = σ u I(k+1). βX βX ^

B) β is distributed N(β,∑ ^), where ∑ ^ = ∑ β β

-1 -1 /n = Q X ∑ Q X /n. ^ V n(β -β)

^ 2 C) β is distributed N(β, ∑ ^ ),where ∑ ^ = σ u (X′X)-1 . βX βX ^

D) U = PXY where PX = X(X′X)-1 X′. Answer: C 22) An estimator of β is said to be linear if A) it can be estimated by least squares. B) it is a linear function of Y1 ,…, Yn . C) there are homoskedasticity-only errors. D) it is a linear function of X1 ,…, Xn . Answer: B 23) The leading example of sampling schemes in econometrics that do not result in independent observations is A) cross-sectional data. B) experimental data. C) the Current Population Survey. D) when the data are sampled over time for the same entity. Answer: D 24) The presence of correlated error terms creates problems for inference based on OLS. These can be overcome by A) using HAC standard errors. B) using heteroskedasticity-robust standard errors. C) reordering the observations until the correlation disappears. D) using homoskedasticity-only standard errors. Answer: A 25) The GLS estimator A) is always the more efficient estimator when compared to OLS. B) is the OLS estimator of the coefficients in a transformed model, where the errors of the transformed model satisfy the Gauss-Markov conditions. C) cannot handle binary variables, since some of the transformations require division by one of the regressors. D) produces identical estimates for the coefficients, but different standard errors. Answer: B

Stock/Watson 2e -- CVC2 8/23/06 -- Page 419

26) The extended least squares assumptions in the multiple regression model include four assumptions from Chapter 6 (ui has conditional mean zero; (Xi,Yi), i = 1,…, n are i.i.d. draws from their joint distribution; Xi and ui have nonzero finite fourth moments; there is no perfect multicollinearity). In addition, there are two further assumptions, one of which is A) heteroskedasticity of the error term. B) serial correlation of the error term. C) the conditional distribution of ui given Xi is normal. D) invertibility of the matrix of regressors. Answer: C 27) The OLS estimator for the multiple regression model in matrix form is A) (XʹX)-1 XʹY B) X(XʹX)-1 Xʹ - PX C) (XʹX)-1 XʹU D) (XΩ-1 X)-1 XΩ-1 Y Answer: A 28) To prove that the OLS estimator is BLUE requires the following assumption A) (Xi ,Yi) i = 1, …, n are i.i.d. draws from their joint drstribution B) Xi and ui have nonzero finite fourth moments C) the conditional distribution of ui given Xi is normal D) none of the above Answer: D 29) The TSLS estimator is A) (XʹX)-1 XʹY B) (XʹZ(Z’Z)-1 Z’X)-1 X ʹZ(Z’Z)-1 Z’ Y C) (XΩ-1 X)-1 (XΩ-1 Y) D) (X’Pz )-1 Pz Y Answer: B 30) The homoskedasticity-only F-statistic is ^ ^ (Rβ -r)ʹ[Rʹ(XʹX)-1 R]-1 (Rβ -r)/q A) 2 s ^ u B)

^ ^ (Rβ -r)ʹ[Rʹ(XʹX)-1 R]-1 (Rβ -r)

2 s ^ u ^

(Rβ -r)ʹ[RʹΣβ R]-1 (Rβ -r) q ^

UʹP ZU D) ^ ^ UʹMZU Answer: A

Stock/Watson 2e -- CVC2 8/23/06 -- Page 420

18.2 Essays and Longer Questions 1) Write an essay on the difference between the OLS estimator and the GLS estimator. Answer: Answers will vary by student, but some of the following points should be made. The multiple regression model is Yi = β0 + β1 X1i + β0 X2i + ... + βkXki + ui, i = 1, …, n which, in matrix form, can be written as Y = Xβ + U. The OLS estimator is derived by minimizing the ^ squared prediction mistakes and results in the following formula: β = (X′X)-1 X′Y. There are two GLS ^

estimators. The infeasible GLS estimator is β GLS = (X′Ω-1 X)-1 (X′Ω-1 Y). Since Ω is typically unknown, the estimator cannot be calculated, and hence its name. However, a feasible GLS estimator can be calculated if Ω is a known function of a number of parameters which can be estimated. Once ^

these parameters have been estimated, they can then be used to calculate Ω, the estimator of Ω. The ^ ^ ^ feasible GLS estimator is defined as β GLS= (X′Ω -1 )-1 (X′Ω -1 Y). There are extended least squares assumptions. · ·

E(ui Xi) = 0 (ui has conditional mean zero); (Xi,Yi), i = 1, …, n are independently and identically distributed (i.i.d.) draws from their

joint

distribution; Xi and ui have nonzero finite fourth moments; · ·

X has full column rank (there is no perfect multicollinearity);

2 var(ui Xi) = σ u (homoskedasticity);

the conditional distribution of ui given Xi is normal (normal errors),

2 These assumptions imply E(U X) = 0 n and E(UU′ X) = σ u In, the Gauss-Markov conditions for multiple regression. If these hold, then OLS is BLUE. If assumptions 5 and 6 do not hold, but assumptions 1 to 4 still hold, then OLS is consistent and asymptotically normally distributed. Small sample statistics can be derived for the case where the errors are i.i.d. and normally distributed, conditional on X. The GLS assumptions are 1.

E(U X) = 0 n;

2. 3.

E(UU′ X) = Ω(X), where Ω(X) is n×n matrix-valued that can depend on X; Xi and ui have nonzero finite fourth moments;

X has full column rank (there is no perfect multicollinearity).

The major differences between the two sets of assumptions relevant to the estimators themselves are that (i) GLS allows for homoskedastic errors to be serially correlated (dropping assumption 2 of OLS list), and (ii) there is the possibility that the errors are heteroskedastic (adding assumption 2 to GLS list). For 2 the case of independent sampling, replacing E(UU′ X) =Ω(X) with E(UU′ X) = σ u In turns the GLS estimator into the OLS estimator. In the case of the infeasible GLS estimator, the model can be transformed in such a way that the Gauss-Markov assumptions apply to the transformed model, if the four GLS assumptions hold. In that case, GLS is BLUE and therefore more efficient than the OLS estimator. This is of little practical value Stock/Watson 2e -- CVC2 8/23/06 -- Page 421

since the estimator typically cannot be computed. The result also holds if an estimator of Ω exists. However, for the feasible GLS estimator to be consistent, the first GLS assumption must apply, which is much stronger than the first OLS assumption, particularly in time series applications. It is therefore possible for the OLS estimator to be consistent while the GLS estimator is not consistent. 2) Give several economic examples of how to test various joint linear hypotheses using matrix notation. Include specifications of Rβ = r where you test for (i) all coefficients other than the constant being zero, (ii) a subset of coefficients being zero, and (iii) equality of coefficients. Talk about the possible distributions involved in finding critical values for your hypotheses. Answer: Answers will vary by student. Many restrictions involve the equality of coefficients across different types of entities in cross-sections (“stability”). Using earnings functions, students may suggest testing for the presence of regional effects, as in the textbook example at the end of Chapter 5 (exercises). The textbook tested jointly for the presence of interaction effects in the student achievement example at the end of Chapter 6. Students may want to test for the equality of returns to education and on-the-job training. The panel chapter allowed for the presence of fixed effects, the presence of which can be tested for. Testing for constant returns to scale in production functions is also frequently mentioned. Consider the multiple regression model with k regressors plus the constant. Let R be of order q × (k+ 1), where q are the number of restrictions. Then to test (i) for all coefficients other than the constant to be zero, H0 : β1 = 0, β2 = 0,. . ., βk = 0 vs. H1 : βj ≠ 0, at least one j, j=1, ..., n, you have R = [0 k×1 Ik ] and r = 0 k×1 . In large samples, the test will produce the overall regression F-statistic, which has a Fk, ∞ distribution. In case (ii), reorder the variables so that the regressors with non-zero coefficients appear first, followed by the regressors with coefficients that are hypothesized to be zero. This leads to the following formulation Yi = β0 + β1 X1i + β2 X2i + ... + βk-qXk-q,i + βk-q+1Xk-q+1,i + βk-q+2 Xk-q+2,i + ... + βkXki + ui, i = 1, …, n. R = [0 q× (k-q+1) Iq ] and r = 0 q×1 . In large samples, the test will produce an F-statistic, which has an Fq,∞ distribution. In (iii), assume that the task at hand is to test the equality of two coefficients, say H0 : β1 = β1 vs. H1 : β1 ≠ β2 , as in section 5.8 of the textbook. Then R = [0 1 -1 0 … 0], r = 0 and q = 1. This is a single restriction, and the F-statistic is the square of the corresponding t-statistic. Hence critical values can be found either from Fq,∞ or from the standard normal table, after taking the square root. 3) Define the GLS estimator and discuss its properties when Ω is known. Why is this estimator sometimes called infeasible GLS? What happens when Ω is unknown? What would the Ω matrix look like for the case of 2 independent sampling with heteroskedastic errors, where var( ui Xi) = ch(Xi) = σ2 X 1i ? Since the inverse of the error variance-covariance matrix is needed to compute the GLS estimator, find Ω -1 . The textbook shows that ~ ~ ~ ~ ~ ~ the original model Y = Xβ + U will be transformed into Y = X β + U, where Y = FY, X = FX, and U = FU, and F′F = Ω-1 . Find F in the above case, and describe what effect the transformation has on the original data. ^ Answer: β GLS= (X′Ω -1 X)-1 (X′Ω -1 Y). The key point for the GLS estimator with Ω known is that Ω is used to create a transformed regression model such that the resulting error term satisfies the Gauss-Markov conditions. In that case, GLS is BLUE. However, since Ω is typically unknown, the estimator cannot be calculated, and is therefore sometimes referred to as infeasible GLS. If Ω is unknown, then a feasible GLS estimator can be calculated if Ω is a known function of a number of parameters which can be estimated. ^

Once the parameters have been estimated, they can then be used to calculate Ω, which is the estimator of Ω. The feasible GLS estimator is then

Stock/Watson 2e -- CVC2 8/23/06 -- Page 422

β GLS= (X′Ω-1 X)-1 (X′Ω-1 Y). In the above example of heteroskedasticity, 2 X 11 0

E(UU′ X) = Ω(X) = σ2 ,

2 X 12 N

R O

2 N X 1n

2 X 11

0 1 Ω-1 (X) = σ2 O

2 X 12 O

1 0 X11

1 N 0 X12

,F= R O 1 N 2 X 1n

N 0

R O 1 N X1n

The transformation in effect scales all variables by X1 . 4) Consider the multiple regression model from Chapter 5, where k = 2 and the assumptions of the multiple regression model hold. (a) Show what the X matrix and the β vector would look like in this case. (b) Having collected data for 104 countries of the world from the Penn World Tables, you want to estimate the effect of the population growth rate (X1i) and the saving rate (X2i) (average investment share of GDP from 1980 to 1990) on GDP per worker (relative to the U.S.) in 1990. What are your expected signs for the regression coefficient? What is the order of the (X′X) here? ^

(c) You are asked to find the OLS estimator for the intercept and slope in this model using the formula β = ( X′X)-1 X′Y. Since you are more comfortable in inverting a 2×2 matrix (the inverse of a 2×2 matrix is, 1 a b -1 = d -b ) ad bc c d -c a you decide to write the multiple regression model in deviations from mean form. Show what the X matrix, the ( X′X) matrix, and the X′Y matrix would look like now. (Hint: use small letters to indicate deviations from mean, i.e., zi = Zi - Z and note that ^

Yi = β0 + β1 X1i + β2 X2i + ui Y = β0 + β1 X1 + β2 X2 . Subtracting the second equation from the first, you get

Stock/Watson 2e -- CVC2 8/23/06 -- Page 423

y i = β1 x 1i + β2 x 2i + ui) (d) Show that the slope for the population growth rate is given by n

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i

β1 =

i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

(e) The various sums needed to calculate the OLS estimates are given below: n

∑ y i = 8.3103; ∑ x 1i = .0122; ∑ x 2i = 0.6422 i=1

i=1

∑ yix1i = -0.2304; ∑ yix2i = 1.5676; ∑ x1ix2i = -0.0520 Find the numerical values for the effect of population growth and the saving rate on per capita income and interpret these. (f) Indicate how you would find the intercept in the above case. Is this coefficient of interest in the interpretation of the determinants of per capita income? If not, then why estimate it? Answer: (a) 1 X 11 X21 β0 X 12 X22 X= 1 , and β = β1 ... ... ... β2 1 X 1n X2n (b) You would expect the population growth rate to have a negative coefficient, and the saving rate to have a positive coefficient. The order of X′X is 3×3. n n n 2 y ix 1i x 1ix 2i x ∑ ∑ ∑ 1i x 11 x21 i=1 i=1 i=1 x 12 x22 (c) X = , X′X = n , X′X . = n n 2 ... ... x x y x 1i 2i X i 2i ∑ ∑ 2i ∑ x 1n x2n i=1 i=1 i=1 (d)

-1 n 2 x x x 1i 2i ∑ 1i ∑ i=1 i=1 = n n 2 ∑ x1ix2i ∑ x 2i i=1 i=1 n

n - ∑ x 1ix 2i i=1 i=1 1 n n 2 n n n 2 2 - ∑ x 1ix 2i ∑ x 1i 2 x x ) ( 1i 2i x x ∑ 1i ∑ 1i ∑ i=1 i=1 i=1 i=1 i=1 2

∑ x 2i

∑ yix1i Post multiplying this expression with

i=1 n

∑ yix2i

results in the two least squares estimators

i=1 Stock/Watson 2e -- CVC2 8/23/06 -- Page 424

β1 ^

β2

∑ yix1i ∑ x 2i - ∑ yix2i ∑ x1ix2i i=1

i=1

i=1 n

∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1

i=1

n n 2 y x i 1i x ∑ 1i ∑ ∑ x1ix2i i=1 i=1 i=1 i=1 n n n 2 2 ∑ x 1i ∑ x 2i - ( ∑ x1ix2i )2 i=1 i=1 i=1 n

∑ yix2i

, and hence gives the formula for β1 .

(e) ^

-0.2304×0.6422-(1.5676×(-0.0520)) 0.0122×0.6422-(-0.0520)2

= 1.5676×0.0122- ((-0.2304)× (-0.0520)2

β1 β2

0.0122×0.6422- (-0.0520)2

= -12.953 . 1.393

A reduction of the population growth rate by one percent increases the per capita income relative to the United States by roughly 0.13. An increase in the saving rate by ten percent increases per capita income relative to the United States by roughly 0.14. (f) The first order condition for the OLS estimator in the case of k = 2 is n n n ^ ^ ^ ^ ^ ^ Y X nβ β i β 1i = + + ∑ 0 1 ∑ 2 ∑ X2i , which, after dividing by n, results in β1 = Y - β1 X1 - β2 X2 . The i=1 i=1 i=1 intercept is only of interest if there are observations close to the origin, which is not the case here. If it is set to zero, then the regression is forced through the origin, instead being allowed to choose a level. 5) In Chapter 10 of your textbook, panel data estimation was introduced. Panel data consist of observations on the same n entities at two or more time periods T. For two variables, you have (Xit, Yit), i = 1,..., n and t = 1,..., T where n could be the U.S. states. The example in Chapter 10 used annual data from 1982 to 1988 for the fatality rate and beer taxes. Estimation by OLS, in essence, involved “stacking” the data. (a) What would the variance-covariance matrix of the errors look like in this case if you allowed for homoskedasticity-only standard errors? What is its order? Use an example of a linear regression with one regressor of 4 U.S. states and 3 time periods. (b) Does it make sense that errors in New Hampshire, say, are uncorrelated with errors in Massachusetts during the same time period (“contemporaneously”)? Give examples why this correlation might not be zero. (c) If this correlation was known, could you find an estimator which was more efficient than OLS? 2 Answer: (a) Under the extended least least squares assumptions, E(UU′ X) = σ u In. In the above example of 4 U.S. states and 3 time periods, the identity matrix will be of order 12 ×12, or (nT) × (nT) in general. Specifically

. Stock/Watson 2e -- CVC2 8/23/06 -- Page 425

(b) It is reasonable to assume that a shock to an adjacent state would have an effect on its neighboring state, particularly when the shock affects the larger of the two such as the case in Massachusetts. Other examples may be Texas and Arkansas, Michigan and Indiana, California and Arizona, New York and New Jersey, etc. A negative oil price shock, which affects the demand for automobiles produced in Michigan, will have repercussions for suppliers located not only in Michigan, but also elsewhere. (c) In case of a known variance-covariance matrix of the error terms, the GLS estimator ^ β GLS = (X′Ω-1 X)-1 (X′Ω-1 Y) could be used. The variance-covariance matrix would be of the form

(There is a subtle issue here for the case of a feasible GLS estimator, where the variances and covariances have to be estimated. It can be shown, in that case, that the GLS estimator does not exist unless n ≤ T, which is not the case for most panels. It is easier to see that the variance-covariance matrix is singular for n>T if the data is stacked by time period.)

18.3 Mathematical and Graphical Problems 1) Your textbook derives the OLS estimator as ^ β = (X′X)-1 X′Y.

Show that the estimator does not exist if there are fewer observations than the number of explanatory variables, including the constant. What is the rank of X′X in this case? Answer: In order for a matrix to be invertible, it must have full rank. Since X′X is of order (k + 1) × (k + 1), then in order to invert X′X , it must have rank (k+1). In the case of a product such as X′X, the rank is less than or equal to the rank of X′ or X, whichever is smaller. X is of order n × (k + 1), and assuming that there is no perfect multicollinearity, will have either rank n or rank (k+1), whichever is the smaller of the two. Hence if there are fewer observations than the number of explanatory variables (including the constant), then the rank of X will be n(< k+1), and the rank of X′X is also n( < k +1). Hence X′X does not have full rank, and therefore cannot be inverted. The OLS estimator does not exist as a result.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 426

2) Assume that the data looks as follows: Y1 Y=

Y2 O Yn

,U=

u1 u2 O un

, X=

X11 X12 O X1n

, and β = (β1 )

Using the formula for the OLS estimator β = (X′X)-1 X′Y, derive the formula for β1 , the only slope in this “regression through the origin.” n n 2 Answer: In this case, X′Y = ∑ X1iYi , and X′X = ∑ X 1i . Hence i=1 i=1 n ∑ X1iY1 ^ i=1 β = (X′X)-1 X′Y = . n 2 ∑ X 1i i=1 3) Write the following three linear equations in matrix format Ax = b, where x is a 3×1 vector containing q, p, and y, A is a 3×3 matrix of coefficients, and b is a 3×1 vector of constants. q = 5 +3 p – 2 y q = 10 – p + 10 y p=6y p 5 -3 1 2 -3 1 2 Answer: A = 1 1 -10 , x = q , b = 10 or 1 1 -10 y 0 1 0 -6 1 0 -6 -2 3 4) Let Y = 10 and X = 2 2

p 5 q = 10 . y 0

10 11 13 1 -1 12

Find X′X, X′Y, (X′X)-1 and finally (X′X)-1 X′Y. Answer: X′X = 5 5 , X′Y = 15 , (X′X)-1 = 0.3 -0.1 , and (X′X)-1 X′Y = 1 . 5 15 35 2 -0.1 0.1

Stock/Watson 2e -- CVC2 8/23/06 -- Page 427

5) A =

a11 a12 a21 a22

,B=

b11 b12 c c c , and C = 11 12 13 b21 b22 c21 c22 c23

show that (A+B)′ = A′ + B′ and (AC)′ = C′ A′. Answer: (A+B) =

A′ =

a11 + b11

a12 + b12

a21 + b21

a22 + b22

a11 a21 a12 a22

(AC) =

, (A+B)′ =

a11 + b11

a21 + b21

a12 + b12

a22 + b22

b11 b21 a +b a +b , A′+ B′ = 11 11 21 21 . b12 b22 a12 + b12 a22 + b22

,B=

a11c11 + a12 c21 a21c11 + a22 c21

a11c12 + a12 c22 a21c12 + a22 c22

a11c11 + a12 c21

a21c11 + a22 c21 c11 c21 a11 a12 , a21c12 + a22 c22 , C′ = c12 c22 , A′ = a21 a22 a21c13 + a22 c23 c13 c23

(AC)′ = a11c13 + a12 c22 a11c13 + a12 c23

a11c11 + a12 c21 C′ A′ = a11c13 + a12 c22 a11c13 + a12 c23

a11c13 + a12 c23 a21c13 + a22 c23

a21c11 + a22 c21 a21c12 + a22 c22 . a21c13 + a22 c23

6) Write the following four restrictions in the form Rβ = r, where the hypotheses are to be tested simultaneously. β3 = 2β5 , β1 + β2 = 1, β4 = 0, β2 = -β6 . Can you write the following restriction β2 = -

β3 β1

in the same format? Why not?

β0 0 0 0 1 0 -2 0 Answer: 0 1 1 0 0 0 0 00001 00 00100 01

β1

0 1 . β3 = 0 β4 0 β5 β6 β2

The restriction β2 = -

β3 β1

cannot be written in the same format because it is nonlinear. ^

7) Using the model Y = Xβ + U, and the extended least squares assumptions, derive the OLS estimator β . Discuss the conditions under which X′X is invertible. Answer: The derivation copies the relevant parts of section 16.1 of the textbook. The model is Y = Xβ + U, where Y u1 β0 Y1 1 X11 N Xk1 u2 β X X , X = 1 12 N k2 , and β = 1 . O O O O O R O un βk Yn X X 1 1n N kn Y is the n×1 dimensional vector of n observations on the dependent variable, X is the n×(k + 1) =

,U=

Stock/Watson 2e -- CVC2 8/23/06 -- Page 428

dimensional matrix of n observations on the k+1 regressors (including the “constant” regressor for the intercept), U is the n×1 dimensional vector of the n error terms, and β is the (k+1)×1 dimensional vector of the k+1 unknown regression coefficients. The extended least squares assumptions are: E(ui Xi) = 0 (ui has conditional mean zero); (Xi,Yi), i = 1, ..., n are independently and identically distributed (i.i.d.) draws from their joint distribution; Xi and ui have nonzero finite fourth moments. X has full column rank (there is no perfect multicollinearity); 2 var(ui Xi) = σ u (homoskedasticity); the conditional distribution of ui given Xi is normal (normal errors). The OLS estimator minimizes the sum of squared prediction mistakes, n ∑ (Yi- b0 - b1X1i - ... - bkXki)2 i=1 The derivative of the sum of squared prediction mistakes with respect to the jth regression coefficient, bj, is ∂ ∂bj

∑ (Yi- b0 - b1X1i - ... - bkXki)2 = -2 ∑ Xji(Yi- b0 - b1X1i - ... - bkXki) for j = 0, ..., k, where, for

i=1 i=1 j = 0, X0i = 1 for all i. The formula for the OLS estimator is obtained by taking the derivative of the sum of squared prediction mistakes with respect to each element of the coefficient vector, setting these ^

derivatives to zero, and solving for the estimator β . The derivative on the right-hand side of above equation is the jth element of the k+1 dimensional vector, –2X′(Y – Xb), where b is the k+1 dimensional vector consisting of b0 ,…, bk. There are k+1 such derivatives, each corresponding to an element of b. Combined, these yield the system of k+1 equations that constitute the first order conditions for the OLS ^

estimator that, when set to zero, define the OLS estimator β . That is, β solves the system of k+1 equations, ^

X′(Y – X β )= 0 k+1 ,

or, equivalently, X′Y = X′X β . Solving this system of equations yields the OLS estimator β = in matrix form: ^

β = (X′X ) -1 X′Y , where (X′X ) -1 is the inverse of the matrix X′X. X′X is invertible as long as it has full rank. This requires that there are more observations than regressors (including the constant), and that there is no perfect multicollinearity among the regressors.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 429

8) Prove that under the extended least squares assumptions the OLS estimator β is unbiased and that its 2 variance-covariance matrix is σ u (X′X)-1 . Answer: Start the proof by relating the OLS estimator to the errors ^ β = (X′X)-1 X′Y = (X′X)-1 X′ (Xβ + U) = β + (X′X)-1 XU.

To prove the unbiasedness of the OLS estimator, take the conditional expectation of both sides of the expression. ^

E(β X) = β + E[(X′X)-1 X′U X] = β + (X′X)-1 X′E(U X) Since E(U X) = 0 (from extended least squares assumptions 1 and 2), ^

E(β X) = β. ^

To find the variance-covariance matrix var(β X) = E[(β - β)(β - β ′ X], we have E[(X′X)-1 X′UU′X(X′X)-1 X] = (X′X)-1 X′E(UU′ X)X(X′X)-1 , and following the extended least squares assumptions 1, 2, and 5, ^ 2 2 var(β X) = σ u (X′X)-1 X′X(X′X)-1 = σ u (X′X)-1 . ^

9) For the OLS estimator β = (X′X)-1 X′Y to exist, X′X must be invertible. This is the case when X has full rank. What is the rank of a matrix? What is the rank of the product of two matrices? Is it possible that X could have rank n? What would be the rank of X′X in the case n<(k+1)? Explain intuitively why the OLS estimator does not exist in that situation. Answer: The rank of a matrix is the maximum number of linearly independent rows or columns. In general, in the case of a rectangular matrix, the maximum number of linearly independent columns is also equal to the maximum number of linearly independent rows. In the case of X, it can be, at most, either n or (k+1), whichever is smaller. The rank of product of two matrices will be, at most, the minimum of the rank of the two matrices of the product. In the case of X′X, both matrices will have, at most, either rank n or (k+1), whichever is smaller. Since X′X is a square matrix of order (k+1)×(k+1), it must have full rank in order to be invertible. In the absence of perfect multicollinearity, the rank will be (k+1) as long as (k+1) ≤ n. If there are fewer observations than regressors (including the constant), then the rank will be n. Except for the special case where there are exactly as many observations as regressors (including the constant), X′X will not have full rank in this case, and cannot be inverted. Intuitively you have to have as many independent equations as there are unknowns to find a unique solution. This is not the case when you have n<(k+1).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 430

10) In order for a matrix A to have an inverse, its determinant cannot be zero. Derive the determinant of the following matrices: A=

3 6 -2 1

1 -1 2 B= 1 0 3 4 0 2 X′X where X = (1 10) Answer: det (A) =15, det (B) = -10, det (X′X) = 0. 11)

Your textbook shows that the following matrix (Mx = In - Px ) is a symmetric idempotent matrix. 1 1 Consider a different Matrix A, which is defined as follows: A = I - ιιʹ and ι = 1 n ... 1 a. Show what the elements of A look like. b. Show that A is a symmetric idempotent matrix c. Show that Aι = 0. ^

d. Show that AU= U , where U is the vector of OLS residuals from a multiple regression. 1 0 ... 0 1 1 ... 1 1-1/n -1/n ... -1/n 1 1 1 ... 1 0 1 ... 0 Answer: a. A = = -1/n 1-1/n ... -1/n n ... ... ... ... ... ... ... ... ... ... ... ... 0 0 ... 1 1 1 ... 1 -1/n -1/n ... 1-1/n 1-1/n -1/n ... -1/n 1-1/n -1/n ... -1/n 1-1/n ... -1/n -1/n b. Aʹ = ʹ = -1/n 1-1/n ... -1/n = A ... ... ... ... ... ... ... ... -1/n -1/n ... 1-1/n -1/n -1/n ... 1-1/n 1 1 1 1 1 A×A =( I - ιιʹ )×(I - ιιʹ ) = (I - ιιʹ - ιιʹ + ιιʹ ιιʹ ) n n n n n2 1 1 ... 1 n n ... n 1 1 ... 1 1 1 ... 1 1 1 ... 1 But ιιʹ ιιʹ = × = n n ... n , and ... ... ... ... ... ... ... ... ... ... ... ... 1 1 ... 1 n n ... n 1 1 ... 1 n n ... n 1 1 ... 1 1 1 1 ... 1 1 1 n n ... n = = ιιʹ n ... ... ... ... n n 2 ... ... ... ... n n ... n 1 1 ... 1 This means that the last two terms in the above equation cancel each other, and therefore A×A = A, that is, idempotent. c. Aι = ( I ^

1 1 ιιʹ ) ι = ι - ιιʹ ι = 0 since ιι' = n n n

d. AU = ( I -

^ ^ 1 ^ ^ ^ 1 ιιʹ ) U = U - ιιʹU = U since ιʹU = 0 n n

Stock/Watson 2e -- CVC2 8/23/06 -- Page 431

12) Write down, in general, the variance-covariance matrix for the multiple regression error term U. Using the 2 assumptions cov(u i ,uj|XiXj) = 0 and var(u i|Xi) = σ u . Show that the variance-covariance matrix can be written 2 as σ u In . u1 Answer: (var-cov)(

u2 ... un

2 u1

=E(

|X) = E(

u1 -E(u1 ) u1 -E(u1 ) u2 -E(u2 ) u2 -E(u2 ) ... ... un -E(un ) un -E(un ) 2 σu 0

u1 u2 ... u1 un

u2 u1 u 2 2

... u2 un

...

... ...

...

un u1 un u2 ... u 2 n

|X) =

u1 ʹ|X) = E(

u2 ... un

u1 u2 ... un |X

... 0

2 σ u ... 0

...

... ...

2 ... σ u

2 = σ u In

1 1 13) Consider the following symmetric and idempotent Matrix A: A = I - ιιʹ and ι = 1 n ... 1 a.

Show that by postmultiplying this matrix by the vector Y (the LHS variable of the OLS regression), you convert all observations of Y in deviations from the mean.

Derive the expression Y’AY. What is the order of this expression? Under what other name have you encountered this expression before?

Answer: a. Note that

1 ιʹY = Y. Given this result, then if you pre‐multiply Y with A, you get n Y1 -Y

AY = ( I ‐

Y2 -Y 1 ιιʹ ) Y = Y ‐ ιʹY = n ... Yn -Y

b. Note that Y’A’AY = Y’AAY = Y’AY = . This is a scalar which is called the variation in Y or the GR:iem2s:K40062003 Total Sums of Squares (TSS).

Stock/Watson 2e -- CVC2 8/23/06 -- Page 432

14) Consider the following population regression function: Y = Xβ + U Y1

1 X1 β Y where Y= 2 , X= 1 X2 , β = 0 , U= β1 ... ... ... Yn 1 Xn

u1 u2 ... un

Given the following information on population growth rates (Y) and education (X) for 86 countries n n n n n 2 2 ∑ Yi = 1.594 , ∑ Xi = 449.6 , ∑ Y i = 0.03982 , ∑ X i = 3,022.76 , ∑ XiYi = 6.4697 i=1 i=1 i=1 i=1 i=1 a) b)

find XʹX, XʹY, (XʹX)-1 and finally (XʹX)-1 XʹY. Interpret the slope and, if necessary, the intercept. n

∑ Xi i=1 n

449.6 = 86 2 449.6 3022.76 ∑ Xi ∑ X i i=1 i=1 n ∑ Yi i=1 XʺY = = n = 1.594 ∑ Xi Yi 6.4697 i=1 1 3022.76 -449.6 (XʹX)-1 = 86×3022.76 - 449.6 2 -449.6 866

Answer: a. XʹX =

(XʹX)-1 XʹY = 0.0331 -0.0028 b. According to these results, five more years of education will lower population growth rates by roughly one percent. 15) You have obtained data on test scores and student -teacher ratios in region A and region B of your state. Region B, on average, has lower student-teacher ratios than region A. You decide to run the following regression Yi = β0 + β1 X1i + β2 X2i + β3 X3i + u i

where X1 is the class size in region A, X2 is the difference between the class size between region A and B, and X 3 is the class size in region B. Your regression package shows a message indicating that it cannot estimate the above equation. What is the problem here and how can it be fixed? Explain the problem in terms of the rank of the X matrix. Answer: There is perfect multicollinearity here, in that X2 = X 1 -X3 , hence the X matrix (and the X’X) matrix does not have full rank (rank = 3 here, not 4). If the X’X is singular, you cannot invert it, since its determinant is zero. Dropping one of the three explanatory variables allows you to estimate the above equation.

Stock/Watson 2e -- CVC2 8/23/06 -- Page 433