Study Guide: Experimental Design Joshuah Touyz c Draft date November 28, 2009
Contents 1 Analysis of Variance (ANOVA)
3
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
ANOVA-balanced, no blocking . . . . . . . . . . . . . . . . . . . . . .
3
2 ANOVA-unbalanced, no blocking
9
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
Unbalanced designs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3 ANOVA and Randomized Block Designs
15
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2
ANOVA-Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3
Testing the hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4
Putting it all together: an oil example . . . . . . . . . . . . . . . . . 19
3.5
Some important notes . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6
Factorial Treatment and Interactions . . . . . . . . . . . . . . . . . . 21
3.7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8
22 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.9
23 Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Sampling Introduction
25
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 i
ii
CONTENTS 4.2.1
Some important notes on sampling . . . . . . . . . . . . . . . 27
4.2.2
Populations and their meaning . . . . . . . . . . . . . . . . . . 27
4.2.3
Sampling protcols . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3
Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4
Some important notes . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Probability sampling
33
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2
Sampling Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3
Simple Random Sampling-SRS . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1
Sample Size Determination . . . . . . . . . . . . . . . . . . . . 44
6 Ratio and Regression Estimation with SRS
47
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2
Estimating a Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2.1
Ratio of estimation of the average . . . . . . . . . . . . . . . . 50
6.2.2
Regression Estimation of the Average . . . . . . . . . . . . . . 52
Preface The notes herewith are an attempt to provide some structure to the University of Waterloo’s actuarial science classes Stat322/332 (Experimental Design) in electronic form. They aim to be a comprehensive study pack for the concepts presented in class and will hopefully bring an increased level of clarity to students who may otherwise find the concepts difficult.
1
2
CONTENTS
Chapter 1 Analysis of Variance (ANOVA) 1.1
Introduction
In this section we consider the statistical technique analysis of variance (abbreviated as ANOVA). It will be introduced under two different experimental designs. The first considers ANOVA in the absence of blocks whereas the second considers ANOVA in the presence of block design. We will further discuss cases in which an experiment is balanced and unbalanced (unbalanced means the number of observations vary between groups). ANOVA seeks to answer the question “are there any differences between sampled groups?” This differs from contrasts which compares different groups within a sample.
1.2
ANOVA-balanced, no blocking
For a balanced and unblocked design the ANOVA hypothesis test may be written as τ1 = ... = τn = 0 where n is the number treatments. The hypothesis posits there are no differences amongst the different treatments of the sample. This may be equivalently interpreted as no difference among the means between treatments. The idea behind ANOVA is to estimate the residual standard deviation in two ways (one using the treatments and the other using the residuals) and to compare these two standard deviations. If the two standard deviations are close to one another the ratio between them will be close to 1 and the hypothesis will be true. The estimates of standard deviation are developed as follows: 3
4
CHAPTER 1. ANALYSIS OF VARIANCE (ANOVA)
Estimate 1:Least squares estimate of residuals we have: P
−µ ˆ − τˆi )2 σ ˆ = rt − (1 + t) + 1 P ¯i+ )2 ij (yij − y since τˆi = y¯i+ − y¯++ and µ = y¯++ = t(r − 1) P 2 ˆij ij r = since rˆij = yij − µ ˆ − τˆi t(r − 1) 2
ij (yij
This estimate for σ ˆ 2 is valid whether or not the hypothesis is true, however it is only as corret as the overall model. Estimate 2: Using least squaresPand assuming the hypothesis is true we have that t y¯ τˆi = y¯i+ − y¯++ where y¯++ = µ = it i+ and t is the number of treamtents. If we let τi 2 be a random variable then τi = Yi+ − Y++ which implies that Y¯i+ ∼ N (µ, σr ) where r is the number of obseRvations in a treatment group. Note that over here σ 2 does not have a hat. This is done to differentiate the standard deviations of treatments and residuals. σ 2 may further be rewritten as: P 2 P ˆi (¯ii+ − y¯++ )2 σ2 iτ = = i multiplying across by r ⇒ r tP −1 t − 1 P ˆi2 r(¯ii+ − y¯++ )2 2 iτ σ =r = i t−1 t−1 Several important points come up when looking at the equation above:
• The t − 1 under the sum of squares of the treatments is the [number of obervations − the number of parameters] (there are t treatments and 1 parameter=τ .) P • ( i r(¯ii+ − y¯++ )2 )/(t − 1) is an estimate for σ 2 when the hypothesis is true. • The total sum of squares is the sum of the ss(treatment) and ss(residuals) (equivalently (t − 1)σ 2 + t(r − 1)ˆ σ 2 ) the proof is show below.
Proof : Suppose we have an balanced unblocked experiment where the data are
1.2. ANOVA-BALANCED, NO BLOCKING
5
repsented by yij , i = 1...t, j = 1, ..., r then: X X (yij − y¯++ )2 = ((¯ yij − y¯i+ ) + (¯ yi+ − y¯++ ))2 ij
ij
X = ((¯ yij − y¯i+ )2 + 2(¯ yij − y¯i+ )(¯ yi+ − y¯++ ) + (¯ yi+ − y¯++ )2 ) ij
X X X = (¯ yij − y¯i+ )2 + 2 (¯ yij − y¯i+ )(¯ yi+ − y¯++ ) + (¯ yi+ − y¯++ )2 ij
ij
ij
X X X X = (¯ yij − y¯i+ )2 + 2 (¯ yi+ − y¯i+ ) (¯ yi+ − y¯++ ) + r(¯ yi+ − y¯++ )2 ij
i
j
i
X X X X = (¯ yij − y¯i+ )2 + 2 (0) (¯ yi+ − y¯++ ) + r(¯ yi+ − y¯++ )2 ij
i
j
i
X X = (¯ yij − y¯i+ )2 + r(¯ yi+ − y¯++ )2 ij
i
Taking the expected value of the mean square residuals leads to the following equation: ! Pt Pt ¯ 2 2 2 ¯ ¯ ¯ ¯ ¯ (r Y − 2r Y Y ) + rt Y ( Y − Y ) i+ ++ ++ i+ ++ i=1 i=1 i+ E =E = t−1 t−1 ! ! Pt Pt 2 2 2 ¯i+ ¯i+ Y¯++ + rtY¯++ ¯i+ − rtY¯++ r Y − 2rt Y r Y i=1 i=1 =E =E = t−1 t−1 P 2 2 r ti=1 E Yi+ − rtE Y++ σ2 σ2 , Y++ ∼ N µ, where Yi+ ∼ N µ + τi , = t−1 r rt i h i Pt h 2 2 r i=1 (µ + τi )2 + σr − rt µ2 + σrt since E[X 2 ] = V ar[X] + (E[X])2 = = t−1 P P P 2 2 2 rtµ2 − 2r yi=1 µτi + r ti=1 τi2 + rtσr − rtµ2 − σrt τ = = σ2 + r i i t−1 t−1 P 2 The interpretation is that when i τi is small we expect the value of the mean P ¯++ )2 )/(t − 1)] to be close to the residual standard deviation squares E[( i (Y¯i+ − Y P sigma2 . The larger i τi2 the larger the expected discrepency measure will be. This is independent of whether or not the hypothesis is true. We can summarize the data in a table
Note that if we assume the hypothesis is true then the ratio of residual ms becomes a distribtion (since we are assuming that Y¯i+ and Y¯++ are random variables with
6
CHAPTER 1. ANALYSIS OF VARIANCE (ANOVA) Source
Sum of Squares P Treatments yi+ − y¯++ )2 i r(¯ P Residual yij − y¯i+ )2 ij (¯ P Total ¯++ )2 ij (yij − y
DoF t−1
Mean Square(ms) P
rt − t
¯ y++ )2 i r(ii+ −¯ t−1 P 2 ˆij ij r t(r−1)
Ratio to Residual ms P (t(r−1)) i r(¯ii+ −¯ y++ )2 P (t−1) ij (¯ yij −¯ yi+ )2
rt − 1
Table 1.1: ANOVA table for a balanced unblocked completely randomized design normal distributions), more specficaly: P χ2t−1 (t(r − 1)) i r(¯iY i+ − Y¯++ )2 P ¯ = Ft−1,t(r−1) ∼ χ2t(r−1) (t − 1) ij (Yij − Y¯i+ )2
Recall that the square of a normal is chi-squared, in our case the resultant F dsitribution has {t − 1, t(r − 1)} degrees of freedom. Below two remarks are given with respect to contrasts in view of sample sizes and ANOVA. Remark 1.2.1 Determining the sample size is based on three questions: 1. The question(s) being asked 2. The required precision of the conclusions 3. Costs and ethical considerations Two important things to note about sample sizes are: • A sample size is usually based on the required width of a confidence interval. For a 2 sided hypothesis test comparing treatments i and p j (the contrast θ = τi − τj ) the confidence interval’s range will be 2Zα/2 σ ˆ (2/r) (where r is the number of replications under treatment i and j). • Depending on the required level of accuracy both Zα/2 (confidence level) and r (degrees of feedom associated with σ ˆ 2 ) vary inversly. For example increasing c increases the range of the confidence interval whereas increasing r decreases the range of the confidence interval. Two points of caution: 1. Decreasing the interval range by 2 would require a 4 times increase in r. 2. Ascertaing the number of replications for an experiment depends on σ ˆ, however σ ˆ is unknown before the experiment. σ ˆ in such cases can be inferred from other similar experiments or a small pilot project can be conducted to determine an estimate for σ ˆ before a larger project is undertaken.
1.2. ANOVA-BALANCED, NO BLOCKING
7
Remark 1.2.2 Multiple Comaprions are possible between different treatments via various linear combinations of treatments (conrtast). When setting up many contrasts caution must be taken not to distort conclusions from the resultant data. For example suppose 10 hypothesis are conducted each with a p-value=0.05, that is 95% of the time our conclusions will be right. The probability of getting 10 tests that all are correct is Pr(All 10 tests result in the correct conlusion)= 0.9510 = 0.598. That means there is ≈ 4/10 of getting a wrong conclusion. Generally if ANOVA shows there are no differences between treatments then looking for differences in treatments is unnecessary.
8
CHAPTER 1. ANALYSIS OF VARIANCE (ANOVA)
Chapter 2 ANOVA-unbalanced, no blocking 2.1
Introduction
In this section we consider the situation where not all the treatments have the same number of replications, yet we wish to test whether there are differences amongst treatments. Ofter experiments start with balanced deisgns but for some reason or another units are lost. Consider the situation where 2 car batteries (battery A and battery B) with 4 replications are tested over a year. At some point during the year one of battery A starts to leak and it has to be dismissed from the experiment- the experiment now consists of an unblanced design since we are left with 3 replictions of battery and 4 from group B. The techniques developed below will address how to deal with unbalanced designs in non blocked experiments.
2.2
Unbalanced designs
The model used is similar to the model used for balanced designs, this time however treatment i is restricted to ri replications. Mathematically let Yij be the response variate, τi the treatment, µ the mean and Rij the random error, then: (2.1)
Yij = µ + τi + Rij
where i = {1, 2, ..., t}, j = {1, 2, ..., ri }
As before a restraint is required on the τi s: X ri τ i = 0 i
Under this model: 9
10
CHAPTER 2. ANOVA-UNBALANCED, NO BLOCKING • The average response for treatment i is given by E[Yij ] = µ + τi not E[Yij ] = µ + rτi . • The difference in treatment responses is given by E[Yij − Ykj ] = E[Yij ] − E[Ykj ] = τi − τj not ri τi − rj τj . • The weigthed average of the treatment means is given by: X ri E[Yij ] X P ri E[Yij ] iP = (2.2) µ= r i i ri i i
P That is the constraint i ri τi = 0 changes the defintion of the parameter µ. Interstingly this won’t be the parameter used as an estimate for µ ˆ which was the previous case in a balanced. This occurs because µ weights the expected values of the treatments of i differently. Earlier it was mentioned in an ideal situation the experiment would be balanced design-in that vain the estimate for µ ˆ will be y++ .
To estimate the parameters Pfor a unbalanced design we apply least squares estimates subject to the constraint i ri τi = 0. The objective function is written as: ) ( t r t i X XX 2 subject to (yij − µ − τi ) ri τ i = 0 (2.3) min i=1 j=1
i
Using LaGrangian multipliers the equations above become: (2.4)
W (µ, τ1 , ..., τn , λ) =
ri XX i
j=1
(yij − µ − τi )2 + λ
t X
ri τ i
i
Minimizing the equation subject to µ, τi , λ, i = 1, ..., t: r
t
i XX ∂W (µ, τ1 , ..., τn , λ) (yij − µ − τi ) = −2 ∂µ i=1 j=1
= −2 ⇒0=
ri t X X i
j=1
ri t X X i
j=1
yij − µ
yij − µ ˆ
t X i=1
t X i=1
ˆ Notice µ ˆ 6= E[Yij ], now to find τi and λ
ri −
t X
ri τ i
i=1
ri ⇒ µ ˆ=
!
Pt
i=1
setting this equal to 0 P ri
j=1
Pt
i=1 ri
yij
= y¯++
2.2. UNBALANCED DESIGNS
11
t
X ∂W (µ, τ1 , ..., τn , λ) = −2 (yij − µ − τi ) + λri ∂τi i=1 t
∂W (µ, τ1 , ..., τn , λ) X = λri ∂λi i=1 0 = −2
setting both equations to 0
ri t X X i=1 j=1
ˆ i = y¯i+ − y¯++ − τˆi − (yi − µ ˆ − τˆi ) + λr
ˆ λ but since 2 ! t t X X ˆ λ 0= ri τˆi = ri y¯i+ − y¯++ − 2 i=1 i=1
ˆ λ ⇒ 2
τˆi = y¯i+ − y¯++ −
ˆ λ =0 2
Substituting the estimates in for σ ˆ we get: s
W (ˆ µ, τˆi ) i=1 ri − (t + 1) − 1 sP P ri t ˆ − τˆi )2 j=1 (yi j − µ i=1 = Pt ri − t s P P i=1 ri t ¯i+ )2 j=1 (yi j − y i=1 = Pt i=1 (ri − 1)
σ ˆ=
Pt
To summarize: we get the following estimates: Parameter µ ˆ τˆi σ ˆ
rP
Estimate y¯++ y¯i+ − y¯++
Pr i (yi j−¯ yi+ )2 Ptj=1 i=1 (ri −1)
t i=1
Table 2.1: Summary: ANOVA unblocked, unblanced design parameter estimates ANOVA used in the unbalanced case is very similar to the balanced case. As ˆ (which does not depend on the treatments) and compare it to before we use sigma σ (which does depend on the treatments). The hypothesis remains τ1 = τ2 = ... =
12
CHAPTER 2. ANOVA-UNBALANCED, NO BLOCKING
τt = 0. As before we can split up the total sum of the squares into the sum of the squares of the residuals and the treatments: X X X (yij − y¯++ )2 = (yij − y++ )2 + (yi+ − y++ )2 ij
ij
ij
ss(total) = ss(residuals) + ss(treatment)
Drawing up a table: Source
Sum of Squares P Treatments yi+ − y¯++ )2 i ri (¯ P Residual yij − y¯i+ )2 i,j (¯ P Total ¯++ )2 ij (yij − y
DoF P
t−1
i (ri − 1) P i ri − 1
Mean Square(ms) P
¯ y++ )2 i ri (ii+ −¯ t−1 P rˆ2 P ij ij i (ri −1)
Ratio to Residual ms (
P
P
i (ri −1)) P
(t−1)
yi+ −¯ y++ )2 i r(¯ (¯ y −¯ y )2 ij i+ ij
Table 2.2: ANOVA table for a unbalanced unblocked completely randomized design
Let’s put our knowledge to use: Example 2.2.1 Suppose we wish to compare SAT scores by states over several years. We are given the following information on state average SAT scores from 1999-2004, not given in chronological order : New York 1007 1006 1000 1000 1001
Hawaii 1001 1002 1008 1001 1007 995
Maryland 1026 1024 1020 1018
Arizona 1047 1049 1043 1048
Kentucky 1116 1106 1102 1100 1098
Table 2.3: Data of average SAT score Given this information determine whether there is a difference in average SAT scores amongst different states. Solution 2.2.2 We proceed by estimating the various parameters for this investigation, then drawing up an ANOVA table. µ ˆ = y¯++ =
24825 = 1034.375 24
2.2. UNBALANCED DESIGNS State New York Hawai Maryland Arizona Kentucky
y¯i+ 1002.8 1002.¯3 1022 1047 1104.4
13 y¯i+ − y¯++ 31.575 32.041¯6 12.375 -12.625 -70.025
(¯ yi+ − y¯++ )2 996.98 1026.66 153.14 159.39 4903.5
τi2 46.8 111.¯3 40 21 203.2
r 5 6 4 4 5
Table 2.4: Summary of estimates for τi , i = {1, ..., 5}
Now to draw up the ANOVA table Source Sum of Squares DoF Treatments 6645.67 4 Residual 422.333 19 Total 7068.003 23
Mean Square(ms) 1661.41 22.22
Ratio to Residual ms 74.74
Table 2.5: ANOVA table for SAT scores So the value of the resultant F-distribution is 74.74 with 4 numerator degrees of freedom and 19 denominator degrees of freedom. Its associated p-value is Pr(F4,20 ≥ 74.74) ≈ 0 there is evidence that suggests average scores between states vary.
A couple of things to note. Having concluded that there exists a difference amonst the diffrent treatments, we can further go on to look at the contrast and construct confidence intervals. The firs thing to do is to check to see see whether major differences exist in variation amongst treaments we can plot the residuals and if they all look erlatively close, we can say they all have approximately the same σ: ... They look realtively similar, so we can proceed to contrast average SAT scores between states. Suppose we can compared Hawai and Kentucky, then the contrast would be θ = τKentucky − τHawai , which yields: θˆ = τˆKentucky − τˆHawai = y¯Kentucky+ − y¯Hawai+ = 1104.4 − 1002.¯3 = 101.0¯6 So that means the random variable and confidence interval are: ! r r 1 ss(residuals) = 22.22 1 11 θˆ ∼ N θ, σ 101.0¯6 ± cˆ σ + where σ ˆ= 6 5 30 19
14
CHAPTER 2. ANOVA-UNBALANCED, NO BLOCKING
Since the confidence interval does not contain 0 there is a difference in responses between states. Remark 2.2.3 Although YHawai+ and YKentucky+ are both averages they have different numbers of obserations. The variance was obtaiend as follows: ˜ = V arYKentucky+ − YHawai+ = V arYKentucky+ +V arYHawai+ = V ar(θ)
σ2 rKentucky
+
σ2 rHawai
Also just be aware that the use of σ 2 in the equation above is notational abuse and should not be confused with ss(treatments). In reality it is σˆ2 but can only be used as such when taking numerical values.
Chapter 3 ANOVA and Randomized Block Designs 3.1
Introduction
In this section we consider balanced designs under blocking. By holding explantory variates fixed in the blocking design confounding factors are reduced. The way in which this model is approached is similar to the non-blocking paradigm, in that mean sum of squares (ss(total)) is further decomposed to give to rise to the sum of the sqaures for blocking also known as ss(block).
3.2
ANOVA-Blocking
Under blocking there are two questions that are generally asked, interstingly they are not intersted in the blocks performance but rather the treaments: • Are there diffrences amongst the different treatments? • Which treatment(s) should be accepted/rejected based on the experimetnal question? The model used under blocking is: (3.1) Yij = µ + τi + βj + Rij
where Rij ∼ N (0, σ 2 ), i = {1, ..., t}; j = {1, ..., b} 15
16
CHAPTER 3. ANOVA AND RANDOMIZED BLOCK DESIGNS
P P Also the two linear constraints i τi = 0 and j βj = 0 are imposed. Once again, the method of least squares is used to estimate the parameter, we get the following:
s P
µ ˆ = y¯++ τˆi = y¯i+ − y¯++ βj = y¯+j − y¯++
−µ ˆ − τˆi − βˆj )2 bt − 1 − (t − 1) − (b − 1) sP ¯++ ) − (¯ yi+ − y¯++ ) − (¯ y+j − y¯++ ))2 i,j ((yij − y = (b − 1)(t − 1) sP ¯i+ − y¯+j + y¯++ )2 i,j (yij − y = (b − 1)(t − 1)
σ ˆ=
i,j (yij
The number of degrees of freedom arising for σ is (b − 1)(t − 1) since we have bt observations, t treatment parameters, b block parameters, 1 µ paremeter and 2 linear constraints. An alternate way in formulating the number of degrees of freedom is: bt − b − t − 1 + 2 which is the same order in which the observations, paremeters and constraints were presented. Dealing with the numerator udner the square root sign ((yij − y¯++ ) − (¯ yi+ − y¯++ ) − (¯ y+j − y¯++ )) is bulky so it would be better to break it down into its indvidual pieces. It will consist of three squred terms and 3 cross product terms. Starting with the sum of the squares. The sum of the squares treatment effects:
t X b X i=1 j=1
(¯ yi+ − y¯++ )2 = b
t X i=1
(¯ yi+ − y¯++ )2
The sum of the squares block effects:
t X b b X X 2 (¯ y+j − y¯++ ) = t (¯ y+j − y¯++ )2 i=1 j=1
j=1
3.2. ANOVA-BLOCKING
17
The two cross product terms: X X 2 −2 (yij − y¯++ )(¯ yi+ − y¯++ ) = −2 (yij y¯i+ − yij y¯++ − y¯++ y¯i+ + y¯++ ) i,j
i,j
= −2
"
"
X i,j
= −2 b = −2b = −2b −2
X i,j
(yij − y¯++ )(¯ y+j
yij y¯i+ −
X i
X i
X i
X i,j
y¯i+ y¯i+ − b
yij y¯++ −
X i
X
yi+ y¯++ +
i,j
y¯i+ y¯++ − b
2 2 [¯ yi+ − 2¯ yi+ y¯++ + y¯++ ]
X
2 y¯++
i,j
X
y¯i+ y¯++ + b
i
X
(¯ yi+ − y¯++ )2
similary X − y¯++ ) = −2t (¯ y+j − y¯++ )2 i
The third cross product term equals 0: X X X 2 (¯ yi+ − y¯++ )(¯ y+j − y¯++ ) = (¯ yi+ − y¯++ ) (¯ y+j − y¯++ ) i,j
i
j
= bt(¯ y++ − y¯++ )(¯ y++ − y¯++ ) = 0
Putting everything back together we see how the to divide up the various sum of the squares between treatments, blocks and residuals.: X ((yij − y¯++ ) − (¯ yi+ − y¯++ ) − (¯ y+j − y¯++ ))2 = i,j
X i,j
(yij − y¯++ )2 + b 2b
X i
X i,j
t b X X (¯ yi+ − y¯++ )2 + t (¯ y+j − y¯++ )2 − i=1
j=1
2
(¯ yi+ − y¯++ ) − 2t 2
(yij − y¯++ ) − b
t X i=1
X i
(¯ y+j − y¯++ )2 = 2
(¯ yi+ − y¯++ ) − t
b X j=1
#
(¯ y+j − y¯++ )2
It’s clear from the decomposition above that: An ANOVA table can also be put together:
i
y¯++ y¯++
#
18
CHAPTER 3. ANOVA AND RANDOMIZED BLOCK DESIGNS P ss(treatment) b ti=1 (¯ yi+ − y¯++ )2 Pb ss(block) t j=1 (¯ y+j − y¯++ )2 ss(residual) ss(total)-ss(treatment)-ss(residual) P ss(total) ¯++ )2 i,j (yij − y Table 3.1: Summary of sum of squares Source
Sum of Squares Pt Treatments b i=1 (¯ yi+ − y¯++ )2 P Block t bj=1 (¯ y+j − y¯++ )2 Residual [1A]-[1B] P Total ¯++ )2 i,j (yij − y
DoF t−1 b−1 (b − 1)(t − 1) tb − 1
Mean Square(ms)
b t
Pt
yi+ −¯ y++ i=1 (¯
Pb
)2
t−1
y+j −¯ y++ ) j=1 (¯
[1A]
Ratio to Residual ms
P (b−1)(t−1) ti=1 (¯ yi+ −¯ y++ )2 (t−1)[1A]-[1B]
2
b−1 [1A]-[1B] (b−1)(t−1)
[1B]
Table 3.2: ANOVA table for a unbalanced unblocked completely randomized design Remark 3.2.1 There a couple of things to note:
1. The estimate for σ ˆ is 2. b
3.3
Pt
yi+ −¯ y++ ) i=1 (¯ σ2
2
∼ χ2t−1
q
[1A]-[1B] (b−1)(t−1)
Testing the hypothesis
Now that all the data and parameter have been summarized we can test the hypothesis that there are no differences amongt treatments i.e. τ1 = ... = τt = 0. The hypothesis test is very similar to the one conducted in non-block designs. As shown above we take the ratio of the ss(treatments) to the ss(residuals) which yields an F distribution with (t-1) and (t-1)(b-1) degrees of freedom. Formally F(t−1),(b−1)(t−1) . Recall that the residual mean sum of the squares is an estimate for σ regardless of whether the hypothesis is true or not. Assuming the hypothesis is true then it can be show that:
(3.2)
Y¯i+ ∼ N
σ2 µ + τi , b
3.4. PUTTING IT ALL TOGETHER: AN OIL EXAMPLE
19
The proof is provided below for clarity: (µ + τi + β1 + Ri1 ) + ... + (µ + τi + βb + Rib ) Yi1 + ... + Yib = Y¯i+ = b b Pb bµ + bτi + Ri1 + ... + Rib j=1 Rij = µ + τi + = b b Pb j=1 Rij E[Y¯i+ ] = E[µ + τi + ] = µ + τi b Pb Pb Rij j=1 Rij j=1 ] = V ar[ ] V ar[Y¯i+ ] = V ar[µ + τi + b b Pb bV ar(Rij ) σ2 j=1 V ar(Rij ) = = = b2 b2 b
Remark 3.3.1 For a randomized block design with multiple blocks the estimate of the contrast θˆ = τˆ1 − τˆ2 is given by: τˆ1 − τˆ2 = (¯ y1+ − y¯++ ) − (¯ y2+ − y¯++ ) = y¯1+ − y¯2+ Accordingly its estimator is : θ˜ = Y¯1+ − Y¯2+ ∼ N
θ, σ
r
1 + 1b b
!
ˆ = P ai τˆi = P ai y¯i A generaliztion of the contrast above can be made such that θ i i P where i ai = 0. Its estimator will be given by: θ˜ =
X i
3.4
ai Y¯i+ ∼ N
θ, σ
rP
2 i ai b
!
where
X
ai = 0
i
Putting it all together: an oil example
Suppose we have the given data of oil prices over the last 11 years broken down by month. We wish to eliminate the effect various years may have had on oil prices, thus the blocks are the years and the months are the treatments. Since Pr(2.8353 > F ) = 0.002667 we conclude that there is a difference in oil prices throughout the year. From the table of averages we can see that oil prices are lowest in January ($1.197) and highest in September ($1.318). Which may seem
20
CHAPTER 3. ANOVA AND RANDOMIZED BLOCK DESIGNS Table 3.3: Average all prices over 11 years sorted by month
Year 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Jan 1.117 1.043 1.129 1.129 1.261 1.131 0.972 1.301 1.472 1.139 1.473
Feb 1.108 1.051 1.120 1.124 1.255 1.082 0.955 1.369 1.484 1.13 1.641
Mar 1.098 1.045 1.115 1.162 1.235 1.041 0.991 1.541 1.447 1.241 1.748
Apr 1.112 1.064 1.140 1.251 1.231 1.052 1.177 1.506 1.564 1.407 1.659
May 1.129 1.080 1.200 1.323 1.226 1.092 1.178 1.498 1.729 1.421 1.542
Jun 1.130 1.106 1.226 1.299 1.229 1.094 1.148 1.617 1.64 1.404 1.514
Jul 1.109 1.136 1.195 1.272 1.205 1.079 1.189 1.593 1.482 1.412 1.524
Aug 1.097 1.182 1.164 1.240 1.253 1.052 1.255 1.51 1.427 1.423 1.628
Sep 1.085 1.177 1.148 1.234 1.277 1.033 1.280 1.582 1.531 1.422 1.728
Oct 1.127 1.152 1.127 1.227 1.242 1.042 1.274 1.559 1.362 1.449 1.603
Nov 1.113 1.163 1.101 1.250 1.213 1.028 1.264 1.555 1.263 1.448 1.535
Oct 1.288
Nov 1.267
Dec 1.231
Table 3.4: Average oil prices by month Jan 1.197
Feb 1.211
Mar 1.242
Apr 1.288
May 1.311
Jun 1.310
Jul 1.291
Aug 1.294
Sep 1.318
counterintuitive since it is more expensive at the beginning of Winter than it is at the end of fall. But consider the economics of the situation. When lots of people use heating during winter supply will increase and price will go down. Similarly for fall, when demand decreases price increases, it is the basic premise of supply and demand. Suppose we wanted to conduct a contrast between September and January prices. We could set up the contrast as θ = τJanuary − τSeptember this would give us ˜ = E[Y¯September+ − Y¯January+ ] = 1.318 − 1.197 = 0.121 E[θ] ˜ = V ar[Y¯September+ − Y¯January+ ] = 2(0.0064)/11 = 0.0011636 V ar[θ] θ˜ ∼ T (0.121, 0.034112) where σ 2 has 121 degrees of freedom
Constructing a %95 confidence interval 0.121 ± 1.645(0.034112) → [0.0649, 0.177] it is clear that 0 does not fall in this range. Accordingly it may be concluded that there is a difference in September and January prices.
Dec 1.070 1.143 1.101 1.260 1.177 0.986 1.298 1.489 1.131 1.394 1.494
3.5. SOME IMPORTANT NOTES
21
Table 3.5: Average oil prices by year 1993 1.108
1994 1.112
1995 1.147
1996 1.231
1997 1.234
1998 1.059
1999 1.165
2000 1.510
2001 1.461
2002 1.358
2003 1.591
Table 3.6: Analysis of Variance Table: Oil Prices Df Month 11 Year 10 Residuals 110
3.5
Sum Sq 0.2008 73.9507 0.7083
Mean Sq 0.0183 0.3951 0.0064
F value 2.8353 61.3547
Pr(> F ) 0.002667 < 2.2e-16
Some important notes
In the ANOVA design we assume that the treatment are block effects are additive, that is there is no interaction between the treatments and blocks. This may not always be the case. It is possible to setup more than one block design for each treatment. One such design is the latin sqaures design which blocks treatments by row and column. For example: consider a farmer treating his various plants with 3 solutions. He can setup wanting to whether a certain .
3.6
Factorial Treatment and Interactions
3.7
Introduction
Often in experiments we have multifactorial models, each treatment can be divided up into one or more factors i.e. treatments have a factorial sturcture. In multifactorial models a key concern is how factors interact with one another to influence treatment output. Interaction is different from correlation. Correlation implies that changing one factor will cause a change in another factor. Alternatively, there is said to be an interaction between factors if: the effect on the response, of changing one factor depends on the level of another. Different models can be used to consider factorial designs. First we will consider the a 22 factorial design then move on to a 23 factorial design.
TotAvrg 1.271
22
CHAPTER 3. ANOVA AND RANDOMIZED BLOCK DESIGNS
3.8
22 Factorial Design
In a 22 factorial design we consider 2 factors at 2 levels. In total there will be 4 treatments. The idea is best developed through an example. Example: Suppose a manufacturer is producing a new set of skis and wants to compare the difference of bend under normal and extreme conditions. He further wants to check if his design is significantly better than his old design. The more a ski bends the more likely it will break. To do so he sets up an experiment with 16 obesrvations, they are listed below: Two question that concern the manfacturer are: Treatment 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
Design new new new new new new new new old old old old old old old old
Condition extreme extreme extreme extreme normal normal normal normal extreme extreme extreme extreme normal normal normal normal
Bend 1.35 1.63 1.43 1.57 1.27 1.44 1.53 1.40 2.56 2.51 2.22 2.35 1.45 1.67 1.34 1.47
Table 3.7: Bend:new and old skis under extreme and normal conditions
â&#x20AC;˘ Is there any evidence of a difference among treatments? â&#x20AC;˘ Is there any evidence of interaction between design and condition? Alternatively put is the effect of the condition the same for each design type? Solution: The first question can be addressed by ANOVA the second can be adressed by a contrast. Before proceeding lets draw up the appropriate tables for the experiment these will be the table of averages and the ANOVA table:
3.8. 22 FACTORIAL DESIGN New Extreme 1.495 Normal 1.41 Average 1.4525
23 Old 2.41 1.4825 1.94625
Average 1.9525 1.44625 1.699375
Table 3.8: Averages for different designs and conditions Treatment Residuals Total
Df 3 12 15
Sum Sq 2.71012 0.21297 2.92309
Mean Sq 0.903375 0.01775
F value 50.90
Pr(>F) ≈0 ***
Table 3.9: ANOVA: for bend under different designs and conditions Question 1: Since the p-value ≈0 for F3,12 = 50.90 there is strong evidence that there is a difference amongst treatments. Question 2: Clearly there is a difference when both going from normal to extreme and changing designs from old to new. To test the hypothesis that an interaction exists i.e. H0 : θ = 0 between design and condition we develop the following two equivalent contrast: θ = (τ1 − τ3 ) − (τ2 − τ4 ) = (¯ yextreme,new − y¯extreme,old ) − (¯ ynormal,new − y¯normal,old ) = (τ1 − τ2 ) − (τ3 − τ4 ) = (¯ yextreme,new − y¯normal,new ) − (¯ yextreme,old − y¯normal,old ) The corresponding estimator is normal: r ˜(θ) ∼ (θ, σ 1 + 1 + 1 + 1 4 4 4 4 Hence the discrepency measure will be d=
|θˆ − 0| | − .8425 − 0| √ = = 6.324 0.13321 σ ˆ 1
The assocaited p-value is Pr(|t1 2| ≥ 6.324) = 0.000038, this indicates strong evidence for interaction between factors. A useful tool when analysing factor models is an interaction plot: In this case it shows that there is’t much change as we proceed from extreme to normal condtions with the new deisgn. This appears not to be the case with the old design where bending decreases as we proceed from extreme to normal condtions. The fact that the two lines are not parallel indicates there is interaction between design and condition.
24
3.9 3.10
CHAPTER 3. ANOVA AND RANDOMIZED BLOCK DESIGNS
23 Factorial Design Notes
There is hidden repliation that occurs in these experiments By conducting a facotrial deisgn it is both more efficient and comprehnsive.
Chapter 4 Sampling Introduction 4.1
Introduction
In this chapter we deal with the planning, analysis and design of simple sample surveys. Surveys are used to collect data, usually via a questionnaire, to study characterisitics and relationships of an underlying population. We consider a simple example of a local grocery store to illustrate topics that will be covered in this section. A local grocery store would like to improve its customer service. To accurately assess the needs of their customers a survey of people who visit their store needs to be conducted. Several questions however need to be asked before collecting data: 1. How might the survey be conducted? 2. Would everyone whoe entered the store be sampled or only a subset of these people? 3. What kind of people would be more likely to respond? 4. Do the conclusions drawn from a sample population hold for the whole population? 5. What kind of inferences can be made based on the data accumulated from the survey? 6. Are there possible sources of error from either the collection, receipt or interpretation of the data? As the example above demonstrates there are many factors to consider when designing a sample survey. The rest of the introduction introduces the concepts of a survey and a census. 25
26
CHAPTER 4. SAMPLING INTRODUCTION
In the social sciences data is obtained by means of a social survey. A social survey is a poll of either a set (portion of the population) or complete space (whole population) of individuals. When a survey is complete (or fully enumerated) it is referred to as a census whereas a partial enumeration of the population is called a survey. Most techniques and problems associated with conducting a census equally exist for surveys. Government bodies have the resources and judicial-power to administer censuses. Whereas such ventures in the private sector would be cost ineffective. In Canada the government conducts a censuses once every five years. Should an individual not respond they are subject to criminal charges and jail time. While it is rare that someone is sent to jail for not completing a census questionnaire the existence of the threat is impetus enough to encourage its completion. Survey’s are not exclusive to the social sciences and also exist in such fields as engineering, medicine as well as in agriculture. It is important understand the terminology associated with sampling and protocol design. Accordingly several definitions are presented below. Each should be studied carefully as they will reappear throughout the course of this section.
4.2
Terminology
Definition 4.2.1 When not paired with a qualifying noun a population under consideration constitutes the full sample space Ω of interest. It represents all individuals that qualify for examination. Definition 4.2.2 A sample (ω) is a subset of a population of interest, mathematically this is equivalent to ω ⊂ Ω. Definition 4.2.3 A census is an investigation of a population where every unit is examined. It’s goal is collect information from the whole population of untis. Definition 4.2.4 A survey is an investigation of a subset of a population of interest. It’s goal is to collect information about a subset of the population. Definition 4.2.5 A descriptive survey seeks to identify properties of a population. Of interest include methods for identifying estimaors for fairly simple population characteristics such as the mean of a population.
4.2. TERMINOLOGY
27
Definition 4.2.6 An analytic survey examines relationships amongst variables in a population of interest. Definition 4.2.7 A cluster is a group of units. Units are often selected in clusters to implement a smapling protocol. Definition 4.2.8 Stratification is the process of subdivding a population of interest into sub-groups called strata before a survesy is conducted. Definition 4.2.9 The inclusion probability is the probability pi that unit i (i = {1, 2, ..., N } where is defined over the study population) is included in the sample.
4.2.1
Some important notes on sampling
In this section basic concepts about sampling are introduced. We explore the types of populations, introduce the concept of sampling protocols and briefly mention different types of errors that may arise from sampling.
4.2.2
Populations and their meaning
The word population is frequently used in statistics in different contexts and conseuqently requires contextualization to understand the exact “population” to which the problem or exercise is referring. Depending on the word that precedes “population” it may prescribed different meanings. It is worth taking a moment to sort through some of these “populations”, since they are commonly used throughout this text. • A target population is the population under consideration. • The study population is the population sampled. It is also known as a frame. • A population of measurements is the set of all measurements derived from the whole population of units. Measurements need not be known. • A sample of measurements is a subset of measurements derived from the population of measurements. Sample measurements are known. • A super population or universe of populations is a useful theoretical concept used to describe the set of all possible populations when the values for a population are not known or are not finite. The universe of populations can be thought of as the superstructure that contains all populations. If we denote the universe of populations by Ξ, population i by Ωi and sample population j from population i by ωij then the using set notation we have that ωij ⊂ Ωi ⊂ Ξ
28
CHAPTER 4. SAMPLING INTRODUCTION where i = {1, 2, ..., } and j = {1, 2, ...}. When a super population is defined over an infinite space containing discrete sets it is not σ-algebra since it does not satisfy closure under countable unions. Consequently super populations that are closed are probability spaces.
Note that the population of measurements, when finite, represents a probability space. The sample measurements are consequently one realization of that probability space. When the population of measurement can be defined in terms of a probability space a random variable can be assigned in conjunction with sampling protocls to develop probabilities, confidence intervals and hypothesis tests for underling samples.
4.2.3
Sampling protcols
Various sampling protocols exist to select units within a population. Each sampling protocol has its own unique set of advantages and disadvantages. The sampling purpose as well as the population of interest will determine sample protocol. For example the methods by which cars for crash tests are sampled will be different from how a company will select people to participate in an advertising study. Both the purpose of the study and target population are different so we would expect that the sample protocol would vary as well. Sampling can be done in one of two ways, either in a probabilistic manner or a non-probablistic fashion. Ideally the statistician seeks to apply a probability sampling protocol, since conclusions drawn from non-probabilistic sampling will tend to be non-representative of the target population. By structuring sets of sampled individuals in terms of frames probabilities can be assigned to the likelihood of a set being selected provided the underlying population is finite. Mathematically: Suppose Ω represents a finite population and Σ = {ω1 , ω2 , ..., ωn } represents the set of all possible frames defined by the sampling protocols. (Note that the sampling protocol will define the probability distribution. For example suppose we have six people and the sampling protocol requires selection of two people. Then the probability distribution of selecting a set of two people will occur with equal probability****.) Then the probabiliy of selecting frame ωi where i = {1, 2, ..., n} is written as: P r(ωi ) =
Card(ωi ) Card(Ω)
where i = {1, 2, ..., n}
where Card(ω) is the cardinality of frame ω. While methodologies vary amongst sampling protocols what remains similar across sampling methedelogies is the deconstruction of large units into smaller ones.
4.3. ERROR
29
Statisticians divide populations by strata that usually have a common charatersitic. It is not uncommon to have stratified multi-stage designs. Where stratified units are further grouped by common features. Starta are usually decomposed into clusters and clusters are further broken down into sample units. By taking this top down approach errors are avoided at each of the different levels. Consider the case where 10000 people are randomly sampled across Canada. If the sampling is not homogeneous over provinces then sample bias is likely to increase. By stratifying, clustering and frameing both sample and study errors can be reduced. Non-probability sampling arises under self-selection, convenience, quota and judgement sampling. In general statistical models cannot be applied to non=probabilistic sampling meausres. Surveys have 3 main advantages when compared to a census they are cheaper, quicker and more practical- especially when the experimental units are destuctible: for example a crash test car. The main disadvantage associated with surveys is the implied uncertainty with respect to drawing general conclusions since various forms of error exist, whereas in a census these would be reduced. Third stage: select 1 person from the household.
4.3
Error
Error can occur at different points in the sampling process. Below â&#x20AC;˘ Study error also called frame error. It is the difference in attributes between the target and study population. â&#x20AC;˘ Sample error is the difference in attribuets of interest between the study and sample population. Sample errors vary each time the plan is repreated. This introduces the idea of: 1. Sampling bias-average sample error 2. Sampling variability-variance of sample errors For populations involving human surveys a component of the sample error is the non-response error. A sampled population is made up of two distinct groups the respondents and non-respondents. Hence we can think of the sample in terms of the intended sample (includes all respondents) and the actual sample (excludes non-respondents). The attiributes of these two sub-populations may be different which leads to non-repsonse error i.e. the difference in actual and intended measurements arising from one or more units refusing to provide data. Accordingly the sampled measurements (of the respondees) may not match those of the frame since the sampled measurements constitute an incomplete set.
30
CHAPTER 4. SAMPLING INTRODUCTION • Measurement error is the difference in the characteristics of interest arising from the difference between the true and measured values of the variates of the units in the sample. Measurement error may be caused by a number of things i.e. 1. Systematic differences in interviewers. 2. People may lie, modify or forget their answers. 3. Interviewers may influence the responses of those surveyed by employing different question protocols. 4. Measurement error may arise when the question posed does not match the question used to define the response variate of the target population.
Confidence intervals capture the effect of sampling and measurment variation but not study error, non-response error nor sampling and measurement bias. We can try to address study error, non-response error and sampling bias through good plannig and execution of the survey. Questionnaires should have most if not all of the following attributes (adapted from Stat332 handbook): 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
4.4 • • • • • • •
A purpose Clear and simple questions Specific rather than general questions Ask one concept per question Use forced choice answers rather than agree/disagree questions Avoid leading questions/contexts Keep the questionnaire short Explain the purpose of the survey Ensure confidentiality Pay attention to the question-order effects Test the questions before hand Plan to use the questions
Some important notes In random sampling there is no sampling bias Sampling bias is the avergae of sampling errors Sampling variability is the variation in sample errors Non-response error arises only in human surveys. Show the divisions the of strat and clusters Self selecting causes sampling bias Note: error can be divided into systematic and random error.
4.4. SOME IMPORTANT NOTES
31
1. Random error: affects variability over the entire sample population, it however does not affet error bias (assuming the error is normal) sometimes called noise 2. Systematic error: is caused by factors that systematically affect measurements across the entire sample. Systematic error is considered a bias measurement since it will either raise of decrease the average over the sample values. It however does not affect variability There are various ways to reduce measurement error both random and systematic: 1. Conduct a pilot test of your instruments 2. With human experiments train the people before hand on how to properly conduct the experiment 3. Double enter the data to make sure they are the same data across the board.
32
CHAPTER 4. SAMPLING INTRODUCTION
Chapter 5 Probability sampling 5.1
Introduction
In the previous chapter it was mentioned that two types of sampling protocols exist. The focus of this chapter is to discuss sampling under a probabilistic framework. When dealing with formal surveys, probability sampling is employed. Contrary to non-probabilistic designs, probablisitic sampling employs an underlying probability model on subsets of the frame. Using this approach a statistical model of sample error can be developped; that is we can construct confidence intervals and hypothesis tests for the parameters that represent attributes of interest in the study population. The execution of the sampling protocol is important. If the sampling protocol is carried out as planned then the underlying probability model is deemed appropriate and inferences about the study population can be made. Contrarily, should the protocol not be carried out as planned then the model on which the probabilistic smapling protocol was based would have to be reassessed. In what follows several protocols sampling protocols are briefly introduced before taking a more detailed llok at sample random sampling (SRS).
5.2
Sampling Protocols
Before describing some of the sample probability protocols, some notation is necessary, let: • i be the unit i • U be the set of all units in the study population, that is U = {1, 2, ..., n} S • Ul be stratum l. Note ∀k Ul = U 33
34
CHAPTER 5. PROBABILITY SAMPLING S • ck be cluster k. Note ∀l ck = U • N equal the number of elements in U i.e. Card(U ) = N • sj be the S set of all units in the sampled population j . Note that sj ⊂ U implies ∀j sj = U • n be equal to the number of elements in s i.e. Card(s) = n
In the sample protocols that follow, it is assumed that both the sample size (n) and study population size (N ) are fixed. We will start with protocols that are relatively simple then gradually proceed to more complicated designs. For ease we assume three things N = 1000, n = 100 and frame divisions are equal, accordinly we then we have the following: 1. Simple random sampling-SRS: In simple random sampling no effort is made to try and stratify the frame, rather a random sample of n individuals is selected from a study population of N units. Consequently, the probability of selecting any set of subjects (sample of size n) from the population (of size N) N possible subsets. Since each occur is the same. Mathematically there are n wwith equal probability the probsability of selecting sample sj for n = 100 and n = 1000 is: −1 −1 −1 N! 900!100! 1000! N = Pr(sj ) = = = n (N − n)!n! 900!100! 1000! 2. Stratified random sampling: In stratified random sampling the frame is divide into l strata from which n/l units are selected from each strata. Strata may be divided according to a common characteristic, but the specific divisions will be determined by the nature of the experiment. In our example suppose l = 50 then the frame is divided into 50 strata with 20 units each, that is U1 = {1, 2, ..., 20}, ..., U50 {981, 982, ..., 1000}. This means 2 units will be selected from each strata. As before we can determine the number of samples generated 50 50 n 20 1000/50 N/l = = 2 100/50 n/l Since each sample occurs with equal 50 probability the probability of selecting 18!2! sample sj will be Pr(sj ) = . 20! 3. Cluster sampling: Cluster sampling is similar to stratified sampling in the sense that the frame is divided into a series of sub-frames. Cluster sampling varies from stratified sampling in the methodology by which the sample is selected. Unlike stratfied sampling where one or more units is selected from
5.2. SAMPLING PROTOCOLS
35
from every strata, all units are selected from a particular cluster. The summation of these clusters must equal n the number of elements in the sample. Suppose k = 25 then we have 20 clusters with 50 units in each i.e. c1 = {1, 2, ..., 50}, ..., c20 = {951, 952, ..., 1000}. The number of samples generated will be: k 20 20 = = n 100 2 N/k 1000/20 As before the probability of selecting sample j will be the inverse of the number of possible samples. 4. Systematic sampling: Systematic sampling like cluster sampling divides the frame into a series of clusters. Every tt h unit is made part of the cluster. Subsequent sampling just requires choosing 1 cluster. Since only 1 cluster is chosen it must contain n units. Using n and N as defined above the set of clusters generated by dividing 1000 into equal units of 100 is c1 = {1, 11, 21, ..., 981, 991}, ..., c10 = {10, 20, ..., 1000}. The sample space in this case consists of N/n possible outcomes. The probability of choosing sample j would then be n/N . 5. Two stage sampling In two stage sampling, samples are selected in two stages. Suppose we divide the frame into 10 strata then â&#x20AC;˘ Stage 1: Select k strata from the l (10 in our case) strata that are available. In our example let k = 4, then the number of different strata samples generated is 10 10 = 4 k â&#x20AC;˘ Stage 2: Now that we have k each stratum must contain an equal amount of elements such there summation equals n. In our example we have k = 4 starta and n = 100 so 100/4 = 25 elements are selected from each strata. The total number of samples generated is given by: 100 10 n N/l = 25 4 n/k k There are many different ways in which samples can be selected. It is important to understand the concepts behind these basic protocols since such paradigms are often mixed together to generate sample spaces as illustrated in the two sample stage sampling protocol example and the previously mentioned three stage sampling protocol (labour force survey). Another important concept is that of the inclusion probability pi which is the probability that unit i is included in the sample. For example suppose we wanted to
36
CHAPTER 5. PROBABILITY SAMPLING
know pi in cluster sampling. If N = 1000, n = 100 and k = 20 then pi is given by: 19! 19//1 1 2 = 1!18! = pi = = 20! 20 10 20//2 2!18!
Simple random sampling was used in some form or another to describe the various sampling protocols. To understand the statistical properties associated with the various protocols the properties underlying SRS must first be understood, which leads us into the next section.
5.3
Simple Random Sampling-SRS
Simple random sampling should actually be called simple random sampling without replacement since each selected unit is not counted more than once in a sample. In simple random sampling the probabiliy of inclusion is given by: N −1 n−1 n (5.1) pi = = N N n Which intuitively makes sense since n units are selected from a total of N units. To identify the test statistics we let yi be the value of the response variate for unit i. Accordingly, the mean and variance over the sample (s) and study (U ) populations are given by: P P P ˆ )2 rˆi 2 i∈s yi i∈s (yi − µ 2 µ ˆ= σ ˆ = = i∈s n−1 Pn P n−1 P 2 yi (yi − µ) r2 µ = i∈s σ 2 = i∈S = i∈S i n n−1 n−1 Since samples vary everytime one is drawn we treat S as a random subset. S = sj is one actualization over all possiblitiess. The estimator S is summarized by the following statistical measures: P P P ˜ )2 r˜i 2 i∈S (yi − µ i∈S yi 2 σ ˜ = = i∈S µ ˜= n n−1 n−1 1 Pr(S = sj ) = for all jover S N n
5.3. SIMPLE RANDOM SAMPLING-SRS
37
It is often more convenient to express the random subset S over the probability space U . That is we wish to express S in terms of a random variable rather than a random subset. To do so the indicator function is introduced let:
Ii =
(
1 if unit i is in the sample 0 otherwise
i = {1, ..., N }
A couple things to note about the random variable Ii : E[Ii ] = Pr[Ii ∈ / S](0) + Pr[Ii ∈ S](1) = n/N
/ S](0)2 + Pr[Ii ∈ S](1)2 − (n/N )2 V ar[Ii ] = E[Ii2 ] − (E[Ii ])2 = Pr[Ii ∈ n n = n/N − (n/N )2 = (1 − ) N N N −2 n−2 n(n − 1) E[Ii Ij ] = Pr(units i and jnare both in the sample) = = N (N − 1) N n n(n − 1) n n n n 1 Cov(Ii , Ij )E[Ii Ij ] − E[Ii ]E[Ij ] = − =− 1− N (N − 1) N N N N N −1 We can now rewrite µ ˜ as a proper random variable. While the exact distribtuion of µ ˜ is unknown many of its properties can be derived. They are summarized below:
µ ˜= E[˜ µ] = µ
P
i∈U
I i yi
n
r n σ √ stddev[˜ µ] = 1− N n
Since E[˜ µ] = µ µ ˜ is an unbiased estimator. The proofs for the formulas above are provided below:
E[˜ µ] = E
P
i∈U
n
I i yi
=
P
i∈U
E[Ii ]yi = n
P
i∈U (n/N )yi
n
=
X yi =µ N i∈U
38
CHAPTER 5. PROBABILITY SAMPLING
" # X X I y 1 i∈U i i yi2 V ar(Ii ) + yi yj Cov(Ii , Ij ) = 2 V arµ ˜ = V ar n n i∈U i6=j,j∈U ! X 1 n n X 2 n 1 n = 2 1− 1− yi yj y − n N N i∈U i N N N − 1 i6=j,j∈U ! P y y n 1 X 2 1 i j i6=j,j∈U yi − 1− = n N N i∈U N −1 n σ2 = 1− N n since ! P X X X 1 i6=j,j∈U yi yj 2 2 = (N − 1) yi − yi yj yi − N −1 N −1 i∈U i∈U i6=j,j∈U !! X X X 1 = N yi2 − yi2 + yi yj N −1 i∈U i∈U i6=j,j∈U ! X X 1 N yi2 − ( yi ) 2 = N −1 i∈U i∈U ! ! X X 1 N 2 2 2 2 2 = y − Nµ N yi − N µ = N −1 N − 1 i∈U i i∈U P
= N σ2
The value 1 − Nn = 1 − f is known as the finite population correction factor-fpc where f is the sampling fraction which is the number of units included in the sample population. The fpc serves to show the dependence terms of the estimator. As n is increased there can be an appreciable reduction in stdev˜ u, if n = N then we have a census and the stdev˜ u rightfully equals 0. The estimator σ ˜ can similarly be shown to be unbiased (i.e. E[σ˜2 ] = σ 2 ). To show
5.3. SIMPLE RANDOM SAMPLING-SRS
39
that it is unbiased preliminary work has to be done. X Note that: (n − 1)σ˜2 = yi2 Ii − n˜ µ then i∈U
n σ2 + µ2 so: E[˜ µ2 ] = V arµ − (E[µ])2 = 1 − N n ! σ2 X 1 n E[˜ σ2] = y2 − n 1− + µ2 n − 1 i∈U i N n ! ! n X 2 n 1 yi − nµ2 − 1 − σ2 = n − 1 N i∈U N n n 1 (N − 1)σ 2 − 1 − σ2 = n−1 N N = σ2
To use the results above a form of the Central Limit Theorem is applied over a sequence of dependent random variables. To avoid the technicalities involved with its proof only the result is stated below: √ n(˜ µ − µ) If N, n and N − n are suitably large then √ ∼ N (0, 1). 1 − fσ ˜ The result is used to construct confidence intervals and hypothesis tests to measure attributes of interest. The idea is illustrated through an example: Example 5.3.1 Suppose in a particular city 618 mortgages were taken out last year. A random sample of 60 of these mortgages had an average of µ ˆ = $187300 and a stadnard deviation σ ˆ = $29200. 1. Estimate the average mortgage value and give a 90% confidence interval for the average. 2. Estimate the total mortgage amount along witha 90% confidence interval for the total value of the mortgages Solution 5.3.2 Part a The problem gives us: N = 618, n = 60, µ ˆ = $187300 and σ ˆ = $29200, so: E[˜ µ] = 187300 0.5 n 0.5 σ 29200 60 √ = 1− stddev(˜ µ) = 1 − = 3582.0 with 59 degrees of freedom N 618 600.5 n The 90% confidence interval for µ ˜ is187300 ± (1.646)3582.0 ⇔ [181404.028, 193195.972]
40
CHAPTER 5. PROBABILITY SAMPLING
Notice that the fpc is in fact small, this allows to apply the P normal approximation. Part B: The total mortgage amount will be given by i∈U yi = N µ. Using the normal approximation and the corresponding estimator µ ˜ the 90% confidence interval is given by: 618(187300 ± (1.646)3582.0 ; [181404.028, 193195.972]) ⇒ 115751400 ± (1.646)2213676 ; [112107689.3, 11939511.07] Notice that the standard deviations are also multiplied by 618. The problem above demonstrated how to use the modified normal approximation when estimating the values over a sample and population. The problem is that the range is relatively large. Suppose we wanted to find a smaller confidence interval. By applying similar methodology to sample errors we can decrease the range (and hence increase the precision) the confidence interval. Example below taken from Stat332 handbook. Example 5.3.3 Suppose an auditor is inspecting a general warehouse. The warehouse claims that it has 1256 items with total value $4311712. The auditor decides to sample 50 items. He finds that the average actual value of the items is $2895.29 with standard deviation $1997.22. The sample average dollar error (actual valuestated value) is $2.33 with sample standard deviation of dollar errors $41.93. 11 of the 50 items had count errors in the sample. The items along with there actual and stated value are listed below. Using SRS the auditor want to finds: • The true value of the inventory • The average dollar error per item • The proportion of items with counts in error Find these values. Solution 5.3.4 Total value of inventory: This question can be approached in one of two ways. The first method will be similar to the way in which the mortgage example was solved. The second method called the difference estimation method uses the average dollar error per item to find total inventory value. Method 1: We approach this question in the same way as mortgage example. It requires us to consrtuct a confidence interval over the population of items in the warehouse. The process of findig such an interval can be broken down into three steps: 1. Identify the required level of precision. Since one is not provided we use the value associated with T0.95/2,50 ≈ 1.96.
5.3. SIMPLE RANDOM SAMPLING-SRS
41
2. Construct a confidence interval for an individual item using the estimator u˜ to represent the average value over individual itemes. In this case the average value of an individual item confidence interval would be: n 0.5 σ ˆ E[˜ u] ± 1.96stddev˜ u) ⇒ µ ˆ ± (1.96) 1− N n0.5 0.5 ! 1997.22 50 √ ⇒ 2895.29 ± 1.96 1− 1256 50 = 2895.29 ± (1.96)276.77 = 2895.28 = ±542.457 ; [2352.82, 3437.76] 3. Multiply across by the total population of items, that is: 1256(2895.28 ± 542.457 ; [2352.82, 3437.76]) ⇒ 3636484 ± 681342 ; [2955141.92, 4317826.56] It should be noted that the range of possible values the inventory can take is large. The lack of precision makes it difficult to assess whether or not there are material errors in the inventory. A possibility would be to increase the number of sampled items, but this would increase costs, time and effort. An alternate method ( difference estimation method) for estimating the possible inventory values is presented below along with the the average dollar error per item. Average dollar error per item: Let µ ˆerror = 2.33 be the average sample error with corresponding standard deviation stddev(ˆ µ)error = 41.93. Using the confidence level as defined above (T0.95/2,50 = 1.96) a confidence interval for an individual item µ ˜ can be constructed: n 0.5 σ ˆerror E[˜ uerror ] ± 1.96stddev(˜ uerror ) ⇒ µ ˆerror ± (1.96) 1− N n0.5 ! 0.5 41.93 50 √ = 2.33 ± (1.96) 1− 1256 50 = 2.33 ± 11.39 Before continuing it is important to note that if yi = (actual value)i −(stated value)i where i = {1, ..., 1256} then µ ˆerror and σ ˆerror are defined as: sP P y ¯)2 i∈s i i∈s (yi − y = 2.33 sy = σ ˆerror = = 41.93 (5.2) y¯ = n n−1
42
CHAPTER 5. PROBABILITY SAMPLING
We can now proceed to estimate the total error over U by mutliplying the confidence interval of the average value of an individual item by N = 1256 Mathematically we let τˆerror = N µ ˆerror .
1256(2.33 ± 11.39) ⇒ 2926.48 ± 14306
In general the average error is more precisely estimated than is the average actual value. Using this result a better estimate of the total value of the inventory is ascertained. Method 2: Initally the error was defined as error=actual value−sampled value. This implies actual value=sampled value+error so the estimate of the total inventory under a 95%confidence interval is given by:
4311712 + 2926 ± 14306 ⇒ 43314638 ± 14306
The proportion of items with counts in error is different to what has been dealt with before so some theory will need be developed. Begin aside: binary variate When dealing with proportions the attribute of interest is the average of the binary variate call it π. A binary variate is a variable that can take on the values of 1 and 0 depending on whether an event occurs or not. Another name for a binary variate is a Bernoulli trial which can be represented by the indicator function which was introduced earlier. If E is the event of interest then the average of the binary variate π is written as:
(5.3)
P
yi π= Card(U ) i∈U
where
yi =
(
1 if E occurs 0 if E does not occur
Notice that π is over all of U . When sampling however π will have to be estimated and is taking over the sample space s instead of U , consequently the sample average
5.3. SIMPLE RANDOM SAMPLING-SRS
43
and standard deviation over the random set s are given by: P Pn i∈s yi i=1 yi π ˆ= Card(s) n Pn P 2 (y 2 − 2¯ y yi + y¯2 ) (yi − y¯) = i=1 i σ ˆπ2ˆ = nP −1 Pnn − 21 Pn y − i=1 2¯ y yi + ni=1 y¯2 = i=1 i Pn Pnn − 12 yi − i=1 y¯ = i=1 n−1 Pn n X nˆ π − n¯ π2 since ˆ yi = π ˆ n ; y¯ = i=1 = π = n−1 n i=1 =
nˆ π (1 − π) n−1
To define the estimator π ˜ , it has to be over all U . As before the indicator function Ii is used to transform the random subset s over U into a random variable over U : P yi I i π ˜ = i∀U n where Ii is the indicator function for whether or not Ii is in the sample space s and yi is the binary variate of whether event E occurs. So the estimator ˆ˜ will have the distribution: P i∀U yi Ii E[˜ π] = E n P n σ ˆπˆ i∀U yi Ii = 1− V ar[˜ π ] = V ar n N n End aside So for the example we have that: E[˜ π] = s
11 50
− 11 ) 50 = 0.057988174 50 11 ± 0.113656 π ˆ ± 1.960.057988174 ⇒ 50
V ar[˜ π] =
50 1− 1256
11 (1 50
44
CHAPTER 5. PROBABILITY SAMPLING
Note:Sample average(˜ µ) is different from the expected value(E[˜ µ]) as is the sample standard deviation (˜ σ ) and the sample standard deviation of µ ˜. In cluster sampling the proportion of items displaying a certain attribute will be a function of the sample mean. Consequently when sampling with clusters there is a quick way to find the proportion of items displaying a certain attribute. Let µ be the average over U , mu ˆ be the sample average, σ ˆ be the sample standard deviation, N population size, n the sample size and k the cluster size, then: µ k
µ ˆ when estimated k ˆ n σ dividing throughout by 12 1− µ ˆ ± Zα/2 N n q 1 − Nn nσˆ µ ˆ ± Zα/2 ⇒ 12 s σ ˆ n 12 1− π ˆ ± Zα/2 ⇔ N n r n σ ˆπˆ 1− π ˆ ± Zα/2 N n
π=
=⇒ r
π ˆ=
Sometimes the estimates derived in this manner for proportions will be have a large spread, in such cases they serve more as tool for indicating whether or not the proportion is significantly larger than an arbitrary cutoff point. Example 5.3.5
5.3.1
Sample Size Determination
Sample size is often determined by the objective of the study. This requires the experimenter to know the target population, the attributes being measured and the precision of the required estimates. Length of a confidence interval is the most commonly used bound to determine the precision of an estimate. However there are alternative formulations, for example using relative error. For the moment lets consider only the length of a confidence intevral, Length of a confidence interval Suppose we wanted to generate a confidence interval that was 2l in length i.e. a confidence interval such that uˆ ± l, then using the standard deviation derived above for SRS or a close facsimile l must equal; 0.5 n σ ˆ l=c 1− N n
5.3. SIMPLE RANDOM SAMPLING-SRS
45
which implies the number of samples required to achieve the length l is: (5.4)
n=
1 N
1 2 + c2lσ2
A couple points need to be made about the equation above: • Since the study has yet to be carried out the value of σ ˆ is unknown. Framing n as a function of another unknown seems counterintuitive, however it is not. σ ˆ can be estimated via a small pilot study or even a similar experiment n on the other hand cannot. • n has some since leeqay since we can adjust the confidence level c, it is up to the experimenter to set this level. • If N is large then n ≈ (cˆ σ )2 /l2 will be a decent approximation for n
46
CHAPTER 5. PROBABILITY SAMPLING item 1 25 39 53 56 121 207 212 223 225 240 252 257 260 331 389 408 413 443 477 480 483 485 493 507 561 569 577 585 620 628 664 707 724 767 770 827 831 853 855 877 946 1002 1051 1053 1134 1215 1234 1236
stated number 1335 1192 1294 1269 1427 1529 1446 1106 847 1016 1105 1297 780 1361 1306 1083 974 893 1502 1248 1663 878 1314 897 1486 1558 1208 1248 1252 925 1235 1233 1153 1041 1055 1118 915 1025 827 798 1340 1100 785 602 1283 1272 1215 1828 868
item price 0.61 1.64 1.5 2.04 3.24 2.81 2.48 1.13 4.95 10.27 3.5 2.25 0.76 1.74 3.11 4.27 1.51 0.64 4.3 1.14 1.39 0.69 4.38 4.56 1.04 2.89 4.51 1.32 1.93 2.62 1.96 0.55 1.53 3.17 3.18 4.07 0.35 0.83 1.93 3.02 0.74 5.34 1.53 0.07 0.59 2.32 2.99 3.12 5.15
stated value 814.35 1954.88 1941.00 2588.76 4623.48 4296.49 3586.08 1249.78 4192.65 10434.32 3867.50 2918.25 592.80 2368.14 4061.66 4624.41 1470.74 571.52 6458.60 1422.72 2311.57 605.82 5755.32 4090.32 1545.44 4502.62 5448.08 1647.36 2416.36 2423.50 2420.60 678.15 1764.09 3299.97 3354.90 4550.26 320.25 850.75 1596.11 2409.96 991.60 5874.00 1201.05 42.14 756.97 2951.04 3632.85 5703.36 4470.20
actual number 1335 1192 1294 1269 1419 1529 1446 1106 847 1016 1105 1297 807 1361 1346 1080 974 892 1502 1248 1663 878 1295 897 1486 1558 1208 1229 1252 975 1223 1233 1153 1041 1010 1156 915 1025 827 798 1340 1100 785 602 1283 1272 1215 1828 868
actual value 814.35 1954.88 1941 2588.76 4597.56 4296.49 3586.08 1249.78 4192.65 10434.32 3867.5 2918.25 613.32 2368.14 4186.06 4611.6 1470.74 570.88 6458.6 1422.72 2311.57 605.82 5672.1 4090.32 1545.44 4502.62 5448.08 1622.28 2416.36 2554.5 2397.08 678.15 1764.09 3299.97 3211.8 4704.92 320.25 850.75 1596.11 2409.96 991.6 5874 1201.05 42.14 756.97 2951.04 3632.85 5703.36 4470.2
Chapter 6 Ratio and Regression Estimation with SRS 6.1
Introduction
In this section ratio and regression estimation with SRS are considered.
6.2
Estimating a Ratio
Sometimes it is worthwhile for the statistician to measure a ratio of interest. For example ,consider an unedited manuscript that has a random number of errors on each of its pages. Suppose we take a sample of 50 pages from this manuscript and inspect both the average number of errors that occur per page as well as the proportion of pages with errors. Then by measuring the ratio of the average number of errors per page to the the proportion of pages that have errors in the sample, an estimate of the total number of pages that have errors can be ascertained. Previously when we used proportions only the numerator was random. With a ratio both the numerator and denominator are random. Consider the manusciprt example,we can write the average-proportion ratio as: P P /N y z y z µ i i i i θ = Pi∈U = = Pi∈U π i∈U zi i∈U zi /N ( 1 if the ith page has at least one error zi = 0 otherwise 47
48
CHAPTER 6. RATIO AND REGRESSION ESTIMATION WITH SRS
where: yi is the number of errors on the ith page. There are a couple of things to note about the equation above: P P • i∈U yi zi = i∈U yi • π is the proportion of pages with errors, µ is the average number of errors per page in pages that contain errors • Since our sample is over s the estimate for θ will be θˆ = µ ˆ/ˆ π and its corre˜ ˜ sponding estimator will be θ = µ ˜/pi. Notice that θ˜ is a ratio of two random variables-finding it’s exact distribution is difficult so we will need an approximation. We can do this using Taylor’s expansion for two variables. Recall that a function of two variables (x, y) may be expanded around the points (x0 , y0 ) via:
∂f (x, y)
∂f (x, y)
f (x, y) ≈ f (x0 , y0 ) + (x − x0 ) + (y − y0 ) + ... ∂x (x=x0 ,y=yx ) ∂y (x=x0 ,y=yx )
For our purposes using the linear portion of the expansion will suffice for a good approximation. If f (x, y) = xy then:
Which implies: (6.1)
x 1 x0 ≈x0 y0 + (x − x0 ) − 2 (y − y0 ) y y0 y0
∂f (x, y)
1 ∂f (x, y)
x0 where = ; =− 2
∂x y0 ∂y y0 (x0 ,y0 ) (x0 ,y0 ) µ 1 µ µ ˜ µ − µ) − 2 (˜ π − π) θ˜ = ≈ + (˜ π ˜ π π π
When sampling from a large population the approximation above is reasonable since (˜ µ, π ˜ ) will be close to (µ, π). Now the mean and standard deviation can be found ˜ They are gven by: for θ.
˜ = E[θ]
µ 1 µ + E[˜ µ − µ] − 2 E[˜ π − π] π π π
µ µ 1 ˜ + (˜ µ − µ) − 2 (˜ π − π) V ar[θ] = V ar π π π µ 1 π ˜ = V ar µ ˜− π π |{z} θ
1 = 2 V ar[˜ µ − θ˜ π] π
6.2. ESTIMATING A RATIO
49
˜ and V ar[θ] ˜ are Note: Both θ˜ is an approximately unbiased estimator since E[θ] 2 close to µθ and σtheta . From a mechanical point of view, when we are estimating µ ˜ − θ˜ π we have that: P P ˆ i∈s (yi − θzi ) i∈s ri µ ˜ − θ˜ π =⇒ µ ˆ − θˆ π= = n n } | {z sample average
As before ri represents a random subset s over the sample space. Transforming ri into a random variable over U we have that: P P i∈U ri Ii i∈U (yi − θzi )Ii = V ar V ar(˜ µ − θ˜ π ) = V ar n n P σ2 (1 − f ) i∈s (ri − r¯)2 = (1 − f ) ratio = n n−1 Pn [y − y ¯ − θ(z − z¯)]2 (1 − f ) i∈s i i = n n−1 P ˆ i − z¯)]2 (1 − f ) i∈s [yi − y¯ − θ(z ≈ n n−1 P ¯ − |{z} θˆ (zi − z¯)]2 i∈s [yi − y (1 − f ) y¯/¯ z = n n−1 P ˆ i ]2 (1 − f ) i∈s [yi − y¯ − θz = n n−1 Now that we have an estimate for V ar(˜ µ − θ˜ π ) the variance of θ˜ is: P ˆ i )2 1 (1 − f ) i∈s (yi − θz ˜ ˆ V ar(θ) = 2 π ˆ n n−1 P
ˆ )2 (y −θz
i i is the sample variance of the estimated residuals. where i∈sn−1 Note: There are a couple of things to note about the equations above:
• When looking at a ratio estimate in R or SPSS a bit of programming has to be done to find the residuals and standard deviation of θ˜ since R does not understand ratio estimates. The input should be r < −(ythetah at ∗ z) where y ˆ Multiplying across by the factor and of values and thetah at is θ. p z are vectors 1 0.5 ˜ ˆ ˆratio . (1 − f ) π√n we get V ar(θ) = σ • The confidence interval determined in this fashion will be larger than that for µ ˜ because the errors for of both µ ˜ and π ˜ are incorporated into the estimate for σ ˆratio .
50
CHAPTER 6. RATIO AND REGRESSION ESTIMATION WITH SRS • The expansion approach using Taylor’s theorem can be applied to other ratio estimators
Example 6.2.1
6.2.1
Ratio of estimation of the average
In this section we refine the methods for ascertaining the population average for variate y. This will require the use of other explanatory variates. The idea is to modify the sample average mu(y) ˆ to account for differences between the sample and known population attributes of the explanatory variates. That is we are modifying µ ˆ(y) to reflect information that we have on the other explanatory variates. (Notice that the term µ ˆ(y) is used to describe the average of y. This is done because we will be framing the differences between µ ˆ(y) and other explanatory averages, so for explicitness µ(·) is included.) The techniques developed hereunder are good for linear responses between variate. For simplicity we consider only one explanatory variate.Proceeding by example: Example Suppose an accounting firm wishes to determine the average amount of money there 12000 clients have invested in stocks. They know that a linear relationship exists between wealth and the amount of money invested in stocks. Auditing all 12000 clients would take to long so one of the senior managers suggests taking a simple random sample of 60 clients and constructing a 95% confidence interval for the range of possible values a client has on average invested in stocks. His firm agrees to provide the following information It is further known that the mean wealth of a client is µ(x) = 50000. Given this information use the ratio estimate of the averages to determine a 95% confidence for the average amount a client has invested in stocks. Solution The ratio estimate for µ(y) is given by: µ(y) =
µ ˆ(y) ˆ µ(x) = θµ(x) µ ˆ(x)
Using the results from the estimation of a ratio we have that: ⊥ ˜ µ] ≈ θµ(x) E[˜ µ(y)] = E[θ˜µ ˜(x)] = E[θ]E[˜ µ(y) 1 µ(y) V ar[˜ µ(y)] = V ar µ(x) + (˜ µ(y) − µ(y)) − (˜ µ(x) − µ(x)) µ(x) µ(x) µ(x)2 1 1 V ar [˜ µ (y) − θ µ ˜ (x)] = (1 − f )σr = µ(x) µ(x)2 n
6.2. ESTIMATING A RATIO 14450 9726 10019 8310 9738 12313 11815 8455 15651
8654 5386 21762 12755 13329 10553 14034 9585 11377
13785 13501 10708 11155 7946 9141 14804 7706 15337
51 10737 14169 10132 10898 10606 13583 8189 9594 11802
7432 7991 11827 8133 13604 10168 13072 14406
7693 12208 17113 17138 15286 13499 10095 18809
8242 8066 14664 13713 8101 10628 12842 18538
Table 6.1: Amount invested in stocks of 60 people x y Sample average 49406 11750 Sample standard deviation 9009 3536 Table 6.2: Average and standard deviation for the sample variates
Since the values of θ and µ(y) are unknown they have to be estimated, this implies the equations above become: µ ˆ(y) 11750 µ(x) = 50000 ≈ 11891 µ ˆ(x) 49406 P60 ˆ¯i )2 1 1 60 i=1 (ri − r ˆ V ar[˜ µ(y)ratio ] = (1 − f )ˆ σr = 1− n 60 12000 60 − 1 {z } | ≈1 P60 ˆ 1 295977563 1 i=1 (yi − θxi ) = = 83609.48 = 289.153 = 60 59 60 59 E[˜ µ] = µ ˆ(y) =
Using a %95 confidence interval we find that: 11891 ± 1.96(289.153) =⇒ 11891 ± 566.74 Notice that the estimated variance is rather small for such a large estimate of µ ˜(y) number. There are a couple things to note about the ratio estimate: 1. Comparing V ˆar(˜ µ(y)ratio ) to the regression line V ar[˜ µ(y)], we find that: X X (yi − θxi )2 < (yi − y¯)2 i∈s
i∈s
The term on the left is the residual sum of the squares whereas the term on the right is the total sum of the squares. “The ratio estimate is more precise if a line through the origin explains some of the variation.”
52
CHAPTER 6. RATIO AND REGRESSION ESTIMATION WITH SRS 2. To effectively apply the ratio estimate two components are required: • The explanatory variate xi must be known for each sample unit i • The population average µ(x) must be known • A linear relationship between x and y that passes through the origin must exist , that is the underlying model must have the form y = βx + noise, where noise is the random component of the model. The smaller the noise the better the ratio estimate. 3. The ratio estimate can be rewritten to emphasize that difference in population and sample mean. µ ˆ(y)ratio =
µ(x) µ ˆ(u) µ(x) = mu(y) ˆ µ ˆ(x) µ ˆ(x)
The ratio estimate is in fact an adjustment to better fit the sample mean to the population given that we know the difference between the sample and true population averages. 4. In the example above we could have fitted the data to the model Yi = βxi + Ri , Ri ∼ N (0.σ 2 ) since both Yi and xi are known. regresP Using least P squares 2 ˆ sion the estimate for β would have been β = ( i∈s xi yi )/( i∈s xi ). Naturally the question arises, as to which model would provide a higher level of precision. To answer this question we turn to a third model who’s random error’s variance changes with xi , that is suppose we have that Y = βxi + Ri , Ri ∼ N (0σ 2 xi ). This model can be transformed so that it’s random error has constant variance, √ that is we divide throughout by xi to yield √ xi Ri Yi = β√ + =⇒ Yi∗ = β xi + Ri∗ ; Ri ∼ G(0, σ 2 ) xi xi xi In this model it can be further shown (via least squares) that β = θ. ”Since we are exploiting the structure of the sample population the estimates will be superior”.
6.2.2
Regression Estimation of the Average
In the previous section the ratio estimate discussed the situation when Y could be summarized by a linear model that passed that passed through the origin. By using the ratio of sample average to population average we were better able to estimate the true average of u˜(y)r atio. In this section we consider what happens when the model does not pass through the origin and ask the question is there a way to improve on our estimates of µ ˜(y)reg using something similar to the ratio of averages? Consider the following model simple linear model that does not pass through the
6.2. ESTIMATING A RATIO
53
origin: (6.2)
Ri ∼ N (0, σ 2 )
Y = β0 + β1 xi + Ri ⇔ Y = α + β(xi − x¯) + Ri
Using least squares estimates we derive the following estimates for beta0 and β1 : P yi (xi − x¯) P βˆ0 = y¯ − βˆ1 x¯ βˆ1 = i∈s 2 i∈s xi
which implies: (6.3)
yi = βˆ0 + βˆ1 xi = y¯ + βˆ1 (xi − x¯)
Using the regression model above we get that: ˆ µ ˆ(y)reg = µ ˆ(y) + β(µ(x) −µ ˆ(x)) µ ˆ(y)reg is an adjusted sample average µ ˆ(y). It represents the knowledge that we don’t have perfect information from the sample and accordingly adjust the sample average to reflect this lack of information. If β1 is positive then mu(y) ˆ will adjust with x. The corresponding estimator of µ(y)reg is given by: ˜ µ ˜(y)reg = µ ˜(y) + β(µ(x) −µ ˜(x)) Notice that there are three estimators in this equation: µ ˜(y), µ ˜(x) and β˜1 . Deriving a function from them is complicated so we apply an approximation to derive the variance and expected value of µ ˜(y), we have the following µ ˜(y)reg − µ(y) = [˜ µ(y) − µ(y)] + β[µ(x) − µ ˜(x)] + [β˜ − β][µ(x) − µ ˜(x)] {z } | ≈0
= [˜ µ(y) − µ(y)] + β[µ(x) − µ ˜(x)]
Taking the expected value of the expression above we have that: E[˜ µ(y)reg − µ(y)] ≈ E[mu(y) ˜ − µ(y) + β1 [µ(x) − µ ˜(x)]] = E[mu(y) ˜ − µ(y)] + β1 E[µ(x) − µ ˜(x)] | {z } | {z } 0
0
≈0 ⇒ E[˜ µ(y)] ≈ µ(y)
The estimate for expected value is approximately unbiased. V ar(˜ µ(y)reg ) can be estimated by noting that sample average of equation ** is: P P ˆ(x))] ri i∈s [yi − β1 (xi − µ µ ˆ(y)reg = = i∈s for i = {1, 2, ..., n} n n
54
CHAPTER 6. RATIO AND REGRESSION ESTIMATION WITH SRS
Notice that the xi and µ(x) have switched this makes it more convenient for interpretation in terms of the residuals. Also note that r¯ = y¯ − β1 (¯ x − µ(x)). As before µ ˆ(y)reg is defined over the sample space s, turning it into a random variable we use the indicator function and concomitantly find the variance: P P 1 i∈U (ri − µ(r))2 i∈U ri ≈ n = (1 − f ) V ar(˜ µ(y)) = V ar I n N −1 i P ¯)2 1 i∈s (ri − r ˆ V ar(˜ µ(y)) = (1 − f ) n n−1 P y − βˆ1 (¯ x − µ(x))))2 1 i∈s (yi − βˆ1 (xi − µ(x)) − (¯ = (1 − f ) n n−1 P ˆ 1 i∈s (yi − y¯ − β1 (xi − x¯))2 = (1 − f ) n n−1 Example Scientists have been looking into the eating habits of competitive swimmers and their heights. They have postulated that calories linearly increase with the height(inches) of the athletes. They sample 50 swimmers from a large population and found the following: Using regression of the averages find a %95 confidence interval for the amount of calories consumedd by competitive swimmers. Solution Using the equations derived above we have that: E[˜ µ(y)reg ] ≈ µ ˆ(y) + βˆ1 (¯ x − µ(x)) = 4227.643+ P ¯ − βˆ1 (x− x¯))2 1 i∈s (yi − y V ar[˜ µ(y)reg ] ≈ (1 − f ) | {z } 50 49 ≈1
1118783 ≈ = 466.1596 = 21.59032 49(50)
So the %95 confidence inteval is given by; 4227.643 ± 1.96(21.5904)
⇐⇒
[4185.325, 4269.96]
The estimate for the regression of the averages is clearly more precise than just the estimate of the sample average µ ˆ(y). This occurs because rather than considering the full sum of squares we are only considering the residual sum of squares. Before completing this chapter there are a couple of points that the student should be aware of: • The regression estimate requires that:
1. The response (y) and explanatory variate (x) are continuous
6.2. ESTIMATING A RATIO 68.42067 77.39238 72.92804 72.26305 72.70596 72.45283 73.14557 72.43712 69.63328 71.50782 70.13968 71.61837 73.50679 69.04422 71.89156 70.36492 72.72292
4308.056 4849.402 3663.817 4701.904 3751.206 3623.854 4872.792 4332.626 4151.425 4092.316 5153.336 3561.187 3550.199 4917.732 4561.661 3944.718 4517.894
55 69.51221 69.47255 72.82152 70.13799 72.47334 69.15696 72.38771 72.46891 72.39594 71.38718 72.75844 66.98620 72.10575 72.57716 70.74978 72.19328 71.64298
3918.281 4478.169 3498.008 3640.918 4100.255 4472.260 4378.805 4447.927 4518.384 4556.639 4776.500 4099.064 4882.639 4410.203 3359.681 4804.530 3906.956
70.33767 69.02344 72.04765 70.29105 72.27747 73.82183 71.48353 70.08849 68.94992 73.23771 72.42849 72.38656 66.74734 75.39866 73.42516 75.38713
4772.298 4024.440 3476.555 5321.765 4063.645 3651.791 3434.279 4398.249 4119.804 4092.453 4078.680 4134.123 4475.718 4318.440 3762.639 4453.943
Table 6.3: Height (inches) vs. calories consumer for competitive swimmers y¯ 4227.643
x¯ 71.647
µ(x) 72
Table 6.4: Summary of data
2. The study population average of the explanatory variate must be known i.e. µ(x) 3. If x and yS have a linear relation then the smaller residuals lead to a more precise estimate • An alternative form for a regression estimate is to use a difference as the response variate i.e. di = yi − xi and then estimate the population average by: µ ˆ(y) = mu(d) ˆ + µ(x) Such an estimate is more precise when the variance of the difference is less than the variance of the sample y1 , ..., yn 7