Virtual University of Pakistan Lecture No. 39 of the course on Statistics and Probability by Miss Saleha Naghmi Habibullah
IN THE LAST LECTURE, YOU LEARNT
• Hypothesis-Testing regarding µ1 - µ2 (based on Z-statistic)
• Hypothesis Testing regarding p (based on Z-statistic)
TOPICS FOR TODAY
• Hypothesis Testing Regarding p1-p2 (based on Z-statistic) •The Student’s t-distribution •Confidence Interval for µ based on the tdistribution
In the last lecture, we discussed hypothesis-testing regarding p, the proportion of successes in a binomial population.
Next, we consider the case when we are interested in testing the equality of two population proportions. We illustrate this situation with the help of the following example:
EXAMPLE A leading perfume company in a western country recently developed a new perfume which they plan to market under the name 'Fragrance'.
A number of comparison tests indicate that 'Fragrance' has very good market potential. The Sales Departments of the company want to plan their strategy so as to reach and impress the largest possible segments of the buying public.
One of the questions is whether the perfume is preferred by younger or older women. These are two independent populations, a population consisting of the younger women and a population consisting of the older women.
A standard scent test will be used where each sampled woman is asked to sniff several perfumes, one of which is 'Fragrance', and indicate the one that she likes best.
A total of 100 young women were selected at random, and each was given the standard scent test. Twenty of the 100 young women chose 'Fragrance' as the perfume they liked best.
Two hundred older women were selected at random, and each was given the same standard scent test. Of the 200 older women, 100 preferred 'Fragrance'.
Test the hypothesis that there is no difference between the proportions of younger and older women who prefer ‘Fragrance’.
SOLUTION We designate p1 as the proportion of younger women who prefer 'Fragrance' and p2 as the proportion of older women who prefer 'Fragrance'.
Hypothesis-Testing Procedure: Step-1:
H0 : p1 = p2 (i.e. p1-p2 = 0)
(There is no difference between the proportions of young women and older women who prefer 'Fragrance’.) H1 : p1 ≠ p2 (i.e. p1-p2 ≠ 0) (The two proportions are not equal.)
Step-2: Level of Significance Îą = 0.05.
Step-3: Test Statistic Z =
(
) ) p1 − p2 ) −0
1 1 ˆcq ˆc p + ÷ n2 n1
where proportion,
the
pˆ c ,
combined is given by:
or
pooled
Total number of successes in the two samples combined pˆ c = Total number of observations in the two samples combined X1 + X 2 = n1 + n 2
This can also be written as n1 pˆ1 + n2 pˆ 2 pˆ c = n1 + n2
which means that pˆ c is the weighted mean of
pˆ1 and pˆ 2 ,
n1 and n2 acting as the weights.
Important Note: In this example, as the hypothesized value of p1 - p2 is equal to zero, therefore both are estimating the common population proportion p. Hence, we use the pooled proportion of the two samples to estimate p.
(The rationale is that the pooled estimator pˆ c is a better estimator of the common population proportion p (as compared with pˆ1 or pˆ 2 ), as it is based on n1 + n2 observations (i.e. based on a greater amount of information).
Step-4: Calculations: X1 is the number of preferring 'Fragrance' = 20. n1
is the number is the sample = 100. X
20
= 0 . 20 pˆ = 1 = 1 n 100 1
X2 is the number of preferring 'Fragrance' = 100. n2 is the number is the sample = 200. X
100
= 0 . 50 pˆ = 2 = 2 n 200 2
Now, the proportion,
pooled or weighted is computed as follows:
n1 pˆ1 + n2 pˆ 2 + 20 100 120 pˆ c = = = = 0.40 n1 + n2 100 + 200 300
Computation: Z =
=
(
) ) p1 − p2 ) −0
1 1 ˆcq ˆc p + ÷ n2 n1
0.20 − 0.50 (0.40)(0.60) 1 + 1 100 200
− 0.30 = = −5.00 0.06
Step-5: Critical Region Since H1 does not state any direction (such as p1 < p2), the test is two-tailed. Thus, the critical values for the .05 level are â&#x20AC;&#x201C;1.96 and + 1.96.
Two-Tailed Test, Areas of Rejection and Non-rejection, .05 Level of Significance:
.95
.025
H0 is rejected
-1.96
H0 is not rejected
.025
1.96
Z H0 is rejected
Step-6: Conclusion: The computed z of â&#x20AC;&#x201C;5.00 is in the area of rejection, that is, to the left of â&#x20AC;&#x201C;1.96. Therefore, the null hypothesis is rejected at the .05 level of significance.
In other words, we conclude that the proportion of young women in the population who prefer 'Fragrance' is not equal to the proportion of older women in the population who prefer 'Fragrance'.
(The difference between the two sample
proportion i.e. 0.30 is so large that it is highly unlikely that such a large difference could be due to chance (i.e. attributable to sampling fluctuations).)
In fact, the value z = -5.00 is even larger than -2.58, the critical value lying on the left tail of the sampling distribution if Îą = 0.01. As such, we can say that our statistic is highly significant.
(In such a situation, the statistic is said to be highly significant because of the fact that we are allowing as small a risk of committing Type-I error as 1%.)
Now, consider another situation: Suppose that the computed value of our test-statistic comes out to be such that it falls between -1.96 and -2.58. In such a situation, we will reject H0 at the 5% level of significance, but we cannot reject H0 at the 1% level.
This means that, if we are willing to allow as much as 5% risk of committing type I error, then we say that we are going to reject H0. But if we are willing to allow only 1% risk of committing type I error, then we conclude that the sample does not provide sufficient evidence to reject H0.
Going back to the example of the perfume, obviously, the company would be interested in determining, which category of women prefers this perfume in greater numbers than the other? The data clearly indicates that the proportion of women who prefer this particular perfume is higher in the population of older women. (This is the reason why the computed value of our test-statistic has come out to be negative.)
Let us consolidate the above ideas by considering another example:
EXAMPLE A candidate for mayor in a large city believes that he appeals to at least 10 per cent more of the educated voters than the uneducated voters. He hires the organization, and educated voters candidate, and 69 support him.
services of a poll-taking they find that 62 of 100 interviewed support the of 150 uneducated voters
At the 0.05 significance level, is the hypothesis accepted or rejected?
Step-1: The null and alternative hypothesis are H0 : p1 â&#x20AC;&#x201C; p2 > 0.10, and H1 : p1 â&#x20AC;&#x201C; p2 < 0.10, where p1 = proportion of educated voters, and p2 = proportion of uneducated voters.
Step-2: Level of Significance: Îą = 0.05.
Step-3: Test Statistic: Z =
(
) ) p1 − p2 ) −0.10 ˆ1 q ˆ1 ˆ2q ˆ2 p p + ÷ n2 n1
which for large sample sizes, is approximately standard normal.
Important Note: In this example, as the hypothesized value of p1 - p2 is not equal to zero, therefore
pË&#x2020;1 and pË&#x2020; 2 , are note estimating the same quantity, and, as such, we do not use pË&#x2020; c in the formula of the test statistic.
Step-4: Computation:
) p1 = 0.62 and qˆ1 = 0.38 ) p2 = 0.46 and qˆ2 = 0.54 Thus Z =
( 0.62 − 0.46 ) − 0.10 ( 0.62 ) ( 0.38 ) ( 0.46 ) ( 0.54 )
= 0.95
100
+
150
÷
Step-5: Critical Region: As this is a one-tailed test, therefore the critical region is given by Z < -z0.05 = -1.645
Step-6: Conclusion: Since the calculated value z = 0.95 does not fall in the critical region, so we accept the null hypothesis H0 : p1 â&#x20AC;&#x201C; p2 > 0.10. The data seems candidateâ&#x20AC;&#x2122;s view.
to
support
the
Until now, we have discussed in considerable detail interval estimation and hypothesis-testing based on the standard normal distribution and the Z-statistic. Next, we begin the discussion of interval estimation hypothesis-testing based on the t-distribution:
We begin by presenting the formal definition of the t-distribution and stating some of its main properties:
The Student’s t-Distribution The mathematical equation of the tdistribution is as follows: f ( x) =
−( v +1) 2
1 x 1 + ÷ 1 v v v β , ÷ 2 2 2
,
−∞ ≤ x ≤ ∞
This distribution has only one parameter ν, which is known as the degrees of freedom of the t-distribution.
Properties of Student’s t-distribution: The t-distribution has the following properties: i) The t-distribution is bell-shaped and symmetric about the value t = 0, ranging from – ∞ to ∞.
ii) The number of degrees of freedom determines the shape of the t-distribution. Thus there is a different t-distribution for each number of degrees of freedom. As such, it is a whole family of distributions.
The t-distribution, for small values of ν, is flatter than the standard normal distribution which means that the t-distribution is more spread out in the tails than is the standard normal distribution.
Standard Normal Distribution t-distribution 3 degrees of freedom
As the degrees of freedom increase, the t-distribution becomes narrower and narrower, until, as n tends to infinity, it tends to coincide with the standard normal distribution. (The t-distribution can never become narrower than the standard normal distribution.)
iii)The t-distribution has a mean of zero, when ν ≥ 2. (The mean does not exist when ν = 1.) iv) The median of the t-distribution is also equal to zero.
v) The t-distribution is unimodal. The density of the distribution reaches its maximum at t = 0 and thus the mode of the t-distribution is t = 0.
(The students will recall that, for any hump-shaped symmetric distribution, the mean, median and mode are equal.)
vi) The variance of the t-distribution is v given by σ = for ν > 2. v−2 2
greater than It is always 1, variance of the standard normal distribution .
the
(This indicates that the t-distribution is more spread out than the standard normal distribution.) For ν ≤ 2, the variance does not exist.
Next, we discuss the application of the t-distribution in statistical inference --- those situations where we need to carry out interval estimation and hypothesis - testing on the basis of the t-distribution. (Situations where the t-distribution is the appropriate sampling distribution ).
With reference to interval estimation and hypothesis-testing about µ, it has been mathematically proved that, if the population from which the sample has been drawn is normally distributed, the population variance is unknown, and the sample size is small (less than 30), then the statistic t =
X −µ
where
s=
∑( X
i
− X)
n −1
2
s n follows the t-distribution having n-1 degrees of freedom.
First, we discuss the construction of a Confidence Interval for Âľ based on the tdistribution with the help of an example:
EXAMPLE The masses, in grams, of thirteen ball bearings seen at random from a batch are 21.4, 23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, 21.9
Calculate a 95% confidence interval for the mean mass of the population, supposed normal, from which these masses were drawn.
SOLUTION The 95% confidence interval for the mean mass of the population µ, is given by = X ± tα 2( n −1)
s n
(The derivation of the above confidence interval is very similar to that of the confidence interval for Âľ based on the Zstatistic.)
Now, in this problem, the sample meanX and s come out to be: ∑ X 314 .7 = = 24 .21, X= n 13 ( ) (∑ X )2 ∑ X−X 2 1 = s= ∑ X 2 − n −1 n − 1 n = 1 [7655.59− 7618.16] = 37.43 = 3.12 = 1.77 12 12
The question is: ‘How do we find t α 2 ( n −1) ?
For this purpose, we will need to consult the table of areas under the tdistribution:
Table of areas under the t-distribution: Upper Percentage Points of the t-Distribution ν
Îą 0.25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691
0.10
0.05
0.025
0.01
0.005
0.001
3.078 1.886 1.838 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341
6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131
31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602
63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947
318.310 22.327 10.214 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733
Continued
Upper Percentage Points of the t-Distribution ν
α
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 ∞
0.25
0.10
0.05
0.690 0.689 0.688 0.688 0.687 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.681 0.679 0.677 0.674
1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.303 1.296 1.289 1.282
1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.671 1.658 1.645
0.025 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.000 1.980 1.960
0.01 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.390 2.358 2.326
0.005 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.660 2.617 2.576
0.001 3.686 3.646 3.610 3.579 3.552 3.527 3.505 3.485 3.467 3.450 3.435 3.421 3.408 3.396 3.385 3.307 3.232 3.160 3.090
The above table is an abridged version of the table by Fisher and Yates, and the entries in this table are values of tα,(ν) for which the area to their right under the tdistribution with ν degrees of freedom is equal to α, as shown below:
α
0
tα
Now, in this problem, since n â&#x20AC;&#x201C; 1 = 12, and the desired level of confidence is 95%, therefore, the right-tail area is 2½%, and, hence, (using the t-table) we obtain t0.025(12) = 2.179
Substituting these values, we obtain the 95% confidence interval for µ as follows: 1 . 77 24 . 21 ± 2 . 179 13 or or
24.21 ± 2.179 (0.49) 24.21 ± 1.07 or 23.14 to 25.28
Hence, the 95% confidence interval for the mean mass of the ball bearings calculated from the given sample is (23.1, 25.3) grams.
IN TODAY’S LECTURE, YOU LEARNT
• Hypothesis Testing Regarding p1-p2 (based on Z-statistic) •The Student’s t-distribution •Confidence Interval for µ based on the tdistribution
IN THE NEXT LECTURE, YOU WILL LEARN • Tests and Confidence Intervals
based on the t-distribution • The Chi-square Distribution • Hypothesis Testing and Interval Estimation Regarding a Population Variance (based on Chi-square Distribution)