STA301_LEC12

Page 1

Virtual University of Pakistan Lecture No. 12 Statistics and Probability by Miss Saleha Naghmi Habibullah


IN THE LAST LECTURE, YOU LEARNT

•Mean Deviation •Standard Deviation and Variance •Coefficient of variation


TOPICS FOR TODAY •Chebychev’s Inequality •The Empirical Rule •The Five-Number Summary


Question-1: ‘How many measurements are within 1 standard deviation of the mean?’ Question-2: ‘How many measurements are within 2 standard deviations?’ and so on.

For any specific data set, we can answer these questions by counting the number of measurements in each of the intervals. However, if we are interested in obtaining a general answer to these questions the problem is a bit more difficult.


The first, which applies to any set of data, is derived from a theorem proved by Russian mathematician, P.L. Chebychev (1821-1894). The second, which applies to mound-shaped, symmetric distributions of data, is based upon empirical evidence that has accumulated over the years. And this set of answers is valid and applicable even if our distribution is slightly skewed.


Let us begin with the Chebychev’s theorem. Chebychev’s Rule applies to any data set, regardless of the shape of the frequency distribution of the data. “For any number k greater than 1, at least 1 – 1/k2 of the data-values fall within k standard deviations of the mean, i.e., within the interval (X – kS,X + kS)” This means that: a)At least 1-1/22 = 3/4 will fall within 2 standard deviations of the mean, i.e. within the interval (X – 2S,X + 2S).


b) At least 1-1/32=8/9 of the data-values will fall within 3 standard deviations of the mean, i.e. within the interval (X – 3S,X + 3S) Because of the fact that Chebychev’s theorem requires k to be greater than 1, therefore no useful information is provided by this theorem on the fraction of measurements that fall within 1 standard deviation of the mean, i.e. within the interval (X–S,X+S).


Next, let us consider the Empirical Rule mentioned above.

Relative Frequency

This is a rule of thumb that applies to data sets with frequency distributions that are moundshaped and symmetric, as follows:

Measurements


According to this empirical rule: a) Approximately 68% of the measurements will fall within 1 standard deviation of the mean, i.e. within the interval (X – S,X + S) b) Approximately 95% of the measurements will fall within 2 standard deviations of the mean, i.e. within the interval (X – 2S,X + 2S). c) Approximately 100% (practically all) of the measurements will fall within 3 standard deviations of the mean, i.e. within the interval (X – 3S,X + 3S).


EXAMPLE The 50 companies’ percentages of revenues spent on R&D (i.e. Research and Development) are: 13.5 7.2 9.7 11.3 8.0

9.5 7.1 7.5 5.6 7.4

8.2 9.0 7.2 10.1 10.5

6.5 9.9 5.9 8.0 7.8

8.4 8.2 6.6 8.5 7.9

8.1 13.2 11.1 11.7 6.5

6.9 9.2 8.8 7.1 6.9

7.5 6.9 5.2 7.7 6.5

10.5 9.6 10.6 9.4 6.8

13.5 7.7 8.2 6.0 9.5

Calculate the proportions of these measurements that lie within the intervals X ± S,X ± 2S, and X ± 3S, and compare the results with the theoretical values. The mean and standard deviation of these data come out to be 8.49 and 1.98, respectively.


Hence (X – S,X + S) = (8.49 – 1.98, 8.49 + 1.98) = (6.51, 10.47) A check of the measurement reveals that 34 of the 50 measurements, or 68%, fall between 6.51 and 10.47. Similarly, the interval (X – 2S,X + 2S) = (8.49 – 3.96, 8.49 + 3.96) = (4.53, 12.45) contains 47 of the 50 measurements, i.e. 94% of the data-values.


Finally, the 3-standard deviation interval around X, i.e. (X – 3S,X + 3S) = (8.49 – 5.94, 8.49 + 5.94) = (2.55, 14.43) contains all, or 100%, of the measurements. In spite of the fact that the distribution of these data is skewed to the right, the percentages of data-values falling within 1, 2, and 3 standard deviations of the mean are remarkably close to the theoretical values (68%, 95%, and 100%) given by the Empirical Rule.


The fact of the matter is that, unless the distribution is extremely skewed, the mound-shaped approximations will be reasonably accurate. Of course, no matter what the shape of the distribution, Chebychev’s Rule, assures that at least 75% and at least 89% (8/9) of the measurements will lie within 2 and 3 standard deviations of the mean, respectively. In this example, 94% of the values are lying inside the interval X + 2S, and this percentage IS greater than 75%. Similarly, 100% of the values are lying inside the interval X + 3S, and this percentage IS greater than 89%.


In the last lecture, we noted that when all the values in a set of data are located near their mean, they exhibit a small amount of variation or dispersion. And those sets of data in which some values are located far from their mean have a large amount of dispersion. Expressing these relationship in terms of the standard deviation, which measures dispersion, we can say that when the values of a set of data are concentrated near their mean, the standard deviation is small. And when the values of a set of data are scattered widely about the mean, the standard deviation is large.


In exactly the same way, if the standard deviation computed from a set of data is large, the values from which it is computed are dispersed widely about their mean. A useful rule that illustrates the relationship between dispersion and standard deviation is given by Chebychev’s thorem, named after the Russian mathematician P.L. Chebychev (1821-1894). This theorem enables us to calculate for any set of data the minimum proportion of values that can be expected to lie within a specified number of standard deviations of the mean.


The theorem tells us that at least 75% of the values in a set of data can be expected to fall within two standard deviations of the mean, at least 89% (8/9) within three standard deviations of the mean, and at least 94% (15/16) within four standard deviations of the mean.

CHEBYCHEV’S THEOREM

Given a set of n observations x1, x2, x3, …., xn on the variable X, the probability is at least (1 – 1/k2) that X will take on a value within k standard deviations of the mean of the set of observations (where k > 1).


Suppose that a set of data has a mean of 150 and a standard deviation of 25. Putting k = 2 in the Chebychev’s theorem, at least 1 – 1/(2)2 = 75% of the data-values will take on a value within two standard deviations of the mean. Since the standard deviation is 25, hence 2(25) = 50, and at least 75% of the data-values will take on a value between 150 – 50 = 100 and 150 + 50 = 200. Consequently, we can say that we can expect at least 75% of the values to be between 100 and 200. By similar calculations we find that we can expect at least 89% to be between 75 and 225, and at least 96% to be between 25 and 275. (The last statement has been made by putting k = 5 in the formula 1 - 1/k2)


Suppose that another set of data has the same mean as before, i.e. 150, but a standard deviation of 10. Applying Chebychev’s theorem, for this set of data we can expect at least 75% of the values to be between 130 and 170, at least 89% to be between 120 and 180, and at least 96% to be between 100 and 200. The above results are summarized in the following table:

PERCENTAGE OF DATA At least 75 % At least 89 % At least 96 %

FOR DATA-SET NO. 1 Lies Between 100 & 200 Lies Between 75 & 225 Lies Between 25 & 275

FOR DATA-SET NO. 2 Lies Between 130 & 170 Lies Between 120 & 180 Lies Between 100 & 200


THE SYMMETRIC CURVE Thus the intervals computed for the latter set of data are all narrower than those for the former. For two symmetric, hump-shaped distributions having the same mean, this point is depicted in the following diagram:

f

X 100

130

150 170

200


Therefore, we see that for a set of data with a small standard deviation, a larger proportion of the values will be concentrated near the mean than for a set of data with a large standard deviation. A limitation of the Chebychev’s theorem is that it gives no information at all about the probability of observing a value within one standard deviation of the mean, since 1 – 1/k2 = 0 when k = 1. Also, it should be noted that the Chebychev’s theorem provides weak information for our variable of interest. For many random variables, the probability of observing a value within 2 standard deviations of the mean is far greater than 1 – 1/22 = 0.75.


In this way, the Chebychev’s theorem and the Empirical Rule play an important role in understanding the nature and importance of the standard deviation as a measure of dispersion.


FIVE-NUMBER SUMMARY A five-number summary consists of X 0,

Q1, Median, Q3, Xm

It provides us quite a good idea about the shape of the distribution.


If the data were perfectly symmetrical, the following would be true: 1.The distance from Q1 to the median would be equal to the distance from the median to Q3: f

Q1

~ X

Q3

X


2. The distance from X0 to Q1 would be equal to the distance from Q3 to Xm. f

X0

Q1

Q3

Xm

X


3. The median, the mid-quartile range, and the midrange would all be equal. All these measures would also be equal to the arithmetic mean of the data: f

~ X = X = Mid − Range = Mid −quartile range

X


On the other hand, for nonsymmetrical distributions, the following would be true: 1. In right-skewed distributions the distance from Q3 to Xm greatly exceeds the distance from X0 to Q1. THE POSITIVELY SKEWED CURVE f

X0

Q1

Q3

Xm

X


2. In right-skewed distributions, median < mid-quartile range < midrange:

f

~ X

X Mid-Range Mid-quartile Range


Similarly, in left-skewed distributions, the distance from X0 to Q1 greatly exceeds the distance from Q3 to Xm. Also, in left-skewed distributions, midrange < midquartile range < median.


EXAMPLE: Suppose that a study is being conducted regarding the annual costs incurred by students attending public versus private colleges and universities in the United States of America. In particular, suppose, for exploratory purposes, our sample consists of 10 Universities whose athletic programs are members of the ‘Big Ten’ Conference. The annual costs incurred for tuition fees, room, and board at 10 schools belonging to Big Ten Conference are given as follows:


Annual Costs Name of University (in $000) Indiana University 15.6 Michigan State University 17.0 Ohio State University 15.2 Pennsylvania State University 16.4 Purdue University 15.2 University of Illinois 15.4 University of Iowa 13.0 University of Michigan 23.1 University of Minnesota 14.3 University of Wisconsin 14.9


If we wish to state the five-number summary for these data, the first step will be to arrange our dataset in ascending order: Ordered Array:

X0 = 13.0 14.3 14.9 15.2 15.2 15.4 15.6 16.4 17.0 Xm = 23.1


MEDIAN AND QUARTILES FOR THIS DATA-SET: 1) The median for this data comes out to be 15.30 thousand dollars. 2) The first quartile comes out to be 14.90 thousand dollars, and 3) The third quartile comes out to be 16.40 thousand dollars.


The Five-Number Summary:

X0

Q1

~ X

Q3

Xm

13.0 14.9 15.3 16.4 23.1 If we apply the rules that I am conveyed to you a short while ago, it is clear that the annual cost data for our sample are right-skewed.


We come to this conclusion because of two reasons: 1. 2.

The distance from Q3 to Xm (i.e., 6.7) greatly exceeds the distance from X0 to Q1 (i.e., 1.9). If we compare the median (which is 15.3), the mid-quartile range (which is 15.65), and the midrange (which is 18.05), we observe that median < mid-quartile range < midrange.

Both these points clearly indicate that our distribution is positively skewed.


The gist of the above discussion is that the five-number summary is a simple yet effective way of determining the shape of our frequency distribution --- without actually drawing the graph of the frequency distribution.


IN TODAY’S LECTURE, YOU LEARNT

•Chebychev’s Inequality •The Empirical Rule •The Five-Number Summary


IN THE NEXT LECTURE, YOU WILL LEARN

•Box and Whisker Plot

•Measures of Skewness and Kurtosis •Moments


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.