(Part 1)
4
Chapter
Descriptive Statistics
Numerical Description Central Tendency Dispersion
McGraw-Hill/Irwin
Copyright Š 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Numerical Description • Statistics are descriptive measures derived from a sample (n items). • Parameters are descriptive measures derived from a population (N items).
4A-2
Numerical Description • Three key characteristics of numerical data: Characteristic
Interpretation
Central Tendency Where are the data values concentrated? What seem to be typical or middle data values? Dispersion How much variation is there in the data? How spread out are the data values? Are there unusual values? Shape Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal? 4A-3
Numerical Description Example: Vehicle Quality • Consider the data set of vehicle defect rates from J. D. Power and Associates. • Defect rate = total no. defects x 100 no. inspected • Numerical statistics can be used to summarize this random sample of brands. • Must allow for sampling error since the analysis is based on sampling. 4A-4
Numerical Description • Number of defects per 100 vehicles, 2006 models.
4A-5
Numerical Description To begin, sort the data in Excel.
4A-6
Numerical Description • Sorted data provides insight into central tendency and dispersion.
4A-7
Numerical Description Visual Displays • The dot plot offers a visual impression of the data.
4A-8
Numerical Description Visual Displays • Histograms with 5 bins (suggested by Sturge’s Rule) and 10 bins are shown below.
• Both are symmetric with no extreme values and show a modal class toward the low end. 4A-9
Central Tendency • The central tendency is the middle or typical values of a distribution. • Central tendency can be assessed using a dot plot, histogram or more precisely with numerical statistics.
4A-10
Central Tendency Six Measures of Central Tendency Statistic
Formula
Excel Formula
Mean
1 n xi ∑ n i =1
Familiar and =AVERAGE(Data uses all the sample ) information.
Median
4A-11
Middle value in sorted array
=MEDIAN(Data)
Pro
Con Influenced by extreme values.
Ignores extremes Robust when and can be extreme data affected by values exist. gaps in data values.
Central Tendency Six Measures of Central Tendency Statistic
Mode
Formula Most frequently occurring data value
xmin + xmax Midrange 2 4A-12
Excel Formula
=MODE(Data)
=0.5*(MIN(Data) +MAX(Data))
Pro
Con
Useful for attribute data or discrete data with a small range.
May not be unique, and is not helpful for continuous data.
Easy to understand and calculate.
Influenced by extreme values and ignores most data values.
Central Tendency Six Measures of Central Tendency Statistic
Geometric mean (G)
Trimmed mean 4A-13
Formula
n
x1 x2 ... xn
Same as the mean except omit highest and lowest k% of data values (e.g., 5%)
Excel Formula
Pro
Useful for growth rates and =GEOMEAN(Data) mitigates high extremes. Mitigates =TRIMMEAN(Data, effects of Percent) extreme values.
Con Less familiar and requires positive data. Excludes some data values that could be relevant.
Central Tendency Mean • A familiar measure of central tendency. Population Formula
Sample Formula
N
n
μ=
∑ xi i =1
N
x=
∑ xi i =1
n
• In Excel, use function =AVERAGE(Data) where Data is an array of data values. 4A-14
Central Tendency Mean • For the sample of n = 37 car brands:
4A-15
Central Tendency Characteristics of the Mean • Arithmetic mean is the most familiar average. • Affected by every sample item. • The balancing point or fulcrum for the data.
4A-16
Central Tendency Characteristics of the Mean • Regardless of the shape of the distribution, absolute distances from the mean to the data points always sum to zero. n • Consider the following ∑ ( xi − x ) = 0 asymmetric distribution of i =1 quiz scores whose mean = 65.
n
∑ ( xi − x ) = (42 – 65) + (60 – 65) + (70 – 65) + (75 – 65) + (78 – 65) 4A-17
i =1
= (-23) + (-5) + (5) + (10) + (13) = -28 + 28 = 0
Central Tendency Median • The median (M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower half of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array. 4A-18
Central Tendency Median • Consider the following n = 6 data values: 11 12 15 17 21 32 • What is the median? For even n, Median =
xn / 2 + x( n / 2+1) 2
n/2 = 6/2 = 3
and
n/2+1 = 6/2 + 1 = 4
M = (x3+x4)/2 = (15+17)/2 = 16 11 4A-19
12
15
16
17
21
32
Central Tendency Median
(Figure 4.6)
• For n = 8, the median is between the fourth and fifth observations in the data array. 4A-20
Central Tendency Median
• For n = 9, the median is the fifth observation in the data array. 4A-21
Central Tendency Median • Consider the following n = 7 data values: 12 23 23 25 27 34 41 • What is the median?
x
For odd n, Median = ( n +1) / 2 (n+1)/2 = (7+1)/2 = 8/2 = 4 M = x4 = 25 12 4A-22
23
23
25
27
34
41
Central Tendency Median • Use Excel’s function =MEDIAN(Data) where Data is an array of data values. • For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. • So, the median is x19 = 121. • When there are several duplicate data values, the median does not provide a clean “50-50” split in the data. 4A-23
Central Tendency Characteristics of the Median • The median is insensitive to extreme data values. • For example, consider the following quiz scores for 3 students: Tom’s scores: 20, 40, 70, 75, 80 Jake’s scores: 60, 65, 70, 90, 95 Mary’s scores: 50, 65, 70, 75, 90
Mean =57, Median = 70, Total = 285 Mean = 76, Median = 70, Total = 380 Mean = 70, Median = 70, Total = 350
• What does the median for each student tell you? 4A-24
Central Tendency Mode • The most frequently occurring data value. • Similar to mean and median if data values occur often near the center of sorted data. • May have multiple modes or no mode.
4A-25
Central Tendency Mode • For example, consider the following quiz scores for 3 students: Lee’s scores: 60, 70, 70, 70, 80 Pat’s scores: 45, 45, 70, 90, 100 Sam’s scores: 50, 60, 70, 80, 90 Xiao’s scores: 50, 50, 70, 90, 90
Mean =70, Median = 70, Mode = 70 Mean = 70, Median = 70, Mode = 45 Mean = 70, Median = 70, Mode = none Mean = 70, Median = 70, Modes = 50,90
• What does the mode for each student tell you? 4A-26
Central Tendency Mode • Easy to define, not easy to calculate in large samples. • Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal. • May be far from the middle of the distribution and not at all typical.
4A-27
Central Tendency Mode • Generally isn’t useful for continuous data since data values rarely repeat. • Best for attribute data or a discrete variable with a small range (e.g., Likert scale).
4A-28
Central Tendency Example: Price/Earnings Ratios and Mode • Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks. 7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
• What is the mode?
4A-29
Central Tendency Example: Price/Earnings Ratios and Mode • Excel’s descriptive statistics results are: • The mode 13 occurs 7 times, but what does the dot plot show?
Mean Median
19
Mode
13
Range
84
Minimum
7
Maximum
91
Sum Count
4A-30
22.7206
1545 68
Central Tendency Example: Price/Earnings Ratios and Mode • The dot plot shows local modes (a peak with valleys on either side) at 10, 13, 15, 19, 23, 26, 29.
• These multiple modes suggest that the mode is not a stable measure of central tendency. 4A-31
Central Tendency Example: Rose Bowl Winners’ Points • Points scored by the winning NCAA football team tends to have modes in multiples of 7 because each touchdown yields 7 points. • Consider the dot plot of the points scored by the winning team in the first 87 Rose Bowl games.
• What is the mode? 4A-32
Central Tendency Mode • A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. • Occurs when dissimilar populations are combined in one sample. For example,
4A-33
Central Tendency Skew ness • Compare mean and median or look at histogram to determine degree of skew ness.
4A-34
Central Tendency Symptoms of Skew ness Distribution’s Shape Skewed left (negative skew ness) Symmetric
Histogram Appearance
Statistics
Long tail of histogram points left (a few low values but most data on right)
Mean < Median
Tails of histogram are balanced (low/high values offset)
Mean â&#x2030;&#x2C6; Median
Skewed right Long tail of histogram points right (positive (most data on left but a few high Mean > Median skew ness) values) 4A-35
Central Tendency Skew ness â&#x20AC;˘ For the sample of spending per customer at 74 Noodles &, the mean ($7.04) exceeds the median ($7.00). What does this suggest?
4A-36
Central Tendency Geometric Mean • The geometric mean (G) is a multiplicative average.
G = n x1 x2 ... xn
• For the J. D. Power quality data (n=37):
G = 37 (87)(93)(98)...(164)(173) = 37 2.37667 ×1077 = 123.38 • In Excel use =GEOMEAN(Array) • The geometric mean tends to mitigate the effects of high outliers. 4A-37
Central Tendency Growth Rates • A variation on the geometric mean used to find the average growth rate for a time series. G=n
xn −1 x1
• For example, from 2002 to 2006, JetBlue Airlines revenues are: 4A-38
Year
Revenue (mil)
2002
635
2003
998
2004
1265
2005
1701
2006
2363
Central Tendency Growth Rates • The average growth rate is given by taking the geometric mean of the ratios of each year’s revenue to the preceding year. • Due to cancellations, only the first and last years are relevant: 2363 ⎛ 998 ⎞ ⎛ 1265⎞ ⎛ 1701⎞ ⎛ 2363⎞ 4 G= ⎜ −1 ⎟⎜ ⎟⎜ ⎟⎜ ⎟ −1= ⎝ 635⎠ ⎝ 998 ⎠ ⎝ 1265⎠ ⎝ 1701⎠ 635 4
= 1.389−1 = .389 or 38.9% per year • In Excel use =(2363/635)^(1/4)-1 4A-39
Central Tendency Midrange • The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. xmin + xmax Midrange = 2 • For the J. D. Power quality data (n=37):
xmin + xmax Midrange = = 2
91 + 204 = 147.5 2
• Here, the midrange (147.5) is higher than the mean (134.51) or median (132). 4A-40
Central Tendency Trimmed Mean • To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. • For example, for the n = 68 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). • To determine how many observations to trim, multiply k x n = 0.05 x 68 = 3.4 or 3 observations. • So, we would remove the three smallest and three largest observations before averaging the remaining values. 4A-41
Central Tendency Trimmed Mean • Here is a summary of all the measures of central tendency for the n = 68 P/E values. Mean: 22.72 Median: 19.00 Mode: 13.00 Geometric Mean: 19.85 Midrange: 49.00 5% Trim Mean: 21.10
=AVERAGE(PERatio) =MEDIAN(PERatio) =MODE(PERatio) =GEOMEAN(PERatio) (MIN(PERatio)+MAX(PERatio))/2 =TRIMMEAN(PERatio,0.1)
• The trimmed mean mitigates the effects of very high values, but still exceeds the median. 4A-42
Central Tendency Trimmed Mean â&#x20AC;˘ The Federal Reserve uses a 16% trimmed mean to mitigate the effects of extremes in its analysis of the Consumer Price Index. 4A-43
Dispersion • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion:
Measures of Variation Statistic Range
Formula xmax – xmin n
Variance (s2) 4A-44
∑ ( xi − x ) i =1
n −1
Excel
Pro
=MAX(Data)- Easy to MIN(Data) calculate
2
=VAR(Data)
Con Sensitive to extreme data values.
Plays a key role Non-intuitive in mathematical meaning. statistics.
Dispersion Measures of Variation Statistic Standard deviation (s) Coefficient. of variation (CV)
4A-45
Formula Excel n
∑ ( xi − x ) i =1
n −1
100 ×
s x
2
Pro
Most common measure. Uses =STDEV(Data) same units as the raw data ($ , £, ¥, etc.). Measures relative variation in None percent so can compare data sets.
Con Nonintuitive meaning. Requires nonnegative data.
Dispersion Measures of Variation Statistic Mean absolute deviation (MAD)
4A-46
Formula
Excel
Pro
Con
Easy to understand.
Lacks “nice” theoretical properties.
n
∑ xi − x i =1
n
=AVEDEV(Data)
Dispersion Range • The difference between the largest and smallest observation. Range = xmax – xmin • For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84
4A-47
Dispersion Variance • The population variance (σ2) is defined as the sum of squared 2 deviations around the mean μ σ = divided by the population size. • For the sample variance (s2), we divide by n – 1 instead of n, otherwise s2 would tend to 2 s = underestimate the unknown population variance σ2. 4A-48
N
∑ ( xi − μ ) i =1
N
n
∑ ( xi − x ) i =1
n −1
2
2
Dispersion Standard Deviation • The square root of the variance. • Explains how individual values in a data set vary from the mean. • Units of measure are the same as X. N Population 2 x − μ ( ) ∑ i standard σ = i =1 deviation N
4A-49
Sample standard deviation
n
s=
∑ ( xi − x ) i =1
n −1
2
Dispersion Standard Deviation • Excel’s built in functions are Statistic Variance
Excel population formula =VARP(Array)
Excel sample formula =VAR(Array)
Standard deviation
=STDEVP(Array)
=STDEV(Array)
4A-50
Dispersion Calculating a Standard Deviation â&#x20AC;˘ Consider the following five quiz scores for Stephanie. (Table 4.12)
4A-51
Dispersion Calculating a Standard Deviation • Now, calculate the sample standard deviation: n
s=
∑ ( xi − x ) i =1
n −1
2
2380 = = 595 = 24.39 5 −1
• Somewhat easier, the two-sum formula can also be used: 2
4A-52
⎛n ⎞ ⎜ ∑ xi ⎟ n 2 (360) 2 ⎝ i=1 ⎠ ∑ xi − n 28300 − 28300 − 25920 2 5 i=1 = = = 595 = 24.39 s = 5 −1 5 −1 n −1
Dispersion Calculating a Standard Deviation • The standard deviation is nonnegative because deviations around the mean are squared. • When every observation is exactly equal to the mean, the standard deviation is zero. • Standard deviations can be large or small, depending on the units of measure. • Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially. 4A-53
Dispersion Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unit-free measure of dispersion • Expressed as a percent of the mean.
s CV = 100 × x • Only appropriate for nonnegative data. It is undefined if the mean is zero or negative. 4A-54
Dispersion Coefficient of Variation • For example:
s CV = 100 × x
Defect rates (n = 37)
s = 22.89 x= 125.38 gives CV = 100 × (22.89)/(125.38) = 18%
ATM deposits (n = 100)
s = 280.80 x= 233.89 gives CV = 100 × (280.80)/(233.89) = 120%
P/E ratios (n = 68)
4A-55
s = 14.28 x = 22.72 gives CV = 100 × (14.08)/(22.72) = 62%
Dispersion Mean Absolute Deviation • The Mean Absolute Deviation (MAD) reveals the average distance from an individual data point to the mean (center of the distribution). • Uses absolute values of the deviations around the mean. n
MAD =
∑ xi − x i =1
n
• Excel’s function is =AVEDEV(Array) 4A-56
Dispersion Central Tendency vs. Dispersion: Manufacturing â&#x20AC;˘ Consider the histograms of hole diameters drilled in a steel plate during manufacturing.
Machine A
Machine B
â&#x20AC;˘ The desired distribution is outlined in red. 4A-57
Dispersion Central Tendency vs. Dispersion: Manufacturing
Machine A Desired mean (5mm) but too much variation.
Machine B Acceptable variation but mean is less than 5 mm.
â&#x20AC;˘ Take frequent samples to monitor quality. 4A-58
Dispersion Central Tendency vs. Dispersion: Job Performance â&#x20AC;˘ Consider student ratings of four professors on eight teaching attributes (10-point scale).
4A-59
Dispersion Central Tendency vs. Dispersion: Job Performance â&#x20AC;˘ Jones and Wu have identical means but different standard deviations.
4A-60
Dispersion Central Tendency vs. Dispersion: Job Performance â&#x20AC;˘ Smith and Gopal have different means but identical standard deviations.
4A-61
Dispersion Central Tendency vs. Dispersion: Job Performance â&#x20AC;˘ A high mean (better rating) and low standard deviation (more consistency) is preferred. Which professor do you think is best?
4A-62
Applied Statistics in Business & Economics
End of Chapter 4A
4A-63