FromProf-Chap4-AppliedStats/Bus&Eco- Doane-2E by school stuff

Descriptive Statistics Numerical Description Central Tendency Dispersion Standardized Data Percentiles and Quartiles Box Plots Grouped Data Skewness and Kurtosis (optional)

Chapter

Numerical Description â&#x20AC;˘ Three key characteristics of numerical data: Characteristic

Interpretation

Central Tendency

Where are the data values concentrated? What seem to be typical or middle data values? How much variation is there in the data? How spread out are the data values? Are there unusual values?

Dispersion

Shape

Are the data values distributed symmetrically? Skewed? Sharply peaked? Flat? Bimodal?

Numerical Description Example: Vehicle Quality • Consider the data set of vehicle defect rates from J. D. Power and Associates. • Defect rate = total no. defects x 100 no. inspected • Numerical statistics can be used to summarize this random sample of brands. • Must allow for sampling error since the analysis is based on sampling.

Numerical Description â&#x20AC;˘ Number of defects per 100 vehicles, 2004 models.

Central Tendency Measures of Central Tendency Statistic

Formula

Excel Formula

Mean

1 n xi ∑ n i =1

Familiar and uses all the =AVERAGE(Data) sample information.

Median

Middle value in sorted array

=MEDIAN(Data)

Pro

Robust when extreme data values exist.

Con Influenced by extreme values. Ignores extremes and can be affected by gaps in data values.

Central Tendency Â&#x201D; Measures of Central Tendency Statistic

Mode

Midrange

Formula Most frequently occurring data value

xmin + xmax 2

Excel Formula

=MODE(Data)

=0.5*(MIN(Data) +MAX(Data))

Pro

Con

Useful for attribute data or discrete data with a small range.

May not be unique, and is not helpful for continuous data. Influenced by extreme values and ignores most data values.

Easy to understand and calculate.

Central Tendency Â&#x201D; Measures of Central Tendency Statistic

Geometric mean (G)

Formula

x1 x2 ... xn

Excel Formula

=GEOMEAN(Data)

Pro

Con

Useful for growth rates and mitigates high extremes.

Less familiar and requires positive data.

Central Tendency Mean • A familiar measure of central tendency. Population Formula

Sample Formula n

μ=

∑ xi i =1

• In Excel, use function =AVERAGE(Data) where Data is an array of data values.

Central Tendency Mean • For the sample of n = 37 car brands: n

∑ xi i =1

87 + 93 + 98 + ... + 159 + 164 + 173 4639 = = 125.38 37 37

Central Tendency Median • The median (M) is the 50th percentile or midpoint of the sorted sample data. • M separates the upper and lower half of the sorted observations. • If n is odd, the median is the middle observation in the data array. • If n is even, the median is the average of the middle two observations in the data array.

Central Tendency Â&#x201D; Median

â&#x20AC;˘ For n = 8, the median is between the fourth and fifth observations in the data array.

Central Tendency Median • Consider the following n = 6 data values: 11 12 15 17 21 32 • What is the median?

xn / 2 + x( n / 2 +1)

For even n, Median = n/2 = 6/2 = 3

and

2 n/2+1 = 6/2 + 1 = 4

M = (x3+x4)/2 = (15+17)/2 = 16 11

Central Tendency Median • Consider the following n = 7 data values: 12 23 23 25 27 34 41 • What is the median? For odd n, Median =

x( n +1) / 2

(n+1)/2 = (7+1)/2 = 8/2 = 4 M = x4 = 25 12

Central Tendency Median • Use Excel’s function =MEDIAN(Data) where Data is an array of data values. • For the 37 vehicle quality ratings (odd n) the position of the median is (n+1)/2 = (37+1)/2 = 19. • So, the median is x19 = 121. • When there are several duplicate data values, the median does not provide a clean “50-50” split in the data.

Central Tendency Characteristics of the Median • The median is insensitive to extreme data values. • For example, consider the following quiz scores for 3 students: Tom’s scores: 20, 40, 70, 75, 80 Jake’s scores: 60, 65, 70, 90, 95 Mary’s scores: 50, 65, 70, 75, 90

Mean =57, Median = 70, Total = 285 Mean = 76, Median = 70, Total = 380 Mean = 70, Median = 70, Total = 350

• What does the median for each student tell you?

Central Tendency Mode • The most frequently occurring data value. • Similar to mean and median if data values occur often near the center of sorted data. • May have multiple modes or no mode.

Central Tendency Mode • For example, consider the following quiz scores for 3 students:

Lee’s scores: 60, 70, 70, 70, 80 Pat’s scores: 45, 45, 70, 90, 100 Sam’s scores: 50, 60, 70, 80, 90 Xiao’s scores: 50, 50, 70, 90, 90

Mean =70, Median = 70, Mode = 70 Mean = 70, Median = 70, Mode = 45 Mean = 70, Median = 70, Mode = none Mean = 70, Median = 70, Modes = 50,90

• What does the mode for each student tell you?

Central Tendency Mode • Use Excel’s function =MODE(Array) - will return #N/A if there is no mode. - will return first mode found if multimodal. • May be far from the middle of the distribution and not at all typical. • Best for categorical data or a discrete variable with a small range and generally not useful for continuous data.

Central Tendency Example: Price/Earnings Ratios and Mode • Consider the following P/E ratios for a random sample of 68 Standard & Poor’s 500 stocks. 7

10 10 10 10 12 13 13 13 13 13 13 13 14 14

14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91

• What is the mode?

Central Tendency Example: Price/Earnings Ratios and Mode • Excel’s descriptive statistics results are: • The mode 13 occurs 7 times, but what does the dot plot show?

Mean

22.7206

Median

Mode

Range

Minimum

Maximum

Sum Count

1545 68

Central Tendency Â&#x201D; Mode â&#x20AC;˘ A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data. â&#x20AC;˘ Occurs when dissimilar populations are combined in one sample. For example,

Central Tendency Â&#x201D; Skewness â&#x20AC;˘ Compare mean and median or look at histogram to determine degree of skewness.

Central Tendency Symptoms of Skewness Distribution’s Shape Skewed left (negative skewness)

Histogram Appearance

Long tail of histogram points left (a few low values but most data on right) Tails of histogram are balanced Symmetric (low/high values offset) Skewed right Long tail of histogram points right (positive (most data on left but a few high skewness) values)

Statistics

Mean < Median Mean ≈ Median Mean > Median

Central Tendency Â&#x201D; Skewness

â&#x20AC;˘ For the sample of J.D. Power quality ratings, the mean (125.38) exceeds the median (121). What does this suggest?

Central Tendency Geometric Mean • The geometric mean (G) is a multiplicative average.

x1 x2 ... xn

• For the J. D. Power quality data (n=37): G = 37 (87)(93)(98)...(164)(173) = 37 2.37667 × 1077 = 123.38

• In Excel use =GEOMEAN(Array) • The geometric mean tends to mitigate the effects of high outliers.

Central Tendency Growth Rates • A variation on the geometric mean used to find the average growth rate for a time series.

n Xn X0

−1

• For example, from 1998 to 2002, Spirit Airlines revenues are:

Year

Revenue (mil)

1998

131

1999

227

2000

311

2001

354

2002

403

Central Tendency Growth Rates • The average growth rate is given by taking the geometric mean of the ratios of each year’s revenue to the preceding year. • Due to cancellations, only the first and last years are relevant:

G=4

227 311 354 403 ( 131 )( 227 )( 311 )( 354 ) − 1 = 4

= 1.324−1 = .324 or 32.4% per year • In Excel use =(403/131)^(1/4)-1

403 131

−1

Central Tendency Midrange

• The midrange is the point halfway between the lowest and highest values of X. • Easy to use but sensitive to extreme data values. xmin + xmax Midrange = 2 • For the J. D. Power quality data (n=37): x1 + x37 87 + 173 xmin + xmax = = 130 Midrange = = 2 2 2 • Here, the midrange (130) is higher than the mean (125.38) or median (121).

Dispersion • Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of dispersion:

Measures of Variation Statistic Range Variance (s2)

Formula xmax – xmin n

∑ ( xi − x ) i =1

n −1

Excel

Pro

Con

=MAX(Data)MIN(Data)

Sensitive to Easy to calculate extreme data values.

=VAR(Data)

Plays a key role in mathematical statistics.

Non-intuitive meaning.

Dispersion Measures of Variation Statistic Standard deviation (s) Coefficient. of variation (CV)

Formula n

∑ ( xi − x ) i =1

Excel

Pro

=STDEV(Data)

n −1

100 ×

s x

None

Most common measure. Uses same units as the raw data ($ , £, ¥, etc.). Measures relative variation in percent so can compare data sets.

Con Non-intuitive meaning.

Requires nonnegative data.

Dispersion Measures of Variation Statistic Mean absolute deviation (MAD)

Formula

Excel

Pro

Con

Easy to understand.

Lacks “nice” theoretical properties.

∑ xi − x i =1

=AVEDEV(Data)

Dispersion Range • The difference between the largest and smallest observation. Range = xmax – xmin • For example, for the n = 68 P/E ratios, Range = 91 – 7 = 84

Dispersion Variance • The population variance (σ2) is defined as the sum of squared 2 σ = deviations around the mean μ divided by the population size. • For the sample variance (s2), we divide by n – 1 instead of n, otherwise s2 would tend to 2 s = underestimate the unknown population variance σ2.

∑ ( xi − μ )

i =1

∑ ( xi − x ) i =1

n −1

Dispersion Standard Deviation • The square root of the variance. • Explains how individual values in a data set vary from the mean. • Units of measure are the same as X. Population standard deviation

σ=

∑ ( xi − μ ) i =1

Sample standard deviation

∑ ( xi − x ) i =1

n −1

Dispersion Standard Deviation • Excel’s built in functions are Statistic Variance

Excel population formula =VARP(Array)

Excel sample formula =VAR(Array)

Standard deviation

=STDEVP(Array)

=STDEV(Array)

Dispersion Â&#x201D; Calculating a Standard Deviation â&#x20AC;˘ Consider the following five quiz scores for Stephanie.

Dispersion Calculating a Standard Deviation • Now, calculate the sample standard deviation: n

2 x − x ( ) ∑ i i =1

n −1

2380 = 595 = 24.39 5 −1

• Somewhat easier, the two-sum formula can also be used: 2

⎛ ⎞ x ∑ ⎜ i⎟ n 2 (360) 2 ⎝ i =1 ⎠ ∑ xi − n 28300 − 2 5 = 28300 − 25920 = 595 = 24.39 i =1 s = = 5 −1 5 −1 n −1 n

Dispersion Calculating a Standard Deviation • The standard deviation is nonnegative because deviations around the mean are squared. • When every observation is exactly equal to the mean, the standard deviation is zero. • Standard deviations can be large or small, depending on the units of measure. • Compare standard deviations only for data sets measured in the same units and only if the means do not differ substantially.

Dispersion Coefficient of Variation • Useful for comparing variables measured in different units or with different means. • A unit-free measure of dispersion • Expressed as a percent of the mean. CV = 100 ×

s x

• Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.

Dispersion Coefficient of Variation • For example:

s CV = 100 × x

Defect rates (n = 37)

s = 22.89 x = 125.38 gives CV = 100 × (22.89)/(125.38) = 18%

ATM deposits (n = 100)

s = 280.80 x = 233.89 gives CV = 100 × (280.80)/(233.89) = 120%

P/E ratios (n = 68)

s = 14.28 x = 22.72 gives CV = 100 × (14.08)/(22.72) = 62%

Example: Which stock price is more volatile? Stock A JAN

Stock B

$1.00

$180

FEB

1.50

175

MAR

1.90

182

APR

.60

186

MAY

3.00

188

JUN

.40

190

JUL

5.00

200

AUG

.20

210

$1.70

$188.88

2.61

128.41

$1.62

$11.33

Mean

$1.62 × 100 % = 95 .3% $1.70 $11 .33 CV B = × 100 % = 6.0% $188 .88 CV A =

Even though stock A has a smaller standard deviation than does stock B, stock A is more volatile according to CV.

Dispersion Mean Absolute Deviation • The Mean Absolute Deviation (MAD) reveals the average distance from an individual data point to the mean (center of the distribution). • Uses absolute values of the deviations around the mean. n

MAD =

∑

i =1

xi − x n

• Excel’s function is =AVEDEV(Array)

Dispersion Central Tendency vs. Dispersion: Manufacturing • Consider the histograms of hole diameters drilled in a steel plate during manufacturing.

Machine A

Machine B

• The desired distribution is outlined in red.

Dispersion Â&#x201D; Central Tendency vs. Dispersion: Manufacturing

Machine A

Machine B

Acceptable variation but Desired mean (5mm) but too much variation. mean is less than 5 mm. â&#x20AC;˘ Take frequent samples to monitor quality.

Standardized Data Chebyshev’s Theorem • Developed by mathematicians Jules Bienaymé (1796-1878) and Pafnuty Chebyshev (1821-1894). • For any population with mean μ and standard deviation σ, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2].

Standardized Data Chebyshev’s Theorem • For k = 2 standard deviations, 100[1 – 1/22] = 75% • So, at least 75.0% will lie within μ + 2σ • For k = 3 standard deviations, 100[1 – 1/32] = 88.9% • So, at least 88.9% will lie within μ + 3σ • Although applicable to any data set, these limits tend to be too wide to be useful.

Standardized Data The Empirical Rule • The normal or Gaussian distribution was named for Karl Gauss (1771-1855). • The normal distribution is symmetric and is also known as the bell-shaped curve. • The Empirical Rule states that for data from a normal distribution, we expect that for k = 1 about 68.26% will lie within μ + 1σ k = 2 about 95.44% will lie within μ + 2σ k = 3 about 99.73% will lie within μ + 3σ

Standardized Data The Empirical Rule • Distance from the mean is measured in terms of the number of standard deviations. Note: no upper bound is given. Data values outside μ + 3σ are rare.

Standardized Data Example: Exam Scores • If 80 students take an exam, how many will score within 2 standard deviations of the mean? • Assuming exam scores follow a normal distribution, the empirical rule states about 95.44% will lie within μ + 2σ so 95.44% x 80 ≈ 76 students will score + 2σ from μ. • How many students will score more than 2 standard deviations from the mean?

Standardized Data Unusual Observations • Unusual observations are those that lie beyond μ + 2σ. • Outliers are observations that lie beyond μ + 3σ.

Standardized Data Â&#x201D; Unusual Observations â&#x20AC;˘ For example, the P/E ratio data contains several large data values. Are they unusual or outliers? 7 13

8 13

16 20

17 20

25 37

26 40

26 41

10 10 14 14 18 18 21 21 26 26 45 48

10 14

10 15

12 15

13 15

13 16

18 21

18 22

19 22

19 23

19 24

27 55

29 68

29 91

Standardized Data The Empirical Rule • If the sample came from a normal distribution, then the Empirical rule states

x ± 1s = 22.72 ± 1(14.08) = (8.9, 38.8)

x ± 2 s = 22.72 ± 2(14.08) = (-5.4, 50.9) x ± 3s = 22.72 ± 3(14.08) = (-19.5, 65.0)

Standardized Data Â&#x201D; The Empirical Rule â&#x20AC;˘ Are there any unusual values or outliers? 7 8 . . . 48 55

Unusual

Outliers

-19.5

68 91

-5.4

8.9

22.72

38.8

50.9

65.0

Standardized Data Defining a Standardized Variable • A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean. Standardization formula for a population:

xi − μ zi = σ

Standardization formula for a sample:

xi − x zi = s

Standardized Data Defining a Standardized Variable • zi tells how far away the observation is from the mean. • For example, for the P/E data, the first value x1 = 7. The associated z value is

xi − x zi = s

= 7 – 22.72 = -1.12 14.08

• A negative z value means the observation is below the mean.

Standardized Data Â&#x201D; Defining a Standardized Variable â&#x20AC;˘ Here are the standardized z values for the P/E data:

â&#x20AC;˘ What do you conclude for these four values?

Standardized Data Defining a Standardized Variable • In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value.

Standardized Data Outliers • What do we do with outliers in a data set? • If due to erroneous data, then discard. • An outrageous observation (one completely outside of an expected range) is certainly invalid. • Recognize unusual data points and outliers and their potential impact on your study. • Research books and articles on how to handle outliers.

Percentiles and Quartiles Percentiles

• Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. • Deciles are data that have been divided into 10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups.

Percentiles and Quartiles Percentiles • Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). • Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. • Percentiles are used in employee merit evaluation and salary benchmarking.

Percentiles and Quartiles Quartiles

• Quartiles are scale points that divide the sorted data into four groups of approximately equal size.

Q1 ÕLower 25%Ö

Q2 ÕSecond 25%Ö

Q3 ÕThird 25%Ö

ÕUpper 25%Ö

• The three values that separate the four groups are called Q1, Q2, and Q3, respectively.

Percentiles and Quartiles Quartiles • The second quartile Q2 is the median, an important indicator of central tendency. Q2 Õ Lower 50% Ö

Õ Upper 50% Ö

• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values. Q1 ÕLower 25%Ö

Q3 Õ Middle 50% Ö

ÕUpper 25%Ö

Percentiles and Quartiles Quartiles • The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2. Q1 ÕLower 25%Ö

Q2 ÕSecond 25%Ö

For first half of data, 50% above, 50% below Q1.

Q3 ÕThird 25%Ö

ÕUpper 25%Ö

For second half of data, 50% above, 50% below Q3.

Percentiles and Quartiles Â&#x201D; Quartiles â&#x20AC;˘ Depending on n, the quartiles Q1,Q2, and Q3 may be members of the data set or may lie between two of the sorted data values.

Percentiles and Quartiles Â&#x201D; Method of Medians â&#x20AC;˘ For small data sets, find quartiles using method of medians: Step 1. Sort the observations. Step 2. Find the median Q2. Step 3. Find the median of the data values that lie below Q2. Step 4. Find the median of the data values that lie above Q2.

Percentiles and Quartiles Excel Quartiles

• Use Excel function =QUARTILE(Array, k) to return the kth quartile. • Excel treats quartiles as a special case of percentiles. For example, to calculate Q3 =QUARTILE(Array, 3) =PERCENTILE(Array, 75) • Excel calculates the quartile positions as: Position of Q1 0.25n + 0.75 Position of Q2 Position of Q3

0.50n + 0.50 0.75n + 0.25

Percentiles and Quartiles Â&#x201D; Central Tendency Using Quartiles â&#x20AC;˘ Some robust measures of central tendency and dispersion using quartiles are: Statistic

Formula Excel

Pro

Midhinge

Q1 + Q3 =.5*(QUARTILE Robust to (Data,1)+.5*QUARTILE extreme 2 (Data,3))

values.

Con Less familiar to most people.

Percentiles and Quartiles Dispersion Using Quartiles Statistic Midspread (Interquartile range)

Formula

Excel

Q3 – Q1

Stable when =QUARTILE(Data,3)extreme QUARTILE(Data,1) data values exist.

Coefficient Q3 − Q1 100 × of quartile Q3 + Q1 variation (CQV)

Pro

None

Relative variation in percent so we can compare data sets.

Con Ignores magnitude of extreme data values. Less familiar to nonstatisticians

Five Number Summary and Shape of Distribution Five Number Summary consists of Median Q1 Lowest

← IQR → ←

Range →

In a right-skewed distribution ¾ (max – Q2) > (Q2 – min) ¾ (max – Q3) > (Q1 – min) In a left-skewed distribution ¾ (max – Q2) < (Q2 – min) ¾ (max – Q3) < (Q1 – min)

Q3 Highest

Box Plots • A useful tool of exploratory data analysis (EDA). • Also called a box-and-whisker plot. • Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax • Consider the five-number summary for the 68 P/E ratios: Xmin, Q1, Q2, Q3, Xmax 7

14 19 26 91

Box Plots Whiskers

Center of Box is Midhinge

Box

Minimum Median (Q2)

Right-skewed

Maximum

Box Plots Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Lower fence Upper fence

Inner fences Q1 – 1.5 (Q3–Q1) Q3 + 1.5 (Q3–Q1)

Outer fences: Q1 – 3.0 (Q3–Q1) Q3 + 3.0 (Q3–Q1)

• Values outside the inner fences are unusual while those outside the outer fences are outliers.

Box Plots Fences and Unusual Data Values • For example, consider the P/E ratio data: Inner fences

Outer fences:

Lower fence:

14 – 1.5 (26–14) = −4

14 – 3.0 (26–14) = −22

Upper fence:

26 + 1.5 (26–14) = +44

26 + 3.0 (26–14) = +62

• Ignore the lower fence since it is negative and P/E ratios are only positive.

Box Plots Fences and Unusual Data Values • Truncate the whisker at the fences and display unusual values Inner Outer and outliers Fence Fence as dots. Unusual

Outliers

• Based on these fences, there are three unusual P/E values and two outliers.

Grouped Data Nature of Grouped Data • Although some information is lost, grouped data are easier to display than raw data. • When bin limits are given, the mean and standard deviation can be estimated. • Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequencies

Grouped Data Mean and Standard Deviation • Consider the frequency distribution for prices of Lipitor® for three cities:

• Where mj = class midpoint k = number of classes

fj = class frequency n = sample size

Grouped Data Nature of Grouped Data

• Estimate the mean and standard deviation by k f m 3427.5 j j x=∑ = = 72.92552 47 j =1 n

f j (m j − x )2

j =1

n −1

∑

2091.48936 = = 6.74293 47 − 1

• Note: don’t round off too soon.