Describing Distributions by jau1990

Describing distributions

We describe distributions for quantitative data in terms of shape, location and spread.

Shape See this from histogram, stemplot or bar chart.

Shape See this from histogram, stemplot or bar chart. Unimodal:

Bimodal:

Shape

Unimodal distributions may be symmetric or skewed.

Shape

Unimodal distributions may be symmetric or skewed. Symmetric:

Shape

Unimodal distributions may be symmetric or skewed. Symmetric:

(NB: not necessarily symmetric about zero.)

Shape skewed:

skewed:

Note we are looking at overall shape, not at fine detail. 8

Shape Right (positively) skewed:

skewed:

Note we are looking at overall shape, not at fine detail. 9

Shape Right (positively) skewed:

Left (negatively) skewed:

Note we are looking at overall shape, not at fine detail. 10

Shape

Example: Difference in reaction times before and after drinking a double whisky.

Difference

Outliers

Newcombâ&#x20AC;&#x2122;s data have 2 outliers. He decided to exclude the most extreme one (â&#x2C6;&#x2019;44) and calculate speed of light from the other 65 observations.

Outliers

Newcomb’s data have 2 outliers. He decided to exclude the most extreme one (−44) and calculate speed of light from the other 65 observations. It can be dangerous to ignore outliers — they may be telling us something important. They should be checked to make sure they are not recording errors. If not, further investigation may reveal some reason for outlying observations.

Location Around what value is the distribution centred? There are several measures of central tendency.

Location Around what value is the distribution centred? There are several measures of central tendency. (i) The mean of observations x1 , x2 , . . . , xn is xĚ&#x201E; =

Location Around what value is the distribution centred? There are several measures of central tendency. (i) The mean of observations x1 , x2 , . . . , xn is Pn x1 + x2 + · · · + xn i=1 xi = x̄ = n n It is very sensitive to outliers — one extreme observation can have a large effect on x̄

Location Because influenced by extreme observations, we say the mean is not resistant.

Location Because influenced by extreme observations, we say the mean is not resistant. The mean of skewed data is pulled towards the long tail, so may not be considered a good description of the location of the bulk of the data.

Mean = 30.013

Median = 26.806

Location (ii) The median is the point with half the observations above it, half below.

Location (ii) The median is the point with half the observations above it, half below. To find the median M , order the observations from smallest to largest, x(1) ≤ x(2) ≤ · · · ≤ x(n) If n is odd, M is the middle observation, the value at position (n + 1)/2, M

= x( n+1 ) 2

If n is even, M is halfway between the observations at positions n/2 and (n/2) + 1, 1 M = x( n ) + x( n +1) 2 2 2

Location Median is easy to find from a stemplot. Example: Petrol consumption. 27, 28 29, 30 31, 32 33, 34 35, 36 37, 38

3 7 448 34 1345 2

Location Median is easy to find from a stemplot. Example: Petrol consumption. 27, 28 29, 30 31, 32 33, 34 35, 36 37, 38 n = 12 x(6) = 33.3, x(7) = 33.4, M = (33.3 + 33.4) /2 = 33.35

3 7 448 34 1345 2

Location Median is easy to find from a stemplot. Example: Petrol consumption. 27, 28 29, 30 31, 32 33, 34 35, 36 37, 38

3 7 448 34 1345 2

n = 12 x(6) = 33.3, x(7) = 33.4, M = (33.3 + 33.4) /2 = 33.35 mpg

Location

The median is a resistant measure of location. Median of 10, 20, 30, 40, 50 is Median of 10, 20, 30, 40, 150 is

Location

The median is a resistant measure of location. Median of 10, 20, 30, 40, 50 is 30 Median of 10, 20, 30, 40, 150 is 30

Location

The median is a resistant measure of location. Median of 10, 20, 30, 40, 50 is 30 Median of 10, 20, 30, 40, 150 is 30 Median is easy to find for small samples,

Location

The median is a resistant measure of location. Median of 10, 20, 30, 40, 50 is 30 Median of 10, 20, 30, 40, 150 is 30 Median is easy to find for small samples, but for large samples, ordering data is time consuming. The mean is quicker to calculate.

Location (iii) Trimmed mean is a more resistant measure than the mean. To calculate the 100p% trimmed mean for 0 < p < 0.5, discard the smallest 100p% and largest 100p% of the observations and compute the mean of the remaining 100(1 â&#x2C6;&#x2019; 2p)% Example: Newcomb light data. Mean is xĚ&#x201E; = 26.21, but we know the data include outliers. Sample size n = 66, so for 5% trimmed mean, discard the largest and smallest observations.

Location (iii) Trimmed mean is a more resistant measure than the mean. To calculate the 100p% trimmed mean for 0 < p < 0.5, discard the smallest 100p% and largest 100p% of the observations and compute the mean of the remaining 100(1 − 2p)% Example: Newcomb light data. Mean is x̄ = 26.21, but we know the data include outliers. Sample size n = 66, so for 5% trimmed mean, discard the largest and smallest 66 × 0.05 = 3.3 observations.

Location (iii) Trimmed mean is a more resistant measure than the mean. To calculate the 100p% trimmed mean for 0 < p < 0.5, discard the smallest 100p% and largest 100p% of the observations and compute the mean of the remaining 100(1 − 2p)% Example: Newcomb light data. Mean is x̄ = 26.21, but we know the data include outliers. Sample size n = 66, so for 5% trimmed mean, discard the largest and smallest 66 × 0.05 = 3.3 observations. That is, discard the smallest 3 values (−44, −2, 16) and the largest 3 values (37, 39, 40).

Location (iii) Trimmed mean is a more resistant measure than the mean. To calculate the 100p% trimmed mean for 0 < p < 0.5, discard the smallest 100p% and largest 100p% of the observations and compute the mean of the remaining 100(1 − 2p)% Example: Newcomb light data. Mean is x̄ = 26.21, but we know the data include outliers. Sample size n = 66, so for 5% trimmed mean, discard the largest and smallest 66 × 0.05 = 3.3 observations. That is, discard the smallest 3 values (−44, −2, 16) and the largest 3 values (37, 39, 40). Compute the mean of the remaining 60 values.

Location (iii) Trimmed mean is a more resistant measure than the mean. To calculate the 100p% trimmed mean for 0 < p < 0.5, discard the smallest 100p% and largest 100p% of the observations and compute the mean of the remaining 100(1 − 2p)% Example: Newcomb light data. Mean is x̄ = 26.21, but we know the data include outliers. Sample size n = 66, so for 5% trimmed mean, discard the largest and smallest 66 × 0.05 = 3.3 observations. That is, discard the smallest 3 values (−44, −2, 16) and the largest 3 values (37, 39, 40). Compute the mean of the remaining 60 values. 5% trimmed mean = 27.40 (in coded units)

Which measure? Depends on shape of distribution, and what weâ&#x20AC;&#x2122;re interested in.

Which measure? Depends on shape of distribution, and what weâ&#x20AC;&#x2122;re interested in. The mean is not resistant, but contains more information than the median, since computed from all values.

Median

Mean = 20

Median

Mean = 20

Median

Median = 20

Mean = 20

Median = 20

Which measure? (b) Positively skewed. Mean

Median.

Which measure? (b) Positively skewed. Mean

Median.

Mean = 9.1

Which measure? (b) Positively skewed. Mean

Mean = 9.1

Median.

Median = 7.7

Which measure? (b) Positively skewed. Mean > Median.

Mean = 9.1

Median = 7.7

Which measure? (c) Negatively skewed. Mean

Median.

Which measure? (c) Negatively skewed. Mean

Median.

Mean = 0.75

Which measure? (c) Negatively skewed. Mean

Mean = 0.75

Median.

Median = 0.77

Which measure? (c) Negatively skewed. Mean < Median.

Mean = 0.75

Median = 0.77

Which measure?

In general, is used for approximately symmetric distributions for skewed data. (easier to calculate, contains more info),

Which measure?

In general, mean is used for approximately symmetric distributions for skewed data. (easier to calculate, contains more info),

Which measure?

In general, mean is used for approximately symmetric distributions (easier to calculate, contains more info), median for skewed data.

Which measure?

In general, mean is used for approximately symmetric distributions (easier to calculate, contains more info), median for skewed data. Can be misleading to use the â&#x20AC;&#x2DC;wrongâ&#x20AC;&#x2122; measure - e.g. income distribution, see Gismo handout.

Which measure? For a bimodal distribution, mean & median may both be misleading, having values which rarely occur.

Which measure? For a bimodal distribution, mean & median may both be misleading, having values which rarely occur. e.g. Ages of cyclists in fatal accidents (USA, 2009). Frequency 140 6 per 10 year 120 interval 100 80 60 40 20 0

Age

Which measure?

The mean age of cyclists killed in accidents was xĚ&#x201E; = 41 years, but this is not very informative.

Which measure?

The mean age of cyclists killed in accidents was xĚ&#x201E; = 41 years, but this is not very informative. Best to say where the modes (peaks) occur.

Which measure?

The mean age of cyclists killed in accidents was x̄ = 41 years, but this is not very informative. Best to say where the modes (peaks) occur. Modal age groups are 10–15 years and 45–54 years.

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean.

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Variance:

(n is number of observations, xĚ&#x201E; is sample mean.)

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Variance:

(xi − x̄)

(n is number of observations, x̄ is sample mean.)

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Pn 2 i=1 (xi − x̄) Variance: s = n

(n is number of observations, x̄ is sample mean.)

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Pn 2 2 i=1 (xi − x̄) Variance: s = n

(n is number of observations, x̄ is sample mean.)

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Pn 2 2 i=1 (xi − x̄) Variance: s = n−1

(n is number of observations, x̄ is sample mean.)

Spread Again, there are several measures. (i) Standard deviation (SD) is a measure of spread about the mean. It is the square root of variance. Pn 2 2 i=1 (xi − x̄) Variance: s = n−1 √ SD: s = Variance (n is number of observations, x̄ is sample mean.)

Spread P P 1 x2i â&#x2C6;&#x2019; n1 ( xi )2 . Better formula to use Can show s2 = nâ&#x2C6;&#x2019;1 when computing by calculator.

Spread P P 1 x2i â&#x2C6;&#x2019; n1 ( xi )2 . Better formula to use Can show s2 = nâ&#x2C6;&#x2019;1 when computing by calculator. Example: Petrol consumption data.

Spread P P 1 x2i â&#x2C6;&#x2019; n1 ( xi )2 . Better formula to use Can show s2 = nâ&#x2C6;&#x2019;1 when computing by calculator. Example: Petrol consumption data. P P 2 n = 12, x = 398.8, x = 13342.74

Spread P P 1 x2i − n1 ( xi )2 . Better formula to use Can show s2 = n−1 when computing by calculator. Example: Petrol consumption data. P P 2 n = 12, x = 398.8, x = 13342.74 2

1 13342.74 − 12 (398.8)2 = = 12 − 1

Spread P P 1 x2i − n1 ( xi )2 . Better formula to use Can show s2 = n−1 when computing by calculator. Example: Petrol consumption data. P P 2 n = 12, x = 398.8, x = 13342.74 2

1 13342.74 − 12 (398.8)2 = = 8.117 12 − 1

Spread P P 1 x2i − n1 ( xi )2 . Better formula to use Can show s2 = n−1 when computing by calculator. Example: Petrol consumption data. P P 2 n = 12, x = 398.8, x = 13342.74 2

1 13342.74 − 12 (398.8)2 = = 8.117 (mpg)2 12 − 1

Spread P P 1 x2i − n1 ( xi )2 . Better formula to use Can show s2 = n−1 when computing by calculator. Example: Petrol consumption data. P P 2 n = 12, x = 398.8, x = 13342.74 1 13342.74 − 12 (398.8)2 s = = 8.117 (mpg)2 12 − 1 √ s = 8.117 = 2

Standard deviation, unlike variance, is in same units as original observations.

Standard deviation, unlike variance, is in same units as original observations. University exam calculator will give s, s2 directly.

Spread

The deviations (xi − x̄) must sum to zero (by definition of x̄), so only have (n − 1) independent deviations.

Spread

The deviations (xi − x̄) must sum to zero (by definition of x̄), so only have (n − 1) independent deviations. We say there are n − 1 degrees of freedom.

Spread

The deviations (xi − x̄) must sum to zero (by definition of x̄), so only have (n − 1) independent deviations. We say there are n − 1 degrees of freedom. Hence division by (n − 1) rather than n in calculating s2 Only used when the mean is used as measure of location.

Spread

Spread (ii) Range is the difference between the largest and smallest values, x(n) â&#x2C6;&#x2019; x(1) .

Spread (ii) Range is the difference between the largest and smallest values, x(n) â&#x2C6;&#x2019; x(1) . Not resistant, & not very useful.

Spread (ii) Range is the difference between the largest and smallest values, x(n) â&#x2C6;&#x2019; x(1) . Not resistant, & not very useful. (iii) Interquartile range (IQR) is the range of the middle 50% of the data. The rth percentile is the value with r% of the observations at or below it. Median is 50th percentile, observation (n + 1)/2 Lower quartile Q1 is 25th percentile, obs. (n + 1)/4 Upper quartile Q3 is 75th percentile, obs. 3(n + 1)/4 IQR = Q3 â&#x2C6;&#x2019; Q1

Spread Example: 22, 25, 34, 35, 41, 41, 46

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7

M = 35

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7 Q1 = x(2) = 25

M = 35 Q3 = x(6) = 41

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7 Q1 = x(2) = 25

M = 35 Q3 = x(6) = 41

IQR = 41 â&#x2C6;&#x2019; 25 = 16

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7 Q1 = x(2) = 25

M = 35 Q3 = x(6) = 41

IQR = 41 â&#x2C6;&#x2019; 25 = 16 Example: Newcomb light data, n = 66

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7 Q1 = x(2) = 25

M = 35 Q3 = x(6) = 41

IQR = 41 â&#x2C6;&#x2019; 25 = 16 Example: Newcomb light data, n = 66 M = obs. 33 21 = 12 x(33) + x(34) =

1 2 (27

+ 27) = 27

100

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7

M = 35

Q1 = x(2) = 25

Q3 = x(6) = 41

IQR = 41 − 25 = 16 Example: Newcomb light data, n = 66 M = obs. 33 21 = 12 x(33) + x(34) = Q1 = obs.

16 43

1 4 x(16)

3 4 x(17)

1 2 (27 1 4

+ 27) = 27

× 24 +

3 4

× 24 = 24

101

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7

M = 35

Q1 = x(2) = 25

Q3 = x(6) = 41

IQR = 41 − 25 = 16 Example: Newcomb light data, n = 66 M = obs. 33 21 = 12 x(33) + x(34) = Q1 = obs. Q3 = obs.

16 43 50 41

= =

1 4 x(16) 3 4 x(50)

+ +

3 4 x(17) 1 4 x(51)

= =

1 2 (27 1 4 3 4

+ 27) = 27

× 24 + × 31 +

3 4 1 4

× 24 = 24 × 31 = 31

Spread Example: 22, 25, 34, 35, 41, 41, 46 n=7

M = 35

Q1 = x(2) = 25

Q3 = x(6) = 41

IQR = 41 − 25 = 16 Example: Newcomb light data, n = 66 M = obs. 33 21 = 12 x(33) + x(34) = Q1 = obs. Q3 = obs.

16 43 50 41

= =

1 4 x(16) 3 4 x(50)

+ +

3 4 x(17) 1 4 x(51)

= =

1 2 (27 1 4 3 4

+ 27) = 27

× 24 + × 31 +

Interquartile range = 31 − 24 = 7 (coded units)

3 4 1 4

× 24 = 24 × 31 = 31

Boxplot

Another simple way to display data.

Boxplot

Another simple way to display data. The box extends from lower quartile to upper quartile.

Boxplot

Another simple way to display data. The box extends from lower quartile to upper quartile. In simplest form, the whiskers extend from smallest value to largest value.

Boxplot

Another simple way to display data. The box extends from lower quartile to upper quartile. In simplest form, the whiskers extend from smallest value to largest value. The line within the box indicates the median value.

Boxplot Example: 22, 25, 34, 35, 41, 41, 46

108

Boxplot Example: 22, 25, 34, 35, 41, 41, 46 From above, M = 35, Q1 = 25, Q3 = 41

Boxplot Example: 22, 25, 34, 35, 41, 41, 46 From above, M = 35, Q1 = 25, Q3 = 41 50 40 30 20 10 0

Boxplot Boxplots show Location, Spread and Skewness (lack of symmetry).

Symmetric

skew

Boxplot Boxplots show Location, Spread and Skewness (lack of symmetry).

Symmetric

Right skew

skew

Boxplot Boxplots show Location, Spread and Skewness (lack of symmetry).

Symmetric

Right skew

Left skew

Boxplot

Boxplots do not show whether the data are unimodal or not. Should really only use a boxplot if you believe the data are unimodal.

114

Boxplot

115

Boxplot

Boxplots do not show whether the data are unimodal or not. Should really only use a boxplot if you believe the data are unimodal. Boxplots are particularly useful for comparing multiple samples. Side-by-side boxplots allow us to compare many samples. Often, outliers are separated off and plotted as individual points on the boxplot. Various rules exist for deciding what counts as an â&#x20AC;&#x2DC;outlierâ&#x20AC;&#x2122;.

116

Boxplot

One popular rule for separating off outliers is as follows. The INNER FENCES are at a distance 1.5 Ă&#x2014; Box length above or below the appropriate quartile.

117

Boxplot

One popular rule for separating off outliers is as follows. The INNER FENCES are at a distance 1.5 Ă&#x2014; Box length above or below the appropriate quartile. Any data values outside the Inner Fences are regarded as outliers, and drawn individually on the plot.

Boxplot

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence =

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 =

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 = 13.5

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 = 13.5 Upper Inner Fence =

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 = 13.5 Upper Inner Fence = 31 + 1.5 × 7 =

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 = 13.5 Upper Inner Fence = 31 + 1.5 × 7 = 41.5

Boxplot Example: Newcomb light data n = 66, M = 27, LQ = 24, U Q = 31, IQR = 7 Lower Inner Fence = 24 − 1.5 × 7 = 13.5 Upper Inner Fence = 31 + 1.5 × 7 = 41.5 Data values outside the Inner Fences are −44, −2 The outlying values are plotted as separate points. Whiskers extend to the smallest and largest of the remaining data values; that is,

Boxplot

130

Changing the units of measurement

A linear transformation yi = a + bxi results in (a) Shape – (b) Location – (c) Spread – e.g. Celsius to Fahrenheit x = o C, y = o F, a = 32, b = 9/5

131

Changing the units of measurement

A linear transformation yi = a + bxi results in (a) Shape – no change (b) Location – (c) Spread – e.g. Celsius to Fahrenheit x = o C, y = o F, a = 32, b = 9/5

132

Changing the units of measurement

A linear transformation yi = a + bxi results in (a) Shape – no change (b) Location – mean, median, quartiles are all transformed in the same way as an individual observation, e.g. ȳ = a + bx̄ (c) Spread – e.g. Celsius to Fahrenheit x = o C, y = o F, a = 32, b = 9/5

133

Changing the units of measurement

A linear transformation yi = a + bxi results in (a) Shape – no change (b) Location – mean, median, quartiles are all transformed in the same way as an individual observation, e.g. ȳ = a + bx̄ (c) Spread – SD and IQR are multiplied by |b|, e.g. sy = |b|sx e.g. Celsius to Fahrenheit x = o C, y = o F, a = 32, b = 9/5

134