Understanding Your Data 1: Normal Distributions and Patterns of Dispersion
BML224: Data Analysis for Research
Aims: Last *me
Aims: Understand the theory and assump0ons rela0ng to the
distribu0on and variance of data
Use SPSS to calculate measures of dispersion, including the
median, range, standard devia0on
Use SPSS graphically to illustrate the distribu0on of the data
through the use of char0ng elements such as frequency histograms, sca<er plots and box plots
Aims: or simply clarifying what all this extra informa*on means!
Descrip*ve Sta*s*cs – The Mean Features of the Mean It makes use of every value in the distribu0on, leading to a
mathema0cal exactness which is useful for further mathema0cal processing
It can be determined if only the total value of the items and
the number of items are known, without knowing individual values
It can be distorted by extreme values in the distribu0on For a discrete distribu0on, the mean may be an ‘impossible’
figure e.g. average number of children per family = 2.4
Descrip*ve Sta*s*cs – The Median Calcula1ng the Median A data series with an uneven number of items:
1
2
2
4
7
7
10
Descrip*ve Sta*s*cs – The Median Calcula1ng the Median A data series with an uneven number of items:
1
2
2
4
7
7
10
Descrip*ve Sta*s*cs – The Median Calcula1ng the Median A data series with an even number of items:
1
2
2
4
7
7
10
12
Descrip*ve Sta*s*cs – The Median Calcula1ng the Median A data series with an even number of items:
1
2
2
4
7
4 + 7 2 Median = 5.5
7
10
12
Descrip*ve Sta*s*cs – The Median Features of the Median Half the items in the series will have a value greater than or
equal to the median and half less than or equal to the median
It is therefore a measure of rank or posi0on It is unaffected by the presence of extreme items in the
distribu0on
Descrip*ve Sta*s*cs – The Mode Calcula1ng the Mode Series 1:
1
2
2
2
7
7
10
12
Descrip*ve Sta*s*cs – The Mode Calcula1ng the Mode Series 1:
1
2
2
2
Single mode: 2 Data set is Modal
7
7
10
12
Descrip*ve Sta*s*cs – The Mode Calcula1ng the Mode Series 1:
1
2
2
2
7
7
7
10
Descrip*ve Sta*s*cs – The Mode Calcula1ng the Mode Series 1:
1
2
2
2
Two modes: 2 + 7 Data set is Bimodal
7
7
7
10
Descrip*ve Sta*s*cs – The Mode Features of the Mode For discrete data it is an actual single value For con0nuous data it is the point of highest frequency
density – but not suited to con0nuous data as different values each cons0tute a poten0al mode
It may not be unique or clearly defined – the more modes
there are the less useful it is to use
Extreme items do not affect its value It cannot be used for further mathema0cal processing It requires arrangement of the data which may be 0me
consuming
Mean, Median, Mode Which to Use*? A measure of loca0on must convey the distribu0on in a single
figure
Therefore important to pick the right figure This depends on:
The type of data being used
The shape of the distribu0on (dispersion)
Whether the average will be the basis for further work on the data
[*Source: Buglear, 2000]
Mean, Median, Mode Which to Use? Mode – when data is not numerical e.g. favourite cereal Median – where there are outliers Mean – where there are no outliers
[*Source: Buglear, 2000]
Descriptive Statistics Which measure to use? 1,2,3,4,5,6,7,8,9,10
Mean = 5.5 Median = 5.5
1,2, 3, 4, 5, 6, 7, 8, 9, 20
Mean = 6.5 Median = 5.5
1, 2, 3, 4, 5, 6, 7, 8, 9, 100
Mean = 14.5 Median = 5.5
Descrip*ve Sta*s*cs Which measure to use? [Turn to page 2-34] 1, 23, 25, 26, 27, 23, 29, 30 Median
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5 Mode
1, 1, 2, 3, 4, 1, 2, 6, 5, 8, 3, 4, 5, 6, 7 Mean
1, 101, 104, 106, 111, 108, 109, 200 Median
Summary Measures Very powerful tool of analysis Each measure summarises a characteris*c of the data into a
single number
Three different categories of measures: CENTRAL TENDENCY (or LOCATION) DISPERSION (or SCALE) SKEWNESS (or SHAPE)
Distribu*on of Data Histogram of JJA temp
Dispersion
30 25
Frequency
20
Skewness
15 10 5 0 14
15
16
17 JJA temp
18
Central Tendency
19
20
The Normal DistribuKon Curve
The Normal Distribu*on Curve Frequency
Mean=100
50% of cases
50% of cases
Mean Median Mode
The Normal Distribu*on Curve
• Although these curves are different shapes they all have a normal distribu*on
The Normal Distribu*on Curve Sample size=10
The Normal Distribu*on Curve
Sample size=100
Platykur*c
Leptokur*c
Mesokur*c
Flatness or peakedness â&#x20AC;&#x201C; Kurtosis of the distribu*on
p. 3-98
Kurtosis
f
Skewness
•
f
0
Positive skews clusters are at the lower end and point towards higher scores
Positive Skewed Distribution x
f
• 0 Symmetrical Distribution
x
0
Negative Skewed Distribution x
Negative skews clusters are at the higher end and point towards the lower values
Skewness and Kurtosis Â&#x2014;â&#x20AC;Ż To check if distribu*on is normal we look at values of
Kurtosis and Skewness:
Skewness and Kurtosis To check if distribu*on is normal we look at values of
Kurtosis and Skewness:
PosiKve values of kurtosis indicate tall, pointy peaks and
negaKve values indicate flat light-tailed distribu*on
PosiKve values of skewness indicate too many low scores
(le]-hand side of graph) and negaKve values indicate too many high scores (right-hand side of graph)
Further away from ZERO means the less likely the data is
normally distributed
Skewness Note: In a symmetric distribu*on: Mean=Median=Mode In a posiKvely skewed distribu*on: Mean>Median>Mode In a negaKvely skewed distribu*on: Mean<Median<Mode
Skewness
0
0 Symmetrical Distribution
x
Mode
Median
Mean
Mode
Mean Median Mode
f
Mean
f
Median
f
0 Positive Skewed Distribution
x
Negative Skewed Distribution
x
Note: In a symmetric distribu*on: Mean=Median=Mode In a posiKvely skewed distribu*on: Mean>Median>Mode In a negaKvely skewed distribu*on: Mean<Median<Mode
Examples of Distribu*ons DistribuKons
• • • •
In a normal distribu*on – kurtosis = 0 PosiKve values indicate the distribu*on is leptokurKc - kurtosis > 0 NegaKve values indicate the distribu*on is platykurKc - Kurtosis < 0 The greater the standard devia*on the fla`er the curve
Examples of Distribu*ons
• In a normal distribu*on – Skewness = 0 (the distribu*on is symmetrical) • PosiKve values indicate the distribu*on is posiKvely skewed • NegaKve values indicate the distribu*on is negaKvely skewed
Examples of Distribu*ons
• In a normal distribu*on – Skewness = 0 (the distribu*on is symmetrical) • PosiKve values indicate the distribu*on is posiKvely skewed • NegaKve values indicate the distribu*on is negaKvely skewed
Graphically IllustraKng the DistribuKon of Data
Histograms
Graphically Describing Data
Graphically Describing Data Using the Split File op*on or the Factor List
Graphically Describing Data Adding the Normal Distribu*on Curve in Graph Builder or using Charts in Frequencies
Graphically Describing Data
Interpre*ng Boxplots
Median: approx 50
Interpre*ng Boxplots
This represents upper value of the interquar*le range: approx 79
This represents the lower value of the interquar*le range: approx 39
Interpre*ng Boxplots
This represents the middle 50% or interquar*le range of the data â&#x20AC;&#x201C; approx: 41
Interpre*ng Boxplots This represents the highest value of extreme: approx 121
This represents the lowest value of extreme: approx 18
Interpre*ng Boxplots This represents the highest value of extreme: approx 121
The range: approx 103
This represents the lowest value of extreme: approx 18
Interpre*ng Boxplots
This represents an extreme value or outlier
Interpre*ng Boxplots
This represents an extreme value or outlier The ver*cal line (whisker) extends to 1.5 *mes the box height
Graphically Describing Data Using the Explore dialog box and using Area in the Factor List
Graphically Describing Data
Graphically Describing Data
Dispersion • Grade Profiles for Assessment Components BML224 - 2009
Graphically Describing Data
Graphically Describing Data
Measures of Dispersion: The Range
Dispersion: Range Two groups of 5 business and management students were asked to record weekly beer consump*on: • Group 1: 12, 12, 12, 12, 12 • Group 2: 0, 5, 10, 15, 30 The mean for both groups is 12 but this provides no indica*on of the differences between the two samples and the level of dispersion
Measures of Dispersion: The Standard DeviaKon
Standard Devia*on An es*mate of the average variability (spread) of a set of
data measured in the same units of measurement as the original data
Standard devia*on measures the dispersion around the
average, but does so on the basis of the figures themselves, not just the rank order
It is calculated from the devia*ons of each item from the
arithme*c mean
Standard Devia*on • C. Andrews Brewery Tours - Salaries (£k) • £11, £15, £15, £18, £25, £30k, £32, £38k • Mean = £23k (£184k/8) • Standard DeviaKon = 9.68 • Global Heritage Tours (£k) • £20, £21, £22, £23, £23, £24, £25, £26 • Mean = £23k (£184k/8) • Standard DeviaKon = 2.00
Standard Devia*on • ‘Quo0ng the standard devia0on of a distribu0on is a
way of indica0ng a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the devia0ons and the bigger the standard (average) devia0on’ [Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197]
The Normal DistribuKon Curve
The Normal Distribu*on Curve Frequency
Mean=100
50% of cases
50% of cases
Mean Median Mode
The Normal Distribu*on Curve Frequency
Mean=100
68% 95% 99% -3 S.D
-2 S.D
-1 S.D
+1 S.D
+2 S.D
+3 S.D
The Normal Distribu*on Curve Example: Sample of 20 men undertaking physical exercise as part of a research programme for Sports and Exercise Science: Measurements taken: • Mean Heartbeat [beats per minute/bpm]=123bpm • Standard DeviaKon=18bpm
The Normal Distribu*on Curve Frequency
Mean=123bpm
68% 95% 99% -3 S.D
-2 S.D
-1 S.D
+1 S.D
+2 S.D
+3 S.D
The Normal Distribu*on Curve • 68% of men had a heart rate between the mean minus 1 s.d and the mean plus 1.d; i.e. • 123-(1*18) and 123+(1*18) or between 123-18 and 123+18; that is between 105 and 141 bpm
Frequency
The Normal Distribu*on Curve Mean=123
68%
105
141
bpm
The Normal Distribu*on Curve • 95% of men had a heart rate between the mean minus 2 s.d and the mean plus 2.d; i.e. • 123-(2*18) and 123+(2*18) or between 123-36 and 123+36; that is between 87 and 159 bpm
The Normal Distribu*on Curve Frequency
Mean=123
68% 95%
87
105
141
bpm
159
The Normal Distribu*on Curve • 99% of men had a heart rate between the mean minus 3 s.d and the mean plus 3.d; i.e. • 123-(3*18) and 123+(3*18) or between 123-54 and 123+54; that is between 69 and 177 bpm
The Normal Distribu*on Curve Frequency
Mean=123
68% 95% 99% 69
87
105
141
bpm
159
177
The Normal Distribu*on Curve • 68% of men had a heart rate between the mean minus 1 s.d and the mean plus 1.d; i.e. • 123-18 and 123+18 that is between 105 and 141 bpm • 95% of men had a heart rate between the mean minus 2 s.d and the mean plus 2 s.d; i.e. • 123-(2*18) and 123+(2*18) or between 123-36 and 123+36; that is between 87 and 159 bpm • 99% of men had a heart rate between the mean minus 3 s.d and the mean plus 3 s.d; i.e. • 123-(3*18) and 123+(3*18) or between 123-54 and 123+54; that is between 69 and 177 bpm
The Normal Distribu*on Curve Frequency
Mean=123
68% 95% 99% 69
87
105
141
bpm
159
177
The Normal Distribu*on Curve - Height • 68% of class have a height between the mean minus 1 s.d and the mean plus 1.d; i.e. • 172-10 and 172+10 that is between 162cm and 182cm • 95% of class have a height between the mean minus 2 s.d and the mean plus 2 s.d; i.e. • 172-(2*10) and 172+(2*10) or between 17223-20 and 172+20; that is between 152cm and 192cm • 99% of class have a height of between the mean minus 3 s.d and the mean plus 3 s.d; i.e. • 172-(3*10) and 172+(3*10) or between 172-30 and 172+30; that is between 142cm and 202cm
Prac*cal Exercise: • Working in groups of 2, I would like you to analyse and present a brief sta*s*cal overview of one ra*o data set (e.g. Profit10): • As part of your presenta*on you must prepare a short 2-3 minute PowerPoint overview, in which you must: • Use descrip0ve sta0s0cs and appropriate charts/plots to examine your chosen variable • Examine rela0onships with other variables (e.g. average turnover by area or by town) using appropriate command op0ons in SPSS (e.g. Split File/Factor List etc) • You are free to cut and paste SPSS output into your PowerPoint presenta0on • You must also provide a clear ra0onale for your chosen line of enquiry
The Standard Error
Standard Error • Field (2003) examining ra*ngs of lecturers (where popula*on mean = 3) • Each sample has a mean and this is represented in a frequency chart • Some means are higher and some lower than the popula*on mean = sampling varia*on
Standard Error • The histogram = sampling distribu*on • The histogram is centred on the popula*on mean (3) therefore the average of all the sample means would give the value of the popula*on mean • Looking at the standard devia*on between the sample means would give a measure of the degree of variability between the different samples
Standard Error â&#x20AC;˘â&#x20AC;ŻThe standard deviaKon of the sample means is the standard error of the mean and a measure of how representa*ve a sample is likely to be of the popula*on
Standard Error • Standard devia*on is useful in comparing distribu*ons of different samples • Use the standard error to determine how likely the mean of your sample is different from the mean of the popula*on • Standard error accounts for sample size SE(Mean) =
Standard Devia*on of the Sample (s) Sample Size (n)
Standard Error - Example Visitor Spending of Short Break Holiday Makers in Chichester Measurements taken: • Sample Size = 10 • Mean Value = £127 • Standard Devia*on = 29.809 • Standard Error = 9.43
Standard Error - Example • 68% - Visitor Spending:
• 99% - Visitor Spending:
• 127 ± 9.43 = 117.57 to 136.43
• 127 ± 28.29 = 98.71 to 155.29
• [Mean ±1*Standard Error]
• [Mean ±3*Standard Error]
• 95% - Visitor Spending: • 127 ± 18.86 = 108.14 to 145.86 [Mean ±2*Standard Error]
Standard Error - Example Frequency
Mean=127
68% 95%
108.14
117.57
136.43 145.86
Visitor Spending
Standard Error - Example Frequency
Mean=127
68% 95% 99% 98.71
108.14
117.57
136.43 145.86 155.29
Visitor Spending
Standard Error - Example • No*ce that higher confidence levels are achieved at the expense of wider confidence intervals: • 68%: 117.57 to 136.43 (Range = 18.86) • 95%: 108.14 to 145.86 (Range = 37.72) • 99%: 98.71 to 155.29 (Range = 56.58) • To improve the accuracy of sample es*mates the best way is to increase sample size!
Standard Error - Example Visitor Spending of Short Break Holiday Makers in Chichester Measurements taken: [Original] • Sample Size = 20 [10] • Mean Size = 127 • Standard Devia*on = 21.07
[127]
• Standard Error = 4.71
[9.43]
[29.809]
Standard Error - Example • Increasing the sample size significantly reduces the width of the confidence intervals: • Before (n=10): • 95%: 108.14 to 145.86 • (Mean ± 2*Standard Error=(2*9.43)) • (Range = 37.72) • A]er (n=20): • 95%: 117.58 to 136.42 • (Mean ± 2*Standard Error=(2*4.71)) • (Range = 18.84)
Standard Error - Example • Increasing the sample size significantly reduces the width of the confidence intervals: • Before (n=10): • 99%: 98.71 to 155.29 • (Mean ± 3*Standard Error=(3*9.43=28.29)) • (Range = 56.58) • A]er (n=20): • 99%: 112.87 to 141.13 • (Mean ± 3*Standard Error=(3*4.71=14.13)) • (Range = 28.26)
Standard Error - Example • Summary – The Influence of Sample Size n=10
n=20
Lower value
108.14
117.58
Upper value
145.86
136.42
Range
37.72
28.26
Lower value
98.71
112.87
Upper value
155.29
141.13
Range
56.58
28.26
95%
99%
Standard Error – Height Example • Mean = 166.55 • Standard Error = 1.08
Standard Error – Height Example • Mean = 166.55 • Standard Error = 1.08 • 95% confidence level popula*on mean will be between 166.55 ± (2*S.E) = 2.16
Standard Error – Height Example • Mean = 166.55 • Standard Error = 1.08 • 95% confidence level popula*on mean will be between 166.55 ± (2*S.E) = 2.16 • Lower value = 164.38 • Upper value = 168.72
Standard Devia*on v Standard Error Standard Error and Standard DeviaKon are ogen confused • Key Points • The standard devia0on is a measure of the variability of the popula0on from which the sample is drawn • For data with a normal distribu0on, about 95% of individuals will have values within 2 standard devia0ons of the mean • We use the sample mean as an es0mate of the mean for the whole popula0on from which to make wider generalisa0ons
Standard Devia*on v Standard Error • Key Points • The sample mean will vary from sample to sample; this varia0on described as the “sampling distribu0on” of the mean - SEM is a measure of sampling error because it describes the variability among all possible means that could be poten0ally sampled • The standard error is the standard devia0on of the sample and a measure of how representa0ve a sample is likely to be of the popula0on
Standard Devia*on v Standard Error • The Role of Size • The standard error of the sample mean depends on both the standard devia0on and the sample size - the standard error falls as the sample size increases, as the extent of chance varia0on is reduced – less variability more reliability! • We increase our confidence in a par0cular sample (as being representa0ve of the popula0on) by increasing the size of the sample - the means of large samples tend to cluster 0ghtly around the true popula0on mean • In contrast the standard devia0on will not tend to change as we increase the size of our sample
Standard Devia*on v Standard Error â&#x20AC;˘â&#x20AC;ŻTherefore if we wanted to: (a) comment on level of dispersion or how widely sca<ered some measurements were, we make reference to the standard devia1on (focus on sample means) (b) indicate the uncertainty around the es0mate of the mean measurement, we quote the standard error of the mean (focus on popula1on means)
Prac*cal Exercise: • Working in groups of 2, I would like you to analyse and present a brief sta*s*cal overview of one ra*o data set (e.g. Turnover 10): • As part of your presenta*on you must prepare a short 2-3 minute PowerPoint overview, in which you must: • Use descrip0ve sta0s0cs and appropriate charts/plots to examine your chosen variable • Examine rela0onships with other variables (e.g. average turnover by area or by town) using appropriate command op0ons in SPSS (e.g. Split File/Factor List etc) • You are free to cut and paste SPSS output into your PowerPoint presenta0on • You must also provide a clear ra0onale for your chosen line of enquiry