Normal Distribution and Patterns of Dispersion 2016

Page 1

Understanding Your Data 1: Normal Distributions and Patterns of Dispersion

BML224: Data Analysis for Research


Aims: Last *me


Aims:   Understand the theory and assump0ons rela0ng to the

distribu0on and variance of data

Use SPSS to calculate measures of dispersion, including the

median, range, standard devia0on

Use SPSS graphically to illustrate the distribu0on of the data

through the use of char0ng elements such as frequency histograms, sca<er plots and box plots


Aims: or simply clarifying what all this extra informa*on means!


Descrip*ve Sta*s*cs – The Mean   Features of the Mean   It makes use of every value in the distribu0on, leading to a

mathema0cal exactness which is useful for further mathema0cal processing

It can be determined if only the total value of the items and

the number of items are known, without knowing individual values

It can be distorted by extreme values in the distribu0on   For a discrete distribu0on, the mean may be an ‘impossible’

figure e.g. average number of children per family = 2.4


Descrip*ve Sta*s*cs – The Median   Calcula1ng the Median   A data series with an uneven number of items:

1

2

2

4

7

7

10


Descrip*ve Sta*s*cs – The Median   Calcula1ng the Median   A data series with an uneven number of items:

1

2

2

4

7

7

10


Descrip*ve Sta*s*cs – The Median   Calcula1ng the Median   A data series with an even number of items:

1

2

2

4

7

7

10

12


Descrip*ve Sta*s*cs – The Median   Calcula1ng the Median   A data series with an even number of items:

1

2

2

4

7

4 + 7 2 Median = 5.5

7

10

12


Descrip*ve Sta*s*cs – The Median   Features of the Median   Half the items in the series will have a value greater than or

equal to the median and half less than or equal to the median

It is therefore a measure of rank or posi0on   It is unaffected by the presence of extreme items in the

distribu0on


Descrip*ve Sta*s*cs – The Mode   Calcula1ng the Mode   Series 1:

1

2

2

2

7

7

10

12


Descrip*ve Sta*s*cs – The Mode   Calcula1ng the Mode   Series 1:

1

2

2

2

Single mode: 2 Data set is Modal

7

7

10

12


Descrip*ve Sta*s*cs – The Mode   Calcula1ng the Mode   Series 1:

1

2

2

2

7

7

7

10


Descrip*ve Sta*s*cs – The Mode   Calcula1ng the Mode   Series 1:

1

2

2

2

Two modes: 2 + 7 Data set is Bimodal

7

7

7

10


Descrip*ve Sta*s*cs – The Mode   Features of the Mode   For discrete data it is an actual single value   For con0nuous data it is the point of highest frequency

density – but not suited to con0nuous data as different values each cons0tute a poten0al mode

It may not be unique or clearly defined – the more modes

there are the less useful it is to use

Extreme items do not affect its value   It cannot be used for further mathema0cal processing   It requires arrangement of the data which may be 0me

consuming


Mean, Median, Mode   Which to Use*?   A measure of loca0on must convey the distribu0on in a single

figure

Therefore important to pick the right figure   This depends on:

The type of data being used

The shape of the distribu0on (dispersion)

Whether the average will be the basis for further work on the data

[*Source: Buglear, 2000]


Mean, Median, Mode   Which to Use?   Mode – when data is not numerical e.g. favourite cereal   Median – where there are outliers   Mean – where there are no outliers

[*Source: Buglear, 2000]


Descriptive Statistics   Which measure to use?   1,2,3,4,5,6,7,8,9,10

Mean = 5.5 Median = 5.5

1,2, 3, 4, 5, 6, 7, 8, 9, 20

Mean = 6.5 Median = 5.5

1, 2, 3, 4, 5, 6, 7, 8, 9, 100

Mean = 14.5 Median = 5.5


Descrip*ve Sta*s*cs   Which measure to use? [Turn to page 2-34]   1, 23, 25, 26, 27, 23, 29, 30 Median

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5 Mode

1, 1, 2, 3, 4, 1, 2, 6, 5, 8, 3, 4, 5, 6, 7 Mean

1, 101, 104, 106, 111, 108, 109, 200 Median


Summary Measures   Very powerful tool of analysis   Each measure summarises a characteris*c of the data into a

single number

Three different categories of measures:   CENTRAL TENDENCY (or LOCATION)   DISPERSION (or SCALE)   SKEWNESS (or SHAPE)


Distribu*on of Data Histogram of JJA temp

Dispersion

30 25

Frequency

20

Skewness

15 10 5 0 14

15

16

17 JJA temp

18

Central Tendency

19

20


The Normal DistribuKon Curve


The Normal Distribu*on Curve Frequency

Mean=100

50% of cases

50% of cases

Mean Median Mode


The Normal Distribu*on Curve

•  Although these curves are different shapes they all have a normal distribu*on


The Normal Distribu*on Curve Sample size=10


The Normal Distribu*on Curve

Sample size=100


Platykur*c

Leptokur*c

Mesokur*c

Flatness or peakedness – Kurtosis of the distribu*on

p. 3-98

Kurtosis


f

Skewness

f

0

Positive skews clusters are at the lower end and point towards higher scores

Positive Skewed Distribution x

f

•  0 Symmetrical Distribution

x

0

Negative Skewed Distribution x

Negative skews clusters are at the higher end and point towards the lower values


Skewness and Kurtosis —  To check if distribu*on is normal we look at values of

Kurtosis and Skewness:



Skewness and Kurtosis   To check if distribu*on is normal we look at values of

Kurtosis and Skewness:

PosiKve values of kurtosis indicate tall, pointy peaks and

negaKve values indicate flat light-tailed distribu*on

PosiKve values of skewness indicate too many low scores

(le]-hand side of graph) and negaKve values indicate too many high scores (right-hand side of graph)

Further away from ZERO means the less likely the data is

normally distributed


Skewness   Note:   In a symmetric distribu*on: Mean=Median=Mode   In a posiKvely skewed distribu*on: Mean>Median>Mode   In a negaKvely skewed distribu*on: Mean<Median<Mode


Skewness

0

0 Symmetrical Distribution

x

Mode

Median

Mean

Mode

Mean Median Mode

f

Mean

f

Median

f

0 Positive Skewed Distribution

x

Negative Skewed Distribution

x

Note:   In a symmetric distribu*on: Mean=Median=Mode   In a posiKvely skewed distribu*on: Mean>Median>Mode   In a negaKvely skewed distribu*on: Mean<Median<Mode


Examples of Distribu*ons DistribuKons

•  •  •  •

In a normal distribu*on – kurtosis = 0 PosiKve values indicate the distribu*on is leptokurKc - kurtosis > 0 NegaKve values indicate the distribu*on is platykurKc - Kurtosis < 0 The greater the standard devia*on the fla`er the curve


Examples of Distribu*ons

•  In a normal distribu*on – Skewness = 0 (the distribu*on is symmetrical) •  PosiKve values indicate the distribu*on is posiKvely skewed •  NegaKve values indicate the distribu*on is negaKvely skewed


Examples of Distribu*ons

•  In a normal distribu*on – Skewness = 0 (the distribu*on is symmetrical) •  PosiKve values indicate the distribu*on is posiKvely skewed •  NegaKve values indicate the distribu*on is negaKvely skewed


Graphically IllustraKng the DistribuKon of Data


Histograms

Graphically Describing Data


Graphically Describing Data Using the Split File op*on or the Factor List


Graphically Describing Data Adding the Normal Distribu*on Curve in Graph Builder or using Charts in Frequencies


Graphically Describing Data


Interpre*ng Boxplots

Median: approx 50


Interpre*ng Boxplots

This represents upper value of the interquar*le range: approx 79

This represents the lower value of the interquar*le range: approx 39


Interpre*ng Boxplots

This represents the middle 50% or interquar*le range of the data – approx: 41


Interpre*ng Boxplots This represents the highest value of extreme: approx 121

This represents the lowest value of extreme: approx 18


Interpre*ng Boxplots This represents the highest value of extreme: approx 121

The range: approx 103

This represents the lowest value of extreme: approx 18


Interpre*ng Boxplots

This represents an extreme value or outlier


Interpre*ng Boxplots

This represents an extreme value or outlier The ver*cal line (whisker) extends to 1.5 *mes the box height


Graphically Describing Data Using the Explore dialog box and using Area in the Factor List


Graphically Describing Data


Graphically Describing Data


Dispersion •  Grade Profiles for Assessment Components BML224 - 2009


Graphically Describing Data


Graphically Describing Data


Measures of Dispersion: The Range


Dispersion: Range Two groups of 5 business and management students were asked to record weekly beer consump*on: •  Group 1: 12, 12, 12, 12, 12 •  Group 2: 0, 5, 10, 15, 30 The mean for both groups is 12 but this provides no indica*on of the differences between the two samples and the level of dispersion


Measures of Dispersion: The Standard DeviaKon


Standard Devia*on   An es*mate of the average variability (spread) of a set of

data measured in the same units of measurement as the original data

Standard devia*on measures the dispersion around the

average, but does so on the basis of the figures themselves, not just the rank order

It is calculated from the devia*ons of each item from the

arithme*c mean


Standard Devia*on • C. Andrews Brewery Tours - Salaries (£k) • £11, £15, £15, £18, £25, £30k, £32, £38k • Mean = £23k (£184k/8) • Standard DeviaKon = 9.68 • Global Heritage Tours (£k) • £20, £21, £22, £23, £23, £24, £25, £26 • Mean = £23k (£184k/8) • Standard DeviaKon = 2.00


Standard Devia*on •  ‘Quo0ng the standard devia0on of a distribu0on is a

way of indica0ng a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the devia0ons and the bigger the standard (average) devia0on’ [Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197]


The Normal DistribuKon Curve


The Normal Distribu*on Curve Frequency

Mean=100

50% of cases

50% of cases

Mean Median Mode


The Normal Distribu*on Curve Frequency

Mean=100

68% 95% 99% -3 S.D

-2 S.D

-1 S.D

+1 S.D

+2 S.D

+3 S.D


The Normal Distribu*on Curve Example: Sample of 20 men undertaking physical exercise as part of a research programme for Sports and Exercise Science: Measurements taken: •  Mean Heartbeat [beats per minute/bpm]=123bpm •  Standard DeviaKon=18bpm


The Normal Distribu*on Curve Frequency

Mean=123bpm

68% 95% 99% -3 S.D

-2 S.D

-1 S.D

+1 S.D

+2 S.D

+3 S.D


The Normal Distribu*on Curve • 68% of men had a heart rate between the mean minus 1 s.d and the mean plus 1.d; i.e. •  123-(1*18) and 123+(1*18) or between 123-18 and 123+18; that is between 105 and 141 bpm


Frequency

The Normal Distribu*on Curve Mean=123

68%

105

141

bpm


The Normal Distribu*on Curve • 95% of men had a heart rate between the mean minus 2 s.d and the mean plus 2.d; i.e. •  123-(2*18) and 123+(2*18) or between 123-36 and 123+36; that is between 87 and 159 bpm


The Normal Distribu*on Curve Frequency

Mean=123

68% 95%

87

105

141

bpm

159


The Normal Distribu*on Curve • 99% of men had a heart rate between the mean minus 3 s.d and the mean plus 3.d; i.e. •  123-(3*18) and 123+(3*18) or between 123-54 and 123+54; that is between 69 and 177 bpm


The Normal Distribu*on Curve Frequency

Mean=123

68% 95% 99% 69

87

105

141

bpm

159

177


The Normal Distribu*on Curve •  68% of men had a heart rate between the mean minus 1 s.d and the mean plus 1.d; i.e. •  123-18 and 123+18 that is between 105 and 141 bpm •  95% of men had a heart rate between the mean minus 2 s.d and the mean plus 2 s.d; i.e. •  123-(2*18) and 123+(2*18) or between 123-36 and 123+36; that is between 87 and 159 bpm •  99% of men had a heart rate between the mean minus 3 s.d and the mean plus 3 s.d; i.e. •  123-(3*18) and 123+(3*18) or between 123-54 and 123+54; that is between 69 and 177 bpm


The Normal Distribu*on Curve Frequency

Mean=123

68% 95% 99% 69

87

105

141

bpm

159

177


The Normal Distribu*on Curve - Height •  68% of class have a height between the mean minus 1 s.d and the mean plus 1.d; i.e. •  172-10 and 172+10 that is between 162cm and 182cm •  95% of class have a height between the mean minus 2 s.d and the mean plus 2 s.d; i.e. •  172-(2*10) and 172+(2*10) or between 17223-20 and 172+20; that is between 152cm and 192cm •  99% of class have a height of between the mean minus 3 s.d and the mean plus 3 s.d; i.e. •  172-(3*10) and 172+(3*10) or between 172-30 and 172+30; that is between 142cm and 202cm


Prac*cal Exercise: •  Working in groups of 2, I would like you to analyse and present a brief sta*s*cal overview of one ra*o data set (e.g. Profit10): •  As part of your presenta*on you must prepare a short 2-3 minute PowerPoint overview, in which you must: •  Use descrip0ve sta0s0cs and appropriate charts/plots to examine your chosen variable •  Examine rela0onships with other variables (e.g. average turnover by area or by town) using appropriate command op0ons in SPSS (e.g. Split File/Factor List etc) •  You are free to cut and paste SPSS output into your PowerPoint presenta0on •  You must also provide a clear ra0onale for your chosen line of enquiry


The Standard Error


Standard Error • Field (2003) examining ra*ngs of lecturers (where popula*on mean = 3) • Each sample has a mean and this is represented in a frequency chart • Some means are higher and some lower than the popula*on mean = sampling varia*on


Standard Error • The histogram = sampling distribu*on • The histogram is centred on the popula*on mean (3) therefore the average of all the sample means would give the value of the popula*on mean • Looking at the standard devia*on between the sample means would give a measure of the degree of variability between the different samples


Standard Error • The standard deviaKon of the sample means is the standard error of the mean and a measure of how representa*ve a sample is likely to be of the popula*on


Standard Error •  Standard devia*on is useful in comparing distribu*ons of different samples •  Use the standard error to determine how likely the mean of your sample is different from the mean of the popula*on •  Standard error accounts for sample size SE(Mean) =

Standard Devia*on of the Sample (s) Sample Size (n)


Standard Error - Example Visitor Spending of Short Break Holiday Makers in Chichester Measurements taken: •  Sample Size = 10 •  Mean Value = £127 •  Standard Devia*on = 29.809 •  Standard Error = 9.43


Standard Error - Example • 68% - Visitor Spending:

• 99% - Visitor Spending:

• 127 ± 9.43 = 117.57 to 136.43

• 127 ± 28.29 = 98.71 to 155.29

• [Mean ±1*Standard Error]

• [Mean ±3*Standard Error]

• 95% - Visitor Spending: • 127 ± 18.86 = 108.14 to 145.86 [Mean ±2*Standard Error]


Standard Error - Example Frequency

Mean=127

68% 95%

108.14

117.57

136.43 145.86

Visitor Spending


Standard Error - Example Frequency

Mean=127

68% 95% 99% 98.71

108.14

117.57

136.43 145.86 155.29

Visitor Spending


Standard Error - Example •  No*ce that higher confidence levels are achieved at the expense of wider confidence intervals: •  68%: 117.57 to 136.43 (Range = 18.86) •  95%: 108.14 to 145.86 (Range = 37.72) •  99%: 98.71 to 155.29 (Range = 56.58) •  To improve the accuracy of sample es*mates the best way is to increase sample size!


Standard Error - Example Visitor Spending of Short Break Holiday Makers in Chichester Measurements taken: [Original] •  Sample Size = 20 [10] •  Mean Size = 127 •  Standard Devia*on = 21.07

[127]

•  Standard Error = 4.71

[9.43]

[29.809]


Standard Error - Example •  Increasing the sample size significantly reduces the width of the confidence intervals: •  Before (n=10): •  95%: 108.14 to 145.86 •  (Mean ± 2*Standard Error=(2*9.43)) •  (Range = 37.72) •  A]er (n=20): •  95%: 117.58 to 136.42 •  (Mean ± 2*Standard Error=(2*4.71)) •  (Range = 18.84)


Standard Error - Example •  Increasing the sample size significantly reduces the width of the confidence intervals: •  Before (n=10): •  99%: 98.71 to 155.29 •  (Mean ± 3*Standard Error=(3*9.43=28.29)) •  (Range = 56.58) •  A]er (n=20): •  99%: 112.87 to 141.13 •  (Mean ± 3*Standard Error=(3*4.71=14.13)) •  (Range = 28.26)


Standard Error - Example •  Summary – The Influence of Sample Size n=10

n=20

Lower value

108.14

117.58

Upper value

145.86

136.42

Range

37.72

28.26

Lower value

98.71

112.87

Upper value

155.29

141.13

Range

56.58

28.26

95%

99%


Standard Error – Height Example •  Mean = 166.55 •  Standard Error = 1.08


Standard Error – Height Example •  Mean = 166.55 •  Standard Error = 1.08 •  95% confidence level popula*on mean will be between 166.55 ± (2*S.E) = 2.16


Standard Error – Height Example •  Mean = 166.55 •  Standard Error = 1.08 •  95% confidence level popula*on mean will be between 166.55 ± (2*S.E) = 2.16 •  Lower value = 164.38 •  Upper value = 168.72


Standard Devia*on v Standard Error Standard Error and Standard DeviaKon are ogen confused •  Key Points •  The standard devia0on is a measure of the variability of the popula0on from which the sample is drawn •  For data with a normal distribu0on, about 95% of individuals will have values within 2 standard devia0ons of the mean •  We use the sample mean as an es0mate of the mean for the whole popula0on from which to make wider generalisa0ons


Standard Devia*on v Standard Error •  Key Points •  The sample mean will vary from sample to sample; this varia0on described as the “sampling distribu0on” of the mean - SEM is a measure of sampling error because it describes the variability among all possible means that could be poten0ally sampled •  The standard error is the standard devia0on of the sample and a measure of how representa0ve a sample is likely to be of the popula0on


Standard Devia*on v Standard Error •  The Role of Size • The standard error of the sample mean depends on both the standard devia0on and the sample size - the standard error falls as the sample size increases, as the extent of chance varia0on is reduced – less variability more reliability! • We increase our confidence in a par0cular sample (as being representa0ve of the popula0on) by increasing the size of the sample - the means of large samples tend to cluster 0ghtly around the true popula0on mean • In contrast the standard devia0on will not tend to change as we increase the size of our sample


Standard Devia*on v Standard Error • Therefore if we wanted to: (a) comment on level of dispersion or how widely sca<ered some measurements were, we make reference to the standard devia1on (focus on sample means) (b) indicate the uncertainty around the es0mate of the mean measurement, we quote the standard error of the mean (focus on popula1on means)


Prac*cal Exercise: •  Working in groups of 2, I would like you to analyse and present a brief sta*s*cal overview of one ra*o data set (e.g. Turnover 10): •  As part of your presenta*on you must prepare a short 2-3 minute PowerPoint overview, in which you must: •  Use descrip0ve sta0s0cs and appropriate charts/plots to examine your chosen variable •  Examine rela0onships with other variables (e.g. average turnover by area or by town) using appropriate command op0ons in SPSS (e.g. Split File/Factor List etc) •  You are free to cut and paste SPSS output into your PowerPoint presenta0on •  You must also provide a clear ra0onale for your chosen line of enquiry


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.