BML246 - UNDERSTANDING YOUR DATA

Page 1

BML246

Research Skills Understanding Your Data

Tutors:

Dr Andy Clegg and Dr Jorge Gutic


Understanding Your Data  Aims:  To produce basic frequency counts, crosstabulations

and descriptive statistics including the mean, median and mode  To understand the basic features of the measures of

central tendency (the values that occur most frequently)  To select and apply appropriate descriptive statistics to

different data types


Understanding Your Data

NOIR NOMINAL

ORDINAL

NON-PARAMETRIC

INTERVAL

RATIO

PARAMETRIC


Frequency Counts

Question: What grade do you expect to get?

Nominal

What grade do you expect to get for the module? 2011/2012

2010/2011

%

%

Grade A

6

8.5%

3

12%

Grade B

17

23.9%

16

32%

Grade C

37

52.1%

19

38%

Grade D

11

15.5%

11

22%

Grade E

0

0%

1

2%

Total

71

100%

50

100%


Frequency Counts

Expected Grade for BML224

Question: What grade do you expect to get?

F (<40%): 2%

A (70%+): 6%

D (40-49%): 22%

B (60-69%): 32%

C (50-59%): 38%


Frequency Counts

Question: How Confident Are You?

Ordinal

How confident are you about starting this module? 2011/2012

2010/2011

%

7 - Very confident

0

0.0%

0

0.0%

6 - Quite Confident

4

5.60%

1

2.0%

5 - Confident

16

22.50%

10

20.0%

4 - Uncertain

28

39.40%

26

52.0%

3 - Anxious

14

19.70%

7

14.0%

2 - Quite Anxious

4

5.60%

4

8.0%

1 - Very Anxious

5

7.00%

2

4.0%

Uncertain to very anxious

51

72%

39

78%

Sample (n)

71

50

%


Frequency Counts

Student Confidence Levels 2011

Question: How Confident Are You?

Ordinal

Very Anxious: 4%

Quite Anxious: 8%

Quite Confident: 2%

Confident: 20% Anxious: 14%

Uncertain: 52%


Crosstabs ï‚— Examples: Length of Ownership by Response to

Recession


Crosstabs ï‚— Examples: Length of Ownership by Response to

Recession


Crosstabs

NOIR NOMINAL

ORDINAL

NON-PARAMETRIC

INTERVAL

RATIO

PARAMETRIC


Descriptive Stats

Question: Business Turnover in 2010

Ratio

Descriptive Statistics – Turnover 2010 Turnover 2010 Mean

£41,311.40

Median

£44,640.00

Mode

£44,760.00

Standard Deviation

£9191.0316


The Mean  Features of the Mean  It makes use of every value in the distribution, leading to

a mathematical exactness which is useful for further mathematical processing  It can be distorted by extreme values in the distribution;  For a discrete distribution, the mean may be an

‘impossible’ figure e.g. average number of children per family = 2.4


The Median ď‚— Calculating the Median ď‚— A data series with an uneven number of items:

1

2

2

4

7

7

10


The Median ď‚— Calculating the Median ď‚— A data series with an uneven number of items:

1

2

2

4

7

7

10


The Median ď‚— Calculating the Median ď‚— A data series with an uneven number of items:

1

2

2

4

7

7

10

12


The Median ď‚— Calculating the Median ď‚— A data series with an uneven number of items:

1

2

2

4

7

4+7 2

Median = 5.5

7

10

12


The Median  Features of the Median  Half the items in the series will have a value greater

than or equal to the median and half less than or equal to the median  It is therefore a measure of rank or position  It is unaffected by the presence of extreme items in the

distribution


The Median  Features of the Median  It may be found when the values of all the items are not

known, provided that values of middle items and the total number of items are known  Ranking the items can be tedious  The median cannot be used for further mathematical

processing  It may not be representative if there are few items


The Mode  Calculating the Mode  Series 1:

1

2

2

2

7

7

10

12


The Mode ď‚— Calculating the Mode ď‚— Series 1:

1

2

2

2

Single mode: 2 Data set is Modal

7

7

10

12


The Mode  Calculating the Mode  Series 1:

1

2

2

2

7

7

7

10


The Mode ď‚— Calculating the Mode ď‚— Series 1:

1

2

2

2

Two modes: 2 + 7 Data set is Bimodal

7

7

7

10


The Mode  Features of the Mode  For discrete data it is an actual single value  For continuous data it is the point of highest frequency

density – but not suited to continuous data as different values each constitute a potential mode  It may not be unique or clearly defined – the more modes

there are the less useful it is to use  Extreme items do not affect its value  It cannot be used for further mathematical processing  It requires arrangement of the data which may be time

consuming


Mean, Median, Mode  Which to Use*?  A measure of location must convey the distribution in a

single figure  Therefore important to pick the right figure  This depends on: 

The type of data being used

The shape of the distribution (dispersion)

Whether the average will be the basis for further work on the data

[*Source: Buglear, 2000]


Mean, Median, Mode  Which measure to use?  1,2,3,4,5,6,7,8,9,10

Mean = 5.5 Median = 5.5

 1,2, 3, 4, 5, 6, 7, 8, 9, 20

Mean = 6.5 Median = 5.5

 1, 2, 3, 4, 5, 6, 7, 8, 9, 100

Mean = 14.5 Median = 5.5


Mean, Median, Mode  Which measure to use? [Turn to page 2-34] Median  1, 23, 25, 26, 27, 23, 29, 30

Mode  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5

 1, 1, 2, 3, 4, 1, 2,Mean 6, 5, 8, 3, 4, 5, 6, 7

Median  1, 101, 104, 106, 111, 108, 109, 200


Summary

NOMINAL

ORDINAL

INTERVAL

RATIO

Frequency Counts

Crosstabulations

Mean

Median

Mode


Summary

NOMINAL

ORDINAL

INTERVAL

RATIO

Frequency Counts

Crosstabulations

Mean

Median

Mode


Descriptive Statistics


Distribution of Data

Histogram of JJA temp 30 25

Dispersion

Frequency

20

a way of describing how spread out a set of data is.

15 10 5 0 14

15

16

17 JJA temp

18

19

20


Distribution of Data

Histogram of JJA temp 30

Skewness

25

is a measure of symmetry (or lack of symmetry in a distribution

Frequency

20 15 10 5 0 14

15

16

17 JJA temp

18

19

20


Distribution of Data

A measure of central tendency is a single value Histogram of JJA temp that attempts to describe a set of data by identifying the central position within that set of data.

30 25 20 Frequency

Central Tendency

15 10 5 0 14

15

16

17 JJA temp

18

19

20


Distribution of Data

Histogram of JJA temp

Dispersion

30

a way of describing how spread out a set of data is.

25

Frequency

20 15

Skewness

10 5 0 14

15

16

17 JJA temp

18

Central Tendency

19

20


Frequency

Normal Distribution Curve

Mean=100

50% of cases

50% of cases

Mean Median Mode


Normal Distribution Curve

• Although these curves are different shapes they all have a normal distribution


Normal Distribution Curve Sample size=10


Normal Distribution Curve

Sample size=100


Platykurtic

p. 3-98

Kurtosis

Leptokurtic

Mesokurtic

Flatness or peakedness – Kurtosis of the distribution (degree to which tails of distribution are ‘heavy’ or ‘light’)


Skewness f

•

Positive skews clusters are at the lower end and point towards higher scores

f

0

Positive Skewed Distribution x f

•

0 Symmetrical Distribution

x

0 Negative Skewed Distribution x

Negative skews clusters are at the higher end and point towards the lower values


Kurtosis  If kurtosis is close to 0 then a normal distribution is often assumed – a mesokurtic distribution

Mesokurtic

 If kurtosis is greater than zero, then

the distribution has heavier tails and is called a leptokurtic distribution Leptokurtic

 If kurtosis is less than zero, then the

distribution has light tails and is called a platykurtic distribution Platykurtic


Skewness and Kurtosis ď‚— To check if distribution is normal we look at values of Kurtosis and Skewness:


Skewness and Kurtosis


Skewness and Kurtosis  To check if distribution is normal we look at values of

Kurtosis and Skewness:  Positive values of kurtosis indicate tall, pointy peaks

and negative values indicate flat light-tailed distribution  Positive values of skewness indicate too many low

scores (left-hand side of graph) and negative values indicate too many high scores (right-hand side of graph)  Further away from ZERO means the less likely the data is

normally distributed


Skewness

0

0 Symmetrical Distribution x

Mode

Median

Mean

Mode

Mean Median Mode

f

Mean

f

Median

f

0 Positive Skewed Distribution

x

Negative Skewed Distribution

x

 Note:  In a symmetric distribution: Mean=Median=Mode  In a positively skewed distribution: Mean>Median>Mode  In a negatively skewed distribution: Mean<Median<Mode


Distributions: GTBS10

Distributions

• • • •

In a normal distribution – kurtosis = 0 Positive values indicate the distribution is leptokurtic - kurtosis > 0 Negative values indicate the distribution is platykurtic - Kurtosis < 0 The greater the standard deviation the flatter the curve


Distributions: Savings

In a normal distribution – Skewness = 0 (the distribution is symmetrical)

Positive values indicate the distribution is positively skewed

Negative values indicate the distribution is negatively skewed


Distributions: Occupancy

In a normal distribution – Skewness = 0 (the distribution is symmetrical)

Positive values indicate the distribution is positively skewed

Negative values indicate the distribution is negatively skewed


Histograms

Graphing Data in SPSS


Graphing Data in SPSS

Using the Split File option or the Factor List


Graphing Data in SPSS

Adding the Normal Distribution Curve in Graph Builder or using Charts in Frequencies


Boxplots


Boxplots

Median: approx 50


Boxplots

This represents upper value of the interquartile range: approx 79

This represents the lower value of the interquartile range: approx 39


Boxplots

This represents the middle 50% or interquartile range of the data – approx: 41


Boxplots This represents the highest value of extreme: approx 121

This represents the lowest value of extreme: approx 18


Boxplots This represents the highest value of extreme: approx 121

The range: approx 103

This represents the lowest value of extreme: approx 18


Boxplots

This represents an extreme value or outlier


Boxplots

This represents an extreme value or outlier

The vertical line (whisker) extends to 1.5 times the box height


Boxplots

Using the Explore dialog box and using Area in the Factor List


Boxplots


Boxplots


Dispersion • Grade Profiles for Assessment Components BML224 2009


Boxplots


Boxplots


Dispersion – The Range Two groups of 5 business and management students were asked to record weekly beer consumption: • Group 1: 12, 12, 12, 12, 12 • Group 2: 0, 5, 10, 15, 30 The mean for both groups is 12 but this provides no indication of the differences between the two samples and the level of dispersion


The Standard Deviation ď‚— An estimate of the average variability (spread) of a set of data measured in the same units of measurement as the original data ď‚— Standard deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not just the rank order ď‚— It is calculated from the deviations of each item from

the arithmetic mean


The Standard Deviation • C. Andrews Brewery Tours - Salaries (£k) • £11, £15, £15, £18, £25, £30k, £32, £38k

• Mean = £23k (£184k/8) • Standard Deviation = 9.68 • Global Heritage Tours (£k) • £20, £21, £22, £23, £23, £24, £25, £26 • Mean = £23k (£184k/8) • Standard Deviation = 2.00


The Standard Deviation

• ‘Quoting the standard deviation of a distribution is

a way of indicating a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the deviations and the bigger the standard (average) deviation’ [Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197]


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.