BML246
Research Skills Understanding Your Data
Tutors:
Dr Andy Clegg and Dr Jorge Gutic
Understanding Your Data Aims: To produce basic frequency counts, crosstabulations
and descriptive statistics including the mean, median and mode To understand the basic features of the measures of
central tendency (the values that occur most frequently) To select and apply appropriate descriptive statistics to
different data types
Understanding Your Data
NOIR NOMINAL
ORDINAL
NON-PARAMETRIC
INTERVAL
RATIO
PARAMETRIC
Frequency Counts
Question: What grade do you expect to get?
Nominal
What grade do you expect to get for the module? 2011/2012
2010/2011
%
%
Grade A
6
8.5%
3
12%
Grade B
17
23.9%
16
32%
Grade C
37
52.1%
19
38%
Grade D
11
15.5%
11
22%
Grade E
0
0%
1
2%
Total
71
100%
50
100%
Frequency Counts
Expected Grade for BML224
Question: What grade do you expect to get?
F (<40%): 2%
A (70%+): 6%
D (40-49%): 22%
B (60-69%): 32%
C (50-59%): 38%
Frequency Counts
Question: How Confident Are You?
Ordinal
How confident are you about starting this module? 2011/2012
2010/2011
%
7 - Very confident
0
0.0%
0
0.0%
6 - Quite Confident
4
5.60%
1
2.0%
5 - Confident
16
22.50%
10
20.0%
4 - Uncertain
28
39.40%
26
52.0%
3 - Anxious
14
19.70%
7
14.0%
2 - Quite Anxious
4
5.60%
4
8.0%
1 - Very Anxious
5
7.00%
2
4.0%
Uncertain to very anxious
51
72%
39
78%
Sample (n)
71
50
%
Frequency Counts
Student Confidence Levels 2011
Question: How Confident Are You?
Ordinal
Very Anxious: 4%
Quite Anxious: 8%
Quite Confident: 2%
Confident: 20% Anxious: 14%
Uncertain: 52%
Crosstabs ï&#x201A;&#x2014; Examples: Length of Ownership by Response to
Recession
Crosstabs ï&#x201A;&#x2014; Examples: Length of Ownership by Response to
Recession
Crosstabs
NOIR NOMINAL
ORDINAL
NON-PARAMETRIC
INTERVAL
RATIO
PARAMETRIC
Descriptive Stats
Question: Business Turnover in 2010
Ratio
Descriptive Statistics – Turnover 2010 Turnover 2010 Mean
£41,311.40
Median
£44,640.00
Mode
£44,760.00
Standard Deviation
£9191.0316
The Mean Features of the Mean It makes use of every value in the distribution, leading to
a mathematical exactness which is useful for further mathematical processing It can be distorted by extreme values in the distribution; For a discrete distribution, the mean may be an
‘impossible’ figure e.g. average number of children per family = 2.4
The Median ď&#x201A;&#x2014; Calculating the Median ď&#x201A;&#x2014; A data series with an uneven number of items:
1
2
2
4
7
7
10
The Median ď&#x201A;&#x2014; Calculating the Median ď&#x201A;&#x2014; A data series with an uneven number of items:
1
2
2
4
7
7
10
The Median ď&#x201A;&#x2014; Calculating the Median ď&#x201A;&#x2014; A data series with an uneven number of items:
1
2
2
4
7
7
10
12
The Median ď&#x201A;&#x2014; Calculating the Median ď&#x201A;&#x2014; A data series with an uneven number of items:
1
2
2
4
7
4+7 2
Median = 5.5
7
10
12
The Median Features of the Median Half the items in the series will have a value greater
than or equal to the median and half less than or equal to the median It is therefore a measure of rank or position It is unaffected by the presence of extreme items in the
distribution
The Median Features of the Median It may be found when the values of all the items are not
known, provided that values of middle items and the total number of items are known Ranking the items can be tedious The median cannot be used for further mathematical
processing It may not be representative if there are few items
The Mode Calculating the Mode Series 1:
1
2
2
2
7
7
10
12
The Mode ď&#x201A;&#x2014; Calculating the Mode ď&#x201A;&#x2014; Series 1:
1
2
2
2
Single mode: 2 Data set is Modal
7
7
10
12
The Mode Calculating the Mode Series 1:
1
2
2
2
7
7
7
10
The Mode ď&#x201A;&#x2014; Calculating the Mode ď&#x201A;&#x2014; Series 1:
1
2
2
2
Two modes: 2 + 7 Data set is Bimodal
7
7
7
10
The Mode Features of the Mode For discrete data it is an actual single value For continuous data it is the point of highest frequency
density – but not suited to continuous data as different values each constitute a potential mode It may not be unique or clearly defined – the more modes
there are the less useful it is to use Extreme items do not affect its value It cannot be used for further mathematical processing It requires arrangement of the data which may be time
consuming
Mean, Median, Mode Which to Use*? A measure of location must convey the distribution in a
single figure Therefore important to pick the right figure This depends on:
The type of data being used
The shape of the distribution (dispersion)
Whether the average will be the basis for further work on the data
[*Source: Buglear, 2000]
Mean, Median, Mode Which measure to use? 1,2,3,4,5,6,7,8,9,10
Mean = 5.5 Median = 5.5
1,2, 3, 4, 5, 6, 7, 8, 9, 20
Mean = 6.5 Median = 5.5
1, 2, 3, 4, 5, 6, 7, 8, 9, 100
Mean = 14.5 Median = 5.5
Mean, Median, Mode Which measure to use? [Turn to page 2-34] Median 1, 23, 25, 26, 27, 23, 29, 30
Mode 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5
1, 1, 2, 3, 4, 1, 2,Mean 6, 5, 8, 3, 4, 5, 6, 7
Median 1, 101, 104, 106, 111, 108, 109, 200
Summary
NOMINAL
ORDINAL
INTERVAL
RATIO
Frequency Counts
☐
☐
☐
☐
Crosstabulations
☐
☐
☐
☐
Mean
☐
☐
☐
☐
Median
☐
☐
☐
☐
Mode
☐
☐
☐
☐
Summary
NOMINAL
ORDINAL
INTERVAL
RATIO
Frequency Counts
☑
☑
Crosstabulations
☑
☑
Mean
☑
☑
☑
Median
☑
☑
☑
Mode
☑
☑
☑
☑
Descriptive Statistics
Distribution of Data
Histogram of JJA temp 30 25
Dispersion
Frequency
20
a way of describing how spread out a set of data is.
15 10 5 0 14
15
16
17 JJA temp
18
19
20
Distribution of Data
Histogram of JJA temp 30
Skewness
25
is a measure of symmetry (or lack of symmetry in a distribution
Frequency
20 15 10 5 0 14
15
16
17 JJA temp
18
19
20
Distribution of Data
A measure of central tendency is a single value Histogram of JJA temp that attempts to describe a set of data by identifying the central position within that set of data.
30 25 20 Frequency
Central Tendency
15 10 5 0 14
15
16
17 JJA temp
18
19
20
Distribution of Data
Histogram of JJA temp
Dispersion
30
a way of describing how spread out a set of data is.
25
Frequency
20 15
Skewness
10 5 0 14
15
16
17 JJA temp
18
Central Tendency
19
20
Frequency
Normal Distribution Curve
Mean=100
50% of cases
50% of cases
Mean Median Mode
Normal Distribution Curve
â&#x20AC;˘ Although these curves are different shapes they all have a normal distribution
Normal Distribution Curve Sample size=10
Normal Distribution Curve
Sample size=100
Platykurtic
p. 3-98
Kurtosis
Leptokurtic
Mesokurtic
Flatness or peakedness – Kurtosis of the distribution (degree to which tails of distribution are ‘heavy’ or ‘light’)
Skewness f
â&#x20AC;˘
Positive skews clusters are at the lower end and point towards higher scores
f
0
Positive Skewed Distribution x f
â&#x20AC;˘
0 Symmetrical Distribution
x
0 Negative Skewed Distribution x
Negative skews clusters are at the higher end and point towards the lower values
Kurtosis If kurtosis is close to 0 then a normal distribution is often assumed – a mesokurtic distribution
Mesokurtic
If kurtosis is greater than zero, then
the distribution has heavier tails and is called a leptokurtic distribution Leptokurtic
If kurtosis is less than zero, then the
distribution has light tails and is called a platykurtic distribution Platykurtic
Skewness and Kurtosis ď&#x201A;&#x2014; To check if distribution is normal we look at values of Kurtosis and Skewness:
Skewness and Kurtosis
Skewness and Kurtosis To check if distribution is normal we look at values of
Kurtosis and Skewness: Positive values of kurtosis indicate tall, pointy peaks
and negative values indicate flat light-tailed distribution Positive values of skewness indicate too many low
scores (left-hand side of graph) and negative values indicate too many high scores (right-hand side of graph) Further away from ZERO means the less likely the data is
normally distributed
Skewness
0
0 Symmetrical Distribution x
Mode
Median
Mean
Mode
Mean Median Mode
f
Mean
f
Median
f
0 Positive Skewed Distribution
x
Negative Skewed Distribution
x
Note: In a symmetric distribution: Mean=Median=Mode In a positively skewed distribution: Mean>Median>Mode In a negatively skewed distribution: Mean<Median<Mode
Distributions: GTBS10
Distributions
• • • •
In a normal distribution – kurtosis = 0 Positive values indicate the distribution is leptokurtic - kurtosis > 0 Negative values indicate the distribution is platykurtic - Kurtosis < 0 The greater the standard deviation the flatter the curve
Distributions: Savings
•
In a normal distribution – Skewness = 0 (the distribution is symmetrical)
•
Positive values indicate the distribution is positively skewed
•
Negative values indicate the distribution is negatively skewed
Distributions: Occupancy
•
In a normal distribution – Skewness = 0 (the distribution is symmetrical)
•
Positive values indicate the distribution is positively skewed
•
Negative values indicate the distribution is negatively skewed
Histograms
Graphing Data in SPSS
Graphing Data in SPSS
Using the Split File option or the Factor List
Graphing Data in SPSS
Adding the Normal Distribution Curve in Graph Builder or using Charts in Frequencies
Boxplots
Boxplots
Median: approx 50
Boxplots
This represents upper value of the interquartile range: approx 79
This represents the lower value of the interquartile range: approx 39
Boxplots
This represents the middle 50% or interquartile range of the data â&#x20AC;&#x201C; approx: 41
Boxplots This represents the highest value of extreme: approx 121
This represents the lowest value of extreme: approx 18
Boxplots This represents the highest value of extreme: approx 121
The range: approx 103
This represents the lowest value of extreme: approx 18
Boxplots
This represents an extreme value or outlier
Boxplots
This represents an extreme value or outlier
The vertical line (whisker) extends to 1.5 times the box height
Boxplots
Using the Explore dialog box and using Area in the Factor List
Boxplots
Boxplots
Dispersion â&#x20AC;˘ Grade Profiles for Assessment Components BML224 2009
Boxplots
Boxplots
Dispersion – The Range Two groups of 5 business and management students were asked to record weekly beer consumption: • Group 1: 12, 12, 12, 12, 12 • Group 2: 0, 5, 10, 15, 30 The mean for both groups is 12 but this provides no indication of the differences between the two samples and the level of dispersion
The Standard Deviation ď&#x201A;&#x2014; An estimate of the average variability (spread) of a set of data measured in the same units of measurement as the original data ď&#x201A;&#x2014; Standard deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not just the rank order ď&#x201A;&#x2014; It is calculated from the deviations of each item from
the arithmetic mean
The Standard Deviation • C. Andrews Brewery Tours - Salaries (£k) • £11, £15, £15, £18, £25, £30k, £32, £38k
• Mean = £23k (£184k/8) • Standard Deviation = 9.68 • Global Heritage Tours (£k) • £20, £21, £22, £23, £23, £24, £25, £26 • Mean = £23k (£184k/8) • Standard Deviation = 2.00
The Standard Deviation
• ‘Quoting the standard deviation of a distribution is
a way of indicating a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the deviations and the bigger the standard (average) deviation’ [Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197]