Section 3

Page 1

Section 3

Dispersion

Learning Outcomes At the end of this session, you should be able to: 

Understand the theory and assumptions relating to the distribution and variance of data

Calculate measures of dispersion, including the median, range, standard deviation and standard error, both manually and using SPSS

Use confidence levels and z scores to establish the relationship between the sample mean and population mean

Use the standard error to establish the extent to which the sample mean deviates from the population mean



Data Analysis for Research

3.0

Measures of Dispersion

Introduction So far, you have been introduced to a number of different methods to graphically illustrate your data. But why is it important to do this? It is important because the way the data is distributed will influence the types of statistical tests that are valid, as many of the statistical tests that you will be introduced to in this module make specific assumptions regarding the distribution of the data. One of the most important distributions that you need to consider is the normal distribution. Under a normal distribution, the characteristic frequency curve is bell-shaped and is symmetrical around the mean. For example, if 1,000 people were asked to estimate the length of a room that was exactly 12 feet long, it is highly probable that everybody would say that the room was 12 feet long. Some may guess at low as 11 feet and other may decide on 13 feet. However, we would expect that most of the estimates would be between 11 feet and 13 feet and very few as far out as 9 feet or 15 feet. If the frequency distribution of the measurements were plotted on a graph, the pattern would tend to be bell-shaped because most of the values would be clustered around the 12 feet mark, while the frequency of measurements would diminish away from this central value.

Figure 3.1:

Normal Distributions

The curves illustrated in Figure 3.1 all have a normal distribution, even though they are not quite the same. You can see that they differ in terms of how spread out the scores are and how peaked they are in the middle. Under a normal distribution, the mean, median and mode are exactly the same. These are features of a normal curve. Indeed, many natural phenomena, such as heights of adult males and weights of eggs, tend to produce the ‘normal’ (or Gaussian) distribution, and more significantly, most sampling will do so as well, regardless of the distribution of the population. This is why it, and sampling, are so important in statistics. The p. 3-89


Data Analysis for Research

Measures of Dispersion

requirements of a normal distribution are not always met in research, especially when you are dealing with small sample sizes. If your sample size is less than 30, then reference to the normal distribution is not appropriate. It is generally found that the more scores from such variables that you plot, the more like the normal distribution they become. This can be seen in the following example. If you randomly select 10 men and measured their height inches, the frequency histogram may be similar to Figure 3.2a. This histogram bears little resemblance to the normal distribution curve. If we were to select an addition 10 men and measure their height, and then plot all 20 measurements the resulting histogram (Figure 3.2b) would again not look like a normal distribution. However, you can see that as we select more and more men and plot their heights, the histogram becomes a closer approximation to the normal distribution (Figure 3.2c to 3.2e). By the time we have select 100 men you can see that we have a perfectly normal distribution. Figure 3.2:

Normal Distribution and Sample Size

[Source: Dancey and Reid, 2002, p. 64] p. 3-90


Measures of Dispersion

Data Analysis for Research

3.1

Measures of Dispersion Although the different types of average can help to describe frequency distributions to a certain extent, they are of limited use on their own and additional measures are often required to illustrate the full picture, and too assess how much variation there is in our sample of population. This situation is best illustrated by a simple example. Two groups of 5 SEMAL students were asked to record their weekly beer consumption. The results in pints were as follows: Group 1: 12, Group 2: 0,

12, 5,

12, 10,

12, 15,

12 30

Passing over the obvious comment that the 2nd group appears to contain someone who isn’t a SEMAL student, the arithmetic mean for both groups is 12. However, this result gives no indication of the basic differences between the two sets of values. Therefore, a measure of dispersion (or spread) can be used to express the fact that one set of values is constant while the other ranges over a wide scale. The following section will highlight a number of ways in which the level of variance within a sample of population can be assessed.

3.1.1

The Range The least sophisticated measure of dispersion is the range of a set of values. The range is simply the difference between the highest and lowest values of a series. As such, it only tells us about two values which may be atypical from the rest of the data set. In reference to our previous example, for the beer consumption of the two groups of tourism management students the ranges are: Group 1: Zero Group 2: 30

Remember the range is calculated by subtracting the minimum value from the maximum value. In this case: Max 12 30

Min

-

12 0

Range

= =

0 30

Although the range tells us about the overall range of scores, it does not give any indication of what is happening in between these scores. Ideally, we need to have an indication of the overall shape of the distribution and how much the scores vary from the mean. Therefore, although the range gives a crude indication of the spread of the scores, it does not really tell us much about the overall shape of the distribution of the sample of scores.

p. 3-91


Measures of Dispersion

Data Analysis for Research

3.1.2

Quartile Deviation The range, as a measure of dispersion, has the significant disadvantage of being susceptible to distortion by extreme values. One way of overcoming this is to ignore items in the top and bottom quarters of the distribution and to consider the range of the two middle quarters only. This is known as the interquartile range since it is the difference between the first and third quartiles. The quartile deviation (semi-interquartile range) is one half of the interquartile range. For continuous data, the lower quartile (Q1) is determined by first ranking the data in order and then dividing the total sample number by 4. In the following example (see Figure 3.3), the lower quartile lies between the ages of the 2nd and 3rd visitors. Thus, the lower quartile value is 14 years (i.e. (13+15)/2). The upper quartile value is computed in a similar way but by dividing the sample size by three quarters. Thus the upper quartile value lies between the ages of the 7th and 8th visitor. Thus the value is 18 years (i.e. (18+18)/2). To summarise, we can now state that one quarter of visitors were aged 14 years or under, while one quarter were aged 18 years or more. In addition, we can also quote the interquartile range by stating that 50% of the visitors were aged between 14 and 18 years of age.

Figure 3.3:

Age Profile of Visitors to the Arun Youth Centre

10, 13, 15, 16, 16, 17, 18, 18, 18, 20

Lower quartile value =

13 + 15 = 14 2

Upper quartile value =

18 + 18 = 18 2

Effectively, the interquartile range is a refinement of the median and is most easily calculated from the cumulative frequency curve. In the Kano rainfall example, discussed in your descriptive statistics handout, the lower quartile is read off by tracing a line from the 25% level to the curve, and then down to the appropriate rainfall (about 275mm), and the upper quartile by reading 75% (to find about 1000mm). This means that over the period in question, half of the years had a rainfall between 725mm and 1000mm, with the interquartile range itself therefore being 275mm.

p. 3-92


Measures of Dispersion

Data Analysis for Research

To calculate the quartile ranges for grouped data, it is first necessary to calculate the cumulative frequencies as in the Kano example. When trying to calculate the quartile values of grouped data it is again necessary to make assumptions regarding the distribution of values within the class. In this instance it is assumed that the distribution is even and the lower quartile is calculated as follows:

 n / 4 − cf (LC)  Q1 = LCL(Q1) +   xw(Q1) f (Q1)   Where: Q1:

is the lower quartile range

LCL(Q1):

is the lower class limit of the class containing the lower quartile

n:

is the sample size

cf(LC):

is the cumulative frequency of the class immediately below that containing the lower quartile

w(Q1):

is the width of the class interval containing the lower quartile

f(Q1):

is the frequency of the class interval containing the lower quartile

The calculation for the upper quartile is:

 3n / 4 − cf (LC)  Q3 = LCL(Q3) +   xw(Q3) f (Q3)   In this case, Q3 reflects the relevant upper quartile values and can be substituted in the description of terms stated for calculating Q1.

p. 3-93


Measures of Dispersion

Data Analysis for Research

3.1.3

Mean Deviation Unlike the range, the mean deviation measures dispersion about a particular average, namely the arithmetic mean. It is the average (arithmetic mean) of all the deviations of values from the arithmetic mean ignoring minus signs. If deviations are considered with plus and minus signs and are measured from the mean then their total will be zero by definition of the arithmetic mean. Basically the mean deviation tells us the average distance by which all items in a data set differ from their mean. For example, for the beer drinking figures of the 2nd group of geography students: Value: d (deviation): |d|:

0

5

10

15

30

-12 12

-7 7

-2 2

+3 3

+18 18

(  d=0)

[|d|, pronounced mod d, is the mathematical shorthand for saying: ‘ignore minus signs’.]

The mean deviation =

| d| n

=

12 + 7 + 2 + 3 + 18 42 = = 8.4 5 5

By ignoring minus signs the mean deviation ignores the fact that some items are greater than the average and some less, consequently this measure of dispersion gives no idea of the way the items are spread around the average.

3.1.4

Standard Deviation Standard deviation is one of the most fundamental measures of dispersion used in statistical analysis. Standard deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not just the rank order. Like the mean deviation it is calculated from the deviations of each item from the arithmetic mean. To ensure that these divisions are not totalled to give zero, they are squared before being added together. This removes all minus signs since two negative values multiplied give a positive value. Thus by summing the squares of the deviations, the sums of squares or sum of squared differences is arrived. The mean of ‘the sum of squares’ is known as the ‘variance’. The square root of the variance is the standard deviation. Standard deviation is symbolized by ‘s’ for a sample and ‘ σ ’ for a population. For an ungrouped, discrete data series, the standard deviation can therefore be calculated as:

 (x − x )

2

σ=

n

p. 3-94


Measures of Dispersion

Data Analysis for Research

or alternatively,

x

2

σ=

n

 x  −    n 

Where: σ : standard deviation  : sum of x : the value x : the mean n : the number of values

2

The calculation of the standard deviation is illustrated in the following example:

Values (x)

3 2 1 2 3 4 3 7 6 5 Totals

(x − x) -0.6 -1.6 -2.6 -1.6 -0.6 0.4 -0.6 3.4 2.4 1.4

(x − x) 0.36 2.56 6.76 2.56 0.36 0.16 0.36 11.56 5.76 1.96

2

Step 1:

First calculate the mean of the sample:

x= Step 2:

 x = 36 = 3.6 n

WORKED EXAMPLE

The Calculation of the Standard Deviation

10

Now calculate the standard deviation:

σ=

σ=

 ( x − x)

2

n

32.4 = 18 . 10

32.4

The standard deviation figure of 1.8 is useful as it provides an indication of how closely the scores are clustered around the mean. The value of the standard deviation when placed in context of the normal distribution. Generally, 70% of all scores fall within 1 standard deviation of the mean. In this example with a standard deviation of 1.8, this tells us that the majority of scores in this sample are within 1.8 units above or below the mean (3.6 +/-1.8). The standard deviation is useful when you want to compare samples using the same scale. For example if we were to take a second sample of scores and calculated a standard deviation of 3.6. If we compare this to the standard deviation from our first sample, it would indicate that scores in the first sample are more closely clustered around the mean value, than scores in the second sample.

p. 3-95


Data Analysis for Research

Measures of Dispersion

In conclusion, the standard deviation is a measure of dispersion which indicates the spread of the data values around the arithmetic mean. ‘Quoting the standard deviation of a distribution is a way of indicating a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the deviations and the bigger the standard (average) deviation’ (Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197)

p. 3-96


Measures of Dispersion

Data Analysis for Research

ď €

Activity 11: To demonstrate your familiarity with basic measures of dispersion, for the following data sets, relating to bedspace size of B&B establishments in Blackpool, calculate the mean,median, range and standard deviation. Results:

Sample A: 4 34 32 18 48 6 17 14 4 18

9 20 14 12 12 19 11 12 14 19

16 8 10 17 14 10 6 10 16 34

10 6 27 10 6 6 6 8 8 14

72 20 14 12 12 19 11 12 14 19

11 38 19 17 14 10 50 62 16 34

10 6 27 10 34 32 6 8 8 23

Range: Standard Deviation:

Mean: Median: Range: Standard Deviation:

Results:

Sample C: 14 34 32 18 48 6 17 50 4 18

Median:

Results:

Sample B: 34 34 32 18 48 16 17 50 4 18

Mean:

9 20 14 12 12 19 11 12 14 19

11 8 19 17 14 10 6 62 16 34

10 6 27 10 34 6 6 8 8 14

Mean: Median: Range: Standard Deviation:

p. 3-97


Data Analysis for Research

3.2

Measures of Dispersion

Other Distributions There are of course variations on the normal distribution. Distributions can also vary depending on how flat or peaked they are. The degree of flatness or peakedness is referred to as the kurtosis of the distribution. If a distribution is highly peaked it is leptokurtic and if the distribution is flat it is platykurtic. Leptokurtic distributions appear relatively thin in appearance, and somewhat pointy. In contrast, platykurtic distributions are flatter, reflecting a greater number of scores in the tails of the distribution. A distribution between the extremes of peakedness and flatness is classed as mesokurtic (see Figure 3.4). In a normal distribution curve, the value of kurtosis is 0 (i.e. the distribution is symmetrical). If a distribution has a value above or below 0 then this indicates a level of deviation from the norm. You don’t need to worry about kurtosis too much as this point, but you will notice that when you produce descriptive statistics in SPSS, a value for kurtosis is given. Positive values of kurtosis indicate that the distribution is leptokurtic, whereas negative values suggest that the distribution is platykurtic (Dancey, 2002).

Figure 3.4:

Examples of Leptokurtic, Platykurtic and Mesokurtic Distributions

[Source: Dancey and Reid, 2002, p. 70]

p. 3-98


Data Analysis for Research

3.2.1

Measures of Dispersion

Skewed Distributions Distributions can also be skewed (see Figure 3.5). A positive skew is when the peak lies to the left of the mean and a negative skew when it lies to the right of the mean. The further the peak lies from the centre of the horizontal axis, the more the distribution is said to be skewed. If you come across badly skewed distributions then you need to consider whether the mean is the best measure of central tendency, as the scores in the extended tail will be distorting your mean. As discussed in your descriptive statistics handbook, at this point it might be more appropriate to use the median or the mode to give a more representative indication of the typical score in your sample. The SPSS output for descriptive statistics will also provide a measure of skewness. A positive value suggests a positively skewed distribution, whereas a negative value suggests a negatively skewed distribution. A value of zero indicates that the distribution is not skewed in either direction (i.e. the distribution is symmetrial).

Figure 3.5:

Examples of Skewed Distributions

These refinements need not concern us here, but will need consideration when it comes to deciding which statistical tests you which to use to examine the data. For now, it is perhaps enough to make a distinction between the most powerful ‘parametric’ tests which rely on the data concerned being normally distributed, and the less powerful non-parametric ones which do not. If you have control over the collection of your data you should do your best to collect data on which parametric tests can be conducted, but if you cannot ensure this quality or need to use others’ information it may be better to use the less powerful tests.

p. 3-99


Data Analysis for Research

3.3

Measures of Dispersion

The Standard Normal Distribution The standard normal distribution (SND) is also known as the probability distribution. The value of probability distributions is that there is a probability associated with each particular score from the distribution. More specifically, the area under the curve between any specified points represents the probability of obtaining scores within these specified points. For example the probability of obtaining scores between -1 and + 1 standard deviations from the distribution is about 68% (see Figure 3.6). This means that:   

68.26% of observations fall within plus or minus one standard deviation of the mean; 95.44% of observations fall within plus or minus two standard deviations of the mean; 99.7% of observations fall within plus or minus three standard deviations of the mean

This percentage values will be referred to later as ‘confidence limits’. Figure 3.6:

The Standard Normal Distribution

Let me illustrate this through a specific example. Figure 3.7, illustrates the number of tourists bunging jumping off at bridge at an extreme academy in New Zealand. There were 150 tourists in total, and the brave souls were most frequently aged between 26 to 30 (the highest bar). The graph also indicates that very few people over the age of 60 participate in bunging jumping (thank goodness for that!). If we think about this distribution as a probability distribution we could start asking specific questions. For example, how likely is it that a 60 year old will undertake bunging jumping in New Zealand. A look at the distribution and your answer might be ‘not very likely’. However, what if you were asked how likely it is a 30 year old went bunging jumping, your answer would be ‘quite likely’. Indeed, the distribution shows that 30 of the 150 tourists were aged around 30 (equating to 20% of the total sample). Therefore, using this data it is possible to estimate the probability that a particular score will occur.

p. 3-100


Data Analysis for Research

Figure 3.7:

Measures of Dispersion

Tourists Bunging Jumping in New Zealand

Using the characteristics of the SND, it is possible to calculate the probability of obtaining scores within any section of the distribution. Statisticans (much clever than me) have calculated the probability of certain scores occurring in a normal distribution with a mean of 1 and standard deviation of 1. If our sample shares these values, then we can use a table of probabilities for normal distribution to assess the likelihood of a particular score occurring. However in reality, it is likely that the data we will collect will have a mean of 0 and standard deviation of 1. However, as Field (2003) points out any data set can be converted into a data set that has a mean of 0 and standard deviation of 1. First to centre the zero, we take each score and subtract from it the mean of all the scores. Then, we divide the resulting score by the standard deviation to ensure the data have a standard deviation of 1. The resulting scores are called z scores. The z-score is expressed in standard deviation units - the z score therefore tells us how many standard deviations above the mean our score is. A negative z score is below the mean and a positive z score is above the mean. Extreme z scores, for example greater than 2 and below 2, have a much smaller chance of being obtained than scores in the middle of the distribution. That is areas of the curve above 2 and below -2 are small in comparison with the area between -1 and 1 (see Figure 3.8).

p. 3-101


Measures of Dispersion

Data Analysis for Research

Figure 3.8:

Areas in the middle and extremes of the Standard Normal Distribution

Let us refer back to our example of bunging jumping in New Zealand, where we can now answer the question what is the probability of someone over 60 doing a bungee jump. First we need to convert 60 into a z-score. From the population the mean age is 32 and the standard deviation is 11. In this instance 60 will become: score-mean = z score standard deviation

(60-32)/11=2.54

This indicates that your score is 2.54 standard deviations around the mean. Consider another example. The mean IQ scores for many IQ tests is 100 and the standard deviation is 15. If you had an IQ score of 130, then your z-score would be: (130-100)/15=2 This indicates that your score is 2 standard deviations around the mean. Using the z-score we can also calculate the proportion of the population who would score above or below your score - or in the case of the normal distribution the area under the normal distribution curve. Figure 3.9 illustrates that the IQ score of 130 is 2 standard deviations above the mean. The shaded area represents the proportion of the population who would score less than you, and the unshaded area represents those who would score more than you. To calculate the specific proportion of the population that would score more or less than you we refer to a standard normal distribution table (see Table 3.1). The table indicates that the proportion falling below your z-score is 0.9772 or 97.72%. In order to find the proportion above your score, you simply subtract the above proportion (0.9772) from 1. In this case the proportion is .0228 or 2.28% . When using statistical tables for SND you should note that only details of positive z scores are given (those that fall above the mean). If you have a negative z score disregard the negative sign of the z score to find the relevant areas above and below your score (Figure 3.10).

p. 3-102


Measures of Dispersion

Data Analysis for Research

Figure 3.9:

Normal Distribution showing the proportion of the population with an IQ of less than 130

97.72%

Table 3.1:

Z Scores for Standard Normal Distribution

p. 3-103


Data Analysis for Research

Measures of Dispersion

Figure 3.10: The proportions of the curve below positive z scores and above negative z scores

Let us now refer back to the z-score calculated when asking about the probabiloty of people over 60 bunging jumping in New Zealand. The calculated z-score is 2.54. Refer to the table of probability values that have been included in the appendices. Look up the value of 2.54 in the column labelled ‘smaller portion’ (i.e. the area above the value 3.2). You should find that the probability value is 0.00554, or .0055% chance that a person over 60% would bungee jump. By looking at the values of the ‘bigger portion’, we find that the probability of those jumping under the age of 60 was .99446. Or alternatively, there is 99.44 probability that those tourists jumping were aged below 60 (.99446 = 1-.00554). Certain z-scores are particularly important, as their values cut off certain important percentages of the distribution. As Field (2003) highlights, the first important value is 1.96 as this cuts off the top 25% of the distribution, and its counterpart at the opposite end (-1.96) cuts off the bottom 2.5% of the distribution. As such, these values together cut off 5% of scores, or put another way, 95% of z-scores lie between -1.96 and 1.96. The other important scores are +/- 2.58 and +/- 3.29 which cut off 1% and 0.1% of scores respectively. Put another way, 99% of z-scores lie within -2.58 and 2.58, and 99.9% of z-scores lie between -3.29 and 3.29. These values will crop up time and time again, indeed we have already referred to this values when referring to the characteristics of the normal distribution curve.

p. 3-104


Data Analysis for Research

3.4

Measures of Dispersion

Confidence Intervals Although the sampling mean is an approximation of the population mean, we are not sure how good an approximation it is. Because the sample mean is a particular value or point along a variable, it is known as a point estimate of the population mean. It represents one point on a variable and because of this we do not know whether our sample mean is an underestimation or overestimation of the population mean. We can therefore use confidence intervals to help us identify where on the variable the population mean may lie. Confidence intervals of the mean are interval estimates of where the population mean may lie and they provide us with a range of scores (an interval) within which we can be confident that the population mean lies (see Figure 3.11). Because we are still only using estimates of population parameters it is not guaranteed that the population mean will fall within this range; we therefore have to give an expression of how confident we are that the range we calculate contains the population mean. Hence the term ‘confidence intervals’.

Figure 3.11: The role of confidence intervals in determining the position of the population mean in relation to the sample mean

p. 3-105


Data Analysis for Research

Measures of Dispersion

We have already discussed the characteristics of the sampling mean and that it tends to be normally distributed, and contains a good approximation of the population mean. Using the base characteristics of the normal distribution allows us to estimate how far our sample mean is from the population mean. As shown in Figure 3.12, we know that the sample mean is going to be a certain number of standard deviations above or below the population mean. Indeed, we can be 99.74% certain that the sample mean will fall with -3 and + 3 standard deviations. As discussed earlier, this area accounts for most of the scores in the distribution. If we wanted to be 95% certain that a certain area of the distribution contained the sample mean we would have to refer back to the z scores. As highlighted earlier, 95% of the area under the SND falls with -1.96 and +1.96 standard deviations. Thus we can be 95% certain that the sample mean will lie between -1.96 and +1.96 standard deviations of the population mean (see Figure 3.13). Figure 3.12: Sample mean is a certain number of S.Ds above or below the population mean

Figure 3.13: Percentage of curve (95%) falling between -1.96 and +1.96 S.Ds

p. 3-106


Data Analysis for Research

Measures of Dispersion

For illustration, assume that the sample mean is somewhere above the population mean. If we draw the distribution around the sample mean instead of the population mean we see the situation in Figure 3.14. Figure 3.14: Location of the population mean where distribution is drawn around the sample mean

Applying the same logic, we can be confident that the population mean falls somewhere with 1.96 standard deviations below the sample mean. Similarly, if the sample mean is below the population mean we can be confident that the population mean is within 1.96 standard deviations above the sample mean (Figure 3.13). We can therefore be confident (95%) that the population mean is within the region 1.96 standard deviations above or below the sample mean. With this information we can now calculate how far the sample mean is from the population mean. All we need to know is the sample mean and the standard deviation of the sampling distribution of the mean (standard error). Figure 3.15: Distribution drawn around the sample mean when it falls below the population mean

p. 3-107


Data Analysis for Research

ď €

Measures of Dispersion

Activity 12: Use the following table to calculate the following: a] The probability that z is less than or equal to 0.7 b] The probability that z is more than 0.7, c] The probability that z is less than or equal to 2 and equal to or more than -2 d] The probability that z is less than or equal to 3 and equal to or more than -3 Record your answers below:

p. 3-108


Data Analysis for Research

3.5

Measures of Dispersion

The Standard Error One useful adjunct of the normal distribution is standard error, or the standard deviation of the sampling distribution of the mean, which can be helpful in gauging the precision of your sample, and deciding how large your eventual sample should be from a pilot study. The standard error is a measure of the degree to which the sample means deviate from the mean of the sample means. Given the mean of the sample means is also a close approximation of the population mean, the standard error of the mean must also tell us the degree to which the sample means deviate from the population mean. Consequently, once we are able to calculate the standard error we can use this information to find out how good an estimate our sample mean is to the population mean. This is illustrated in Figure 3.16.

Figure 3.16: Calculating the Standard Error

[Source: Field, A, 2003, p. 16]

p. 3-109


Data Analysis for Research

Measures of Dispersion

Figure 3.16, illustrates the process of taking samples from the population. In this case Field (2003) is looking at the ratings of lecturers. If we were to take the rating of all lecturers the mean value would be 3. As illustrated in Figure 3.16, each sample has a mean value, and these have been presented in a frequency chart. As you can see some samples have the same mean as the population, some are lower, some are higher. These differences are referring to sample variation. As you can see, the end result is a symmetrical distribution, known as a sampling distribution (Field, 2003). If we were to take the average of all the sample means, you’d get the same value as the population mean. But how representative is the population mean? We use standard deviation as a measure of how representative the mean was of the observed data. If you were to measure the standard deviation between sample means then this would give a measure of how much variability there was between the samples of the different means. The standard deviation of the sample means is known as the standard error of the sample mean. Standard error is very similar to standard deviation, but takes account of sample size. The larger the sample size, the lower the sampling error.

SE (mean) = Standard Deviation of the Sample (s) √ Sample Size (n) Dividing the standard deviation by the square root of the sample size takes account of the fact that the larger the sample size, the more likely that the sample is representative, and vice versa. Any probability of the sample mean being close to the population mean can be calculated, but for our purposes we will only examine the population mean that we can estimate with 95% probability, which corresponds to two standard errors away from the mean. For example, in investigating the geography of sport in Lancashire, you might want to find out how far Warrington’s supporters travelled to the match. From sampling the crowd you might find a mean of 23km travelled to Wilderspool, and a Standard Error of 3km. This means that your sampling suggested (with 95% certainty) that the mean distance which supporters of Warrington RLFC travelled to the match is 23km ± 6km. This does not mean that 95% of supporters travel between 17km and 29km, but rather is a measure of the confidence with which you state the mean. You can be pretty certain that if you sampled the crowd twenty times nineteen of your answers would be within this range.

p. 3-110


Measures of Dispersion

Data Analysis for Research

The following example highlights a practical application of the standard error in attempting to assess the mean spending of short break holidaymakers in Chichester.

Values (x)

109 97 112 156 86 94 176 158 147 135

(x − x)

(x − x)

-18 -30 -15 -29 -41 -33 49 31 20 8

324 900 225 841 1681 1089 2401 961 400 64

Totals

Step 3:

2

Step 1:

First calculate the mean of the sample:

x=

Step 2:

8886

WORKED EXAMPLE

In an example the following results were obtained were taken: Visitor Spending (£)

 x = 1270 = 127 n

10

Now calculate the standard deviation:

σ=

σ=

 (x − x)

2

n

8886 = 29.809 10

Now calculate the standard error: SE =

Standard Deviation of the Sample (s) Sample Size (n)

SE =

SE =

29.809 10

29.809 = 9.43 3.16

In this example, the standard error has been calculated at 9.43. With reference back to the properties of the normal distribution curve, we can conclude that it is likely that 68 times out of 100 (or approximately 2 in 3 times) that the true mean of the population lies within the range 127± 9.43. That is between 117.57mm and 136.43mm (or the Mean ± 1 x Standard Error (SE)). If we wish to predict the range with greater confidence then the rule of plus or minus two standard errors can be applied to give a 95% confidence level. In this case the true mean of the population would lie within the range 127± 18.86. That is between 108.14mm and 145.86 (or the Mean ± 2 x SE).

p. 3-111


Measures of Dispersion

Data Analysis for Research

In the above example, the selected standard errors equated to critical z values of 1.0 and 2.0. These values help to establish and define the ‘confidence limits’. As discussed these limits are usually described in percentage rather than absolute values, and you would therefore refer to 68.2%, 95.4% and 99.7% confidence levels. For the 95% (0.95) and 99% (0.99) levels (the percentage values have been rounded for convenience) the critical z values are 1.96 and 2.58 respectively. Therefore, if we were to refer back to our previous example, we can redefine our confidence limits and expected ranges in which we would expect the mean value of the population to lie. For example, in the previous example, at the 95% confidence level the limits were given by: 127 ±( 2 x 9.43) = £108.14 to £136.43 If we adopt the critical z values for the standard error at a 95% confidence level, the limits are now defined as: 127 ± (1.96 x 9.43) = £108.52 to £145.48mm If we adopt the critical z values for the standard error at a 99% confidence level, the limits are now defined as: 127 ±( 2.58 x 9.43) = £102.67 to £151.33mm Effectively, higher confidence levels can only be achieved at the expense of wider confidence intervals. Therefore, we can be 99% certain that the sample population lies between £102.67 and £151.33, but only 95 per cent confident that it lies between the narrower bands of £108.14 and £145.86. Clearly the best way to gain greater accuracy in sample estimates is to increase the sample size (n). As the sample size (n) increases the standard error, or spread, of the sampling distribution is reduced and the resulting confidence intervals are narrowed. Referring back to our previous example which focused on visitor spending, increasing the sample size by 20 yields the following results: Mean: £127 Std Dev: 29.95 Standard Error: 2.99 Adopting the same confidence limits as before, we can now be certain that at the 95% confidence level the mean of the sample population lies between: 127 ± (1.96 x 2.99) = £121.14 to £129.86 and at the 99% confidence level the sample population lies between: 127 ± (2.58 x 2.99) = £119.29 to £134.71 As you can clearly see, increasing the sample size has significantly reduced the width of the confidence intervals. p. 3-112


Measures of Dispersion

Data Analysis for Research

This is graphically illustrated in Figure 3.17 below. Figure 3.17: Confidence Intervals with Samples Sizes

a] Sample Size of 10 £108.52

£145.48

Confidence Interval (95%) Range = 36.96

Sample mean of 127

a] Sample Size of 100 £121.14

Confidence Interval (95%)

£129.86

Range = 8.72

Sample mean of 127

As is evident in Figure 3.17, increasing the sample size results in a much narrower range of scores and gives us a much clearer indication of where the population mean may be. This in turn underlines the importance of sample size when trying to estimate population parameters from sample statistics. Generally the larger the sample size, the better the estimate of the population we can get from it.

p. 3-113


Measures of Dispersion

Data Analysis for Research

ď €

Activity 14: Refer back to the exercise on page 3-93. This time calculate the standard error for each sample, and the standard error ranges at 95% and 99% (using z-scores).

Results:

Sample A: 4 34 32 18 48 6 17 14 4 18

9 20 14 12 12 19 11 12 14 19

16 8 10 17 14 10 6 10 16 34

10 6 27 10 6 6 6 8 8 14

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper:

Results:

Sample B: 34 34 32 18 48 16 17 50 4 18

72 20 14 12 12 19 11 12 14 19

11 38 19 17 14 10 50 62 16 34

10 6 27 10 34 32 6 8 8 23

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper: Results:

Sample C: 14 34 32 18 48 6 17 50 4 18

9 20 14 12 12 19 11 12 14 19

11 8 19 17 14 10 6 62 16 34

10 6 27 10 34 6 6 8 8 14

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper:

p. 3-114


Measures of Dispersion

Data Analysis for Research

3.6

Looking at Distributions in SPSS As discussed in this handbook, SPSS will produce basic descriptive statistics for dispersion in the Descriptive dialog box. Refer back to your descriptive statistics section for guidance. Statistics for variance can also be created via the Explore dialog box. The following example is using the Age variable in the Dataset file. Move the mouse over Analyse and press the left mouse button. Move the mouse over Descriptives and then over Explore and press the left mouse button.

The Explore dialog box appears.

Select Age and click the central arrow so that Age appears in the Dependent List.

simulation p. 3-115


Data Analysis for Research

Measures of Dispersion

Move the mouse over Statistics and press the left mouse button. The Explore: Statistics dialog box opens. At this point we can assign a confidence interval for the mean (as discussed in the previous sections). Make sure that the confidence interval is set to 95%. Click Continue. This returns you to the Explore dialog box. Click OK.

A summary table is produced in the output window.

p. 3-116


Data Analysis for Research

Measures of Dispersion

This summary table provides you with basic descriptive statistics including the mean and the median, and measures of dispersion including the range, standard deviation and standard error. The output also provides the confidence interval at 95% (47.07 to 48.34). Note that Age is a ratio data type, and that the average would not apply to ordinal or nominal data sets.

3.7

Graphically Looking at Distributions in SPSS Refer back to your descriptive statistics handbook for information on how to produce basic frequency histograms, stem and leaf plots and box plots. SPSS will also allow you to plot the normal distribution over a frequency histogram, so you can ascertain how the distribution of your sample relates to the normal distribution. The following example again uses the Age variable in the Dataset file. Move the mouse over Graphs and then Chart Builder and press the left mouse button.

simulation p. 3-117


Data Analysis for Research

Measures of Dispersion

The Chart Builder dialog box appears.

p. 3-118


Data Analysis for Research

Measures of Dispersion

Select Histogram in the Choose From: box. A series of charts are presented.

p. 3-119


Data Analysis for Research

Measures of Dispersion

Move the mouse over the Simple Histogram, and holding the left mouse button down drag it into the chart window. Release the left mouse button and a simple histogram is presented. An Element Properties dialog box also appeared and we will return to this shortly.

You will notice that the histogram presents options for the vertical Y-axis and the horizontal X-axis. In this case we need to assign Age to the X-axis. The vertical Y-axis will be frequency which SPSS will default to automatically.

p. 3-120


Measures of Dispersion

Data Analysis for Research

Move the mouse over Age in the Variables box and holding down the left mouse button, drag Age over the to X-Axis box. Release the left mouse button and Age is assigned to the X-axis of the histogram.

We now need to assign a Normal Distribution Curve to the histogram. Shift your attention to the accompanying Elements Properties dialog box. Select Display normal curve in the dialog box and click Apply.

Notice that a Nornal Distribution curve has been superimposed on top of the histogram in the Chart Builder window. Click OK in the Chart Builder window.

p. 3-121


Data Analysis for Research

Measures of Dispersion

A frequency histogram is produced in the output window, and a normal distribution curve has been plotted on it. As you can see from this output, the Age variable bears some resemblence to the normal distribution, although there the overall shape of the curve is influenced by a number of outlying values.

p. 3-122


Data Analysis for Research

Measures of Dispersion

As before we can also use the Split File option look at specific cases. For example here Area has been selected and two separate distribution curves for the Chichester and Arun Districts have been produced.

p. 3-123


Measures of Dispersion

Data Analysis for Research

ď €

Activity 15:



Table 15: Descriptive Statistics for GTBSscore08 GTBSscore08 Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

Table 16: Descriptive Statistics for GTBSscore08 - Chichester District GTBSscore08: Chichester District Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

p. 3-124


Measures of Dispersion

Data Analysis for Research

Activity 15: Table 17: Descriptive Statistics for GTBSscore08 - Arun District

GTBSscore08: Arun District Council Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

Repeat this exercise for an additional variable (which should be ratio or variable in nature). Record your results by cutting and pasting your output into your log book.

p. 3-125


Data Analysis for Research

Measures of Dispersion

Notes:

p. 3-126


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.