BML224 - Statistics Manual

Page 1

BML224: Data Analysis for Research

SEMAL Dr Andrew Clegg



Geographical 2 Data Analysis for Techniques Research

DescriptiveContents Statistics

Contents Course Outline

p.

i

Section 1: Sampling and Types of Data 1.0

Introduction - Why use statistics

p.

1-1

1.1

Sampling

p.

1-1

1.2

Some Terms in Sampling

p.

1-2

1.3

Avoiding Bias

p.

1-2

1.4

Deciding on the Choice of Sampling Techniques

p.

1-2

1.5

Summary

p.

1-6

1.6

Types of Data

p.

1-7

1.7

Presenting Data

p.

1-11

Section 2: Descriptive Statistics 2.0

Introduction

p.

2-25

2.1

Measures of Central Tendancy

p.

2-25

2.2

Arithmetic Mean

p.

2-25

2.3

The Median

p.

2-27

2.4

The Mode

p.

2.31

2.5

Comparision of the Mean, Median and Mode

p.

2-33

2.6

The Population Mean

p.

2-34

2.7

Skew and the Relationship of the Mean, Median and Mode

p.

2-36

2.8

Using SPSS to Calculate Descriptive Statistics

p.

2-37

2.9

Graphically Describing Data

p.

2-68

2.10 Graphically Describing Data in SPSS

p.

2-74

2.11

p.

2-80

Creating Crosstabulations in SPSS

Section 3: Measures of Dispersion

Š Dr Andrew Clegg

3.0

Introduction

p.

3-89

3.1

Measures of Dispersion

p.

3-91

3.2

Other Distributions

p.

3-98

3.3

The Standard Normal Distribution

p.

3-100

3.4

Confidence Intervals

p.

3-105

3.5

The Standard Error

p.

3-109

3.6

Looking at Distributions in SPSS

p.

3-115

3.7

Graphically Looking at Distributions in SPSS

p.

3-117 p. 209


Geographical 2 Data Analysis for Techniques Research

DescriptiveContents Statistics

Section 4: Student T-Test, Paired Samples T-Test, Mann Whitney and Wilcoxon 4.0

Introduction

p. 4-127

4.1

Null and Alternative Hyptheses

p. 4-127

4.2

Hypothesis Testing

p. 4-129

4.3

One and Two Tailed Tests

p. 4-129

4.4

Choosing the Right Test

p. 4.132

4.5

Parametric Tests

p. 4-134

4.6

Using SPSS to Calculate the Student T-Test

p. 4-135

4.7

Using SPSS to Calcualte the T-Test for for Related Samples

p. 4-147

4.8

Non-Parametric Tests

p. 4-152

4.9

Using SPSS to Calculate Mann Whitney

p. 4-153

4.10Using SPSS to Calculate Wilcoxon Signed Ranks Test

p. 4-161

Section 5: Chi-Squared 5.0

Introduction

p. 5-167

5.1

The One Sample ChiSquared Test

p. 5-169

5.2

The Chi-Squared Test for Two or More Samples

p. 5-172

5.3

Yates Correction Factor

p. 5-175

5.4

Conditions Necessary for Conducting a Chi-Squared Test

p. 5-176

5.5

Using SPSS to Calculate Chi-Squared

p. 5-180

Section 6: Correlation 6.0

Introduction

p. 6-191

6.1

The Meaning of Correlation

p. 6-193

6.2

Identifying Signs of Correlation in theData

p. 6-194

6.3

Correlation Analysis

p. 6-194

6.4

Using SPSS to Measure Correlation: Pearson’s Product Moment Correlation Coefficient

6.5

Using SPSS to Measure Correlation: Spearman’s Rank Correlation Co-efficient

© Dr Andrew Clegg

p. 6-196

p. 6-206

p. 210


Data Analysis for Research

Data Analysis for Research Introduction

The acquisition, manipulation, interpretation and presentation of data are important skills for graduates. The aim of this module is to introduce students to the use of computer-based statistical techniques and database management systems for the analysis of quantitative data sources. The module provides an appropriate link to Business Research, where more qualitative research methodologies will be discussed. The module is designed to reflect the lack of confidence and anxiety felt by students when dealing with statistical techniques, often for the first time. The module will start at an introductory level looking at data types and the use of descriptive statistics before advancing towards the use of more advanced statistical techniques and database management systems. Knowledge and Understanding:

Learning Outcomes

On successful completion of this module students will be able to: 

Relate underlying statistical theory, such as the normal distribution, to statistical analysis

Distinguish between the characteristics of different data types and relate to different approaches to statistical analysis

Analyse and present data employing descriptive statistics in SPSS

Select and apply appropriate advanced statistical techniques in SPSS and the analyse the output accordingly

Construct, manipulate and interrogate a database management system

Demonstrate effective numeracy skills

Work independently and to time constraints

p. i


Data Analysis for Research

Module Content

7/9 & 9/9:

Week 1: Introduction to Statistics

14/9 & 16/9:

Week 2: Introduction to Statistics 2

21/9 & 23/9:

Week 3: Descriptive Statistics

28/9 & 30/9

Week 4: Dispersion

5/10 & 7/10:

Week 5: Student T-Test and Paired Samples T-Test

12/10 & 14/10: Week 6: Mann Whitney and Wilcoxon 19/10 & 21/10: Week 7: Chi-Squared 26/10 & 28/10 Week 8: READING WEEK 2/11 & 4/11

Week 9: Correlation

9/11 & 11/11:

Week 10: Exam 1

16/11 & 18/11: Week 11: Database Management [1] 23/11 & 25/11: Week 12: Database Management [2] 30/11 & 2/11:

Week 13: Database Management [3]

Sessions for BML224 will be on a Tuesday and Thursday in ICT Room G6. You will be allocated a specific group during the day. Specific learning outcomes for each session are provided in your statistics manual and are also detailed on the BML224 home page on Moodle. Please note that there is a provisional extra session on Tuesday 30th November and Thursday 2nd November.

Module Resources

Sessions will involve the use of a series of data files that can be downloaded from the BML224 home page on Moodle. Please download these files into your own file space ready for use in the session. Additional resources are also available and are detailed throughout this manual. Activities and resources are signposted by a number of icons including:

ď € 

This icon refers to class-based and self-directed activities. Details relating to activities are provided in your manual.

This icon refers to your Word log book. The activities in this manual are mirrored in the log book and you will be expected to keep your log book up-to-date and record the activities that you have completed. You will also be expected to submit sections of the log book as you progress through the module, so that the module tutor can monitor your completion and understanding of different activities. Failure to complete the log book and meet specific submission dates will result in you having to attend additional workshop sessions. p. ii


Data Analysis for Research This icon relates to online simulations that have been

simulation

developed to provide additional guidance on the use of SPSS. Each of the statistical tests covered in the manual has an accompanying simulation that you can access online. The weblink to these online simulations is available on the BML224 home page on Moodle.

Self-Directed Activities

As part of the course, you will be asked to complete short tasks as part of the lecture session. Specific tasks will be allocated on a weekly basis. It is essential that these tasks are completed, in order to demonstrate your competency in the statistical methods that are being employed during the module. Please ensure that you read through the handouts provided thoroughly. The module will be using SPSS (PASW Statistics). You can get a copy of SPSS to install on your own PC from the library (free of charge!). This software will be licensed to you as long as you are at student at the University of Chichester.

Assessment

The assessment for this module will consist of two exam-based practicals. Additional detail relating to assessment tasks will be provided during the course of the module. The assessment criteria for module are: ď Ž ď Ž ď Ž

Demonstration of underlying statistical theory in relation to the use of descriptive and advanced statistical techniques. Accuracy of answers based on the use of the appropriate statistical software and techniques and the interpretation of SPSS output. Ability to interrogate data accurately within a database management system

Formative assessment points will provide opportunities to test your familiarity with the statistical concepts and techniques discussed during the course of the module. The resit for this module will consist of a series of 2 practical examinations. Please note that in line with University regulations, an overall failure in this module would involve resitting all elements of the assessment. A pass on the module is based on your overall grade profile - therefore if you were to unexpectedably fail a specific element of the assessment if your overall grade profile was above 40% (including the fail) you would pass the module. Remember a non-submission in any part of the assessment would also result in the failure of the whole module.

p. iii


Data Analysis for Research

Student Support

I can be found on the top of floor of the Mordington building (Room 2.19) on the Bognor Regis campus. If you have any problems please do not hesitate to come and see me. While I am usually around, consultancy work does take me off campus from time to time. Therefore while you are welcome to pop in informally, please email me to make an appointment (a.clegg@chi.ac.uk/tel: 812017) to guarantee that I am in to see you. You will be introduced to a number of new concepts and techniques in this module. Statistics is not ‘everybody’s cup of tea’ and I am very conscious about what is called ‘statistics anxiety’. If at any time you are unclear about what we are doing, please do not hesistate to come and see me. The support materials for this module have been designed to make the everything as self-explanatory as possible. Please make time to read through the materials provided, and use the online simulations to enhance your familiarity with the different statistical techniques we will be using. The materials have been expanded and developed as a result of feedback from student evaluations. It is imperative that you read through all the materials provided and take responsibility for your own learning. Fail to do so could result in you failing this module.

Evaluation

At the end of the module, you will have the opportunity to complete a module evaluation form to comment on the overall structure, content and quality of the programme. The module evaluation for 2009-2010 can be found on the BML224 homepage. If you have any immediate concerns about the quality of the module then please do not hesitate to come and talk to me directly. Students often find statistical analysis rather difficult. Therefore considerable

Code of Conduct

time and effort has gone into the design of learning and teaching materials. The sessions will be tutor-led to start, therefore students are asked to pay close attention to the instructions that are given. Please note: [1] I will not expect to see students using software other than that being used in the session - no emailing, checking Facebook or equivalent. Students persistently infringing this request will be asked to leave the session. [2] Please be punctual as there is quite alot of ground to cover in each of the weekly sessions. [3] Mobile phones should be switched off before the session. [4] This is a difficult module and you will need to concentrate. I will need to try and help everybody through the session. This is not helped by constant chatting, as such any students persistently talking or causing a distraction will be asked to leave the session.

p. iv


Section 1

Sampling & Types of Data Learning Outcomes At the end of this session, you should be able to: 

Understand the rationale for the use of statistical techniques

Discuss approaches to developing sampling frameworks and methodologies

Define key terms in the use of statistical techniques

Understand the difference between different data types

Present numerical data effectively in graphical and tabular form



Data Analysis for Research

Introduction to Statistics

Introduction to Statistical Terms and Sampling Frameworks 1.0

Introduction - Why use Statistics? There is far more to research than measurement and analysis of quantifiable facts. A prime tool in the study of how people exist in their environment is the very fact of the investigator’s common humanity: you know a lot about what people do because you are also a human being. And human actions and responses are affected by memory, prejudice and emotions which cannot be adequately quantified. Even so, there are innumberable instances of relevant, quantified facts in geographical investigations: most questionnaire results contain some quantitative element, even if it is only how many people said ‘yes’ and how many people said ‘no’; international comparisons can often make use of data from the World Health Organization, World Bank, Unicef or the United Nations Development Programme, amongst others; within Britain data from the census or health authorities exists for a wide range of areal units; in physical geography the geology, soils, vegetation, elevation, aspect and so on can all be quantified. You should not ignore these data. You may feel that it is only necessary to present such information, perhaps using a table or a graph, and sometimes that may be enough. On the other hand, statistics will enable you to go much further in the understanding of the patterns and relationships displayed. Furthermore, they will help ascertain the quality of the information that you are using. This last point is perhaps the most important part of using statistics: for instance, your pie chart showing that 75% of respondents preferred Bognor to Barbados as a holiday destination may look impressive, but statistics will soon reveal that any conclusions to be drawn from the answers given by only four people are limited, to say the least. If the statistics can’t test your hypotheses, the fault may be in your hypotheses, or more likely in your data collection, but it certainly isn’t in the statistics. Before considering the statistical manipulation of data, it is necessary to consider how the data is collected for use.

1.1

Sampling You often have to make do with what information there is (if you are interested in the cultivation of mangelwurzels and the nineteenth century agricultural survey did not record them, there is not much you can do about it), but ideally in research you can go and collect the information yourself. In such a case you can ensure that the information you collect is as useful as possible. Sometimes you will be able to collect all the relevant information – the census population of each ward in the county, for instance – but in many cases you will need to collect a sample. For example, you would not practicably be able to find the opinions of all the people in a county, but using an appropriate sampling technique you could collect information from a smaller but representative sample of that population.

© Dr Andrew Clegg

p. 1-1


Data Analysis for Research

1.2

1.3

Sampling Strategies

Some Terms in Sampling 

A variable is a property which can vary and be measured - temperature etc.

An observation or variate is a particular measure.

Population is the complete set of counts or measurements derived from all objects possessing one or more common characteristics. This can be infinite, as in the case of elevations in the field.

Sample - part of a population.

Avoiding Bias An important question to ask yourself at the start of sampling is ‘What do I want my sampling to be representative of ?’ An example of where this might be important is in studying the patterns of farming in a region. For simplicity and clarity, let us assume that each farm only cultivates one crop. Selecting points on a map will tend to choose the bigger farms because they occupy a larger area. On the other hand, selecting farms from a list will tend to choose the smaller farms, because there are likely to be more of them within the same area. Therefore the first method will give a representative sample of the land use, the second of the farming. What can cause problems is using the first to find out about farms, or the second about land-use.

1.4

Deciding on the Choice of Sampling Techniques Before you starting sampling, you need to consider whether a convenient sampling frame exists. An example of a sampling frame may be a list of names on an electoral register or a membership directory of a particular organisation. Even when sampling frames do exist, they are often incomplete or out of date. The integrity of the data set will therefore influence your choice of sampling technique. However, it is often possible to construct your own sampling framework, although this could be costly and time-consuming. For example, if investigating the distribution of farm shops in West Sussex, you could use the farm shops listed in the yellow pages as a provisional framework and then supplement this with fieldwork to check for any farm shops not listed in the yellow pages. For an area it may be necessary to create a grid with x and y axes, so that the whole area under investigation can be referred to using co-ordinates, like grid references. In this instance, you need to achieve a balance between having too few cells to give precise or even usable results (remember that a co-ordinate reference refers to an area rather than a point) and having so many that the sampling process becomes too time-consuming. Such decisions must be made with specific reference to the particular investigation and the time and resources at your disposal. Indeed, when designing a sampling strategy for a research project it is important to ask yourself whether you can afford the time and money to carry out the sample collection. When deciding on the sample technique, you also need to decide on the size of the sample. As a general guideline, the larger the sample, the more confident we can be that the statistics derived from it will be similar to the population parameters. However, a large sample with a poorly designed sampling frame, may contain less information than a smaller but more carefully designed sample.

© Dr Andrew Clegg

p. 1-2


Data Analysis for Research

1.4.1

Sampling Strategies

Random Sampling The word random in this context does not mean haphazard. It refers to a definite method of selection aimed at eliminating bias as far as possible. A random sampling method should satisfy two important criteria: a) every individual must have an equal chance of inclusion in the sample throughout the sampling procedure; and b) the selection of a particular individual should not affect the change selection of any other individual. To put these criteria in more formal probability terms: the probabilities of inclusion in the sample must be equal and independent of each other. So, if the aim is to pick a random sample of 50 households from a population of 200, every household should have the same 50/200 or 0.25 probability of selection. The simplest example of pure random sampling is a raffle or lottery. Thus to take a random sample of the population of the UK, the name of each resident would have to be written on a piece of paper, all the pieces of paper put in a giant drum and a random selection made: obviously not a practical method. More usually it is numbers, not names, which are used and, instead of picking these numbers out of a hat, a computer can be programmed to generate random number sequences. Alternatively tables of random numbers can be used. Computers use the last digits of their internal clock to ‘seed’ their random numbers (otherwise they would just keep repeating the same sequence), and similarly when using random number tables it is worthwhile picking a point somewhere in the table ‘at random’ and then sometimes read up, or left, rather than from left to right.

1.4.2

Systematic Sampling Systematic sampling is, as its name suggests, sampling according to a regular system. This involves choosing the first item at random and then selecting every nth item where n will be determined by the size of the sample required. For example, if a sample of 50 items is required from a population of 500 items, every 10th one would be selected. Provided that there are no characteristics in the population which recur every 10th item, the sample will be unbiased; indeed this may be thought of simply as a short cut (the population does not need to be numbered) method of producing a random sample.

1.4.3

Stratified Sampling It is possible, in some instances, to improve on simple random sampling by stratification of the population. This is particularly true where the population is heterogeneous (i.e. made up of dissimilar groups) and the population can be stratified into homogeneous (i.e. similar) classes. These classes should define mutually exclusive categories. For example suppose a bakery makes three different types of loaf: large, small and cottage. If a simple random sample was taken of the daily output, it would be possible, although unlikely, for it to include only one type of loaf. Stratification of the population before sampling can prevent this and, if carried out as described below, can produce a sample which is truly representative of the population. Assume that the bakery’s output is 50% large, 40% small and 10% cottage loaves. The different loaves divide the population into three strata. Now if a sample of 50 loaves is required it should contain 25 large, 20 small and 5 cottage thus ensuring that the proportions of each type of loaf in the population are reflected in the sample. Within these constraints, however, selection should be made on a random basis.

© Dr Andrew Clegg

p. 1-3


Data Analysis for Research

1.4.4

Sampling Strategies

Multi-Stage Sampling Surveys covering the whole UK are frequently required but, as you can imagine, simple random sampling or even stratified sampling will not give an easy solution. Where the population is very spread out, particularly geographically, simple random sampling will result in a dispersed sample leading to a considerable amount of travelling and time. Consequently some method is needed to narrow down the field down to a smaller area, with the resultant cost savings. Multi-stage sampling attempts to do this without adversely affecting the ‘randomness’ of the result. The first step is to divide the population into manageable, convenient groups or areas, such as counties or local authority regions. Indeed, stratification of areas such as counties or local authorities by principal geographical regions is often introduced in order to minimise geographical bias (Clark et al, 1998, p. 84). A number of areas are then selected at random. If the number of areas selected is still too large or dispersed, then these areas can be broken down further to reduce the sample size to more manageable proportions. For example, having chosen a random sample of local authorities, each one itself may be divided into political wards or streets or households. Finally a simple random or systematic sample will be chosen.

1.4.5

Cluster Sampling Cluster sampling can often be confused with multi-stage sampling as the first step appears identical. The important difference is that cluster sampling is used when the population has not been listed and it is the only way to obtain a sample. As an example, suppose that a survey is to be done on the proportion of elm trees attacked by Dutch elm disease in the UK. Obviously there is no list of the complete population of elm trees. Neither would it be possible to try and cover the whole population. To use cluster sampling in this case, the population could be divided into small ‘clusters’ by drawing a grid over the map of the country and choosing, at random, a few of these clusters for observation, each cluster being a small area. Within each area, the investigators will then be asked to find as many elm trees as possible within that area and note how many of them are diseased.

1.4.6

Non-Random Sampling The previous paragraphs have been concerned with methods of random sampling, basically simple random sampling with several variations and refinements. The methods discussed in the previous section share a number of key elements. These include: a) the chances of obtaining an unrepresentative sample are small; b) this chance decreases as the size of the sample increases; c) this chance can be calculated; and d) the sampling error can be measured and therefore the results can be interpreted. Unfortunately occasions often arise when the selection of a random sample is not feasible. This may be because:  It would be too costly;  It would take too long; or  All the items in the population are not known.

© Dr Andrew Clegg

p. 1-4


Data Analysis for Research

Sampling Strategies

For these reasons the following research methods of non-random sampling are used, particularly in the field of market research.

1.4.6.1

Judgement Sampling In this case an expert, or a team of experts use their personal judgement to select what, in their opinion, is a truly representative sample. It certainly cannot be called a random sample as it involves human judgment which could involve bias. On the other hand, the sampling process does not require any numbering of the population or random number tables. It can be done more quickly and economically than random sampling and, if carried out sensibly, can produce very good results.For example, in an interview situation, a researcher may pick individuals because of the nature of the response they are likely to give, and the responses the researcher is looking for.

1.4.6.2

Quota Sampling This is the method most often used in market research where the data is collected by enumerators armed with questionnaires. To avoid the expense of having to ‘track down’ specific people chosen by random sampling methods, the enumerators are given a quota of say 400 people, and are told to interview all the people they can until their quota has been met. Such a quota is nearly always divided up into different types of people with sub quotas for each type. For example, out of a quota of 400, the enumerator may be told to interview 250 working wives, 100 non-working wives and 50 unmarried women, and within each of these three classes to have 50% who smoke and 50% non-smokers. Using this technique, the researcher has the choice of selecting certain people who might be included in the sample, and can therefore introduce an element of bias into the sample. The main advantage of this method is that, if a respondent refuses to answer the questions for any reason, the interviewer will just look for another person in the same category. With true random sampling, once a sample item has been decided upon, it must be used. Any substitution results in a non-random sample.

1.4.6.3

Convenience Sampling As the name implies, the most important factor here is the ease of selecting the sample. No effort is made to introduce any element of randomness. An example of this is the quality controller who takes the first 20 items off the production line as his sample, a dangerous procedure as any fault occurring after this could remain unnoticed until the next sample is taken (maybe an hour later). For most purposes, this sampling method is simply not good enough but for some pilot surveys the savings in cost, time and effort outweigh the disadvantages. The aim of a pilot survey could be to establish the most satisfactory form of questionnaire to be used in the actual survey. Since the actual results would not be used it does not matter that the sample was not selected at random.

Š Dr Andrew Clegg

p. 1-5


Sampling Strategies

Data Analysis for Research

1.5

Summary Sampling serves two purposes. One is the saving of time and effort in the collection of information. The second is the collection of information so that inferences and comparisons can be drawn using statistics. Although a simple subject, it is fundamental to much research, and needs to be done with care. Table 1, provides a summary of the key sampling methods that have been discussed.

Table 1:

Sampling Methods Representative

Probability

Random

Description

Example

Judgemental 

Sampling elements are selected Several houses for sale in Belfast, based on the interviewer’s experience perhaps with families known to the that are likely to produce the required interviewer, are chosen subjectively. results.

Sampling elements are selected The quota is the first 30 homeowners subject to a predefined quota control. sellign their houses in Belfast who are also making an intra-urban move, and are aged between 20-40 years.

Sampling elements in the sampling frame are numbered. First sampling unit is selected using random number tables. All other units are selected systematically k units away from the previous unit.

Sampling frame of 600 homeowners selling their houses in Belfast. These houses for sale are ordered and numbered. A random number is selected for a start point, from which every tenth property is selected for inclusion in the sample.

Quota

Systematic (first unit selected at random)

Simple random 

Sample size of n elements selected from a sampling frame without replacement, such that every possible member of the population has an equal chance of being selected.

All 600 houses for sale in the sampling frame are numbered 1-600. A sample of 30 units is selected using a random number table, excluding those numbers outside the range 1-600.

Sampling frame divided into subgroups (strata) which are then each sampled using the simple random method.

All 600 houses for sale come from lists provided by six estate agents. These are each randomly sampled for houses to include in the sample.

Sampling frame divided into hierarchical levels (stages). Each level is sampled using a simple random method which selects the elements to be included at the next level.

All 600 houses for sale are distributed to enumeration districts within several wards. A random sample of these wards is selected and of these random samples of both enumeration districts and finally houses for sale are selected.

Sampling frame divided into hierarchical levels (stages). Levels are selected using random sampling similar to the multi-stage random method. However, all elements are selected at the final stage.

Similar to the above method, expect that all the houses for sale in a given enumeration district are selected.

Stratified random

Multi-stage random

Clustered random

[Source: KITCHIN, R. AND TATE, N. (2000): Conducting Research into Human Geography, Prentice Hall, London, p. 55.]

© Dr Andrew Clegg

p. 1-6


Data Analysis for Research

1.6

Types of Data

Types of Data Normally when we think of data quality, we think about reliability or accuracy. In statistics, data have quality in terms of what they represent and how they can be manipulated. The four levels of measurement are: nominal/ categorical, ordinal, interval and ratio. Each measurement is outlined below: 

An ordinal variable can be ranked in order from highest to lowest, for example a league table. Alternatively, a questionnaire survey may ask respondents to rank satisfication levels on a scale from ‘Strongly Agree’ to ‘Strongly Disagree’. Ordinal variables do not allow comparable measurements, for example ‘Strongly Agree’ is not worth double ‘Slightly Agree’.

Interval and Ratio variables are concerned with quantitative data. Interval variables are in the form of a scale which possesses a fixed but arbitrary interval and arbitrary origin. Addition or multiplication by a constant will not alter the interval nature of the observations (e.g. 10C, 20C, 30C, 40C). For a ratio measurement, this number is in relation to a scale of an arbitrary interval, similar to interval data, but with a true zero origin. In these cases, where we are using numbers as we normally think of them, one value can be twice the size of another. For example, income is a ratio variable as a person can have no income. Ratio measurement commonly applies to metric quantities such as distance and mass, which possess a zero origin. [When importing data into SPSS, and using the Variable window, Interval and Ratio data are classed as Scale - see Descriptive section in this handbook].

Categorical or nominal variables are the lowest level and are variables where numerical values have been assigned to separate categories, often viewed as unique from one another. For example, gender (male/female), hair colour (blonde, brown, ginger, grey), or direction (north, east, south, west).

It is important to remember that data can only be converted from higher to lower quality, and data can only be treated ‘at their own level’. For instance, the numbers ‘1,2,3,4’ could be heights in meters (ratio), temperatures in degrees C (interval), the order of countries achieving Rostow’s ‘take off’ (ordinal) or the answer to ‘what is your favourite number (nominal): they must not be treated at a higher level than their meaning. As Mulberg (2002) points out ‘the thing to ask is if it makes sense to talk about one case being double another, or if there is a highest and a lowest (see Figure 1). It is also important to understand the different types of data or variables, as this will influence the kind of statistical analysis that is possible. The levels of measurement are summarised in Table 2. In order to use parametric and non-parametric tests successfully later in the module, it is imperative that you understand the characteristics and differences between types of data. Please read through these notes carefully, and learn the different data types.

© Dr Andrew Clegg

p. 1-7


Types of Data

Data Analysis for Research

Figure 1:

Judging Levels of Measurement

Start Does it make sense to talk about one number being double another?

Yes

Ratio Level

Yes

Ordinal Level

No

Does it make sense to talk about one number being higher or lower than other? No

Nominal Level

[Source: Mulberg, 2002, p. 8]

Additional terms that you will encounter include: 

A discrete variable is a variable whose numerical values varies in steps or where the values are integer numbers. Normally such variables are associated with counts, for example you may count the number of firms, products or employees when conducting a survey. Discrete variables do not allow for decimal places.

A continuous variable is a variable which assumes a value that can be donated on a continuous scale. Examples include weights, heights and age. In reality, continuous variables relate to specific values that lie at a point on a continuum. For example a person’s age could be recorded in discrete form as being so many years, but in reality their age can be placed at a point on a continuum which reflects not only the numbers of years but also the number of days, minutes and seconds which have passed since the moment of their birth (Clark et al, 1998). Continuous variables allow for decimal places. Continuous variables can som etimes be described as demonstrating certain statistical properties that allow them to be used in parametric statistical tests. However, sometimes some continuous variables do not show these particular properties, and when this happens, the variables are though suitable to be used in non-parametric tests (Mulberg, 2002).

Variables can also be classed as ‘dependent’ or ‘independent’. A dependent variable refers to a variable which is identified as having a relationship or dependance on the value of one or more independent variables. For example, levels of car ownership may be directly dependent on a number of independent variables including average household income, age and the number of persons in the household.

© Dr Andrew Clegg

p. 1-8


Types of Data

Data Analysis for Research

Table 2:

Data Quality Name

Description

Nominal or Categorical

Data assigned to discrete categories, in no Clay, sandstone,granite, lifestyle groups, natural order

Ordinal

Examples

singles, retired

The categories associated with a variable Cities in order of population size/opinions can be rank-ordered. Objects can be regarding service or product quality ordered in terms of a criterion from highest to lowest.

Interval

With ‘true’ interval variables, categories Temperature in degrees Celsius or associated with a variable can be rank- Fahrenheit. ordered, as with an ordinal variable, but the Goal Difference distances between categories are equal; Categories have no absolute zero point; Variables which strictly speaking are ordinal but which have a large number of categories, such as multiple-item questionnaire measures. These variables are assumed to have similar properties to ‘true’ interval variables.

Ratio

Data with meaningful intervals and a true zero

Age, distance

When attempting to remember types of data use the abbreviation NOIR (nominal, ordinal, interval, ratio).

When using variables in statistical analysis, a further distinction is also drawn between descriptive and inferential statistics. Descriptive statistics refer to the sample that is created by the research/study process and literally refers to the methods and techniques used to describe and summarise data. Measures of central tendency (mode, median, mean) are the most basic descriptive statistics to which we can also add basic measures of dispersion including the maximum, minimum and range of values.

Inferential statistics refer to those techniques which are adopted to draw conclusions about the population to which the sample belongs and which enable inferences about the characteristics that might be expected in other samples as yet to be selected from that same population. Inferential statistics give greater analytical power and bring into play probability theory and other statistical tests and measures that will be discussed later in this handbook.

© Dr Andrew Clegg

p. 1-9


Data Analysis for Research

Types of Data

However, as Lindsay (1997) points out the use of inferential statistics carries greater responsibility and as such any user must be aware of the following guidelines: 

Sampling must be independent. This means that the data generation method should give every observation in the population an equal chance of selection, and the choice of any one case should not affect the selection of value of any other case;

The statistical test chosen should be fit for its purpose and appropriate for the type of data selected;

The user must interpret the results of the exercise properly. The numerical outcome of a statistical test is the result of an exercise.

© Dr Andrew Clegg

p. 1-10


Data Analysis for Research

1.7

Presenting Data

Presenting Data Presenting numerical data accurately is an important element of essays, reports, presentations and posters. The aim of the following section is to provide a few basic guidelines on how to incorporate graphs and tables effectively, and at the same time creatively, into your work.

1.7.1

Using Graphs and Charts Computer spreadsheets such as Excel, now allow you to produce a range of graphs and charts (bar charts, column charts, pie charts, graphs) quickly and easily. As such, graphs can be used effectively to enhance the quality of reports, essays, posters and presentations. Carefully thoughtout graphs can bring to life data from tables and allow comparisons to be made quickly. However, poorly designed graphs can easily fail and weaken a piece of work. It is very common for students to rush in and produce a whole plethora of charts and graphs without giving much thought to the data set they are using or what type of output would be most appropriate. Therefore is it important to take your time and give careful consideration to what you actually want to achieve. First, ask yourself the following questions: Is a graph or chart necessary? Students often use diagrams as a means of ‘padding out’ work and as a result graphs not referred to in the text become ‘window-dressing’. Therefore carefully consider whether the graph is actually needed - ask yourself whether the graph helps the reader understand a particular point or aspect of the data. If it does fine - but make sure that is it integrated and referred to fully in your dicussion. If not, provide a simple verbal description. What is the purpose/objective/outcome? Are you producing a graph for an essay/report, poster or presentation? While the basic guidelines and formatting options are generic, you need to consider the overall purpose and intended audience. For example graphs produced for a presentation will be different to those produced for inclusion in an essay or a PowerPoint presentation. Carefully consider the importance of visual impact and clarity, and the type of media you are using. What is the nature of the data set you are using? Graphs often fail because an incorrect chart type has been used or the graph is too complicated. Therefore before you start carefully consider the actual nature of the data set you are using. Above all you need to distinguish between ‘continuous’ data and ‘discrete’ quantities. A continuous quantity is that which can be chosen to any degree

© Dr Andrew Clegg

p. 1-11


Data Analysis for Research

Presenting Data

of precision. Examples of continuous quantities include mass (kg), length (m), and time (s). Discrete quantitites in contrast can only be expressed as integers (whole numbers) for example: 3 computers, 5 cars, 4 houses. In trying to decide if something is continuous or discrete, decide whether it is like a stream (continuous) or like people (discrete). Continuous variables are usually plotted on a graph as this demonstrates the existence of a casual relationship between the data points, whereas discrete data series are plotted as bar charts or histograms. In addition to the nature of the data set also consider whether you referring to absolute values or percentage distributions? This will have a significant influence on the chart type that you use. Second, how complicated is the data set?; is it best represented as a graph or a table?; can the data be manipulated to make it easier to use, for example by reformatting columns or excluding columns? Be prepared to modify the data set if necessary. However, make sure that when you do this you do not alter the accuracy or the representativeness of the data set you are using. The following graphs highlight the issue of using appropriate chart types. Figure 2: Car Sales for Rover, BMW, and Jaguar 1995-2000

[Source: Believe, M., 2001] In Figure 2, car sales for leading manufacturers have been plotted for a 5-year time period. In this instance we are dealing with discrete data (as you cannot sell half a car!). However, the data has been plotted as a line graph - is this correct? The answer is YES as there is a logical year to year link and the ‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. This data could have also been presented as a column chart. Compare this to Figure 3.

Š Dr Andrew Clegg

p. 1-12


Presenting Data

Data Analysis for Research

Figure 3:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] Figure 3 highlights the attitudes of residents to new housing development in West Sussex. Is this graph the most effective form of presentation? The answer is NO. In this instance joining the dots is not appropriate as there is no casual relationship between x-axis variables. In this instance a column chart would have been more effective - see Figure 4. Figure 4:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] Š Dr Andrew Clegg

p. 1-13


Presenting Data

Data Analysis for Research

While Figure 4 is a definite improvement, is there any way of making the data in Figure 4 more effective so that it really highlights the differences in resident opinions between the different areas? Again the answer is YES. So far we have graphed the absolute values relating to resident opinions. If we were to change this to a percentage distribution we could present the data as a bar chart - see Figure 5. Figure 5:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] As you can see in Figure 5, utilising the percentage distribution really succeeds in highlighting the differences in residents opinions. Let us consider a further example. Figure 6 illustrates the mean monthly temperature and rainfall totals for Edinburgh. Is the graph appropriate? Again the answer is YES as there is a logical year to year link and the ‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. However, although this graph allows us to compare monthly temperature and rainfall totals, the high values for temperature have masked the values for rainfall and a degree of accuracy has been lost. To overcome this we can change the type of the graph and plot temperature and rainfall on separate axis - see Figure 7.

Š Dr Andrew Clegg

p. 1-14


Presenting Data

Data Analysis for Research

Figure 6:

Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987] Figure 7:

Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987] So far our discussion has concentrated on the use of line graphs, column and bar charts. Another type of chart frequently used is the pie chart. The overall total number of cases represented by the pie chart should equal the sample size, or aggregate to 100% where segments denote proportional frequencies (Riley et al, 1998, p. 172). Let us consider some specific examples. Š Dr Andrew Clegg

p. 1-15


Presenting Data

Data Analysis for Research

Figure 8:

The Distribution of Serviced Establishments in Torbay by Size

[Source: Clegg, 1997] Figure 8 refers to the percentage distribution of serviced establishments in Torbay by size. When using pie charts it is important to remember that pie charts can only graph the percentage distribution of one specific variable and cannot be used to analyse time series data. For example, we could not use a pie chart to illustrate the car sales for Rover, BMW and Jaguar referred to in Figure 2. However, we could use a pie chart to analyse the market share of car sales for a specific year (see Figure 9). Figure 9:

Market Share of Car Sales for Rover, BMW and Jaguar in 1995

[Source: Believe, M., 2001] Š Dr Andrew Clegg

p. 1-16


Presenting Data

Data Analysis for Research

By drawing and then combining two or more pie charts we could then compare market share for different years (see Figure 10). Figure 10: Market Share of Car Sales for Rover, BMW and Jaguar in 1995 and 1999

1995 Rover 27% Jaguar 41%

BMW 32%

1999 Jaguar 32% Rover 41%

BMW 27%

[Source: Believe, M., 2001] Programmes such as Excel will only allow you to draw one pie chart at a time - however once drawn you can arrange a number of pie charts on a worksheet and print them out. Alternatively, you can cut and paste Excel charts into Word or Publisher. Clearly, using the most appropriate type of graph is very important to ensure that the data is presented accurately. In addition to the type of chart it is also important to ensure that the graph is presented effectively.

Š Dr Andrew Clegg

p. 1-17


Presenting Data

Data Analysis for Research

1.7.2

Producing Graphs When producing graphs a number of basic rules and guidelines need to be considered. These are: Is the graph completely self-explanatory? Is the graph clearly titled, labelled and sourced? 

The axes should be labelled, and clear indication given as to the scales being used, and the numerical quantities being referred to;

All dates and times periods should be explicitly stated in the title, and on the appropriate axis;

In titles do not write ‘A Graph Showing....’. This is obvious - instead refer to the specific content of the graph (see examples given in this section);

The source of the data should be included, especially if they are drawn from published material.

Are elements of the graph distinguishable?

© Dr Andrew Clegg

When using charts it is important that the different data series are clearly distinguishable otherwise the graph will be meaningless;

Consider carefully the number of data series you intend to graph. Too much data will over complicate a graph and reduce its impact;

When using pie charts it is recommended that the number of segments should not be too large. Too many segments make charts confusing and difficult to read;

If charts are to be included in a black and white report, avoid shadings that involve colours as the distinctions will be clearly lost. Try and keep the use of colours to a minimum. Use one colour and different shades;

Ensure that each segment of the pie chart is clearly labelled and that the percentage values have been added to indicate quickly which are the principal groups and by how much;

Avoid repetition; if labels and percentage values have been added to a pie chart there is no need to include the legend.

p. 1-18


Data Analysis for Research

1.7.3

Presenting Data

2D or 3D Graph Formats Excel and similar packages allow you to enhance the quality of graphs by making them 3D. However, the use of 3D formatting needs to be treated with caution. If you are producing graphs on A4 for a presentation 3D charts can work effectively. However, if you are preparing graphs for inclusion in an essay or report 3D charts may not be appropriate and you may be better off with a standard 2D version. There are no hard and fast rules on this issue and, ultimately, the type of chart produced and the type of formatting applied will depend on the nature of the data set used. Let me illustrate this by referring to examples included in this section. Below is Figure 4, showing resident attitudes to housing development in West Sussex. At the moment this is a standard 2D column chart. Let us convert it into a 3D chart.

2D

3D

Š Dr Andrew Clegg

p. 1-19


Data Analysis for Research

Presenting Data

Do you think this chart is effective? It looks good but is not quite as easy to read as the standard chart. It is noticeable that in order to create a 3D chart Excel has to shrink the original chart. This is where problems lie, as in making the graph smaller the overall impact of the graph is diminished. Let us try another example. Below is Figure 8, which refers to the distribution of serviced accommodation in Torbay. As before, let us convert this into a 3D chart.

2D

3D In this instance the 3D chart is actually quite effective and has enhanced the standard 2D chart considerably. The basic rule seems to be that simple 2D charts can be converted into 3D charts quite effectively. However, the more detailed and complicated the standard chart the less effective it becomes when you make it 3D. Your best option is to experiment with different data sets and formatting options to find the most effective form of presentation. Š Dr Andrew Clegg

p. 1-20


Presenting Data

Data Analysis for Research

1.7.3

Using Tables In addition to charts, tables are also an effective way of presenting information. Again when producing tables a number of guidelines can be followed: 

Consider the purpose of presenting the data as a table as there may be better ways of presenting it;

Avoid the temptation of just photocopying tables out of text books and sticking into essays. In many cases, tables often contain information superfluous to the reader. Be prepared to modify data sets so that only relevant information is included in your table;

Make sure that tables are completely self-explanatory. Provide a table number and title for each table. If abbreviations are used when labelling then provide a key;

Make sure that the content of the table is fully referred to in the text - make sure that tables are not basically ‘window-dressing’;

Allow sufficient space when designing the table for all figures to be clearly written;

Make sure that the table/data is fully sourced.

Again let me illustrate with a number of examples.

Table 2:

Visits Abroad by UK Residents 1994-1997 Area of Destination

Year

Total (‘000) North America Number of Visits (000’s)

Western Europe

Rest of World

1994

39,630

2,927

32,375

4,328

1995

41,345

3,120

33,821

4,404

1996

42,050

3,584

33,566

4,900

1997

45,957

3,594

37,060

5,303

+9

0

+10

+8

% Change 1996/1997

[Source: ETB, 1999] Table 2 is an example of a table I created for the Arun Tourism Strategy document. Does the table meet the guidelines highlighted above? The answer is YES. The table is clear, well laid out, titled, sourced and selfexplanatory. Shading has also been used to try and enhance the visual impact of the table.

© Dr Andrew Clegg

p. 1-21


Presenting Data

Data Analysis for Research

Now consider Table 3 which refers to regional tourism spending in England in 1997. Again this is a clear table that for the purposes of the tourism strategy had to contain a lot of detail. If you were using this table to illustrate patterns of regional spending it could be simplified to show the most obvious or important patterns. For example in Table 1 it is evident that tourism spending is highest in the West Country and lowest in Northumbria. Table 3:

The Regional Distribution of Tourism Spending in England, 1997 All

Holidays

Tourism Destination England

Short

Long

Business

Holidays

Holidays

and Work

VFR

(1-3 nights) (4+ nights)

£11,665

£7,725

£2,505

£5,215

£2,055

£1,415

%

%

%

%

%

%

Cumbria

3

5

5

5

1

1

Northumbria

3

3

3

3

3

5

North West England

9

8

11

6

12

10

Yorkshire

8

8

7

8

9

10

Heart of England

11

9

14

7

15

16

East of England

13

14

11

15

14

12

9

6

13

2

15

17

West Country

24

30

17

37

10

10

Southern

11

10

10

11

3

9

9

8

9

7

10

12

London

South East England

[Source: ETB, 1998] The table could therefore be easily modified to really reinforce this message (see Table 4). Notice that in the amended Table 4, I have also changed the title so that the content of the new table becomes self-explanatory and reflects the actual purpose of the table. Table 3 could have also been modified by removing specific columns thereby emphasising the patterns of spending in particular market areas.

© Dr Andrew Clegg

p. 1-22


Presenting Data

Data Analysis for Research

Table 4:

Selected Regional Differentials in the Distribution of Tourism Spending in England, 1997

All

Holidays

Tourism Destination England

Short

Long

Business

Holidays

Holidays

and Work

VFR

(1-3 nights) (4+ nights)

£11,665

£7,725

£2,505

£5,215

£2,055

£1,415

%

%

%

%

%

%

3

3

3

3

3

5

East of England

13

14

11

15

14

12

West Country

24

30

17

37

10

10

9

8

9

7

10

12

Northumbria

South East England

[Source: ETB, 1998]

© Dr Andrew Clegg

p. 1-23


Data Analysis for Research

Š Dr Andrew Clegg

Presenting Data

p. 1-24


Section 2

Descriptive Statistics

Learning Outcomes At the end of this session, you should be able to: 

Produce descriptive statistics including the mean, median and mode

Understand the features of measures of central tendency

Apply appropriate descriptive statistics to different data types

Import data into SPSS and use SPSS to produce descriptive statistics and cross-tabulations

Use SPSS to graphically describe data through the use of frequency histograms, stem and leaf plots and box plots



Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.0

Introduction The first part of the data analysis process is the production of basic descriptive statistics, such as the mean, median, mode, standard deviation, standard error, and basic frequency and contingency tables. The analysis of the descriptive statistics can then be used to ascertain the nature of the data, especially in relation to its distribution, and what types of statistical tests can be used to analyse the data further.

2.1

Measures of Central Tendency Averages, or measures of central tendency, give a simple summary of the characteristics of the data being described. How the data is described depends upon its quality. The three measures used are the mean, median and mode (see Table 2.1).

Table 2.1:

2.2

Measures of Central Tendency Name

Data Type

Description

Example

Mean

Ratio or interval

Total/Number of samples

‘The mean July maximum in Bognor is 210C’

Median

Ordinal

Middle in rank order

‘Half of the customers travel more than 6km to Tescos’

Mode

Nominal

Most common category

‘Most visitors are from London’

Arithmetic Mean This is the figure that most people would produce if they were asked to give the average set of figures. The mean is the most commonly used of all averages and is calculated by adding together all the values in a series and dividing the total by the number of items in the series. The computation formula is:

n

x =  xi

/

n

i =1

The symbols may be explained as follows:

x

pronounced ‘x-bar’ denotes the arithmetic mean of a sample;

pronounced ‘sigma’ means ‘the sum of’;

© Dr Andrew Clegg

p. p. 2-25 25


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

xi n

means all values of x where x1, x2, x3...xn represent the values of each observation in a data set. Thus i assumes, in turn, the values of 1,2,3 and so on and; is the total number of observations in the data set.

Therefore for the following data series: 8,2,4,7,3,4,1,2,2,1 The arithmetic mean is calculated as:

x=

2.2.1

8 + 2 + 4.... + 2 + 1 34 = = 3.4 10 10

Features of the Mean When using the mean, you should consider the following points: 

The mean is easy to understand and calculate and is the most commonly used of all averages;

It makes use of every value in the distribution, leading to a mathematical exactness which is useful for further mathematical processing;

It can be determined if only the total value of the items and the number of items are known, without knowing individual values;

It can be distorted by extreme values in the distribution;

For a discrete distribution, the mean may be an ‘impossible’ figure e.g. 17.5 cigarettes per day when all values in the distribution are whole numbers.

© Dr Andrew Clegg

p. p. 2-26 26


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.3

The Median There are however certain occasions when it is either not possible or not practical to use the arithmetic mean, particularly if the values of some of the extreme items are difficult to determine or if it is possible only to arrange the items in order without assigning numerical values to them. In such cases the representative or average figure may be taken as the middle item when the series is arranged in ascending or descending order. The statistical term for this middle item in a set of data is the median. The median is a position average or the value of the middle item of a series. For example, the median of the series 1,2,2,4,7,7,10 is the value 4 since it is the middle item. For a series with an even number of items (e.g. 1,2,3,4), there is no middle item and yet a median may still be required. In this case the median is conventionally taken as the arithmetic mean of the two central items, in this case, a value of 5.5. Therefore, to reemphaise:

WORKED EXAMPLE Example 1: A series with an uneven number of items The data series in rank order is: 1

2

2

4

7

7

10

The median is the middle item which in this case is 4.

Example 2: A series with an even number of items The data series in rank order is: 1

2

2

4

7

7

10

11

The median is the arithmetic mean of the two central items: 4 + 7

2

Š Dr Andrew Clegg

= 5.5

p. p. 2-27 27


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.3.1

The Median of a Grouped Distribution Strictly speaking, it should be impossible to find the median of a grouped distribution as detailed information is lost when data is gathered into classes. However, as with the arithmetic mean, several assumptions are made and an answer is produced. There is also a convention to say which is the median item in a grouped frequency distribution with either an odd or an even number of items. If a frequency distribution contains a total of n items then the median item will be:

 n + 1 th   2  item if n is odd

a) the 

b) the

n th item if n is even 2  401+1th  = 201st item.  2 

For a distribution of 401 items the median will thus be the 

For a distribution of 400 items the median will be the

400 2

th

= 200th item.

To find the median within a grouped data set it is first necessary to construct a table showing the cumulative frequencies. The data on the following pages highlights the annual rainfall in Kano, a popular tourist destination in Nigeria, and it should be clear that Table 2.2 has been produced by dividing the annual rainfall totals into ranked categories (400-499mm etc) and then counting the number of years that fall into each of these categories. These are then added up to produce the cumulative frequency, which can be expressed as a percentage for easier interpretation (see Figure 2.1).

© Dr Andrew Clegg

p. p. 2-28 28


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Table 2.2:

Rainfall for Kano, Nigeria from 1907 to 1974

Year Rainfall 1907 930 1908 970 1909 650 1910 890 1911 1230 1912 850 1913 750 1914 950 1915 680 1916 1010 1917 740 1918 480 1919 690 1920 820 1921 990 1922 860 1923 1040

Table 2.3:

Year 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940

Rainfall 820 1100 540 780 850 900 700 770 890 830 1000 1180 1010 850 830 940 980

Year 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957

Rainfall 740 1110 810 840 620 790 480 990 1060 800 700 580 920 810 1040 710 1110

Year 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974

Rainfall 1070 1010 830 1020 760 780 1140 700 750 900 780 970 960 710 660 410 560

Cumulative Rainfall for Kano, Nigeria

Annual Rainfall in mm. Frequency 400-499 3 500-599 3 600-699 5 700-799 15 800-899 15 900-999 12 1000-1099 9 1100-1199 5 1200-1299 1

Š Dr Andrew Clegg

Cumulative Frequency 3 6 11 26 41 53 62 67 68

Cumulative % Frequency 4.4 8.8 16.2 38.2 60.3 77.9 91.2 98.5 100

p. p. 2-29 29


Geographical 2 Data Analysis for Techniques Research

Figure 2.1:

Descriptive DescriptiveStatistics Statistics

Cumulative Frequency Curve for Kano, Nigeria

By reading off at 50% on the y axis (Cumulative % Frequency) to the line, and then down to the x axis the median is calculated at about 850mm. The median is, in fact, what is quite often meant by ‘the average’ in everyday conversation, in that half of the years tend to have more rainfall than this, and half less.

2.2.1

Features of the Median When using the median, you should consider the following points: 

Half the items in the series will have a value greater than or equal to the median and half less than or equal to the median. It is therefore a measure of rank or position;

It is easy to understand;

It is unaffected by the presence of extreme items in the distribution;

If found directly (from ungrouped data) it will be the same as an actual item in the distribution;

It may be found when the values of all the items are not known, provided that values of middle items and the total number of items are known;

Ranking the items can be tedious;

The median cannot be used for further mathematical processing;

It may not be representative if there are few items.

© Dr Andrew Clegg

p. p. 2-30 30


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.4

The Mode In an ungrouped, discrete distribution the mode is the value which occurs most often; that is, the value with the highest frequency. The mode of the series 1,2,2,3,4 is the value of 2. Unlike the mean and the median, it is not necessarily unique. For example the series 1,2,2,3,4,4 has two modes: 2 and 4. In a continuous frequency distribution it is possible that no two values will be the same. In this sort of situation the mode is defined as the point where there is the greatest clustering of values, or maximum frequency density.

2.4.1

Mode for Grouped Data To find the mode within a grouped data set it is first necessary to construct a histogram showing the frequency distribution (see Figure 2.2). Having constructed the graph, first identify the modal class (the class with the greatest frequency or frequency density). To calculate the actual value of the mode, draw a line from the top right-hand corner of the modal rectangle to the point where the top of the adjacent rectangle on the left meets it. Now draw a similar line from the top left-hand corner to the point where the adjacent rectangle on the right meets it. Now draw a perpendicular from the point at which these lines cross to the horizontal axis. This point gives the value of the mode. The Calculation of the Mode from a Frequency Histogram

Frequency

Figure 2.2:

70 Mode = 34

60 50 40 30 20 10 0

10

20

30

40

50

60

70

80

90

100

Age

While this technique will give the specific value of the mode, it is often more useful and meaningful to simply indicate the boundaries of the modal class. In other words, rather than attempting to calculate an accurate value for the mode, which may not be entirely accurate or representative, it would be more meaningful to say that more people for example fell within the 30 and under 40 age group than any other group described by Figure 1.2. Š Dr Andrew Clegg

p. p. 2-31 31


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.4.2

Features of the Mode When using the mode, you should consider the following points:

For discrete data it is an actual single value;

For continuous data it is the point of highest frequency density;

It is easy to understand;

Extreme items do not affect its value;

It can be estimated from incomplete data;

It cannot be used for further mathematical processing;

It may not be unique or clearly defined;

It requires arrangement of the data which may be time consuming.

Activity 1: For practice work out the mean, median and mode for the following sets of scores relating to the number of bedspaces in serviced accommodation in Torquay. Set 2:

Set 1: 4 16 16 20 32 10

© Dr Andrew Clegg

9 10 20 15 14 27

16 15 8 8 6 15

Mean =

Mean =

Median =

Median =

Mode =

Mode =

14 12 10 2 14 30

p. p. 2-32 32


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.5

Comparison of the Mean, Median and Mode The mean, median and mode are the three most important statistical measures of location and central tendency. Here are some guidelines to help you decide which value should be used in a particular case:

2.5.1

To determine what would result from an equal distribution use the mean (e.g. to determine the per capita consumption of jelly babies);

If position or ranking is involved use the median which gives the half-way value (e.g. a student interested in whether his exam mark places him in the upper or lower half of the class will need to compare his mark with the median mark);

Where the most typical value is required use the mode (e.g. a shoe manufacturer may want to know the average shoe size for ladies. For production planning it will be the mode that he requires as it will tell him the most common shoe size).

Which Measure Should You Use? The type of measure that you use will depend on the data that you are using, but ultimately whatever measure you choose should provide a good indication of the typical score in your sample. The mean is the most frequently used measure of central tendancy, because it is calculated from the actual scores themselves, not from ranks, as is the case with the median, and not from frequency of occurence, as in the case of the mode. However, as mentioned earlier, as the mean uses all the scores in the calculation it is sensitive to extreme values. Look at the following sets of scores: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 The mean from this set of data is 5.5 (the same as the median). If we were to change one of the scores to make it more extreme, we would get the following: 1,2, 3, 4, 5, 6, 7, 8, 9, 20 The mean is now 6.5, although the median is still 5.5. If we were to make the final score even more extreme we would get the following: 1, 2, 3, 4, 5, 6, 7, 8, 9, 100 The mean is now 14.5, which as you can see is not really representative of this set of scores. As we have only changed the highest score, the median remains 5.5. In this case, the median becomes a better measure of central tendancy. Therefore when deciding which measure to use it is always useful to check the data for extreme values. Where extremes scores are present, use the median as this simply gives you the score in the middle of other scores when they are put into ascending order. The insensitivity to extreme values makes the median a useful alternative to the mean.

© Dr Andrew Clegg

p. p. 2-33 33


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

The mode can be used with any type of data, as it relates to the most frequently occurring score and does not require any calculation. The median and mode cannot be used with certain types of data. For example if you were discussing occupation or attraction classifications it would be meaningless to rank these in order of magnitude. Again, when using the mode it is important that it provides a good indication of the typical score. Consider the following two sets of data: A] 1,2,2,2,2,2,2,2,3,4,5,6,7,8 B] 1,2,2,3,4,5,6,7,8,9,10,11,12 In set A there are more 2s than any other number and the mode would provide a suitable measure of central tendency. However, in set B, although the mode is again 2, it is not such a good indicator as its frequency of occurence is only just greater than all the other scores.

ď €

Activity 2: Which measure of central tendency would be most suitable for each of the following sets of data: a] b] c] d]

2.6

1, 23, 25, 26, 27, 23, 29, 30 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 5 1, 1, 2, 3, 4, 1, 2, 6, 5, 8, 3, 4, 5, 6, 7 1, 101, 104, 106, 111, 108, 109, 200

........................................ ........................................ ........................................ ........................................

The Population Mean The measures of central tendancy outlined above are useful for giving an indication of the typical score in a sample. However, what if you wanted to get an indication of the typical score in a population. In theory, one could calcuate the population mean (a parameter) in a similar way to the calculation of a sample mean; obtain scores from everyone in the population, sum them and divide by the number in the population. However, this would not be possible. We therefore have to estimate the population parameters from the sample statistics. One way of estimating the population mean is to calculate the means for a number of samples and then calculate the mean of these sample means. It has been found that this gives a close approximation of the population mean. So why does the mean of the sample means approximate the population mean? Imagine randomly selecting a sample of people and measuring their IQ. It has been found that the population mean for IQ is 100. It could be that, by chance, you have selected mainly geniuses and that the mean IQ of the sample is 150. This is clearly above the population mean of 100. You might select another sample that happens to have a mean IQ of 75, again not near the population mean. It is clear that the sample mean need not be a close approximation of the population mean. However, if we calculate the mean of these two samples, we get a much closer approximation to the population mean: (75+100)/2 = 112.5

Š Dr Andrew Clegg

p. p. 2-34 34


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The mean of the sample means (112.5) is a closer approximation of the population mean (100) than either of the individual sample means (75 and 150). If several samples of the same size are taken from a population, some will have a mean higher than the population mean and some will have a lower mean. If all the sample means were plotted as a frequency histogram the graph would look similar to Figure 2.3. Figure 2.3:

Distribution of Sample Means Selected from a Population with a Mean of 100

Population mean and mean of sample means are both 100

If we calculated the mean of all these sample means it would be equal to 100, which is also equal to the population mean. This tendency of the mean of sample means to equal the population mean is known in statistics as the Central Limit Theorem. Knowing that the mean of the sample means gives a good approximation of the population mean is important as it helps us to generalise from our samples to our population. This will be considered in more detail when we look at dispersion.

Š Dr Andrew Clegg

p. p. 2-35 35


Geographical 2 Data Analysis for Techniques Research

2.7

Descriptive DescriptiveStatistics Statistics

Skew and the Relationship of the Mean, Median and Mode Skew is the term that is used to describe the shape of the data as depicted by its frequency distribution or frequency curve. Under a symmetrical distribution curve, or what is also called ‘Normal Distribution’ (this will be covered in more detail when we look at measure of dispersion), the data builds up slowly from the left to a central peak or modal point and then declines to the right. In this situation, the mean, median and the mode all coincide (see Figure 2.4). A positive skew is when the peak lies to the left and a negative skew when it lies to the right. The further the peak lies from the centre of the horizontal axis, the more the distribution is said to be skewed.

Figure 2.4:

Symmetrical, Positively and Negatively Skewed Data Distributions

Where the distribution is positively skewed, the mean and median will be pulled to the right of the mode, and where it is negatively skewed, the mean and median are pulled to the left. Consequently, in a positively skewed distribution, the mean will have the greatest value, the mode the lowest value and the median will fall between the two. Conversely, in a negatively skewed distribution, the mode will have the highest value and the mean will have a lower value than the median and the mode.

Š Dr Andrew Clegg

p. p. 2-36 36


Geographical 2 Data Analysis for Techniques Research

2.8

Descriptive DescriptiveStatistics Statistics

Using SPSS to Calculate Descriptive Statistics Having considered the basic calculation of the mean, median and mode by hand (and hopefully not to painfully!), the aim of this next section is to show you how to produce basic descriptive statistics using SPSS. You can also produce descriptive statistics in Access, and this will be demonstrated later in the module. We first need to consider the basic elements of the SPSS operating system.

2.8.1

An Introduction to SPSS SPSS (PASW Statistics) is a powerful statistical tool that can be used to perform a wide range of statistical techniques. When analysing data in SPSS it is often convenient to transfer over the data you which to analyse from an Excel spreadsheet. The following section will highlight how to import an Excel spreadsheet, and provide a basic introduction to the SPSS environment, before detailing in more detail how to produce descriptive statistics. To import an Excel spreadsheet, first open SPSS. SPSS asks you what you would like to do. Move the mouse over Open an Existing Data Source and press the left mouse button. Either choose the required files or select More Files and click OK.

simulation

Š Dr Andrew Clegg

p. p. 2-37 37


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The Open File dialog box appears. Move the mouse over the drive containing the file you want to open and then press the left mouse button. The file Dataset is located in the BML224 home page on Moodle.

SPSS must be told to look for an Excel file. Therefore in the Files of Type box make sure that Excel is selected [Move the mouse over and press the left mouse button. A sub menu of different file types appear. Move the mouse over Excel and press the left mouse button]. Now select the Dataset file and click Open. The Opening File Options dialog box appears. In the Excel spreadsheet you are going to import, the first row in the spreadsheet contains the field names of the variables you want to examine. To assist your data analysis, you need to ensure that SPSS recognises this.

Move the mouse over Read Variable Names option and press the left mouse button becomes ). Move the mouse over OK and ( press the left mouse button. Š Dr Andrew Clegg

p. p. 2-38 38


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

SPSS now automatically imports the fields in the Excel spreadsheet and the data is displayed in the Data Editor window.

You know need to save this file to your own homespace on the network. Move the mouse over File and press the left mouse button. Move the mouse over Save As and press the left mouse button again. The Save As Dialog box appears. Save the file as DATASET.SAV. Note that .SAV is the file extension for data tables in SPSS. If you need to reload this file at any point, in the Open File dialog box select the DATASET.SAV file.

Š Dr Andrew Clegg

p. p. 2-39 39


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Before using SPSS to perform basic frequency counts and descriptive statistics on the results of the Interview data you first need to understand the nature of the data. For example, some variables are based on numeric coding schemes (nominal, categorical data types) and others on specific data values (interval or ratio data types). For those questions based on numeric coding schemes, certain descriptive statistics are not appropriate, although in this case SPSS can be used to perform basic frequency counts. Details of the variables in the Dataset file are included in the Dataset guide which has been given to you as part of the module resources. Please read through this guide carefully and become familiar with the different types of data, as this will be central to your successful completion of this module.

Š Dr Andrew Clegg

p. p. 2-40 40


Geographical 2 Data Analysis for Techniques Research

2.8.2

Descriptive DescriptiveStatistics Statistics

Using the Variable View In SPSS, we can use the variable view to check the integrity of the data and to apply additional information to the coding schemes to aid our analysis of the data. In the bottom of the SPSS window, click on the Variable View tab. The Variable View window is displayed. This window provides specific information relating to the variables that we have imported in the Dataset file. A number of key areas need to be checked at this point. First, check the Type column. In order for SPSS to conduct statistical analysis on the variables in the Dataset file all the variables here should be listed as Numeric.

In this instance the Greenrank06 variable is listed as a String. This needs to be changed to Numeric. To do this move the mouse over String and press the left mouse button. The cell is highlighted and a button appears.

Click the button and the Variable Type dialog box appears.

simulation Š Dr Andrew Clegg

p. p. 2-41 41


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Select Numeric and click OK.

Check the other variables to ensure that they are set as numeric. We can also use the Variable View to check the Measurement type of the variables. In this instance the measurement type should look like this. Refer back to your introductory notes to check on different data types. If the measurement type is not correct for a specific variable, move the mouse over the measurement cell in question and press the left mouse button. The cell is highlighted and a button appears.

Click on the button and a sub menu appears, offering three options: Scale, Ordinal and Nominal. Move the move over the required data type and press the left mouse button. The new data type will be presented. Note that ratio and interval data (e.g. age/investment) are classified as Scale). In the Variable View we can also assign more specific value labels to each of the variables. For example if we take Area as an example of the basic coding scheme in place here, Chichester District = 1 and Arun District =2. Any subsequent analysis that we perform will use this base coding scheme in any output. In order to make the SPSS output more self-explanatory we can assign additional value labels so that any output actually refers to Chichester District and Arun District. In the Variable View move the mouse over Values for the Area variable and press the left mouse button. The cell is highlighted and a button appears. Click the button.

Š Dr Andrew Clegg

p. p. 2-42 42


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

The Value Labels dialog box appears.

In the Value: box type 1. In the Value Label: box type Chichester District.

Chichester District

Click Add.

In the Value: box type 2. In the Value Label: box type Arun District. Click Add.

Arun District 1 = ‘Chichester District’

Click OK.

The changes you have made are reflected in the Variable View.

Repeat this process to add Value Labels to the remaining variables (where appropriate!). Return to the Data View and SAVE the file. We can now experiment with producing descriptive statistics.

© Dr Andrew Clegg

p. p. 2-43 43


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

By using the Value Labels in the Data View window you can switch the value labels between the numeric coding and the full text labeling. Click the button to toggle between the different options.

Numeric Coding

Text Label

Š Dr Andrew Clegg

p. p. 2-44 44


Geographical 2 Data Analysis for Techniques Research

2.8.3

Descriptive DescriptiveStatistics Statistics

Working with SPSS Output Before we start producing descriptive statistics, it is worth mentioning that SPSS output can be cut and paste into a Word document (or equivalent package). The process is very simple. In the output window, select the item you want to cut and paste, in this case a histogram. When the item is selected a black border will appear. Copy the item (Edit>Copy or right mouse click>Copy).

Open Word and paste the selection into your document.

simulation Š Dr Andrew Clegg

p. p. 2-45 45


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

To print specific elements of the output, first select the element you wish to print. When the item is selected a black border will appear. Select Print from the File menu. The Print dialog box opens. Make sure that Selection is highlighted and click OK.

The required element is printed. Please use this method to print and annotate output that will be created during the module.



Please use the cut and paste process highlighted here to complete your log book that we will use throughout this module.

Additional guidance notes on the different features of SPSS are available in the appendices of this handbook. When using SPSS to analsye data, you should not be directly cutting and pasting SPSS output into your work. Outputs tables should ideally be recreated in Word, and data should be transferred into Excel to create appropriate graphs.

Š Dr Andrew Clegg

p. p. 2-46 46


Geographical 2 Data Analysis for Techniques Research

2.8.4

Descriptive DescriptiveStatistics Statistics

Producing Descriptive Statistics As mentioned earlier, before using SPSS to perform basic frequency counts and descriptive statistics on the results of the survey data you first need to understand the nature of the data (refer back to Section 1.6). In this case, we will start by exploring the categorical/nominal variable: OCC (occupation). Remember for this variable it would not be appropriate to apply the mean, median or standard deviation. To perform a basic frequency count, first decide on the variable you which to examine. In this case we shall examine OCC.

To do so, first move the mouse over Analyse and press the left mouse. Move the left mouse button over Descriptive Statistics and then over Frequencies and press the left mouse button again.

The Frequencies dialog box appears.

simulation

Move the mouse over the variable you want to examine (in this case Occ) and press the left mouse button. Move the mouse over the central arrow and press the left mouse button again. Alternatively, select the variable you want to examine and quickly double click the left mouse button. The selected variable moves across into the Variable(s) box. Note this procedure can be repeated for multiple variables. Click OK.

Š Dr Andrew Clegg

p. p. 2-47 47


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The results of the frequency count are displayed in the output window. Notice that the frequency table has listed the occupations as a result of you entering in data for the Value Labels. This helps to make the table more self-explanatory. Any statistics you generate in SPSS will also be displayed in this output window. This is very useful as it means all your calculations are stored in one file that you can save and open at a later date. Save the output file to your own homespace on the network. Save the file as DS-OUTPUT1. Repeat this procedure to perform frequency counts to complete the Tables 1 and 2 overleaf. Your additional frequency counts will appear in the output window. Save the output regularly. Record your results overleaf or alternatively print out and fully annotate your SPSS output and file in your work folder. The information presented in the frequency chart could now be copied or cut and paste into Excel where you could create an Excel chart to show the distribution of the data.

simulation

An online simulation of how to create basic frequency statistics is available on the BML224 home page. Please use this simulation to familiarise yourself with the basic prodecures outlined here.

Š Dr Andrew Clegg

p. p. 2-48 48


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Activity 3:

Table 1: The Distribution of Accommodation by Size Size

Frequency

Percentage

Small

Medium

Large

Table 2: The Distribution of Accommodation by Price Price

Frequency

Percentage

Up to £30

£31 to £50

£51 to £70

£71 to £90

£91+

Having completed Tables 1 and 2, now have a go at completing Table 3. It is exactly the same process but you will need to perform a frequency count for each separate question in the table (the relevant variable name is given in the brackets).

© Dr Andrew Clegg

p. p. 2-49 49


Geographical 2 Data Analysis for Techniques Research

Activity 4:

Descriptive DescriptiveStatistics Statistics

Table 3: Business Responses to Tourism Issues

You will have noticed that the frequency count produced relates to the entire sample of 300 businesses, and there is no differentiation based on specific cases such as location. By selecting specific cases we can use SPSS to produce more detail frequency counts. In the following example we will produce a frequency count showing the frequency distribution of different occupation types by area.

© Dr Andrew Clegg

p. p. 2-50 50


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Return to the Data View window in SPSS. Move the mouse Data and press the left mouse button.

simulation

Move the mouse over Split File and press the left mouse button

The Split File dialog box opens.

Select Compare groups

Then select Area and move into the Groups Based on box. Then click OK.

Š Dr Andrew Clegg

p. p. 2-51 51


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The frequency table is displayed in the output window. As you can see the frequency table now gives a breakdown of occupation type by area (our prior labelling clearly referring to the Chichester and Arun Districts). Let us repeat this frequency count but this time instead of using Area we will use Town Code.

Š Dr Andrew Clegg

p. p. 2-52 52


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Return to the Data View window in SPSS. Move the mouse over Data and press the left mouse button.

Move the mouse over Split File and press the left mouse button.

The Split File dialog box opens. Deselect Area and then select Town Code and move into the Groups Based on box.

Click OK.

Š Dr Andrew Clegg

p. p. 2-53 53


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Run the frequency count again and the frequency table is displayed in the output window. As you can see the frequency table now gives a breakdown of occupation type by Town Code (our prior labelling clearly referring to the actual towns).

Š Dr Andrew Clegg

p. p. 2-54 54


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Activity 5:

Using the Split File option please complete the following tables. Table 4: Size of Accommodation by Area

Size of Accommodation

Area

Small [No. of Ests]

Medium [No. of Ests]

Large [No. of Ests]

Total

Chichester District % Distribution Arun District % Distribution

Table 5: Size of Accommodation by Town

Size of Accommodation

Town

Small [No. of Ests]

Medium [No. of Ests]

Large [No. of Ests]

Total

Chichester % Distribution Midhurst % Distribution Arundel % Distribution Bognor Regis % Distribution

© Dr Andrew Clegg

p. p. 2-55 55


Geographical 2 Data Analysis for Techniques Research

Activity 5:

Descriptive DescriptiveStatistics Statistics

Table 6: Business Response to Employment Opportunities by Area

Table 7: Business Response to Employment Opportunities by Town

© Dr Andrew Clegg

p. p. 2-56 56


Activity 6: Self-Directed Cut and paste the results from Table 5 in your SPSS output into Excel. Edit the layout of the results accordingly and produce the following graph. The graph should be presented on A4 in landscape format. Please copy the format of this chart exactly.

The Size Structure of Accomodation in the Chichester and Arun Districts

Bognor Regis Small

Medium

Arundel Town

ď €

Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Large

Midhurst

Chichester

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percentage

Please print of the chart and have it checked by the module tutor. File the chart in your work folder.

Š Dr Andrew Clegg

p. p. 2-57 57


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Before we do any additional analysis it is important to remember to set the Split File dialog box, so any subsequent analysis is based on the entire sample. Return to the Data View window in SPSS. Move the mouse Data and press the left mouse button.

Move the mouse over Split File and press the left mouse button

The Split File dialog box opens.

Select Analyze all cases, do not create groups and then click OK.

Failure to reset the Split Files dialog box can result in inaccurate statistics being created.

Š Dr Andrew Clegg

p. p. 2-58 58


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

There are a number of ways in which you can produce Descriptive Statistics for interval or ratio variables in SPSS. Method 1: First decide on the variable you which to examine. In this case we shall examine the turnover of businesses in 2008 (Turnover08).

To do so, first move the mouse over Analyse and press the left mouse.

Move the left mouse button over Descriptive Statistics and then over Frequencies and press the left mouse button again. The Frequencies dialog box appears.

Move the mouse over the variable you want to examine (in this case Turnover08) and press the left mouse button. Move the mouse over the central arrow and press the left mouse button again. Alternatively, select the variable you want to examine and quickly double click the left mouse button. The selected variable moves across into the Variables box. Note this procedure can be repeated for multiple variables. Move the mouse over Statistics and press the left mouse button.

Š Dr Andrew Clegg

simulation p. p. 2-59 59


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The Frequencies: Statistics dialog box appears. This dialog box gives you the opportunity to select a wide range of descriptive statistics. Select the options you want to include by moving the mouse over the blank square and pressing the left mouse button so a tick appears. When you have completed your selection move the mouse over Continue and press the left mouse button.

Note that SPSS also allows you to select measures of dispersion. This will be discussed in more detail in the next session. This will take you back to the Frequencies dialog box. Move the mouse over OK and press the left mouse button. SPSS automatically calculates the necessary statistics and displays the results in the Output window. This method not only produces the basic descriptive statistics for the variable but also a frequency table (which can be deleted).

Š Dr Andrew Clegg

p. p. 2-60 60


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Descriptive statistics can also be produced by selecting Descriptives instead of Frequencies in the Descriptive Statistics sub menu. Follow the same procedures as in the previous example, however, in this case click Options to specify the descriptive statistics you want SPSS to produce.

Select the options you want to include by moving the mouse over the blank square and pressing the left mouse button so a tick appears.

When you have completed your selection move the mouse over Continue and press the left mouse button. This will take you back to the Descriptives dialog box. Move the mouse over OK and press the left mouse button. SPSS automatically calculates the necessary statistics and displays the results in the Output window.

Š Dr Andrew Clegg

p. p. 2-61 61


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

You will have noticed that the descriptive statistics produced for Turnover08 relate to the entire sample of 300 businesses. By using the Split File option again we can look in more detail at the characteristics of turnover in relation to specific cases such as size of business or location. For example in the following, we can use the Split file to look at the average turnover in the Chichester and Arun Districts. As before open the Split File dialog box and select Compare groups. Select Area to go in the Groups Based on: box.

Now produce descriptive statistics for Turnover08 again (using either the descriptives or frequencies option). In the following example I have created descriptive statistics using the frequencies option and you can see in the output that descriptive statistics have now been produced for both the Chichester and Arun Districts.

Š Dr Andrew Clegg

p. p. 2-62 62


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Method 2: The second (and slightly faster method) is to use the Explore function. In this example we will again examine the turnover of businesses in 2008 (Turnover08). To do so, first move the mouse over Analyse and press the left mouse.

Move the left mouse button over Descriptive Statistics and then over Explore and press the left mouse button again. The Explore dialog box appears.

Move the mouse over the variable you want to examine (in this case Turnover08) and press the left mouse button. Move the mouse over the Dependent List arrow and press the left mouse button again.

Š Dr Andrew Clegg

p. p. 2-63 63


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Turnover08 appears in the Dependent List.

Make sure that Statistics is selected in the dialog box. We will come back to plots later. Click OK. Descriptive statistics for Turnover08 are produced in the output window.

Š Dr Andrew Clegg

p. p. 2-64 64


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

As in the previous method producing descriptive statistics, the values given in the output relate to the entire sample. By adding variables in the Factor List in the Explore dialog box, we can differentiate by specific cases. Return to the Explore dialog box.

Select Area from the variable list and click the Factor List arrow. Area will appear in the Factor List window. This will give us separate descriptive statistics for the Arun and Chichester Districts. Remember in the previous method, we used the Split File option to group around specific cases. Click OK. Descriptive statistics for business turnover in the Arun and Chichester Districts are produced in the output window.

Š Dr Andrew Clegg

p. p. 2-65 65


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Let me illustrate another example. Return to the Explore dialog box. Remove Area from the Factor List and replace with E-Strategy. Click OK.

Descriptive statistics for business turnover for E-Commerce Adopters and E-Commerce Non-Adopters are produced in the output window.

Š Dr Andrew Clegg

p. p. 2-66 66


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Activity 7:

Using either method, attempt to complete the following tables. Table 8: Descriptive Statistics for Turnover08 by Town

Table 9: Descriptive Statistics for GTBS Score in 2008 [GTBS08] by Size of Business

GTBS08

Size of Business

Mean

Median

Mode

Standard Deviation

Range

Small Medium Large

Table 10: Descriptive Statistics for Invest by GStrategy

© Dr Andrew Clegg

p. p. 2-67 67


Geographical 2 Data Analysis for Techniques Research

2.9

Descriptive DescriptiveStatistics Statistics

Graphically Describing Data As mentioned earlier, when using statistics it is important to understand the data that you are using. One of the best ways of doing this is through exploratory data analysis, and investigating your data using graphical techniques. The next section will consider three main elements: frequency histograms, stem and leaf plots and box plots.

2.9.1

Frequency Histograms In the above section you have used SPSS to perform basic frequency counts. The frequency histogram is a useful way of representing a frequency count more graphically, and allowing us to inspect for any extreme values (see Figure 2.5). Any extreme values and possible errors that have been made in inputting the data are often easier to spot when you have graphed the data. The frequency histogram is also useful for discovering other important characteristics of your data. For example you can easily record the value of the mode by looking for the tallest column in the chart. In addition, the histogram also gives you useful information about how the values are distributed. However, when interpreting the distribution of the data, be aware that the interpretation of your histrogram is dependent upon the particular intervals that the bars represent. The way that the data is distributed will become important when we look at normal distribution and dispersion in the next session. The distribution and character of the data is also an important consideration in the use of inferential statistics that will be examined later in this module.

Š Dr Andrew Clegg

p. p. 2-68 68


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Figure 2.5:

Freqency Histogram showing the Mean, Median and Mode Mean (not normally shown on histograms)

Median

Mode

16.56

[Note: The frequency histogram is based on the following data: 2, 12, 12, 19, 19, 20, 20, 20, 25]

2.9.2

Stem and Leaf Plots Stem and leaf plots are similar to frequency histograms in that they allow you to see how the scores are distributed. They also retain the values of the individual observations. A basic example of a stem and leaf plot is shown below: Stem and Leaf Plot [a] [Data set= 2, 12, 12, 19, 19, 19, 20, 20, 20, 25] Stem Tens

Leaf Units

0 1 2

2 22999 0005

The score of 2

The score of 25

A stem and leaf plot based on a larger data set is illustrated overleaf.

Š Dr Andrew Clegg

p. p. 2-69 69


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Stem and Leaf Plot [b] [Data set= 1, 1, 2, 2, 2, 5, 5, 5, 12 ,12, 12, 12, 14, 14, 14, 14, 15, 15, 15, 15, 18, 18, 24, 24, 24, 24, 24, 25, 25, 25, 25, 25, 25, 25, 28, 28, 28, 28, 28, 28, 28, 28, 32, 32, 33, 33, 33, 33, 34, 34, 34, 34, 34, 35, 35, 35, 35, 35, 42, 42, 42, 43, 43, 44 ] Stem 0 1 2 3 4

Leaf 11222555 22224444555588 44444555555588888888 2233334444455555 222334

You can see the similarities between histograms and stem and leaf plots if you turn the stem and leaf plot on its side. When you do this you can get a good representation of the distribution of the data. In Stem and Leaf Plot [a] the first line contains the scores 0 to 9, the next line 10 to 19 and hte last line 20 to 29. Therefore in this case the stem indicates the tens and the leaf the units. You can see the score of 2 is represented as 0 in the tens column (the stem) and 2 in the units column (the leaf), 25 is represented as a stem of 2 and a leaf of 5. The same pattern applies to Stem and Leaf Plot [b], which highlights that this approach is useful for presenting lots of data. However, there are times when the system of blocking in tens is not very informative. For example look at the following Stem and Leaf Plot. Stem and Leaf Plot [c] Stem 0 1 2 6

Leaf 0000022222222333333333555555555555555777777777777799999999 000000033333888 3 4

This Stem and Leaf Plot is not really that informative, and only indicates that most of the values are below 20. An alternative system is to block the scores in groups of 5 (0-4, 5-9, 10-14, 15-19 etc). Stem and Leaf Plot [d] Block 0-4 5-9 10-14 15-19 20-24 60-64

Š Dr Andrew Clegg

Stem 0. 0* 1. 1* 2. 6.

Leaf 0000022222222333333333 555555555555555777777777777799999999 000000033333 888 3 4

p. p. 2-70 70


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

This stem and leaf plot provides a much better indication of the distribution of scores. You can see that we use a full stop (.) following the stem to signify the first half of each block of ten scores (e.g. 0-4) and an asterisk (*) to signify the second half of each block of ten scores (e.g. 5-9).

2.9.3

Box Plots Extreme scores are sometimes difficult to spot in a large data set. In this instance an alternative graphical technique is the box plot or whisker plot, which gives a clear indication of the distribution of extreme scores, and like the stem and leaf plots and histograms discussed above, tells us how the scores are distributed. An example of a box plot is given in Figure 2.6:

Figure 2.6:

An Example of a Box Plot

40

This thick line represents the Median Adjacent Values

30

20 Hinges The Box

10

Whiskers

0 N=

9

Although SPSS will automatically create box plots, the following section will outline how to create them so you understand how to interpret them. Step 1:

The box plot in Figure 1.6 is based on the following data: 2, 20, 20, 12, 12, 19, 19, 25, 20 The first step is to calculate the median score. 2, 12, 12, 19, 19, 20, 20, 20, 25 Median score = 19 [position 5]

Š Dr Andrew Clegg

p. p. 2-71 71


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Step 2:

The next step is to calculate the hinges. These are the scores that cut the top and bottom 25% of the data (the lower and upper quartiles): thus 50% of the scores fall within the hinges. The hinges form the outer boundaries of the box. The hinges are calculated by adding 1 to the position of the median position and then dividing by 2. In this instance the median was in position 5, therefore: (5+1)/2 = 3

Step 3:

The upper and lower hinges are therefore the third score from the top and the third score from the bottom of the ranked list, which in this current example are 20 and 12 respectively.

Step 4:

From these scores we can work out the h-spread, which is the range of the scores between the two hinges. The score on the upper hinge is 20 and the score on the lower hinge is 12, therefore the h-spread is 8 (20 minus 12).

Step 5:

We define extreme values as those that fall one-and-a-half times the h-spread outside the upper and lower hinges. The points one-and-a-half times the h-spread outside the upper and lower hinges are called inner fences. One-and-a-half times the h-spread in this case is 12, that is 1.5*8: therefore any score that falls below 0 (lower hinge, 12, minus 12) or above 32 (upper hinge, 20, plus 12) is classed as an extreme score.

Step 6:

The scores that fall within the hinges and inner fences and which are closest to the inner fence are called adjacent scores. In our example, these scores are 2 and 25, as 2 is the closest score to 0 (the lower inner fence) and 25 is the closest to 32 (the upper inner fence). These are illustrated by the cross-bars on each of the whiskers.

Any extreme scores (those that fall outside the upper and lower fences), are shown on the box plot. You can see from Figure 2.6 that the h-spread is indicated by the box width (12 to 20) and that there are no extreme scores. The lines coming out from the edge of the box are called whiskers, and these represent the range of scores that fall outside the hinges but are within the limits defined by the inner fences. Any scores that fall outside the inner fences are classed as extreme scores (also called outliers). As shown in Figure 1.6, there are no scores outside the inner fences, which are 0 and 32. The inner fences are not necessarily shown on the plot. The lowest and highest scores that fall within the inner fences (adjacent scores 2 and 25) are indicated on the plots by the cross-lines on each of the whiskers. If we were to add a score of 33 to the data set illustrated in Figure 1.6, a revised box plot would now indicate the presence of an extreme score (see Figure 2.7). As shown in Figure 2.7, the score is marked as 10, indicating that the tenth score in our data set is an extreme score (in this case, 33). This value falls outside the inner fence of 32. In this situation it may be worth examining the data set to ensure that this extreme value has not been caused by an error in the data entry process.

Š Dr Andrew Clegg

p. p. 2-72 72


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Figure 2.7:

Revised Box Plot Indicating an Extreme Score

40

10 30

20

10

0 N=

Š Dr Andrew Clegg

10

p. p. 2-73 73


Geographical 2 Data Analysis for Techniques Research

2.10

Descriptive DescriptiveStatistics Statistics

Graphically Describing Data in SPSS Creating histograms, stem and leaf plots and box plots in SPSS is very straight forward. In the following example, we will generate graphical output relating to the Turnover08 variable in the dataset. Move the mouse over Analyse and press the left mouse.

Move the left mouse button over Descriptive Statistics and then over Explore and press the left mouse button again.

The Explore dialog box appears.

Move the mouse over the variable you want to examine (in this case Turnover08) and press the left mouse button. Move the mouse over the central arrow and press the left mouse button again. Alternatively, select the variable you want to examine and quickly double click the left mouse button.

Š Dr Andrew Clegg

p. p. 2-74 74


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

The selected variable moves across into the Dependent List. Move the mouse over Plots and press the left mouse button.

Turnover08

The Explore Plots dialog box appears. Select Stem and Leaf and Histogram. Click Continue.

Š Dr Andrew Clegg

p. p. 2-75 75


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

You are returned to the Explore dialog box. Click OK.

Turnover08

SPSS generates a histogram, stem and leaf plot and box plot in the output window.

Turnover08

Š Dr Andrew Clegg

p. p. 2-76 76


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Turnover08

Turnover08

Š Dr Andrew Clegg

p. p. 2-77 77


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

As before any graphical output produced is referring to the entire sample of 300 businesses. Using the Factor List option in the Explore dialog allows us to examine specific variables in more detail. For example the following output has been produced by selecting Area in the Factor List box. This is an extremely useful way of visually looking at the distribution of your data, which we will come back to when we look at dispersion and statistical testing.

Š Dr Andrew Clegg

p. p. 2-78 78


Geographical 2 Data Analysis for Techniques Research

Activity 8:

Descriptive DescriptiveStatistics Statistics

I would like you now to have a go at producing graphical output for a specific variable. Choose an appropriate variable (which must be ratio or interval in nature) and produce output for the entire sample, and then use the Factor List option in the Explore dialog box to investigate specific cases. Record your observations by cutting and pasting the output into your log book.

© Dr Andrew Clegg

p. p. 2-79 79


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

2.11

Creating Cross-tabulations in SPSS Another useful way of examining the relationship between variables is through the use of cross-tabulations. In the following example we will create a number of cross-tabulations using data from the Dataset file. To create a cross-tabulation in SPSS, move the mouse over Analyse and press the left mouse button. Move the mouse over Descriptive Statistics and then Crosstabs.

The Crosstabs dialog box appears. You need to think about the structure of your crosstab and decide what variable you want as a row and what variable you want as column. Your crosstab should take the form of a contingency table.

simulation Š Dr Andrew Clegg

p. p. 2-80 80


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

Move the mouse over the variable you want to assign to Rows, in this case Area, and press the left mouse button. Move the mouse over the central arrow and press the left mouse button again. Area appears in the Row(s) box:

Move the mouse over the variable you want to assign to Columns, in this case Occ (Occupation), and press the left mouse button. Move the mouse over the central arrow and press the left mouse button again. Occ appears in the Column(s) box:

Click OK.

Š Dr Andrew Clegg

p. p. 2-81 81


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

SPSS produced the crosstab in the output window:

The crosstab presented here is based on the absolute values of the data. We can repeat the process to include Row and Column percentages. This is often a good idea, as it provides a more representative overview if you have different sample sizes. In this instance we will add percentages to the rows.

Having selected the Row and Column variables move the mouse over Cells and press the left mouse button.

Š Dr Andrew Clegg

p. p. 2-82 82


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

The Crosstabs: Cell Display dialog box appears.

Select Row in the Percentages window and then click Continue. This will return you to the Crosstabs dialog box. Click OK.

A second crosstab is produced in the output window - this time row percentages have been included. In this example the crosstab is showing the distribution of occupation categories within a specific District. For example in the Chichester District, 48.6% of businesses are run by previous managers and administrators, compared to 25.7% who were in professional occupations. Reference to the percentage distribution rather than the absolute values provides a more representative discussion, as it takes into account relative sample sizes. Repeat the process removing row percentages and adding column percentages.

Š Dr Andrew Clegg

p. p. 2-83 83


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

When producing crosstabs it is important that you correctly assign row and column percentages as this can influence the accuracy of how you discuss the results. A simple rule of thumb is that row percentages should always total 100 when read across the row, and column percentages will always total 100 when read down the column. In the above example where we have used the column total we are looking at the distribution of specific occupation categories across the two Districts. For example, 70.2% of managers and administrators are within the Chichester District compared to 29.8% in the Arun District. In contrast, 63.6% of plant operatives are within the Arun District compared to 36.4% in the Chichester District.

Š Dr Andrew Clegg

p. p. 2-84 84


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

Activity 9:

Now attempt to complete the following tables. Please give consideration to whether you should be using row or column totals (the clue is in the table). Refer to your Dataset guide. Table 11: Town Against G-Strategy

Table 12: E-Strategy Against Occupation

Occupation

E-Strategy

Managers and Administrators

Professional Occupations

Clerical and Secretarial

Sales Operations

Plant Operatives

Total

E-Commerce Adopters %Distribution E-Commerce NonAdopters %Distribution

© Dr Andrew Clegg

p. p. 2-85 85


Descriptive DescriptiveStatistics Statistics

Geographical 2 Data Analysis for Techniques Research

ď €

Activity 9: Table 13: Perceived Value of the Internet Against E-Commerce and Marketing Course Attendance

Table 14: Town Against the Size of Business

Town

Size of Business

Chichester

Midhurst

Arundel

Bognor Regis

Small %Distribution Medium %Distribution Large %Distribution Total

Š Dr Andrew Clegg

p. p. 2-86 86


Geographical 2 Data Analysis for Techniques Research

 simulation

Descriptive DescriptiveStatistics Statistics

Activity 10:

Using the Dataset file, create 3 additional crosstabs using appropriate variables. Record your results by cutting and pasting your output into your log book. Check your crosstabs with your module tutor to ensure that they are correct.

Please review the online simulations to ensure that you are familiar with the basic approaches of producing descriptive statistics in SPSS.

We can make crosstabs even more specific by using the Layer Command. In the following example our the initial crosstab is GStrategy v Occupation but we are going to use the layer command to examine any differences between GStrategy, Occupation and Area. In effect, the layer command is allowing us to use Area as an additional filter. Select the variables to use as the basis of your crosstab. Here we are using GStrategy (row) and Occupation (column). Select Area and put in the Layer option.

Click OK.

© Dr Andrew Clegg

p. p. 2-87 87


Geographical 2 Data Analysis for Techniques Research

Descriptive DescriptiveStatistics Statistics

In the output window you will notice that a crosstab showing GStrategy v Occupation has been provided, but the results have also now been split by area, showing relative distributions in both the Arun and Chichester Districts.

Š Dr Andrew Clegg

p. p. 2-88 88


Section 3

Dispersion

Learning Outcomes At the end of this session, you should be able to: 

Understand the theory and assumptions relating to the distribution and variance of data

Calculate measures of dispersion, including the median, range, standard deviation and standard error, both manually and using SPSS

Use confidence levels and z scores to establish the relationship between the sample mean and population mean

Use the standard error to establish the extent to which the sample mean deviates from the population mean



Data Analysis for Research

3.0

Measures of Dispersion

Introduction So far, you have been introduced to a number of different methods to graphically illustrate your data. But why is it important to do this? It is important because the way the data is distributed will influence the types of statistical tests that are valid, as many of the statistical tests that you will be introduced to in this module make specific assumptions regarding the distribution of the data. One of the most important distributions that you need to consider is the normal distribution. Under a normal distribution, the characteristic frequency curve is bell-shaped and is symmetrical around the mean. For example, if 1,000 people were asked to estimate the length of a room that was exactly 12 feet long, it is highly probable that everybody would say that the room was 12 feet long. Some may guess at low as 11 feet and other may decide on 13 feet. However, we would expect that most of the estimates would be between 11 feet and 13 feet and very few as far out as 9 feet or 15 feet. If the frequency distribution of the measurements were plotted on a graph, the pattern would tend to be bell-shaped because most of the values would be clustered around the 12 feet mark, while the frequency of measurements would diminish away from this central value.

Figure 3.1:

Normal Distributions

The curves illustrated in Figure 3.1 all have a normal distribution, even though they are not quite the same. You can see that they differ in terms of how spread out the scores are and how peaked they are in the middle. Under a normal distribution, the mean, median and mode are exactly the same. These are features of a normal curve. Indeed, many natural phenomena, such as heights of adult males and weights of eggs, tend to produce the ‘normal’ (or Gaussian) distribution, and more significantly, most sampling will do so as well, regardless of the distribution of the population. This is why it, and sampling, are so important in statistics. The p. 3-89


Data Analysis for Research

Measures of Dispersion

requirements of a normal distribution are not always met in research, especially when you are dealing with small sample sizes. If your sample size is less than 30, then reference to the normal distribution is not appropriate. It is generally found that the more scores from such variables that you plot, the more like the normal distribution they become. This can be seen in the following example. If you randomly select 10 men and measured their height inches, the frequency histogram may be similar to Figure 3.2a. This histogram bears little resemblance to the normal distribution curve. If we were to select an addition 10 men and measure their height, and then plot all 20 measurements the resulting histogram (Figure 3.2b) would again not look like a normal distribution. However, you can see that as we select more and more men and plot their heights, the histogram becomes a closer approximation to the normal distribution (Figure 3.2c to 3.2e). By the time we have select 100 men you can see that we have a perfectly normal distribution. Figure 3.2:

Normal Distribution and Sample Size

[Source: Dancey and Reid, 2002, p. 64] p. 3-90


Measures of Dispersion

Data Analysis for Research

3.1

Measures of Dispersion Although the different types of average can help to describe frequency distributions to a certain extent, they are of limited use on their own and additional measures are often required to illustrate the full picture, and too assess how much variation there is in our sample of population. This situation is best illustrated by a simple example. Two groups of 5 SEMAL students were asked to record their weekly beer consumption. The results in pints were as follows: Group 1: 12, Group 2: 0,

12, 5,

12, 10,

12, 15,

12 30

Passing over the obvious comment that the 2nd group appears to contain someone who isn’t a SEMAL student, the arithmetic mean for both groups is 12. However, this result gives no indication of the basic differences between the two sets of values. Therefore, a measure of dispersion (or spread) can be used to express the fact that one set of values is constant while the other ranges over a wide scale. The following section will highlight a number of ways in which the level of variance within a sample of population can be assessed.

3.1.1

The Range The least sophisticated measure of dispersion is the range of a set of values. The range is simply the difference between the highest and lowest values of a series. As such, it only tells us about two values which may be atypical from the rest of the data set. In reference to our previous example, for the beer consumption of the two groups of tourism management students the ranges are: Group 1: Zero Group 2: 30

Remember the range is calculated by subtracting the minimum value from the maximum value. In this case: Max 12 30

Min

-

12 0

Range

= =

0 30

Although the range tells us about the overall range of scores, it does not give any indication of what is happening in between these scores. Ideally, we need to have an indication of the overall shape of the distribution and how much the scores vary from the mean. Therefore, although the range gives a crude indication of the spread of the scores, it does not really tell us much about the overall shape of the distribution of the sample of scores.

p. 3-91


Measures of Dispersion

Data Analysis for Research

3.1.2

Quartile Deviation The range, as a measure of dispersion, has the significant disadvantage of being susceptible to distortion by extreme values. One way of overcoming this is to ignore items in the top and bottom quarters of the distribution and to consider the range of the two middle quarters only. This is known as the interquartile range since it is the difference between the first and third quartiles. The quartile deviation (semi-interquartile range) is one half of the interquartile range. For continuous data, the lower quartile (Q1) is determined by first ranking the data in order and then dividing the total sample number by 4. In the following example (see Figure 3.3), the lower quartile lies between the ages of the 2nd and 3rd visitors. Thus, the lower quartile value is 14 years (i.e. (13+15)/2). The upper quartile value is computed in a similar way but by dividing the sample size by three quarters. Thus the upper quartile value lies between the ages of the 7th and 8th visitor. Thus the value is 18 years (i.e. (18+18)/2). To summarise, we can now state that one quarter of visitors were aged 14 years or under, while one quarter were aged 18 years or more. In addition, we can also quote the interquartile range by stating that 50% of the visitors were aged between 14 and 18 years of age.

Figure 3.3:

Age Profile of Visitors to the Arun Youth Centre

10, 13, 15, 16, 16, 17, 18, 18, 18, 20

Lower quartile value =

13 + 15 = 14 2

Upper quartile value =

18 + 18 = 18 2

Effectively, the interquartile range is a refinement of the median and is most easily calculated from the cumulative frequency curve. In the Kano rainfall example, discussed in your descriptive statistics handout, the lower quartile is read off by tracing a line from the 25% level to the curve, and then down to the appropriate rainfall (about 275mm), and the upper quartile by reading 75% (to find about 1000mm). This means that over the period in question, half of the years had a rainfall between 725mm and 1000mm, with the interquartile range itself therefore being 275mm.

p. 3-92


Measures of Dispersion

Data Analysis for Research

To calculate the quartile ranges for grouped data, it is first necessary to calculate the cumulative frequencies as in the Kano example. When trying to calculate the quartile values of grouped data it is again necessary to make assumptions regarding the distribution of values within the class. In this instance it is assumed that the distribution is even and the lower quartile is calculated as follows:

 n / 4 − cf (LC)  Q1 = LCL(Q1) +   xw(Q1) f (Q1)   Where: Q1:

is the lower quartile range

LCL(Q1):

is the lower class limit of the class containing the lower quartile

n:

is the sample size

cf(LC):

is the cumulative frequency of the class immediately below that containing the lower quartile

w(Q1):

is the width of the class interval containing the lower quartile

f(Q1):

is the frequency of the class interval containing the lower quartile

The calculation for the upper quartile is:

 3n / 4 − cf (LC)  Q3 = LCL(Q3) +   xw(Q3) f (Q3)   In this case, Q3 reflects the relevant upper quartile values and can be substituted in the description of terms stated for calculating Q1.

p. 3-93


Measures of Dispersion

Data Analysis for Research

3.1.3

Mean Deviation Unlike the range, the mean deviation measures dispersion about a particular average, namely the arithmetic mean. It is the average (arithmetic mean) of all the deviations of values from the arithmetic mean ignoring minus signs. If deviations are considered with plus and minus signs and are measured from the mean then their total will be zero by definition of the arithmetic mean. Basically the mean deviation tells us the average distance by which all items in a data set differ from their mean. For example, for the beer drinking figures of the 2nd group of geography students: Value: d (deviation): |d|:

0

5

10

15

30

-12 12

-7 7

-2 2

+3 3

+18 18

(  d=0)

[|d|, pronounced mod d, is the mathematical shorthand for saying: ‘ignore minus signs’.]

The mean deviation =

| d| n

=

12 + 7 + 2 + 3 + 18 42 = = 8.4 5 5

By ignoring minus signs the mean deviation ignores the fact that some items are greater than the average and some less, consequently this measure of dispersion gives no idea of the way the items are spread around the average.

3.1.4

Standard Deviation Standard deviation is one of the most fundamental measures of dispersion used in statistical analysis. Standard deviation measures the dispersion around the average, but does so on the basis of the figures themselves, not just the rank order. Like the mean deviation it is calculated from the deviations of each item from the arithmetic mean. To ensure that these divisions are not totalled to give zero, they are squared before being added together. This removes all minus signs since two negative values multiplied give a positive value. Thus by summing the squares of the deviations, the sums of squares or sum of squared differences is arrived. The mean of ‘the sum of squares’ is known as the ‘variance’. The square root of the variance is the standard deviation. Standard deviation is symbolized by ‘s’ for a sample and ‘ σ ’ for a population. For an ungrouped, discrete data series, the standard deviation can therefore be calculated as:

 (x − x )

2

σ=

n

p. 3-94


Measures of Dispersion

Data Analysis for Research

or alternatively,

x

2

σ=

n

 x  −    n 

Where: σ : standard deviation  : sum of x : the value x : the mean n : the number of values

2

The calculation of the standard deviation is illustrated in the following example:

Values (x)

3 2 1 2 3 4 3 7 6 5 Totals

(x − x) -0.6 -1.6 -2.6 -1.6 -0.6 0.4 -0.6 3.4 2.4 1.4

(x − x) 0.36 2.56 6.76 2.56 0.36 0.16 0.36 11.56 5.76 1.96

2

Step 1:

First calculate the mean of the sample:

x= Step 2:

 x = 36 = 3.6 n

WORKED EXAMPLE

The Calculation of the Standard Deviation

10

Now calculate the standard deviation:

σ=

σ=

 ( x − x)

2

n

32.4 = 18 . 10

32.4

The standard deviation figure of 1.8 is useful as it provides an indication of how closely the scores are clustered around the mean. The value of the standard deviation when placed in context of the normal distribution. Generally, 70% of all scores fall within 1 standard deviation of the mean. In this example with a standard deviation of 1.8, this tells us that the majority of scores in this sample are within 1.8 units above or below the mean (3.6 +/-1.8). The standard deviation is useful when you want to compare samples using the same scale. For example if we were to take a second sample of scores and calculated a standard deviation of 3.6. If we compare this to the standard deviation from our first sample, it would indicate that scores in the first sample are more closely clustered around the mean value, than scores in the second sample.

p. 3-95


Data Analysis for Research

Measures of Dispersion

In conclusion, the standard deviation is a measure of dispersion which indicates the spread of the data values around the arithmetic mean. ‘Quoting the standard deviation of a distribution is a way of indicating a kind of ‘average’ amount by which all the values deviate from the mean. The greater the dispersion the bigger the deviations and the bigger the standard (average) deviation’ (Rowntree, 1981, p. 54, cited in Riley, M. et al, 1998, p. 197)

p. 3-96


Measures of Dispersion

Data Analysis for Research

ď €

Activity 11: To demonstrate your familiarity with basic measures of dispersion, for the following data sets, relating to bedspace size of B&B establishments in Blackpool, calculate the mean,median, range and standard deviation. Results:

Sample A: 4 34 32 18 48 6 17 14 4 18

9 20 14 12 12 19 11 12 14 19

16 8 10 17 14 10 6 10 16 34

10 6 27 10 6 6 6 8 8 14

72 20 14 12 12 19 11 12 14 19

11 38 19 17 14 10 50 62 16 34

10 6 27 10 34 32 6 8 8 23

Range: Standard Deviation:

Mean: Median: Range: Standard Deviation:

Results:

Sample C: 14 34 32 18 48 6 17 50 4 18

Median:

Results:

Sample B: 34 34 32 18 48 16 17 50 4 18

Mean:

9 20 14 12 12 19 11 12 14 19

11 8 19 17 14 10 6 62 16 34

10 6 27 10 34 6 6 8 8 14

Mean: Median: Range: Standard Deviation:

p. 3-97


Data Analysis for Research

3.2

Measures of Dispersion

Other Distributions There are of course variations on the normal distribution. Distributions can also vary depending on how flat or peaked they are. The degree of flatness or peakedness is referred to as the kurtosis of the distribution. If a distribution is highly peaked it is leptokurtic and if the distribution is flat it is platykurtic. Leptokurtic distributions appear relatively thin in appearance, and somewhat pointy. In contrast, platykurtic distributions are flatter, reflecting a greater number of scores in the tails of the distribution. A distribution between the extremes of peakedness and flatness is classed as mesokurtic (see Figure 3.4). In a normal distribution curve, the value of kurtosis is 0 (i.e. the distribution is symmetrical). If a distribution has a value above or below 0 then this indicates a level of deviation from the norm. You don’t need to worry about kurtosis too much as this point, but you will notice that when you produce descriptive statistics in SPSS, a value for kurtosis is given. Positive values of kurtosis indicate that the distribution is leptokurtic, whereas negative values suggest that the distribution is platykurtic (Dancey, 2002).

Figure 3.4:

Examples of Leptokurtic, Platykurtic and Mesokurtic Distributions

[Source: Dancey and Reid, 2002, p. 70]

p. 3-98


Data Analysis for Research

3.2.1

Measures of Dispersion

Skewed Distributions Distributions can also be skewed (see Figure 3.5). A positive skew is when the peak lies to the left of the mean and a negative skew when it lies to the right of the mean. The further the peak lies from the centre of the horizontal axis, the more the distribution is said to be skewed. If you come across badly skewed distributions then you need to consider whether the mean is the best measure of central tendency, as the scores in the extended tail will be distorting your mean. As discussed in your descriptive statistics handbook, at this point it might be more appropriate to use the median or the mode to give a more representative indication of the typical score in your sample. The SPSS output for descriptive statistics will also provide a measure of skewness. A positive value suggests a positively skewed distribution, whereas a negative value suggests a negatively skewed distribution. A value of zero indicates that the distribution is not skewed in either direction (i.e. the distribution is symmetrial).

Figure 3.5:

Examples of Skewed Distributions

These refinements need not concern us here, but will need consideration when it comes to deciding which statistical tests you which to use to examine the data. For now, it is perhaps enough to make a distinction between the most powerful ‘parametric’ tests which rely on the data concerned being normally distributed, and the less powerful non-parametric ones which do not. If you have control over the collection of your data you should do your best to collect data on which parametric tests can be conducted, but if you cannot ensure this quality or need to use others’ information it may be better to use the less powerful tests.

p. 3-99


Data Analysis for Research

3.3

Measures of Dispersion

The Standard Normal Distribution The standard normal distribution (SND) is also known as the probability distribution. The value of probability distributions is that there is a probability associated with each particular score from the distribution. More specifically, the area under the curve between any specified points represents the probability of obtaining scores within these specified points. For example the probability of obtaining scores between -1 and + 1 standard deviations from the distribution is about 68% (see Figure 3.6). This means that:   

68.26% of observations fall within plus or minus one standard deviation of the mean; 95.44% of observations fall within plus or minus two standard deviations of the mean; 99.7% of observations fall within plus or minus three standard deviations of the mean

This percentage values will be referred to later as ‘confidence limits’. Figure 3.6:

The Standard Normal Distribution

Let me illustrate this through a specific example. Figure 3.7, illustrates the number of tourists bunging jumping off at bridge at an extreme academy in New Zealand. There were 150 tourists in total, and the brave souls were most frequently aged between 26 to 30 (the highest bar). The graph also indicates that very few people over the age of 60 participate in bunging jumping (thank goodness for that!). If we think about this distribution as a probability distribution we could start asking specific questions. For example, how likely is it that a 60 year old will undertake bunging jumping in New Zealand. A look at the distribution and your answer might be ‘not very likely’. However, what if you were asked how likely it is a 30 year old went bunging jumping, your answer would be ‘quite likely’. Indeed, the distribution shows that 30 of the 150 tourists were aged around 30 (equating to 20% of the total sample). Therefore, using this data it is possible to estimate the probability that a particular score will occur.

p. 3-100


Data Analysis for Research

Figure 3.7:

Measures of Dispersion

Tourists Bunging Jumping in New Zealand

Using the characteristics of the SND, it is possible to calculate the probability of obtaining scores within any section of the distribution. Statisticans (much clever than me) have calculated the probability of certain scores occurring in a normal distribution with a mean of 1 and standard deviation of 1. If our sample shares these values, then we can use a table of probabilities for normal distribution to assess the likelihood of a particular score occurring. However in reality, it is likely that the data we will collect will have a mean of 0 and standard deviation of 1. However, as Field (2003) points out any data set can be converted into a data set that has a mean of 0 and standard deviation of 1. First to centre the zero, we take each score and subtract from it the mean of all the scores. Then, we divide the resulting score by the standard deviation to ensure the data have a standard deviation of 1. The resulting scores are called z scores. The z-score is expressed in standard deviation units - the z score therefore tells us how many standard deviations above the mean our score is. A negative z score is below the mean and a positive z score is above the mean. Extreme z scores, for example greater than 2 and below 2, have a much smaller chance of being obtained than scores in the middle of the distribution. That is areas of the curve above 2 and below -2 are small in comparison with the area between -1 and 1 (see Figure 3.8).

p. 3-101


Measures of Dispersion

Data Analysis for Research

Figure 3.8:

Areas in the middle and extremes of the Standard Normal Distribution

Let us refer back to our example of bunging jumping in New Zealand, where we can now answer the question what is the probability of someone over 60 doing a bungee jump. First we need to convert 60 into a z-score. From the population the mean age is 32 and the standard deviation is 11. In this instance 60 will become: score-mean = z score standard deviation

(60-32)/11=2.54

This indicates that your score is 2.54 standard deviations around the mean. Consider another example. The mean IQ scores for many IQ tests is 100 and the standard deviation is 15. If you had an IQ score of 130, then your z-score would be: (130-100)/15=2 This indicates that your score is 2 standard deviations around the mean. Using the z-score we can also calculate the proportion of the population who would score above or below your score - or in the case of the normal distribution the area under the normal distribution curve. Figure 3.9 illustrates that the IQ score of 130 is 2 standard deviations above the mean. The shaded area represents the proportion of the population who would score less than you, and the unshaded area represents those who would score more than you. To calculate the specific proportion of the population that would score more or less than you we refer to a standard normal distribution table (see Table 3.1). The table indicates that the proportion falling below your z-score is 0.9772 or 97.72%. In order to find the proportion above your score, you simply subtract the above proportion (0.9772) from 1. In this case the proportion is .0228 or 2.28% . When using statistical tables for SND you should note that only details of positive z scores are given (those that fall above the mean). If you have a negative z score disregard the negative sign of the z score to find the relevant areas above and below your score (Figure 3.10).

p. 3-102


Measures of Dispersion

Data Analysis for Research

Figure 3.9:

Normal Distribution showing the proportion of the population with an IQ of less than 130

97.72%

Table 3.1:

Z Scores for Standard Normal Distribution

p. 3-103


Data Analysis for Research

Measures of Dispersion

Figure 3.10: The proportions of the curve below positive z scores and above negative z scores

Let us now refer back to the z-score calculated when asking about the probabiloty of people over 60 bunging jumping in New Zealand. The calculated z-score is 2.54. Refer to the table of probability values that have been included in the appendices. Look up the value of 2.54 in the column labelled ‘smaller portion’ (i.e. the area above the value 3.2). You should find that the probability value is 0.00554, or .0055% chance that a person over 60% would bungee jump. By looking at the values of the ‘bigger portion’, we find that the probability of those jumping under the age of 60 was .99446. Or alternatively, there is 99.44 probability that those tourists jumping were aged below 60 (.99446 = 1-.00554). Certain z-scores are particularly important, as their values cut off certain important percentages of the distribution. As Field (2003) highlights, the first important value is 1.96 as this cuts off the top 25% of the distribution, and its counterpart at the opposite end (-1.96) cuts off the bottom 2.5% of the distribution. As such, these values together cut off 5% of scores, or put another way, 95% of z-scores lie between -1.96 and 1.96. The other important scores are +/- 2.58 and +/- 3.29 which cut off 1% and 0.1% of scores respectively. Put another way, 99% of z-scores lie within -2.58 and 2.58, and 99.9% of z-scores lie between -3.29 and 3.29. These values will crop up time and time again, indeed we have already referred to this values when referring to the characteristics of the normal distribution curve.

p. 3-104


Data Analysis for Research

3.4

Measures of Dispersion

Confidence Intervals Although the sampling mean is an approximation of the population mean, we are not sure how good an approximation it is. Because the sample mean is a particular value or point along a variable, it is known as a point estimate of the population mean. It represents one point on a variable and because of this we do not know whether our sample mean is an underestimation or overestimation of the population mean. We can therefore use confidence intervals to help us identify where on the variable the population mean may lie. Confidence intervals of the mean are interval estimates of where the population mean may lie and they provide us with a range of scores (an interval) within which we can be confident that the population mean lies (see Figure 3.11). Because we are still only using estimates of population parameters it is not guaranteed that the population mean will fall within this range; we therefore have to give an expression of how confident we are that the range we calculate contains the population mean. Hence the term ‘confidence intervals’.

Figure 3.11: The role of confidence intervals in determining the position of the population mean in relation to the sample mean

p. 3-105


Data Analysis for Research

Measures of Dispersion

We have already discussed the characteristics of the sampling mean and that it tends to be normally distributed, and contains a good approximation of the population mean. Using the base characteristics of the normal distribution allows us to estimate how far our sample mean is from the population mean. As shown in Figure 3.12, we know that the sample mean is going to be a certain number of standard deviations above or below the population mean. Indeed, we can be 99.74% certain that the sample mean will fall with -3 and + 3 standard deviations. As discussed earlier, this area accounts for most of the scores in the distribution. If we wanted to be 95% certain that a certain area of the distribution contained the sample mean we would have to refer back to the z scores. As highlighted earlier, 95% of the area under the SND falls with -1.96 and +1.96 standard deviations. Thus we can be 95% certain that the sample mean will lie between -1.96 and +1.96 standard deviations of the population mean (see Figure 3.13). Figure 3.12: Sample mean is a certain number of S.Ds above or below the population mean

Figure 3.13: Percentage of curve (95%) falling between -1.96 and +1.96 S.Ds

p. 3-106


Data Analysis for Research

Measures of Dispersion

For illustration, assume that the sample mean is somewhere above the population mean. If we draw the distribution around the sample mean instead of the population mean we see the situation in Figure 3.14. Figure 3.14: Location of the population mean where distribution is drawn around the sample mean

Applying the same logic, we can be confident that the population mean falls somewhere with 1.96 standard deviations below the sample mean. Similarly, if the sample mean is below the population mean we can be confident that the population mean is within 1.96 standard deviations above the sample mean (Figure 3.13). We can therefore be confident (95%) that the population mean is within the region 1.96 standard deviations above or below the sample mean. With this information we can now calculate how far the sample mean is from the population mean. All we need to know is the sample mean and the standard deviation of the sampling distribution of the mean (standard error). Figure 3.15: Distribution drawn around the sample mean when it falls below the population mean

p. 3-107


Data Analysis for Research

ď €

Measures of Dispersion

Activity 12: Use the following table to calculate the following: a] The probability that z is less than or equal to 0.7 b] The probability that z is more than 0.7, c] The probability that z is less than or equal to 2 and equal to or more than -2 d] The probability that z is less than or equal to 3 and equal to or more than -3 Record your answers below:

p. 3-108


Data Analysis for Research

3.5

Measures of Dispersion

The Standard Error One useful adjunct of the normal distribution is standard error, or the standard deviation of the sampling distribution of the mean, which can be helpful in gauging the precision of your sample, and deciding how large your eventual sample should be from a pilot study. The standard error is a measure of the degree to which the sample means deviate from the mean of the sample means. Given the mean of the sample means is also a close approximation of the population mean, the standard error of the mean must also tell us the degree to which the sample means deviate from the population mean. Consequently, once we are able to calculate the standard error we can use this information to find out how good an estimate our sample mean is to the population mean. This is illustrated in Figure 3.16.

Figure 3.16: Calculating the Standard Error

[Source: Field, A, 2003, p. 16]

p. 3-109


Data Analysis for Research

Measures of Dispersion

Figure 3.16, illustrates the process of taking samples from the population. In this case Field (2003) is looking at the ratings of lecturers. If we were to take the rating of all lecturers the mean value would be 3. As illustrated in Figure 3.16, each sample has a mean value, and these have been presented in a frequency chart. As you can see some samples have the same mean as the population, some are lower, some are higher. These differences are referring to sample variation. As you can see, the end result is a symmetrical distribution, known as a sampling distribution (Field, 2003). If we were to take the average of all the sample means, you’d get the same value as the population mean. But how representative is the population mean? We use standard deviation as a measure of how representative the mean was of the observed data. If you were to measure the standard deviation between sample means then this would give a measure of how much variability there was between the samples of the different means. The standard deviation of the sample means is known as the standard error of the sample mean. Standard error is very similar to standard deviation, but takes account of sample size. The larger the sample size, the lower the sampling error.

SE (mean) = Standard Deviation of the Sample (s) √ Sample Size (n) Dividing the standard deviation by the square root of the sample size takes account of the fact that the larger the sample size, the more likely that the sample is representative, and vice versa. Any probability of the sample mean being close to the population mean can be calculated, but for our purposes we will only examine the population mean that we can estimate with 95% probability, which corresponds to two standard errors away from the mean. For example, in investigating the geography of sport in Lancashire, you might want to find out how far Warrington’s supporters travelled to the match. From sampling the crowd you might find a mean of 23km travelled to Wilderspool, and a Standard Error of 3km. This means that your sampling suggested (with 95% certainty) that the mean distance which supporters of Warrington RLFC travelled to the match is 23km ± 6km. This does not mean that 95% of supporters travel between 17km and 29km, but rather is a measure of the confidence with which you state the mean. You can be pretty certain that if you sampled the crowd twenty times nineteen of your answers would be within this range.

p. 3-110


Measures of Dispersion

Data Analysis for Research

The following example highlights a practical application of the standard error in attempting to assess the mean spending of short break holidaymakers in Chichester.

Values (x)

109 97 112 156 86 94 176 158 147 135

(x − x)

(x − x)

-18 -30 -15 -29 -41 -33 49 31 20 8

324 900 225 841 1681 1089 2401 961 400 64

Totals

Step 3:

2

Step 1:

First calculate the mean of the sample:

x=

Step 2:

8886

WORKED EXAMPLE

In an example the following results were obtained were taken: Visitor Spending (£)

 x = 1270 = 127 n

10

Now calculate the standard deviation:

σ=

σ=

 (x − x)

2

n

8886 = 29.809 10

Now calculate the standard error: SE =

Standard Deviation of the Sample (s) Sample Size (n)

SE =

SE =

29.809 10

29.809 = 9.43 3.16

In this example, the standard error has been calculated at 9.43. With reference back to the properties of the normal distribution curve, we can conclude that it is likely that 68 times out of 100 (or approximately 2 in 3 times) that the true mean of the population lies within the range 127± 9.43. That is between 117.57mm and 136.43mm (or the Mean ± 1 x Standard Error (SE)). If we wish to predict the range with greater confidence then the rule of plus or minus two standard errors can be applied to give a 95% confidence level. In this case the true mean of the population would lie within the range 127± 18.86. That is between 108.14mm and 145.86 (or the Mean ± 2 x SE).

p. 3-111


Measures of Dispersion

Data Analysis for Research

In the above example, the selected standard errors equated to critical z values of 1.0 and 2.0. These values help to establish and define the ‘confidence limits’. As discussed these limits are usually described in percentage rather than absolute values, and you would therefore refer to 68.2%, 95.4% and 99.7% confidence levels. For the 95% (0.95) and 99% (0.99) levels (the percentage values have been rounded for convenience) the critical z values are 1.96 and 2.58 respectively. Therefore, if we were to refer back to our previous example, we can redefine our confidence limits and expected ranges in which we would expect the mean value of the population to lie. For example, in the previous example, at the 95% confidence level the limits were given by: 127 ±( 2 x 9.43) = £108.14 to £136.43 If we adopt the critical z values for the standard error at a 95% confidence level, the limits are now defined as: 127 ± (1.96 x 9.43) = £108.52 to £145.48mm If we adopt the critical z values for the standard error at a 99% confidence level, the limits are now defined as: 127 ±( 2.58 x 9.43) = £102.67 to £151.33mm Effectively, higher confidence levels can only be achieved at the expense of wider confidence intervals. Therefore, we can be 99% certain that the sample population lies between £102.67 and £151.33, but only 95 per cent confident that it lies between the narrower bands of £108.14 and £145.86. Clearly the best way to gain greater accuracy in sample estimates is to increase the sample size (n). As the sample size (n) increases the standard error, or spread, of the sampling distribution is reduced and the resulting confidence intervals are narrowed. Referring back to our previous example which focused on visitor spending, increasing the sample size by 20 yields the following results: Mean: £127 Std Dev: 29.95 Standard Error: 2.99 Adopting the same confidence limits as before, we can now be certain that at the 95% confidence level the mean of the sample population lies between: 127 ± (1.96 x 2.99) = £121.14 to £129.86 and at the 99% confidence level the sample population lies between: 127 ± (2.58 x 2.99) = £119.29 to £134.71 As you can clearly see, increasing the sample size has significantly reduced the width of the confidence intervals. p. 3-112


Measures of Dispersion

Data Analysis for Research

This is graphically illustrated in Figure 3.17 below. Figure 3.17: Confidence Intervals with Samples Sizes

a] Sample Size of 10 £108.52

£145.48

Confidence Interval (95%) Range = 36.96

Sample mean of 127

a] Sample Size of 100 £121.14

Confidence Interval (95%)

£129.86

Range = 8.72

Sample mean of 127

As is evident in Figure 3.17, increasing the sample size results in a much narrower range of scores and gives us a much clearer indication of where the population mean may be. This in turn underlines the importance of sample size when trying to estimate population parameters from sample statistics. Generally the larger the sample size, the better the estimate of the population we can get from it.

p. 3-113


Measures of Dispersion

Data Analysis for Research

ď €

Activity 14: Refer back to the exercise on page 3-93. This time calculate the standard error for each sample, and the standard error ranges at 95% and 99% (using z-scores).

Results:

Sample A: 4 34 32 18 48 6 17 14 4 18

9 20 14 12 12 19 11 12 14 19

16 8 10 17 14 10 6 10 16 34

10 6 27 10 6 6 6 8 8 14

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper:

Results:

Sample B: 34 34 32 18 48 16 17 50 4 18

72 20 14 12 12 19 11 12 14 19

11 38 19 17 14 10 50 62 16 34

10 6 27 10 34 32 6 8 8 23

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper: Results:

Sample C: 14 34 32 18 48 6 17 50 4 18

9 20 14 12 12 19 11 12 14 19

11 8 19 17 14 10 6 62 16 34

10 6 27 10 34 6 6 8 8 14

Standard Error: The standard error range at 95% Lower: Upper: The standard error range at 99% Lower: Upper:

p. 3-114


Measures of Dispersion

Data Analysis for Research

3.6

Looking at Distributions in SPSS As discussed in this handbook, SPSS will produce basic descriptive statistics for dispersion in the Descriptive dialog box. Refer back to your descriptive statistics section for guidance. Statistics for variance can also be created via the Explore dialog box. The following example is using the Age variable in the Dataset file. Move the mouse over Analyse and press the left mouse button. Move the mouse over Descriptives and then over Explore and press the left mouse button.

The Explore dialog box appears.

Select Age and click the central arrow so that Age appears in the Dependent List.

simulation p. 3-115


Data Analysis for Research

Measures of Dispersion

Move the mouse over Statistics and press the left mouse button. The Explore: Statistics dialog box opens. At this point we can assign a confidence interval for the mean (as discussed in the previous sections). Make sure that the confidence interval is set to 95%. Click Continue. This returns you to the Explore dialog box. Click OK.

A summary table is produced in the output window.

p. 3-116


Data Analysis for Research

Measures of Dispersion

This summary table provides you with basic descriptive statistics including the mean and the median, and measures of dispersion including the range, standard deviation and standard error. The output also provides the confidence interval at 95% (47.07 to 48.34). Note that Age is a ratio data type, and that the average would not apply to ordinal or nominal data sets.

3.7

Graphically Looking at Distributions in SPSS Refer back to your descriptive statistics handbook for information on how to produce basic frequency histograms, stem and leaf plots and box plots. SPSS will also allow you to plot the normal distribution over a frequency histogram, so you can ascertain how the distribution of your sample relates to the normal distribution. The following example again uses the Age variable in the Dataset file. Move the mouse over Graphs and then Chart Builder and press the left mouse button.

simulation p. 3-117


Data Analysis for Research

Measures of Dispersion

The Chart Builder dialog box appears.

p. 3-118


Data Analysis for Research

Measures of Dispersion

Select Histogram in the Choose From: box. A series of charts are presented.

p. 3-119


Data Analysis for Research

Measures of Dispersion

Move the mouse over the Simple Histogram, and holding the left mouse button down drag it into the chart window. Release the left mouse button and a simple histogram is presented. An Element Properties dialog box also appeared and we will return to this shortly.

You will notice that the histogram presents options for the vertical Y-axis and the horizontal X-axis. In this case we need to assign Age to the X-axis. The vertical Y-axis will be frequency which SPSS will default to automatically.

p. 3-120


Measures of Dispersion

Data Analysis for Research

Move the mouse over Age in the Variables box and holding down the left mouse button, drag Age over the to X-Axis box. Release the left mouse button and Age is assigned to the X-axis of the histogram.

We now need to assign a Normal Distribution Curve to the histogram. Shift your attention to the accompanying Elements Properties dialog box. Select Display normal curve in the dialog box and click Apply.

Notice that a Nornal Distribution curve has been superimposed on top of the histogram in the Chart Builder window. Click OK in the Chart Builder window.

p. 3-121


Data Analysis for Research

Measures of Dispersion

A frequency histogram is produced in the output window, and a normal distribution curve has been plotted on it. As you can see from this output, the Age variable bears some resemblence to the normal distribution, although there the overall shape of the curve is influenced by a number of outlying values.

p. 3-122


Data Analysis for Research

Measures of Dispersion

As before we can also use the Split File option look at specific cases. For example here Area has been selected and two separate distribution curves for the Chichester and Arun Districts have been produced.

p. 3-123


Measures of Dispersion

Data Analysis for Research

ď €

Activity 15:



Table 15: Descriptive Statistics for GTBSscore08 GTBSscore08 Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

Table 16: Descriptive Statistics for GTBSscore08 - Chichester District GTBSscore08: Chichester District Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

p. 3-124


Measures of Dispersion

Data Analysis for Research

Activity 15: Table 17: Descriptive Statistics for GTBSscore08 - Arun District

GTBSscore08: Arun District Council Please cut and paste your histogram below and rescale accordingly

Descriptive Statistics Mean Median Mode Standard Deviation Standard Error Skewness Kurtosis

Please provide a brief summary of the distribution:

Repeat this exercise for an additional variable (which should be ratio or variable in nature). Record your results by cutting and pasting your output into your log book.

p. 3-125


Data Analysis for Research

Measures of Dispersion

Notes:

p. 3-126


Section 4

Student T-Test, Paired Samples T-Test, Mann Whitney and Wilcoxon Learning Outcomes At the end of this session, you should be able to: 

Understand the rationale for the use of parametric and non-parametric tests

Examine the relationship between variables using parametric and non-parametric tests, constructing suitable null and alternative hypotheses

Apply the procedure for conducting parametric and non-parametric tests in SPSS in relation to the Student T-Test, the Paired Samples T-Test, Mann Whitney and Wilcoxon

Interpret computer generated SPSS output in relation to the above tests



Statistical Tests: Introduction

Data Analysis for Research

4.0

Introduction Statistical tests are used to make deductions about a particular data set or relationships between different data sets. For example, you might have interviewed a random sample of 50 households from two rural villages in West Sussex to compare whether income levels are different. In village A, you calculate the mean income to be £17,650 and for village B, £22,200. In this instance, a statistical test can be used to determine whether we have a real difference or whether the difference could have occurred purely by chance. There are a wide variety of statistical tests, each designed to take account of the different characteristics of the data sets you may wish to examine. The choice of test to use can prove overwhelming, and indeed frightening at first. At the most basic level, the principal distinction drawn between different statistical tests is whether they are ‘parametric’ and ‘non-parametric’ tests. Parametric tests can only be performed where the data conforms to a normal distribution and is of an interval or ratio nature. In contrast, non-parametric tests involvement less rigorous conditions and can be used on data of lower level which does not conform to a normal frequency distribution.

4.1

Null and Alternative Hypotheses Before conducting a statistical test it is first necessary to establish a hypothesis or statement which the test then challenges. These hypotheses are referred to as the null hypothesis (Ho ) and the alternative or research hypothesis (H1). The null hypothesis is usually expressed as Ho: μ1= μ1 where μn is the mean for each group, and the subscript n denotes the group. When stating a null hypothesis, the normal procedure is to start by assuming that there is no real difference between your data sets. A statistical test effectively helps the researcher to decide whether or not the null hypothesis is true, or more precisely, whether or not it should be accepted. If the result of the test shows that the null hypothesis should not be accepted and that it should be rejected, we can then go on to say, with some degree of confidence, that a difference does exist or a change has occurred (Riley, M. et al, 1998, p. 203). It is important that you express both Ho and H1 in the context of your own research problem before collecting your data and before starting your analysis. In reference to the rural income example quoted above, we could formulate the following hypotheses: Ho: μa= μb

There is no significant difference between the mean income of households in village A as compared with the mean income of households in village B; mean household income is not influenced by geographical location.

H1: μa≠ μb

There is a significant difference in the mean household income for households in village A as compared with village B; mean household income is influenced by geographical location.

To determine whether or not your sampled data sets are consistent with the null hypothesis or the alternative hypothesis, we need to perform a probability based significance test. However, before such a test is conducted we must determine how big any difference has to be, to be considered real beyond that expected due to chance.

© Dr Andrew Clegg

p. 4-127


Data Analysis for Research

4.2

Statistical Tests: Introduction

Hypothesis Testing Most tests follow the same basic logic, in that a research hypothesis (your alternative hypothesis) predicts a difference in distributions, whereas a null hypothesis predicts that they are the same. For each significance test, we can produce a probability distribution of a test statistic, termed a sampling distribution under the null hypothesis, calculated on the basis that the null hypothesis is true. A simple example relating to the probability distribution curve of the Student’s t-statistic is shown in Figure 4.1.

Figure 4.1:

Rejection Region for a Probability Region

A large visible difference between data sets corresponds to a probability towards the tails of distribution, therefore meaning that differences occurring by chance are unlikely. We can determine whether any difference between the data sets is large enough to have occurred by chance, by determining whether the difference occurs in relation to the tails of the distribution. We can define a critical or rejection region as that part of the probability distribution beyond a critical value of a test statistic at a certain probability (see Figure 1.1). We compare this critical value with the calculated test statistic. If the calculated statistic is greater than the critical value and therefore falls within the rejection region, the difference in the data are unlikely to have occurred by chance. Consequently, we can reject the null hypothesis and accept the alternative hypothesis. However, if the calculated value does not fall within the rejection region, this does not prove the truth of the null hypothesis, but merely fails to reject it. The size of the rejection region is determined by the significance level. Significance levels allow the researcher to state whether or not they believe a null hypothesis to be true with a given level of confidence or significance value. Significance levels are presented in statistical tables as a probability value normally expressed in decimal terms i.e. 0.05 (5% or p=00.5/1 in 20) and 0.01 (1% or p=0.01/ 1 in 100). The value 0.05 indicates the 95 per cent confidence limit and represents the minimum limit for deciding upon whether or not a particular result is significant and whether or not the null hypothesis should be accepted or rejected. Anything lower than 95 per cent confidence level, that is where the level is computed to be 94 per cent or less, means the null hypothesis is normally accepted and the result is regarded as not significant. If the significance level is found to be higher, that is, it indicates a confidence level of 95 per cent or more has been achieved, then we say the observed change or difference is significant. By the selection of either a 5% or a 1% significance level, what we are saying is that we are willing to accept either a 5% or a 1% chance of making an error in Š Dr Andrew Clegg

p. 4-128


Statistical Tests: Introduction

Data Analysis for Research

rejecting the null hypothesis when it is in fact true; this is known as a Type I error. A Type II error represents the probability of not rejecting the null hypothesis when it is in fact false. SPSS will report the significance of the calculated test statistic in terms of a probability value p. Where p<0.05 this would indicate a significant result result at a 0.05 (5%) level, and p<0.01 would indicate a significant result at a 0.01 (1%) level, often termed ‘highly significant’.

4.3

One and Two Tailed Tests When conducting a statistical test to a given significance level it is important to consider how the hypothesis is worded as this will either create conditions for a one- or two tailed test. Any statement including terms such as reduces or increases, no lower or no higher implies a specific direction in the null hypothesis and consequently forms the basis of a one-tailed test. In contrast, any statement indicating no direction (no different/no effect) forms the basis of a two-tailed test. Therefore in relation to the rural income example stated above, we would perform a two-tailed test. This is because we would have to allow for the average income for village B to be either larger or smaller than that for village A. We could, however, have chosen a slighty different alternative hypothesis, for example: H1: μa> μb

The mean household income for households in Village A is significantly larger as compared with Village B; mean household income is influenced by geographical location.

This is termed a one-tailed test as we are only interested in a difference in one direction, in this case positive differences (larger). As a result, the rejection region must be concentrated at one end of the distribution (hence the term one-tailed). For the sample mean to be larger than the population mean, the rejection region must lie at the positive end of the x-axis. The choice of a two-tailed or one-tailed test will determine the distribution of the rejection region. This will now be discussed in the following section.

© Dr Andrew Clegg

p. 4-129


Data Analysis for Research

4.3.1

Statistical Tests: Introduction

Significance Levels and One and Two-Tailed Predictions The relationship between significance levels and one and two-tailed predictions is explained by Hinton (2004) in the following extract: When we undertake a one-tailed test we argue that if the test score has a probability lower than the significance level then it falls within the tail-end of the known distribution we are interested in. We interpret this as indicating that the score is unlikely to have come from a distribution the same as the known distribution but from a different distribution. If the score arises anywhere outside this part of the tail cut off by the significance level we reject the alternative hypothesis. This is shown in Figure 4.2. Notice that this shows a one-tailed projection that the unknown distribution is higher than the known distribution.

Figure 4.2:

A One-Tailed Prediction and the Significance Level

With a two-tailed prediction, unlike the one-tailed, both tails of the known distribution are of interest, as the unknown distribution could be at either end. However, if we set our significance level so that we take the 5 per cent at the end of each tail we increase the risk of making an error. Recall that we are arguing that, when the probability is less than 0.05 that a score arises from the known distribution, then we conclude that the distributions are different. In this case the chance that we are wrong, and the distributions are the same, is less than 5 per cent. If we take 5 per cent at either end of the distribution, as we are tempted to do in a twotailed test, we end up with a 10 per cent chance of an error, and we have increased the chance of making a mistake. We want to keep the risk of making an error down to 5 per cent overall, as otherwise there will be an increase in our false claims of differences in distributions which can undermine our credibility with other researchers, who might stop taking our findings seriously. When we gamble on the unknown distribution being at either tail of the known distribution, to keep the overall error risk to 5 per cent, we must share our 5 per cent between the two tails of the known distribution, so we set our confidence level at 2.5 per cent at each end. If the score falls into one of the 2.5 per cent tails we then say it comes from a diffferent distribution. Thus, when we undertake a two-tailed prediction the result has to fall within a smaller area of the tail compared to a one-tailed prediction, before we claim that the distributions are different, to compensate for hedging our bets in our prediction. This is shown in Figure 4.3. Š Dr Andrew Clegg

p. 4-130


Statistical Tests: Introduction

Data Analysis for Research

Figure 4.3:

A Two-Tailed Prediction and the Significance Level

[Extract taken from Hinton, P. (2004), Statistics Explained, Routledge, London] The changes in the critical values between one and two-tailed tests have important consequences because it is possible for Ho to be accepted if the test is two-tailed but rejected if it is one-tailed. This happens with z values within the range 1.645 and 1.96 and test statistics of, say, 1.75 which fall outside the two-tailed rejection region but within the one-tailed. Consequently the phrasing and justification of the alternative hypothesis should be formulated with considerable care. Although the actual method for calculating the test statistics is not influenced by the nature of the null hypothesis, the effect of stating a direction is to impose a more rigorous test which in turn affects the significance level that can be quoted. By stating a direction to the null hypothesis we are effectively establishing a more precise test. Table 4.1:

Critical z Values for the 0.01 and 0.05 Rejection Regions for One- and Two-tailed Tests Critical Values Tailedness

0.05 Level

0.01 Level

One-tailed test Two-tailed test

-1.645 or +1.645 -1.96 or +1.96

-2.33 or +2.33 -2.58 or +2.58

Š Dr Andrew Clegg

p. 4-131


Data Analysis for Research

4.4

Statistical Tests: Introduction

Choosing the Right Test The main motivation for choosing a statistical test to apply to a set of data has to be driven ultimately by the objectives of your research project. Indeed your project should have been designed and data sampled with a certain test or set of tests in mind (Kitchin and Tate, 1999). When deciding upon a particular test, you need to consider the nature and characteristics of the data sets that you are investigating and, in particular, whether they will allow the use of a parametric or non-parametric tests. The common characteristics of both parametric and non-parametric tests are listed in Table 4.2. Table 4.3 also provides a useful framework to help you choose the correct test.

Table 4.2:

Common Characteristics of Parametric and Non-parametric Tests Parametric Tests      

Independence of observations, except where the data are paired Random sampling of observations from a normally distributed population Interval scale measurement (at least) for the dependent variable A minimum sample size of about 30 per group is recommended Equal variances of the population from which the data is drawn Hypotheses are usually made about the mean (μ) of the population Non-Parametric Tests

     

Independence of randomly selected observations except when paired Few assumptions concerning the distribution of the population Ordinal or nominal scale of measurement Ranks or frequencies of data are the focus of tests Hypotheses are posed regarding ranks, medians or frequencies Sample size requirements are less stringent than for parametric tests

[Kitchin and Tate, 1999, p. 113]

© Dr Andrew Clegg

p. 4-132


Statistical Tests: Introduction

Data Analysis for Research

Table 4.3:

Identifying the Right Test Question 1: What combination of variables have you?

Which test to use:

Two categorical

Chi-Square

Two seperate continuous

Go to question 2

Two continuous which is the same measure administered twice

Two continuous which is the same measure administered on three occasions or more

One categorical and one continuous

Question 2: Should your continuous data be used with parametric tests or non-parametric tests?

Which test to use:

Parametric

Pearson

Non-Parametric

Spearman

Parametric

Related t-test

Non-Parametric

Wilcoxon sign-ranks

Parametric

ANOVA (within subjects)

Non-Parametric

Friedmann test

Parametric

Go to question 3

Question 3: How many levels has your categorical data?

Which test to use:

2

Independentsamples t-test

3 or more

ANOVA (between subjects)

2

Mann-Whitney U

3 or more

Kruskal-Wallis

Go to question 2

Go to question 2

Go to question 2

Non-Parametric

Go to question 3

[Source: Maltby & Day, 2002]

Š Dr Andrew Clegg

p. 4-133


Data Analysis for Research

4.5

Parametric Tests

4.5.1

T-Test or Student’s T-test

Statistical Tests: Student T-Test

The t-test is most useful for testing whether or not a significant difference exists between the means of two samples, or alternatively, whether or not two samples come from one population. There are two principal versions of the t-test. One relates to samples involving independent data sets and the other to samples which involve paired comparisons. In both cases, the data must be of ratio or interval in nature, randomly chosen and normally (or near normally) distributed. The variances of the two data sets should also be similar. Where there is doubt over the frequency distribution and the values of the variances that may jeopardise the accuracy of the test, alternative and less refined non-parametric tests should be used.

4.5.2

T-Test for Independent Samples In this instance, the t-test compares two unrelated data sets by inspecting the amount of difference between their means and taking into account the variability of each data set. The larger the difference in the means, the more likely that a real, significant difference exists, and our samples come from different populations (see Figure 4.5).

Figure 4.5:

Differences in Means and Populations

The following section will illustrate how to use SPSS to conduct a student t-test using variables from the Dataset file.

Š Dr Andrew Clegg

p. 4-134


Statistical Tests: Introduction

Data Analysis for Research

4.6

Using SPSS to Calculate the Student T-Test The aim of the following section is to demonstrate how to use SPSS to perform the unrelated and related ttest. As already mentioned in this section, the t-test is most useful for testing whether or not a significant difference exists between the means of two samples, or alternatively, whether or not two samples come from one population. There are two principal versions of the t-test. One relates to samples involving independent data sets, and the other to samples which involve paired comparisons. In both cases, the data must be of interval nature, randomly chosen and normally (or near normally) distributed. The variances of the two data sets should also be similar. Where there is doubt over the frequency distribution and the values of the variances that may jeopardise the accuracy of the test, alternative and less refined non-parametric tests should be used. To begin, open SPSS and open the file dataset file that you have used in previous sessions. We are going to use the Student T-test to examine the relationship between different variables. Let us consider a potential research scenario to help you place the use of the student t-test in context. Scenario:

As part of the bidding process to Tourism South East for future tourism funding, local tourism officers have to demonstrate if there is a significant difference in turnover between businesses in the Arun and Chichester Districts.

Variables:

We are therefore going to examine if there is a relationship between Area and Turnover08.

Before we start we first need to establish a Null and Alternative hypothesis.

In this case: The Null Hypothesis: Ho: μa= μb

There is no significant difference in Turnover between Area; business turnover is not influenced by location

TheAlternativel Hypothesis: H1: μa≠ μb

© Dr Andrew Clegg

There is a significant difference in Turnover between Area; business turnover is influenced by location

p. 4-135


Data Analysis for Research

4.6.1

Statistical Tests: Introduction

T-Test for Independent Samples To perform the unrelated t-test for two independent samples, first move the mouse over Analyse and press the left mouse button. Move the mouse over Compare Means and then over Independent Samples T Test.

The Independent-Samples T Test dialog box appears.

simulation Š Dr Andrew Clegg

p. 4-136


Statistical Tests: Student T-Test

Data Analysis for Research

Move the mouse over the variable Turnover08 and press the left mouse button. Move the mouse over the centre arrow and press the left mouse button so that the variable Tunrover08 appears in the Test Variable(s) box.

Turnover08

Select the variable Area and press the lower arrow so that Area appears in the Grouping Variable box.

Move the mouse over Define Groups and press the left mouse button. The Define Groups dialog box appears. In the box beside Group 1: type 1 and in the box beside Group 2: type 2. Note in this case the groups have been defined in terms of their two codes (1=Chichester District and 2=Arun District). The values can also be used as a cut-off point, at or above which all the values constitute one group while those below form the other group. In this instance the cut-off point is two, which would be placed in parentheses after gender. Move the mouse over Continue and press the left mouse button. This will return you to the IndependentSamples T Test dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the test and displays the results in the Output window. Š Dr Andrew Clegg

p. 4-137


Statistical Tests: Student T-Test

Data Analysis for Research

In this case the following output is produced:

Turnover08

Turnover08

You are now wondering what this all means. Let us start by referring back to our null and alternative hypothesis.

In this case: The Null Hypothesis: Ho: μa= μb

There is no significant difference in Turnover between Area; business turnover is not influenced by location

TheAlternativel Hypothesis: H1: μa≠ μb

There is a significant difference in Turnover between Area; business turnover is influenced by location

The second subtable in the output, provides the information we need by tabulating the value of t and its pvalue (Sig.(2-tailed)) together with the 95% Confidence Interval of Difference for both Equal variances assumed and Equal variances not assumed. The key to which situation to use lies in the first two columns labelled Levene’s Test for Equality of Variances which is a test for the homogeneity of variance assumption of a valid t-test. One of the criteria for using a parametic t test is the assumption that both populations have equal variances. If the test statistic F is significant, Levene’s test has found that the two variances do differ significantly, in which case we must use the bottom values. Provided the test is not significant (p>0.05), the variances can be assumed to be homogenous and the Equal Variances line of values for the t-test can be used. As Kinnear and Gray (1999) point out: 

If p > 0.05, then the homogeneity of variance assumption has not been violated and the normal t-test based on equal variances (Equal variances assumed) is used (the top line).

If p < 0.05, then the homogeneity of variance assumption has been violated and the normal t-test based on equal variances should be replaced by one based on separate variance estimates (Equal variances not assumed)(the bottom line).

© Dr Andrew Clegg

p. 4-138


Statistical Tests: Student T-Test

Data Analysis for Research

In this example, the Levene Test is significant (p = 0.041 and is therefore < than 0.05), so the t value calculated with the pooled variance estimate (Equal variances not assumed) is appropriate.

The results are relatively straightforward. The table includes the tstatistic, the degree of freedom, and the two-tailed probability of the former being equalled or exceeded by chance alone (Sig.). This form of output does not give the critical t-value that must be exceeded for the null hypothesis to be rejected and Sig. is therefore of great importance in this and other tests in the output of which it is commonly listed. It allows us to dispense with tables of critical values and, if this probability value is equal to or less than the selected significance level, the null hypothesis must be rejected. In this case, the test produces a two-tailed p-value of 0.000; this value is significant. Remember for the pvalue not to be significant at the 0.05 level, the p-value would have to be greater than 0.05. In this case, the null hypothesis is rejected at the 0.05 significance level. In other words we would conclude that there is a significant difference in mean turnover between area, and that turnover is influenced by location. It is important to write up the results clearly and fully. In this instance we could write: A student t-test was conducted to determine if a significant difference between turnover and area existed. A null hypthosis of no significant difference and an alternative hypthosis of a significant different were established, and a 95% confidence level was assumed. The difference was significant t = 6.354, p(<.0005)<0.05. Therefore the null hypthosis can be rejected and we can assume that there is a significant difference between turnover and area, and that turnover is influenced by location. Note that in the above we have reported the probability value as <.0005. You cannot have a probability value of 0.000. The reported probability value has actually been rounded down to three decimal places and therefore for accurary we would report this as p<0.0005.

Š Dr Andrew Clegg

p. 4-139


Data Analysis for Research

Statistical Tests: Student T-Test

A note on Significance Testing taken from Maltby and Day (2002): ‘Significance testing is a criterion, based on probability, that researchers use to decide whether two variables are related. Remember, as researcher always use samples, and because of the possible error, they use significance testing to decide whether the relationships observed are real, or not. Researchers are then able to use a criteria level (significance testing) to decide whether or not their findings are probable (confident of their findings) and not probable (not confident of their findings). This criterion is expressed in terms of percentages, and their relationship to probability values. If we accept that we can never be 100 per cent sure of our findings, we have to set a criterion of how certain we want to be of our findings. Traditionally, two criterion are used. The first is that we are 95 per cent confident of our findings, the second is that we are 99 per cent confident of our findings. This is often expressed in another way. Rather, there is only a 5 per cent (95 per cent confidence) or 1 per cent (99 per cent confidence) probability that we have we have made an error. In terms of significance testing these two criteria are often termed the .05 (5 per cent) and 0.01 (1 per cent) significance levels. Throughout this handbook, you will be using a number of tests to determine whether there is a significant association/relationship between two variables. These tests always provide a probability statistic, in the form of a value; e.g. 0.75, 0.40, 0.15, 0.04, 0.03 and 0.002. Here, the notion of significance testing is essential. This probability statistic is compared against the criteria of 0.05 and 0.01 to decide whether the findings are significant. If the probability value (p) is less than 0.05 (p<0.05) or less than 0.01(p<0.01) then we conclude that the findings is significant. If the probability value is more than 0.05 (p>0.05) then we decide that the finding is not significant. Therefore we can use this information in relation to our research idea and we can determine whether our variables are significantly related, or not. Therefore, for the probability values stated above: * The probability values of 0.75, 0.40 and 0.15 are greater than 0.05 (p>0.05) and these probability values are not significant at the 0.05 level (p>0.05). * The probability values of 0.04, and 0.03 are less than 0.05 (p<0.05) and these probability values are significant at the 0.05 level (p<0.05). * The probability value of 0.02 is less than 0.01 (p<0.01) therefore this probability value is significant at the 0.01 level (p<0.01)’

Š Dr Andrew Clegg

p. 4-140


Statistical Tests: Student T-Test

Data Analysis for Research

4.6.2

One or Two-Tailed Tests The above test has been based on a two-tailed test as the null and alternative hypothesis did not specify any specific direction. If we were going to perform a one-tailed test we would first need to look at the mean values of the data and then rewrite our hypotheses accordingly. Remember that when applying a one-tailed test it is first necessary to establish whether the difference in the samples corresponds to the direction outlined in the alternative hypothesis. For example if the alternative hypothesis is that the mean of sample Y is greater than the mean of sample X, the null hypothesis can only be rejected if the mean of sample Y is greater than the mean of sample X and if it is significant at the chosen level. If we use Descriptives Statistics in SPSS to look at the mean turnovers for businesses in the Chichester and Arun Districts we would find that the mean turnover in the Chichester District is £43,968.47and in the Arun District is £37,591.69. The mean turnover is higher in Chichester which therefore suggests that turnover may be influenced by location. We can therefore conduct a one-tailed t-test to test if there is actually a significance difference between the two mean scores. In this case The Null Hypothesis: Ho: μa= μb

There is no significant difference in Turnover between Area; business turnover is not influenced by location.

TheAlternativel Hypothesis: H1: μa≠ μb

There is a significant difference in Turnover between Area; business turnover is higher in Chichester than Bognor Regis.

To calculate the one-tailed level of significance, divide the two-tailed significance value by 2 (0.000/2). The resultant one-tailed value would be 0.000 which would still be significant (p.<0.05).

© Dr Andrew Clegg

p. 4-141


Data Analysis for Research

4.6.3

Statistical Tests: Student T-Test

Choosing the Correct Data for a T-Test SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable of selecting the right variables to use in a t-test. This will be central to your assessment in this module and it is vital that you get it right. Let us first refer back to Table 4.3 on page 4-133. This table clearly shows that a t-test is a combination of one continuous variable and one categorical (with two levels).

In the worked example provided, Turnover08 was the continuous variable and Area was the categorical variable. Note that Area has two levels (i.e. 1 - Chichester District and 2 - Arun District). You can only use categorical variables that have two levels in a t-test. The actual Independent Samples T-Test actually provides a clue here as you are only able to define two groups (levels) within the Grouping Variable.

Turnover08

In this case also note that the continuous variable (Turnover08) goes in the Test Variable box.

Š Dr Andrew Clegg

p. 4-142


Data Analysis for Research

ď €

Statistical Tests: Student T-Test

Activity 16: Referring to the variables in the Dataset file and your accompanying data set guide, attempt to complete the following diagram listing Test Variables and Grouping Variables that would be suitable for use in a series of ttests.

Test Variables

Grouping Variables

Š Dr Andrew Clegg

p. 4-143


Statistical Tests: Student T-Test

Data Analysis for Research

Activity 17: From the list of potential relationships that you have identified overleaf, please conduct 3 separate Ttests and record your results in the following tables. For each test, identify a research scenario that you are using the test to explore.

Table 18: Student T-Test 1 Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the p. value of the Levene Test Is this significant: yes/no? Is your test based on:

Equal variances assumed Equal variances not assumed

Record the value of p. (Sig. 2-tailed) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses)

© Dr Andrew Clegg

p. 4-144


Statistical Tests: Student T-Test

Data Analysis for Research

Activity 17:

Table 19: Student T-Test 2 Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the p. value of the Levene Test Is this significant: yes/no? Is your test based on:

Equal variances assumed Equal variances not assumed

Record the value of p. (Sig. 2-tailed) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses)

© Dr Andrew Clegg

p. 4-145


Statistical Tests: Student T-Test

Data Analysis for Research

Activity 17:

Table 20: Student T-Test 3 Student T-Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the p. value of the Levene Test Is this significant: yes/no? Is your test based on:

Equal variances assumed Equal variances not assumed

Record the value of p. (Sig. 2-tailed) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses)

© Dr Andrew Clegg

p. 4-146


Data Analysis for Research

4.7

Statistical Tests: Related T-Test

Using SPPS to Calculate the T-Test for Related Samples The t-test can also be used to examine means of the same participants in two conditions or at two points in time. The advantage of using the same participants or matched participants is that the amount of error deriving from differences between participants is reduced. The difference between a related and unrelated t-test lies essentially in the fact that two scores from the same person are likely to vary less than two scores from two different people. For example, if you were to weigh the same person on two occasions, the difference between those two weights is likely to be less than the weights of two seperate individuals. The variability of the standard error for the related t-test is less than that for the unrelated one. Indeed, the variability of the standard error of the differences in means for the related t test will depend on the extent to which the pairs of scores are similar or related. The more similar they are, the less the variability will be of their estimated standard error. In the following example we are going to look at paired data from the Dataset file. Let us consider a potential research scenario to help you place the use of the related t-test in context. Scenario:

Between 2008 and 2010, Tourism South East ran a series of courses in conjunction with the Green Tourism Business Scheme to help GTBS members progress to the next stage of accreditation (e.g. bronze to silver; silver to gold). As part of the monitoring process, Tourism South East want to establish if these courses have had any impact on GTBS scores.

Variables:

We are are going to examine differences in GTBS scores in 2008 and 2010.

As always we need to start by defining our hypthoses. In this instance, the null and alternative hypthoseses have been stated as: Ho: μa= μb

There is no significant difference between the GTBS scores in 2008 and 2010.

H1: μa≠ μb

There is a significant difference in the GTBS scores in 2008 and 2010.

To perform the related t-test, first move the mouse over Analyse and press the left mouse button.

simulation

Move the mouse over Compare Means and then over Paired-Samples T-Test. © Dr Andrew Clegg

p. 4-147


Data Analysis for Research

Statistical Tests: Related T-Test

The Paired-Samples T-Test dialog box appears.

Move the mouse over GTBS08 and press the left mouse button. GTBS08 is selected. Now move the mouse over GTBS10 and press the left mouse button. GTBS10 is selected. Move the mouse over the central button and press the left mouse button.

GTBS08 and GTBS10 now appear in the Paired Variables box. Click OK.

Š Dr Andrew Clegg

p. 4-148


Data Analysis for Research

Statistical Tests: Related T-Test

The procedure produces the following results in the output window:

The first table evident in the SPSS output is the Paired Samples Statistics which reports the descriptive statistics. By observing the mean scores we can see that mean GTBS scores were higher in 2010 than 2008. These differences seem to be supporting our initial hypothesis. To establish whether this result is significant or has merely occured by chance we refer to the Paired Samples Test. The key elements of the Pair Samples Test include: (a) The test statistic - this is denoted as t; in this case the value of t=-11.386 (b) The degrees of freedom - the degrees of freedom equal the size of the sample (300) minus 1. The minus 1 represents minus 1 for the sample as you have only asked one set of respondents. The degrees of freedom value is placed in brackets between the t and the = sign (e.g. t(299)=-11.386). (c) The Probability Value - as in all tests we also have to report the probability value. Note that the value of p =.000 (which remember we report as p<.0005) is less than 0.05 which means that there has been a significant change in GTBS scores between 2006 and 2008. Let us bring these different elements together. As can be seen from the SPSS output, the difference between the two means is significant. This is specifically reported as: There is a significant difference in GTBS scores between 2006 and 2008, t (299)= -11.386, p (<.0005)<0.05.

Š Dr Andrew Clegg

p. 4-149


Statistical Tests: Related T-Test

Data Analysis for Research

However, we can be more specific and in our altnerative hypothesis look for an improvement in GTBS scores. As a result our alternative hypothesis would be: H1: μa≠ μb

There is a significant improvement in the GTBS scores between 2008 and 2010.

This therefore means we have conducted a one-tailed test, as we have specified a specific direction in which to examine change. To alter the output here so that it complies with a onetailed test we merely divide the p-value by 2. The resultant value (.000) is still significant (p=.000<0.05). As a result we can reject the null hypothesis and conclude that there has been a significant improvement in GTBS scores between 2008 and 2010, at the 95% confidence level. Specifically: There has been a significant improvement in GTBS scores between 2008 and 2010, t (299)= -11.386, p (<.0005)<0.05.

© Dr Andrew Clegg

p. 4-150


Data Analysis for Research

Statistical Tests: Related T-Test

Activity 18: We are now going to use the Dataset file to conduct a number of additional related t-tests. Please complete the following tables, making clear reference to the SPSS output. You have been provided with research scenarios for each table to place the test in context.

Table 21: Related T-Test: Turnover08 Against Turnover10 [Tourism South East want to establish if regional marketing strategies implemented between 2008 and 2010 have had an impact on business turnover.] Related T-Test Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

Table 22: Related T-Test: Green08 Against Green10 [Tourism South East want to establish if support given to the use of local produce has impacted on how much businesses spend on local produce] Related T-Test Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

Note that the tests conducted here relate to the entire sample. If we used the Split File option as we have done previously, we could conduct Related T-tests to provide comparisons between selected variables such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests above. Cut and paste the output into your log book.

© Dr Andrew Clegg

p. 4-151


Statistical Tests: Mann Whitney U Test

Data Analysis for Research

4.8

Non Parametric Tests

4.8.1

The Mann-Whitney U Test (Independent Samples) When comparing samples of geographical data, assumptions of normality which underpin the accuracy of parametric tests, such as the t- test, are often quite unrealistic. In these cases, the use of a non-parametric test, such as the Mann Whitney U Test, provides a convenient alternative. The Mann Whitney U test is the non-parametric counterpart of the t-test for unrelated (independent) data. The test is used to determine whether ordinal data collected in two different samples differ significantly. As a non-parametric test it is not restricted by any assumptions regarding the nature of the population from which the sample was taken and is applicable to ordinal (ranked data). In additition, the sample sizes of the data sets need not be equal. The test calculates whether there is a significant difference in the distribution (based on the median) of data by comparing ranks of each data set. Within the Mann Whitney U test the null hypothesis is that the two populations are taken from a common population so that there should be no consistent difference between the two sets of values. Any observed differences are due entirely to chance in the sampling process. To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to use Mann Whitney to examine the relationship between different variables. Let us consider a potential research scenario to help you place the use of the Mann Whitney test in context.

4.8.2

Scenario:

Tourism South East are developing a new e-tourism strategy and they want to establish if there is any relationship between e-strategy (e-commerce adopters and non-adopters) and business attitudes to the value of the internet.

Variables:

We are therefore going to examine the relationship between EStrategy and the perceived value of the internet in 2008 (Webqual08).

Writing Null and Alternative Hypotheses Before we start we first need to establish a Null and Alternative hypothesis.

In this case: The Null Hypothesis: H o:

There is no significant difference between the two groups in terms of their perceived value of the internet; e-strategy does not influence attitudes towards the internet

TheAlternative Hypothesis: H 1:

Š Dr Andrew Clegg

There is a significant difference between the two groups in terms of their perceived value of the internet; e-strategy does influence attitudes towards the internet p. 4-152


Statistical Tests: Mann Whitney U Test

Data Analysis for Research

4.9

Using SPSS to Calculate Mann Whitney To perform the Mann Whitney U test, first move the mouse over Analyse and press the left mouse button. Move the mouse over Nonparametric Tests and then over Legacy Dialogs. Select 2 Independent Samples. The Two-Independent Samples Tests dialog box appears.

Select the variable labelled Webqual08. Move the mouse over the central arrow and press the left mouse button so Webqual08 appears in the Test Variable List.

Š Dr Andrew Clegg

p. 4-153


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Select the variable EStrategy and press the lower arrow so that EStrategy appears in the Grouping Variable box.

Move the mouse over Define Groups and press the left mouse button.

The Define Groups dialog box appears. In the box beside Group 1: type 1 and in the box beside Group 2: type 2. Note in this case the groups have been defined in terms of their two codes (1=ECommerce Adopter and 2=ECommerce - Non Adopter). The values can also be used as a cut-off point, at or above which all the values constitute one group while those below form the other group. In this instance the cut-off point is two, which would be placed in parentheses after gender. Move the mouse over Continue and press the left mouse button. This will return you to the IndependentSamples Tests dialog box. Move the mouse over OK and press the left mouse button. SPSS performs the test and displays the results in the Output window.

Š Dr Andrew Clegg

p. 4-154


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

The first subtable, Ranks, illustrates the number of businesses in each group, and the total number of businesses. The Mean Rank indicates the mean rank of scores within each group and the Sum of Ranks indicates the total sum of all ranks within each group. If our null hypothesis of no significant difference was true, then we would expect the mean rank and sum of ranks to be roughly similar across the two groups. As we can see the mean rank for E-Commerce Adopters is 178.82 and for E-Commerce Non-Adopters is 116.35. There is a clear difference between the two, and to determine whether this difference is significant we refer to the Test Statistics table below. This tells us that the Mann Whitney U value is 6508.000 and that the probability value (p), ascertained by examining the Asymp. Sig. (2-tailed) is .000. In this case, the p-value (.000) (reported as p<.0005) is less than 0.05, so we can therefore reject the null hypothesis and conclude that there a significant difference between EStrategy and attitudes towards the internet. Our Mann Whitney test was two-tailed but again we could be more specific by indicating a direction in our alternative hypothesis. In this case the alternative hypothesis would be: H 1:

There is a significant difference between the two groups in terms of their perceived value of the internet. E-commerce adopters rank the value of the internet higher than e-commerce non-adopters.

Note that an initial examination of the mean ranks would support our alternative hypothesis. As before, for a one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant as the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and conclude that there a significant difference between EStrategy and attitudes towards the internet and that eCommerce adopters rank the value of the internet higher than E-commerce non-adopters.

Š Dr Andrew Clegg

p. 4-155


Data Analysis for Research

4.9.1

Statistical Tests: Wilcoxon Signed Ranks

Choosing the Correct Data for a Mann Whitney Test SPSS will not tell you if you are using the wrong data in a test, and it is therefore imperative that you are capable of selecting the right variables to use in a Mann Whitney T-test. This will be central to your assessment in this module and it is vital that you get it right. Let us first refer back to Table 3 on page 4-133. This table clearly shows that a Mann Whitney T-Test is nonparametric and comprises a combination of one continuous variable and one categorical (with two levels).

In the worked example provided, Webqual was the continuous variable and EStrategy was the categorical variable. Note that EStrategy has two levels (i.e. 1 - E-Commerce Adopter and 2 - E-Commerce NonAdopter). You can only use categorical variables that have two levels in a Mann Whitney Test. The actual Mann Whitney Test dialog box actually provides a clue here as you are only able to define two groups (levels) within the Grouping Variable.

In this case also note that the continuous variable (Webqual08) goes in the Test Variable box.

Š Dr Andrew Clegg

p. 4-156


Data Analysis for Research

ď €

Statistical Tests: Wilcoxon Signed Ranks

Activity 19: Referring to the variables in the Dataset file, attempt to complete the following diagram listing Test Variables and Grouping Variables that would be appropriate for use in a series of Mann Whitney Tests.

Test Variables

Grouping Variables

Š Dr Andrew Clegg

p. 4-157


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Activity 20: From the list of potential relationships that you have identified overleaf, please conduct 3 separate Mann Whitney tests and record your results in the following tables. For each test, identify a research scenario that you are using the test to explore.

Table 23: Mann Whitney Test 1 Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the Mann Whitney U Value Record the value of p (Asymp. Sig. (2-tailed)) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)

© Dr Andrew Clegg

p. 4-158


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Activity 20: Table 24: Mann Whitney Test 2

Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the Mann Whitney U Value Record the value of p (Asymp. Sig. (2-tailed)) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)

© Dr Andrew Clegg

p. 4-159


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Activity 20: Table 25: Mann Whitney Test 3

Mann Whitney Test

Research Scenario

Test Variable

Grouping Variable

Null Hypothesis

Alternative Hypothesis

SPSS Output Record the Mann Whitney U Value Record the value of p (Asymp. Sig. (2-tailed)) Is the value of p. significant: yes/no? Your conclusions (with full reference to the null and alternative hypotheses, and your test statistics)

© Dr Andrew Clegg

p. 4-160


Data Analysis for Research

4.10

Statistical Tests: Wilcoxon Signed Ranks

Using SPSS to Calculate Wilcoxon Signed Ranks Test (Related Data Sets) The Wilcoxon signed ranks test is the non-parametric conterpart of the t-test for related data or paired t-test. The basic assumptions for the test are that the data are paired across conditions or time, and that the data are symmetrical but need not be normal or any other shape. The data should also be of at least ordinal level, which therefore makes the test very useful for analysing data based on ranked scores. The test itself examines the differences between data from the phenomenon collected in two different conditions or times by examining the ranks of the difference in values over the two conditions. For example, you may want to know whether a village’s fertility or mortality rate changes significantly between dates or whether the conditions under which a questionnaire or interview is conducted influence the findings of a study significantly. In this case, the test calculates whether there is a significant difference by examining whether the ranks of individual phenomena differ between conditions or times. To begin, open SPSS and open the file Dataset file that you have used in previous sessions. We are going to use the Wilcoxon test to examine the relationship between different variables. Let us consider a potential research scenario to help you place the use of the Wilcoxon Test in context. Scenario:

Between 2008 and 2010, Tourism South East have been running E-Commerce workshops across the South East region. As part of the monitoring process, Tourism South East want to establish if these workshops have had any impact on business attitudes to the value of the internet.

Variables:

Therefore we are are going to examine the relationship between Webqual08 and Webqual10.

In this instance, the null and alternative hypthoseses have been stated as: H o:

There is no difference in business attitudes towards the value of the internet between 2008 and 2010

H1:

There is a difference in business attitudes towards the value of the internet between 2008 and 2010.

The significance level has been set at 0.05 (95%). Note that this is also a two-tailed test as no direction has been specified in the alternative hypothesis.

Š Dr Andrew Clegg

p. 4-161


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

To perform the Wilcoxon Test, first move the mouse over Analyse and press the left mouse button.

Move the mouse over Nonparametric Tests and then Legacy Dialogs and then over 2 Related Samples and press the left mouse button.

The Two-Related Samples Tests dialog box appears.

Move the mouse over Webqual08 and press the left mouse button. Webqual08 is selected. Now move the mouse over Webqual10 and press the left mouse button. Webqual10 is selected. You will notice that in the Current Selections area in the dialog box, Webqual08 is now beside Variable 1 and Webqual10 is beside Variable 2.

Š Dr Andrew Clegg

p. 4-162


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Move the mouse over the central button and press the left mouse button. Webqual08 and Webqual10 now appear in the Paired Variables box.

Click OK. The procedure produces the following in the output window.

Š Dr Andrew Clegg

p. 4-163


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

The first subtable, Ranks, shows the number of negative, positive and tied ranks, along with the mean rank and the Sum of Ranks. Let us explore this is additional detail. Key observations: 

Webqual10 has been entered into the equation first, therefore the calculation is based on the attitudes scores in 2010 minus the attitude scores in 2008.

The Negative Ranks indicate how many ranks of Webqual08 were larger than Webqual10. Here the value is 0, which would initially suggest that attitude scores have increased.

The PostiveRanks indicate how many ranks of Webqual08 were smaller than Webqual10. The value here is 259.

The Tied Ranks indicate how many of the rankings of Webqual08 and Webqual10 are the same. The value here is 41.

The Total is the total number of ranks, which is equal to the number of attitude scores in the sample (in this case 300).

From the second subtable, Test Statistics, it can be seen that the value of z = -16.093, which is significant as the value of p (.000) is less than 0.05. We can therefore reject the null hypothesis and conclude that there is a significant difference in business attitudes towards the value of the internet between 2008 and 2010. The findings of the Wilcoxon test should be reports as: z= -16.093, p(<0.0005)<.005 The Wilcoxon Test was two-tailed but again we could be more specific by indicating a direction in our alternative hypothesis. In this case the alternative hypothesis would be: H1: There is a significant difference in attitudes towards the value of the internet between 2008 and 2010; business attitudes have improved. Note that an initial examination of the data in the ranks table support our alternative hypothesis. As before, for a one-tailed test, the p value needs to be halved (.000/2 = .000). In this case the test would still be significant as the p-value (.000) (reported as p<.0005) is less than 0.05, so we can again reject the null hypothesis and conclude that there a significant difference between attitudes towards the value of the internet between 2008 and 2010 and that business attitudes have improved.

© Dr Andrew Clegg

p. 4-164


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Activity 21: We are now going to use the Dataset file to conduct a number of additional Wilcoxon Tests. Please complete the following tables, making clear reference to the SPSS output. You have been provided with research scenarios for each table to place the test in context.

Table 26: Wilcoxon Test = BLINK08-BLINK10 [Following a complete review of their business advisory services, instigated by poor industry feedback in 2007, Business Link need to establish if business attitudes towards their advisory services has improved between 2008 and 2010] Related T-Test Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

Table 27: Wilcoxon Test- WEBVALUE08-WEBVALUE10 [Tourism South East want to establish if business attitudes to destination management systems have changed following the change of DMS platform and a complete relaunch of booking systems] Related T-Test Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

Note that the tests conducted here relate to the entire sample. If we used the Split File option as we have done previously, we could conduct Wilcoxon Tests to provide comparisons between selected cases such as Area, Town or G-Strategy. Attempt to apply the Split File option and repeat one of the tests above. Cut and paste the output into your log book.

© Dr Andrew Clegg

p. 4-165


Data Analysis for Research

Statistical Tests: Wilcoxon Signed Ranks

Notes:

Š Dr Andrew Clegg

p. 4-166


Section 5

Chi-Squared

Learning Outcomes At the end of this session, you should be able to: 

Understand the rationale for the use of chisquared

Understand the basic conditions and criteria involved in the use of chi-squared

Apply the procedure for calculating chi-squared statistics both manually and in SPSS

Interpret manually derived and computer generated SPSS chi-squared output



Chi-Squared

Data Analysis for Research

5.0

Introduction Chi-squared (χ2) is primarily employed to test a null hypothesis of ‘no difference’ between samples of frequency measurements. The method is used widely in the fields of Business and Management and is often employed in questionnaire analysis. In many ways it is the most flexible of such tests as:

It can be applied to frequency data on any originally collected scale of measurement (nominal, ordinal or interval) provided that the data are grouped into independent and mutually-exclusive categories.

It may be used to test a null hypothesis of ‘no difference’ for any number of samples.

The chi-squared test involves computing a calculated χ2 statistic and comparing this with an appropriate tabulated χ2 statistic (or critical χ2 value) to test a null hypothesis of ‘no difference’ at a selected significance level. Although considered less powerful than other tests, this is compensated for by its simple data requirements. Both ordinal and ratio scale data can be converted into nominal form, although such categorisation can often cause a loss of detail.

The chi-squared test requires that data be in the form of contingency tables, which are simply data matrices showing the frequency of observations in different categories (h) for one or more samples (k). The following are three examples of contingency tables. Table 5.1:

Categories of residence sampled in terms of the age of the resident

Sample

a

b

Category

Age: 20-29

30-39

Owner-occupied Rented Council housing Other No. of categories (h) = 4

18 31 24 17

42 29 41 4

c 40 and over 28 12 35 1

No. of samples (k) = 3

Note: Measurement scale: nominal

© Dr Andrew Clegg

p. 5-167


Chi-Squared

Data Analysis for Research

Table 5.2:

Typical questionnaire responses Category

Strongly Agree

Agree

Niether Agree or Disagree

Disagree

Strongly Disagree

Frequency

8

11

6

19

12

No. of categories (h) = 5

No. of samples (k) = 1

Note: Measurement scale: ordinal

Table 5.3:

Total dissolved solids in groundwater, sampled by rock type Categories (Concentration in mgl-1)

A Granite (n=30)

0-19 20-39 40-59 60-79 80-99 100-119

No. of categories (h) = 6

B Basalt (n=30)

3 12 10 4 1 0

1 9 11 8 3 1

No. of samples (k) = 2

Note: Measurement scale: interval The calculated χ2 statistic compares the observed frequency (O) for each category and every sample against an expected frequency (E) using the general formula:

χ = 2

(O − E) 2 E

In the above equation the observed frequencies (O) are those that we measure, (i.e. those that appear in the contingency tables). The expected frequencies (E) for each category are defined by our hypothesis. The null hypothesis of ‘no difference’ often involves testing for departure from a uniform distribution in the case of the single sample test. This means that the expected value for each category is identical and equal to n/h. The chi-squared test can also be used to establish differences from a theoretical distribution, such as the normal distribution. © Dr Andrew Clegg

p. 5-168


Chi-Squared

Data Analysis for Research

Although the χ2 test is primarily employed as a one-tailed test of the significance of differences, it may also be employed to establish the significance of similarities between samples. Most of the χ2 tables contain not only the usual values at the lower end of the significance scale for testing differences but also values at the upper end of this scale for testing similarity. If we wish to establish similarity of two or more samples, then our calculated χ2 statistic must be less than, for example, the appropriate figure for the 95% significance level if we are to accept the null hypothesis of ‘no difference’ at this level.

5.1

The One-Sample Chi-Squared Test The one-sample test is normally used to test the significance of differences between categories of a single sample. Consider the following example.

The frequency of rock falls from a popular cliff face in Snowdonia is recorded for two weeks in the summer, autumn, winter and spring by the local mountain rescue. The results are recorded in Table 5.4. Table 5.4:

Rockfall frequency Sampling Period

Summer

Autumn

Winter

Spring

Frequency of Rockfalls

17

14

10

23

h=4

n=64

If there were no differences in the frequency of rockfalls in each season, then we would expect an equal frequency of rockfalls in each season. Basically, the expected frequency for each category would be:

E=

n 64 = = 16 h 4

As with any test, we must first formalise the null and alternative hypotheses. In this case:

H0: There is no difference in the frequency of rock falls between seasons H1: The frequency of rock falls is significantly greater in some seasons than in others

© Dr Andrew Clegg

p. 5-169


Chi-Squared

Data Analysis for Research

The calculated χ2 statistic can now be computed as follows:

χ = 2

Table 5.5:

(O − E) 2 E

Rockfall frequency: the calculation of the χ2 statistic Category

O

E

(O-E)

(O-E)2

Summer Autumn Winter Spring

17 14 10 23

16 16 16 16

1 -2 -6 7

1 4 36 49

χ2 =

(O − E) 2 E

0.0625 0.2500 2.2500 3.0625

(O − E) 2 = 5.6250 E

The degrees of freedom (v) for this one-sample chi-square test is:

V=h-1 which in this case equals: V=h-1 = 4-1 = 3 The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level (see Table 5.6). To reject the null hypothesis, the calculated χ2 must exceed the tabulated χ2. At the 0.05 significance level, the tabulated χ2 statistic with three degrees of freedom is 7.82. As the calculated value is less than the tabulated value we cannot reject the null hypothesis of ‘no difference’ at the 0.05 significance level and conclude that there is no significant difference in the frequency of rock falls between the different seasons.

© Dr Andrew Clegg

p. 5-170


Chi-Squared

Data Analysis for Research

Table 5.6:

Critical Values of Chi-square 95%

Š Dr Andrew Clegg

99%

p. 5-171


Chi-Squared

Data Analysis for Research

5.2

The Chi-Squared Test for Two or More Samples The chi-square test can also be used to test the differences or similarities between two or more samples, though it is always used as a test of difference. The procedure is similar to that for the one sample test, except that the calculation of the expected frequencies (E) is slightly more complex. Consider the following example.

A researcher in Ghana studying the distribution of malaria outbreaks among international tourists obtains the following results from a sample of 100 tourists who stayed in hotels on the river flood plain, and from 200 tourists who stayed in hotels on a plateau above the river. The results are recorded in Table 5.7. Table 5.7:

The incidence of malaria outbreaks in Ghana Category

Infected

Not Infected

20 25

80 175

Sample Flood plain (n=100) Plateau (n=200)

In this case we have two samples (k=2) and two categories (h=2). The researcher wishes to establish whether the two samples differ significantly in terms of the incidence of infection. The expected frequencies (E) are thus those that would be expected if there were indeed ‘no differences’ between the plateau and the flood plain in terms of incidence of infection. The expected frequencies are calculated for each observation using the following formula:

E=

Column total x Row total Overall total

Or alternatively using notation in a contingency table format: Table 5.7a:

Calculation of expected values Category

Infected

Not Infected

Row Total

Sample Flood plain (expected) Plateau (expected) Column Totals

© Dr Andrew Clegg

Cell A= (N1 x T1)/T Cell B=(N1 x T2)/T N1

Cell C= (N2 x T1)/T Cell D= (N2 x T2)/T

T1 T2

N2

T

p. 5-172


Chi-Squared

Data Analysis for Research

Therefore in the case of the malaria outbreaks, the expected values are calculated in the following manner: Table 5.7b:

Calculation of expected values Category

Infected

Not Infected

Row Total

Sample Flood plain (observed) Plateau (observed)

20 25

80 175

100 200

Column Totals

45

255

300

Not Infected

Row Total

Hence the expected values are: Table 5.7c:

Calculation of expected values cont.. Category

Infected

Sample Flood plain (expected) Plateau (expected)

15 (45*100)/300 30 (45*200)/300

85 170

Column Totals

45

255

(255*100)/300 (255*200)/300

100 200 300

Note that the row and column totals for the expected values are identical to those for the observed values. The χ2 statistic is now calculated in the following manner: Table 5.8:

Calculation of the χ2 statistic (O − E) 2 E

Category

O

E

(O-E)

(O-E)2

Flood plain Infected Not Infected

20 80

15 85

5 -5

25 25

1.667 0.294

Plateau Infected Not Infected

25 175

30 170

-5 5

25 25

0.833 0.147

Total

300

300

© Dr Andrew Clegg

2.941

p. 5-173


Chi-Squared

Data Analysis for Research

When the chi-squared test is used to test two or more samples, the number of degrees of freedom is given by: V = (h-1)(k-1) In this case: V = (h-1)(k-1) = (2-1)(2-1) =1

Formally, the test of a null hypothesis of ‘no difference’ is as follows: H0: There is no difference between the incidence of infection on the flood plain and that on the plateau H1: There is a significant difference between the incidence of infection on the flood plain and that on the plateau

The calculated χ2 statistic is then compared with a tabulated χ2 statistic at a selected significance level. The tabulated value at the 0.1 significance level with 1 degree of freedom is 2.71 (see Table 1.6). As the calculated χ2 statistic (2.941) exceeds the tabulated χ2 statistic we can reject the null hypothesis at the 0.1 significance level. In practical terms this means that on the basis of the evidence of the chi-square test, it is extremely unlikely that the observed difference between rates of infection is due only to chance in the sampling process and instead reflects a ‘real’ difference between the rates of iinfection on the flood plain and plateau.

Notice that the larger the test statistic, the stronger the evidence of association will be. This is not surprising because the test statistic, χ2 , is based on differences between the actual, or observed frequencies and those we would expect if there were no association. If there were association then we would anticipate large differences between observed and expected frequencies. If there were no association we would expect small differences.

In the above example, if a higher significance level had been chosen, say 0.05, then the calculated χ2 statistic (2.941) would have been less than the tabulated χ2 statistic ( 3.84) and so the null hypothesis could not have been rejected. This situation raises the issue of subjectivity and that in order to reject a null hypothesis a researcher may well be tempted to choose a lower significance level. The safest rule is to choose a significance level before the test is carried out and stick to it.

© Dr Andrew Clegg

p. 5-174


Chi-Squared

Data Analysis for Research

5.3

Yates Correction Factor When using a χ2 test with one degree of freedom, as in the previous example, it is necessary to make a slight adjustment to the calculations. The adjustment consists of either adding or subtracting 0.5 to the value of each (O-E) before squaring it. The rule for deciding whether to add or subtract the 0.5 is: a) If (O-E) is negative then add; b) If (O-E) is positive then subtract

It is probably more easily remembered by noting that addition or subtraction should be performed with a view to making the value of χ2 smaller. The effect of the Yates correction can be highlighted with reference to a new version of Table 5.9 which has been appropriately adjusted using the Yates correction factor.

Table 5.9:

Calculation of the χ2 statistic with Yates Correction (O-E)

(O-E)2

(O − E) 2 E

Category

O

E

Flood plain Infected Not Infected

20 80

15 85

5-0.5 (4.5) -5+0.5 (-4.5)

25 (20.25) 25 (20.25)

1.667 (1.35) 0.294 (0.24)

Plateau Infected Not Infected

25 175

30 170

-5+0.5 (-4.4) 5-0.5 (4.5)

25 (20.25) 25 (20.25)

0.833 (0.675) 0.147 (0.119)

Total

300

300

2.941 (2.384)

The effect of the Yates correction is to introduce greater accuracy into the calculation and evaluation of the χ2 statistic. In this case, the Yates correction has reduced the value of the calculated χ2 statistic to the extent that is no longer exceeds the value of the tabulated value of χ2 with one degree of freedom. As such the null hypothesis can no longer be rejected and the researcher would have to conclude that there is no significant difference between the incidence of river blindness on the flood plain and the plateau.

© Dr Andrew Clegg

p. 5-175


Data Analysis for Research

5.4

Chi-Squared

Conditions Necessary for Conducting a Chi-squared Test When using chi-square a number of guidelines must be remembered: 

Contingency tables must consist of at least two categories;

Where there are only two categories, the expected frequency in each category must not be less than 5;

Where there are more than two categories, no category should have an expected frequency of less than 1 and not more than one category in five should have an expected frequency of less than five;

Data must be in the form of frequencies (i.e. counted data in categories). The χ2 statistic is best suited to comparing frequencies within nominal categories. It can also be applied to higher order levels of measurement if data are grouped into categories prior to analysis. These tests are not applicable to interval scale data;

No cell is allowed to have an expected frequency of less than 1. This requirement can sometimes be met through the amalgamation of rows and columns (i.e. fewer cells with more observations in each). However be careful as the regrouping of data can lead to a loss of information and the subtle differences between two data sets being obscured. Therefore regrouping should be avoided if at all possible and thus larger sample sizes are recommended . In addition, the way that categories are constructed may determine whether or nor significant associations are detected;

Samples are assumed to be independent (not applicable to dependent variables);

Random sampling is assumed (other sampling procedures can be considered as long as they are proved to be unbiased);

Data samples must be discrete and unambiguous;

Frequencies must be absolute and not percentages of proportional values;

Question of ‘tailedness’ of the alternative hypothesis does not arise in the context of the chi-square tests. Because of the manner of its execution the direction of departure is immaterial.

© Dr Andrew Clegg

p. 5-176


Chi-Squared

Data Analysis for Research

ď €

Activity 22: From an examination of destination preferences for second homes it appears that coastal counties of England and Wales are perceived as being more desirable holiday locations than inland counties. The results are summarised below. Residential Desirability Location Low Preference

High Preference

Total

Coastal Counties

5

14

19

Inland Counties

19

15

34

Total

24

29

53

Of the 19 coastal counties, 14 have preference scores of more than 30 and only 5 have preference scores of 30 or less. Of the 34 inland counties, 15 have high preference scores and 19 have low scores. Use the chi-square test to decide whether there is in fact a significant difference at the 0.05 level between coastal and inland counties in terms of their destination desirability. Report your final result below:

Š Dr Andrew Clegg

p. 5-177


Data Analysis for Research

ď €

Chi-Squared

Activity 23: In a survey commissioned by a TV travel program, 135 people were asked what their favourite foreign holiday destination was. Some of the results are summarised in the contingency table below:

Use these sample results to test for association between gender and destination preference, using a 95% confidence level. When calculating the expected frequencies, check if the data meets the requirements of the chi-square test. How can you re-categorise the data to make it meet the criteria for the chi-squared test? Report your final result below:

Š Dr Andrew Clegg

p. 5-178


Chi-Squared

Data Analysis for Research

Activity 24: Company managers at Butlins are investigating the relationship between job satisfaction and the levels of absenteeism in the firm. They believe that satisfied individuals are less likely to be absent from work than those who are not satisfied. The results from a survey of 30 workers are displayed in a contingency table below. Job Satisfaction Absenteeism Dissatisfied

Happy

Total

Absent from work

4

11

15

Not absent from work

10

5

15

Total

14

16

30

Calculate the value of χ2 for the difference between the observed and expected numbers. Is this difference significant at the 0.05 level? Record your final result below:

© Dr Andrew Clegg

p. 5-179


Chi-Squared

Data Analysis for Research

5.5

Using SPSS to Calculate Chi-Squared Having considered how to calculate chi-square manually, the aim of the following section is to highlight how to calculate chi-square values using SPSS.

To start, we will repeat Exercise 3 relating to job satisfaction levels at Butlins. Load the ‘Butlins1’ exercise file into SPSS.

Label the columns and values as you have done in previous sessions.

Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Job Satis 1 1 2 1 2 2 1 1 2 1 2 2 2 2 2 1 1 2 1 2 2 1 2 2 1 1 1 2 1 1

Absent 2 1 1 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 1 2 1 2 1 2 2

To perform a chi-squared test in SPSS, move the mouse over Analyse and press the left mouse button. Move the mouse over Descriptive Statistics and then Crosstabs.

simulation © Dr Andrew Clegg

p. 5-180


Chi-Squared

Data Analysis for Research

The Crosstabs dialog box appears.

Move the mouse over Jobsatis and press the left mouse button. Press the top arrow button so that Jobsatis is selected in the Row(s): box. Move the mouse over Absent and press the left mouse button. Press the middle arrow button so that Absent is selected in the Column(s) box.

Move the mouse over Statistics and press the left mouse button. The Crosstabs: Statistics dialog box appears.

Select the chi-square option and then press Continue.

Š Dr Andrew Clegg

p. 5-181


Data Analysis for Research

Chi-Squared

This takes you back to the Crosstabs dialog box. Move the mouse over Cells and press the left mouse button. The Crosstabs: Cell display dialog box appears. Make sure that Observed and Expected counts are selected and then press Continue. This will take you back to the initial Crosstabs dialog box.

Press OK and SPSS will automatically calculate the chi-square statistics and display the results in the output window. The output window will display a contingency table and the following output.

Š Dr Andrew Clegg

p. 5-182


Data Analysis for Research

Chi-Squared

How do these results compare to your manual output ? Well, first of all you should notice that the Pearson chi-square result gives you the χ2 statistic prior to revision by the Yates Correction (4.82). Second, the Continuity Correction chisquare gives you the χ2 statistic as adjusted by the Yates Correction (3.34).

But, from the SPSS output how do you infer the significance level ?

Although the output looks daunting the answer is quite simple. In the output below, the significance value (Asymp. Sig. (two -tailed)) for the corrected χ2 statistic is .067. This value is greater than 0.05 which means it is not significant at the 0.05 confidence level. This can also be reported as p>0.05 ( not significant). Notice however, that the value is less than 0.1, which means it is significant at the 0.1 significance level , which can alternatively be recorded as p<0.1 (significant). Basically, these are the same results as you should have calculated manually.

© Dr Andrew Clegg

p. 5-183


Chi-Squared

Data Analysis for Research

Remember: 

if the significance value (p) is <0.1 then the value is significant at the 0.1 significance level (90%)

if the significance value (p) is <0.05 then the value is significant athe 0.05 significance level (95%)

if the significance value (p) is <0.01 then the value is significant at the 0.01 significance level (99%).

Remember however, that you should not switch between significance levels so that the null hypothesis can be rejected. The safest rule is to pick a significance level before you start and stick with it all the way through the test.

Accurately Reporting the Outcomes of the Chi-Square Test

When reporting the chi-square result a number of key elements must be included: 

Specify suitable hypotheses. In this case: H0: There is no significant difference between job satisfaction and levels of absenteeism H1: There is a significant difference between job satisfaction and levels of absenteeism

The test statistic. Therefore in your write-up you must include what χ2 equals. In this example χ2 = 3.34.

The degrees of freedom. This is the number of rows minus 1, times the number of columns minus 1. This value is actually given in the SPSS output. The value for degrees of freedom is placed between the χ2 and the = sign and placed in brackets. In this example the degrees of freedom = 1, therefore χ2 (1) = 3.34.

As part of the report you must also state the probability. As highlighted above this is done in relation to whether your probability value was below 0.05 and 0.01 (and therefore significant) or above 0.05 (and therefore not significant). Here, you use the less than (<) or greater than (>) the criteria level. You state this criteria by stating whether p<0.05 (significant), p<0.01 (significant) or p>0.05 (not significant). Assuming a 95% confidence level in the above example, as p=0.67, we would write p>0.05 and place this after the reporting of the χ2 value. Therefore χ2 (1) = 3.34, p (0.067)>0.05.

© Dr Andrew Clegg

p. 5-184


Chi-Squared

Data Analysis for Research

These elements must be incorporated into your text to ensure that your results are presented succintly but effectively. You can also include a table. Therefore using the findings above we could report the following.

Table 1: Job Satisfaction v Job Absenteeism

Category

Job Satisfaction Happy Dissatisifed

Totals

Absenteeism Yes (observed) % of Total

4 13%

11 37%

15 50%

No (observed) % of Total

10 33%

5 17%

15 50%

Totals

14

16

30

‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether there was a significant difference between the two variables. A null hypothesis of no signifcanrt difference and an alternative hypothesis of a significant difference were established, and a 95% confidence level was assumed. No significant difference was found between job satisfaction and absenteeism (χ2 (1) = 3.34, p (0.067)>0.05). The null of hypothesis of no significant difference can therefore not be rejected.’ Note if we had assumed a 90% confidence level from the start we would write: ‘Table 1 shows a breakdown of the distribution of respondents in terms of levels of job satisfaction and levels of absenteeism (with percentages in brackets). A chi-squared was used to determine whether there was a significant difference between the two variables. A null hypothesis of no significant difference and an alternative hypothesis of a significant difference were established, and a 90% confidence level was assumed. A significant difference was found between job satisfaction and absenteeism (χ2 (1) = 3.34, p (0.067)<0.1). The null hypothesis of no significant difference can therefore be rejected.’

© Dr Andrew Clegg

p. 5-185


Chi-Squared

Data Analysis for Research

Activity 25: Load the ‘Chi Square Exercises file’ file into SPSS. This file contains the data relating to the two additional practical exercises that you completed by hand. Perform two chi-squares tests and compare your output back to your manual calculations. Note that the Excel file contains two spreadsheets that you will need to access. Import into SPSS the normal way but select the spreadsheet you wish to use from the Opening Excel Data Source dialog box (as below). Record the results in your log book.

simulation

The following coding schemes have been used: Residential Desirability: Area: Coastal =1; Inland =2; Score: High = 1; Low =2 TV Survey:

© Dr Andrew Clegg

Gender: Female =1; Male = 2; Location: Greece = 1; Spain = 2; Thailand = 3; Turkey = 4; USA =5 Regroup: Europe = 1; Asia = 2; USA = 3

p. 5-186


Chi-Squared

Data Analysis for Research

Activity 26:

© Dr Andrew Clegg

Referring to the variables in the Dataset file, identify a series of relationships that could be examined using the chi-squared test. Remember you need to focus on category/nominal data for this exercise.

p. 5-187


Data Analysis for Research

Chi-Squared

Activity 27: Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the following tables, making clear reference to the SPSS output. For each test, identify a research scenario that you are using the test to explore.

Table 28: Chi-Squared 1 Chi-Squared Test

Research Scenario

Row Variable Column Variable Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

Table 29: Chi-Squared 2 Chi-Squared Test

Research Scenario

Row Variable Column Variable Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

© Dr Andrew Clegg

p. 5-188


Data Analysis for Research

Chi-Squared

Activity 27: Using the Dataset file conduct 3 appropriate chi-squared tests. Please complete the following tables, making clear reference to the SPSS output. For each test, identify a research scenario that you are using the test to explore.

Table 30: Chi-Squared 3 Chi-Squared Test

Research Scenario

Row Variable Column Variable Null Hypothesis

Alternative Hypothesis Comment on the SPSS Output

© Dr Andrew Clegg

p. 5-189


Data Analysis for Research

Chi-Squared

Notes:

Š Dr Andrew Clegg

p. 5-190


Section 6

Correlation

Learning Outcomes At the end of this session, you should be able to: 

Explain the rationale for the use of correlation analysis

Understand the basic conditions and criteria involved in the use of correlation analysis

Use SPSS to calculate the correlation coefficient for both the Pearson’s Product Moment Correlation Coefficient and Spearman’s Rank Correlation Coefficient

Interpret computer generated SPSS correlation analysis output



Data Analysis for Research

Correlation

6.0 Introduction The aim of this session is to help you understand the importance of correlation in statistical analysis. By the end of this session you should understand the meaning of correlation, how to check if data fulfils assumptions for parametric and non-parametric testing, and how to perform correlation statistics on SPSS.

6.1 The Meaning of Correlation Correlation is one of the most widely used statistical techniques. It is a means to measure the degree of association between two variables, that is, the extent to which changes in values of one variable are matched by changes in another variable. For example, we would tend to expect that, other things being equal, the market price of houses increases as the size of the house increases, that is bigger houses cost more. The size and price are correlated. The amount of water flowing down a river would be expected to be closely related to the amount of rain which has recently fallen on the catchment. The rainfall and river flow are correlated. We may have data on crime rates and on unemployment in a number of areas. It may be that those areas with a high crime rate that also, in general, have a higher rate of unemployment. These variables are also correlated. Correlation may measure that extent to which higher values of one variable are matched with higher values of the other and this is called positive correlation or it can measure the extent to which higher values of one variable are matched with lower values of the other, and this is called negative correlation. For example, you might find a positive correlation between the amount of beer you drank the night before and the number of pneumatic drills you think are in your head the next day. However, there might be a negative correlation between the number of pints and your ability to perform particular tasks. To repeat, correlation is a measure of association, it says nothing whatsoever about cause. Although variation is house size may cause variation in house price, and variation in amounts of rainfall may cause variation in river flow, there has been a long, political as well as sociological, argument about whether unemployment causes crime. It is possible to find sets of data which have absolutely nothing in common, except that they are correlated. Remember: 

If higher values of one variable are associated with higher values of the other variable, then the two variables are positively correlated.

If higher values of one variable are associated with lower values of the other variable, then the two variables are negatively correlated.

There are several ways to measure correlation, using a range of different indices for different types of data. When variables are parametric in nature (e.g. interval/ratio data), by far the most commonest measure of correlation is the Pearson’s Product Moment Correlation Coefficient, often referred to as Pearson’s r. Where data is ordinal (one or both variables are not measured on an interval scale), or when not normally distributed, or when other assumptions of the Pearson correlation coefficient are violated, we use the Spearman Corelation Coefficient, referred to a Spearman’s rs. © Dr Andrew Clegg

p. 6-191


Data Analysis for Research

ď €

Correlation

Activity 28: Refering to the variables in the Dataset file and your accompanying data set guide, attempt to complete the following diagram, listing variables that could be correlated using the Pearson Product Moment Correlation or the Spearman Rank Correlation Coefficient.

Pearson Product Moment Correlation

Spearman Rank Correlation Coefficient

Š Dr Andrew Clegg

p. 6-192


Data Analysis for Research

Correlation

6.2 Identifying Signs of Correlation in the Data No matter what the type of data you are using, an important first stage in measuring correlation is to obtain some idea if correlation may be present in the data. The simplest way to do this is to plot the variables and look carefully at the graph.

Figure 6.1 shows that the two variables are clearly related in some way: they are strongly correlated. The graph slopes up to the right, that is, there is an association between higher values, so the correlation is positive.

Figure 1.1: Strong Positive Correlation

In the case of Figure 6.2, the graph slopes down to the right, thereby implying a negative relationship, meaning that as one variable increases, the other decreases.

Figure 6.2: Strong Negative Correlation

In addition to positive and negative relationships we sometimes find non-linear or curvilinear relationships, in which the shape of the relationship between the two variables is not straight, but curves at one or more points (see Figure 6.3).

Š Dr Andrew Clegg

p. 6-193


Data Analysis for Research

Correlation

Figure 6.3: Non-linear or Curvilinear Relationship

It is important to identify if the relationship is non-linear as:  

It would affect the choice of correlation measurement technique; If the wrong technique was used there would be a spurious result.

Overall, scatter diagrams are useful aids in the preliminary steps of identifying correlation and allow three aspects of a relationship to be discerned: whether it is linear; the direction of the relationship (positive or negative); and the strength of the relationship. The amount of scatter is indicative of the strength of the relationship.

6.3 Correlation Analysis The correlation coefficient (r) measures linear relationship between the variables. Every correlation coefficient will lie somewhere on the scale of possible values, that is between -1 and +1 inclusive. A relationship of -1 or +1 would indicate a perfect relationship, positive or negative respectively, between the two variables. The complete absence of a relationship would engender a computed coefficient of zero. The closer the correlation coefficient is to 1 (either positively or negatively) the stronger the relationship between the two variables. The nearer the correlation coefficient is to zero, the weaker the relationship. These ideas are displayed in Figure 6.4.

© Dr Andrew Clegg

p. 6-194


Correlation

Data Analysis for Research

Figure 6.4: The Strength and Direction of Correlation Coefficients

Perfect Negative Correlation

-1

Strong

Perfect Postive Correlation

No Correlation

Weak

0

Weak

Strong

1

If the correlation coefficient is is 0.85, this would indicate a strong positive relationship between the two variables, whereas a correlation coefficient of 0.28 would denote a weak positive relationship. Similarly, -0.75 and -0.36 would be indicative of strong and weak negative relationships respectively. However, what is a large correlation ? Cohen and Holliday (1982) suggest the following: 0.19 and below is very low; 0.20 to 0.30 is low; 0.40 to 0.69 is modest; 0.70 to 0.89 is high; and 0.90 to 1 is very high. However, these measures are regarded as a rule of thumb and should not be regarded as definite indications. Caution is also required when comparing computed correlation coefficients. For example we can say that a computed correlation coefficient of -0.60 is larger than one of -0.30, but we cannot say that the relationship is twice as strong. In order to understand this more clearly, we need to refer to the coefficient of determination (R2). This is quite simply the square of the correlation coefficient multiplied by 100. It provides us with an indication of how far variation in one variable is due to the other. Thus if r= -0.6, then R2 =36 per cent. This means that 36 per cent of the variance in one variable is due to the other. When r = -0.3, then R2 will be 9 per cent. Thus, although an r of -.06 is twice as large as one of -0.3, it cannot indicate that the former is twice as strong as the latter, because four times more variance is being accounted for by an r of 0.6 than one of -0.3 (Bryman and Cramer, 1997). Referring to the determination of coefficient can also influence your interpretation of r. For example, an r value of 0.75 may seem quite high, but it would only mean that 56 per cent of the variance in y can be attributed to x. In other words, 46 per cent of the variance in y is due to variables other than x.

Š Dr Andrew Clegg

p. 6-195


Correlation

Data Analysis for Research

6.4 Using SPSS to Measure Correlation: Pearson’s Correlation Coefficient The most commonly used (and misused) measure of correlation is Pearson’s Product Moment Correlation Coefficient. This is a powerful parametric measure, which can be used to test for significance and reliability as long as its assumptions are satisfied. The first two assumptions are:  

The relationship between the variables is linear; The variables are interval or ratio scale measurements.

Before we use Pearson’s Correlation Coefficient to examine possible correlations in the Dataset file, let me illustrate correlation through a simple example. Load the Excel file ‘Correlation’ into SPSS. The details of this data file are highlighted below.

CARS PERSONS INCOME AGE TRAVEXP [No. of cars] [No. of Persons] [Income (Thousands)] [Age][Travel Expenditure] 0 2 1 2 2 0 1 2 1 1 2 1 1 2 3 1 1 0 0 1 2 1 0 1 2 3 1 2 1

2 3 1 4 2 1 3 2 1 3 2 2 5 4 2 3 4 3 4 3 1 1 2 2 3 4 3 3 2

9 25 13 30 50 4 30 43 10 50 37 25 30 50 75 45 50 20 13 35 40 75 10 50 30 100 40 30 30

25 37 23 30 43 18 27 55 71 20 41 51 45 40 54 34 67 44 34 54 65 45 34 26 65 32 46 55 65

10 50 20 60 70 5 100 30 15 20 50 90 40 80 150 50 30 20 15 50 50 30 10 30 70 100 60 50 20

The above table refers to factors that might influence the level of car ownership in individual households. If you wanted to examine the relationship between the different variables, the first stage would be to produce a series of scatterplots to highlight the direction and strength of any possible relationships. Let us examine correlation through a specific example. In this case, we will look at the relationship between the number of persons in the household (Persons) against the number of cars (Cars). © Dr Andrew Clegg

p. 6-196


Correlation

Data Analysis for Research

To do so, click Graphs, move the mouse over Legacy Dialogs and then select Scatter/Dot.

The Scatterplot dialog box appears.

Ensure that Simple is selected and then press Define.

The Simple Scatterplot dialog box appears. Move the mouse of Cars (Number of Cars) and press the left mouse button. Move the mouse over the top arrow and press the left mouse button so that cars is selected in the Y Axis: box. Move the mouse over Persons (Number of People) and press the left mouse button. Move the mouse over the centre arrow and press the left mouse button so Persons is selected in the X Axis box:

Press OK.

simulation Š Dr Andrew Clegg

p. 6-197


Data Analysis for Research

Correlation

A scatterplot showing the relationship between the two variables appears.

The non-linear relationship expressed in the scatterplot indicates a very weak correlation between the two variables. This can be confirmed by actually calculating the correlation coefficient. To do so, move the mouse over Analyse and press the left mouse button. Move the mouse over Correlate and then over Bivariate and press the left mouse button again. The Bivariate Correlations dialog box appears.

Š Dr Andrew Clegg

p. 6-198


Data Analysis for Research

Correlation

Move the mouse over Cars and press the left mouse button. Move the mouse over the top arrow so that Cars is selected in the Variables Box:.

Repeat the same procedure for Persons. Make sure that the Pearson Correlation coefficient and a twotailed test is selected. A two-tailed test is selected as we do not know which direction our relationship between the two variables will be and we will be looking for either a positive or a negative correlation. Press OK. SPSS produces a matrix of correlation coefficients in the output window. In this case the following output is produced:

As you can see from the output, the value of r for the two variables equals 0.129, which indicates a very weak correlation. You should also notice that the probability value (p) is also not significant (p>0.05).

Š Dr Andrew Clegg

p. 6-199


Data Analysis for Research

Correlation

As with your previous exercises, you should also provide null and alternative hypotheses. In this case: Null Hypothesis There is no significant association between levels of car ownership and the number of persons in the household. Alternative Hypothesis [Two-Tailed] There is a significant association between levels of car ownership and the number of persons in the household. Note that this alternative hypothesis is two-tailed as it is not specifying a specific direction (for example a positive or negative association). An initial scatterplot of the data would reveal any possible association between the data, and allow you to specify a one-tailed test. In this case, a one-tailed test would look like this: Alternative Hypothesis [One-Tailed] There is a positive association between levels of car ownership and the number of persons in the household. Referring back to the SPSS output for our initial correlation:

The Pearson Correlation test statistic = .129. The output indicates that this is not significant (p=.503, >0.05) A conventional way of reporting these figures would be as follows: r = .129, n = 29, p>0.01. The results indicate that there is no significant association between levels of car ownership and number of persons in the household. Note that when using correlation you are examining the level of association, and this should be clearly reflected in your hypthoses.

Š Dr Andrew Clegg

p. 6-200


Correlation

Data Analysis for Research

Let us now repeat this procedure to examine the relationship between additional variables within the dataset. In this case will look at car ownership against household income. First create a scatterplot between the car ownership and income. Your scatterplot should look similar to the graph below:

The scatterplot clearly indicates that there is a linear relationship between the two variables, and that there is evidence of a positive correlation: in this case, as household income increases so does the level of car ownership. Having established the existence of a linear relationship, now calculate the correlation coefficient. In the Bivariate calculations dialog box, specify a one-tailed test as in this case we are expecting a positive correlation - thus indicating a direction. SPSS will generate the following output.

Correlations Cars Cars

Income

Income

Pearson Correlation

1

.665(**)

Sig. (1-tailed)

.

.000

N

29

29

Pearson Correlation

.665(**) 1

Sig. (1-tailed)

.000

.

N

29

29

** Correlation is significant at the 0.01 level (1-tailed).

Š Dr Andrew Clegg

p. 6-201


Correlation

Data Analysis for Research

The Pearson Correlation test statistic =0.665. SPSS indicates with ** that it is significant at the 0.01 level for a one-tailed prediction. The actual p value is shown to be 0.000. A conventional way of reporting these figures would be as follows: r=0.665, n=29, p<0.01. The results indicate that as household income increases, car ownership also increases, which is a positive correlation. As the r value reported is positive and p <0.01, we can state that there is a positive correlation between our two variables and that the null hypothesis can be rejected.

Activity 29: Examine the remaining variable in the dataset and record your observations, using the tables below that are also in your log book.

Table 31: Number of cars against age 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-202


Correlation

Data Analysis for Research

Activity 29: Examine the remaining variable in the dataset and record your observations, using the tables below that are also in your log book.

Table 32: Number of cars against income 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

Table 33: Number of cars against monthy travel expenses 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-203


Correlation

Data Analysis for Research

Activity 30:

Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 34: Correlation 1 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Research Scenario Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-204


Correlation

Data Analysis for Research

Activity 30:

Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 35: Correlation 2 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Research Scenario Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-205


Correlation

Data Analysis for Research

6.5

Non-Parametric Correlation: Spearman’s Rank Correlation Coefficient It is often the case that the data available do not fit the requirements for parametric testing. In this case, there is a non-parametric correlation measure available. Spearman’s Rank Correlation Coefficient is mathematically derived from Pearson’s coefficient, but instead of using the actual data values it uses rank or ordinal data. The Spearman correlation coefficient is known as rs. The main assumptions for the use of Spearman’s rank correlation are: 

The relationship between the variable is monotonic, that is, x increases as y increases or x decreases as y decreases. A linear relationship is monotonic, but a monotonic relationship is not necessarily linear.

The variables are ordinal (ranks) or are ranked interval or ratio scale measurements.

simulation

Commit 1.00 2.00 1.00 4.00 4.00 1.00 1.00 1.00 2.00 4.00 3.00 4.00 1.00 1.00 1.00 2.00 1.00 3.00 4.00 4.00 1.00 1.00 2.00 3.00 1.00

Satis 1.00 3.00 2.00 3.00 4.00 1.00 2.00 2.00 1.00 4.00 4.00 4.00 1.00 2.00 2.00 2.00 1.00 3.00 4.00 3.00 1.00 2.00 1.00 4.00 1.00

To highlight the use of the Spearman’s rank correlation, type the data table into SPSS. The data refers to a survey of workers in a London hotel. The manager believed that the employee commitment to customer care policies was influenced by overall job satisfaction. The data in the table is ranked for Commitment (Commit) (1=High Commitment and 4 = Poor Commitment) and Job Satisfaction (Satis) (1= High Satisfaction and 4=Low Satisfaction)

Use the same procedure starting on page 6-196, to open the Bivariate Correlations dialog box.

Select both Commit and Satis in the Variables: box. Instead of Pearson’s r, make sure that the Spearman Correlation Coefficient is selected. Make sure that the onetailed test is also selected. This is because the manager believes that employee commitment increases with job satisfaction. This therefore implies a direction in the alternative hypothesis making it a one-tailed test.

© Dr Andrew Clegg

p. 6-206


Data Analysis for Research

Correlation

Press OK and SPSS will automatically calculate the value of the Spearman’s rank correlation coefficient. In this case, the following output is produced.

As you can see from the output, there is a strong positive correlation between the two variables (0.78). The result is also significant (p<0.01) and the manager can be confident at the 99% significance level that commitment increases with job satisfaction. The positive correlation is also reflected in a scatterplot of the two variables.

Š Dr Andrew Clegg

p. 6-207


Correlation

Data Analysis for Research

Activity 31:

Using the Dataset file, conduct two Spearman Rank Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 36: Correlation 3 6SHDUPDQ·V 5DQN &orrelation Coefficient Research Scenario Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-208


Correlation

Data Analysis for Research

Activity 31:

Using the Dataset file, conduct two Spearman Rank Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 37: Correlation 4 6SHDUPDQ·V 5DQN &orrelation Coefficient Research Scenario Please cut and paste your scatterplot below and rescale accordingly

Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?

Null Hypothesis:

Alternative Hypothesis (one tailed):

Value of r? Probability Value? Please provide a brief summary of your findings here:

© Dr Andrew Clegg

p. 6-209


Data Analysis for Research

Š Dr Andrew Clegg

Correlation

p. 6-210


Section 7

Useful Reading



Geographical Techniques 2 Data Analysis & Presentation

7.0

Useful Reading Descriptive Statistics

Useful Reading BRYMAN, A. AND CRAMER, D. (2001), Quantitative Data Analysis with SPSS Release 10 for Windows, Routledge, London. BUGLEAR, J. (2000), Stats to Go, Butterworth Heinemann, London. CLARK, M., RILEY, M., WILKIE, E. AND WOOD, R. (1998), Researching and Writing Dissertations in Hospitality and Tourism, Thomson Business Press, London. DANCEY, C. AND REIDY, J. (2002), Statistics Without Maths for Psychology, Second Edition, Prentice Hall, London. EBDON, D. (1985), Statistics in Geography, Blackwell, London. FIELD, A. (2009), Discovering Statistics Using SPSS, Third Edition, Sage, London. FINN, M., ELLIOTT-WHITE, M. AND WALTON, M. (2000), Tourism and Leisure Research Methods, Longman, London. GHAURI, P. AND GRONHAUG, K. (2002), Research Methods in Business Studies, FT Prentice Hall, London. HINTON, P. (2004), Statistics Explained, Routledge, London. HINTON, P., BROWNLOW, C., McMURRAY, I. AND COZENS, B. (2004), SPPS Explained, Routledge, London. KINNEAR, P. AND GRAY, C. (1999), SPSS for Windows Made Simple, Psychology Press, London. KITCHIN, R. AND TATE, N. (2000), Conducting Research into Human Geography, Prentice Hall, London. MALTBY, J. AND DAY, L. (2002), Early Success in Statistics, Prentice Hall, London. McQUEEN, R. AND KNUSSEN, C. (2002), Research Methods for Social Science, Prentice Hall, London. MICROSOFT PRESS (1997), Microsoft Access 97 - At a Glance, Microsoft Press, Washington. MULBERG, J. (2002), Figuring Figures, Prentice Hall, London. ROGERSON, P. (2001), Statistical Methods for Geography, Sage Publications, London. SAUNDERS, .M, LEWIS, P. AND THORNHILL, A. (2003), Research Methods for Business Studies, Third Edition, FT Prentice Hall, London.

Š Dr Andrew Clegg

p. p. 7-209 209


Section 8

Appendices



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.