Section 6
Correlation
Learning Outcomes At the end of this session, you should be able to:
Explain the rationale for the use of correlation analysis
Understand the basic conditions and criteria involved in the use of correlation analysis
Use SPSS to calculate the correlation coefficient for both the Pearson’s Product Moment Correlation Coefficient and Spearman’s Rank Correlation Coefficient
Interpret computer generated SPSS correlation analysis output
Data Analysis for Research
Correlation
6.0 Introduction The aim of this session is to help you understand the importance of correlation in statistical analysis. By the end of this session you should understand the meaning of correlation, how to check if data fulfils assumptions for parametric and non-parametric testing, and how to perform correlation statistics on SPSS.
6.1 The Meaning of Correlation Correlation is one of the most widely used statistical techniques. It is a means to measure the degree of association between two variables, that is, the extent to which changes in values of one variable are matched by changes in another variable. For example, we would tend to expect that, other things being equal, the market price of houses increases as the size of the house increases, that is bigger houses cost more. The size and price are correlated. The amount of water flowing down a river would be expected to be closely related to the amount of rain which has recently fallen on the catchment. The rainfall and river flow are correlated. We may have data on crime rates and on unemployment in a number of areas. It may be that those areas with a high crime rate that also, in general, have a higher rate of unemployment. These variables are also correlated. Correlation may measure that extent to which higher values of one variable are matched with higher values of the other and this is called positive correlation or it can measure the extent to which higher values of one variable are matched with lower values of the other, and this is called negative correlation. For example, you might find a positive correlation between the amount of beer you drank the night before and the number of pneumatic drills you think are in your head the next day. However, there might be a negative correlation between the number of pints and your ability to perform particular tasks. To repeat, correlation is a measure of association, it says nothing whatsoever about cause. Although variation is house size may cause variation in house price, and variation in amounts of rainfall may cause variation in river flow, there has been a long, political as well as sociological, argument about whether unemployment causes crime. It is possible to find sets of data which have absolutely nothing in common, except that they are correlated. Remember:
If higher values of one variable are associated with higher values of the other variable, then the two variables are positively correlated.
If higher values of one variable are associated with lower values of the other variable, then the two variables are negatively correlated.
There are several ways to measure correlation, using a range of different indices for different types of data. When variables are parametric in nature (e.g. interval/ratio data), by far the most commonest measure of correlation is the Pearson’s Product Moment Correlation Coefficient, often referred to as Pearson’s r. Where data is ordinal (one or both variables are not measured on an interval scale), or when not normally distributed, or when other assumptions of the Pearson correlation coefficient are violated, we use the Spearman Corelation Coefficient, referred to a Spearman’s rs. © Dr Andrew Clegg
p. 6-191
Data Analysis for Research
ď €
Correlation
Activity 28: Refering to the variables in the Dataset file and your accompanying data set guide, attempt to complete the following diagram, listing variables that could be correlated using the Pearson Product Moment Correlation or the Spearman Rank Correlation Coefficient.
Pearson Product Moment Correlation
Spearman Rank Correlation Coefficient
Š Dr Andrew Clegg
p. 6-192
Data Analysis for Research
Correlation
6.2 Identifying Signs of Correlation in the Data No matter what the type of data you are using, an important first stage in measuring correlation is to obtain some idea if correlation may be present in the data. The simplest way to do this is to plot the variables and look carefully at the graph.
Figure 6.1 shows that the two variables are clearly related in some way: they are strongly correlated. The graph slopes up to the right, that is, there is an association between higher values, so the correlation is positive.
Figure 1.1: Strong Positive Correlation
In the case of Figure 6.2, the graph slopes down to the right, thereby implying a negative relationship, meaning that as one variable increases, the other decreases.
Figure 6.2: Strong Negative Correlation
In addition to positive and negative relationships we sometimes find non-linear or curvilinear relationships, in which the shape of the relationship between the two variables is not straight, but curves at one or more points (see Figure 6.3).
Š Dr Andrew Clegg
p. 6-193
Data Analysis for Research
Correlation
Figure 6.3: Non-linear or Curvilinear Relationship
It is important to identify if the relationship is non-linear as:
It would affect the choice of correlation measurement technique; If the wrong technique was used there would be a spurious result.
Overall, scatter diagrams are useful aids in the preliminary steps of identifying correlation and allow three aspects of a relationship to be discerned: whether it is linear; the direction of the relationship (positive or negative); and the strength of the relationship. The amount of scatter is indicative of the strength of the relationship.
6.3 Correlation Analysis The correlation coefficient (r) measures linear relationship between the variables. Every correlation coefficient will lie somewhere on the scale of possible values, that is between -1 and +1 inclusive. A relationship of -1 or +1 would indicate a perfect relationship, positive or negative respectively, between the two variables. The complete absence of a relationship would engender a computed coefficient of zero. The closer the correlation coefficient is to 1 (either positively or negatively) the stronger the relationship between the two variables. The nearer the correlation coefficient is to zero, the weaker the relationship. These ideas are displayed in Figure 6.4.
© Dr Andrew Clegg
p. 6-194
Correlation
Data Analysis for Research
Figure 6.4: The Strength and Direction of Correlation Coefficients
Perfect Negative Correlation
-1
Strong
Perfect Postive Correlation
No Correlation
Weak
0
Weak
Strong
1
If the correlation coefficient is is 0.85, this would indicate a strong positive relationship between the two variables, whereas a correlation coefficient of 0.28 would denote a weak positive relationship. Similarly, -0.75 and -0.36 would be indicative of strong and weak negative relationships respectively. However, what is a large correlation ? Cohen and Holliday (1982) suggest the following: 0.19 and below is very low; 0.20 to 0.30 is low; 0.40 to 0.69 is modest; 0.70 to 0.89 is high; and 0.90 to 1 is very high. However, these measures are regarded as a rule of thumb and should not be regarded as definite indications. Caution is also required when comparing computed correlation coefficients. For example we can say that a computed correlation coefficient of -0.60 is larger than one of -0.30, but we cannot say that the relationship is twice as strong. In order to understand this more clearly, we need to refer to the coefficient of determination (R2). This is quite simply the square of the correlation coefficient multiplied by 100. It provides us with an indication of how far variation in one variable is due to the other. Thus if r= -0.6, then R2 =36 per cent. This means that 36 per cent of the variance in one variable is due to the other. When r = -0.3, then R2 will be 9 per cent. Thus, although an r of -.06 is twice as large as one of -0.3, it cannot indicate that the former is twice as strong as the latter, because four times more variance is being accounted for by an r of 0.6 than one of -0.3 (Bryman and Cramer, 1997). Referring to the determination of coefficient can also influence your interpretation of r. For example, an r value of 0.75 may seem quite high, but it would only mean that 56 per cent of the variance in y can be attributed to x. In other words, 46 per cent of the variance in y is due to variables other than x.
Š Dr Andrew Clegg
p. 6-195
Correlation
Data Analysis for Research
6.4 Using SPSS to Measure Correlation: Pearson’s Correlation Coefficient The most commonly used (and misused) measure of correlation is Pearson’s Product Moment Correlation Coefficient. This is a powerful parametric measure, which can be used to test for significance and reliability as long as its assumptions are satisfied. The first two assumptions are:
The relationship between the variables is linear; The variables are interval or ratio scale measurements.
Before we use Pearson’s Correlation Coefficient to examine possible correlations in the Dataset file, let me illustrate correlation through a simple example. Load the Excel file ‘Correlation’ into SPSS. The details of this data file are highlighted below.
CARS PERSONS INCOME AGE TRAVEXP [No. of cars] [No. of Persons] [Income (Thousands)] [Age][Travel Expenditure] 0 2 1 2 2 0 1 2 1 1 2 1 1 2 3 1 1 0 0 1 2 1 0 1 2 3 1 2 1
2 3 1 4 2 1 3 2 1 3 2 2 5 4 2 3 4 3 4 3 1 1 2 2 3 4 3 3 2
9 25 13 30 50 4 30 43 10 50 37 25 30 50 75 45 50 20 13 35 40 75 10 50 30 100 40 30 30
25 37 23 30 43 18 27 55 71 20 41 51 45 40 54 34 67 44 34 54 65 45 34 26 65 32 46 55 65
10 50 20 60 70 5 100 30 15 20 50 90 40 80 150 50 30 20 15 50 50 30 10 30 70 100 60 50 20
The above table refers to factors that might influence the level of car ownership in individual households. If you wanted to examine the relationship between the different variables, the first stage would be to produce a series of scatterplots to highlight the direction and strength of any possible relationships. Let us examine correlation through a specific example. In this case, we will look at the relationship between the number of persons in the household (Persons) against the number of cars (Cars). © Dr Andrew Clegg
p. 6-196
Correlation
Data Analysis for Research
To do so, click Graphs, move the mouse over Legacy Dialogs and then select Scatter/Dot.
The Scatterplot dialog box appears.
Ensure that Simple is selected and then press Define.
The Simple Scatterplot dialog box appears. Move the mouse of Cars (Number of Cars) and press the left mouse button. Move the mouse over the top arrow and press the left mouse button so that cars is selected in the Y Axis: box. Move the mouse over Persons (Number of People) and press the left mouse button. Move the mouse over the centre arrow and press the left mouse button so Persons is selected in the X Axis box:
Press OK.
simulation Š Dr Andrew Clegg
p. 6-197
Data Analysis for Research
Correlation
A scatterplot showing the relationship between the two variables appears.
The non-linear relationship expressed in the scatterplot indicates a very weak correlation between the two variables. This can be confirmed by actually calculating the correlation coefficient. To do so, move the mouse over Analyse and press the left mouse button. Move the mouse over Correlate and then over Bivariate and press the left mouse button again. The Bivariate Correlations dialog box appears.
Š Dr Andrew Clegg
p. 6-198
Data Analysis for Research
Correlation
Move the mouse over Cars and press the left mouse button. Move the mouse over the top arrow so that Cars is selected in the Variables Box:.
Repeat the same procedure for Persons. Make sure that the Pearson Correlation coefficient and a twotailed test is selected. A two-tailed test is selected as we do not know which direction our relationship between the two variables will be and we will be looking for either a positive or a negative correlation. Press OK. SPSS produces a matrix of correlation coefficients in the output window. In this case the following output is produced:
As you can see from the output, the value of r for the two variables equals 0.129, which indicates a very weak correlation. You should also notice that the probability value (p) is also not significant (p>0.05).
Š Dr Andrew Clegg
p. 6-199
Data Analysis for Research
Correlation
As with your previous exercises, you should also provide null and alternative hypotheses. In this case: Null Hypothesis There is no significant association between levels of car ownership and the number of persons in the household. Alternative Hypothesis [Two-Tailed] There is a significant association between levels of car ownership and the number of persons in the household. Note that this alternative hypothesis is two-tailed as it is not specifying a specific direction (for example a positive or negative association). An initial scatterplot of the data would reveal any possible association between the data, and allow you to specify a one-tailed test. In this case, a one-tailed test would look like this: Alternative Hypothesis [One-Tailed] There is a positive association between levels of car ownership and the number of persons in the household. Referring back to the SPSS output for our initial correlation:
The Pearson Correlation test statistic = .129. The output indicates that this is not significant (p=.503, >0.05) A conventional way of reporting these figures would be as follows: r = .129, n = 29, p>0.01. The results indicate that there is no significant association between levels of car ownership and number of persons in the household. Note that when using correlation you are examining the level of association, and this should be clearly reflected in your hypthoses.
Š Dr Andrew Clegg
p. 6-200
Correlation
Data Analysis for Research
Let us now repeat this procedure to examine the relationship between additional variables within the dataset. In this case will look at car ownership against household income. First create a scatterplot between the car ownership and income. Your scatterplot should look similar to the graph below:
The scatterplot clearly indicates that there is a linear relationship between the two variables, and that there is evidence of a positive correlation: in this case, as household income increases so does the level of car ownership. Having established the existence of a linear relationship, now calculate the correlation coefficient. In the Bivariate calculations dialog box, specify a one-tailed test as in this case we are expecting a positive correlation - thus indicating a direction. SPSS will generate the following output.
Correlations Cars Cars
Income
Income
Pearson Correlation
1
.665(**)
Sig. (1-tailed)
.
.000
N
29
29
Pearson Correlation
.665(**) 1
Sig. (1-tailed)
.000
.
N
29
29
** Correlation is significant at the 0.01 level (1-tailed).
Š Dr Andrew Clegg
p. 6-201
Correlation
Data Analysis for Research
The Pearson Correlation test statistic =0.665. SPSS indicates with ** that it is significant at the 0.01 level for a one-tailed prediction. The actual p value is shown to be 0.000. A conventional way of reporting these figures would be as follows: r=0.665, n=29, p<0.01. The results indicate that as household income increases, car ownership also increases, which is a positive correlation. As the r value reported is positive and p <0.01, we can state that there is a positive correlation between our two variables and that the null hypothesis can be rejected.
Activity 29: Examine the remaining variable in the dataset and record your observations, using the tables below that are also in your log book.
Table 31: Number of cars against age 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-202
Correlation
Data Analysis for Research
Activity 29: Examine the remaining variable in the dataset and record your observations, using the tables below that are also in your log book.
Table 32: Number of cars against income 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
Table 33: Number of cars against monthy travel expenses 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-203
Correlation
Data Analysis for Research
Activity 30:
Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 34: Correlation 1 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Research Scenario
Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-204
Correlation
Data Analysis for Research
Activity 30:
Using the Dataset file, conduct two Pearson Product Moment Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 35: Correlation 2 3HDUVRQ·V 3URGXFW 0RPHQW &RUUHODWLRQ &RHIILFLHQW Research Scenario
Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-205
Correlation
Data Analysis for Research
6.5
Non-Parametric Correlation: Spearman’s Rank Correlation Coefficient It is often the case that the data available do not fit the requirements for parametric testing. In this case, there is a non-parametric correlation measure available. Spearman’s Rank Correlation Coefficient is mathematically derived from Pearson’s coefficient, but instead of using the actual data values it uses rank or ordinal data. The Spearman correlation coefficient is known as rs. The main assumptions for the use of Spearman’s rank correlation are:
The relationship between the variable is monotonic, that is, x increases as y increases or x decreases as y decreases. A linear relationship is monotonic, but a monotonic relationship is not necessarily linear.
The variables are ordinal (ranks) or are ranked interval or ratio scale measurements.
simulation
Commit 1.00 2.00 1.00 4.00 4.00 1.00 1.00 1.00 2.00 4.00 3.00 4.00 1.00 1.00 1.00 2.00 1.00 3.00 4.00 4.00 1.00 1.00 2.00 3.00 1.00
Satis 1.00 3.00 2.00 3.00 4.00 1.00 2.00 2.00 1.00 4.00 4.00 4.00 1.00 2.00 2.00 2.00 1.00 3.00 4.00 3.00 1.00 2.00 1.00 4.00 1.00
To highlight the use of the Spearman’s rank correlation, type the data table into SPSS. The data refers to a survey of workers in a London hotel. The manager believed that the employee commitment to customer care policies was influenced by overall job satisfaction. The data in the table is ranked for Commitment (Commit) (1=High Commitment and 4 = Poor Commitment) and Job Satisfaction (Satis) (1= High Satisfaction and 4=Low Satisfaction)
Use the same procedure starting on page 6-196, to open the Bivariate Correlations dialog box.
Select both Commit and Satis in the Variables: box. Instead of Pearson’s r, make sure that the Spearman Correlation Coefficient is selected. Make sure that the onetailed test is also selected. This is because the manager believes that employee commitment increases with job satisfaction. This therefore implies a direction in the alternative hypothesis making it a one-tailed test.
© Dr Andrew Clegg
p. 6-206
Data Analysis for Research
Correlation
Press OK and SPSS will automatically calculate the value of the Spearmanâ&#x20AC;&#x2122;s rank correlation coefficient. In this case, the following output is produced.
As you can see from the output, there is a strong positive correlation between the two variables (0.78). The result is also significant (p<0.01) and the manager can be confident at the 99% significance level that commitment increases with job satisfaction. The positive correlation is also reflected in a scatterplot of the two variables.
Š Dr Andrew Clegg
p. 6-207
Correlation
Data Analysis for Research
Activity 31:
Using the Dataset file, conduct two Spearman Rank Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 36: Correlation 3 6SHDUPDQ·V 5DQN &orrelation Coefficient Research Scenario
Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-208
Correlation
Data Analysis for Research
Activity 31:
Using the Dataset file, conduct two Spearman Rank Correlation Coefficients on appropriate variables and record your answers in the tables below, which can be found in your log book. For each test, identify a research scenario that you are using the test to explore. Table 37: Correlation 4 6SHDUPDQ·V 5DQN &orrelation Coefficient Research Scenario
Please cut and paste your scatterplot below and rescale accordingly
Scatterplot (Please note any evidence of a relationship. It is linear or non-linear? Is it positive or negative?
Null Hypothesis:
Alternative Hypothesis (one tailed):
Value of r? Probability Value? Please provide a brief summary of your findings here:
© Dr Andrew Clegg
p. 6-209
Data Analysis for Research
Š Dr Andrew Clegg
Correlation
p. 6-210