Dr. Cathal Walsh
ST3002 – Statistical Analysis Assignment by Stephen Denham, JS MSISS
Stephen Denham 08339678 16/1/2010
Table of Contents 1.
Introduction .......................................................................................................................... 2
2.
Section One – Fisher’s Iris Flower Dataset ........................................................................ 2 Univariate ................................................................................................................................ 2 Multivariate ............................................................................................................................. 4 Parametric Relationships between Variables ........................................................................ 5 Model – Relative Importance of Variables.............................................................................. 6
3.
Section Two – Anscombe’s Quartet ..................................................................................... 9 Statically Similarities .............................................................................................................. 9 Plot .......................................................................................................................................... 10
4.
Conclusion ........................................................................................................................... 11
5.
References ........................................................................................................................... 11
1
1. Introduction The following document contains a statistical analysis of two famous datasets. These are known as the Iris Flower Data Set (R. A. Fisher, 1936) and Anscombe’s Quartet (1973). Both are very different in terms of what they can bring, but through this analysis, it is hoped that the reader can improve his or her knowledge of information processing. In the same way that a building can be photographed from many different angles, sides, inside and out, this analysis describes the data using several different types of methods to provide a holistic description of the underlying trends. The open source programming software R was used. Complete copies of the code, output and graphs used are available in the same folder.
2. Section One – Fisher’s Iris Flower Dataset This dataset contains the measurements of petal and sepal, lengths and widths (in centimetres) for the Setosa, Versicolor and Virginica species. Here, a univariate, bivariate and multivariate anaylsis is done and then a prediction model is described.
Univariate First of all, it is important to note there are an equal number of each of the three species of Iris (fifty). This can be clearly seen in Figure 1. Having an equal number of each sample makes the data much simpler to analyse as it does not have to be adjusted. Figure 2 shows the distribution of the sepal width, petal length, petal width and petal length. Figure 2 (a) (describing sepal width) shows that it is following a normal distribution. It is the only parameter that appears to follow this distribution. Sepal length seems to be quite irregular around the mean but is similar to the normal distribution. The two petal variables are both quite irregular and require more in-depth study. Both the petal length and petal width variables seem to contain two distributions. Petal length actually has by far the largest variation from the mean with a standard deviation of 1.76. Not surprisingly, especially after looking at figure 2 (a) and figure 3, sepal width has by far the smallest variation from the mean with a standard deviation of 0.4.
2
Figure 3 is a boxplot of the four variables. The thick centre lines are the medians of each. The coloured boxes contain the centre quartiles and the horizontal lines at the top and bottom show the minimum and maximum. The unfilled dots above and below sepal width are outliers. It is a more convenient way of viewing the range and general distribution of variables then comparing several figures from the R output.
3
Multivariate In this section, the analysis goes deeper and gives relate relative relationships of the variables. Figure 4 is a more holistic view of what the underlying trends than figure 2. It shows that species are looked at individually, they are all much closer to normal distributions.
4
Figure 5 is a plot series of plots of the four variables, against each other. This is the most descriptive graph available. It is clear from studying figure 5, that the iris setosa species has very different dimensions to the versicolor and virginica sets. There seems to be a correlation between sepal length and width for setosa. Other variables do not seem to show any proportionality for Iris setosa. For example, sepal length appears to increase without any increase in petal length. The Iris versicolor and virginica are clearly a lot more similar. Unlike the Iris setosa, they do not seem to show a strong correlation between sepal length and width. They follow similar correlations.
Parametric Relationships between Variables Linear models of each of these variables are available in the output code attached. The weakest correlation was the relationship between sepal length and sepal width. It has a correlation coefficient of -0.12. The strongest was between petal length and petal width. It had a correlation coefficient of 0.96. It has the following linear model: 5
! = 1.084 + 2.23! + !
! = !"#$% !"#$ℎ ! = !"#$% !"#$%ℎ ! = !""#" This is shown here as a demonstration, the other five linear models can be found in the R output code attached.
Model – Relative Importance of Variables This is an attempt to understand the relevant importance of the four parameters of Iris plants. A model was created to see which parameters are the best predictors of iris species. This was done using regressive partition tree. This model could be of much help to botanists when trying to determine the iris species.
6
These trees are a good way to see what of the four variables, are the best at determining iris species. They start at the top and work down. For example, using Figure 6 (a), if a petal width is true (less than 0.8), then there is a high probability that it is of the setosa species. If it is false (greater then 0.8), then the user must go right. If the petal width is true (less than 1.75), then it goes left and it has a high probability of being a versicolor. If it is false (greater than 1.75), then it goes right and it has a high probability of being a virginica. Figure 6 (a) is a regressive tree when the petal width and petal length variables are available. As it does not make use of petal length, it is evident that petal length is not as useful in describing species. Figure 6 (b) is a regressive tree when just the sepal width and sepal length are available. It uses the two variables twice so the both are useful in determining iris species in this case.
7
Figure 7 is a tree with all four variables given to it. However, it only makes use of petal width, in the same way as figure 6 (a). This means that petal width is the most descriptive variable when trying to determine iris species.
The relative importance of petal width is confirmed with a principal component analysis (PCA). This PCA divides the data up into four components. These can be seen in figure 8. The first component accounts for 92% of the variance. Petal length accounts for 86% of this first component. From that, it is understood that the petal length accounts for 79% of the overall variance. Figure 9 is the output from the PCA. Petal length is the longest arrow in this biplot because it can account for the biggest portion of the variance. This confirms it to be the most important variable. It is very horizontal because it accounts for very little of the second component.
8
3. Section Two – Anscombe’s Quartet This dataset is known as Anscombe’s Quartet. Francis John Anscombe used this data to demonstrate the limitations of statistical calculations. His 1973 article opened with the following criticism of analytical conventions at the time: 1. Numerical calculations are exact, but graphs are rough. 2. For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis. 3. Performing intricate calculations is virtuous, whereas actually looking at the data is cheating. (Anscombe, 1973) The quartet is four pairs of data which relate to each other in interesting ways.
Statically Similarities If this data were solely described using summary statistics, it is likely that many incorrect conclusions would be drawn. It would appear that the data subsets are very similar. When rounded, many key summary statistics are the same. These are •
‘Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable.’ ~Bobby Bragan, 1963
Mean of four X and Y variables
9
•
Variance of X and Y is the same
•
Correlation coefficient of X and Y for four cases
•
Linear regression line of least squares
The exact figures of these are not relavant to this analysis but can be found in the output attached. There are also many other similarities such as max and min in 3 of the 4 cases.
Plot When the data is plotted, it paints a very different picture. There are large differences in these four plots.
Figure 10 (a): The first plot appears to be normally distributed and to have a linear correlation. So a linear least squares model is appropriate. 10
Figure 10 (b): The second plot appears to be a negative quadratic function. If the series was to continue, it would be expected to descend. The linear model is not appropriate for this trend but it cannot be seen without graphing the data. Figure 10 (c): The third plot follows a perfectly straight line, except for one outlier. Figure 10 (d): The fourth plot also contains a very different outlier. The rest of the data does not variate on the dependant variable. If this was real data, these outliers further to see if there is valid cause to remove them (Eg. incorrectly entered data). What descriptive statistics are useful in seeing the differences in this data?
4. Conclusion In summary, these datasets are particularly interesting. They have a large number of underlying trends requiring in-depth study. The first dataset contains four parameters which describe three species of Iris flower. Petal length is by far the most important parameter as it has the strongest correlations to the other variables and is the single best predictor of species. The second dataset demonstrates that statistical analysis requires graphical representations as well as using formulae for summary statistics. Without graphs, these figures are likely to be misrepresented which can be very costly. Both methods are needed to get a real representation of the data.
5. References ANSCOMBE, F. J. 1973. Graphs in Staticical Analysis. The American Statistician, 27, 17-21. R. A. FISHER, S., F.R.S 1936. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179-188.
11