4 minute read

Association Rules���������������������������������������������������������������������������������������������������������

Chapter 7 ■ UnsUpervised Learning

Figure 7-2. Plot of iris petal length versus petal width

Advertisement

It does look as if we should be able to distinguish the species. Setosa stands out on both plots, but Versicolor and Virginia overlap on the first.

Since this is such a simple dataset, and since there is obviously structure if we just plot a few dimensions against each other, this is not a case where we would usually pull out the cannon that is PCA, but this is a section on PCA so we will.

Since PCA only works on numerical data, we need to remove the Species parameter, but after that, we can do the transformation using the prcomp function:

pca <- iris %>% select(-Species) %>% prcomp pca ## Standard deviations: ## [1] 2.0562689 0.4926162 0.2796596 0.1543862 ## ## Rotation: ## PC1 PC2 PC3 ## Sepal.Length 0.36138659 -0.65658877 0.58202985 ## Sepal.Width -0.08452251 -0.73016143 -0.59791083 ## Petal.Length 0.85667061 0.17337266 -0.07623608 ## Petal.Width 0.35828920 0.07548102 -0.54583143

172

Chapter 7 ■ UnsUpervised Learning

## PC4 ## Sepal.Length 0.3154872 ## Sepal.Width -0.3197231 ## Petal.Length -0.4798390 ## Petal.Width 0.7536574

The object that this produces contains different information about the result. The standard deviations tell us how much variance is in each component and the rotation what the linear transformation is. If we plot the pca object, we will see how much of the variance in the data is on each component, as shown in Figure 7-3.

4

iances r Va 0 1 2 3

Figure 7-3. Plot of the variance on each principal component for the iris dataset

pca %>% plot

The first thing you want to look at after making the transformation is how the variance is distributed along the components. If the first few components do not contain most of the variance, the transformation has done little for you. When it does, there is some hope that plotting the first few components will tell you about the data.

To map the data to the new space spanned by the principal components, we use the predict() function:

mapped_iris <- pca %>% predict(iris) mapped_iris %>% head ## PC1 PC2 PC3 ## [1,] -2.684126 -0.3193972 0.02791483 ## [2,] -2.714142 0.1770012 0.21046427 ## [3,] -2.888991 0.1449494 -0.01790026 ## [4,] -2.745343 0.3182990 -0.03155937 ## [5,] -2.728717 -0.3267545 -0.09007924 ## [6,] -2.280860 -0.7413304 -0.16867766

173

Chapter 7 ■ UnsUpervised Learning

## PC4 ## [1,] 0.002262437 ## [2,] 0.099026550 ## [3,] 0.019968390 ## [4,] -0.075575817 ## [5,] -0.061258593 ## [6,] -0.024200858

This can also be used with new data that wasn’t used to create the pca object. Here, we just give it the same data we used before. We don’t actually have to remove the Species variable; it will figure out which of the columns to use based on their names. We can now plot the first two components against each other, as shown in Figure 7-4.

mapped_iris %>% as.data.frame %>% cbind(Species = iris$Species) %>% ggplot() + geom_point(aes(x = PC1, y = PC2, colour = Species))

Figure 7-4. Plot of first two principal components for the iris dataset

174

Chapter 7 ■ UnsUpervised Learning

The mapped_iris object returned from the predict() function is not a data frame but a matrix. That won’t work with ggplot() so we need to transform it back into a data frame, and we do that with as.data.frame. Since we want to color the plot according to species, we need to add that information again—remember the pca object does not know about this factor data—so we do that with cbind(). After that, we plot.

We didn’t gain much from this. There was about as much information in the original columns as there is in the transformed data. But now that we have seen PCA in action we can try it out on a little more interesting example.

We will look at the HouseVotes84 data from the mlbench package:

library(mlbench) data(HouseVotes84) HouseVotes84 %>% head ## Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ## 1 republican n y n y y y n n n y ## 2 republican n y n y y y n n n n ## 3 democrat <NA> y y <NA> y y n n n n ## 4 democrat n y y n <NA> y n n n n ## 5 democrat y y y n y y n n n n ## 6 democrat n y y n y y n n n n ## V11 V12 V13 V14 V15 V16 ## 1 <NA> y y y n y ## 2 n y y y n <NA> ## 3 y n y y n n ## 4 y n y n n y ## 5 y <NA> y y y y ## 6 n n y y y y

The data contains the votes cast for both republicans and democrats on 16 different proposals. The types of votes are yea, nay, and missing/unknown. Now, since votes are unlikely to be accidentally lost, missing data here means someone actively decided not to vote, so it isn’t really missing. There is probably some information in that as well.

Now an interesting question we could ask is whether there are differences in voting patterns between republicans and democrats. We would expect that, but can we see it from the data?

The individual columns are binary (well, trinary if we consider the missing data as actually informative) and do not look very different between the two groups, so there is little information in each individual column. We can try doing a PCA on the data.

HouseVotes84 %>% select(-Class) %>% prcomp ## Error in colMeans(x, na.rm = TRUE): 'x' must be numeric

Okay, R is complaining that the data isn’t numeric. We know that PCA needs numeric data, but we are giving it factors. We need to change that so we can try to map the votes into zeros and ones.

175

This article is from: