![](https://assets.isu.pub/document-structure/230101182555-1179cabb6e9ce02cd5c0b77a54fb27cf/v1/0036137bf0a73c95bf89ba59a81796f2.jpeg?width=720&quality=85%2C50)
4 minute read
Basic Data Types ����������������������������������������������������������������������������������������������������������
Chapter 7 ■ UnsUpervised Learning
Figure 7-12. Clusters and species for iris for a bad clustering
Advertisement
If you go back and look at Figure 7-10 and think that some of the square points are closer to the center of the “triangular cluster” than the center of the “square cluster”, or vice versa, you are right. Don’t be too disturbed by this; two things are deceptive here. One is that the axes are not on the same scale, so distances along the x-axis are farther than distances along the y-axis. A second is that the distances used to group data points are in the four-dimensional space of the original features, while the plot is a projection onto the twodimensional plane of the first two principal components.
There is something to worry about, though, concerning distances. The algorithm is based on the distance from cluster centers to data points, but if you have one axis in centimeters and another in meters, a distance along one axis is numerically a hundred times farther than along the other. This is not merely solved by representing all features in the same unit. First of all, that isn’t always possible. There is no meaningful way of translating time or weight into a distance. Even if it was, what is being measured is also relevant for the unit we consider. The height of a person is meaningfully measured in meters, but you do not want something like cell size to be measured in meters.
This is also an issue for principal component analysis. Obviously, a method that tries to create a vector space basis based on the variance in the data is going to be affected by the units used in the input data. The usual solution is to rescale all input features so they are centered at zero and have variance one. You subtract from each data point the mean of the feature and divide by the standard deviation. This means that measured in standard deviations, all dimensions have the same variation.
The prcomp() function takes parameters to do the scaling. Parameter center, which defaults to TRUE, translates the data points to mean zero, and parameter scale. (notice the .), which defaults to FALSE, scales the data points to have variance one at all dimensions.
187
Chapter 7 ■ UnsUpervised Learning
The kmeans() functions do not take these parameters, but you can explicitly rescale a numerical data frame using the scale() function. I have left this as an exercise.
Now let’s consider how the clustering does at predicting the species more formally. This returns us to familiar territory: We can build a confusion matrix between species and clusters.
table(iris$Species, clusters$cluster) ## ## 1 2 3 ## setosa 50 0 0 ## versicolor 0 2 48 ## virginica 0 36 14
One problem here is that the clustering doesn’t know about the species, so even if there were a one-toone corresponding between clusters and species, the confusion matrix would only be diagonal if the clusters and species were in the same order.
We can associate each species to the cluster most of its members are assigned to. This isn’t a perfect solution—two species could be assigned to the same cluster this way, and we still wouldn’t be able to construct a confusion matrix—but it will work for us in the case we consider here. We can count how many observations from each cluster is seen in each species like this:
tbl <- table(iris$Species, clusters$cluster) (counts <- apply(tbl, 1, which.max)) ## setosa versicolor virginica ## 1 3 2
Build a table mapping species of clusters to get the confusion matrix like this:
map <- rep(NA, each = 3) map[counts] <- names(counts) table(iris$Species, map[clusters$cluster]) ## ## setosa versicolor virginica ## setosa 50 0 0 ## versicolor 0 48 2 ## virginica 0 14 36
A final word on k-means is this: Since k is a parameter that needs to be specified, how do you pick it? Here we knew that there were three species so we picked three for k as well. But when we don’t know if there is any clustering in the data, to begin with, or if there is a lot, how do we choose k? Unfortunately, there isn’t a general answer to this. There are several rules of thumbs, but there is no perfect solution you can always apply.
Hierarchical Clustering
Hierarchical clustering is a technique you can use when you have a distance matrix of your data. Here the idea is that you build up a tree structure of nested clusters by iteratively merging clusters. You start with putting each data point in their own singleton clusters. Then iteratively you find two clusters that are close together and merge them into a new cluster. You continue this until all data points are in the same large cluster. Different algorithms exist, and they mainly vary in how they choose which cluster to merge next
188