Control Structures ��

Formulas and Their Model Matrix��

Chapter 7 ■ UnsUpervised Learning

## rhs support confidence ## [1] {LANGUAGE=English} 0.6110308 0.9456204 ## [2] {LANGUAGE=English} 0.5098410 0.8847935 ## [3] {LANGUAGE=English} 0.5609919 0.8813767 ## [4] {LANGUAGE=English} 0.8666741 0.8666741 ## [5] {LANGUAGE=English} 0.5207384 0.8611622 ## lift ## [1] 1.0910911 ## [2] 1.0209069 ## [3] 1.0169644 ## [4] 1.0000000 ## [5] 0.9936402

Exercises

Try the following exercises to become more comfortable with the concepts discussed in this chapter.

Dealing with Missing Data in the HouseVotes84 Data

In the PCA analysis, we translated missing data into 0.5. This was to move things along but probably not an appropriate decision. People who do not cast a vote are not necessarily undecided and therefore equally likely to vote yea or nay; there can be conflicts of interests or other reasons. So we should instead translate each column into three binary columns.

You can use the transmute() function from dplyr to add new columns and remove old ones—it is a bit of typing since you have to do it 16 times, but it will get the job done.

If you feel more like trying to code your way out of this transformation, you should look at the mutate_ at() function from dplyr. You can combine it with column name matches and multiple functions to build the three binary vectors (for the ifelse() calls you have to remember that comparing with NA always gives you NA so you need always to check for that first). After you have created the new columns, you can remove the old ones using select() combined with match().

Try to do the transformation and then the PCA again. Does anything change?

Rescaling for k-Means Clustering

Use the scale() function to rescale the iris dataset, then redo the k-means clustering analysis.

Varying k

Analyze the iris data with kmeans() with k ranging from 1 to 10. Plot the clusters for each k, coloring the data points according to the clustering.

Project 1

To see a data analysis in action, I use an analysis that my student, Dan Søndergaard, did the first year I held the data science class. I am redoing his analysis here with his permission.

The data contains physicochemical features measured from Portuguese Vinho Verde wines, and the goal was to try to predict wine quality from these measurements. The data is available from https:// archive.ics.uci.edu/ml/datasets/Wine+Quality.

196

Chapter 7 ■ UnsUpervised Learning

Importing Data

If we go to the data folder, we can see that the data is split into three files. The measurements from red wine, white wine, and a description of the data (the file winequality.names). To avoid showing large URLs, I will not list the code for reading the files, but it is in this form:

read.table(URL, header=TRUE, sep=';')

That there is a header that describes the columns, and that fields are separated by semicolons we get from looking at the files.

We load the red and white wine data into separate data frames called red and white.

We can combine the two data frames using this:

wines <- rbind(data.frame(type = "red", red), data.frame(type = "white", white))

Then we’ll see the summary:

summary(wines) ## type fixed.acidity volatile.acidity ## red :1599 Min. : 3.800 Min. :0.0800 ## white:4898 1st Qu.: 6.400 1st Qu.:0.2300 ## Median : 7.000 Median :0.2900 ## Mean : 7.215 Mean :0.3397 ## 3rd Qu.: 7.700 3rd Qu.:0.4000 ## Max. :15.900 Max. :1.5800 ## citric.acid residual.sugar ## Min. :0.0000 Min. : 0.600 ## 1st Qu.:0.2500 1st Qu.: 1.800 ## Median :0.3100 Median : 3.000 ## Mean :0.3186 Mean : 5.443 ## 3rd Qu.:0.3900 3rd Qu.: 8.100 ## Max. :1.6600 Max. :65.800 ## chlorides free.sulfur.dioxide ## Min. :0.00900 Min. : 1.00 ## 1st Qu.:0.03800 1st Qu.: 17.00 ## Median :0.04700 Median : 29.00 ## Mean :0.05603 Mean : 30.53 ## 3rd Qu.:0.06500 3rd Qu.: 41.00 ## Max. :0.61100 Max. :289.00 ## total.sulfur.dioxide density ## Min. : 6.0 Min. :0.9871 ## 1st Qu.: 77.0 1st Qu.:0.9923 ## Median :118.0 Median :0.9949 ## Mean :115.7 Mean :0.9947 ## 3rd Qu.:156.0 3rd Qu.:0.9970 ## Max. :440.0 Max. :1.0390

197

Chapter 7 ■ UnsUpervised Learning

## pH sulfates alcohol ## Min. :2.720 Min. :0.2200 Min. : 8.00 ## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 ## Median :3.210 Median :0.5100 Median :10.30 ## Mean :3.219 Mean :0.5313 Mean :10.49 ## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 ## Max. :4.010 Max. :2.0000 Max. :14.90 ## quality ## Min. :3.000 ## 1st Qu.:5.000 ## Median :6.000 ## Mean :5.818 ## 3rd Qu.:6.000 ## Max. :9.000

There are 11 measurements for each wine, and each wine has an associated quality score based on sensory data. At least three wine experts judged and scored the wine on a scale between 0 and 10. No wine achieved a score below 3 or above 9. There are no missing values. There is not really any measurement that we want to translate into categorical data. The quality scores are given as discrete values, but they are ordered categories, and we might as well consider them as numerical values for now.

Exploring the Data

With the data loaded, we first want to do some exploratory analysis to get a feeling for it.

Distribution of Quality Scores

The first thing Dan did was look at the distribution of quality scores for both types of wine, as shown in Figure 7-17.

ggplot(wines) + geom_bar(aes(x = factor(quality), fill = type), position = 'dodge') + xlab('Quality') + ylab('Frequency')

198

Chapter 7 ■ UnsUpervised Learning

2000

1500

Frequency

1000

500 type

red

white

3 4 5 6 7 8 9 Quality

Figure 7-17. Distribution of wine qualities

There are very few wines with extremely low or high scores. The quality scores also seem normal-distributed, if we ignore that they are discrete. This might make the analysis easier.

Is This Wine Red or White?

The dataset has two types of wine: red and white. As Dan noticed, these types are typically described by very different words by wine experts, but several experiments have shown that even the best wine experts cannot distinguish red from white if the color is obscured or the experts blindfolded (see http://io9.com/winetasting-is-bullshit-heres-why-496098276). It is, therefore, interesting to see if the physicochemical features available in the data can help decide whether a wine is red or white.

Dan used the Naive Bayes method to explore this, so we need the e1071 package.

library(e1071)

He used a five-fold cross-validation to study this, but I will just use the partition() function from Chapter 6.

random_group <- function(n, probs) { probs <- probs / sum(probs) g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)), rightmost.closed = TRUE) names(probs)[sample(g)]

199

Control Structures ��

Next Article

Formulas and Their Model Matrix��

Exercises

Dealing with Missing Data in the HouseVotes84 Data

Rescaling for k-Means Clustering

Varying k

Project 1

Importing Data

Exploring the Data

Distribution of Quality Scores

Is This Wine Red or White?

More articles from this publication:

Formulas and Their Model Matrix��

Bayesian Linear Regression��

Parallel Execution��

Switching to C++ ��

Speeding Up Your Code ��

Exercises��

Using git in RStudio��

Version Control and Repositories ��

Collaborating on GitHub��

This article is from:

Beginning of Data Science in R

Next Article

Formulas and Their Model Matrix���������������������������������������������������������������������������������

Exercises

Dealing with Missing Data in the HouseVotes84 Data

Rescaling for k-Means Clustering

Varying k

Project 1

Importing Data

Exploring the Data

Distribution of Quality Scores

Is This Wine Red or White?

More articles from this publication:

Formulas and Their Model Matrix���������������������������������������������������������������������������������

Bayesian Linear Regression�����������������������������������������������������������������������������������������

Version Control and Repositories ���������������������������������������������������������������������������������

This article is from:

Beginning of Data Science in R

Formulas and Their Model Matrix��

Formulas and Their Model Matrix��

Bayesian Linear Regression��

Version Control and Repositories ��