MENDATIONS m vieRECOM FILTERING using
lecture material for DTU course
COLLABORATIVE
by ANDRIUSBUTKUS DTU Informatics
The world of media today can be characterized by us being exposed to incredible amounts of content - new movies, songs, books and all the other media are being released every day. Is this a problem? NO! This is great since it gives us a variety we never had before! But what is not so good is that it’s hard for us to keep up with all the new stuff. So the problem with the media today is how to find good stuff. And I may add, that due to the huge amounts of media we need it to be done automatically. This is where the recommendation systems come into the picture...
DTU The lecture notes for the course
02525
In the next 2 weeks we will explore how recommendation systems produce recommendations. Even though we’ll be talking about movie recommendations have in mind that the process itself is the same for music, books, cars, clothes, and the rest of the things that possibly can be recommended.
Formulas are the same, I just thought that it may be easier for us to relate to the results if they happen to be the popular movies. Besides these 2 weeks have to be fun, right?... You’ll learn the math behind the most popular recommendation approach - Collaborative Filtering. You’ll also try out several ways to calculate the similarity between people based on their movie taste, as well as several ways to combine multiple user ratings into one prediction. First we will do a small exercise (just 8 people and 4 movies) by hand, so that you can really see what is happening. Later we’ll try a bigger example (60+ users and 50 movies) using popular programming language Python. In the end you’ll have to write a report (in groups of 5-6 people) and also with several of you I’ll have a pleasure to meet during an oral exam at the end of the semester :) So without further delay, lets start...
1
RECOMMENDATIONAPPROACHES features or the users?
As I already pointed out, the main challenge is how to AUTOMATICALLY find movies that a user hasnʼt seen yet but would like to see. How do we know that the user would like to see them? Well, we donʼt. We can just estimate. We make our estimations and predictions based on several assumptions. One of those assumptions is that:
be similar to us in their movie taste. OK, now how can we see if two people have similar taste in movies?
As you can see already some features are more meaningful than others. What makes two movies more similar - if they share a year of production or a genre? Moreover, the movie genres are very vague categories with very unclear boundaries. User generated keywords and tags normally give us a good idea about what the movie is like. But even here one has to be careful - certain keywords (features) carry more information and meaning than others. We as humans can see it quite easily, but if we want the process to be done automatically, (and we do want that) then we need to teach machines to pick the “meaningful” features form the whole pile and then do the comparison of two movies based on these “meaningful” features. This is what content-based filtering is all about. Where the main problem remains the extraction of “meaningful” features about the movie and then interpreting them.
Collaborative Filtering has one major advantage against content-based filtering - itʼs ability to work with very limited input. It doesnʼt require complicated features of a movie to be extracted, instead it needs knowledge of who bought what and (if available) who rated what. Thatʼs all it really needs. Of course this means that we can not say why a certain movie was recommended to us - is it because it is a drama or is it because Naomi Watts is in it. All we will get is “people who bought this also bought this” type of recommendations. You are all familiar with the Amazon.com way of recommending things (see the picture bellow).
If you and your friend us ually end up rating the same movies in a similar way then that could be a good indication that you do indeed share a movie taste and therefore can be good advisors for each other when it comes down to recommending new movies. So the “People will like movies that are similar to the ones third assumption is this: they have liked before” “People are similar if they rated the same movies This seems obvious enough, but what do we mean by in the same way” SIMILAR movies? What makes two movies similar? If they are similar then they must share something, right? So what do we need to do now? Two things: first to In general we could say that two movies are similar if find people who are similar to us based on the way they share a number of features. The features here they rate movies, and second to somehow combine could mean any type of information we could say their ratings for the movies that we havenʼt seen yet about the movie: genre, director, year of production, and by doing that weʼll get our recommendations. This actors, keywords, tags, etc... is the essence of Collaborative Filtering.
Another approach to predict what movies a person will like is called Collaborative Filtering. First of all it starts with a different assumption: “People will like movies that other similar people have liked before” Much like the first one, this assumption is quite logical and we have all seen it work from our life experience. If we need to find out what movie to see next, we quite often end up asking our friends about it. You have probably also noticed that some of your friends have “better” movie taste than others. “Better” here meaning that itʼs closer to your own. So far we learned two things about collaborative filtering - first, we need a number of people who have seen something we havenʼt yet so that we can use their opinions about the movie (ratings) to predict how much weʼll love or hate the particular movie. Secondly, these people need to
2
COLLABORATIVEFILTERING how does it work?
N o w t h a t y o u k n o w t h e m a i n i d e a b e h i n d The neighborhood in this case refers to a group of Collaborative Filtering, lets get more technical and see people who have a similar taste for movies like you. what kind of math is hiding behind. Since the neighborhood is formed among users (calculated from the user-user similarity matrix) such As you already know, all that collaborative filtering collaborative filtering is called “user-based”. needs as an input is a matrix containing information of which customer bought/rated which movie. This is Due to the dynamic nature of user profiles, both steps called the interaction matrix. As you can imagine, need to be performed “on-line” - this can be this matrix is mostly empty. Even if you think youʼve computationally very expensive and causes the seen many movies, itʼs very few compared to the total scalability issue of such system, meaning that while number of movies available today. this method is good for small databases itʼs too slow when we have millions of users and movies. An individual entry in the matrix is the rating R of the user u for the movie i. There are a number of rating As a solution to this problem Amazon.com introduced schemes using different numbers of possible values - “item-based” collaborative filtering. The only difference love/hate (Last.fm), 5 stars (Amazon.com), 10 stars is that instead of calculating user-user similarities, we (imdb), etc... To keep things simple weʼll use a instead calculate item-item similarities, and then common 5 star rating, where 3 is the neutral value present the user with the “most similar” unseen movies surrounded by two positive and two negative values. to the ones that user has rated high. The overall goal of the algorithm is to predict what kind of values will be in the currently empty cells of the interaction matrix - prediction of rating R for and item ii and user ua. If we can do that, then itʼs easy to simply pick top n movies and present to the user as a sorted list. As already mentioned itʼs a two step process: STEP 1: form neighborhood STEP 2: predict ratings
3
The main advantage here is that item-item relationships are much less dynamic compared with the user-user. Hereby the first step (forming of the neighborhood) may be calculated offline and the scalability problem is solved. The difference in dynamics here can be explained by the fact that every movie is normally rated by a much higher number of users compared with the number of movies seen by a single person. Thus adding a new user or movie into the system has different affects for user-user and itemitem similarity matrices.
EXERCISEONE
should Andrius see “District 9” ? Lets start our practical part of the lecture by calculating the Euclidian distance using the example bellow. This is the example of collaborative filtering that weʼll stick to today. It is oversimplified since we only have 8 users and 4 movies that are shared between them. The example will show you the basic principles and math behind collaborative filtering which can be later applied on much bigger data sets.
Imagine a user called Andrius. He has rated 4 movies in the following way:
Dark Knight The Fall Andrius
3
5
Saw
Love Actually
3
5
To answer that question we need to have several other users who both have rated the four movies Andrius has seen and “District 9”. This is all the input we need..
Dark KnightThe Knight Fall
There is also another movie that Andrius hasnʼt seen yet - “District 9”. Now the question is:
How much will Andrius like “DISTRICT 9”? The answer has to be a rating in the range of 1 to 5 since thatʼs the rating scheme we are using here.
Saw
Love Actually
District 9
Michael
5
4
1
4
3
Christian
5
3
4
2
4
Gitte
2
4
1
5
5
Andrius
3
5
3
5
?
Emilie
4
3
5
1
4
Sofie
5
2
3
3
4
Isabel
2
1
1
2
2
Wivi
3
3
1
5
2
...ready?....
STEPONE
selecting the neighbors When it comes down to calculating similarities between a set of items (would it be users or movies) there are a number of different methods to do this. Weʼll look at two of them - Euclidian Distance and Pearson Correlation.
The Euclidean Distance The Euclidian distance is probably the most simple way to calculate similarities between users. It takes the movies that the selected set of users all have rated and uses them as axes in a space. Users are mapped into the space based on their ratings for the selected movies. A similarity will be then expressed as the Euclidean distance between people in this space. As you can guess such space will have as many dimensions as the number of movies weʼll take.
The formula is basically just one rating minus another ensuring that the result is positive since the distance can never be negative. If we have a two dimensional space P=(px, py), Q=(qx, qy) then the formula turns into this:
( p x − q x )2 + ( p y − q y )2
You can see where this is going, right? So if we have If we use only one movie then the space will be one- an N dimensional space P=(p , p , ... , p ), Q=(q , x y N x dimensional and will have only one axis x. The q , ... , q ) the formula will look like this: y N distance between two people in such space P=(px), Q=(qx) can be calculated using this equation: 2 2 2
( px − qx ) + ( py − qy ) + ... + ( pn − qn )
2
( px − qx ) = px − qx
4
You get the idea...
As you saw in the previous page the Euclidian distance can be calculated between two points in an N-dimensional space. In our case N=4, since we have the ratings for four movies that we can use. Since it problematic to effectively visualize a space that has more than 3 dimensions. To make is easier to see visually, lets try to take only two movies (“Dark Knight” and “The Fall”) and map all 8 users into such 2D space based on their ratings.
Andrius
5
1.41
Dark Knight
The Fall
Saw
Love Actually
Michael
5
4
1
4
Christian
5
3
4
2
Gitte
2
4
1
5
Andrius
3
5
3
5
Emilie
4
3
5
1
Sofie
5
2
3
3
Isabel
2
1
1
2
Wivi
3
3
1
5
The Fall
4
Michael
Gitte Christian
3
Wivi
Emilie
2 1 0
Sofie
Isabel
0
1
2
3
4
5
Dark Knight Now if we want to calculate the similarity between two persons in this space we can use a formula given in the previous page. Lets take Andrius and Gitte as an example. We get that the distance between them is 1.41 (using only the first two movies). See the picture above on the right hand side. This distance should be taken literally, you can imagine it as meters for instance. It means that the smaller the number the closer two people are. So the range of values is basically from 0 to infinity. This is great if we want to visualize the distance, but if we want to use the distance as a weighting function in our calculations we have to normalize it - to make it fit in between 0 and 1 (since itʼs a metric space there are no negative values).
5
How do we do that?... A simple way to do this is to take the existing formula add 1 to it (so that we donʼt get the division by zero) and invert it.
1 1 + ( p x − q x )2 + ( p y − q y )2 We get that the similarity between Andrius and Gitte is 0.41. In this case, the closer the value gets to 1 the more similar the users are. Now you can try to use all 4 movies to calculate the similarity between Andrius and Christian. If we use this method to calculate similarities between every pair of users in the interaction matrix (using all four movies) we get the user-user similarity matrix that looks like this:
Michael
Christian
Gitte
Andrius
Emilie
Sofie
Isabel
Wivi
Michael
1.00
0.21
0.24
0.24
0.16
0.25
0.18
0.29
Christian
0.21
1.00
0.16
0.16
0.37
0.37
0.18
0.18
Gitte
0.24
0.16
1.00
0.29
0.14
0.18
0.19
0.41
Andrius
0.24
0.19
0.29
1.00
0.17
0.20
0.15
0.26
Emilie
0.16
0.37
0.14
0.17
1.00
0.24
0.17
0.15
Sofie
0.25
0.37
0.18
0.20
0.24
1.00
0.21
0.22
Isabel
0.18
0.18
0.19
0.15
0.17
0.21
1.00
0.21
Wivi
0.29
0.18
0.41
0.26
0.15
0.22
0.21
1.00
What we get in the end of STEP 1 is a user-user similarity matrix. This matrix shows us which users are similar to Andrius. The similarity measure is based on Euclidian distance and takes values from 0 to 1. The next thing to do is to rank users in terms of their similarity to Andrius and select the most similar ones. This could be done in two ways: ONE way is to select the N most similar users. This ensures that we have the 0.30 sufficient number of “similar” users to make predictions but we donʼt have control over 0.25 how similar they are to Andrius. 0.20 ANOTHER way would be to have a certain threshold of similarity and then take 0.15 everybody above that mark. This ensures 0.10 that our selected users are sufficiently similar to Andrius but we canʼt ensure that 0.05 we have enough users to make quality 0 predictions. Gitte Wivi Michael Sofie Emilie Isabel
top 3 In the example lets take top 3 users or everybody above similarity of 0.2 The main problem by using the Euclidian distance it that doesnʼt take into consideration the fact that two users may simply have different rating habits. Look at Isabel. Her ratings are 2, 1, 1, 2. It seems that she never gives higher values than that. So maybe a rating of 2 for her is quite high, while for other users it would be quite low. Notice that the similarity between Isabel and Michael is 0.18. Itʼs not much even though we see that in principal they both sort movies in a very similar way, even though they may use different ratings for them. Euclidian distance takes movie ratings one by one and “compares” them separately. It doesnʼt see the big picture. Sometimes peopleʼs individual ratings are different but still share general tendencies and patterns. One way to capture such patterns is to measure how well two sets of data fit on a straight line. This is known as...
...the Pearson Correlation A slightly more sophisticated way to determine the similarity between peopleʼs preferences is to use a Pearson correlation coefficient. The formula is a bit more complicated than the Euclidean distance, but it tends to give better results in situations where the data isnʼt well normalized. Like in the example of Michael vs. Isabel. Take a look at the formula on your right hand side. You notice that all we need as an input are the individual ratings of each user for each movie and also the average of the ratings for the particular user. Since the formula is quite easy to calculate by hand weʼll try to do that for two users (lets stick to Andrius and Gitte for that). The similarity score we get ranges from -1 to 1. We can use it directly (it doesnʼt need to be normalized). Correlation of 1 shows that two users are as similar as possible. Similarity of -1 shows them being completely opposite (which is a useful thing to know as well).
6
∑ (R
a,i
sim(a, m) =
− R a )(Rm,i − R m )
i∈I
∑ (R
a,i
i∈I
− R a )2
∑ (R
m,i
− R m )2
i∈I
sim(a, m) − similarity between users a and m Ra,i − rating of the user a of item i Rm,i − rating of the user m for item i R a − average rating of the user a R m − average rating of the user m
If we want to visualize the similarity between Andrius and Gitte using the Pearson Correlation Score, then we have to plot the ratings of both of them on two axis (one axis for each person) and then see how well they are fitted by a straight line.
5
5
4
4
3
3
Emilie
Andrius
Here you see two examples of the Pearson correlation. Both graphs have 4 points because we have 4 movies to rate. The mapping itself is pretty straightforward - simply plot each movie in the space using Andriusʼs rating for it as Y coordinate and Gitteʼs as X. The red line that you see is called the best-fit line because it comes as close to all the movies on the chart as possible.
2 1 0
2 1
0
1
2
3
4
0
5
0
1
Gitte
The interesting aspect of the Pearson Correlation is that it takes into consideration the general rating tendencies between two users and can produce high correlation even if not a single rating is actually the same. Notice how Michael and Isabel now are similar to each other with the correlation of 0.67.
7
4
5
sim(a,m) = -0.85
Take a look at the two graphs above. The correlation depends on two parameters here. First, how well does the datasets fit on a straight line. Secondly, what is the slope of the line (positive or negative). The first one will contribute to strength of the correlation, while the second one will show is it positive or negative. How would the red line look if the ratings were identical?
Michael
3
Andrius
sim(a,m) = 0.95
Christian Gitte Andrius Emilie Sofie Isabel Wivi
2
The correlation between two datasets is a number. But how to interpret it? What is “good” and “bad” here? The truth is that there is no universal answer it all depends on an application. Sometimes even a correlation of 0.9 is not good enough. If we are talking about all kinds of Social Networks with many users who rate things, then almost any positive correlation is already an indication that users are similar enough for this information to be useful. Now try to manually calculate the Pearson Correlation Score for Andrius and Sofie. In the table bellow you see the rest of the correlations already calculated.
Michael
Christian
Gitte
Andrius
Emilie
Sofie
Isabel
Wivi
1.00 0.00 0.53 0.33 -0.51 0.38 0.67 0.71
0.00 1.00 -0.85 -0.89 0.83 0.72 0.00 -0.63
0.53 -0.85 1.00 0.95 -0.96 -0.44 0.32 0.89
0.33 -0.89 0.95 1.00 -0.85 -0.69 0.00 0.71
-0.51 0.83 -0.96 -0.85 1.00 0.27 -0.51 -0.96
0.38 0.72 -0.44 -0.69 0.27 1.00 0.69 0.00
0.67 0.00 0.32 0.00 -0.51 0.69 1.00 0.71
0.71 -0.63 0.89 0.71 -0.96 0.00 0.71 1.00
STEPTWO
calculating prediction After we get the user-user similarity table (no matter what similarity metric we choose to use) we want to select the users most similar to us. In our example the top 3 neighbors for Andrius are still the same (see the table on the right hand side). Now that we have selected the neighbors and have their ratings for the movie “District 9”, how do we combine these three numbers to produce a recommendation for Andrius?
top 3 nearest neighbors for Andrius Gitte Wivi Michael
ratings for “District 9” 5 2 3
0.95 0.71 0.33
There are several ways to do this:
1 2 3
First of all we could just take a simple
average of the three ratings. The obvious drawback of this average is that is ignores the fact that some users are more similar to us than others. In other words it treats everybody the same.
pred(a,i) =
∑
u ∈NN
Ru,i
k
In order to take user-user similarity into consideration we need to multiply the ratings with a similarity index - which we choose to use as our
weighting function.
pred(a,i) =
∑
∑
Then we need to divide everything, not by the number of users (3) but, by the sum of their individual similarities to Andrius.
Until now we still donʼt take into consideration that all 4 people (3 users and Andrius) may have different rating habits. To take differences in rating habits into account we need to
adjust our prediction by including the
pred(a,i) = R a +
∑
u ∈NN
u ∈NN
Ru,i × sim(a,u)
u ∈NN
sim(a,u)
(Ru,i − Ru ) × sim(a,u)
∑
u ∈NN
sim(a,u)
averages of the ratings for each user into the equation. Hereʼs how it looks....
Now calculate prediction for Andrius using all three methods. Can you see the difference? Does it makes sense?
8
EXERCISETWO with your own ratings
here’s the plan:
PREPARATIONPHASE 1. 2. 3. 4. 5. 6. 7.
work in groups in the group select 6 movies of your choice each person rate all 6 of them (5 star rating scheme) now you have 1 interaction matrix per group all filled with ratings each of you discard one movie of your choice (pretend that you havenʼt seen it) now you have one interaction matrix per person with one empty cell the task for each of you is to calculate the predicted rating for the “missing movie”
CALCULATIONPHASE 1. 2. 3. 4. 5. 6. 7.
calculate the user-user similarities using Euclidean distance select top 3 closest neighbors to you calculate the user-user similarities using Pearson Correlation select top 3 closest neighbors to you each of you calculate the predictions for one “missing movie” use all three ways to combine the ratings of your neighbors. now you have 6 predicted ratings: 3 using the Euclidean distance and 3 using Pearson Correlation
DISCUSSIONPHASE 1. 2.
discuss in the group what seems to work best in your case what weaknesses can you identify in Collaborative Filtering approach
THANKYOU READING FOR
andrius.ab@gmail.com
9