Matrix Factorization for Recommendation systems

21 minute read

What are we doing?

We are building a basic version of a low-rank matrix factorization recommendation system and we will use it on a dataset obtained from https://grouplens.org/datasets/movielens/. It has 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.

The other technique to build a recommendation system is an item-based collaborative filtering approach. Collaborative filtering methods that compute distance relationships between items or users are generally thought of as “neighborhood” methods, since they are centered on the idea of “nearness”.

These methods are not well-suited for larger datasets. There is another conceptual issue with them as well, i.e., the ratings matrices may be overfit and noisy representations of user tastes and preferences.When we use distance based “neighborhood” approaches on raw data, we match to sparse low-level details that we assume represent the user’s preference vector instead of the vector itself. It’s a subtle difference, but it’s important.

If I’ve listened to ten Breaking Benjamin songs and you’ve listened to ten different Breaking Benjamin songs, the raw user action matrix wouldn’t have any overlap. We’d have nothing in common, even though it seems pretty likely we share at least some underlying preferencs. We need a method that can derive the tastes and preference vectors from the raw data.

Low-Rank Matrix Factorization is one of those methods.

Basics of Matrix Factorization for Recommendation systems

All of the theoretical explanation has been taken from: http://nicolas-hug.com/blog/matrix_facto_1

The problem we need to assess is that of rating prediction. The data we would have on our hands is a rating history.

It would look something like this:

no-alignment

Our R matrix is a 99% sparse matrix withthe columns as the items or movies in our case and the rows as individual users.

We will factorize the matrix R. The matrix factorization is linked to SVD(Singular Value Decomposition). It’s a beautiful result of Linear Algebra. When people say Math sucks, show them what SVD can do.

But, before we move onto SVD, we should review PCA(Principle Components Analysis). It’s only slightly less awesome than SVD, but it’s still pretty cool.

A little bit of PCA

We’ll play around with the Olivetti dataset. It’s a set of greyscale images of faces from 40 people, making up a total of 400 images.

from sklearn.datasets import fetch_olivetti_faces
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
%matplotlib inline
faces = fetch_olivetti_faces()
print(faces.DESCR)
Modified Olivetti faces dataset.

The original database was available from

    http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

The version retrieved here comes in MATLAB format from the personal
web page of Sam Roweis:

    http://www.cs.nyu.edu/~roweis/

There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting,
facial expressions (open / closed eyes, smiling / not smiling) and facial
details (glasses / no glasses). All the images were taken against a dark
homogeneous background with the subjects in an upright, frontal position (with
tolerance for some side movement).

The original dataset consisted of 92 x 112, while the Roweis version
consists of 64x64 images.

Here are the first 10 people:

# Here are the first ten guys of the dataset
fig = plt.figure(figsize=(10, 10))
for i in range(10):
    ax = plt.subplot2grid((1, 10), (0, i))
    
    ax.imshow(faces.data[i * 10].reshape(64, 64), cmap=plt.cm.gray)
    ax.axis('off')

no-alignment

Each image size is 64 x 64 pixels. We will flatten each of these images (we thus get 400 vectors, each with 64 x 64 = 4096 elements). We can represent our dataset in a 400 x 4096 matrix:

no-alignment

PCA, which stands for Principal Component Analysis, is an algorithm that will reveal 400 of these guys:

# Let's compute the PCA
pca = PCA()
pca.fit(faces.data)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
# Now, the creepy guys are in the components_ attribute.
# Here are the first ten ones:

fig = plt.figure(figsize=(10, 10))
for i in range(10):
    ax = plt.subplot2grid((1, 10), (0, i))
    
    ax.imshow(pca.components_[i].reshape(64, 64), cmap=plt.cm.gray)
    ax.axis('off')

no-alignment

This is pretty creepy, right?

We call these guys the principal components (hence the name of the technique), and when they represent faces such as here we call them the eigenfaces. Some really cool stuff can be done with eigenfaces such as face recognition, or optimizing your tinder matches! The reason why they’re called eigenfaces is because they are in fact the eigenvectors of the covariance matrix of X

We obtain here 400 principal components because the original matrix X has 400 rows (or more precisely, because the rank of X is 400). As you may have guessed, each of the principal component is in fact a vector that has the same dimension as the original faces, i.e. it has 64 x 64 = 4096 pixels.

# Reconstruction process

from skimage.io import imsave

face = faces.data[0]  # we will reconstruct the first face

# During the reconstruction process we are actually computing, at the kth frame,
# a rank k approximation of the face. To get a rank k approximation of a face,
# we need to first transform it into the 'latent space', and then
# transform it back to the original space

# Step 1: transform the face into the latent space.
# It's now a vector with 400 components. The kth component gives the importance
# of the kth  creepy guy
trans = pca.transform(face.reshape(1, -1))  # Reshape for scikit learn

# Step 2: reconstruction. To build the kth frame, we use all the creepy guys
# up until the kth one.
# Warning: this will save 400 png images.
for k in range(400):
    rank_k_approx = trans[:, :k].dot(pca.components_[:k]) + pca.mean_
    imsave('{:>03}'.format(str(k)) + '.jpg', rank_k_approx.reshape(64, 64))
E:\Software\Anaconda2\envs\py36\lib\site-packages\skimage\util\dtype.py:122: UserWarning: Possible precision loss when converting from float64 to uint8
  .format(dtypeobj_in, dtypeobj_out))

As far as we’re concerned, we will call these guys the creepy guys.

Now, one amazing thing about them is that they can build back all of the original faces. Take a look at this (these are animated gifs, about 10s long):

no-alignment no-alignment no-alignment no-alignment no-alignment no-alignment no-alignment no-alignment no-alignment no-alignment

Each of the 400 original faces (i.e. each of the 400 original rows of the matrix) can be expressed as a (linear) combination of the creepy guys. That is, we can express the first original face (i.e. its pixel values) as a little bit of the first creepy guy, plus a little bit of the second creepy guy, plus a little bit of third, etc. until the last creepy guy. The same goes for all of the other original faces: they can all be expressed as a little bit of each creepy guy.

Face 1 = α1⋅Creepy guy #1 + α2⋅Creepy guy #2 . . . + α400⋅Creepy guy #400

The gifs you saw above are the very translation of these math equations: the first frame of a gif is the contribution of the first creepy guy, the second frame is the contribution of the first two creepy guys, etc. until the last creepy guy.

Latent Factors

We’ve actually been kind of harsh towards the creepy guys. They’re not creepy, they’re typical. The goal of PCA is to reveal typical vectors: each of the creepy/typical guy represents one specific aspect underlying the data. In an ideal world, the first typical guy would represent (e.g.) a typical elder person, the second typical guy would represent a typical glasses wearer, and some other typical guys would represent concepts such as smiley, sad looking, big nose, stuff like that. And with these concepts, we could define a face as more or less elder, more or less glassy, more or less smiling, etc. In practice, the concepts that PCA reveals are really not that clear: there is no clear semantic that we could associate with any of the creepy/typical guys that we obtained here. But the important fact remains: each of the typical guys captures a specific aspect of the data. We call these aspects the latent factors (latent, because they were there all the time, we just needed PCA to reveal them). Using barbaric terms, we say that each principal component (the creepy/typical guys) captures a specific latent factor.

Now, this is all good and fun, but we’re interested in matrix factorization for recommendation purposes, right? So where is our matrix factorization, and what does it have to do with recommendation? PCA is actually a plug-and-play method: it works for any matrix. If your matrix contains images, it will reveal some typical images that can build back all of your initial images, such as here. If your matrix contains potatoes, PCA will reveal some typical potatoes that can build back all of your original potatoes. If your matrix contains ratings, well… Here we come.

PCA on a (dense) rating matrix

Until stated otherwise, we will consider for now that our rating matrix R is completely dense, i.e. there are no missing entries. All the ratings are known. This is of course not the case in real recommendation problems, but bare with me.

PCA on the users

Here is our rating matrix, where rows are users and columns are movies:

no-alignment

Instead of having faces in the rows represented by pixel values, we now have users represented by their ratings. Just like PCA gave us some typical guys before, it will now give us some typical users, or rather some typical raters.

we would obtain a typical action movie fan, a typical romance movie fan, a typical comedy fan, etc. In practice, the semantic behind the typical users is not clearly defined, but for the sake of simplicity we will assume that they are (it doesn’t change anything, this is just for intuition/explanation purposes).

Each of our initial users (Alice, Bob…) can be expressed as a combination of the typical users. For instance, Alice could be defined as a little bit of an action fan, a little bit of a comedy fan, a lot of a romance fan, etc. As for Bob, he could be more keen on action movies:

Alice = 10% Action fan + 10% Comedy fan + 50% Romance fan + …

Bob = 50% Action fan + 30% Comedy fan + 10% Romance fan + …

And the same goes for all of the users, you get the idea. (In practice the coefficients are not necessarily percentages, but it’s convenient for us to think of it this way).

PCA on the movies

What would happen if we transposed our rating matrix? Instead of having users in the rows, we would now have movies, defined as their ratings:

no-alignment

In this case, PCA will not reveal typical faces nor typical users, but of course typical movies. And here again, we will associate a semantic meaning behind each of the typical movies, and these typical movies can build back all of our original movies:

And the same goes for all the other movies.

So what can SVD do for us? SVD is PCA on R and R(Transpose), in one shot.

SVD will give you the two matrices U and M, at the same time. You get the typical users and the typical movies in one shot. SVD gives you U and M by factorizing R into three matrices. Here is the matrix factorization:

R=MΣUT

To be very clear: SVD is an algorithm that takes the matrix R as an input, and it gives you M, Σ and U, such that:

R is equal to the product MΣUT.

The columns of M can build back all of the columns of R (we already know this).

The columns of U can build back all of the rows of R (we already know this).

The columns of M are orthogonal, as well as the columns of U. I haven’t mentioned this before, so here it is: the principal components are always orthogonal. This is actually an extremely important feature of PCA (and SVD), but for our recommendation we actually don’t care (we’ll come to that).

Σ is a diagonal matrix (we’ll also come to that).

We can basically sum up all of the above points by this statements: the columns of M are an orthogonal basis that spans the column space of R, and the columns of U are an orthonormal basis that spans the row space of R. If this kind of phrases works for you, great. Personally, I prefer to talk about creepy guys and typical potatoes

The model behind SVD

When we compute and use the SVD of the rating matrix R, we are actually modeling the ratings in a very specific, and meaningful way. We will describe this modeling here.

For the sake of simplicity, we will forget about the matrix Σ: it is a diagonal matrix, so it simply acts as a scaler on M or U(Transpose). Hence, we will pretend that we have merged into one of the two matrices. Our matrix factorization simply becomes:

R=MUT

Now, with this factorization, let’s consider the rating of user u for item i, that we will denote rui:

no-alignment

Because of the way a matrix product is defined, the value of rui is the result of a dot product between two vectors: a vector pu which is a row of M and which is specific to the user u, and a vector qi which is a column of UT and which is specific to the item i:

rui=pu⋅qi

where ‘⋅’ stands for the usual dot product. Now, remember how we can describe our users and our items?

Alice = 10% Action fan + 10% Comedy fan + 50% Romance fan + …

Bob = 50% Action fan + 30% Comedy fan + 10% Romance fan + …

Titanic = 20% Action + 0% Comedy + 70% Romance + …

Toy Story = 30% Action + 60% Comedy + 0% Romance + …

Well, the values of the vectors pu and qi exactly correspond to the coefficients that we have assigned to each latent factor:

pAlice=(10%, 10%, 50%, …)

pBob=(50%, 30%, 10%, …)

qTitanic=(20%, 0%, 70%, …)

qToy Story=(30%, 60%, 0%, …)

The vector pu represents the affinity of user u for each of the latent factors. Similarly, the vector qi represents the affinity of the item i for the latent factors. Alice is represented as (10%, 10%, 50%,…), meaning that she’s only slightly sensitive to action and comedy movies, but she seems to like romance. As for Bob, he seems to prefer action movies above anything else. We can also see that Titanic is mostly a romance movie and that it’s not funny at all.

So, when we are using the SVD of R, we are modeling the rating of user u for item i as follows:

no-alignment

In other words, if u has a taste for factors that are endorsed by i, then the rating rui will be high. Conversely, if i is not the kind of items that u likes (i.e. the coefficient don’t match well), the rating rui will be low. In our case, the rating of Alice for Titanic will be high, while that of Bob will be much lower because he’s not so keen on romance movies. His rating for Toy Story will, however, be higher than that of Alice.

We now have enough knowledge to apply SVD to a recommendation task.

Setting up the ratings data

import pandas as pd
import numpy as np
movies_df=pd.read_csv("E:/Git/Project markdowns/Matrix Factorization/ml-20m/movies.csv")
movies_df.head(10)
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
ratings_df=pd.read_csv("E:/Git/Project markdowns/Matrix Factorization/ml-20m/ratings.csv")
ratings_df.head(10)
userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580
5 1 112 3.5 1094785740
6 1 151 4.0 1094785734
7 1 223 4.0 1112485573
8 1 253 4.0 1112484940
9 1 260 4.0 1112484826
#Defining a list of numbers from 0 to 7999
mylist=range(0, 8001)
#Getting the data for the first 8000 users
train_df=ratings_df[ratings_df["userId"].isin(mylist)]
train_df
userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580
5 1 112 3.5 1094785740
6 1 151 4.0 1094785734
7 1 223 4.0 1112485573
8 1 253 4.0 1112484940
9 1 260 4.0 1112484826
10 1 293 4.0 1112484703
11 1 296 4.0 1112484767
12 1 318 4.0 1112484798
13 1 337 3.5 1094785709
14 1 367 3.5 1112485980
15 1 541 4.0 1112484603
16 1 589 3.5 1112485557
17 1 593 3.5 1112484661
18 1 653 3.0 1094785691
19 1 919 3.5 1094785621
20 1 924 3.5 1094785598
21 1 1009 3.5 1112486013
22 1 1036 4.0 1112485480
23 1 1079 4.0 1094785665
24 1 1080 3.5 1112485375
25 1 1089 3.5 1112484669
26 1 1090 4.0 1112485453
27 1 1097 4.0 1112485701
28 1 1136 3.5 1112484609
29 1 1193 3.5 1112484690
... ... ... ... ...
1170896 8000 858 5.0 1360263827
1170897 8000 1073 5.0 1360263822
1170898 8000 1092 4.5 1360262914
1170899 8000 1093 3.5 1360263017
1170900 8000 1136 4.0 1360263818
1170901 8000 1193 4.5 1360263372
1170902 8000 1213 4.0 1360263806
1170903 8000 1293 3.0 1360262929
1170904 8000 1333 4.0 1360262893
1170905 8000 1405 2.5 1360262963
1170906 8000 1544 0.5 1360263394
1170907 8000 1645 4.0 1360262901
1170908 8000 1673 4.0 1360263436
1170909 8000 1732 4.5 1360263799
1170910 8000 1884 3.5 1360263556
1170911 8000 2371 4.0 1360263061
1170912 8000 2539 3.0 1360262950
1170913 8000 3255 3.5 1360262934
1170914 8000 3671 4.5 1360263492
1170915 8000 3717 3.5 1360262956
1170916 8000 3911 4.5 1360262942
1170917 8000 5669 3.0 1360263898
1170918 8000 6863 4.0 1360263887
1170919 8000 7836 4.5 1360263540
1170920 8000 8376 4.0 1360263882
1170921 8000 8622 2.5 1360263878
1170922 8000 8917 4.0 1360263478
1170923 8000 35836 4.0 1360263689
1170924 8000 50872 4.5 1360263854
1170925 8000 55290 4.0 1360263330

1170926 rows × 4 columns

#Time to define the R matrix that we discussed about earlier
R_df = train_df.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
R_df.head(10)
movieId 1 2 3 4 5 6 7 8 9 10 ... 129822 129857 130052 130073 130219 130462 130490 130496 130642 130768
userId
1 0.0 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 4.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 5.0 0.0 3.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 3.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 4.0 0.0 5.0 0.0 0.0 3.0 0.0 0.0 0.0 4.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10 rows × 14365 columns

The last thing we need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

With my ratings matrix properly formatted and normalized, I would be ready to do the singular value decomposition.

R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
R_demeaned
array([[-0.04559694,  3.45440306, -0.04559694, ..., -0.04559694,
        -0.04559694, -0.04559694],
       [-0.01698573, -0.01698573,  3.98301427, ..., -0.01698573,
        -0.01698573, -0.01698573],
       [ 3.94632788, -0.05367212, -0.05367212, ..., -0.05367212,
        -0.05367212, -0.05367212],
       ..., 
       [ 2.91806474, -0.08193526, -0.08193526, ..., -0.08193526,
        -0.08193526, -0.08193526],
       [-0.01141664, -0.01141664, -0.01141664, ..., -0.01141664,
        -0.01141664, -0.01141664],
       [-0.01127741, -0.01127741, -0.01127741, ..., -0.01127741,
        -0.01127741, -0.01127741]])

Scipy and Numpy both have functions to do the singular value decomposition. We would be using the Scipy function svds because it let’s us choose how many latent factors we want to use to approximate the original ratings matrix.

from scipy.sparse.linalg import svds
M, sigma, Ut = svds(R_demeaned, k = 50)

The function returns exactly those matrices detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. So, we will convert those numbers into a diagonal matrix.

sigma = np.diag(sigma)

Time for Predicting

We would be adding the user means to the data to get the original means back

allpredictedratings = np.dot(np.dot(M, sigma), Ut) + user_ratings_mean.reshape(-1, 1)

To put this kind of a system into production, we would have to create a training and validation set and optimize the number of latent features ($k$) by minimizing the Root Mean Square Error. Intuitively, the Root Mean Square Error will decrease on the training set as $k$ increases (because I’m approximating the original ratings matrix with a higher rank matrix).

For movies, between 20 and 100 feature “preferences” vectors have been found to be optimal for generalizing to unseen data.

We won’t be optimizing the $k$ for this post.

Giving the movie Recommendations

With the predictions matrix for every user, we can define a function to recommend movies for any user. All we need to do is return the movies with the highest predicted rating that the specified user hasn’t already rated.

We will also return the list of movies the user has already rated

predicted_df = pd.DataFrame(allpredictedratings, columns = R_df.columns)
predicted_df.head()
movieId 1 2 3 4 5 6 7 8 9 10 ... 129822 129857 130052 130073 130219 130462 130490 130496 130642 130768
0 0.469187 0.766043 0.175666 0.020263 -0.144290 0.128583 -0.347741 0.011184 -0.166920 0.016176 ... -0.001710 -0.009119 -0.003445 0.001706 -0.011580 -0.009186 0.001071 -0.000878 -0.005326 -0.002008
1 1.059213 0.008119 0.334548 0.095283 0.205378 0.192142 0.410210 0.012785 0.107436 -0.171990 ... -0.002465 -0.005723 -0.002387 0.001833 -0.005137 0.000737 0.004796 -0.003082 -0.001402 -0.002621
2 2.025882 0.881859 -0.031231 0.003809 -0.009610 0.636919 0.006099 0.023052 -0.019402 0.196938 ... 0.004825 -0.002954 0.002429 0.024075 -0.002344 0.006869 0.007860 0.003550 -0.000376 0.004453
3 -0.545908 0.648594 0.387437 -0.008829 0.219286 0.852600 0.037864 0.083376 0.211910 0.977409 ... 0.002507 0.004520 0.002546 0.001082 0.003604 0.002208 0.004599 0.001259 0.001489 0.002720
4 2.023229 1.073306 1.197391 0.106130 1.185754 0.646488 1.362204 0.203931 0.284101 1.453561 ... 0.001165 -0.001161 -0.000224 0.000626 -0.000193 -0.000454 0.004080 -0.002701 0.000496 0.000509

5 rows × 14365 columns

def recommend_movies(predictions_df, userId, movies_df, originalratings_df, num_recommendations):
    
    # Get and sort the user's predictions
    userrownumber = userId - 1 # userId starts at 1, not 0
    sortedpredictions = predicted_df.iloc[userrownumber].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    userdata = originalratings_df[originalratings_df.userId == (userId)]
    usercomplete = (userdata.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userId, usercomplete.shape[0]))
    print ('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['movieId'].isin(usercomplete['movieId'])].
         merge(pd.DataFrame(sortedpredictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {userrownumber: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return usercomplete, recommendations
ratedalready, predictions = recommend_movies(predicted_df, 1003, movies_df, ratings_df, 10)
User 1003 has already rated 174 movies.
Recommending highest 10 predicted ratings movies not already rated.
ratedalready.head(10)
userId movieId rating timestamp title genres
74 1003 2571 5.0 1209226214 Matrix, The (1999) Action|Sci-Fi|Thriller
36 1003 1210 5.0 1209308048 Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Sci-Fi
134 1003 7153 5.0 1209226291 Lord of the Rings: The Return of the King, The... Action|Adventure|Drama|Fantasy
6 1003 110 5.0 1209226276 Braveheart (1995) Action|Drama|War
130 1003 6874 5.0 1209226396 Kill Bill: Vol. 1 (2003) Action|Crime|Thriller
118 1003 5952 5.0 1209227126 Lord of the Rings: The Two Towers, The (2002) Adventure|Fantasy
94 1003 3578 5.0 1209226379 Gladiator (2000) Action|Adventure|Drama
47 1003 1527 4.5 1209227148 Fifth Element, The (1997) Action|Adventure|Comedy|Sci-Fi
137 1003 7438 4.5 1209226530 Kill Bill: Vol. 2 (2004) Action|Drama|Thriller
26 1003 780 4.5 1209226261 Independence Day (a.k.a. ID4) (1996) Action|Adventure|Sci-Fi|Thriller
predictions
movieId title genres
4759 4963 Ocean's Eleven (2001) Crime|Thriller
5208 5418 Bourne Identity, The (2002) Action|Mystery|Thriller
4788 4993 Lord of the Rings: The Fellowship of the Ring,... Adventure|Fantasy
1605 1721 Titanic (1997) Drama|Romance
2549 2716 Ghostbusters (a.k.a. Ghost Busters) (1984) Action|Comedy|Sci-Fi
42 47 Seven (a.k.a. Se7en) (1995) Mystery|Thriller
4692 4896 Harry Potter and the Sorcerer's Stone (a.k.a. ... Adventure|Children|Fantasy
1220 1291 Indiana Jones and the Last Crusade (1989) Action|Adventure
326 344 Ace Ventura: Pet Detective (1994) Comedy
2690 2858 American Beauty (1999) Comedy|Drama

The recommendations look pretty solid!

Conclusion

Low-dimensional matrix recommenders try to capture the underlying features driving the raw data (which we understand as tastes and preferences). From a theoretical perspective, if we want to make recommendations based on people’s tastes, this seems like the better approach. This technique also scales significantly better to larger datasets.

We do lose some meaningful signals by using a lower-rank matrix.

One particularly cool and effective strategy is to combine factorization and neighborhood methods into one framework(http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf). This research field is extremely active, and you should check out this Coursera course, Introduction to Recommender Systems(https://www.coursera.org/specializations/recommender-systems), for understanding this better.