Jekyll2018-05-04T05:48:53+00:00https://malhotrajat.github.io/i-love-data/Rajat MalhotraA collection of data-oriented projects.Rajat MalhotraMatrix Factorization for Recommendation systems2018-02-26T00:00:00+00:002018-02-26T00:00:00+00:00https://malhotrajat.github.io/i-love-data/markup/MatrixFactorization<h1 id="what-are-we-doing"><strong>What are we doing?</strong></h1>
<p>We are building a basic version of a low-rank matrix factorization recommendation system and we will use it on a dataset obtained from https://grouplens.org/datasets/movielens/. It has 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users.</p>
<p>The other technique to build a recommendation system is an item-based collaborative filtering approach. Collaborative filtering methods that compute distance relationships between items or users are generally thought of as “neighborhood” methods, since they are centered on the idea of “nearness”.</p>
<p>These methods are not well-suited for larger datasets. There is another conceptual issue with them as well, i.e., the ratings matrices may be overfit and noisy representations of user tastes and preferences.When we use distance based “neighborhood” approaches on raw data, we match to sparse low-level details that we assume represent the user’s preference vector instead of the vector itself. It’s a subtle difference, but it’s important.</p>
<p>If I’ve listened to ten Breaking Benjamin songs and you’ve listened to ten different Breaking Benjamin songs, the raw user action matrix wouldn’t have any overlap. We’d have nothing in common, even though it seems pretty likely we share at least some underlying preferencs. We need a method that can derive the tastes and preference vectors from the raw data.</p>
<p>Low-Rank Matrix Factorization is one of those methods.</p>
<h1 id="basics-of-matrix-factorization-for-recommendation-systems"><strong>Basics of Matrix Factorization for Recommendation systems</strong></h1>
<h4 id="all-of-the-theoretical-explanation-has-been-taken-from-httpnicolas-hugcomblogmatrix_facto_1">All of the theoretical explanation has been taken from: http://nicolas-hug.com/blog/matrix_facto_1</h4>
<h4 id="i-could-have-written-it-myself-but-i-loved-the-explanation-in-the-link-and-i-would-love-it-if-people-read-it-completely-to-understand-how-matrix-factorization-for-recommendation-systems-actally-works">I could have written it myself, but I loved the explanation in the link and I would love it if people read it completely to understand how Matrix Factorization for recommendation systems actally works</h4>
<p>The problem we need to assess is that of rating prediction. The data we would have on our hands is a <strong>rating history</strong>.</p>
<p>It would look something like this:</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/Rmatrix.JPG" alt="no-alignment" /></p>
<p>Our <strong>R matrix</strong> is a 99% sparse matrix withthe columns as the items or movies in our case and the rows as individual users.</p>
<p>We will factorize the matrix R. The matrix factorization is linked to SVD(Singular Value Decomposition). It’s a beautiful result of Linear Algebra. When people say Math sucks, show them what SVD can do.</p>
<p>But, before we move onto SVD, we should review PCA(Principle Components Analysis). It’s only slightly less awesome than SVD, but it’s still pretty cool.</p>
<h1 id="a-little-bit-of-pca"><strong>A little bit of PCA</strong></h1>
<p>We’ll play around with the Olivetti dataset. It’s a set of greyscale images of faces from 40 people, making up a total of 400 images.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">fetch_olivetti_faces</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">faces</span> <span class="o">=</span> <span class="n">fetch_olivetti_faces</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">faces</span><span class="o">.</span><span class="n">DESCR</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Modified Olivetti faces dataset.
The original database was available from
http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
The version retrieved here comes in MATLAB format from the personal
web page of Sam Roweis:
http://www.cs.nyu.edu/~roweis/
There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting,
facial expressions (open / closed eyes, smiling / not smiling) and facial
details (glasses / no glasses). All the images were taken against a dark
homogeneous background with the subjects in an upright, frontal position (with
tolerance for some side movement).
The original dataset consisted of 92 x 112, while the Roweis version
consists of 64x64 images.
</code></pre></div></div>
<p><strong>Here are the first 10 people:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Here are the first ten guys of the dataset</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot2grid</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="n">ax</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">faces</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="mi">10</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">gray</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/output_11_0.png" alt="no-alignment" /></p>
<p><strong>Each image size is 64 x 64 pixels. We will flatten each of these images (we thus get 400 vectors, each with 64 x 64 = 4096 elements). We can represent our dataset in a 400 x 4096 matrix:</strong></p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/Flattened.JPG" alt="no-alignment" /></p>
<p><strong>PCA, which stands for Principal Component Analysis, is an algorithm that will reveal 400 of these guys:</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Let's compute the PCA</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">()</span>
<span class="n">pca</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">faces</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Now, the creepy guys are in the components_ attribute.</span>
<span class="c"># Here are the first ten ones:</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplot2grid</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">i</span><span class="p">))</span>
<span class="n">ax</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">components_</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span> <span class="n">cmap</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">cm</span><span class="o">.</span><span class="n">gray</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/output_15_0.png" alt="no-alignment" /></p>
<p><strong>This is pretty creepy, right?</strong></p>
<p>We call these guys the principal components (hence the name of the technique), and when they represent faces such as here we call them the eigenfaces. Some really cool stuff can be done with eigenfaces such as face recognition, or optimizing your tinder matches! The reason why they’re called eigenfaces is because they are in fact the eigenvectors of the covariance matrix of X</p>
<p>We obtain here 400 principal components because the original matrix X has 400 rows (or more precisely, because the rank of X is 400). As you may have guessed, each of the principal component is in fact a vector that has the same dimension as the original faces, i.e. it has 64 x 64 = 4096 pixels.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Reconstruction process</span>
<span class="kn">from</span> <span class="nn">skimage.io</span> <span class="kn">import</span> <span class="n">imsave</span>
<span class="n">face</span> <span class="o">=</span> <span class="n">faces</span><span class="o">.</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="c"># we will reconstruct the first face</span>
<span class="c"># During the reconstruction process we are actually computing, at the kth frame,</span>
<span class="c"># a rank k approximation of the face. To get a rank k approximation of a face,</span>
<span class="c"># we need to first transform it into the 'latent space', and then</span>
<span class="c"># transform it back to the original space</span>
<span class="c"># Step 1: transform the face into the latent space.</span>
<span class="c"># It's now a vector with 400 components. The kth component gives the importance</span>
<span class="c"># of the kth creepy guy</span>
<span class="n">trans</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">face</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="c"># Reshape for scikit learn</span>
<span class="c"># Step 2: reconstruction. To build the kth frame, we use all the creepy guys</span>
<span class="c"># up until the kth one.</span>
<span class="c"># Warning: this will save 400 png images.</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">400</span><span class="p">):</span>
<span class="n">rank_k_approx</span> <span class="o">=</span> <span class="n">trans</span><span class="p">[:,</span> <span class="p">:</span><span class="n">k</span><span class="p">]</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">components_</span><span class="p">[:</span><span class="n">k</span><span class="p">])</span> <span class="o">+</span> <span class="n">pca</span><span class="o">.</span><span class="n">mean_</span>
<span class="n">imsave</span><span class="p">(</span><span class="s">'{:>03}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">k</span><span class="p">))</span> <span class="o">+</span> <span class="s">'.jpg'</span><span class="p">,</span> <span class="n">rank_k_approx</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>E:\Software\Anaconda2\envs\py36\lib\site-packages\skimage\util\dtype.py:122: UserWarning: Possible precision loss when converting from float64 to uint8
.format(dtypeobj_in, dtypeobj_out))
</code></pre></div></div>
<p>As far as we’re concerned, we will call these guys the <strong>creepy guys</strong>.</p>
<p>Now, one amazing thing about them is that they can build back all of the original faces. Take a look at this (these are animated gifs, about 10s long):</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim2.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim3.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim4.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim5.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim6.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim7.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim8.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim9.gif" alt="no-alignment" />
<img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/anim10.gif" alt="no-alignment" /></p>
<p>Each of the 400 original faces (i.e. each of the 400 original rows of the matrix) can be expressed as a (linear) combination of the creepy guys. That is, we can express the first original face (i.e. its pixel values) as a little bit of the first creepy guy, plus a little bit of the second creepy guy, plus a little bit of third, etc. until the last creepy guy. The same goes for all of the other original faces: they can all be expressed as a little bit of each creepy guy.</p>
<p><strong>Face 1 = α<sub>1</sub>⋅Creepy guy #1 + α<sub>2</sub>⋅Creepy guy #2 . . . + α<sub>400</sub>⋅Creepy guy #400</strong></p>
<p>The gifs you saw above are the very translation of these math equations: the first frame of a gif is the contribution of the first creepy guy, the second frame is the contribution of the first two creepy guys, etc. until the last creepy guy.</p>
<h3 id="latent-factors">Latent Factors</h3>
<p>We’ve actually been kind of harsh towards the creepy guys. They’re not creepy, they’re typical. The goal of PCA is to reveal typical vectors: each of the creepy/typical guy represents one specific aspect underlying the data. In an ideal world, the first typical guy would represent (e.g.) a typical elder person, the second typical guy would represent a typical glasses wearer, and some other typical guys would represent concepts such as smiley, sad looking, big nose, stuff like that. And with these concepts, we could define a face as more or less elder, more or less glassy, more or less smiling, etc. In practice, the concepts that PCA reveals are really not that clear: there is no clear semantic that we could associate with any of the creepy/typical guys that we obtained here. But the important fact remains: each of the typical guys captures a specific aspect of the data. We call these aspects the latent factors (latent, because they were there all the time, we just needed PCA to reveal them). Using barbaric terms, we say that each principal component (the creepy/typical guys) captures a specific latent factor.</p>
<p>Now, this is all good and fun, but we’re interested in matrix factorization for recommendation purposes, right? So where is our matrix factorization, and what does it have to do with recommendation? PCA is actually a plug-and-play method: it works for any matrix. If your matrix contains images, it will reveal some typical images that can build back all of your initial images, such as here. If your matrix contains potatoes, PCA will reveal some typical potatoes that can build back all of your original potatoes. If your matrix contains ratings, well… Here we come.</p>
<h1 id="pca-on-a-dense-rating-matrix">PCA on a (dense) rating matrix</h1>
<p>Until stated otherwise, we will consider for now that our rating matrix R is completely dense, i.e. there are no missing entries. All the ratings are known. This is of course not the case in real recommendation problems, but bare with me.</p>
<h3 id="pca-on-the-users">PCA on the users</h3>
<p>Here is our rating matrix, where rows are users and columns are movies:</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/UserPCA.JPG" alt="no-alignment" /></p>
<p>Instead of having faces in the rows represented by pixel values, we now have users represented by their ratings. Just like PCA gave us some typical guys before, it will now give us some typical users, or rather some typical raters.</p>
<p>we would obtain a typical action movie fan, a typical romance movie fan, a typical comedy fan, etc. In practice, the semantic behind the typical users is not clearly defined, but for the sake of simplicity we will assume that they are (it doesn’t change anything, this is just for intuition/explanation purposes).</p>
<p>Each of our initial users (Alice, Bob…) can be expressed as a combination of the typical users. For instance, Alice could be defined as a little bit of an action fan, a little bit of a comedy fan, a lot of a romance fan, etc. As for Bob, he could be more keen on action movies:</p>
<p><strong>Alice = 10% Action fan + 10% Comedy fan + 50% Romance fan + …</strong></p>
<p><strong>Bob = 50% Action fan + 30% Comedy fan + 10% Romance fan + …</strong></p>
<p>And the same goes for all of the users, you get the idea. (In practice the coefficients are not necessarily percentages, but it’s convenient for us to think of it this way).</p>
<h3 id="pca-on-the-movies">PCA on the movies</h3>
<p>What would happen if we transposed our rating matrix? Instead of having users in the rows, we would now have movies, defined as their ratings:</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/MoviesPCA.JPG" alt="no-alignment" /></p>
<p>In this case, PCA will not reveal typical faces nor typical users, but of course typical movies. And here again, we will associate a semantic meaning behind each of the typical movies, and these typical movies can build back all of our original movies:</p>
<p>And the same goes for all the other movies.</p>
<p>So what can SVD do for us? SVD is PCA on R and R(Transpose), in one shot.</p>
<p>SVD will give you the two matrices U and M, at the same time. You get the typical users and the typical movies in one shot. SVD gives you U and M by factorizing R into three matrices. Here is the matrix factorization:</p>
<p><strong><em>R=MΣU<sup>T</sup></em></strong></p>
<p>To be very clear: SVD is an algorithm that takes the matrix R as an input, and it gives you M, Σ and U, such that:</p>
<p>R is equal to the product <strong>MΣU<sup>T</sup>.</strong></p>
<p>The columns of M can build back all of the columns of R (we already know this).</p>
<p>The columns of U can build back all of the rows of R (we already know this).</p>
<p>The columns of M are orthogonal, as well as the columns of U. I haven’t mentioned this before, so here it is: the principal components are always orthogonal. This is actually an extremely important feature of PCA (and SVD), but for our recommendation we actually don’t care (we’ll come to that).</p>
<p>Σ is a diagonal matrix (we’ll also come to that).</p>
<p>We can basically sum up all of the above points by this statements: the columns of M are an orthogonal basis that spans the column space of R, and the columns of U are an orthonormal basis that spans the row space of R. If this kind of phrases works for you, great. Personally, I prefer to talk about creepy guys and typical potatoes</p>
<h3 id="the-model-behind-svd">The model behind SVD</h3>
<p>When we compute and use the SVD of the rating matrix R, we are actually modeling the ratings in a very specific, and meaningful way. We will describe this modeling here.</p>
<p>For the sake of simplicity, we will forget about the matrix Σ: it is a diagonal matrix, so it simply acts as a scaler on M or U(Transpose). Hence, we will pretend that we have merged into one of the two matrices. Our matrix factorization simply becomes:</p>
<p><strong><em>R=MU<sup>T</sup></em></strong></p>
<p>Now, with this factorization, let’s consider the rating of user u for item i, that we will denote rui:</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/productmatrices.JPG" alt="no-alignment" /></p>
<p>Because of the way a matrix product is defined, the value of rui is the result of a dot product between two vectors: a vector pu which is a row of M and which is specific to the user u, and a vector qi which is a column of UT and which is specific to the item i:</p>
<p><strong><em>ru<sub>i</sub>=p<sub>u</sub>⋅q<sub>i</sub></em></strong></p>
<p>where ‘⋅’ stands for the usual dot product. Now, remember how we can describe our users and our items?</p>
<p><strong>Alice = 10% Action fan + 10% Comedy fan + 50% Romance fan + …</strong></p>
<p><strong>Bob = 50% Action fan + 30% Comedy fan + 10% Romance fan + …</strong></p>
<p><strong>Titanic = 20% Action + 0% Comedy + 70% Romance + …</strong></p>
<p><strong>Toy Story = 30% Action + 60% Comedy + 0% Romance + …</strong></p>
<p>Well, the values of the vectors p<sub>u</sub> and q<sub>i</sub> exactly correspond to the coefficients that we have assigned to each latent factor:</p>
<p><strong>p<sub>Alice</sub>=(10%, 10%, 50%, …)</strong></p>
<p><strong>p<sub>Bob</sub>=(50%, 30%, 10%, …)</strong></p>
<p><strong>q<sub>Titanic</sub>=(20%, 0%, 70%, …)</strong></p>
<p><strong>q<sub>Toy Story</sub>=(30%, 60%, 0%, …)</strong></p>
<p>The vector p<sub>u</sub> represents the affinity of user u for each of the latent factors. Similarly, the vector q<sub>i</sub> represents the affinity of the item i for the latent factors. Alice is represented as (10%, 10%, 50%,…), meaning that she’s only slightly sensitive to action and comedy movies, but she seems to like romance. As for Bob, he seems to prefer action movies above anything else. We can also see that Titanic is mostly a romance movie and that it’s not funny at all.</p>
<p>So, when we are using the SVD of R, we are modeling the rating of user u for item i as follows:</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/matrixfactorization/equation.JPG" alt="no-alignment" /></p>
<p>In other words, if u has a taste for factors that are endorsed by i, then the rating rui will be high. Conversely, if i is not the kind of items that u likes (i.e. the coefficient don’t match well), the rating rui will be low. In our case, the rating of Alice for Titanic will be high, while that of Bob will be much lower because he’s not so keen on romance movies. His rating for Toy Story will, however, be higher than that of Alice.</p>
<p>We now have enough knowledge to apply SVD to a recommendation task.</p>
<h1 id="setting-up-the-ratings-data"><strong>Setting up the ratings data</strong></h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">movies_df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"E:/Git/Project markdowns/Matrix Factorization/ml-20m/movies.csv"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">movies_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>movieId</th>
<th>title</th>
<th>genres</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>Toy Story (1995)</td>
<td>Adventure|Animation|Children|Comedy|Fantasy</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>Jumanji (1995)</td>
<td>Adventure|Children|Fantasy</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>Grumpier Old Men (1995)</td>
<td>Comedy|Romance</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>Waiting to Exhale (1995)</td>
<td>Comedy|Drama|Romance</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>Father of the Bride Part II (1995)</td>
<td>Comedy</td>
</tr>
<tr>
<th>5</th>
<td>6</td>
<td>Heat (1995)</td>
<td>Action|Crime|Thriller</td>
</tr>
<tr>
<th>6</th>
<td>7</td>
<td>Sabrina (1995)</td>
<td>Comedy|Romance</td>
</tr>
<tr>
<th>7</th>
<td>8</td>
<td>Tom and Huck (1995)</td>
<td>Adventure|Children</td>
</tr>
<tr>
<th>8</th>
<td>9</td>
<td>Sudden Death (1995)</td>
<td>Action</td>
</tr>
<tr>
<th>9</th>
<td>10</td>
<td>GoldenEye (1995)</td>
<td>Action|Adventure|Thriller</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ratings_df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"E:/Git/Project markdowns/Matrix Factorization/ml-20m/ratings.csv"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ratings_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>userId</th>
<th>movieId</th>
<th>rating</th>
<th>timestamp</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>2</td>
<td>3.5</td>
<td>1112486027</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>29</td>
<td>3.5</td>
<td>1112484676</td>
</tr>
<tr>
<th>2</th>
<td>1</td>
<td>32</td>
<td>3.5</td>
<td>1112484819</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>47</td>
<td>3.5</td>
<td>1112484727</td>
</tr>
<tr>
<th>4</th>
<td>1</td>
<td>50</td>
<td>3.5</td>
<td>1112484580</td>
</tr>
<tr>
<th>5</th>
<td>1</td>
<td>112</td>
<td>3.5</td>
<td>1094785740</td>
</tr>
<tr>
<th>6</th>
<td>1</td>
<td>151</td>
<td>4.0</td>
<td>1094785734</td>
</tr>
<tr>
<th>7</th>
<td>1</td>
<td>223</td>
<td>4.0</td>
<td>1112485573</td>
</tr>
<tr>
<th>8</th>
<td>1</td>
<td>253</td>
<td>4.0</td>
<td>1112484940</td>
</tr>
<tr>
<th>9</th>
<td>1</td>
<td>260</td>
<td>4.0</td>
<td>1112484826</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Defining a list of numbers from 0 to 7999</span>
<span class="n">mylist</span><span class="o">=</span><span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">8001</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Getting the data for the first 8000 users</span>
<span class="n">train_df</span><span class="o">=</span><span class="n">ratings_df</span><span class="p">[</span><span class="n">ratings_df</span><span class="p">[</span><span class="s">"userId"</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">mylist</span><span class="p">)]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train_df</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>userId</th>
<th>movieId</th>
<th>rating</th>
<th>timestamp</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>2</td>
<td>3.5</td>
<td>1112486027</td>
</tr>
<tr>
<th>1</th>
<td>1</td>
<td>29</td>
<td>3.5</td>
<td>1112484676</td>
</tr>
<tr>
<th>2</th>
<td>1</td>
<td>32</td>
<td>3.5</td>
<td>1112484819</td>
</tr>
<tr>
<th>3</th>
<td>1</td>
<td>47</td>
<td>3.5</td>
<td>1112484727</td>
</tr>
<tr>
<th>4</th>
<td>1</td>
<td>50</td>
<td>3.5</td>
<td>1112484580</td>
</tr>
<tr>
<th>5</th>
<td>1</td>
<td>112</td>
<td>3.5</td>
<td>1094785740</td>
</tr>
<tr>
<th>6</th>
<td>1</td>
<td>151</td>
<td>4.0</td>
<td>1094785734</td>
</tr>
<tr>
<th>7</th>
<td>1</td>
<td>223</td>
<td>4.0</td>
<td>1112485573</td>
</tr>
<tr>
<th>8</th>
<td>1</td>
<td>253</td>
<td>4.0</td>
<td>1112484940</td>
</tr>
<tr>
<th>9</th>
<td>1</td>
<td>260</td>
<td>4.0</td>
<td>1112484826</td>
</tr>
<tr>
<th>10</th>
<td>1</td>
<td>293</td>
<td>4.0</td>
<td>1112484703</td>
</tr>
<tr>
<th>11</th>
<td>1</td>
<td>296</td>
<td>4.0</td>
<td>1112484767</td>
</tr>
<tr>
<th>12</th>
<td>1</td>
<td>318</td>
<td>4.0</td>
<td>1112484798</td>
</tr>
<tr>
<th>13</th>
<td>1</td>
<td>337</td>
<td>3.5</td>
<td>1094785709</td>
</tr>
<tr>
<th>14</th>
<td>1</td>
<td>367</td>
<td>3.5</td>
<td>1112485980</td>
</tr>
<tr>
<th>15</th>
<td>1</td>
<td>541</td>
<td>4.0</td>
<td>1112484603</td>
</tr>
<tr>
<th>16</th>
<td>1</td>
<td>589</td>
<td>3.5</td>
<td>1112485557</td>
</tr>
<tr>
<th>17</th>
<td>1</td>
<td>593</td>
<td>3.5</td>
<td>1112484661</td>
</tr>
<tr>
<th>18</th>
<td>1</td>
<td>653</td>
<td>3.0</td>
<td>1094785691</td>
</tr>
<tr>
<th>19</th>
<td>1</td>
<td>919</td>
<td>3.5</td>
<td>1094785621</td>
</tr>
<tr>
<th>20</th>
<td>1</td>
<td>924</td>
<td>3.5</td>
<td>1094785598</td>
</tr>
<tr>
<th>21</th>
<td>1</td>
<td>1009</td>
<td>3.5</td>
<td>1112486013</td>
</tr>
<tr>
<th>22</th>
<td>1</td>
<td>1036</td>
<td>4.0</td>
<td>1112485480</td>
</tr>
<tr>
<th>23</th>
<td>1</td>
<td>1079</td>
<td>4.0</td>
<td>1094785665</td>
</tr>
<tr>
<th>24</th>
<td>1</td>
<td>1080</td>
<td>3.5</td>
<td>1112485375</td>
</tr>
<tr>
<th>25</th>
<td>1</td>
<td>1089</td>
<td>3.5</td>
<td>1112484669</td>
</tr>
<tr>
<th>26</th>
<td>1</td>
<td>1090</td>
<td>4.0</td>
<td>1112485453</td>
</tr>
<tr>
<th>27</th>
<td>1</td>
<td>1097</td>
<td>4.0</td>
<td>1112485701</td>
</tr>
<tr>
<th>28</th>
<td>1</td>
<td>1136</td>
<td>3.5</td>
<td>1112484609</td>
</tr>
<tr>
<th>29</th>
<td>1</td>
<td>1193</td>
<td>3.5</td>
<td>1112484690</td>
</tr>
<tr>
<th>...</th>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<th>1170896</th>
<td>8000</td>
<td>858</td>
<td>5.0</td>
<td>1360263827</td>
</tr>
<tr>
<th>1170897</th>
<td>8000</td>
<td>1073</td>
<td>5.0</td>
<td>1360263822</td>
</tr>
<tr>
<th>1170898</th>
<td>8000</td>
<td>1092</td>
<td>4.5</td>
<td>1360262914</td>
</tr>
<tr>
<th>1170899</th>
<td>8000</td>
<td>1093</td>
<td>3.5</td>
<td>1360263017</td>
</tr>
<tr>
<th>1170900</th>
<td>8000</td>
<td>1136</td>
<td>4.0</td>
<td>1360263818</td>
</tr>
<tr>
<th>1170901</th>
<td>8000</td>
<td>1193</td>
<td>4.5</td>
<td>1360263372</td>
</tr>
<tr>
<th>1170902</th>
<td>8000</td>
<td>1213</td>
<td>4.0</td>
<td>1360263806</td>
</tr>
<tr>
<th>1170903</th>
<td>8000</td>
<td>1293</td>
<td>3.0</td>
<td>1360262929</td>
</tr>
<tr>
<th>1170904</th>
<td>8000</td>
<td>1333</td>
<td>4.0</td>
<td>1360262893</td>
</tr>
<tr>
<th>1170905</th>
<td>8000</td>
<td>1405</td>
<td>2.5</td>
<td>1360262963</td>
</tr>
<tr>
<th>1170906</th>
<td>8000</td>
<td>1544</td>
<td>0.5</td>
<td>1360263394</td>
</tr>
<tr>
<th>1170907</th>
<td>8000</td>
<td>1645</td>
<td>4.0</td>
<td>1360262901</td>
</tr>
<tr>
<th>1170908</th>
<td>8000</td>
<td>1673</td>
<td>4.0</td>
<td>1360263436</td>
</tr>
<tr>
<th>1170909</th>
<td>8000</td>
<td>1732</td>
<td>4.5</td>
<td>1360263799</td>
</tr>
<tr>
<th>1170910</th>
<td>8000</td>
<td>1884</td>
<td>3.5</td>
<td>1360263556</td>
</tr>
<tr>
<th>1170911</th>
<td>8000</td>
<td>2371</td>
<td>4.0</td>
<td>1360263061</td>
</tr>
<tr>
<th>1170912</th>
<td>8000</td>
<td>2539</td>
<td>3.0</td>
<td>1360262950</td>
</tr>
<tr>
<th>1170913</th>
<td>8000</td>
<td>3255</td>
<td>3.5</td>
<td>1360262934</td>
</tr>
<tr>
<th>1170914</th>
<td>8000</td>
<td>3671</td>
<td>4.5</td>
<td>1360263492</td>
</tr>
<tr>
<th>1170915</th>
<td>8000</td>
<td>3717</td>
<td>3.5</td>
<td>1360262956</td>
</tr>
<tr>
<th>1170916</th>
<td>8000</td>
<td>3911</td>
<td>4.5</td>
<td>1360262942</td>
</tr>
<tr>
<th>1170917</th>
<td>8000</td>
<td>5669</td>
<td>3.0</td>
<td>1360263898</td>
</tr>
<tr>
<th>1170918</th>
<td>8000</td>
<td>6863</td>
<td>4.0</td>
<td>1360263887</td>
</tr>
<tr>
<th>1170919</th>
<td>8000</td>
<td>7836</td>
<td>4.5</td>
<td>1360263540</td>
</tr>
<tr>
<th>1170920</th>
<td>8000</td>
<td>8376</td>
<td>4.0</td>
<td>1360263882</td>
</tr>
<tr>
<th>1170921</th>
<td>8000</td>
<td>8622</td>
<td>2.5</td>
<td>1360263878</td>
</tr>
<tr>
<th>1170922</th>
<td>8000</td>
<td>8917</td>
<td>4.0</td>
<td>1360263478</td>
</tr>
<tr>
<th>1170923</th>
<td>8000</td>
<td>35836</td>
<td>4.0</td>
<td>1360263689</td>
</tr>
<tr>
<th>1170924</th>
<td>8000</td>
<td>50872</td>
<td>4.5</td>
<td>1360263854</td>
</tr>
<tr>
<th>1170925</th>
<td>8000</td>
<td>55290</td>
<td>4.0</td>
<td>1360263330</td>
</tr>
</tbody>
</table>
<p>1170926 rows × 4 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Time to define the R matrix that we discussed about earlier</span>
<span class="n">R_df</span> <span class="o">=</span> <span class="n">train_df</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="n">index</span> <span class="o">=</span> <span class="s">'userId'</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span><span class="s">'movieId'</span><span class="p">,</span> <span class="n">values</span> <span class="o">=</span> <span class="s">'rating'</span><span class="p">)</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">R_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>movieId</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>...</th>
<th>129822</th>
<th>129857</th>
<th>130052</th>
<th>130073</th>
<th>130219</th>
<th>130462</th>
<th>130490</th>
<th>130496</th>
<th>130642</th>
<th>130768</th>
</tr>
<tr>
<th>userId</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>0.0</td>
<td>3.5</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>2</th>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>3</th>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>4</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>5</th>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>6</th>
<td>5.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>5.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>7</th>
<td>0.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>8</th>
<td>4.0</td>
<td>0.0</td>
<td>5.0</td>
<td>0.0</td>
<td>0.0</td>
<td>3.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>4.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>9</th>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
<tr>
<th>10</th>
<td>4.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>...</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
</tr>
</tbody>
</table>
<p>10 rows × 14365 columns</p>
</div>
<p><strong>The last thing we need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.</strong></p>
<p><strong>With my ratings matrix properly formatted and normalized, I would be ready to do the singular value decomposition.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">R</span> <span class="o">=</span> <span class="n">R_df</span><span class="o">.</span><span class="n">as_matrix</span><span class="p">()</span>
<span class="n">user_ratings_mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">R</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">R_demeaned</span> <span class="o">=</span> <span class="n">R</span> <span class="o">-</span> <span class="n">user_ratings_mean</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">R_demeaned</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([[-0.04559694, 3.45440306, -0.04559694, ..., -0.04559694,
-0.04559694, -0.04559694],
[-0.01698573, -0.01698573, 3.98301427, ..., -0.01698573,
-0.01698573, -0.01698573],
[ 3.94632788, -0.05367212, -0.05367212, ..., -0.05367212,
-0.05367212, -0.05367212],
...,
[ 2.91806474, -0.08193526, -0.08193526, ..., -0.08193526,
-0.08193526, -0.08193526],
[-0.01141664, -0.01141664, -0.01141664, ..., -0.01141664,
-0.01141664, -0.01141664],
[-0.01127741, -0.01127741, -0.01127741, ..., -0.01127741,
-0.01127741, -0.01127741]])
</code></pre></div></div>
<p><strong>Scipy and Numpy both have functions to do the singular value decomposition. We would be using the Scipy function svds because it let’s us choose how many latent factors we want to use to approximate the original ratings matrix.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scipy.sparse.linalg</span> <span class="kn">import</span> <span class="n">svds</span>
<span class="n">M</span><span class="p">,</span> <span class="n">sigma</span><span class="p">,</span> <span class="n">Ut</span> <span class="o">=</span> <span class="n">svds</span><span class="p">(</span><span class="n">R_demeaned</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<p><strong>The function returns exactly those matrices detailed earlier in this post, except that the $\Sigma$ returned is just the values instead of a diagonal matrix. So, we will convert those numbers into a diagonal matrix.</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="time-for-predicting"><strong>Time for Predicting</strong></h1>
<p><strong>We would be adding the user means to the data to get the original means back</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">allpredictedratings</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">sigma</span><span class="p">),</span> <span class="n">Ut</span><span class="p">)</span> <span class="o">+</span> <span class="n">user_ratings_mean</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>To put this kind of a system into production, we would have to create a training and validation set and optimize the number of latent features ($k$) by minimizing the Root Mean Square Error. Intuitively, the Root Mean Square Error will decrease on the training set as $k$ increases (because I’m approximating the original ratings matrix with a higher rank matrix).</p>
<p>For movies, between 20 and 100 feature “preferences” vectors have been found to be optimal for generalizing to unseen data.</p>
<p>We won’t be optimizing the $k$ for this post.</p>
<h1 id="giving-the-movie-recommendations">Giving the movie Recommendations</h1>
<p>With the predictions matrix for every user, we can define a function to recommend movies for any user. All we need to do is return the movies with the highest predicted rating that the specified user hasn’t already rated.</p>
<p>We will also return the list of movies the user has already rated</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predicted_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">allpredictedratings</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="n">R_df</span><span class="o">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">predicted_df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>movieId</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>...</th>
<th>129822</th>
<th>129857</th>
<th>130052</th>
<th>130073</th>
<th>130219</th>
<th>130462</th>
<th>130490</th>
<th>130496</th>
<th>130642</th>
<th>130768</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.469187</td>
<td>0.766043</td>
<td>0.175666</td>
<td>0.020263</td>
<td>-0.144290</td>
<td>0.128583</td>
<td>-0.347741</td>
<td>0.011184</td>
<td>-0.166920</td>
<td>0.016176</td>
<td>...</td>
<td>-0.001710</td>
<td>-0.009119</td>
<td>-0.003445</td>
<td>0.001706</td>
<td>-0.011580</td>
<td>-0.009186</td>
<td>0.001071</td>
<td>-0.000878</td>
<td>-0.005326</td>
<td>-0.002008</td>
</tr>
<tr>
<th>1</th>
<td>1.059213</td>
<td>0.008119</td>
<td>0.334548</td>
<td>0.095283</td>
<td>0.205378</td>
<td>0.192142</td>
<td>0.410210</td>
<td>0.012785</td>
<td>0.107436</td>
<td>-0.171990</td>
<td>...</td>
<td>-0.002465</td>
<td>-0.005723</td>
<td>-0.002387</td>
<td>0.001833</td>
<td>-0.005137</td>
<td>0.000737</td>
<td>0.004796</td>
<td>-0.003082</td>
<td>-0.001402</td>
<td>-0.002621</td>
</tr>
<tr>
<th>2</th>
<td>2.025882</td>
<td>0.881859</td>
<td>-0.031231</td>
<td>0.003809</td>
<td>-0.009610</td>
<td>0.636919</td>
<td>0.006099</td>
<td>0.023052</td>
<td>-0.019402</td>
<td>0.196938</td>
<td>...</td>
<td>0.004825</td>
<td>-0.002954</td>
<td>0.002429</td>
<td>0.024075</td>
<td>-0.002344</td>
<td>0.006869</td>
<td>0.007860</td>
<td>0.003550</td>
<td>-0.000376</td>
<td>0.004453</td>
</tr>
<tr>
<th>3</th>
<td>-0.545908</td>
<td>0.648594</td>
<td>0.387437</td>
<td>-0.008829</td>
<td>0.219286</td>
<td>0.852600</td>
<td>0.037864</td>
<td>0.083376</td>
<td>0.211910</td>
<td>0.977409</td>
<td>...</td>
<td>0.002507</td>
<td>0.004520</td>
<td>0.002546</td>
<td>0.001082</td>
<td>0.003604</td>
<td>0.002208</td>
<td>0.004599</td>
<td>0.001259</td>
<td>0.001489</td>
<td>0.002720</td>
</tr>
<tr>
<th>4</th>
<td>2.023229</td>
<td>1.073306</td>
<td>1.197391</td>
<td>0.106130</td>
<td>1.185754</td>
<td>0.646488</td>
<td>1.362204</td>
<td>0.203931</td>
<td>0.284101</td>
<td>1.453561</td>
<td>...</td>
<td>0.001165</td>
<td>-0.001161</td>
<td>-0.000224</td>
<td>0.000626</td>
<td>-0.000193</td>
<td>-0.000454</td>
<td>0.004080</td>
<td>-0.002701</td>
<td>0.000496</td>
<td>0.000509</td>
</tr>
</tbody>
</table>
<p>5 rows × 14365 columns</p>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">recommend_movies</span><span class="p">(</span><span class="n">predictions_df</span><span class="p">,</span> <span class="n">userId</span><span class="p">,</span> <span class="n">movies_df</span><span class="p">,</span> <span class="n">originalratings_df</span><span class="p">,</span> <span class="n">num_recommendations</span><span class="p">):</span>
<span class="c"># Get and sort the user's predictions</span>
<span class="n">userrownumber</span> <span class="o">=</span> <span class="n">userId</span> <span class="o">-</span> <span class="mi">1</span> <span class="c"># userId starts at 1, not 0</span>
<span class="n">sortedpredictions</span> <span class="o">=</span> <span class="n">predicted_df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">userrownumber</span><span class="p">]</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="c"># Get the user's data and merge in the movie information.</span>
<span class="n">userdata</span> <span class="o">=</span> <span class="n">originalratings_df</span><span class="p">[</span><span class="n">originalratings_df</span><span class="o">.</span><span class="n">userId</span> <span class="o">==</span> <span class="p">(</span><span class="n">userId</span><span class="p">)]</span>
<span class="n">usercomplete</span> <span class="o">=</span> <span class="p">(</span><span class="n">userdata</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">movies_df</span><span class="p">,</span> <span class="n">how</span> <span class="o">=</span> <span class="s">'left'</span><span class="p">,</span> <span class="n">left_on</span> <span class="o">=</span> <span class="s">'movieId'</span><span class="p">,</span> <span class="n">right_on</span> <span class="o">=</span> <span class="s">'movieId'</span><span class="p">)</span><span class="o">.</span>
<span class="n">sort_values</span><span class="p">([</span><span class="s">'rating'</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="p">)</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'User {0} has already rated {1} movies.'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">userId</span><span class="p">,</span> <span class="n">usercomplete</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Recommending highest {0} predicted ratings movies not already rated.'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">num_recommendations</span><span class="p">))</span>
<span class="c"># Recommend the highest predicted rating movies that the user hasn't seen yet.</span>
<span class="n">recommendations</span> <span class="o">=</span> <span class="p">(</span><span class="n">movies_df</span><span class="p">[</span><span class="o">~</span><span class="n">movies_df</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">]</span><span class="o">.</span><span class="n">isin</span><span class="p">(</span><span class="n">usercomplete</span><span class="p">[</span><span class="s">'movieId'</span><span class="p">])]</span><span class="o">.</span>
<span class="n">merge</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">sortedpredictions</span><span class="p">)</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(),</span> <span class="n">how</span> <span class="o">=</span> <span class="s">'left'</span><span class="p">,</span>
<span class="n">left_on</span> <span class="o">=</span> <span class="s">'movieId'</span><span class="p">,</span>
<span class="n">right_on</span> <span class="o">=</span> <span class="s">'movieId'</span><span class="p">)</span><span class="o">.</span>
<span class="n">rename</span><span class="p">(</span><span class="n">columns</span> <span class="o">=</span> <span class="p">{</span><span class="n">userrownumber</span><span class="p">:</span> <span class="s">'Predictions'</span><span class="p">})</span><span class="o">.</span>
<span class="n">sort_values</span><span class="p">(</span><span class="s">'Predictions'</span><span class="p">,</span> <span class="n">ascending</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span><span class="o">.</span>
<span class="n">iloc</span><span class="p">[:</span><span class="n">num_recommendations</span><span class="p">,</span> <span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">usercomplete</span><span class="p">,</span> <span class="n">recommendations</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ratedalready</span><span class="p">,</span> <span class="n">predictions</span> <span class="o">=</span> <span class="n">recommend_movies</span><span class="p">(</span><span class="n">predicted_df</span><span class="p">,</span> <span class="mi">1003</span><span class="p">,</span> <span class="n">movies_df</span><span class="p">,</span> <span class="n">ratings_df</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>User 1003 has already rated 174 movies.
Recommending highest 10 predicted ratings movies not already rated.
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ratedalready</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>userId</th>
<th>movieId</th>
<th>rating</th>
<th>timestamp</th>
<th>title</th>
<th>genres</th>
</tr>
</thead>
<tbody>
<tr>
<th>74</th>
<td>1003</td>
<td>2571</td>
<td>5.0</td>
<td>1209226214</td>
<td>Matrix, The (1999)</td>
<td>Action|Sci-Fi|Thriller</td>
</tr>
<tr>
<th>36</th>
<td>1003</td>
<td>1210</td>
<td>5.0</td>
<td>1209308048</td>
<td>Star Wars: Episode VI - Return of the Jedi (1983)</td>
<td>Action|Adventure|Sci-Fi</td>
</tr>
<tr>
<th>134</th>
<td>1003</td>
<td>7153</td>
<td>5.0</td>
<td>1209226291</td>
<td>Lord of the Rings: The Return of the King, The...</td>
<td>Action|Adventure|Drama|Fantasy</td>
</tr>
<tr>
<th>6</th>
<td>1003</td>
<td>110</td>
<td>5.0</td>
<td>1209226276</td>
<td>Braveheart (1995)</td>
<td>Action|Drama|War</td>
</tr>
<tr>
<th>130</th>
<td>1003</td>
<td>6874</td>
<td>5.0</td>
<td>1209226396</td>
<td>Kill Bill: Vol. 1 (2003)</td>
<td>Action|Crime|Thriller</td>
</tr>
<tr>
<th>118</th>
<td>1003</td>
<td>5952</td>
<td>5.0</td>
<td>1209227126</td>
<td>Lord of the Rings: The Two Towers, The (2002)</td>
<td>Adventure|Fantasy</td>
</tr>
<tr>
<th>94</th>
<td>1003</td>
<td>3578</td>
<td>5.0</td>
<td>1209226379</td>
<td>Gladiator (2000)</td>
<td>Action|Adventure|Drama</td>
</tr>
<tr>
<th>47</th>
<td>1003</td>
<td>1527</td>
<td>4.5</td>
<td>1209227148</td>
<td>Fifth Element, The (1997)</td>
<td>Action|Adventure|Comedy|Sci-Fi</td>
</tr>
<tr>
<th>137</th>
<td>1003</td>
<td>7438</td>
<td>4.5</td>
<td>1209226530</td>
<td>Kill Bill: Vol. 2 (2004)</td>
<td>Action|Drama|Thriller</td>
</tr>
<tr>
<th>26</th>
<td>1003</td>
<td>780</td>
<td>4.5</td>
<td>1209226261</td>
<td>Independence Day (a.k.a. ID4) (1996)</td>
<td>Action|Adventure|Sci-Fi|Thriller</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predictions</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>movieId</th>
<th>title</th>
<th>genres</th>
</tr>
</thead>
<tbody>
<tr>
<th>4759</th>
<td>4963</td>
<td>Ocean's Eleven (2001)</td>
<td>Crime|Thriller</td>
</tr>
<tr>
<th>5208</th>
<td>5418</td>
<td>Bourne Identity, The (2002)</td>
<td>Action|Mystery|Thriller</td>
</tr>
<tr>
<th>4788</th>
<td>4993</td>
<td>Lord of the Rings: The Fellowship of the Ring,...</td>
<td>Adventure|Fantasy</td>
</tr>
<tr>
<th>1605</th>
<td>1721</td>
<td>Titanic (1997)</td>
<td>Drama|Romance</td>
</tr>
<tr>
<th>2549</th>
<td>2716</td>
<td>Ghostbusters (a.k.a. Ghost Busters) (1984)</td>
<td>Action|Comedy|Sci-Fi</td>
</tr>
<tr>
<th>42</th>
<td>47</td>
<td>Seven (a.k.a. Se7en) (1995)</td>
<td>Mystery|Thriller</td>
</tr>
<tr>
<th>4692</th>
<td>4896</td>
<td>Harry Potter and the Sorcerer's Stone (a.k.a. ...</td>
<td>Adventure|Children|Fantasy</td>
</tr>
<tr>
<th>1220</th>
<td>1291</td>
<td>Indiana Jones and the Last Crusade (1989)</td>
<td>Action|Adventure</td>
</tr>
<tr>
<th>326</th>
<td>344</td>
<td>Ace Ventura: Pet Detective (1994)</td>
<td>Comedy</td>
</tr>
<tr>
<th>2690</th>
<td>2858</td>
<td>American Beauty (1999)</td>
<td>Comedy|Drama</td>
</tr>
</tbody>
</table>
</div>
<h3 id="the-recommendations-look-pretty-solid"><strong>The recommendations look pretty solid!</strong></h3>
<h1 id="conclusion">Conclusion</h1>
<p>Low-dimensional matrix recommenders try to capture the underlying features driving the raw data (which we understand as tastes and preferences). From a theoretical perspective, if we want to make recommendations based on people’s tastes, this seems like the better approach. This technique also scales significantly better to larger datasets.</p>
<p>We do lose some meaningful signals by using a lower-rank matrix.</p>
<p>One particularly cool and effective strategy is to combine factorization and neighborhood methods into one framework(http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf). This research field is extremely active, and you should check out this Coursera course, Introduction to Recommender Systems(https://www.coursera.org/specializations/recommender-systems), for understanding this better.</p>Rajat MalhotraMatrix Factorization, Recommender Systems, SVD, PCANews Categorization using Multinomial Naive Bayes and Logistic Regression2018-02-25T00:00:00+00:002018-02-25T00:00:00+00:00https://malhotrajat.github.io/i-love-data/markup/NewsCategorization<h1 id="source">Source</h1>
<p>The dataset was taken from : https://www.kaggle.com/uciml/news-aggregator-dataset</p>
<p>This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014.</p>
<p>News categories included in this dataset include business; science and technology; entertainment; and health. Different news articles that refer to the same news item (e.g., several articles about recently released employment statistics) are also categorized together.</p>
<h1 id="content">Content</h1>
<p>The columns included in this dataset are:</p>
<p><strong>ID</strong> : the numeric ID of the article</p>
<p><strong>TITLE</strong> : the headline of the article</p>
<p><strong>URL</strong> : the URL of the article</p>
<p><strong>PUBLISHER</strong> : the publisher of the article</p>
<p><strong>CATEGORY</strong> : the category of the news item; one of: – b : business – t : science and technology – e : entertainment – m : health</p>
<p><strong>STORY</strong> : alphanumeric ID of the news story that the article discusses</p>
<p><strong>HOSTNAME</strong> : hostname where the article was posted</p>
<p><strong>TIMESTAMP</strong> : approximate timestamp of the article’s publication, given in Unix time (seconds since midnight on Jan 1, 1970)</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>This dataset comes from the UCI Machine Learning Repository. Any publications that use this data should cite the repository as follows:</p>
<p>Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.</p>
<p>This specific dataset can be found in the UCI ML Repository at this URL</p>
<h1 id="approach">Approach</h1>
<p>We would be using the Multinomial Naive Bayes and Logistic Regression algorithms to categorize the news headlines</p>
<h1 id="importing-the-data-and-the-required-libraries">Importing the data and the required libraries</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#importing allthe necessary libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">CountVectorizer</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_extraction.text</span> <span class="kn">import</span> <span class="n">TfidfTransformer</span>
<span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">MultinomialNB</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LogisticRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">metrics</span>
<span class="kn">import</span> <span class="nn">itertools</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"E:/Git/Project markdowns/Classifying news articles/uci-news-aggregator/uci-news-aggregator.csv"</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Analysing the structure of the data given to us</span>
<span class="n">data</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>
<div>
<style>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ID</th>
<th>TITLE</th>
<th>URL</th>
<th>PUBLISHER</th>
<th>CATEGORY</th>
<th>STORY</th>
<th>HOSTNAME</th>
<th>TIMESTAMP</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>Fed official says weak data caused by weather,...</td>
<td>http://www.latimes.com/business/money/la-fi-mo...</td>
<td>Los Angeles Times</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.latimes.com</td>
<td>1394470370698</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>Fed's Charles Plosser sees high bar for change...</td>
<td>http://www.livemint.com/Politics/H2EvwJSK2VE6O...</td>
<td>Livemint</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.livemint.com</td>
<td>1394470371207</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>US open: Stocks fall after Fed official hints ...</td>
<td>http://www.ifamagazine.com/news/us-open-stocks...</td>
<td>IFA Magazine</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.ifamagazine.com</td>
<td>1394470371550</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>Fed risks falling 'behind the curve', Charles ...</td>
<td>http://www.ifamagazine.com/news/fed-risks-fall...</td>
<td>IFA Magazine</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.ifamagazine.com</td>
<td>1394470371793</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>Fed's Plosser: Nasty Weather Has Curbed Job Gr...</td>
<td>http://www.moneynews.com/Economy/federal-reser...</td>
<td>Moneynews</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.moneynews.com</td>
<td>1394470372027</td>
</tr>
<tr>
<th>5</th>
<td>6</td>
<td>Plosser: Fed May Have to Accelerate Tapering Pace</td>
<td>http://www.nasdaq.com/article/plosser-fed-may-...</td>
<td>NASDAQ</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.nasdaq.com</td>
<td>1394470372212</td>
</tr>
<tr>
<th>6</th>
<td>7</td>
<td>Fed's Plosser: Taper pace may be too slow</td>
<td>http://www.marketwatch.com/story/feds-plosser-...</td>
<td>MarketWatch</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.marketwatch.com</td>
<td>1394470372405</td>
</tr>
<tr>
<th>7</th>
<td>8</td>
<td>Fed's Plosser expects US unemployment to fall ...</td>
<td>http://www.fxstreet.com/news/forex-news/articl...</td>
<td>FXstreet.com</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>www.fxstreet.com</td>
<td>1394470372615</td>
</tr>
<tr>
<th>8</th>
<td>9</td>
<td>US jobs growth last month hit by weather:Fed P...</td>
<td>http://economictimes.indiatimes.com/news/inter...</td>
<td>Economic Times</td>
<td>b</td>
<td>ddUyU0VZz0BRneMioxUPQVP6sIxvM</td>
<td>economictimes.indiatimes.com</td>
<td>1394470372792</td>
</tr>
<tr>
<th>9</th>
<td>10</td>
<td>ECB unlikely to end sterilisation of SMP purch...</td>
<td>http://www.iii.co.uk/news-opinion/reuters/news...</td>
<td>Interactive Investor</td>
<td>b</td>
<td>dPhGU51DcrolUIMxbRm0InaHGA2XM</td>
<td>www.iii.co.uk</td>
<td>1394470501265</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Getting the counts of each category</span>
<span class="n">data</span><span class="p">[</span><span class="s">"CATEGORY"</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>e 152469
b 115967
t 108344
m 45639
Name: CATEGORY, dtype: int64
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'Entertainment'</span><span class="p">,</span> <span class="s">'Business'</span><span class="p">,</span> <span class="s">'Science and Technology'</span><span class="p">,</span> <span class="s">'Health'</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#visualizing the categories </span>
<span class="n">data</span><span class="p">[</span><span class="s">"CATEGORY"</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'pie'</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">names</span><span class="p">,</span> <span class="n">autopct</span><span class="o">=</span><span class="s">'</span><span class="si">%1.0</span><span class="s">f</span><span class="si">%%</span><span class="s">'</span><span class="p">,</span> <span class="n">subplots</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000227F372B320>], dtype=object)
</code></pre></div></div>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/newscategorization/output_11_1.png" alt="no-alignment" /></p>
<h1 id="preparing-the-data-to-be-fed-into-the-model">Preparing the data to be fed into the model</h1>
<p>We will split the original data into the training and testing sets using the train_test_split() function. We want a training size of 70% of the entire data.</p>
<p>It is worth noting that the train_test_split() function shuffles the rows inorder to maintain the training set distribution similar to the original data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">"TITLE"</span><span class="p">]</span>
<span class="n">y</span><span class="o">=</span><span class="n">data</span><span class="p">[</span><span class="s">"CATEGORY"</span><span class="p">]</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Calculating the number of rows in our train set</span>
<span class="nb">len</span><span class="p">(</span><span class="n">y_train</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>295693
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y_train</span><span class="o">.</span><span class="n">value_counts</span><span class="p">()</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'pie'</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">names</span><span class="p">,</span> <span class="n">autopct</span><span class="o">=</span><span class="s">'</span><span class="si">%1.0</span><span class="s">f</span><span class="si">%%</span><span class="s">'</span><span class="p">,</span> <span class="n">subplots</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000227F3748CF8>], dtype=object)
</code></pre></div></div>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/newscategorization/output_16_1.png" alt="no-alignment" /></p>
<p>We see that the training set has the same distribution as the original data and that’s what we wanted.</p>
<h1 id="training-the-multinomial-naive-bayes-classifier">Training the Multinomial Naive Bayes Classifier</h1>
<p>In order to train and test the classifier, the first step should be to tokenize and count the number of occurrence of each word that appear in the headlines.</p>
<p>We use the CountVectorizer() for that. Each term is assigned a unique integer index.</p>
<p>Then the counters are transformed to a TF-IDF representation using TfidfTransformer().</p>
<p>The last step creates the Multinomial Naive Bayes classifier.</p>
<p>In order to make the training process easier, scikit-learn provides a Pipeline class that behaves like a compound classifier.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text_clf</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'vect'</span><span class="p">,</span> <span class="n">CountVectorizer</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'tfidf'</span><span class="p">,</span> <span class="n">TfidfTransformer</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'clf'</span><span class="p">,</span> <span class="n">MultinomialNB</span><span class="p">()),</span>
<span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text_clf</span> <span class="o">=</span> <span class="n">text_clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">predicted1</span> <span class="o">=</span> <span class="n">text_clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">metrics</span><span class="o">.</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predicted1</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.92273093130060124
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">metrics</span><span class="o">.</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predicted1</span><span class="p">,</span> <span class="n">target_names</span><span class="o">=</span><span class="nb">sorted</span><span class="p">(</span><span class="n">names</span><span class="p">)))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> precision recall f1-score support
Business 0.89 0.91 0.90 34868
Entertainment 0.95 0.97 0.96 45630
Health 0.97 0.84 0.90 13658
Science and Technology 0.90 0.90 0.90 32570
avg / total 0.92 0.92 0.92 126726
</code></pre></div></div>
<p>We can see that the metrics (precision, recall and f1-score) on an average give us 0.92, the results for category e (entertainment) are even better.</p>
<p>The overall accuracy of classification is 92.273%</p>
<h1 id="training-the-logistic-regression-classifier">Training the Logistic Regression Classifier</h1>
<p>In order to train and test the classifier, the first step should be to tokenize and count the number of occurrence of each word that appear in the headlines.</p>
<p>We use the CountVectorizer() for that. Each term is assigned a unique integer index.</p>
<p>Then the counters are transformed to a TF-IDF representation using TfidfTransformer().</p>
<p>The last step creates the Logistic Regression classifier. It is worth noting that the default mode for the LogisticRegression() function can only help us classify binary target variables. In order to be able to classify a multi-class problem, we specify <strong>multi_class=’multinomial’</strong>. Also worth noting is that the only solvers that can be used for a multiclass problem are: <strong>newton-cg, sag & lbfgs</strong> which are specified using the <strong>solver=…</strong> parameter in the LogisticRegression() function.</p>
<p>In order to make the training process easier, scikit-learn provides a Pipeline class that behaves like a compound classifier.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text_np</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">([(</span><span class="s">'vect'</span><span class="p">,</span> <span class="n">CountVectorizer</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'tfidf'</span><span class="p">,</span> <span class="n">TfidfTransformer</span><span class="p">()),</span>
<span class="p">(</span><span class="s">'clf2'</span><span class="p">,</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">solver</span><span class="o">=</span><span class="s">'newton-cg'</span><span class="p">,</span> <span class="n">multi_class</span><span class="o">=</span><span class="s">'multinomial'</span><span class="p">)),</span>
<span class="p">])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">text_np</span> <span class="o">=</span> <span class="n">text_np</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">predicted2</span> <span class="o">=</span> <span class="n">text_np</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">metrics</span><span class="o">.</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predicted2</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.94406041380616446
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">metrics</span><span class="o">.</span><span class="n">classification_report</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predicted2</span><span class="p">,</span> <span class="n">target_names</span><span class="o">=</span><span class="nb">sorted</span><span class="p">(</span><span class="n">names</span><span class="p">)))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> precision recall f1-score support
Business 0.92 0.93 0.92 34868
Entertainment 0.97 0.98 0.97 45630
Health 0.96 0.91 0.94 13658
Science and Technology 0.93 0.93 0.93 32570
avg / total 0.94 0.94 0.94 126726
</code></pre></div></div>
<p>The accuracy with the Logistic Regression model is better than with the Naive Bayes model. Also, we see the average f-score is 0.94 which is betterthan the 9.2 in the case of the Naive Bayes classifier.</p>
<h1 id="conclusion"><strong>Conclusion</strong></h1>
<p>A Naive Bayes method is slightly faster but the Logistic Regression model has a higher classification accuracy.</p>
<p>This difference arises because they optimize different objective functions even though both the algorithms utilize the same hypothesis space.</p>
<p>When the training size reaches infinity the discriminative model: logistic regression performs better than the generative model Naive Bayes. However the generative model reaches its asymptotic faster than the discriminative model, i.e the generative model (Naive Bayes) reaches the asymptotic solution for fewer training sets than the discriminative model (Logistic Regression). The training set size seems to be large enough for the logistic regression to be better. The other reason could also be that the data isn’t entirely conditionally independent (which is required by the Naive Bayes model).</p>Rajat MalhotraHeadlines, Classification, Naive Bayes, Logistic RegressionScraping Indeed.com for jobs2018-02-22T00:00:00+00:002018-02-22T00:00:00+00:00https://malhotrajat.github.io/i-love-data/markup/Indeedjobsearch<h1 id="motivation"><strong>Motivation</strong></h1>
<p>I am currently in the search for a job and I see that one of the most time-consuming processes is actually searching for a, well, job. Finding all the relevant positions in all the companies across all cities sounds impossible. So, the motivation for creating a tool like this was simple. Search for job roles being posted daily and evaluate (at least very basically) whether the job descriptions match with my skillset. For my analysis, I used Indeed.com, which is a major job aggregator and is used by many people daily.</p>
<h1 id="overview"><strong>Overview</strong></h1>
<p>This tool would go through all the jobs (type of jobs should be mentioned by the user) in a particular city or cities and add those ones to a list that require particular skills that match with the user’s skillset.
All of the code is written in the form of functions in order to change parameters,search terms or the number of pages we want to search.</p>
<h1 id="scoring-a-job"><strong>Scoring a job</strong></h1>
<p>For my analysis, I would be using the “data scientist” position since that is what I am interested in.
Evaluating any data science job can’t be simple and every company has a different definition for who a “data scientist” is. But, we can evaluate it superficially and get to know whether a few keywords exist or not.
For my purposes, I’d be using the following keywords:</p>
<p><strong>R</strong></p>
<p><strong>SQL</strong></p>
<p><strong>Python</strong></p>
<p><strong>Hadoop</strong></p>
<p><strong>Tableau</strong></p>
<p>These are the skills that I possess and therefore if any job description contains any of these words, I want to know about it. Obviously, I won’t be applying to every job that contains any of these keywords, but a consolidated list of jobs I COULD apply to is a good start.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#importing the necessary libraries</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">bs4</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">smtplib</span>
<span class="c">#Defining a function that would score a job based on the specific keywords you want the job description to contain</span>
<span class="k">def</span> <span class="nf">job_score</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="c">#obtaining the html script</span>
<span class="n">htmlcomplete</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">htmlcontent</span> <span class="o">=</span> <span class="n">bs4</span><span class="o">.</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">htmlcomplete</span><span class="o">.</span><span class="n">content</span><span class="p">,</span> <span class="s">'lxml'</span><span class="p">)</span>
<span class="n">htmlbody</span> <span class="o">=</span> <span class="n">htmlcontent</span><span class="p">(</span><span class="s">'body'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c">#findin all the keywords</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">'R[</span><span class="err">\</span><span class="s">,</span><span class="err">\</span><span class="s">.]'</span><span class="p">,</span> <span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="p">))</span>
<span class="n">sql</span> <span class="o">=</span> <span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'sql'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Sql'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'SQL'</span><span class="p">)</span>
<span class="n">python</span> <span class="o">=</span> <span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'python'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Python'</span><span class="p">)</span>
<span class="n">hadoop</span> <span class="o">=</span> <span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'hadoop'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Hadoop'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'HADOOP'</span><span class="p">)</span>
<span class="n">tableau</span> <span class="o">=</span> <span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'tableau'</span><span class="p">)</span><span class="o">+</span><span class="n">htmlbody</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">'Tableau'</span><span class="p">)</span>
<span class="n">total</span><span class="o">=</span><span class="n">r</span><span class="o">+</span><span class="n">python</span><span class="o">+</span><span class="n">sql</span><span class="o">+</span><span class="n">hadoop</span><span class="o">+</span><span class="n">tableau</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'R count:'</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="s">','</span><span class="p">,</span><span class="s">'Python count:'</span><span class="p">,</span> <span class="n">python</span><span class="p">,</span> <span class="s">','</span><span class="p">,</span><span class="s">'SQL count:'</span><span class="p">,</span> <span class="n">sql</span><span class="p">,</span> <span class="s">','</span><span class="p">,</span><span class="s">'Hadoop count:'</span><span class="p">,</span> <span class="n">hadoop</span><span class="p">,</span> <span class="s">','</span><span class="p">,</span><span class="s">'Tableau count:'</span><span class="p">,</span> <span class="n">tableau</span><span class="p">,</span> <span class="s">','</span><span class="p">,)</span>
<span class="k">return</span> <span class="n">total</span>
</code></pre></div></div>
<h1 id="evaluating-an-example-job"><strong>Evaluating an example job</strong></h1>
<p>Let’s evaluate this “Data Insights Analyst” job from Homeaway.</p>
<p><img src="https://malhotrajat.github.io/i-love-data/assets/images/indeed_scraping/homeaway.jpg" alt="no-alignment" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">job_score</span><span class="p">(</span><span class="s">'https://www.indeed.com/viewjob?jk=29d57706cae9885e&tk=1c6l78ddmafhgf15&from=serp&vjs=3'</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>R count: 1 , Python count: 1 , SQL count: 2 , Hadoop count: 1 , Tableau count: 1 ,
6
</code></pre></div></div>
<h1 id="looking-at-the-html-script-behind-the-scenes"><strong>Looking at the HTML script behind the scenes</strong></h1>
<p>To extract any kind of information using the HTML script, we need to know how it is structured and where is the relevant information (needed by us) located in the script)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#This section of the code lets you see the html script so that you can understand the structure and what information can be extracted from which part of the script </span>
<span class="n">URL</span> <span class="o">=</span> <span class="s">'https://www.indeed.com/jobs?q=data&l=Austin</span><span class="si">%2</span><span class="s">C+TX&sort=date'</span>
<span class="c">#conducting a request of the stated URL above:</span>
<span class="n">complete</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">URL</span><span class="p">)</span>
<span class="c">#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string.</span>
<span class="n">content</span> <span class="o">=</span> <span class="n">bs4</span><span class="o">.</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">complete</span><span class="o">.</span><span class="n">text</span><span class="p">,</span> <span class="s">'html.parser'</span><span class="p">)</span>
<span class="c">#printing soup in a more structured tree format that makes for easier reading</span>
<span class="k">print</span><span class="p">(</span><span class="n">content</span><span class="o">.</span><span class="n">prettify</span><span class="p">())</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"><!DOCTYPE html></span>
<span class="nt"><html</span> <span class="na">lang=</span><span class="s">"en"</span><span class="nt">></span>
<span class="nt"><head></span>
<span class="nt"><meta</span> <span class="na">content=</span><span class="s">"text/html;charset=utf-8"</span> <span class="na">http-equiv=</span><span class="s">"content-type"</span><span class="nt">/></span>
<span class="nt"><script </span><span class="na">src=</span><span class="s">"/s/044574d/en_US.js"</span> <span class="na">type=</span><span class="s">"text/javascript"</span><span class="nt">></span>
<span class="nt"></script></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"/s/ecdfb5e/jobsearch_all.css"</span> <span class="na">rel=</span><span class="s">"stylesheet"</span> <span class="na">type=</span><span class="s">"text/css"</span><span class="nt">/></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"http://rss.indeed.com/rss?q=data&amp;l=Austin%2C+TX&amp;sort=date"</span> <span class="na">rel=</span><span class="s">"alternate"</span> <span class="na">title=</span><span class="s">"Data Jobs, Employment in Austin, TX"</span> <span class="na">type=</span><span class="s">"application/rss+xml"</span><span class="nt">/></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"/m/jobs?q=data&amp;l=Austin%2C+TX&amp;sort=date"</span> <span class="na">media=</span><span class="s">"only screen and (max-width: 640px)"</span> <span class="na">rel=</span><span class="s">"alternate"</span><span class="nt">/></span>
<span class="nt"><link</span> <span class="na">href=</span><span class="s">"/m/jobs?q=data&amp;l=Austin%2C+TX&amp;sort=date"</span> <span class="na">media=</span><span class="s">"handheld"</span> <span class="na">rel=</span><span class="s">"alternate"</span><span class="nt">/></span>
<span class="nt"><script </span><span class="na">type=</span><span class="s">"text/javascript"</span><span class="nt">></span>
<span class="k">if</span> <span class="p">(</span><span class="k">typeof</span> <span class="nb">window</span><span class="p">[</span><span class="s1">'closureReadyCallbacks'</span><span class="p">]</span> <span class="o">==</span> <span class="s1">'undefined'</span><span class="p">)</span> <span class="p">{</span>
<span class="nb">window</span><span class="p">[</span><span class="s1">'closureReadyCallbacks'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[];</span>
<span class="p">}</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span> <span class="p">.</span>
<span class="o"><</span><span class="sr">/a</span><span class="err">>
</span> <span class="o"><</span><span class="nx">div</span> <span class="kd">class</span><span class="o">=</span><span class="s2">" row result"</span> <span class="nx">data</span><span class="o">-</span><span class="nx">jk</span><span class="o">=</span><span class="s2">"0829198f649e9c08"</span> <span class="nx">data</span><span class="o">-</span><span class="nx">tn</span><span class="o">-</span><span class="nx">component</span><span class="o">=</span><span class="s2">"organicJob"</span> <span class="nx">data</span><span class="o">-</span><span class="nx">tu</span><span class="o">=</span><span class="s2">""</span> <span class="nx">id</span><span class="o">=</span><span class="s2">"p_0829198f649e9c08"</span><span class="o">></span>
<span class="o"><</span><span class="nx">h2</span> <span class="kd">class</span><span class="o">=</span><span class="s2">"jobtitle"</span> <span class="nx">id</span><span class="o">=</span><span class="s2">"jl_0829198f649e9c08"</span><span class="o">></span>
<span class="o"><</span><span class="nx">a</span> <span class="kd">class</span><span class="o">=</span><span class="s2">"turnstileLink"</span> <span class="nx">data</span><span class="o">-</span><span class="nx">tn</span><span class="o">-</span><span class="nx">element</span><span class="o">=</span><span class="s2">"jobTitle"</span> <span class="nx">href</span><span class="o">=</span><span class="s2">"/rc/clk?jk=0829198f649e9c08&amp;fccid=7c30762e902763ee&amp;vjs=3"</span> <span class="nx">onclick</span><span class="o">=</span><span class="s2">"setRefineByCookie([]); return rclk(this,jobmap[0],true,0);"</span> <span class="nx">onmousedown</span><span class="o">=</span><span class="s2">"return rclk(this,jobmap[0],0);"</span> <span class="nx">rel</span><span class="o">=</span><span class="s2">"noopener nofollow"</span> <span class="nx">target</span><span class="o">=</span><span class="s2">"_blank"</span> <span class="nx">title</span><span class="o">=</span><span class="s2">"Data Entry Associate"</span><span class="o">></span>
<span class="o"><</span><span class="nx">b</span><span class="o">></span>
<span class="nx">Data</span>
<span class="o"><</span><span class="sr">/b</span><span class="err">>
</span> <span class="nx">Entry</span> <span class="nx">Associate</span>
<span class="o"><</span><span class="sr">/a</span><span class="err">>
</span> <span class="o"><</span><span class="sr">/h2> . . . </span><span class="err">.
</span></code></pre></div></div>
<h1 id="extracting-job-data"><strong>Extracting Job Data</strong></h1>
<p>The next step after defining a job scoring function, is to define a function that gets you all the relevant information from the HTML script for all the jobs on a single page.
We look for non-sponsored or organic jobs and extract the attributes from those. These attributes contain a lot of information that I don’t need but we will just let them be. What we do need are the following things:</p>
<p><strong>Name of the company</strong></p>
<p><strong>Date when the job was posted</strong></p>
<p><strong>Title</strong></p>
<p><strong>Hyperlink to the job</strong></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">jobdata</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="n">htmlcomplete2</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">htmlcontent2</span> <span class="o">=</span> <span class="n">bs4</span><span class="o">.</span><span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">htmlcomplete2</span><span class="o">.</span><span class="n">content</span><span class="p">,</span> <span class="s">'lxml'</span><span class="p">)</span>
<span class="c">#only getting the tags for organic job postings and not the ones thatare sponsored</span>
<span class="n">tags</span> <span class="o">=</span> <span class="n">htmlcontent2</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'div'</span><span class="p">,</span> <span class="p">{</span><span class="s">'data-tn-component'</span> <span class="p">:</span> <span class="s">"organicJob"</span><span class="p">})</span>
<span class="c">#getting the list of companies that have the organic job posting tags</span>
<span class="n">companies</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">span</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="c">#extracting the features like the company name, complete link, date, etc.</span>
<span class="n">attributes</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">h2</span><span class="o">.</span><span class="n">a</span><span class="o">.</span><span class="n">attrs</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="n">dates</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">find_all</span><span class="p">(</span><span class="s">'span'</span><span class="p">,</span> <span class="p">{</span><span class="s">'class'</span><span class="p">:</span><span class="s">'date'</span><span class="p">})</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">]</span>
<span class="c"># update attributes dictionaries with company name and date posted</span>
<span class="p">[</span><span class="n">attributes</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">({</span><span class="s">'company'</span><span class="p">:</span> <span class="n">companies</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">strip</span><span class="p">()})</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">attributes</span><span class="p">)]</span>
<span class="p">[</span><span class="n">attributes</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">update</span><span class="p">({</span><span class="s">'date posted'</span><span class="p">:</span> <span class="n">dates</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">strip</span><span class="p">()})</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">attributes</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">attributes</span>
</code></pre></div></div>
<p>Now we can look at a sample of the attribute dictionary for the first job on the page I have specified.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">jobdata</span><span class="p">(</span><span class="s">'https://www.indeed.com/jobs?q=data&l=Austin</span><span class="si">%2</span><span class="s">C+TX&sort=date'</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'class': ['turnstileLink'],
'company': 'Absolute Software',
'data-tn-element': 'jobTitle',
'date posted': 'Just posted',
'href': '/rc/clk?jk=0829198f649e9c08&fccid=7c30762e902763ee&vjs=3',
'onclick': 'setRefineByCookie([]); return rclk(this,jobmap[0],true,0);',
'onmousedown': 'return rclk(this,jobmap[0],0);',
'rel': ['noopener', 'nofollow'],
'target': '_blank',
'title': 'Data Entry Associate'}
</code></pre></div></div>
<h1 id="defining-a-list-of-cities"><strong>Defining a list of cities</strong></h1>
<p>We define a list of cities that we want to search for jobs in</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#defining a list of cities you want to search jobs in</span>
<span class="n">citylist</span> <span class="o">=</span> <span class="p">[</span><span class="s">'New+York'</span><span class="p">,</span><span class="s">'Chicago'</span><span class="p">,</span> <span class="s">'Austin'</span><span class="p">]</span><span class="c">#, 'San+Francisco', 'Seattle', 'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Washington+DC', 'Boulder']</span>
</code></pre></div></div>
<h1 id="searching-for-and-scoring-all-new-jobs"><strong>Searching for and Scoring all new jobs</strong></h1>
<p>I can now loop through Indeed.com and apply the functions defined above to every page.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#defining a list to store all the relevant jobs</span>
<span class="n">newjobslist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c">#defining a new function to go through all the jobs posted in the last 'n' days for a specific role</span>
<span class="c">#essentially looping over 2 </span>
<span class="k">def</span> <span class="nf">newjobs</span><span class="p">(</span><span class="n">daysago</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">startingpage</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pagelimit</span> <span class="o">=</span> <span class="mi">20</span><span class="p">,</span> <span class="n">position</span> <span class="o">=</span> <span class="s">'data+scientist'</span><span class="p">):</span>
<span class="k">for</span> <span class="n">city</span> <span class="ow">in</span> <span class="n">citylist</span><span class="p">:</span>
<span class="n">indeed_url</span> <span class="o">=</span> <span class="s">'http://www.indeed.com/jobs?q={0}&l={1}&sort=date&start='</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">position</span><span class="p">,</span> <span class="n">city</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">startingpage</span><span class="p">,</span> <span class="n">startingpage</span> <span class="o">+</span> <span class="n">pagelimit</span><span class="p">):</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'URL:'</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">)),</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">attributes</span> <span class="o">=</span> <span class="n">jobdata</span><span class="p">(</span><span class="n">indeed_url</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="o">*</span><span class="mi">10</span><span class="p">))</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">attributes</span><span class="p">)):</span>
<span class="n">href</span> <span class="o">=</span> <span class="n">attributes</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'href'</span><span class="p">]</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">attributes</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'title'</span><span class="p">]</span>
<span class="n">company</span> <span class="o">=</span> <span class="n">attributes</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'company'</span><span class="p">]</span>
<span class="n">date_posted</span> <span class="o">=</span> <span class="n">attributes</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="s">'date posted'</span><span class="p">]</span>
<span class="k">print</span> <span class="p">(</span><span class="nb">repr</span><span class="p">(</span><span class="n">company</span><span class="p">),</span><span class="s">','</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">title</span><span class="p">),</span><span class="s">','</span><span class="p">,</span> <span class="nb">repr</span><span class="p">(</span><span class="n">date_posted</span><span class="p">))</span>
<span class="n">evaluation</span> <span class="o">=</span> <span class="n">job_score</span><span class="p">(</span><span class="s">'http://indeed.com'</span> <span class="o">+</span> <span class="n">href</span><span class="p">)</span>
<span class="k">if</span> <span class="n">evaluation</span> <span class="o">>=</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">newjobslist</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="s">'{0}, {1}, {2}, {3}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">company</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">city</span><span class="p">,</span> <span class="s">'http://indeed.com'</span> <span class="o">+</span> <span class="n">href</span><span class="p">))</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">newjobsstring</span> <span class="o">=</span> <span class="s">'</span><span class="se">\n\n</span><span class="s">'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">newjobslist</span><span class="p">)</span>
<span class="k">return</span> <span class="n">newjobsstring</span>
</code></pre></div></div>
<h1 id="sending-an-email-to-myself"><strong>Sending an email to myself</strong></h1>
<p>I can now send an email to myself using the smtplib library.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">emailme</span><span class="p">(</span><span class="n">from_addr</span> <span class="o">=</span> <span class="s">'****'</span><span class="p">,</span> <span class="n">to_addr</span> <span class="o">=</span> <span class="s">'****'</span><span class="p">,</span> <span class="n">subject</span> <span class="o">=</span> <span class="s">'Daily Data Science Jobs Update Scraped from Indeed'</span><span class="p">,</span> <span class="n">text</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
<span class="n">message</span> <span class="o">=</span> <span class="s">'Subject: {0}</span><span class="se">\n\n</span><span class="s">Jobs: {1}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
<span class="c"># login information</span>
<span class="n">username</span> <span class="o">=</span> <span class="s">'****'</span>
<span class="n">password</span> <span class="o">=</span> <span class="s">'****'</span>
<span class="c"># send the message</span>
<span class="n">server</span> <span class="o">=</span> <span class="n">smtplib</span><span class="o">.</span><span class="n">SMTP</span><span class="p">(</span><span class="s">'smtp.gmail.com:587'</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">ehlo</span><span class="p">()</span>
<span class="n">server</span><span class="o">.</span><span class="n">starttls</span><span class="p">()</span>
<span class="n">server</span><span class="o">.</span><span class="n">login</span><span class="p">(</span><span class="n">username</span><span class="p">,</span> <span class="n">password</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">sendmail</span><span class="p">(</span><span class="n">from_addr</span><span class="p">,</span> <span class="n">to_addr</span><span class="p">,</span> <span class="n">message</span><span class="p">)</span>
<span class="n">server</span><span class="o">.</span><span class="n">quit</span><span class="p">()</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Please check your mail'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
<span class="k">print</span> <span class="p">(</span><span class="s">'Searching for jobs...'</span><span class="p">)</span>
<span class="n">starting_page</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">page_limit</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">datascientist</span> <span class="o">=</span> <span class="n">newjobs</span><span class="p">(</span><span class="n">position</span> <span class="o">=</span> <span class="s">'data+scientist'</span><span class="p">,</span> <span class="n">startingpage</span> <span class="o">=</span> <span class="n">starting_page</span><span class="p">,</span> <span class="n">pagelimit</span> <span class="o">=</span> <span class="n">page_limit</span><span class="p">)</span>
<span class="n">emailme</span><span class="p">(</span><span class="n">text</span> <span class="o">=</span> <span class="n">datascientist</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Searching for jobs...
URL: http://www.indeed.com/jobs?q=data+scientist&l=New+York&sort=date&start=0
'J.Crew Group, Inc.' , 'Customer Analytics Manager' , 'Just posted'
R count: 0 , Python count: 1 , SQL count: 1 , Hadoop count: 0 , Tableau count: 0 ,
.
.
.
.
'Invenio Marketing Solutions' , 'Inside Sales Representative - Mediacom' , '4 days ago'
R count: 0 , Python count: 0 , SQL count: 0 , Hadoop count: 0 , Tableau count: 0 ,
Please check your mail.
</code></pre></div></div>
<p>Here’s a snapshot of the email you would receive:
<img src="https://malhotrajat.github.io/i-love-data/assets/images/indeed_scraping/email.JPG" alt="no-alignment" /></p>
<h1 id="ending-remarks"><strong>Ending Remarks</strong></h1>
<p>This was a pretty interesting project to complete and a lot of fun too. I am sure there are many improvements that can be made and it can give more information too. More changes may be made in future.</p>Rajat MalhotraIndeed.com, scraping, job descriptions