Introduction to Recommendations with Map-Reduce and mrjob

Thursday, August 23, 2012

Hi all,

In this post I will present how can we use map-reduce programming model for making recommendations.   Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ?   Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Why Map-Reduce ?

MapReduce is a framework originally developed at Google that allows easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation of it.  It scales well to many thousands of nodes and can handle petabytes of data. For recommendations where we have to find the similar products to a product you are interested at , we must calculate how similar pairs of items are. For instance, if someone watches the movie Matrix, the recommender would suggest the film Blade Runner. So we need to compute the similarity between two movies. One way is to find correlation between pairs of items.  But if you own a shopping site, which has 500,00 products, potentially we would have to compute over 250 billion computations. Besides the computation, the correlation data will be sparse, because it's unlikely that every pair of items will have some user interested in them. So we have a large and sparse dataset. And we have also to deal with temporal aspect since the user interest in products changes with time, so we need the correlation calculation done periodically so that the results are up to date.  For these reason the best way to handle with this scenarion and problem is going after a divide and conquer pattern, and MapReduce is a powerful framework and can be used to implement data mining algorithms.  You can take a look at this post about MapReduce or go to these video classes about Hadoop.

Map-Reduce Architecture

Meeting mrjob

mrjob is a Python package that helps you write and run Hadoop Streaming jobs. It supports Amazon's Elastic MapReduce(EMR) and it also works with your own Hadoop cluster.  It has been released as an open-source framework by Yelp and we will use it as interface for Hadoop since its legibility and ease to handle with MapReduce tasks.  Check this link to see how to to download and use it.

Movie Similarities

Imagine that you own a online movie business, and you want to suggest for your clients movie recommendations.  Your system runs a rating system, that is, people can rate movies with 1 to 5 starts, and we will assume for simplicity that all of the ratings are stored in a csv file somewhere.
Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked.  Using the correlation we can:

  • For every pair of movies A and B, find all the people  who rated botha A and B.
  • Use these ratings to form a Movie A vector and a Movie B vector.
  • Calculate the correlation between those two vectors
  • When someone watches a movie, you can recommend the movies most correlated with it

So the first step is to get our movies file which has three columns:  (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this link).

Here it is a sample of the dataset file after normalized.

So let's start by reading the ratings into the MovieSimilarities job.

You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?

One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector.  Then, calculate the correlation between these two vectors.  Now when someone watches a movie, you can now recommend him the movies most correlated with it.

So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.

Before using these rating pairs to calculate correlation,  let's see how we can compute it.  We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.
To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So  we can now can calculate the correlation between the movies. The correlation can be expressed as:

So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.

So let's see the output. Here's a sample of the top output I got:

MovieA MovieB Correlation
Return of the Jedi (1983) Empire Strikes Back, The (1980) 0.787655
Star Trek: The Motion Picture (1979) Star Trek III: The Search for Spock (1984) 0.758751
Star Trek: Generations (1994) Star Trek V: The Final Frontier (1989) 0.72042
Star Wars (1977) Return of the Jedi (1983) 0.687749
Star Trek VI: The Undiscovered Country (1991) Star Trek III: The Search for Spock (1984) 0.635803
Star Trek V: The Final Frontier (1989) Star Trek III: The Search for Spock (1984) 0.632764
Star Trek: Generations (1994) Star Trek: First Contact (1996) 0.602729
Star Trek: The Motion Picture (1979) Star Trek: First Contact (1996) 0.593454
Star Trek: First Contact (1996) Star Trek VI: The Undiscovered Country (1991) 0.546233
Star Trek V: The Final Frontier (1989) Star Trek: Generations (1994) 0.4693
Star Trek: Generations (1994) Star Trek: The Wrath of Khan (1982) 0.424847
Star Trek IV: The Voyage Home (1986) Empire Strikes Back, The (1980) 0.38947
Star Trek III: The Search for Spock (1984) Empire Strikes Back, The (1980) 0.371294
Star Trek IV: The Voyage Home (1986) Star Trek VI: The Undiscovered Country (1991) 0.360103
Star Trek: The Wrath of Khan (1982) Empire Strikes Back, The (1980) 0.35366
Stargate (1994) Star Trek: Generations (1994) 0.347169
Star Trek VI: The Undiscovered Country (1991) Empire Strikes Back, The (1980) 0.340193
Star Trek V: The Final Frontier (1989) Stargate (1994) 0.315828
Star Trek: The Wrath of Khan (1982) Star Trek VI: The Undiscovered Country (1991) 0.222516
Star Wars (1977) Star Trek: Generations (1994) 0.219273
Star Trek V: The Final Frontier (1989) Star Trek: The Wrath of Khan (1982) 0.180544
Stargate (1994) Star Wars (1977) 0.153285
Star Trek V: The Final Frontier (1989) Empire Strikes Back, The (1980) 0.084117

As we would expect we can notice that
  • Star Trek movies are similar to other Star Trek movies.
  • The people  who likes Star Trek movies are not so fans of Star Wars and vice-versa;
  • Star Wars Fans will be always fans! :D
  • The Sci-Fi movies are quite similar to each other;
  • Star Trek III: The Search for Spock (1984) is one the best movies of Star Trek (several positive correlations)

To see the full code, checkout the Github repository here.

Book Similarities

Let's see another dataset. What about Book Ratings ? Let's see this dataset of 1 million book ratings.   Here's again a sample of it:

But now we want to compute other similarity measures besides correlation. Let's take a look on them.

Cossine Similarity

Another common vector-based similarity measure.

Regularized Correlation

We could use regularized correlation by adding N virtual movie pairs that have zero correlation. This helps avoid noise if some movie pairs have very few raters in common.

The implicit data can be useful. In some cases only because you rate a Toy Store movie, even if you rate it quite horribly, you can still be interested in similar animation movies.  So we can ignore the value itself of each rating and use a set-based similarity measure such as the Jaccard Similarity.

Now, let's add all those similarities to our mapreduce job and make some adjustments by making a new job for counting the number of raters for each movie. It will be required for computing the jaccard similarity.

Ok,  let's take a look at the book similarities now with those new fields.

BookA BookB Correlation Cossine Reg Corr Jaccard Mutual Raters
The Return of the King (The Lord of The Rings, Part 3)The Voyage of the Dawn Treader (rack) (Narnia) 0 0.998274 0 0.068966 2
The Return of the King (The Lord of the Rings, Part 3) The Man in the Black Suit : 4 Dark Tales 0 1 0 0.058824 6
The Fellowship of the Ring (The Lord of the Rings, Part 1) The Hobbit : The Enchanting Prelude to The Lord of the Rings 0.796478 0.997001 0.49014 0.045714 16
The Two Towers (The Lord of the Rings, Part 2) Harry Potter and the Prisoner of Azkaban (Book 3) -0.184302 0.992536 -0.087301 0.022277 9
Disney's 101 Dalmatians (Golden Look-Look Books) Walt Disney's Lady and the Tramp (Little Golden Book) 0.88383 1 0.45999 0.166667 5
Disney's 101 Dalmatians (Golden Look-Look Books) Disney's Beauty and the Beast (Golden Look-Look Book) 0.76444 1 0.2339 0.166667 7
Disney's Pocahontas (Little Golden Book) Disney's the Lion King (Little Golden Book) 0.54595 1 0.6777 0.1 4
Disney's the Lion King (Disney Classic Series) Walt Disney Pictures presents The rescuers downunder (A Little golden book) 0.34949 1 0.83833 0.142857 3
Harry Potter and the Order of the Phoenix (Book 5) Harry Potter and the Goblet of Fire (Book 4) 0.673429 0.994688 0.559288 0.119804 49
Harry Potter and the Chamber of Secrets (Book 2) Harry Potter and the Goblet of Fire (Book 4) 0.555423 0.993299 0.496957 0.17418 85
The Return of the King (The Lord of The Rings, Part 3) Harry Potter and the Goblet of Fire (Book 4) -0.2343 0.02022 -0.08383 0.015444 4

  • Lord of The Rings, books are similar to other Lord of The Ring books
  • Walt Disney books are similar to other Walt Disney books. 
  • Lord of The Ring books does not stick together Harry Potter books.

The possibilities are endless.

But is it possible to generalize our input and make our code to generate similarities for different inputs ? Yes it is.  Let's abstract our input. For this, we will create a VectorSimilarities Class that represents input data in the following format:

So if we want to define a new input format, just subclass the VectorSimilarities class and implement the method input.

So here's the class for the book recommendations using our new VectorSimilarities.

And here's the class for the movies recommendations. It simply reads from a data file and lets the VectorSimilarities superclass do the work.


As you noticed map-reduce is a powerful technique for numerical computation and speacially when you have to compute large datasets. There are several optimization I can do in those scripts such as numpy vectorizations for computing the similarities. I will explore more these features in the next posts: one handling with recommender systems and popular social networks as also how you can use the Amazon EMR infrastructure to compute your jobs!

I'd like to thank Edwin Chen and his post using those examples with Scala and whose post inspired me to explore these examples above in Python.

All code for those examples above can be downloaded at my github repository.

Stay tunned,

I hope you enjoyed this article,

Best regards,

Marcel Caraciolo


  1. Awesome! Thank you

  2. Thanks, great job!

  3. Thanks so much man! Really looking forward to your next few posts which hopefully will delve deeper into the workings of mrjob.

  4. Great post, Please let me know how " def calculate_similarity " works

    The output has 19,21 0.4,2 and 21,19 0.4,2 but there is only one yield statement

    In "def calculate_ranking", I think I should do 1- similarity to get the Highest correlated to the top .

    In "def top_similar_items" Where do you give the K to emit the closest items


  5. Nice online shopping blog thank you for sharing this important info about online shopping

  6. hai,Very nice.i have learn to this posts.Thanks for that.

    Hadoop Training in Chennai

  7. It's helpful!
    Thanks for sharing!

  8. Really is very interesting, I saw your website and get more details..Nice work. Thanks regards,
    Refer this link below,
    SAS Training in Chennai

  9. When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the mapper is outputting. As the size of the data I need to process increases (at the muti-terabyte scale), the reduce step becomes prohibitively slow. I talked to a friend of mine and he says that this has been his experience too, and that instead of using a combiner, he partitions his reduce key using a hash function which reduces the number of values that go to each key in the reduce step. I tried this and it worked. Has anyone else had this experience with the combiner step not scaling well, and why can't I find any documentation of this problem as well as the workaround? I'd rather not use a workaround if there is a way to make the combiner step scale.
    java barcode maker

  10. Thanks for sharing this informative blog. If anyone wants to get Big Data Training in Chennai visit fita academy located at Chennai, which offers best Hadoop Training in Chennai with years of experienced professionals.

  11. Thanks for sharing the information about recommendations with Map-Reduce and mrjob

    seo Training in chennai

  12. Thanks for sharing this informative blog. Recently I did Digital Marketing Training in Chennai at a leading digital marketing company. It's really useful for me to make a bright career. To know more details about this course please visit FITA.

  13. Hi, I am Victoria lives in Chennai. I am technology freak. I did Hadoop Training in Chennai at FITA which offers best Big Data Training in Chennai. This is useful for me to make a bright career in IT field.

  14. I have read all the articles in your blog; was really impressed after reading it.If anyone focus the Best sas training in Chennai. Let us know we are ready to serve for your career. FITA is pleased to inform you that; we provides practical training on all the technologies with the MNC exports having more than 5 years of experience in your preferred domain. Get your career with our knowledge.
    sas training institute in Chennai|sas training chennai

  15. SEO Training institute Chennai

    Your information is really useful for me.Thanks for sharing such a valuable information. If anyone wants to get SEO Course in Chennai visit FITA Academy located at Chennai. Rated as No.1 SEO Training Center in Chennai.

    SEO Training in Chennai


  16. The information you posted here is useful to make my career better keep updates...If anyone want to get Cloud Computing Training in Chennai, Please visit FITA academy located at Chennai Velachery which offer best Cloud Computing Course in Chennai.

    Cloud Computing Training Centers in Chennai

  17. Statistical Analysis System (SAS) is an integrated system of software products provided by SAS Institute Inc., The most common description of statistics is that it’s the process of analyzing data — number crunching, in a sense. Visit Us, SAS Training in Chennai

  18. Statistical Analysis System (SAS) is an hadoop training in chennai integrated system of software products provided by SAS Institute Inc.,oracle training in chennai The most common description of statistics is that it’s the process of analyzing data — number oracle dba training in chennai crunching, in a sense.

  19. This comment has been removed by the author.

  20. Thank you for sharing you article very useful informaction for haddop software testing.Hadoop Training Center in Chennai

  21. This is certainly one of the most valuable article. Great tips from beginning to till end. Lot of information are available here.Super article.
    SEO Training in chennai | SEO Training chennai | SEO Course in chennai | SEO Course chennai

  22. They are offer the software testing for hadoop programming language.This is very useful for Hadoop programming dots clear.I have read you article very useful information Software testing. Thank you for sharing you article.Hadoop Training Chennai | .Hadoop Training Courses in Chennai

  23. How to specify specific Movie and get top five results as output

  24. I love this post & you have shared valid information to our vision.FITA is the right place to take sas training in Chennai, we are the professional training institute provides on the entire technical course with the wonderful job assurance.
    sas course in Chennai | sas institutes in Chennai

  25. This comment has been removed by the author.

  26. Hi friends, This is Jamuna from Chennai. Your technical information is really useful for me. Keep update your blog.
    Best Oracle Training in Chennai

  27. Thanks for giving important information to training seekers,Keep posting useful information
    SharePoint Course in Chennai

  28. Nice post. PHP is one of the server side scripting language mainly used for designing website. So learning PHP Training in Chennai is really useful to make a better career.
    HTML5 Training in Chennai

  29. Thanks for sharing this information. Salesforce is a cloud based CRM. Nowadays most of the multinational companies used this CRM for managing their customers. To know more details call 9841746595.

    Salesforce Training in Chennai

  30. Thanks for sharing; Salesforce crm cloud application provides special cloud computing tools for your client management problems. It’s a fresh technology in IT industries for the business management.
    Salesforce training in Chennai|Informatica training in chennai

  31. Good to learn something new from this blog. Thanks for sharing such worthy article. You can also see some school website design here.


  32. The information you have given here is truly helpful to me. CCNA- It’s a certification program based on routing & switching for starting level network engineers that helps improve your investment in knowledge of networking & increase the value of employer’s network,
    ccna training institute in Chennai|ccna courses in Chennai|Salesforce training in Chennai

  33. Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Oracle Training in chennai Consultants having more than 12+ years of IT experience.

  34. Informatica training in Chennai cover all aspects of DataWarehousing including
    Informatica Training in chennai Power Center,Get access to Informatica, ETL, Business Intelligence, Analytical SQL, Unix Shell Scripting, Data Modeling using ERWIN, ETL testing, Performance Tuning and Informatica administrator and more courses.

  35. Greens Technology provides Best PEGA training courses in chennai.PEGA training course content designed basic to advanced levels. Pega Training In Chennai we have a team of PEGA experts who are working professionals with hands on real time PEGA projects knowledge, which will give students an edge over other Training Institutes.

  36. Greens Technology Apache Hadoop training in Chennai is the expert source for Apache Hadoop training and certification. We offer public and private Hadoop Training in Chennai courses for developers and administrators with certification for IT professionals.

  37. HP Unified Functional Testing (UFT) software, formerly known as HP QuickTest Professional (QTP) QTP Training in Chennai,provides functional and regression test automation for software applications and environments.

  38. SAS (Statistical Analysis System) is one of the most popular softwares used in the world of analytics & big data.SAS helps in data management, data cleaning and statistically analysing data. SAS certifications offered by the SAS Training in Chennai Institute are highly sought after and globally recognized.

  39. Green Technologies In Chennai Greens Technology is a leading Training and Placement company in Chennai. We are known for our practical approach towards trainings that enable students to gain real-time exposure on competitive technologies.
    Trainings are offered by employees from MNCs to give a real corporate exposure.

  40. Thanks for the notes that you have published here. Though it looks familiar, it's a very good approach you have implemented here. Thanks for posting content in your blog. I feel awesome to be here.

    cloud computing training in chennai
    Best Institute for Cloud Computing in Chennai
    hadoop training in chennai

  41. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.