Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on: Crab, a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package

But first let's introduce what Pandas is.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts. On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was

This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.

The math behind it is quite simple:

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's Course Talk. It is an aggregator of several MOOC's where people can rate the courses and write reviews. The dataset is a mirror from the date 10/11/2013 and it is only used here for study purposes.

Let's use

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article, and stay tunned for the next one about another type of non-personalized recommenders: Ranking algorithms for vote up/vote down systems!

Special thanks for the tutorial of Diego Manillof :)

Cheers,

Marcel Caraciolo

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on: Crab, a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package

**Pandas**. In future I will also post an alternative version of this post but referencing Crab, about how it works with him.But first let's introduce what Pandas is.

## Introduction to Pandas

**Pandas**is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.## Non-personalized Recommenders

**Non-personalized recommenders**can recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer, so each customer gets the same recommendation. For example, if you go to amazon.com as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts. On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average |

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was

**4.5 stars out of 5.**

This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.

The math behind it is quite simple:

Score = ( 65 * 5 + 18 * 4 + 7 * 3 + 4 * 2 + 2 * 1)

Score = 428/ 96

Score = 4.45 ˜= 4.5 out of 5 stars

But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:

Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order. Finally, pick up the top K books and show them as related. :D

*Score(X, Y) = Total Customers who purchased X and Y / Total Customers who purchased X*

Using this simple score function for all the books you wil achieve:

Python for Data Analysis 100%

Startup Playbook 100%

Python for Data Analysis 100%

Startup Playbook 100%

MongoDB Definitive Guid 0 %

Machine Learning for Hackers 0%

Machine Learning for Hackers 0%

As we imagined the book Python for Data Analysis makes perfect sense. But why did the book Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence. This a famous trick in e-commerce applications called

*. Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana. Hence we need to adjust the math to handle this case as well. Modfying the version:***banana trap***Score(X, Y) = (Total Customers who purchased X and Y / Total Customers who purchased X) /*

*(Total Customers who did*

**not**purchase X but got Y / Total Customers who did**not**purchase X)
Substituting the number we get:

Python for Data Analysis = ( 2 / 2 ) / ( 1 / 3) = 1 / 1/3 = 3

Startup Playbook = ( 2 / 2) / ( 3 / 3) = 1

The denominator acts as a

**normalizer**and you can see that Python for Data Analysis clearly stands out. Interesting, doesn't ?
The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for Atepassar.com for ranking professors. :)

## Examples with real dataset (let's play with CourseTalk dataset)

Let's use

**Pandas**to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)**Update**: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article, and stay tunned for the next one about another type of non-personalized recommenders: Ranking algorithms for vote up/vote down systems!

Special thanks for the tutorial of Diego Manillof :)

Cheers,

Marcel Caraciolo

Great article, mate. Can't wait for next part!

ReplyDeleteGood luck

Great post, Marcel.

ReplyDeleteI've been using pandas for a while now, it's really great for data management. The only downside is that pandas has limited out-of-core capabilities. My dataset is ~200GB big and I have to use a high-performance cluster to be able to use it with pandas. But apparently Wes McKinney is working on that (see his last post: http://wesmckinney.com/blog/?p=697).