Pages

Showing posts with label ipython. Show all posts
Showing posts with label ipython. Show all posts

Non-Personalized Recommender systems with Pandas and Python

Tuesday, October 22, 2013


Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on:  Crab,  a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package Pandas.  In future I will also post an alternative version of this post but referencing Crab, about how it works with him.

But first let's introduce what Pandas is.

Introduction to Pandas


Pandas is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.

Non-personalized Recommenders


Non-personalized recommenders can  recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer,  so each customer gets the same recommendation.  For example, if you go to amazon.com as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts.  On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was 4.5 stars out of 5.
This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.
The math behind it is quite simple:

Score = ( 65 * 5 + 18 * 4 + 7 * 3 +  4 * 2 +  2 * 1)
Score =  428/ 96
Score = 4.45 ˜= 4.5 out of 5 stars

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.




But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of  rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:




Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order.  Finally, pick up the top K books and show them as related. :D

Score(X, Y) =  Total Customers who purchased X and Y / Total Customers who purchased X


Using this simple score function for all the books you wil achieve:


Python for Data Analysis                                                 100%

Startup Playbook                                                              100%

MongoDB Definitive Guid                                                0 %

Machine Learning for Hackers                                          0%


As we imagined the book  Python for Data Analysis makes perfect sense. But why did the book  Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence.  This a famous trick in e-commerce applications called banana trap.   Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana.  Hence we need to adjust the math to handle this case as well. Modfying the version:

Score(X, Y) =  (Total Customers who purchased X and Y / Total Customers who purchased X) / 
         (Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)

Substituting the number we get:

Python for Data Analysis =   ( 2 / 2 ) /  ( 1 / 3) =  1 / 1/3  =  3 

Startup Playbook   =   ( 2 / 2)  /  ( 3 /  3)  =  1 

The denominator acts as a normalizer and you can see that Python for Data Analysis clearly stands out.  Interesting, doesn't ? 

The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for Atepassar.com for ranking  professors. :)

Examples with real dataset (let's play with CourseTalk dataset)

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's  Course Talk.  It is an aggregator of several MOOC's where people can rate the courses and write reviews.  The dataset is a mirror from the date  10/11/2013 and it is only used here for study purposes.



Let's use Pandas to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)

Update: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article,  and stay tunned for the next one about another type of non-personalized recommenders:  Ranking algorithms for vote up/vote down systems!

Special thanks for the tutorial of Diego Manillof :)

Cheers,

Marcel Caraciolo

Review about the book Learning IPython for Interactive Computing and Data Visualization

Monday, June 17, 2013



Hi all,

I was invited to review a copy of the book recently released titled "Learning IPython for Interactive Computing and Data Visualization" by the author Cyrille Rossant.  The book focus on one of the best tools for working with Python with the interactive incremented shell IPython.  By the way, it was the time to the tool receive a special book about it.


Learning IPython for Interactive Computing and Data Visualization

IPython is covered through the six chapters using several basic examples related to scientific computing  along with another Python tools such as Matplotlib, Numpy, Pandas, etc.  The first chapters explore the IPython basics such as installation, basic commands to get used with the tool.

The next chapters introduces NumPy and Pandas basics with the IPython shell active. Don't expect advanced examples with those tools. The idea is a simple demonstration of what we can do at IPython.

There is a chapter to discuss the visualization data with graphs, plottings with IPython Notebook. However I missed more details abot IPython notebook. It lacks more deep examples related to the topic.  

I really liked the chapter 5 when they showed some basics of MPI (Message Passing Theme), although the topic wasn't vasted explored. But the introduction gives a greate potential of usefulness to the more advanced books about IPython.

The last chapter shows how to create pluggins to IPython, for instance, create an simple extension that introduces a new cell magic (write C++ code directly in the cell, and it will be automatically compiled and executed).

My conclusion about the book is that it achieves the expected goal: a technical introduction to IPython. If you want a book to explore scientific computing or a advanced stuff to IPython, this is not the the book yet. I can say that this book is a first start for much more topics about the use of IPython.  MPI, IPython notebook, etc.  I recommend the book for start exploring the IPython as reference! :D


Regards,

Marcel

Slides from Keynotes at VII PythonBrasil

Monday, October 3, 2011

Hi all,

I'd like to share the slides of the keynotes I lectured at the VII PythonBrasil, the Brazilian Python Users Meeting that happens once a year.   This year I had the opportunity to give two talks: One is about the Open-Source Communities and the experience with the local community of Pernambuco: The Python User Group of Pernambuco (PUG-PE) and about the framework I am currently working on: Crab - A Python Framework for Building Recommender Systems.

It was an amazing event and with lots of amazing keynotes, opportunities to meet people and make some friends. I also had the opportunity to give two more lighting talks: the pipeline toolkit for scientific computations JobLib and about Ipython.

Below the slides provided:



                           The JobLib slides for download.



I'd like also to announce the launch of the new home page of the project Crab with a reformulated design. It still in development, with lots of work to do, but it's coming! The first release 0.1 will be launched until the second week of October.

Crab new Home Page 


Thanks for the feedback from all developers at PythonBrasil and I expect new contributors at the project.

Regards,

Marcel Caraciolo

Scipy Conference 2011 and my participation!

Tuesday, July 19, 2011

Hi all,

Last week I was at the Scipy 2011 Conference at Austin, Tx. My first international conference as also my first lecture international! The Scipy Conference is an annual meeting for scientific computing developers and researchers that use python scientific packages in their research or work.  It was a great opportunity for meeting new python developers, know more about what's happening in scientific python nowadays and to learn about Scipy, Numpy and Matplotlib, considered the standard libraries for developers who wants start to develop in the scientific world.




At the first day of the conference, I had the opportunity to learn more about Numpy, a widely used library for numerical computations in Python as also learn more about the Scikit-learn framework, a great open-source toolkit for machine learning developers written in Python, Numpy and Scipy.  

You can access both tutorials available here at the Scipy Conference Tutorials WebPage.  Numpy is an amazing library, and what I learned I started already applied at the library I am currently working on called Crab for building recommender systems.   The Scikit-learn is also an interesting framework written in Scipy, Numpy and Matplotlib with several machine learning techniques and has as one main features the easy-to-use interface with lots of examples and tutorials for starters and beginners in machine learning.  It works so smoothly that I decided to use it as dependency of the Crab framework.

The second day started with more advanced tutorials, specially on Global Arrays with Numpy for  High performance computation. A quite powerful effort in this feature and I believe that soon will be added to the Numpy core. 

The another tutorial was about an introduction to Traits, Matplotlib and Chaco - great tools for creating nice user interfaces and plotting charts. One of the best parts of this tutorial was easily to create nice interfaces and animated plottings with a few lines of code.  Take a look of what you can do here or even see a real-time animated plotting with Matplotlib.








Traits and Chaco are part of the EPD package developed by the company Enthought, whose one of the co-founders is one of the main developers and founders of Numpy! Yeah :D Those frameworks allow easily create nice interfaces only using models concepts. If you want to learn more, please check out the tutorials as the official website about how to download, install and use it.


Another keynote interesting was about the Ipython, the incremented shell for scientific Python developers. What amazed me was when he showed the matplotlib embedded at the shell instead of opening a new window! The work around the Ipython has been fantastic, with several features for python developers! I extremely recommend!




The rest of the conference was dedicated to keynotes and talks about currently works on data science, core technologies and data mining with Python, Scipy , Numpy and related libraries.  I had the opportunity of giving the lecture - Crab - A Python Framework for Building Recommender Systems written by me, Bruno Melo and Ricardo Caspirro, actually the main contributors for this work.  The idea is to provide for python developers a recommender toolkit so they can easily create, test and deploy recommender engines with simple interfaces written with the scientific python packages such as Numpy, Scipy and Matplotlib.




You can check out my slides at the Scipy Conference here.


The project is currently being developed by the non-profitable organization called Muriçoca, that we decided  to create to manage and develop the Crab Framework. 


One of the best keynotes was the presentation of Hilary Manson, the Data Scientist at bit.ly.  She gave a funny lecture about her work and the current challenges with handling with large data sets and lots of URL-shortening happening at the backend of Bit.ly. It is quite amazing the amount of data and what you can do and extract useful information from all this data.

At least, I decided also to give a lighting talk about Mining the Scipy Lectures. A simple lecture to show what you can do with the data from the Scipy Conference Schedule and play with it. I used some NLP techniques and clustered based on the most frequent topics to check how was distributed the lectures at Scipy based on the keywords from their titles.  To visualize I used the Graph Visualization tool Ubigraph to show in 3D the clusters generated (by the way I used the K-means algorithm to cluster). 




The slides are also available here and the source code here.

3D Lectures Clusters


Soon I will release the PDF with the article submitted as also the video with both keynotes that I presented.  It was an amazing conference at Austin, making new friends and lots of new partners! :D I expect to be there next year, absolutely!  One of my goals this year also is to prepare a scientific computing course using Python, wait for more information soon here at the blog (it will include matplotlib, scipy and numpy)!

Cheers,

Marcel Caraciolo