Share this post!

Friday, February 24, 2012

Guide to Recommender Systems Book Online



Hi all,

This year one of my goals is to write a book such as a guide to teach recommender systems for programmers. I know there are several textbooks that focus on providing a theorical foundation for recommender systems, and as result, may seem difficult to understand. For programmers that want to learn how to start to use or understand the components of a recommender system, this book is what they are looking for.  

This guide follows a learn-by-doing approach. Therefore, I will use theory and apply it through the exercises and experiment with Python code.  I hope when you complete the book you will be able to understand how to build a recommender system and give you the first steps to apply them at your own systems. The textbook is laid out as a series of small steps that will guide you for undestanding the recommender system techniques. 

This book is available for download for free under a Creative Commons license. This project is also leaded by my colleague Ricardo Caspirro, who will review and translate it to portuguese language.

Below I provide the table of contents of the book.


Guide to Recommender Systems


The link for the online guide is available here.


http://muricoca.github.com/recommendation-lectures/index.html


Table of Contents


Chapter 01: Introduction to Recommender Systems

Finding out what recommender system is and what problems it solves. And a fast review of what you will be able to do when you finish this book.

Chapter 02: Collaborative Filtering

This chapter focus on how you can use the state-of-the-art techniques of collaborative filtering that makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from similar users (user-based) or similar items( item-based).


Chapter 03: Content Based Filtering
Recommender systems that  suggest an item to a user based upon a description of the item and a profile of the user's interests. Although thedetails of various systems differ, content-based recommendation systems share in common a means for describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to recommend. 


Chapter 04: Hybrid Based Filtering
This chapter will focus how to pick the best features of collaborative and of content and mix them to build hybrid recommender systems. It will present the current work on this field and an example of  how it works and how you can decide the best strategy to select.

Chapter 05:  Model - Based Recommenders
Techniques that will include memory-based techniques or data mining techniques such as association analysis, symbolic data analysis and classification/clustering techniques will be covered in this chapter.

Chapter 06:  Evaluation of Recommender Systems
This chapter starts with a short description on how to evaluate the recommender systems and the commonly used metrics for compare the recommender algorithms in the development and deployment stages.

Chapter 07:  Recommender Systems and Distributed-Computing
Recommender Systems suffer with sparse matrices where the user x items preferences are sparsed (lots of missing values - preferences). It results on large datasets with millions of items, users and preferences.  For this task it is considered to use distributed computing techniques such as map-reduce to distribute the recommendations. This chapter will cover those topics.

Chapter 08:  Study Case
It will present a study case of a mobile recommender system for recommend users to another users using several techniques showed above and how we tested and deployed it.

Chapter 09:  Recommender Systems the Next Generation
This chapter brings the next generation of recommender systems, describing what the research is going after in several fields such as ubiquity, semmantics, etc.

Chapter 10:  Meeting Python-RecSys Framework
It will present the Python-RecSys framework for building recommender systems with Python in a easy way. It will describe how to build or test already implemented techniques or develop new ones and deploy them with frameworks Web and REST.


This book is under development, please let me know if there are any suggestions or corrections to make over one of those chapters. If you see that there is a topic that needs an extra chapter or a topic that I am missing, please also let me know and comment.


I hope you enjoy this work, specially the developers!


Regards,

Marcel Caraciolo

Thursday, January 12, 2012

The Anatomy of Product Recommendations - Infographic - Tips and Best Practices

Hi all,

I'd like to share a post that I found recently from a product's blog called LiftSuggest. It is focused on recommender engines for e-commerces and give several tips about how to design your product recommender at your website for maximum clicks and to increase your sales.

They posted a infographic about the anatomy of a product recommender engine. The tips they give are quite important when you are designing a recommender system for your e-commerce. 


Product Recommender Engine

Further information please access their blog.


Regards,

Marcel Caraciolo

Wednesday, January 4, 2012

Deepjewel - Social Media powering Recommendations

Hi all,

Happy new year!  My first post this year will be about an idea that I had with my friend Ricardo Caspirro about the next generation of social recommenders in commerce applications and retail stores. What excites me is that this idea came from a conversation that we had in 2009, and since that year we discussed more about what it would be the "Deepjewel".

Deepjewel is a giant knowledge base that encapsulates interesting entities and relationships of the social world in the web.  The social world in this context means all the millions and billions of tweets, Facebook messages, profiles, relationships, blog posting, YouTube videos, and more - a living organism itself, constantly evolving. 


The Deepjewel

But what motivated us to create the Deepjewel?  One of main problems that we face nowadays is the discovery of content and items of our interest.  Many times, for instance, to find a book or a movie that we like, it is required to search at several websites and social networks through the web.  There isn't a tool that allow this connection between items of many domains in a organized and structured way, even for easy access. Those objects are spread over the web, and the recommendations are placed in social networks by comments, results of machine learning techniques or by queries at several web pages or search engines.  The problem becomes worst when we don't know anything about the existence of a certain item, which it could result at never finding out that possible item that would be of our interest. 

The social media is huge and we need tools that performs a deep analysis of all this data, filter out items of my interest, specially from the historical data  (with our permission, of course)  from our presence in the web and bring items and products relevant to us without loosing the discovery process associated with the serendipity.  One of the solutions is a powerful recommender engine fed by this Social Genome.



Hybrid Social Context-Aware Recommender RecDay


This hybrid social context-aware recommender (which we call recday) is a engine composed by several modular components, which we employ a broad range of semantic analysis techniques, including information extraction and integration, natural language processing and machine learning. The main task of this recommender is to analyze information about his posts, bios and relationships/lists collected from the social genome and summarize it (all this data would represent the interests of the user) in profiles, which we could call DNA. Those profiles built by the recday would infer the possible interests of the users and would serve as basis for personalized recommendations of products and services from the retail stores/e-commerce applications.


A perfect example for this proccess, which we call the translation, is when you mentioned several times about Apple products (such as macbook, ipod, iphone, etc) at your tweets. Even you never used the word "Apple", we can use the Social Genome to detect the products and infer that you are interested in Apple products. The following figure illustrate certain kinds of entities and relationships collected in the Social Genome:


The relationships extracted from the social web data

The second step of this engine is to build the user profile. Different from another approaches which it only uses the content or the historical data from reviews or ratings from the user, the Recday would go further and would analyze the temporal context included in the interests of the user. Several reports on consumer behavior show that the user desires are influented by external factors and even the humour or feelings of the person at the certain moment. It is required to collect in a stealth way (with the user permission of course) his social data and build his personality defined by several dimensions. Those dimensions represent the current state of the user which may define what kind of suggestions he would like to receive at that particular moment. For example, if you are happy today because you got a new job and posted at your Facebook about that event or even updated your profile about this new position, it would be a valuable information for your DNA profile in order to recommend products and services to celebrate this occasion (You are happy and excited, don't you think ?).

Another important component in this proccess is the product side. We need extract more information from their products portfolio. Items must be juiced in order to get all its meta-data available. Imagine the movie Batman where we have details about the year, genre, cast, production, direction, sinoypsis, etc. All this data can be used to build the DNA Item and be expressed by a collection of dimensions that represent the item profile. With those profiles (DNA User and DNA Item) we compute the similarity between user and items in order to produce a list of top recommendations and related explanations.


The Social Architecure of the Recommender


The final result can be shown in several mediums: mobile apps, widgets, web, API's, pluggins, etc. It is important the usability and how you will present all this information for a particular user. That's why the user interface must be simple and easy to navigate and have mechanisms to collect the user's feedback for the suggestions given by the engine. This proccess is cyclic, so when you give a feedback (a like or dislike or a comment about the suggestion), this piece of information is handled and leveraged to power your DNA profile.

A particular medium of the recommendations: Ipad Demo

In order to build all these interesting technical challenges, we needed to start developing our in-house solution called Crab, which proccesses all this data and employ several analysis and filtering techniques  to deal with the percularities of this heterogeneous sources of data. The first start is the launch of the Deepjewel Labs. Deepjewel is a principle that we can mine the wealth in data, by identifying interesting entities and relationships and converting them into valuable information as input in the recommendation proccess of items and services. 

In summary, all those human and computation techniques can be used to perform a deep semantic analysis of web and social data, where the result for a commerce application or retail store is the ability to offer what the users want before they know that they really want in a personalized way. The RecDay  would be able to daily offer relevant product and services to their customers without they even know it exists. This is a new way to shop in which you don't have to go find products, service and information; the machine will help them find their way to you.

To know more about our Deepjewel labs, visit the website (currently in portuguese):  


I hope you enjoyed,

Marcel Caraciolo

Wednesday, December 21, 2011

Playing with Foursquare API with Python

Hi all,

I'd like to share a project that I am developing that it may be useful for anyone who wants to create datasets from mobile location networks.  Specifically, I developed a wrapper in Python for accessing the Foursquare API called PyFoursquare

For anyone who doesn't know what is Foursquare, it is a popular mobile social-location network with more 10.000.000 of users around the world. The idea is that you can share your current location with your friends and as result discover new places, find where your friends are and even check some tips and recommendations about a place and what to do when you arrive there. It is an amazing project with lots of data available for anyone who wants to develop new apps for connect or mine (data mining) its data!

Foursquare Mobile Application

This Python API is one of the results of my master degree project where I proposed a new architecture for mobile recommenders that fetches reviews from social networks to improve the explanation and the quality of the given recommendations.  I  used this library to collect tips (text reviews) from Foursquare from places at my neighborhood Recife, Brazil.  This API was a little messy, so I decided to clean it up, organize and documment it for publish for the open-source community.

One of advantages of this API is that you can handle each entity from the Foursquare data as Model object. So instead of handling with json dictionaries, I encapsulate the results in the respective models (Venue, Tips, User, etc.) and access its attributes as common object in Python!

I inspired myself at the work of Joshua at Tweepy, which is a Python library for Twitter.  In this version released 0.0.1 I only implemented some API's such as search/venues,  venue_details and venue_tips.  In future releases I pretend to add more models and support for more API methods available at Foursquare.

How can you use it at your project ?

It is simple! Just install it by downloading at the Github's home project, extract the source from the tar.gz and  at the directory of the project run the command below:

$ python setup.py install

or the easier way is to install by the command easy_install:

$ easy_install pyfoursquare


After that, you can  simple test by running the command below at your Python Shell

>>> import pyfoursquare


Now let's see how you can get started with the PyFoursquare:

First you need to create an application at Foursquare. The link is this.  There  you can also get further information about the API, another libraries and several applications using the Foursquarw API's.  

The Foursquare Developer's Settings


After creating your application, you must get the client_id and your client_secret. Those keys will be important to connect the app to the users' accounts.  Foursquare uses the secure authentication based on OAuth2.  In PyFoursquareAPI, you won't need to handle with all steps provided by OAuth2.  It already encapsulates all the steps and handshakes between your app and Foursquare servers. \m/ 

Below the  code you must write for authenticate an user to connect to your app:




After the user  authorized, you now can instantiate the PyFoursquare API.  It will give you access to the Foursquare API methods.  I implemented several methods, but feel free to add new ones! Don't forget to submit the final results as pull requests at the project's repository at Github.

In this example I fetched a venue by giving as input the latitude and longitude and querying for the place with the name 'Burburinho'.  Burburinho is a popular bar nearby where I work!

Source code




Now you can access the result and access the Venue as a Python Object. All elements of the Venue are represented as attributes of the object Venue at PyFoursquare. The goal is to make easier the life of the developer when he access the Foursquare API by parsing all the JSON (the result) and placing in the correct model for him.



I expect you enjoyed this API. Feel free to use it at your applications or research!  I'd like to thank the Foursquare team for expose their data by providing those API's!  For data mining researchers instered in mobile location data, it is a mine of gold!

Further information about PyFoursquare, you can find here.

Feel free to give sugestions, improvements and comments,

Regards,

Marcel Caraciolo

Monday, December 19, 2011

Machine Learning with Python: Meeting TF-IDF for Text Mining

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!


What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:



and 




The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:


Tokenization


First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:


We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:



As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.



Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.



Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:




The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.



Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:




A row of this matrix would be:



I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our M_{tf\mbox{-}idf} matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.

Regards,

Marcel Caraciolo