Collaborative Filtering : Implementation with Python! [Part 02]

Friday, November 13, 2009

Hi all,

In my last article i have talked about one of the information filtering techniques (IF) to make recommendations: User-Based Collaborative Filtering. The way how the recommendation system works, using this collaborative filtering, it requires all recommendations of each user to build a data set. Obviously, it will work well with a small amount of items and users, but the real world is not so benevolent, and web stores like Amazon where there are millions of users and products, it would become impracticable to compare each user against other users, and also each item that the user has evaluated - it would be extremely slow.

To overcome this limitation one approach proposed in the literature is to use the Item-Based Collaborative Filtering. In cases when there is available a big data set with millions of user profiles with items rated, this technique may show better results, and allows the computation be done in advance so an user demanding recommendations, should quickly obtain them. This article will present this filtering technique in action with some implementations in Python programming language.

The procedure to do the filtering based on items is similar that i discussed earlier with user-based collaborative filtering. The basic idea is to pre-process the most similar items against each other. So when the user asks for a recommendation, the system examines the items with the greatest scores and it builds a ranked list calculating the weighted average of the items similar to the items in the user profile. The main difference between the two approaches is that in the item-based one the comparison between the items will not change so frequently as the comparison between users in the user-based technique. It means it will not be necessary to evaluate every time the similarity between the items, then it can be done in times where the website has low movement or in separated servers from where the main application is hosted.

To compare the items , you must define a function that builds a data set with the similar items. Depending on the size of the data set, it will spend some time to build it. I must remember that i said earlier: You don't need to call this function every time that you want a new recommendation. Instead, you build the data set and use it every time as requested.

I implemented this function called calculateSimilarItems, it builds a dictionary (dict) of items showing for each one the items that they are most similar.

#Create a dictionary of items showing which other items they are most similar to.

def calculateSimilarItems(prefs,n=10):
result = {}
#Invert the preference matrix to be item-centric
itemPrefs = transformPrefs(prefs)
for item in itemPrefs:
#Status updates for large datasets
if c%100==0:
print "%d / %d" % (c, len(itemPrefs))
#Find the most similar items to this one
scores =
result[item] = scores
return result

As you can see, i reused some implementations i have done in the user-based collaborative filtering. The differences now is that we will iterate through items and calculates the similarity between items. The result will be the list with items joined with their respective most similar items.

Now we are ready to make recommendations using the data set of similarities between items without needing to examine all data set. The function getRecommendedItems do all the logic, the basic idea behind it is to compare the items i haven't rated yet against items that i already have evaluated. This comparison uses some calculations and as result you receive a prediction of the rate i'd would give for the item. You can see the implementation as follows:

def getRecommendedItems(prefs, itemMatch, user):
userRatings = prefs[user]
scores = {}
totalSim = {}

#loop over items rated by this user
for (item, rating) in userRatings.items():

#Loop over items similar to this one
for (similarity, item2) in itemMatch[item]:

#Ignore if this user has already rated this item
if item2 in userRatings:
#Weighted sum of rating times similarity
scores[item2] += similarity * rating
#Sum of all the similarities

#Divide each total score by total weighting to get an average
rankings = [(score/totalSim[item],item)
for item,score in scores.items()]

#Return the rankings from highest to lowest
return rankings

Let's use the same data set in the last article - the book data set - the target is to recommend some books for users based on their preferences. To download the data set and how to load it into your application, see the article about the User-Based-Collaborative Filtering that i've written.

As you can see, the result it will show a ranked list of tuples with (rate_predicted, book) as values. So you can see not only the books that you should like and haven't read, but also an estimated rate that you would give for it. Companies all around the world are working on how to improve the prediction rate of items in a dataset, even there was a challenge to win a $1.000.000 prize sponsored by NetFlix to improve by 10% the prediction accuracy of the recommendations. It's awesome!

So that's it, you now have a simple working demo of some of the most popular information filterings (IF) to recommend new items for users. Play with the library that i provided here. Load up some new datasets, it will be very funny and you will learn more about recommendation systems. There are other types of algorithms to recommendation systems and soon i will provide the implementation of them too.

To sum up, it's important to notice that the Item-based-content-filtering is significantly faster than the the user-based one, specially when you want to extract a recommendation list of items in a big amount of data. Even more, don't forget the extra time to maintain the similarity table of items. Other main difference between these two techniques is that there is a precision difference related on when the available data set is "sparse". In the data set we worked here, for the book recommendations, it would unlikely that you would find two users with the same list of evaluated books. So almost of the recommendations is based on a small set of users, which implies in a sparse data set. Otherwise, the item-based collaborative filtering generally has a better performance than the user-based one in the sparse data set, but in a dense one their performance is almost the same.

You also may noticed that the user-based collaborative filtering technique is easier to implement and it doesn't have extra steps, so it's generally recommended to use it in small data sets that can be maintained in the memory and change frequently. Finally, in some applications, show to people other users that have same preferences has some value also instead of recommend items.

In the next articles, i will provide some ideas that i am developing specifically related on how to find similar group of users using some grouping algorithms. I expect you enjoyed. To download all the implementation provided here, click here.

Thanks, any doubts, suggestions or recommendations make yourself free to ask!

See you next time,

Marcel Caraciolo


  1. you just ripped off code from Programming Collective intelligence...sad!

  2. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino etc.

    Embedded system training in chennai
    Embedded Course training in chennai
    Matlab training in chennai
    Android training in chennai
    LabVIEW training in chennai
    Arduino training in chennai
    Robotics training in chennai
    Oracle training in chennai
    Final year projects in chennai
    Mechanical projects in chennai
    ece projects in chennai

  3. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

    PLC training in chennai
    Automation training in chennai
    Best plc training in chennai
    PLC SCADA training in chennai
    Process automation training in chennai
    Final year eee projects in chennai
    VLSI training in chennai

  4. dá o credito pro livro, cara..