Collaborative Filtering : Implementation with Python! [Part 02]

Friday, November 13, 2009

Hi all,

In my last article i have talked about one of the information filtering techniques (IF) to make recommendations: User-Based Collaborative Filtering. The way how the recommendation system works, using this collaborative filtering, it requires all recommendations of each user to build a data set. Obviously, it will work well with a small amount of items and users, but the real world is not so benevolent, and web stores like Amazon where there are millions of users and products, it would become impracticable to compare each user against other users, and also each item that the user has evaluated - it would be extremely slow.

To overcome this limitation one approach proposed in the literature is to use the Item-Based Collaborative Filtering. In cases when there is available a big data set with millions of user profiles with items rated, this technique may show better results, and allows the computation be done in advance so an user demanding recommendations, should quickly obtain them. This article will present this filtering technique in action with some implementations in Python programming language.

The procedure to do the filtering based on items is similar that i discussed earlier with user-based collaborative filtering. The basic idea is to pre-process the most similar items against each other. So when the user asks for a recommendation, the system examines the items with the greatest scores and it builds a ranked list calculating the weighted average of the items similar to the items in the user profile. The main difference between the two approaches is that in the item-based one the comparison between the items will not change so frequently as the comparison between users in the user-based technique. It means it will not be necessary to evaluate every time the similarity between the items, then it can be done in times where the website has low movement or in separated servers from where the main application is hosted.

To compare the items , you must define a function that builds a data set with the similar items. Depending on the size of the data set, it will spend some time to build it. I must remember that i said earlier: You don't need to call this function every time that you want a new recommendation. Instead, you build the data set and use it every time as requested.

I implemented this function called calculateSimilarItems, it builds a dictionary (dict) of items showing for each one the items that they are most similar.

#Create a dictionary of items showing which other items they are most similar to.

def calculateSimilarItems(prefs,n=10):
result = {}
#Invert the preference matrix to be item-centric
itemPrefs = transformPrefs(prefs)
for item in itemPrefs:
#Status updates for large datasets
if c%100==0:
print "%d / %d" % (c, len(itemPrefs))
#Find the most similar items to this one
scores =
result[item] = scores
return result

As you can see, i reused some implementations i have done in the user-based collaborative filtering. The differences now is that we will iterate through items and calculates the similarity between items. The result will be the list with items joined with their respective most similar items.

Now we are ready to make recommendations using the data set of similarities between items without needing to examine all data set. The function getRecommendedItems do all the logic, the basic idea behind it is to compare the items i haven't rated yet against items that i already have evaluated. This comparison uses some calculations and as result you receive a prediction of the rate i'd would give for the item. You can see the implementation as follows:

def getRecommendedItems(prefs, itemMatch, user):
userRatings = prefs[user]
scores = {}
totalSim = {}

#loop over items rated by this user
for (item, rating) in userRatings.items():

#Loop over items similar to this one
for (similarity, item2) in itemMatch[item]:

#Ignore if this user has already rated this item
if item2 in userRatings:
#Weighted sum of rating times similarity
scores[item2] += similarity * rating
#Sum of all the similarities

#Divide each total score by total weighting to get an average
rankings = [(score/totalSim[item],item)
for item,score in scores.items()]

#Return the rankings from highest to lowest
return rankings

Let's use the same data set in the last article - the book data set - the target is to recommend some books for users based on their preferences. To download the data set and how to load it into your application, see the article about the User-Based-Collaborative Filtering that i've written.

As you can see, the result it will show a ranked list of tuples with (rate_predicted, book) as values. So you can see not only the books that you should like and haven't read, but also an estimated rate that you would give for it. Companies all around the world are working on how to improve the prediction rate of items in a dataset, even there was a challenge to win a $1.000.000 prize sponsored by NetFlix to improve by 10% the prediction accuracy of the recommendations. It's awesome!

So that's it, you now have a simple working demo of some of the most popular information filterings (IF) to recommend new items for users. Play with the library that i provided here. Load up some new datasets, it will be very funny and you will learn more about recommendation systems. There are other types of algorithms to recommendation systems and soon i will provide the implementation of them too.

To sum up, it's important to notice that the Item-based-content-filtering is significantly faster than the the user-based one, specially when you want to extract a recommendation list of items in a big amount of data. Even more, don't forget the extra time to maintain the similarity table of items. Other main difference between these two techniques is that there is a precision difference related on when the available data set is "sparse". In the data set we worked here, for the book recommendations, it would unlikely that you would find two users with the same list of evaluated books. So almost of the recommendations is based on a small set of users, which implies in a sparse data set. Otherwise, the item-based collaborative filtering generally has a better performance than the user-based one in the sparse data set, but in a dense one their performance is almost the same.

You also may noticed that the user-based collaborative filtering technique is easier to implement and it doesn't have extra steps, so it's generally recommended to use it in small data sets that can be maintained in the memory and change frequently. Finally, in some applications, show to people other users that have same preferences has some value also instead of recommend items.

In the next articles, i will provide some ideas that i am developing specifically related on how to find similar group of users using some grouping algorithms. I expect you enjoyed. To download all the implementation provided here, click here.

Thanks, any doubts, suggestions or recommendations make yourself free to ask!

See you next time,

Marcel Caraciolo


  1. you just ripped off code from Programming Collective intelligence...sad!

  2. dá o credito pro livro, cara..

  3. Can you please help to calculate the accuracy of both user and item based for book recommendations.

  4. This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:

    -Phone hacks (remotely)
    -Credit repair
    -Bitcoin recovery (any cryptocurrency)
    -Make money from home (USA only)
    -Social media hacks
    -Website hacks
    -Erase criminal records (USA & Canada only)
    -Grade change
    -funds recovery

    Email: onlineghosthacker247@ gmail .com

  5. Innomatics Research Labs is collaborated with JAIN (Deemed-to-be University) and offering the Online MBA in Artificial Intelligence & Business Intelligence Program. It is a sublime program of getting an MBA degree from one of the best renowned university – JAIN University and an IBM certification program in Data Science, Artificial Intelligence, and Business Intelligence from Innomatics Research Labs in collaboration with Royal Society London.

    Online MBA in Artificial intelligence from Jain University

  6. AI Patasala Python Course in Hyderabad is sure to be the ideal choice for those looking to gain insight into all the real-world challenges in this field. AI Patasala Python course is the best option for students to begin their career with the latest technology.
    Python Course in Hyderabad

  7. This comment has been removed by the author.

  8. Become a Data Science expert by joining in AI Patasala’s Data Science Training program, where you can learn advanced data science concepts with practical knowledge.
    Data Science Course in Hyderabad

  9. Fantastic Blog! I want to thank you for your time and effort in writing this post. I hope you continue to do your best work in the future as well. Thank you for your page, I wanted to say! I thank you for sharing. excellent websites

    Want to learn Data Science? Check out the data science course in Hyderabad.
    Visit my Profile for More Information
    Data science course in Hyderabad .