Collaborative Filtering : Implementation with Python! [Part 02]

Hi all,

In my last article i have talked about one of the information filtering techniques (IF) to make recommendations: User-Based Collaborative Filtering. The way how the recommendation system works, using this collaborative filtering, it requires all recommendations of each user to build a data set. Obviously, it will work well with a small amount of items and users, but the real world is not so benevolent, and web stores like Amazon where there are millions of users and products, it would become impracticable to compare each user against other users, and also each item that the user has evaluated - it would be extremely slow.

To overcome this limitation one approach proposed in the literature is to use the Item-Based Collaborative Filtering. In cases when there is available a big data set with millions of user profiles with items rated, this technique may show better results, and allows the computation be done in advance so an user demanding recommendations, should quickly obtain them. This article will present this filtering technique in action with some implementations in Python programming language.

The procedure to do the filtering based on items is similar that i discussed earlier with user-based collaborative filtering. The basic idea is to pre-process the most similar items against each other. So when the user asks for a recommendation, the system examines the items with the greatest scores and it builds a ranked list calculating the weighted average of the items similar to the items in the user profile. The main difference between the two approaches is that in the item-based one the comparison between the items will not change so frequently as the comparison between users in the user-based technique. It means it will not be necessary to evaluate every time the similarity between the items, then it can be done in times where the website has low movement or in separated servers from where the main application is hosted.

To compare the items , you must define a function that builds a data set with the similar items. Depending on the size of the data set, it will spend some time to build it. I must remember that i said earlier: You don't need to call this function every time that you want a new recommendation. Instead, you build the data set and use it every time as requested.

I implemented this function called calculateSimilarItems, it builds a dictionary (dict) of items showing for each one the items that they are most similar.

#Create a dictionary of items showing which other items they are most similar to.

def calculateSimilarItems(prefs,n=10):

result = {}

#Invert the preference matrix to be item-centric

itemPrefs = transformPrefs(prefs)

c=0

for item in itemPrefs:

#Status updates for large datasets

c+=1

if c%100==0:

print "%d / %d" % (c, len(itemPrefs))

#Find the most similar items to this one

scores =

topMatches(itemPrefs,item,n=n,similarity=sim_distance)

result[item] = scores

return result

As you can see, i reused some implementations i have done in the user-based collaborative filtering. The differences now is that we will iterate through items and calculates the similarity between items. The result will be the list with items joined with their respective most similar items.

Now we are ready to make recommendations using the data set of similarities between items without needing to examine all data set. The function getRecommendedItems do all the logic, the basic idea behind it is to compare the items i haven't rated yet against items that i already have evaluated. This comparison uses some calculations and as result you receive a prediction of the rate i'd would give for the item. You can see the implementation as follows:

def getRecommendedItems(prefs, itemMatch, user):

userRatings = prefs[user]

scores = {}

totalSim = {}

#loop over items rated by this user

for (item, rating) in userRatings.items():

#Loop over items similar to this one

for (similarity, item2) in itemMatch[item]:

#Ignore if this user has already rated this item

if item2 in userRatings:

continue

#Weighted sum of rating times similarity

scores.setdefault(item2,0)

scores[item2] += similarity * rating

#Sum of all the similarities

totalSim.setdefault(item2,0)

totalSim[item2]+=similarity

#Divide each total score by total weighting to get an average

rankings = [(score/totalSim[item],item)

for item,score in scores.items()]

#Return the rankings from highest to lowest

rankings.sort()

rankings.reverse()

return rankings

Let's use the same data set in the last article - the book data set - the target is to recommend some books for users based on their preferences. To download the data set and how to load it into your application, see the article about the User-Based-Collaborative Filtering that i've written.

As you can see, the result it will show a ranked list of tuples with (rate_predicted, book) as values. So you can see not only the books that you should like and haven't read, but also an estimated rate that you would give for it. Companies all around the world are working on how to improve the prediction rate of items in a dataset, even there was a challenge to win a $1.000.000 prize sponsored by NetFlix to improve by 10% the prediction accuracy of the recommendations. It's awesome!

So that's it, you now have a simple working demo of some of the most popular information filterings (IF) to recommend new items for users. Play with the library that i provided here. Load up some new datasets, it will be very funny and you will learn more about recommendation systems. There are other types of algorithms to recommendation systems and soon i will provide the implementation of them too.

To sum up, it's important to notice that the Item-based-content-filtering is significantly faster than the the user-based one, specially when you want to extract a recommendation list of items in a big amount of data. Even more, don't forget the extra time to maintain the similarity table of items. Other main difference between these two techniques is that there is a precision difference related on when the available data set is "sparse". In the data set we worked here, for the book recommendations, it would unlikely that you would find two users with the same list of evaluated books. So almost of the recommendations is based on a small set of users, which implies in a sparse data set. Otherwise, the item-based collaborative filtering generally has a better performance than the user-based one in the sparse data set, but in a dense one their performance is almost the same.

You also may noticed that the user-based collaborative filtering technique is easier to implement and it doesn't have extra steps, so it's generally recommended to use it in small data sets that can be maintained in the memory and change frequently. Finally, in some applications, show to people other users that have same preferences has some value also instead of recommend items.

In the next articles, i will provide some ideas that i am developing specifically related on how to find similar group of users using some grouping algorithms. I expect you enjoyed. To download all the implementation provided here, click here.

Thanks, any doubts, suggestions or recommendations make yourself free to ask!

See you next time,

Marcel Caraciolo

13 comments:

AnonymousJuly 19, 2014 at 12:00 AM
you just ripped off code from Programming Collective intelligence...sad!
AnonymousOctober 30, 2016 at 7:23 PM
dá o credito pro livro, cara..
UnknownApril 23, 2017 at 1:42 AM
Can you please help to calculate the accuracy of both user and item based for book recommendations.
jamesJune 26, 2019 at 11:07 PM
Amazing content.
Data Mining Service Providers in Bangalore
jane hollySeptember 20, 2020 at 4:26 AM
This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:

-Phone hacks (remotely)
-Credit repair
-Bitcoin recovery (any cryptocurrency)
-Make money from home (USA only)
-Social media hacks
-Website hacks
-Erase criminal records (USA & Canada only)
-Grade change
-funds recovery

Email: onlineghosthacker247@ gmail .com
PGP in Data Science with online MBAJune 2, 2021 at 2:12 AM
Innomatics Research Labs is collaborated with JAIN (Deemed-to-be University) and offering the Online MBA in Artificial Intelligence & Business Intelligence Program. It is a sublime program of getting an MBA degree from one of the best renowned university – JAIN University and an IBM certification program in Data Science, Artificial Intelligence, and Business Intelligence from Innomatics Research Labs in collaboration with Royal Society London.

Online MBA in Artificial intelligence from Jain University
Ramesh SampangiOctober 21, 2021 at 6:42 AM
AI Patasala Python Course in Hyderabad is sure to be the ideal choice for those looking to gain insight into all the real-world challenges in this field. AI Patasala Python course is the best option for students to begin their career with the latest technology.
Python Course in Hyderabad
technologyforallDecember 1, 2021 at 2:36 AM
This comment has been removed by the author.
Ramesh SampangiDecember 20, 2021 at 3:04 AM
Become a Data Science expert by joining in AI Patasala’s Data Science Training program, where you can learn advanced data science concepts with practical knowledge.
Data Science Course in Hyderabad
Learnbay Data ScienceJuly 12, 2022 at 12:32 AM
Fantastic Blog! I want to thank you for your time and effort in writing this post. I hope you continue to do your best work in the future as well. Thank you for your page, I wanted to say! I thank you for sharing. excellent websites

Want to learn Data Science? Check out the data science course in Hyderabad.
Visit my Profile for More Information
Data science course in Hyderabad .
GCP MASTERSJanuary 30, 2024 at 9:37 PM
thanks for valuable info
gcp training
fiberglass greenhouse panelsJanuary 16, 2025 at 10:14 AM
Good information.
토토사이트March 25, 2025 at 3:46 AM
Muchas gracias por la publicación del blog. Tengo muchas ganas de leer más. Realmente grandioso.

Artificial Intelligence in Motion

Pages