Know how many retweets (RT) and the user that you most retweet at Twitter! -Text Mining

Saturday, November 21, 2009

Hi all,

While playing with some collective intelligence techniques under the Twitter, i and some friends developed a simple application with the original idea of my friend Bruno Melo and some suggestions of my friend Ricardo Caspirro to count the number of retweets that you do on twitter and the users that you most retweet. We decided to give it the name of BabaOvo (A portuguese word that means that you are a pamper or flatter) . The idea is to know how many retweets i've done in all my statuses profile and whose the users i retweeted (RT) most.

The code is so simple and we've done it all in Python! Soon i'll release a official version and the source code so you can you it on your own.

I've made a test with my twitter username 'marcelcaraciolo' , and i've discovered some really interesting insights: ' I like the user lucianans (hehhe she's my date)' , 'The users cinlug, larryolj and the user tarantulae are related to python development and open-source: what i like a lot' . 'marcelobarros, telmo_mota, croozeus are associated with mobile topics: great guys to follow if you want to know about mobile area' and at least 'reciferock, davidsonFellipe and macmagazine: all personal interests: rock music, blogs and the apple world' !

It's amazing what you can find about you only using this simple technique of text mining ! Data mining is a powerful area and natural language processing is incredible if you know how to handle it. I am playing with it and soon i'll publish here some results of a project that i am doing with recommendations and Twitter networking!

For now, take a look on the pie chart that i plot after running the babaOvo application. I've put only the top-10 users! The chart was drawn with the Matplotlib Python framework!

BabaOvo Pie Chart : Frequency Retweets

That's all!

Marcel Caraciolo

Collaborative Filtering : Implementation with Python! [Part 02]

Friday, November 13, 2009

Hi all,

In my last article i have talked about one of the information filtering techniques (IF) to make recommendations: User-Based Collaborative Filtering. The way how the recommendation system works, using this collaborative filtering, it requires all recommendations of each user to build a data set. Obviously, it will work well with a small amount of items and users, but the real world is not so benevolent, and web stores like Amazon where there are millions of users and products, it would become impracticable to compare each user against other users, and also each item that the user has evaluated - it would be extremely slow.

To overcome this limitation one approach proposed in the literature is to use the Item-Based Collaborative Filtering. In cases when there is available a big data set with millions of user profiles with items rated, this technique may show better results, and allows the computation be done in advance so an user demanding recommendations, should quickly obtain them. This article will present this filtering technique in action with some implementations in Python programming language.

The procedure to do the filtering based on items is similar that i discussed earlier with user-based collaborative filtering. The basic idea is to pre-process the most similar items against each other. So when the user asks for a recommendation, the system examines the items with the greatest scores and it builds a ranked list calculating the weighted average of the items similar to the items in the user profile. The main difference between the two approaches is that in the item-based one the comparison between the items will not change so frequently as the comparison between users in the user-based technique. It means it will not be necessary to evaluate every time the similarity between the items, then it can be done in times where the website has low movement or in separated servers from where the main application is hosted.

To compare the items , you must define a function that builds a data set with the similar items. Depending on the size of the data set, it will spend some time to build it. I must remember that i said earlier: You don't need to call this function every time that you want a new recommendation. Instead, you build the data set and use it every time as requested.

I implemented this function called calculateSimilarItems, it builds a dictionary (dict) of items showing for each one the items that they are most similar.

#Create a dictionary of items showing which other items they are most similar to.

def calculateSimilarItems(prefs,n=10):
result = {}
#Invert the preference matrix to be item-centric
itemPrefs = transformPrefs(prefs)
for item in itemPrefs:
#Status updates for large datasets
if c%100==0:
print "%d / %d" % (c, len(itemPrefs))
#Find the most similar items to this one
scores =
result[item] = scores
return result

As you can see, i reused some implementations i have done in the user-based collaborative filtering. The differences now is that we will iterate through items and calculates the similarity between items. The result will be the list with items joined with their respective most similar items.

Now we are ready to make recommendations using the data set of similarities between items without needing to examine all data set. The function getRecommendedItems do all the logic, the basic idea behind it is to compare the items i haven't rated yet against items that i already have evaluated. This comparison uses some calculations and as result you receive a prediction of the rate i'd would give for the item. You can see the implementation as follows:

def getRecommendedItems(prefs, itemMatch, user):
userRatings = prefs[user]
scores = {}
totalSim = {}

#loop over items rated by this user
for (item, rating) in userRatings.items():

#Loop over items similar to this one
for (similarity, item2) in itemMatch[item]:

#Ignore if this user has already rated this item
if item2 in userRatings:
#Weighted sum of rating times similarity
scores[item2] += similarity * rating
#Sum of all the similarities

#Divide each total score by total weighting to get an average
rankings = [(score/totalSim[item],item)
for item,score in scores.items()]

#Return the rankings from highest to lowest
return rankings

Let's use the same data set in the last article - the book data set - the target is to recommend some books for users based on their preferences. To download the data set and how to load it into your application, see the article about the User-Based-Collaborative Filtering that i've written.

As you can see, the result it will show a ranked list of tuples with (rate_predicted, book) as values. So you can see not only the books that you should like and haven't read, but also an estimated rate that you would give for it. Companies all around the world are working on how to improve the prediction rate of items in a dataset, even there was a challenge to win a $1.000.000 prize sponsored by NetFlix to improve by 10% the prediction accuracy of the recommendations. It's awesome!

So that's it, you now have a simple working demo of some of the most popular information filterings (IF) to recommend new items for users. Play with the library that i provided here. Load up some new datasets, it will be very funny and you will learn more about recommendation systems. There are other types of algorithms to recommendation systems and soon i will provide the implementation of them too.

To sum up, it's important to notice that the Item-based-content-filtering is significantly faster than the the user-based one, specially when you want to extract a recommendation list of items in a big amount of data. Even more, don't forget the extra time to maintain the similarity table of items. Other main difference between these two techniques is that there is a precision difference related on when the available data set is "sparse". In the data set we worked here, for the book recommendations, it would unlikely that you would find two users with the same list of evaluated books. So almost of the recommendations is based on a small set of users, which implies in a sparse data set. Otherwise, the item-based collaborative filtering generally has a better performance than the user-based one in the sparse data set, but in a dense one their performance is almost the same.

You also may noticed that the user-based collaborative filtering technique is easier to implement and it doesn't have extra steps, so it's generally recommended to use it in small data sets that can be maintained in the memory and change frequently. Finally, in some applications, show to people other users that have same preferences has some value also instead of recommend items.

In the next articles, i will provide some ideas that i am developing specifically related on how to find similar group of users using some grouping algorithms. I expect you enjoyed. To download all the implementation provided here, click here.

Thanks, any doubts, suggestions or recommendations make yourself free to ask!

See you next time,

Marcel Caraciolo

Collaborative Filtering : Implementation with Python!

Tuesday, November 10, 2009

Continuing the recommendation engines articles series, in this article i'm going to present an implementation of the collaborative filtering algorithm (CF), that filters information for a user based on a collection of user profiles. Users having similar profiles may share similar interests. For a user, information can be filtered in/out regarding to the behaviors of his or her similar users.

Users profiles can be collected either explicitly or implicitly. One can explicitly ask users to rate what they have used/purchased. Such a profile is filled explicitly by the user ratings. An implicit profile is based on passive observation and contains users historic interaction data.

The most common usage of Collaborative Filtering is to make recommendation. That's why collaborative filtering is strongly correlated to recommender system in literature.

The implementation shown here will be at Python, so if you're not familiar with the programming language you can see more about it here. The pros of using Python is that with so less lines of code you can easily make the things running. Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.

Generally you have a crowd of users, a big pile of items and some of the users rated them(what they think). Finally, you want to suggest more items to a user and you'd prefer to make your recommendations relevant to their likely interests. As you will see, that the algorithm suggest that you could use the opinions that people have recorded about items they have bought, to give a good guess as to which items they haven't bought, but might like.

The first thing is to collect the preferences of the users. My Collaborative Filtering implementation stores its data in two 2D matrices. So for each user in a row we have columns for each item that he rated, as you can see at the Figure 01 below.
Figure 01. The 2D Matrix User:Book:Rating

To keep things simple, let's represent our matrices as two levels of Python dict objects (a dictionary is simply a hash table, if you're not familiar with Python). The key of each dict is a userID, so to get the rate which the user "Bryan" gave to the book "Classical Mythology" we look in first-level dict for "Bryan", then the second-level dict for "Lost Symbol". Our problem scope here will be book recommendations. The complete dataset can be found free here at this link for download. Free available for research, the Book-Crossing dataset contains 278,858 users (anonymized) providing 1,149,780 ratings (explicit/implicit) about 271,379 books in a 4-week crawl (August/September 2004).

In this article, we will use only use only the data stored in Bx-books.csv and Bx-book-Ratings.csv, that contain the list of identifiers, titles of the books and the ratings gave by the users respectively. To download the data already pre-processed, click here. If you prefer to do it all by yourself, i also provided some code (loadDataset) in the implementation. It's important to notice that the user represented in this data set is represented by an unique numeric identifier for privacy of the users.

>>>from critics import *
{'Fortune': 6.0}

After collecting the data related to the stuff that the users prefer, you need somehow a metric to determine how similar the users are compared to your tastes. To measure this, you have to compare each user with other using a similar measure distance. There are some functions to evaluate this metric, but in this article i will use the euclidian distance and pearson correlation. I am not going to explain the mathematics behind those measure distances, because you can find a lot of information about them out of a hat. The basic idea behind those measures is that the more the users have similar tastes the more they are next to each other in the preferences search space. Which one to use? Depends on your problem, test all and verify which one get better results. Generally, the Pearson correlation gets slightly better results, since it shows how much the variables change together. To play with them, check the implementation of the functions sim_pearson or sim_euclidean. Those functions will be used as parameters of the functions defined in the rest of this article.

>>>recommendations.sim_distance(critics,'98556', '180727')
>>> recommendations.sim_pearson(critics.critics,'180727', '177432')

Now that we have the measure distances to compare two users, we now can define other function to classify all users compared to a specified user and find the one that is most similar. In this particular case, the goal is to find users that rated and have the similar taste so i can know who i can ask for advice when i want to choose a book. The function topMatches returns a sorted list of n users with similar preferences to a specified user. Now, with the list, you can see the ratings done by other users that have similar preferences as me. So the idea i should see the books that she rated, then choose new books.

>>> recommendations.topMatches(critics.critics,'98556',10,recommendations.sim_distance)
[(1.0, '69721'), (1.0, '28667'), (1.0, '224646'), (1.0, '182212'),
(1.0, '11676'), (0.5, '4157'), (0.5, '28729'), (0.5, '224650'), (0.5, '199616'), 
(0.5, '189139')]
>>> recommendations.topMatches(critics.critics,'180727', 3) 
[(1.0, '189139'), (1.0, '11676'), (0.6622661785325219, '177432')]

Find someone similar to read recommendations is great, but generally what we really want is to make recommendation of books not users. I could simply look to the user profile and seek for books the user likes and i haven't read yet, but this it's not so clever. This approach could eventually result in a user that haven't done an evaluation on books that i could like. It could also return a user that liked a movie that was badly evaluated (low rates) by all other users returned by the topMatches. To solve those problems, you have to give rates to items using a weighted average that can properly classify the evaluations. The implementation code for this items recommendation is simple and work with both measure distances.

The code of the function getRecommendations looks at each user except the one passed as parameter. It calculates how similar the users are to the specified user and after looks at each item rated by those users. As result you now have a classified books list and also a estimated rate that i would give for each book in it. This report allows me to choose which book i want to read or not, or if i prefer to do other thing than read it. It's important to notice that you can decide not to make recommendations if any result achieves a specified threshold by the user.

>>> recommendations.getRecommendations(critics.critics,'180727')[0:3]
[(10.000000000000002, 'The Two Towers (The Lord of the Rings, Part 2)'),
(10.000000000000002, 'The Return of the King (The Lord of the Rings, Part 3)'),
(10.000000000000002, 'Hawaii')] 

>>> recommendations.getRecommendations(critics.critics,'180727', 
[(10.000000000000002, 'Dune'), (10.000000000000002, 'Best Friends'), 
(10.000000000000002, 'All Creatures Great and Small'),
(10.000000000000002, 'A Christmas Carol (Dover Thrift Editions)')] 

Now we know how to find similar users and recommend items to a user, but how about finding similar items ?! You see those recommendations at web stores in the internet, specially when the store hasn't collected many information about your preferences. One of web stores that uses this type of recommendations is the Amazon web store, as you can see it here.

Figure 02. Amazon Web Store Recommendation System

In this case, you can evaluate the similarity, searching for users that liked a particular item and seeing others that appreciated in the same way. To do this, you can use the same functions defined earlier in this article, the only change is to replace the users by items now. So you can find similar items to the specified item.

I provided a function transformPrefs to do that. It rebuilds the new dictionary now with the key value with the book name and as values the pairs (user,rate).

>>> critics = recommendations.transformPrefs(critics)
>>> critics
[(1.0, 'Year of Wonders'), {'Robin Hood (Penguin Popular Classics)': {'81263': 8.0, '128653': 8.0},
 'Collected short stories [of] W. Somerset Maugham': {'180651': 8.0}, 
'Signing Naturally: Student Videotext and Workbook Level 1 
(Vista American Sign Language Series Functional Notional Appr)': {'236948': 9.0}, 
'Looking For Laura': {'98391': 8.0, '255952': 8.0, '5582': 4.0, '250192': 9.0, '72352': 7.0, 
'182085': 10.0, '67775': 7.0}

It's not so obvious that changing users to items it will lead to useful results, but in many cases it will make possible to do interesting comparisons. Imagine a web store that collect buying historic profiles with the purpose of recommend products to people in particular. Revert people to products, you can allow the system now recommend users that could buy specific products. It's very useful when the marketing department of your company want to do a great marketing effort to a big cut-off prices sales. Or it could be also be used to check if links recommended show in a web page are really seen by users that have a great probability of liking them.

>>> recommendations.topMatches(critics,'Drums of Autumn')
[(1.0, 'Year of Wonders'), (1.0, 'Velvet Angel'), (1.0, 'Twice Loved'), 
(1.0, 'Trying to Save Piggy Sneed'), (1.0, 'The Zebra Wall')]

If you want to recommend specific people that have done the evaluation about a book. If you want to send invitations to a book launch event ?!

>>> recommendations.getRecommendations(critics,'The Weight of Water')[0:5]
[(10.000000000000002, '92048'), (10.000000000000002, '211152'), (10.000000000000002, '198996'), 
(10.000000000000002, '156467'), (10.0, '99298')]

So that's it. I expect you enjoyed this article. As you can see the recommendation engine using collaborative filtering is very effective when you don't have a great amount of items or users. When you deal with a big store like Amazon, that has millions of users and items - compare one user against all others , then each evaluated item can be extremely slow. An alternative technique to get over this limitation is the Item-based-filtering. It's very useful in cases when you have a big dataset. This technique can give better results and allows that many calculations be done previously before a user ask for a recommendation, consequently, showing the recommendations quickly.

You can download a copy of my sample collaborative filtering implementation as In the next article we will study about the item-based-filtering technique.

PS: I'm planning with other colleague Ricardo Caspirro to develop a library in Python for recommendations. We are very excited and planning great stuff for the python and recommendation systems enthusiasts! Wait for great news soon!


Marcel Caraciolo