Playing with the Twitter Data: Clustering/Grouping my friends based on their statuses with Python

Tuesday, December 29, 2009

Hi all,

As i said at my last post, i will begin to post some articles about some approaches that i developed in order to find new users from the web service Twitter (a real-time micro-blogging and social network web service that the user can post messages up to 140 characters). The main goal is to create a recommender system that could find new users that share the same tastes and preferences as me.

These articles will introduce some basic concepts about data clustering analysis, a method used to discover and visualize things, people or ideas that have close relations. For this, it will be presented how to gather and prepare all data provided from the Twitter and show some particular clustering algorithms associated with well-known distance measures. Some graphs will also be shown as part of graphic visualization tools in order to observe the clusters created. Developing those clustering algorithms helped me to understand how i could design my recommendation algorithm using a special distance measure giving as result a score for a specified user of the social network.

Before exploring this topic, it's important to show to the reader the difference between supervised and unsupervised learning. The supervised learning techniques use data and expected results in order to "learn" how to extract new information and produce a result based on what he has learned until that moment. However, the unsupervised learning like methods like clusterings they are different from a neural network or a decision tree. Those types of algorithms aren't trained with expected results. Its purpose is to find a structure in the data set provided and none of this patterns is the expected result. The goal is to use the data to find distinct groups that may exist. In this article and next ones, we will explore some of the unsupervised learning techniques like the hierarchical clustering and the K-means.

Now, let's get it started by exploring the twitter user profiles, and show based on their statuses updates (text messages), how they can be grouped in accordance to their statuses (text) and also the words based on their use.

In the following steps, i will present how i captured the data, prepared it and clustered it and finally presenting some interesting results.

Step 1: Getting the Twitter Data

The first step is to fetch all the twitter data. For this article, i decided to analyse my twitter social network. So i only focused on my profile and my friends (following) profile. The goal is to identify based on my twitter statuses and as my friends how we are grouped. So the expected result is that friends that write similar content to me will get closer and friends that write different posts will be far from me. To get all data i've used the python-twitter library. It's a open-source python interface for the Twitter API and extremely useful to access the twitter data. For further information about see its official code project homepage.

For this experiment, i've used the subset of 100 friends randomly picked from my friends social network and the clustered data will be the number of times that a particular set of words appears in each user's twitter statuses . A small subset of what of this looks like is shown in Table 01.

Table 01 - The frequencies of words of user profiles

By clustering user profiles based on word frequencies, it may be possible to verify if there are groups of user profiles that frequently write about similar subjects or write in similar styles (e.g. english or portuguese languages). To generate this dataset, you'll be downloading the twitter statuses from a set of users, extracting the text from the entries, and creating a table of word frequencies.

For downloading the statuses from Twitter, i've developed a customized version of the python-twitter library. In this module, i've developed some new functions in order to get the social network of my friends in Twitter. I recommend that you download this module and use it. To download the click here.

After that, i've played around with the Twitter API in order to get my statuses and get my friends statuses from the twitter social network. The function getFriendsIds is responsable to get the user ids from the twitter including mine. You can see the snippet code below.

The next step is to create a function that will extract all the words from the statuses. In this article i've downloaded up to last 100 statuses from each friend on a set of 100 user ids. The next snippet shows how i managed to get this done.

The next step is to generate the list of words that will actually be used in the counts for each blog. Since words like 'the' will appear in almost all of them, you can reduce the total number of words included by selecting only those word that you consider viable to appear in the list of words. I've created a list of stopwords in english/portuguese language which are words like pronoums and articles that you can eliminate from your granted list of words. It's important to do this pre-processing in order to get better results. You can download my list of stopwords here. To use it just import it like: import stopwords.

The final step is to use the list of words and the list of statuses to create a text file containing a big matrix of all word counts for each of the user statuses. I've used the module pickle to store those tables. The advantage is that you can easily load and dump the data without losing the type of the object and avoid further parsing and processing to manage the data.

Step 2: The clustering algorithm

The Hierarchical clustering will be used as the clustering algorithm in this article. Its main idea is to build up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item, in this case an individual user profile. In each interaction this technique calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group. Figure 1 shows this process.

Figure 01 - The Hierarchical Clustering Algorithm in Action

As you can see at the figure, the similarity of the items is represented by their relative locations- the closer two items are, the more similar they are. As you can see each pair of groups (the closest together) are merged to form a new group whose location is halfway between two. After the process of forming and merging groups, the final step unifies the two remaining groups.

The next step is define the closeness. In this article i will use the Pearson correlation to determine how similar two user profiles are. Since some twitter statuses contain more entries or much longer entries than others, and will thus contain more words overall, the Pearson correlation will correct for this, trying to determine how well two sets of data fit onto a stright line. Remember that the Pearson correlation is 1.0 two items match perfectly, and it's close 0.0 when there's no relationship at all. Here we decided to use 1 minus the pearson correlation since we wanted to create a smaller distance between items that are more similar.

The algorithm for hierarchical clustering begins by creating a group of clusters that are just the original items. the main loop of the function searches for the two best matches by trying every possible pair and calculating their correlation. The best pair of clusters is merged into a single cluster. The data for this new cluster is the average of the data for the two old clusters. This process is repeated until only one cluster remains. You can see all the code for the hierarchical clustering in the function hcluster.

Step 3: Showing the results

You can interpret the clusters more clearly by viewing them as dendrogram. Hierarchical clustering clustering results are usually viewed this way, since dendrograms pack a lot of information into a relatively small space. The idea of this graph is display the nodes arranged into their hierarchy. The dendrogram for the example above is shown in Figure 02.

Figure 02 - The visualization of a dendrogram

This dendrogram not only uses connections to show which items ended up in each cluster and it also uses the distance to show how far the items were. You can see that the AB cluster is lot closer to the individual A and B items than the DE cluster is to the individual D and E items. Rendering the graph this way can help you determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster.

In this article in order to draw and save as JPG the dendrogram, i will use the Python library (PIL) which is available at

The PIL makes it very easy to generate images with text and lines, which is all you'll really need to construct a dendrogram. To see the respective code in how to draw the drendrogram see it at the code

You can see the result of the clustering of my 100 friends including me here at the figure 03. If you call the method drawdendrogram, it will generate a file called twitterclust.jpg with the dendrogram.

Figure 03 - Twitter Friends Dendrogram

Step 4: Interpreting the results

Now i'll comment the results of the dendrogram! It gave me very interesting results which i could conclude some relevant insights.

  1. Language: Portuguese x English
Interesting that the algorithm could cluster the twitter user profiles into two big groups at first. Based on the frequency of the words in portuguese and english, it could separate users that write text in english language and users that write text in portuguese. And i'm amazed by the fact that i have a equivalent amount of friends that write in portuguese and in english. See the Figure 04 below.

Figure 04 - Twitter Friends Dendrogram

2. Python users

Another group formed by the algorithm is the group of python language enthusiasts. Like me for example! The clustering algorithm could identify those users that writes a lot about the programming language python and related stuff. As you can see by the name of the users, they all are very close: dakerfp, pythonbrasil, apyb, luciano, planetpython, gvanrossum (The creator of the Python language) , pythonstuff and marcelcaraciolo (me).

Figure 05 - Python Friends Group

3. Mobile and Symbian Twitter speakers

The group formed by the mobile and symbian twitter users are also represented at the
dendrogram. The algorithm grouped all together the user that writes about mobile and
symbian (mobile operational system). You can notice just looking at the user names.
In general, they all have in the name nokia or symbian or mobile. Interesting isn't ?
Of course the names tell us anything, it's just a hint that could be a user that post statuses
a lot about mobile. You can check yourself that users, their statuses probably will have some
word written with 'mobile', 'cellphone', etc.

Figure 06 - Mobile Friends Group

Some tags cloud (word frequencies) of some users:


In this article i presented my first attempt to cluster/group my friends (all that i'm following)

on the webservice Twitter. For that, i only consider the twitter statuses of my friends, and based

on what they write about (words in the text), the presented algorithm will group the twitter

user profiles .

The algorithm here used is one of the unsupervised learning techniques called hierarchical

clustering. It's powerful to cluster the data provided to him, but its drawback is that it's

very slow (consumes time and processing) since it calculates the distance metric between

all the items to be clustered. The distance metric used here is the Pearson Correlation.

The results shown here through the dendrograms and tag clouds can give us some interesting

conclusions. The algorithm successfully clustered the twitter profiles into groups and with

a simple analysis we could see the group of mobile users, the python users and the division

between english/portuguese profiles.

Other techniques could be applied here as same as other methods to present the results.

In the next articles i will present some other techniques that i've used to cluster the twitter

data and some powerful tools to show the relationship between the clustered groups.

To download the source-codes that i've used here. Check it here.

That's all,

Marcel Pinheiro Caraciolo

New articles coming soon!

Friday, December 11, 2009

Hi all,

Since the last time i posted here, during that time until now, i've been working on a Twitter Users Recommendation Tool, putting all my efforts on it. Finally, i finished it, but it needs some refinements to publish it as an open-source project for the community.

In my next posts, i will publish some results here with some implementations in Python that i've done during the project.

Finally, i will publish also other article about some utility tools that i've developed during the data mining project at my university like plotting with Python the Lift, KS and ROC curves for performance evaluation in decision systems.

Stay tunned !

That's all,

Marcel Caraciolo

Know how many retweets (RT) and the user that you most retweet at Twitter! -Text Mining

Saturday, November 21, 2009

Hi all,

While playing with some collective intelligence techniques under the Twitter, i and some friends developed a simple application with the original idea of my friend Bruno Melo and some suggestions of my friend Ricardo Caspirro to count the number of retweets that you do on twitter and the users that you most retweet. We decided to give it the name of BabaOvo (A portuguese word that means that you are a pamper or flatter) . The idea is to know how many retweets i've done in all my statuses profile and whose the users i retweeted (RT) most.

The code is so simple and we've done it all in Python! Soon i'll release a official version and the source code so you can you it on your own.

I've made a test with my twitter username 'marcelcaraciolo' , and i've discovered some really interesting insights: ' I like the user lucianans (hehhe she's my date)' , 'The users cinlug, larryolj and the user tarantulae are related to python development and open-source: what i like a lot' . 'marcelobarros, telmo_mota, croozeus are associated with mobile topics: great guys to follow if you want to know about mobile area' and at least 'reciferock, davidsonFellipe and macmagazine: all personal interests: rock music, blogs and the apple world' !

It's amazing what you can find about you only using this simple technique of text mining ! Data mining is a powerful area and natural language processing is incredible if you know how to handle it. I am playing with it and soon i'll publish here some results of a project that i am doing with recommendations and Twitter networking!

For now, take a look on the pie chart that i plot after running the babaOvo application. I've put only the top-10 users! The chart was drawn with the Matplotlib Python framework!

BabaOvo Pie Chart : Frequency Retweets

That's all!

Marcel Caraciolo

Collaborative Filtering : Implementation with Python! [Part 02]

Friday, November 13, 2009

Hi all,

In my last article i have talked about one of the information filtering techniques (IF) to make recommendations: User-Based Collaborative Filtering. The way how the recommendation system works, using this collaborative filtering, it requires all recommendations of each user to build a data set. Obviously, it will work well with a small amount of items and users, but the real world is not so benevolent, and web stores like Amazon where there are millions of users and products, it would become impracticable to compare each user against other users, and also each item that the user has evaluated - it would be extremely slow.

To overcome this limitation one approach proposed in the literature is to use the Item-Based Collaborative Filtering. In cases when there is available a big data set with millions of user profiles with items rated, this technique may show better results, and allows the computation be done in advance so an user demanding recommendations, should quickly obtain them. This article will present this filtering technique in action with some implementations in Python programming language.

The procedure to do the filtering based on items is similar that i discussed earlier with user-based collaborative filtering. The basic idea is to pre-process the most similar items against each other. So when the user asks for a recommendation, the system examines the items with the greatest scores and it builds a ranked list calculating the weighted average of the items similar to the items in the user profile. The main difference between the two approaches is that in the item-based one the comparison between the items will not change so frequently as the comparison between users in the user-based technique. It means it will not be necessary to evaluate every time the similarity between the items, then it can be done in times where the website has low movement or in separated servers from where the main application is hosted.

To compare the items , you must define a function that builds a data set with the similar items. Depending on the size of the data set, it will spend some time to build it. I must remember that i said earlier: You don't need to call this function every time that you want a new recommendation. Instead, you build the data set and use it every time as requested.

I implemented this function called calculateSimilarItems, it builds a dictionary (dict) of items showing for each one the items that they are most similar.

#Create a dictionary of items showing which other items they are most similar to.

def calculateSimilarItems(prefs,n=10):
result = {}
#Invert the preference matrix to be item-centric
itemPrefs = transformPrefs(prefs)
for item in itemPrefs:
#Status updates for large datasets
if c%100==0:
print "%d / %d" % (c, len(itemPrefs))
#Find the most similar items to this one
scores =
result[item] = scores
return result

As you can see, i reused some implementations i have done in the user-based collaborative filtering. The differences now is that we will iterate through items and calculates the similarity between items. The result will be the list with items joined with their respective most similar items.

Now we are ready to make recommendations using the data set of similarities between items without needing to examine all data set. The function getRecommendedItems do all the logic, the basic idea behind it is to compare the items i haven't rated yet against items that i already have evaluated. This comparison uses some calculations and as result you receive a prediction of the rate i'd would give for the item. You can see the implementation as follows:

def getRecommendedItems(prefs, itemMatch, user):
userRatings = prefs[user]
scores = {}
totalSim = {}

#loop over items rated by this user
for (item, rating) in userRatings.items():

#Loop over items similar to this one
for (similarity, item2) in itemMatch[item]:

#Ignore if this user has already rated this item
if item2 in userRatings:
#Weighted sum of rating times similarity
scores[item2] += similarity * rating
#Sum of all the similarities

#Divide each total score by total weighting to get an average
rankings = [(score/totalSim[item],item)
for item,score in scores.items()]

#Return the rankings from highest to lowest
return rankings

Let's use the same data set in the last article - the book data set - the target is to recommend some books for users based on their preferences. To download the data set and how to load it into your application, see the article about the User-Based-Collaborative Filtering that i've written.

As you can see, the result it will show a ranked list of tuples with (rate_predicted, book) as values. So you can see not only the books that you should like and haven't read, but also an estimated rate that you would give for it. Companies all around the world are working on how to improve the prediction rate of items in a dataset, even there was a challenge to win a $1.000.000 prize sponsored by NetFlix to improve by 10% the prediction accuracy of the recommendations. It's awesome!

So that's it, you now have a simple working demo of some of the most popular information filterings (IF) to recommend new items for users. Play with the library that i provided here. Load up some new datasets, it will be very funny and you will learn more about recommendation systems. There are other types of algorithms to recommendation systems and soon i will provide the implementation of them too.

To sum up, it's important to notice that the Item-based-content-filtering is significantly faster than the the user-based one, specially when you want to extract a recommendation list of items in a big amount of data. Even more, don't forget the extra time to maintain the similarity table of items. Other main difference between these two techniques is that there is a precision difference related on when the available data set is "sparse". In the data set we worked here, for the book recommendations, it would unlikely that you would find two users with the same list of evaluated books. So almost of the recommendations is based on a small set of users, which implies in a sparse data set. Otherwise, the item-based collaborative filtering generally has a better performance than the user-based one in the sparse data set, but in a dense one their performance is almost the same.

You also may noticed that the user-based collaborative filtering technique is easier to implement and it doesn't have extra steps, so it's generally recommended to use it in small data sets that can be maintained in the memory and change frequently. Finally, in some applications, show to people other users that have same preferences has some value also instead of recommend items.

In the next articles, i will provide some ideas that i am developing specifically related on how to find similar group of users using some grouping algorithms. I expect you enjoyed. To download all the implementation provided here, click here.

Thanks, any doubts, suggestions or recommendations make yourself free to ask!

See you next time,

Marcel Caraciolo