Pages

Playing with the Twitter Data: Clustering/Grouping my friends based on their statuses with Python

Tuesday, December 29, 2009


Hi all,


As i said at my last post, i will begin to post some articles about some approaches that i developed in order to find new users from the web service Twitter (a real-time micro-blogging and social network web service that the user can post messages up to 140 characters). The main goal is to create a recommender system that could find new users that share the same tastes and preferences as me.


These articles will introduce some basic concepts about data clustering analysis, a method used to discover and visualize things, people or ideas that have close relations. For this, it will be presented how to gather and prepare all data provided from the Twitter and show some particular clustering algorithms associated with well-known distance measures. Some graphs will also be shown as part of graphic visualization tools in order to observe the clusters created. Developing those clustering algorithms helped me to understand how i could design my recommendation algorithm using a special distance measure giving as result a score for a specified user of the social network.


Before exploring this topic, it's important to show to the reader the difference between supervised and unsupervised learning. The supervised learning techniques use data and expected results in order to "learn" how to extract new information and produce a result based on what he has learned until that moment. However, the unsupervised learning like methods like clusterings they are different from a neural network or a decision tree. Those types of algorithms aren't trained with expected results. Its purpose is to find a structure in the data set provided and none of this patterns is the expected result. The goal is to use the data to find distinct groups that may exist. In this article and next ones, we will explore some of the unsupervised learning techniques like the hierarchical clustering and the K-means.


Now, let's get it started by exploring the twitter user profiles, and show based on their statuses updates (text messages), how they can be grouped in accordance to their statuses (text) and also the words based on their use.


In the following steps, i will present how i captured the data, prepared it and clustered it and finally presenting some interesting results.


Step 1: Getting the Twitter Data


The first step is to fetch all the twitter data. For this article, i decided to analyse my twitter social network. So i only focused on my profile and my friends (following) profile. The goal is to identify based on my twitter statuses and as my friends how we are grouped. So the expected result is that friends that write similar content to me will get closer and friends that write different posts will be far from me. To get all data i've used the python-twitter library. It's a open-source python interface for the Twitter API and extremely useful to access the twitter data. For further information about see its official code project homepage.


For this experiment, i've used the subset of 100 friends randomly picked from my friends social network and the clustered data will be the number of times that a particular set of words appears in each user's twitter statuses . A small subset of what of this looks like is shown in Table 01.


Table 01 - The frequencies of words of user profiles


By clustering user profiles based on word frequencies, it may be possible to verify if there are groups of user profiles that frequently write about similar subjects or write in similar styles (e.g. english or portuguese languages). To generate this dataset, you'll be downloading the twitter statuses from a set of users, extracting the text from the entries, and creating a table of word frequencies.


For downloading the statuses from Twitter, i've developed a customized version of the python-twitter library. In this module, i've developed some new functions in order to get the social network of my friends in Twitter. I recommend that you download this module and use it. To download the twitterT.py click here.


After that, i've played around with the Twitter API in order to get my statuses and get my friends statuses from the twitter social network. The function getFriendsIds is responsable to get the user ids from the twitter including mine. You can see the snippet code below.





The next step is to create a function that will extract all the words from the statuses. In this article i've downloaded up to last 100 statuses from each friend on a set of 100 user ids. The next snippet shows how i managed to get this done.





The next step is to generate the list of words that will actually be used in the counts for each blog. Since words like 'the' will appear in almost all of them, you can reduce the total number of words included by selecting only those word that you consider viable to appear in the list of words. I've created a list of stopwords in english/portuguese language which are words like pronoums and articles that you can eliminate from your granted list of words. It's important to do this pre-processing in order to get better results. You can download my list of stopwords here. To use it just import it like: import stopwords.




The final step is to use the list of words and the list of statuses to create a text file containing a big matrix of all word counts for each of the user statuses. I've used the module pickle to store those tables. The advantage is that you can easily load and dump the data without losing the type of the object and avoid further parsing and processing to manage the data.



Step 2: The clustering algorithm


The Hierarchical clustering will be used as the clustering algorithm in this article. Its main idea is to build up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item, in this case an individual user profile. In each interaction this technique calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group. Figure 1 shows this process.


Figure 01 - The Hierarchical Clustering Algorithm in Action




As you can see at the figure, the similarity of the items is represented by their relative locations- the closer two items are, the more similar they are. As you can see each pair of groups (the closest together) are merged to form a new group whose location is halfway between two. After the process of forming and merging groups, the final step unifies the two remaining groups.


The next step is define the closeness. In this article i will use the Pearson correlation to determine how similar two user profiles are. Since some twitter statuses contain more entries or much longer entries than others, and will thus contain more words overall, the Pearson correlation will correct for this, trying to determine how well two sets of data fit onto a stright line. Remember that the Pearson correlation is 1.0 two items match perfectly, and it's close 0.0 when there's no relationship at all. Here we decided to use 1 minus the pearson correlation since we wanted to create a smaller distance between items that are more similar.


The algorithm for hierarchical clustering begins by creating a group of clusters that are just the original items. the main loop of the function searches for the two best matches by trying every possible pair and calculating their correlation. The best pair of clusters is merged into a single cluster. The data for this new cluster is the average of the data for the two old clusters. This process is repeated until only one cluster remains. You can see all the code for the hierarchical clustering in the function hcluster.



Step 3: Showing the results


You can interpret the clusters more clearly by viewing them as dendrogram. Hierarchical clustering clustering results are usually viewed this way, since dendrograms pack a lot of information into a relatively small space. The idea of this graph is display the nodes arranged into their hierarchy. The dendrogram for the example above is shown in Figure 02.



Figure 02 - The visualization of a dendrogram


This dendrogram not only uses connections to show which items ended up in each cluster and it also uses the distance to show how far the items were. You can see that the AB cluster is lot closer to the individual A and B items than the DE cluster is to the individual D and E items. Rendering the graph this way can help you determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster.


In this article in order to draw and save as JPG the dendrogram, i will use the Python library (PIL) which is available at http://pythonware.com

The PIL makes it very easy to generate images with text and lines, which is all you'll really need to construct a dendrogram. To see the respective code in how to draw the drendrogram see it at the code twitterClustering.py.


You can see the result of the clustering of my 100 friends including me here at the figure 03. If you call the method drawdendrogram, it will generate a file called twitterclust.jpg with the dendrogram.


Figure 03 - Twitter Friends Dendrogram




Step 4: Interpreting the results



Now i'll comment the results of the dendrogram! It gave me very interesting results which i could conclude some relevant insights.

  1. Language: Portuguese x English
Interesting that the algorithm could cluster the twitter user profiles into two big groups at first. Based on the frequency of the words in portuguese and english, it could separate users that write text in english language and users that write text in portuguese. And i'm amazed by the fact that i have a equivalent amount of friends that write in portuguese and in english. See the Figure 04 below.


Figure 04 - Twitter Friends Dendrogram





2. Python users

Another group formed by the algorithm is the group of python language enthusiasts. Like me for example! The clustering algorithm could identify those users that writes a lot about the programming language python and related stuff. As you can see by the name of the users, they all are very close: dakerfp, pythonbrasil, apyb, luciano, planetpython, gvanrossum (The creator of the Python language) , pythonstuff and marcelcaraciolo (me).

Figure 05 - Python Friends Group

























3. Mobile and Symbian Twitter speakers

The group formed by the mobile and symbian twitter users are also represented at the
dendrogram. The algorithm grouped all together the user that writes about mobile and
symbian (mobile operational system). You can notice just looking at the user names.
In general, they all have in the name nokia or symbian or mobile. Interesting isn't ?
Of course the names tell us anything, it's just a hint that could be a user that post statuses
a lot about mobile. You can check yourself that users, their statuses probably will have some
word written with 'mobile', 'cellphone', etc.

Figure 06 - Mobile Friends Group


























Some tags cloud (word frequencies) of some users:












Conclusions:


In this article i presented my first attempt to cluster/group my friends (all that i'm following)

on the webservice Twitter. For that, i only consider the twitter statuses of my friends, and based

on what they write about (words in the text), the presented algorithm will group the twitter

user profiles .

The algorithm here used is one of the unsupervised learning techniques called hierarchical

clustering. It's powerful to cluster the data provided to him, but its drawback is that it's

very slow (consumes time and processing) since it calculates the distance metric between

all the items to be clustered. The distance metric used here is the Pearson Correlation.

The results shown here through the dendrograms and tag clouds can give us some interesting

conclusions. The algorithm successfully clustered the twitter profiles into groups and with

a simple analysis we could see the group of mobile users, the python users and the division

between english/portuguese profiles.

Other techniques could be applied here as same as other methods to present the results.

In the next articles i will present some other techniques that i've used to cluster the twitter

data and some powerful tools to show the relationship between the clustered groups.


To download the source-codes that i've used here. Check it here.


That's all,


Marcel Pinheiro Caraciolo

New articles coming soon!

Friday, December 11, 2009

Hi all,

Since the last time i posted here, during that time until now, i've been working on a Twitter Users Recommendation Tool, putting all my efforts on it. Finally, i finished it, but it needs some refinements to publish it as an open-source project for the community.

In my next posts, i will publish some results here with some implementations in Python that i've done during the project.

Finally, i will publish also other article about some utility tools that i've developed during the data mining project at my university like plotting with Python the Lift, KS and ROC curves for performance evaluation in decision systems.

Stay tunned !

That's all,

Marcel Caraciolo