Pages

Playing with the Twitter Data [Final]: Clustering/Grouping my friends with Python!

Sunday, February 14, 2010

Hi all,

One of the problems I'm now facing with Twitter is how organize my friends (the users that i follow) in a way that I could easily read their posts based on similar interests or keywords. 

The idea is identify common groups of users and look at a segmentation based on user biography keywords. That biography information would allow us to segment Twitter users in groups of similar interests, professions and qualities.

I could do this manually checking each user and their last statuses and organize them in lists in accordance to their history data on Twitter. But for me, that I have more than 100 friends, it might be painful to check and place it one by one. 

So far we have seen various examples of using analytics to gain insights from Twitter. Using cluster analysis is one of my favorites.  By doing cluster analysis, it will be possible to reveal similar Twitter users and as result the possibility of creating Twitter Lists based on the groups generated by those clustering algorithms.  

Using text mining and some machine learning techniques, I'll present a simple tool to get all your Twitter Friends and organize them in lists segmented by the similar behavior. 

Step 1: Getting the Twitter User Data

In this part, i'll show the code written in Python for getting the Twitter User data. I used the python-twitter library with some custom modifications, allowing to download the Twitter user data and his last statuses/posts. In order to get my friends statuses, I've developed some functions and custom threads, accelerating the process of downloading the user profiles and dumping them into data files.

Take a look into the code provided in TwitterCollector.py. Due to rate limits (150 requests per hour) from Twitter, depending on how many friends you have, it will be not possible to fetch all of them. So for this article, i'll be using a subset of 100 friends randomly picked from my Twitter user profile ('marcelcaraciolo').  

The number of statuses selected will be up to the 200 last statuses from each user. I've chosen this number, because i thought it was a good number to summarize the user behavior on Twitter. But you can change this number if you desire, just be aware that the script may fail during the process because of the request limits for Twitter Web Service.

The Twitter Collector is really simple and effective. There is a pool of threads  (TwitterCollector) where each one grabs the statuses of the user, and then places it into a queue. This queue is repository for holding the statuses that will be parsed. The parsing involves the extraction of the words for what the user has written at his posts and remove the stop words (using some natural language techniques). Another pool of threads (TwitterMiner) joins on this queue, and is responsible for do the parsing work on the statuses. 

After the TwitterMiner finishes his job, it places it into another queue, where it will be dumped into a data file, in this case handled by the module 'pickle'. The module pickle is very efficient on storing and retrieving objects (serialization) into data files.


Step 2: Building the user profiles


Now let's start building the user profiles. In this step we'll build the user profile for each user fetched in the previous step. The user profile is summarized by the statuses collected, where for each user there's a table of word frequencies. This table will allow to check if there are groups of users that frequently write about similar subjects or write in similar styles (e.g. english or portuguese languages).

For instance, consider three users: me ('marcelcaraciolo') that writes more about python and data mining,  'symbian' writes more about  mobile and symbian stuff and the user 'parties' that write about beer and parties around the world. It will happen that after a clustering analysis it may be possible that me and symbian will be in the same group where 'parties' will be placed at another group due to the dissimilarity in behavior between us on Twitter.

The code provided on TwitterOrganizer.py will be responsible to open the data file with the statuses collected in the first step and build the user profiles by creating a table for each user with the word frequencies.




Step 3: Clustering Analysis

Now it's time to discover the groups of users that exhibit the same behavior. Until now i've presented some techniques for clustering analysis like Hierarchical clustering and MultiDimensional Scaling. But both of them have a couple of disadvantages specially on visualization and computational processing.  An alternative method of clustering is K-means clustering. The main difference from Hierarchical Clustering is the number of distinct clusters which is told in advance.  

Let's talk a little bit about  How does K-means work? 

The idea of K-means clustering is to determine the size of the clusters based on the structure of the data. It begins with k  randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. Those steps repeats until the assignments stop changing.  The result is a set of clusters within the ranges of each of the variables.  The number of iterations it takes to produce the final result is quite small compared to hierarchical clustering.  It's important to notice that since the k-means uses random centroids to start with, the order of the results returned will be different most of the times. This happens because of the initial locations of the centroids. The same happens with the number of clusters (k) that may affect the results.

You can see more about k-means process here at this link.

I've run a couple of times the algorithm, and i selected one of the partitions generated by the k-means algorithm with the Twitter user profiles generated at the step 02. The k selected on this example is 15. For presenting the results, i've decided to make the things more interesting. I've used the ubigraph framework, a python library for displaying graphs and networks at 3D.  As you can see at the video below, the clusters are generated during the k-means process. Each node represents a Twitter user, and each cluster is differentiated from each other by the color. 


After running the graph, i've decided to play more with the Twitter data. In the end of the video playback, you can see that I'm clicking on some cluster centroids (represented by the non-labeled cubes). You can see that it shows the most frequently words that appears for each group. It's notable that they're are joined exactly because of the frequency of those words at their user profiles. It will be useful for the last step of this article: Creating the Twitter Lists.


Step 4: Creating the Twitter Lists


Now that i have my friends grouped by similar subjects posted on Twitter, it's time to make this data clustering useful. I've decided to create Twitter Lists, which is a new feature launched from Twitter recently. The idea is creating group of users whose the owner can follow and so other users if they are interested on the topics discussed on what the members of the list posted. It's a handful tool for organizing your friends in order to read their statuses based on what they post. For instance, my friends that talk about python i could group them altogether in a list with the name 'PythonUsers'.

Based on the clustering algorithm, i'll create the lists with members shown at each cluster. Since, the python-twitter library doesn't have at its last release support with Lists API. I've decided to create one on my own. The TwitterListAPI is a simple python wrapper for the Twitter API handling some operations with lists like: creating, updating and removing lists and adding/removing users from the list. You can check the implementation in the module twitterList.py.


You can run it after running the clustering algorithm. The methods readFile() and readFiles() are responsible for reading the files generated at the result of the last step. They contain the name of the new list that you want create, as also the description of the list, which is optional and at least the ids of the user that you would like to add to the list. If you don't have the userID, don't worry you can use the method GetUserID() to get this information for each username. You may also want edit those files at your preference in order to remove or add users, edit the list name or description.

I've run with my twitter friends, and you can see at the figure below one of the lists created after the clustering algorithm.



Conclusion

So, this is my initial attempt in order to manage my Twitter Friends and also playing with Twitter Data by clustering the users based on their statuses. It uses a initial study of text mining and a machine learning technique called K-means, clustering algorithm, for group the user profiles, in this case summarized by user statuses and word frequencies. I believe there's more work to do, specially about running other clustering algorithms and distance metrics to compare the results. I think that the K (number of the clusters) should be found automatically in a way to balance the trade-off: The number of lists x the number of friends. A lot of friends with many non-related topics in a same group could not be  useful as lists with a few members it's not necessary. The best value for 'k' is a subject for future studies. Other good result of this work is the use of 3D techniques to show the clustering process and also the new API for handling with Twitter Lists.  At least, the pre-processing of the statuses will be expanded in order to improve the identification of the topics and reduce the number of topics - variables (dimensionality), a issue for machine learning techniques due to heavy processing and time costing.


You can download all the code used at this article here

That's all,

Special thanks for Luciana Nunes and Ricardo Caspirro for the support during this work!

Marcel Pinheiro Caraciolo