Pages

Playing with the Twitter Data [Part 02]: Clustering/Grouping my friends based on their statuses with Python

Wednesday, January 13, 2010


In this article i'll continue the saga of clustering and grouping my friends on twitter in order to obtain the degree of relationship or connection between what i post on Twitter (my statuses subjects) and what my friends post. In the first article i presented the hierarchical clustering method with its result represented by a dendrogram.

The dendrogram is a stylized tool for visualization of data in two dimmensions. But we have other types of visualization for data in 2D. In this article i will present a technique called multidimensional scaling, which will be useful to find a two-dimmensional representation of the dataset. This algorithm will take into account the difference between every pair of items (in this context friends) and tries to make a chart in which the distances between every pair of items match those differences.

Let's begin!

The algorithm first calculates the target distances between all the items (in this scenario, it would be the twitter user profiles). The Pearson correlation was used to compare the items. An example of this is shown in Table 01.

Table 01. The distance matrix

Next all the twitter user profiles represented by their usernames are placed randomly on the two-dimmensional chart, as show in Figure 01.


Figure 01. Initial position of the items

The current distances between all the items are calculated using the actual distance as shown in Figure 2.


Figure 02. Distance between items


Now for every pair of items, the target distance is compared to the current one and an error value is calculated. Every username is moved a small amount closer or further in proportion to the error between the two usernames. Figure 3 shows the force acting on the user marcelcaraciolo . The distance between marcelcaraciolo and nokia is 0.5 , but the target distance is only 0.2, so marcelcaraciolo has to be moved closer to nokia. At the same time, marcelcaraciolo is also being pushed away by pythonstories and acelera because it's too close.


Figure 03. Forces acting on the user marcelcaraciolo

During the algorithm process, every node is moved according to the combination of all other nodes pushing or pulling on it. The difference between the current distances and the target distances tends to get a bit smaller. The procedure is repeated many times until the total amount of error cannot be reduced by moving the usernames any more.

You can see all the code for the algorithm and the chart plotting (a image) at the code available for download at the end of this article.

I decided to test with my friends (whom i follow) on Twitter social network. The Figure 04 shows the outcome of the algorithm presented here. I've used only a subset of friends (97) and each friend represented by the 200 or less statuses written. You can see that the clustering doesn't break out quite well as they do on the dendrogram, but you can still clearly see some topical grouping, such as the users who writes posts in english (upper part of the graph) and the ones who writes posts in portuguese (lower part of the graph). Other interesting conclusion is that my username (rounded by a circle) 'marcelcaraciolo' is in the middle of the chart. It came to my mind that i write some posts in english/portuguese (i think that's why i'm in the middle) and that i'm close to some users that writes about python (my right side) and some users that write about startups, ideas and projects (some users like srlm, alexodrosgomes, marshallclark and others).


From users that doesn't write about the main topics that i write on twitter like mobile programming, python and startups and ideas, these ended up very far away from the my username. You can see some other groups like the people that talk about symbian and nokia very close to each other(upper and right). If you have this representation done in 3D, the clusters would be even better, but obviously this would be difficult to visualize on paper.


As you can see, using the multidimensional scaling algorithm is an effective way to take a dataset and view it in a easy way to interpret it. It's important to notice that some information is lost during the process of scaling, but the result should help you understand the algorithms better. It's a trade-off that you can live with depending on the details of the clusters/groups that you wanna find.

See all the code used in this article here.

That's all !

Wait for my next article, i will talk about k-means applied for this task with some 3D graph results!

Marcel Caraciolo


1 comment: