As i said at my last post, i will begin to post some articles about some approaches that i developed in order to find new users from the web service Twitter (a real-time micro-blogging and social network web service that the user can post messages up to 140 characters). The main goal is to create a recommender system that could find new users that share the same tastes and preferences as me.
These articles will introduce some basic concepts about data clustering analysis, a method used to discover and visualize things, people or ideas that have close relations. For this, it will be presented how to gather and prepare all data provided from the Twitter and show some particular clustering algorithms associated with well-known distance measures. Some graphs will also be shown as part of graphic visualization tools in order to observe the clusters created. Developing those clustering algorithms helped me to understand how i could design my recommendation algorithm using a special distance measure giving as result a score for a specified user of the social network.
Before exploring this topic, it's important to show to the reader the difference between supervised and unsupervised learning. The supervised learning techniques use data and expected results in order to "learn" how to extract new information and produce a result based on what he has learned until that moment. However, the unsupervised learning like methods like clusterings they are different from a neural network or a decision tree. Those types of algorithms aren't trained with expected results. Its purpose is to find a structure in the data set provided and none of this patterns is the expected result. The goal is to use the data to find distinct groups that may exist. In this article and next ones, we will explore some of the unsupervised learning techniques like the hierarchical clustering and the K-means.
Now, let's get it started by exploring the twitter user profiles, and show based on their statuses updates (text messages), how they can be grouped in accordance to their statuses (text) and also the words based on their use.
In the following steps, i will present how i captured the data, prepared it and clustered it and finally presenting some interesting results.
Step 1: Getting the Twitter Data
The first step is to fetch all the twitter data. For this article, i decided to analyse my twitter social network. So i only focused on my profile and my friends (following) profile. The goal is to identify based on my twitter statuses and as my friends how we are grouped. So the expected result is that friends that write similar content to me will get closer and friends that write different posts will be far from me. To get all data i've used the python-twitter library. It's a open-source python interface for the Twitter API and extremely useful to access the twitter data. For further information about see its official code project homepage.
For this experiment, i've used the subset of 100 friends randomly picked from my friends social network and the clustered data will be the number of times that a particular set of words appears in each user's twitter statuses . A small subset of what of this looks like is shown in Table 01.
By clustering user profiles based on word frequencies, it may be possible to verify if there are groups of user profiles that frequently write about similar subjects or write in similar styles (e.g. english or portuguese languages). To generate this dataset, you'll be downloading the twitter statuses from a set of users, extracting the text from the entries, and creating a table of word frequencies.
For downloading the statuses from Twitter, i've developed a customized version of the python-twitter library. In this module, i've developed some new functions in order to get the social network of my friends in Twitter. I recommend that you download this module and use it. To download the twitterT.py click here.
After that, i've played around with the Twitter API in order to get my statuses and get my friends statuses from the twitter social network. The function getFriendsIds is responsable to get the user ids from the twitter including mine. You can see the snippet code below.
The next step is to create a function that will extract all the words from the statuses. In this article i've downloaded up to last 100 statuses from each friend on a set of 100 user ids. The next snippet shows how i managed to get this done.
The next step is to generate the list of words that will actually be used in the counts for each blog. Since words like 'the' will appear in almost all of them, you can reduce the total number of words included by selecting only those word that you consider viable to appear in the list of words. I've created a list of stopwords in english/portuguese language which are words like pronoums and articles that you can eliminate from your granted list of words. It's important to do this pre-processing in order to get better results. You can download my list of stopwords here. To use it just import it like: import stopwords.
The final step is to use the list of words and the list of statuses to create a text file containing a big matrix of all word counts for each of the user statuses. I've used the module pickle to store those tables. The advantage is that you can easily load and dump the data without losing the type of the object and avoid further parsing and processing to manage the data.
Step 2: The clustering algorithm
The Hierarchical clustering will be used as the clustering algorithm in this article. Its main idea is to build up a hierarchy of groups by continuously merging the two most similar groups. Each of these groups starts as a single item, in this case an individual user profile. In each interaction this technique calculates the distances between every pair of groups, and the closest ones are merged together to form a new group. This is repeated until there is only one group. Figure 1 shows this process.
Figure 01 - The Hierarchical Clustering Algorithm in Action
As you can see at the figure, the similarity of the items is represented by their relative locations- the closer two items are, the more similar they are. As you can see each pair of groups (the closest together) are merged to form a new group whose location is halfway between two. After the process of forming and merging groups, the final step unifies the two remaining groups.
The next step is define the closeness. In this article i will use the Pearson correlation to determine how similar two user profiles are. Since some twitter statuses contain more entries or much longer entries than others, and will thus contain more words overall, the Pearson correlation will correct for this, trying to determine how well two sets of data fit onto a stright line. Remember that the Pearson correlation is 1.0 two items match perfectly, and it's close 0.0 when there's no relationship at all. Here we decided to use 1 minus the pearson correlation since we wanted to create a smaller distance between items that are more similar.
The algorithm for hierarchical clustering begins by creating a group of clusters that are just the original items. the main loop of the function searches for the two best matches by trying every possible pair and calculating their correlation. The best pair of clusters is merged into a single cluster. The data for this new cluster is the average of the data for the two old clusters. This process is repeated until only one cluster remains. You can see all the code for the hierarchical clustering in the function hcluster.
Step 3: Showing the results
You can interpret the clusters more clearly by viewing them as dendrogram. Hierarchical clustering clustering results are usually viewed this way, since dendrograms pack a lot of information into a relatively small space. The idea of this graph is display the nodes arranged into their hierarchy. The dendrogram for the example above is shown in Figure 02.
This dendrogram not only uses connections to show which items ended up in each cluster and it also uses the distance to show how far the items were. You can see that the AB cluster is lot closer to the individual A and B items than the DE cluster is to the individual D and E items. Rendering the graph this way can help you determine how similar the items within a cluster are, which could be interpreted as the tightness of the cluster.
In this article in order to draw and save as JPG the dendrogram, i will use the Python library (PIL) which is available at http://pythonware.com
The PIL makes it very easy to generate images with text and lines, which is all you'll really need to construct a dendrogram. To see the respective code in how to draw the drendrogram see it at the code twitterClustering.py.
You can see the result of the clustering of my 100 friends including me here at the figure 03. If you call the method drawdendrogram, it will generate a file called twitterclust.jpg with the dendrogram.
Figure 03 - Twitter Friends Dendrogram
Step 4: Interpreting the results
- Language: Portuguese x English
In this article i presented my first attempt to cluster/group my friends (all that i'm following)
on the webservice Twitter. For that, i only consider the twitter statuses of my friends, and based
on what they write about (words in the text), the presented algorithm will group the twitter
user profiles .
The algorithm here used is one of the unsupervised learning techniques called hierarchical
clustering. It's powerful to cluster the data provided to him, but its drawback is that it's
very slow (consumes time and processing) since it calculates the distance metric between
all the items to be clustered. The distance metric used here is the Pearson Correlation.
The results shown here through the dendrograms and tag clouds can give us some interesting
conclusions. The algorithm successfully clustered the twitter profiles into groups and with
a simple analysis we could see the group of mobile users, the python users and the division
between english/portuguese profiles.
Other techniques could be applied here as same as other methods to present the results.
In the next articles i will present some other techniques that i've used to cluster the twitter
data and some powerful tools to show the relationship between the clustered groups.
To download the source-codes that i've used here. Check it here.
Marcel Pinheiro Caraciolo