WordTree: Visualization Tool for Twitter

Monday, June 14, 2010

Hi all,

It has been a while since my last post. But, I am still alive!! In this post I'll be talking about one of my last projects based on natural language processing (NLP) that I've developed: A Visual Word Tree.

But what is a Word Tree ?

A word tree is a visual analyzer tool for unstructured text, such as a article, speech or a book. It is also a new visualization technique that makes easy the exploration of repetitive context. The main idea behind it includes the concordance principle.

Concordances have been used for centuries at biblical scholars to see how different words occur in religious texts. It is a special type of indexation technique, which shows near to each word some words that appears before or after that one. For instance, consider the phrase "if love" in Romeo and Juliet which occurs three times:

As the figure shown above, you will notice that in the words following 'if love' there are many repeated phrases. For example, "be" follows "if love" in all three cases. And "be blind" follows in two cases. To create a word tree, the computer merges all the matching phrases, as in this diagram:

As you can notice, this diagram has a shape of a tree of each node represented by a word, which can be easily interpreted and visualized. It emphasizes in the interactive exploration of short texts (like the short texts of the Bible). This visualization is called WordTree and is based on a well-known data structure in computer science: suffix tree (introduced in the 70's).

IBM scientists have developed a interactive graphical tool, which recently at TEDx São Paulo, Fernanda Viegas from IBM have lectured a great keynote about data visualization techniques, presenting the application of the wordTree technique. The tool is also available for public and for free use at the website ManyEyes.

But why did i developed one ?

One of the problems of ManyEyes is its forbidden use for commercial applications and based on what I've researched there is also a maximum limit of words supported by the tool. I've found other ones like the ManyEyes, but unfortunatelly was not open-source or not available for public.

Therefore, I decided to build my own implementation of concordance based on the suffix tree. Different from the ManyEyes tool, my goal is to create automatically word trees from statuses from the web microblog Twitter. I was inspired by the work done by the Vettalabs who have developed a wordTree for Twitter in Java.

Twitter has a powerful mechanism called Re-Tweet (RT) which can be used by users to repeat any tweet that was already posted by someone, in order to reinforce or support that tweet and spread to all your followers (Making a RT of a tweet you're announcing that tweet for more people in order to see it). The more the number of RT's , the more divulged that tweet had.

Thus, I've developed a simple system that monitors the Twitter in real-time, seeking for tweets that has keywords specified by the users. Furthermore, for each n tweets found, one new tree is created, which shows in a easy way what it has been discussed about that topic, i.e., related to that keyword.

Let's see an example: I've collected some tweets about the recently launched movie in the brazilian theaters: Robin Hood. The figure below illustrates the new word tree created:

WordTree related to the movie Robin Hood

As you can notice, the quantity of tweets processed can be huge - thousand of items in one day. So I decided to prune the tree in order to present only relevant information. For that, we can use some natural language processing techniques such as to choose only nodes that have verbs in the second node and a subject in the first node of the tree. This could simplify the tree and focusing on the texts that have a subject + verb in the beginning (action of the keyword), etc.

The tree presented above is a reverse tree, which shows the words that precede a keyword.(trees with higher depth). The other one is the basic tree where the keyword is the subject:

WordTree related to Brazil ( a mention to the Soccer World Cup)!

Both the suffix tree and the graph was developed using Python programming language. The most interesting part is that you can easily visualize/extract information in real-time on Twitter with this visualization tool. It also summarizes repeated words by increasing its font/letter size so the user can directly understand in a intuitive way, specially in a environment with lots of text and information.

I'd like to mention Murilo Queiroga who has given me some tips to this work. Thanks Murilo!

See you next time,

Marcel Caraciolo

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.

My Github Repos

Artificial Intelligence in Motion

Pages