Pages

WordTree: Visualization Tool for Twitter

Monday, June 14, 2010

Hi all,

It has been a while since my last post. But, I am still alive!! In this post I'll be talking about one of my last projects based on natural language processing (NLP) that I've developed: A Visual Word Tree.

But what is a Word Tree ?

A word tree is a visual analyzer tool for unstructured text, such as a article, speech or a book. It is also a new visualization technique that makes easy the exploration of repetitive context. The main idea behind it includes the concordance principle.

Concordances have been used for centuries at biblical scholars to see how different words occur in religious texts. It is a special type of indexation technique, which shows near to each word some words that appears before or after that one. For instance, consider the phrase "if love" in Romeo and Juliet which occurs three times:


As the figure shown above, you will notice that in the words following 'if love' there are many repeated phrases. For example, "be" follows "if love" in all three cases. And "be blind" follows in two cases. To create a word tree, the computer merges all the matching phrases, as in this diagram:


As you can notice, this diagram has a shape of a tree of each node represented by a word, which can be easily interpreted and visualized. It emphasizes in the interactive exploration of short texts (like the short texts of the Bible). This visualization is called WordTree and is based on a well-known data structure in computer science: suffix tree (introduced in the 70's).

IBM scientists have developed a interactive graphical tool, which recently at TEDx São Paulo, Fernanda Viegas from IBM have lectured a great keynote about data visualization techniques, presenting the application of the wordTree technique. The tool is also available for public and for free use at the website ManyEyes.

But why did i developed one ?

One of the problems of ManyEyes is its forbidden use for commercial applications and based on what I've researched there is also a maximum limit of words supported by the tool. I've found other ones like the ManyEyes, but unfortunatelly was not open-source or not available for public.

Therefore, I decided to build my own implementation of concordance based on the suffix tree. Different from the ManyEyes tool, my goal is to create automatically word trees from statuses from the web microblog Twitter. I was inspired by the work done by the Vettalabs who have developed a wordTree for Twitter in Java.


Twitter has a powerful mechanism called Re-Tweet (RT) which can be used by users to repeat any tweet that was already posted by someone, in order to reinforce or support that tweet and spread to all your followers (Making a RT of a tweet you're announcing that tweet for more people in order to see it). The more the number of RT's , the more divulged that tweet had.

Thus, I've developed a simple system that monitors the Twitter in real-time, seeking for tweets that has keywords specified by the users. Furthermore, for each n tweets found, one new tree is created, which shows in a easy way what it has been discussed about that topic, i.e., related to that keyword.

Let's see an example: I've collected some tweets about the recently launched movie in the brazilian theaters: Robin Hood. The figure below illustrates the new word tree created:



WordTree related to the movie Robin Hood


As you can notice, the quantity of tweets processed can be huge - thousand of items in one day. So I decided to prune the tree in order to present only relevant information. For that, we can use some natural language processing techniques such as to choose only nodes that have verbs in the second node and a subject in the first node of the tree. This could simplify the tree and focusing on the texts that have a subject + verb in the beginning (action of the keyword), etc.

The tree presented above is a reverse tree, which shows the words that precede a keyword.(trees with higher depth). The other one is the basic tree where the keyword is the subject:


WordTree related to Brazil ( a mention to the Soccer World Cup)!


Both the suffix tree and the graph was developed using Python programming language. The most interesting part is that you can easily visualize/extract information in real-time on Twitter with this visualization tool. It also summarizes repeated words by increasing its font/letter size so the user can directly understand in a intuitive way, specially in a environment with lots of text and information.

I'd like to mention Murilo Queiroga who has given me some tips to this work. Thanks Murilo!

See you next time,

Marcel Caraciolo

17 comments:

  1. Thank you for your articles that you have shared with us. Hopefully you can give the article a good benefit to us. Machine Learning Training In Jaipur

    ReplyDelete
  2. This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:

    -Phone hacks (remotely)
    -Credit repair
    -Bitcoin recovery (any cryptocurrency)
    -Make money from home (USA only)
    -Social media hacks
    -Website hacks
    -Erase criminal records (USA & Canada only)
    -Grade change
    -funds recovery

    Email: onlineghosthacker247@ gmail .com

    ReplyDelete
  3. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    고스톱

    ReplyDelete
  4. The next time I read a blog, I hope that it doesn't disappoint me as much as this one. I mean, I know it was my choice to read, but I actually thought you have something interesting to say. All I hear is a bunch of whining about something that you could fix if you weren't too busy looking for attention.
    스포츠토토

    ReplyDelete
  5. Great Post! Im looking forward to seeing more from this blog here.

    ReplyDelete
  6. There are some great ideas above. Thanks for providing this good stories

    ReplyDelete
  7. This website is useful. You made great points Many thanks for sharing.

    ReplyDelete
  8. Very nice article and straight to the point. Keep it up, Thanks.

    ReplyDelete
  9. Hey there. Your article looks good. Keep on writing great article!

    ReplyDelete
  10. I really want to commend the author for addressing such a complex topic with such clarity! The way you divided it into manageable parts made all the difference. I feel much more confident in my understanding now. Thank you for putting in the effort to make this so accessible!
    Visit our link for ISO Certification in Jeddah

    ReplyDelete

  11. Thanks for some other wonderful article

    ReplyDelete