Pages

WordTree: Visualization Tool for Twitter

Monday, June 14, 2010

Hi all,

It has been a while since my last post. But, I am still alive!! In this post I'll be talking about one of my last projects based on natural language processing (NLP) that I've developed: A Visual Word Tree.

But what is a Word Tree ?

A word tree is a visual analyzer tool for unstructured text, such as a article, speech or a book. It is also a new visualization technique that makes easy the exploration of repetitive context. The main idea behind it includes the concordance principle.

Concordances have been used for centuries at biblical scholars to see how different words occur in religious texts. It is a special type of indexation technique, which shows near to each word some words that appears before or after that one. For instance, consider the phrase "if love" in Romeo and Juliet which occurs three times:


As the figure shown above, you will notice that in the words following 'if love' there are many repeated phrases. For example, "be" follows "if love" in all three cases. And "be blind" follows in two cases. To create a word tree, the computer merges all the matching phrases, as in this diagram:


As you can notice, this diagram has a shape of a tree of each node represented by a word, which can be easily interpreted and visualized. It emphasizes in the interactive exploration of short texts (like the short texts of the Bible). This visualization is called WordTree and is based on a well-known data structure in computer science: suffix tree (introduced in the 70's).

IBM scientists have developed a interactive graphical tool, which recently at TEDx São Paulo, Fernanda Viegas from IBM have lectured a great keynote about data visualization techniques, presenting the application of the wordTree technique. The tool is also available for public and for free use at the website ManyEyes.

But why did i developed one ?

One of the problems of ManyEyes is its forbidden use for commercial applications and based on what I've researched there is also a maximum limit of words supported by the tool. I've found other ones like the ManyEyes, but unfortunatelly was not open-source or not available for public.

Therefore, I decided to build my own implementation of concordance based on the suffix tree. Different from the ManyEyes tool, my goal is to create automatically word trees from statuses from the web microblog Twitter. I was inspired by the work done by the Vettalabs who have developed a wordTree for Twitter in Java.


Twitter has a powerful mechanism called Re-Tweet (RT) which can be used by users to repeat any tweet that was already posted by someone, in order to reinforce or support that tweet and spread to all your followers (Making a RT of a tweet you're announcing that tweet for more people in order to see it). The more the number of RT's , the more divulged that tweet had.

Thus, I've developed a simple system that monitors the Twitter in real-time, seeking for tweets that has keywords specified by the users. Furthermore, for each n tweets found, one new tree is created, which shows in a easy way what it has been discussed about that topic, i.e., related to that keyword.

Let's see an example: I've collected some tweets about the recently launched movie in the brazilian theaters: Robin Hood. The figure below illustrates the new word tree created:



WordTree related to the movie Robin Hood


As you can notice, the quantity of tweets processed can be huge - thousand of items in one day. So I decided to prune the tree in order to present only relevant information. For that, we can use some natural language processing techniques such as to choose only nodes that have verbs in the second node and a subject in the first node of the tree. This could simplify the tree and focusing on the texts that have a subject + verb in the beginning (action of the keyword), etc.

The tree presented above is a reverse tree, which shows the words that precede a keyword.(trees with higher depth). The other one is the basic tree where the keyword is the subject:


WordTree related to Brazil ( a mention to the Soccer World Cup)!


Both the suffix tree and the graph was developed using Python programming language. The most interesting part is that you can easily visualize/extract information in real-time on Twitter with this visualization tool. It also summarizes repeated words by increasing its font/letter size so the user can directly understand in a intuitive way, specially in a environment with lots of text and information.

I'd like to mention Murilo Queiroga who has given me some tips to this work. Thanks Murilo!

See you next time,

Marcel Caraciolo

7 comments:

  1. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

    PLC training in chennai
    Automation training in chennai
    Best plc training in chennai
    PLC SCADA training in chennai
    Process automation training in chennai
    Final year eee projects in chennai
    VLSI training in chennai

    ReplyDelete
  2. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai. Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino, etc.

    Embedded system training in chennai
    Embedded Course training in chennai
    Matlab training in chennai
    Android training in chennai
    LabVIEW training in chennai
    Robotics training in chennai
    Oracle training in chennai
    Final year projects in chennai
    Mechanical projects in chennai
    ece projects in chennai

    ReplyDelete
  3. Java SE Java EE Java Online Course Oracle Learning Tutorials. Java EE Training Java is a great cross-platform programming language. Java EE & Java SE Java Training Institutes in Chennai on Linux Training Course Materials. java j2ee training institutes in chennai Java Standard Edition Java Enterprise Edition Certification Training Course ware Java Training in Chennai . Java Development Kit JDK J2EE Training in Chennai Java Runtime Environment JRE Java Course in Chennai on Linux Java Interview Questions . IT Technical Articles

    ReplyDelete
  4. I have read your blog its very attractive and impressive. I like it your blog.

    Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

    Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

    ReplyDelete