It has been a while since my last post. But, I am still alive!! In this post I'll be talking about one of my last projects based on natural language processing (NLP) that I've developed: A Visual Word Tree.
But what is a Word Tree ?A word tree is a visual analyzer tool for unstructured text, such as a article, speech or a book. It is also a new visualization technique that makes easy the exploration of repetitive context. The main idea behind it includes the concordance principle.
Concordances have been used for centuries at biblical scholars to see how different words occur in religious texts. It is a special type of indexation technique, which shows near to each word some words that appears before or after that one. For instance, consider the phrase "if love" in Romeo and Juliet which occurs three times:
As the figure shown above, you will notice that in the words following 'if love' there are many repeated phrases. For example, "be" follows "if love" in all three cases. And "be blind" follows in two cases. To create a word tree, the computer merges all the matching phrases, as in this diagram:
As you can notice, this diagram has a shape of a tree of each node represented by a word, which can be easily interpreted and visualized. It emphasizes in the interactive exploration of short texts (like the short texts of the Bible). This visualization is called WordTree and is based on a well-known data structure in computer science: suffix tree (introduced in the 70's).
IBM scientists have developed a interactive graphical tool, which recently at TEDx São Paulo, Fernanda Viegas from IBM have lectured a great keynote about data visualization techniques, presenting the application of the wordTree technique. The tool is also available for public and for free use at the website ManyEyes.
But why did i developed one ?
One of the problems of ManyEyes is its forbidden use for commercial applications and based on what I've researched there is also a maximum limit of words supported by the tool. I've found other ones like the ManyEyes, but unfortunatelly was not open-source or not available for public.
Therefore, I decided to build my own implementation of concordance based on the suffix tree. Different from the ManyEyes tool, my goal is to create automatically word trees from statuses from the web microblog Twitter. I was inspired by the work done by the Vettalabs who have developed a wordTree for Twitter in Java.
Twitter has a powerful mechanism called Re-Tweet (RT) which can be used by users to repeat any tweet that was already posted by someone, in order to reinforce or support that tweet and spread to all your followers (Making a RT of a tweet you're announcing that tweet for more people in order to see it). The more the number of RT's , the more divulged that tweet had.
Thus, I've developed a simple system that monitors the Twitter in real-time, seeking for tweets that has keywords specified by the users. Furthermore, for each n tweets found, one new tree is created, which shows in a easy way what it has been discussed about that topic, i.e., related to that keyword.
Let's see an example: I've collected some tweets about the recently launched movie in the brazilian theaters: Robin Hood. The figure below illustrates the new word tree created:
As you can notice, the quantity of tweets processed can be huge - thousand of items in one day. So I decided to prune the tree in order to present only relevant information. For that, we can use some natural language processing techniques such as to choose only nodes that have verbs in the second node and a subject in the first node of the tree. This could simplify the tree and focusing on the texts that have a subject + verb in the beginning (action of the keyword), etc.
The tree presented above is a reverse tree, which shows the words that precede a keyword.(trees with higher depth). The other one is the basic tree where the keyword is the subject:
Both the suffix tree and the graph was developed using Python programming language. The most interesting part is that you can easily visualize/extract information in real-time on Twitter with this visualization tool. It also summarizes repeated words by increasing its font/letter size so the user can directly understand in a intuitive way, specially in a environment with lots of text and information.
I'd like to mention Murilo Queiroga who has given me some tips to this work. Thanks Murilo!
See you next time,
Marcel Caraciolo
As the figure shown above, you will notice that in the words following 'if love' there are many repeated phrases. For example, "be" follows "if love" in all three cases. And "be blind" follows in two cases. To create a word tree, the computer merges all the matching phrases, as in this diagram:
As you can notice, this diagram has a shape of a tree of each node represented by a word, which can be easily interpreted and visualized. It emphasizes in the interactive exploration of short texts (like the short texts of the Bible). This visualization is called WordTree and is based on a well-known data structure in computer science: suffix tree (introduced in the 70's).
IBM scientists have developed a interactive graphical tool, which recently at TEDx São Paulo, Fernanda Viegas from IBM have lectured a great keynote about data visualization techniques, presenting the application of the wordTree technique. The tool is also available for public and for free use at the website ManyEyes.
But why did i developed one ?
One of the problems of ManyEyes is its forbidden use for commercial applications and based on what I've researched there is also a maximum limit of words supported by the tool. I've found other ones like the ManyEyes, but unfortunatelly was not open-source or not available for public.
Therefore, I decided to build my own implementation of concordance based on the suffix tree. Different from the ManyEyes tool, my goal is to create automatically word trees from statuses from the web microblog Twitter. I was inspired by the work done by the Vettalabs who have developed a wordTree for Twitter in Java.
Twitter has a powerful mechanism called Re-Tweet (RT) which can be used by users to repeat any tweet that was already posted by someone, in order to reinforce or support that tweet and spread to all your followers (Making a RT of a tweet you're announcing that tweet for more people in order to see it). The more the number of RT's , the more divulged that tweet had.
Thus, I've developed a simple system that monitors the Twitter in real-time, seeking for tweets that has keywords specified by the users. Furthermore, for each n tweets found, one new tree is created, which shows in a easy way what it has been discussed about that topic, i.e., related to that keyword.
Let's see an example: I've collected some tweets about the recently launched movie in the brazilian theaters: Robin Hood. The figure below illustrates the new word tree created:
As you can notice, the quantity of tweets processed can be huge - thousand of items in one day. So I decided to prune the tree in order to present only relevant information. For that, we can use some natural language processing techniques such as to choose only nodes that have verbs in the second node and a subject in the first node of the tree. This could simplify the tree and focusing on the texts that have a subject + verb in the beginning (action of the keyword), etc.
The tree presented above is a reverse tree, which shows the words that precede a keyword.(trees with higher depth). The other one is the basic tree where the keyword is the subject:
Both the suffix tree and the graph was developed using Python programming language. The most interesting part is that you can easily visualize/extract information in real-time on Twitter with this visualization tool. It also summarizes repeated words by increasing its font/letter size so the user can directly understand in a intuitive way, specially in a environment with lots of text and information.
I'd like to mention Murilo Queiroga who has given me some tips to this work. Thanks Murilo!
See you next time,
Marcel Caraciolo
شركة تسليك مجاري المطبخ بالرياض
ReplyDeleteشركة تسليك مجاري بالرياض
شركة تسليك مجارى الحمام بالرياض
level تسليك المجاري بالرياض
افضل شركة تنظيف بالرياض
تنظيف شقق بالرياض
شركة تنظيف منازل بالرياض
شركة غسيل خزنات بالرياض
افضل شركة مكافحة حشرات بالرياض
رش مبيدات بالرياض
شركة تخزين عفش بالرياض
شركة تنظيف مجالس بالرياض
تنظيف فلل بالرياض
ابى شركة تنظيف بالرياض
ReplyDeleteThanks for your informative article.
NLP training in Chennai
Thank you for your articles that you have shared with us. Hopefully you can give the article a good benefit to us. Machine Learning Training In Jaipur
ReplyDeleteAmazing content.
ReplyDeleteData Mining Service Providers in Bangalore
Vidyasthali Group of Institutions was founded by an eminent group of academics and industry leaders who are masters of the top significant achievements and accomplishments. Vidyasthali is a reputed B-school in Jaipur.
ReplyDeletevidyasthali institute of technology science&managment
best techonology science and management institute in jaipur
courses in vidyasthali group of institute
no 1 technology science & management college in jaipur
best management
vitsm
vidyasthali institute of technology science and managment
techonlogy science&management college near me,
best technology
best management
VITSM teaching staff
vitsm courses
top science college
best science colleges in jaipur
best science college in jaipur
vitsm fees structure
vitsm fee structure
vitsm result
vitsm result
This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:
ReplyDelete-Phone hacks (remotely)
-Credit repair
-Bitcoin recovery (any cryptocurrency)
-Make money from home (USA only)
-Social media hacks
-Website hacks
-Erase criminal records (USA & Canada only)
-Grade change
-funds recovery
Email: onlineghosthacker247@ gmail .com
Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
ReplyDelete고스톱
The next time I read a blog, I hope that it doesn't disappoint me as much as this one. I mean, I know it was my choice to read, but I actually thought you have something interesting to say. All I hear is a bunch of whining about something that you could fix if you weren't too busy looking for attention.
ReplyDelete스포츠토토
Great posts
ReplyDeleteAbogado Disputas Contratos Comerciales
Disputas Contratos Litigio Mediación
Best Blog and very helpful too
ReplyDeleteAbogado Transito Fairfax VA
dui charges dropped
Great Post! Im looking forward to seeing more from this blog here.
ReplyDeleteThere are some great ideas above. Thanks for providing this good stories
ReplyDeleteThis website is useful. You made great points Many thanks for sharing.
ReplyDeleteVery nice article and straight to the point. Keep it up, Thanks.
ReplyDeleteHey there. Your article looks good. Keep on writing great article!
ReplyDeleteI really want to commend the author for addressing such a complex topic with such clarity! The way you divided it into manageable parts made all the difference. I feel much more confident in my understanding now. Thank you for putting in the effort to make this so accessible!
ReplyDeleteVisit our link for ISO Certification in Jeddah
ReplyDeleteThanks for some other wonderful article