CodeCereal : Blog about Logic programming and artificial intelligence!

Thursday, July 29, 2010

Hi all,

I'd like to share a great blog that my colleague and friend has started to write about Artificial Intelligence and Reasoning and expert systems: The blog CodeCereal by Daker Fernandes, an undergraduate student from  Informatics Center (CIN) / UFPE .

I've read some posts and it's quite interesting the recently posts about abudctive reasoning, knowledge representation and the building of expert systems using  Prolog.   Anyone interested about this area and logic programming must take a look at this blog!!

Congratulations Daker and keep going with this great work!



Working on Sentiment Analysis on Twitter with Portuguese Language

Tuesday, July 20, 2010

Hi all,

In this post I would like to write about my last studies that I was  focusing on. In recent years it is noticeable the increase in amount of information available on the Web. People have started to share in the Web their knowledge and opinions in blogs, social networks and other medium.

On top of that, it have appeared a new research field related to Natural Language Processing, called Opinion Mining also known as Sentiment Analysis. Sentiment Anaylsis aims to identify the sentiment or feeling in the users to something such as a product, company, place, person and others based on the content published in the web. In the view of the requester, it is possible to obtain a full report including a summary about what people are feeling about an item without the need of find and read all opinions and news related to it.
In machine learning area, sentiment analysis is a  text categorization problem which desires to detect favorable and unfavorable opinions related to a specific topic. Its main challenge is to identify how the sentiments are expressed in text and whether they point a positive opinion or a negative one.

There are many applications for the use of Sentiment Analysis, some as follows:
  • Sentiment Analysis in Company Stocks :  An important data for investors in order to identify the humor of the market to the companies stocks based on the opinion of analysts, therefore, identifying the trends in their prices.
  • Sentiment Analysis in Products: A company might be interested in the opinion of their customers about a certain product. For example,  Google (or another company) can use the sentiment mining to know what the people are speaking about the Android cellphone: Nexus One. This information can be used to improve their products or even identify new marketing strategies.
  • Sentiment Analysis on Places:  One person who will travel might want know the best places to visit or the best restaurant to eat. The opinion mining could help those people recommending good places during the planning of their travel.
  • Sentiment Analysis on elections: The voters could use the sentiment analysis to identify the  opinion of other voters about a specific candidate.
  • Analysis on games and movies:  It is possible to mine the sentiments about games and even movies. We will talk more about this in the future.
Another application for sentiment analysis is on status messages on social networks such as Twitter or Facebook. Twitter has gained a special attention recently where people used it  to express their opinion on certain topic.  Therefore, applying sentiment analysis tools  for Twitter that attempt to classify tweets into either positive, negative or neutral categories  automatically could be quite useful for companies and marketers. It is a widely public data source that couldn't be ignored.

Of course, there are several tools in this area, but more focused on classifying tweets in english language. When using this tools for portuguese comments, many tweets end up in the wrong classification. It happens since those tools are generally trained with algorithms modeled to parse and extract english words. In Brazil, there is a quite interesting web tool called Opsys developed by Thomas Lopes  which is focused on mining opinions from feeds and tweets in the web. Currently, its focus is on brazilian elections 2010 and investments (companies stocks). It works really well with portuguese texts and have a nice web summarization tool for analysis.

But since it is a new area, specially working with portuguese corpora I've decided to develop a simple working sentiment analysis tool for identifying opinions on from Twitter. For this, I apply a common and simple machine learning technique called Naive Bayes to classify the set of tweets related to movie reviews  which I will explain more about it in the next sections.

Ok, But how does the sentiment analysis work ?! What are the required steps ?
  1. Data Collection and Pre-processing:  In this step, it is important to search in the web the item of interest, that is, what you want to know the opinion about. It is important also to remove all facts that don't express opinions like news and objective phrases. if your system doesn't identify  subjectivity. The focus is on the user's opinions. The pre-processing is also important in order to remove unnecessary words or irrelevant words to the next step: The classification. 
      2. Classification: The polarity of the content that must be identified. Generally, the polarities used are positive, negative or neutral.

      3.  Presentation of Results: In this step, the classification of several opinions must be summarized in order to be presented to the user. The goal is to facilitate the understanding and give a general comprehension about what people are talking about an item.  This summarization can be expressed in graphics or text. 

Data Collection and Pre-processing

The first step (Data Collection) is related to the information retrieving. It is necessary to extract keywords from the text that may lead to correct classification. Keywords about the original data are usually stored in the form of a feature vector, F = (f1,f2,... fn).  Each coordinate of a feature vector represents one word, also called a feature, of the original text.  The value for each feature may be a binary value, indicating the presence or absence of the feature, an integer which may further express the intensity of the feature in the original text.  It is important to have a good selection of features since it strongly influences the subsequent learning in the machine learning process.  The goal of selecting good features is to capture the desired properties of the original text that are relevant for the sentiment analysis task. Unfortunately, this task for finding best features does not exist. It is required to rely on our intuition, the domain knowledge and a lot of experimentation to choose the best set of features. I strongly recommend the study of the  Natural Language Processing (NLP)  subject, which it may help you to understand this vast research field.

Our approach includes the use of Bag-of-Words. It is a popular model used in Information Retrieving that takes individual words (unigrams) in sentence as features, assuming their conditional independence. So the whole text is represented by a unordered collection of words.  Each feature in the vector represents a existence of one word. The challenge with this approach is the choice of words that are appropriate to become features.

For instance, considering this model, the tweet:  'Assisti hoje o filme Eclipse, ele é lindo !'  may be represented by the following feature vector:

 F = {'Assisti': 1 , 'hoje': 1, 'o': 1, 'filme': 1, 'Eclipse': 1, 'ele': 1, 'é': 1, 'lindo':1}

Here we represent the feature vector as a python dictionary.

Obviously, for any real use, we have to compare this vector to a feature vector that would have much larger number of words. It would be necessary in fact a dictionary of the language, however this model would be inefficient since it would overfit and lead to bad performance when exposed to new examples.
In literature, a common approach is to manually select the most important keywords, for example the word 'lindo' (adjective) is a good indicator of the author's opinion. The most important keywords such as 'excelente',  'horrível', 'fraco'  would be selected as features  since  they express polarity of a sentence). However, Pang et al. show that manual keyword model is outperformed by statistical models, where a good set of words that represent features are selected by their occurrence in the existent training corpus. So, the quality of the selection would depend on the size of the corpus and the similarity of domains of training and test data.  The use of corpus from different domains that don't have the same properties of the domain of the text we want to classify, may lead to inaccurate results. For instance, if the analysis is done on a set of tweets related to a product and is trained with a set based on movies, the most of the sentences would be misclassified.  It is important also to create a dictionary which will capture the most important features for classifying previously unseen sentences.

Additionally, it is possible to remove some of the existing words that bring little useful information like pronouns, articles, prepositions, etc (List of stop words).  Of course, the model presented here is simple, and there are several limitations, which include the inability to capture the polarity relation between words and different meanings of one word.  Another limitations would be suppressed by use of regular expressions for handling the negation and parts of speech for a syntax analysis of the word.


Classification algorithms are efficient and consolidated techniques for this sentiment classification task, since it predicts the label for a given input. However, depending on which approach used (supervised or unsupervised), it will be required a training set with labeled examples before new tweets be classified. It is important to train the model with the set in a domain related to the data domain that will be used as input. The labels we are interested here are the subjectivity of the sentence and the polarity of the sentence (neutral, positive or negative).  There are several machine learning techniques for this task, but in this article we are using the simplest one but with a great  efficience in classification problems: Naive Bayes technique.

The Naive Bayes model or Naive Bayer classifier is a simple probabilistic classifier based on the Baye's theorem with strong independence assumptions. In simple terms, this probability model assumes that the presence or absence of a particular feature of a class is unrelated to presence or absence of any other feature.  For instance, a car may be considered to be a vehicle if there are  4 tires, engine and at least 2 doors.  Even if these features depend on each other, the naive Bayes classifier considers all these properties to independently contribute to the probability that the car is an vehicle.

In our case, each stemmed word in a tweet is taken to be a unique variable in the model, and the goal is to find the probability of that word, and consequently the all sentence itself, belonging to a certain class: Positive vs Negative.  In spite of their naive design and simplified assumptions,  naive Bayes classifiers have worked very well in many complex real-word situations. One of the extensive uses is in spam filtering. In fact, one of the most popular packages (SpamAssassin) is a direct implementation of Naive Bayes. It is a an amazingly simple approach and also a powerful classifier.  More information about the algorithm of Naive Bayes it can be found here and here.
In this tutorial I have used the implementation of a simple Naive Bayesian Classifier. There are some variations using Fisher Score discriminant function, which will not be explored in this article now.
In our problem domain, we will use the Naive Bayes classifier to classify the tweets in Positive or Negative.  To apply this task, it is necessary to train the classifier by creating a training set of tweets (or words) already classified (in Positive or  Negative). Trained the model,  Naive Bayes assumes that all features in the feature vector (after the pre-processing step), and applies Bayes' rule on the tweet. Naive Bayes calculates the prior probability frequency for each label in the training set. Each label is given a likelihood estimate from the contributions of all features, and the sentence is assigned the label with highest likelihood estimate, that is, 'Positive' or 'Negative'. 


The last step: The presentation of the results. Generally, it can be presented by using text or charts. The summarization in text is not a easy task, and it has a research field only for it. The common presentation is a list of sentences where for each the classification is shown by its side.

Summarization by TwitterSentiment Analyzer  -

The other common option is to use charts to present the results. It's the simplest way, and there are many flavors for this presentation. And don't forget they are more attractive for the end users and reports.

Summarization by TwitterSentiment Analyzer  -

Our Study Case:  Movies Reviews on Twitter

For a simple study case and demonstration here, I've created a training set with more about 300 sentences related to movie reviews on Twitter. So sentences such as 'Eu amei o filme Eclipse!' or 'Eclipse poderia ser melhor. #horrível Pior filme da saga.'  are  labeled under positive and negative polarities for creating the training set. It is important to notice that my data corpus are related to portuguese tweets.  So english ones will not work properly.  I've used the Twitter Search API for searching tweets with the keyword 'Eclipse',  the new released  movie in the theaters from the saga  'Twilight' and fetched tweets during three days (July 05th 2010 - July 07th 2010). 
What people are talking about the movie Eclipse on Twitter ??
The task is to analyze the opinions about the movie in Twitter. Using the Naive Bayes Classifier with some adjustments to include the 'Neutral' Classification, we have analyzed 4265 tweets.  I have used some pre-processing techniques explained above. All the code was written in Python. The result you can see in the graph below.

Polarity Chart on tweets about the movie 'Eclipse'

As you may notice, brazilians are really loving the movie Eclipse!! There are also a great amount on neutral tweets, it means that the classifier didn't have the confidence to classify as positive or negative, (I'll explain soon) or tweets that don't express sentiment (opinions).

You can see some of the tweets classified as follows:

Some Tweets classified related to the movie Eclipse

The neutral classification, I've added a simple threshold where after evaluating the probabilities for each polarity (Positive and Negative). The maximum one is compared to a threshold and if its value isn't higher, it is classified as a neutral one. It means that the classifier doesn't have the confidence to classify the tweet as one of the classes trained in the model.

As you can see, the Naive Bayes is a simple machine learning technique quite efficient for automatically extract sentiment from small sentences like Tweets. However, it is important to notice that it still can be improved with new feature extraction approaches or another robust machine learning techniques.


Soon I will provide all the source code for this project, but if you want to play around with the classifier, I've developed a simple REST API for using the classifier for simple tests. This version is still not official and it only works with movie reviews domain (in portuguese of course).


The output result is in JSON. PLEASE THIS API is still in development. So if you plan on integrating this API into a product, please let me know , so that I can contact you in order to help you to create your own.

The classifier is only working with a single arbitrary text (in portuguese language)  and as result give you its polarity. Example:


{"results": {"polarity": "negativo", "text": "Encontro Explosivo filme ruim Danado", "query": "encontro explosivo"}}

- text: The text you want to classify. This should be URL encoded. 
- query: The subject. This should be URL encoded. (optional)
- text: original text submitted
- query: original query submitted
- polarity. The polarity values are:
       -  'negativo': corresponds to negative
       -  'neutro': corresponds to neutral
       - 'positivo': corresponds to positive


As we can notice that sentiment analysis is a trend in the Web, with several interesting application with a lot of data sources provided by users.  Microblogs and Social Networks such as Twitter, Facebook and Orkut and another web services are powerful crowding sources for obtain opinions from the users in the Web about any subject and specially to help to answer the question about what people are interested on. Despite the challenge, more companies and researchers are working in this area until one day it would be easy for users and companies to simply obtain complete and rich summarized reports about the opinions from the Web in order to support them in the decision making process in their daily life.

 In machine learning topics, there are also a lot of improvements that would be made for this task and several challenges for researchers to extend it. I would say the most important ones are related to the natural language processing, which is directly associated to sentiment analysis. I would cite the subjectivity identification (distinguish if a text is really a opinion or a fact), identification  of sarcasms, which are not easy to extract,  the multiple reference of items in a same sentence (a tweet referencing Iphone and Ipod) and with different opinions, which it may confuse the classification. Misspelling words and abbreviated words (common in blogs and social networks) are also difficulties for search and classification. Many others could be cited here but it may be found if you search on Google about sentiment analysis.

In the end, I would tell that applying sentiment analysis in portuguese language is still a new area with great challenges to deal with text mining since the language is different from the current ones applied on english in the literature. My work is  a initial discussion about how we can learn more about sentiment analysis and how portuguese could be inserted in this context.  More analysis and a better evaluation should be applied to compare another popular machine learning techniques with Naive Bayes classifier.

I expect   you have enjoyed this tutorial, it's quite long but I believe it could be a useful start for the beginners in machine learning.


See you next time!

Marcel Caraciolo