Hi all,
In this post I would like to write about my last studies that I was focusing on. In recent years it is noticeable the increase in amount of information available on the Web. People have started to share in the Web their knowledge and opinions in blogs, social networks and other medium.
On top of that, it have appeared a new research field related to Natural Language Processing, called Opinion Mining also known as Sentiment Analysis. Sentiment Anaylsis aims to identify the sentiment or feeling in the users to something such as a product, company, place, person and others based on the content published in the web. In the view of the requester, it is possible to obtain a full report including a summary about what people are feeling about an item without the need of find and read all opinions and news related to it.
In machine learning area, sentiment analysis is a text categorization problem which desires to detect favorable and unfavorable opinions related to a specific topic. Its main challenge is to identify how the sentiments are expressed in text and whether they point a positive opinion or a negative one.
There are many applications for the use of Sentiment Analysis, some as follows:
- Sentiment Analysis in Company Stocks : An important data for investors in order to identify the humor of the market to the companies stocks based on the opinion of analysts, therefore, identifying the trends in their prices.
- Sentiment Analysis in Products: A company might be interested in the opinion of their customers about a certain product. For example, Google (or another company) can use the sentiment mining to know what the people are speaking about the Android cellphone: Nexus One. This information can be used to improve their products or even identify new marketing strategies.
- Sentiment Analysis on Places: One person who will travel might want know the best places to visit or the best restaurant to eat. The opinion mining could help those people recommending good places during the planning of their travel.
- Sentiment Analysis on elections: The voters could use the sentiment analysis to identify the opinion of other voters about a specific candidate.
- Analysis on games and movies: It is possible to mine the sentiments about games and even movies. We will talk more about this in the future.
Another application for sentiment analysis is on status messages on social networks such as Twitter or Facebook. Twitter has gained a special attention recently where people used it to express their opinion on certain topic. Therefore, applying sentiment analysis tools for Twitter that attempt to classify tweets into either positive, negative or neutral categories automatically could be quite useful for companies and marketers. It is a widely public data source that couldn't be ignored.
Of course, there are several tools in this area, but more focused on classifying tweets in english language. When using this tools for portuguese comments, many tweets end up in the wrong classification. It happens since those tools are generally trained with algorithms modeled to parse and extract english words. In Brazil, there is a quite interesting web tool called Opsys developed by Thomas Lopes which is focused on mining opinions from feeds and tweets in the web. Currently, its focus is on brazilian elections 2010 and investments (companies stocks). It works really well with portuguese texts and have a nice web summarization tool for analysis.
But since it is a new area, specially working with portuguese corpora I've decided to develop a simple working sentiment analysis tool for identifying opinions on from Twitter. For this, I apply a common and simple machine learning technique called Naive Bayes to classify the set of tweets related to movie reviews which I will explain more about it in the next sections.
Ok, But how does the sentiment analysis work ?! What are the required steps ?
- Data Collection and Pre-processing: In this step, it is important to search in the web the item of interest, that is, what you want to know the opinion about. It is important also to remove all facts that don't express opinions like news and objective phrases. if your system doesn't identify subjectivity. The focus is on the user's opinions. The pre-processing is also important in order to remove unnecessary words or irrelevant words to the next step: The classification.
2. Classification: The polarity of the content that must be identified. Generally, the polarities used are positive, negative or neutral.
3. Presentation of Results: In this step, the classification of several opinions must be summarized in order to be presented to the user. The goal is to facilitate the understanding and give a general comprehension about what people are talking about an item. This summarization can be expressed in graphics or text.
Data Collection and Pre-processing
The first step (Data Collection) is related to the information retrieving. It is necessary to extract keywords from the text that may lead to correct classification. Keywords about the original data are usually stored in the form of a feature vector, F = (f1,f2,... fn). Each coordinate of a feature vector represents one word, also called a feature, of the original text. The value for each feature may be a binary value, indicating the presence or absence of the feature, an integer which may further express the intensity of the feature in the original text. It is important to have a good selection of features since it strongly influences the subsequent learning in the machine learning process. The goal of selecting good features is to capture the desired properties of the original text that are relevant for the sentiment analysis task. Unfortunately, this task for finding best features does not exist. It is required to rely on our intuition, the domain knowledge and a lot of experimentation to choose the best set of features. I strongly recommend the study of the Natural Language Processing (NLP) subject, which it may help you to understand this vast research field.
Our approach includes the use of Bag-of-Words. It is a popular model used in Information Retrieving that takes individual words (unigrams) in sentence as features, assuming their conditional independence. So the whole text is represented by a unordered collection of words. Each feature in the vector represents a existence of one word. The challenge with this approach is the choice of words that are appropriate to become features.
For instance, considering this model, the tweet: 'Assisti hoje o filme Eclipse, ele é lindo !' may be represented by the following feature vector:
F = {'Assisti': 1 , 'hoje': 1, 'o': 1, 'filme': 1, 'Eclipse': 1, 'ele': 1, 'é': 1, 'lindo':1}
Here we represent the feature vector as a python dictionary.
Obviously, for any real use, we have to compare this vector to a feature vector that would have much larger number of words. It would be necessary in fact a dictionary of the language, however this model would be inefficient since it would overfit and lead to bad performance when exposed to new examples.
In literature, a common approach is to manually select the most important keywords, for example the word 'lindo' (adjective) is a good indicator of the author's opinion. The most important keywords such as 'excelente', 'horrível', 'fraco' would be selected as features since they express polarity of a sentence). However, Pang et al. show that manual keyword model is outperformed by statistical models, where a good set of words that represent features are selected by their occurrence in the existent training corpus. So, the quality of the selection would depend on the size of the corpus and the similarity of domains of training and test data. The use of corpus from different domains that don't have the same properties of the domain of the text we want to classify, may lead to inaccurate results. For instance, if the analysis is done on a set of tweets related to a product and is trained with a set based on movies, the most of the sentences would be misclassified. It is important also to create a dictionary which will capture the most important features for classifying previously unseen sentences.
Additionally, it is possible to remove some of the existing words that bring little useful information like pronouns, articles, prepositions, etc (List of stop words). Of course, the model presented here is simple, and there are several limitations, which include the inability to capture the polarity relation between words and different meanings of one word. Another limitations would be suppressed by use of regular expressions for handling the negation and parts of speech for a syntax analysis of the word.
Classification
Classification algorithms are efficient and consolidated techniques for this sentiment classification task, since it predicts the label for a given input. However, depending on which approach used (supervised or unsupervised), it will be required a training set with labeled examples before new tweets be classified. It is important to train the model with the set in a domain related to the data domain that will be used as input. The labels we are interested here are the subjectivity of the sentence and the polarity of the sentence (neutral, positive or negative). There are several machine learning techniques for this task, but in this article we are using the simplest one but with a great efficience in classification problems: Naive Bayes technique.
The Naive Bayes model or Naive Bayer classifier is a simple probabilistic classifier based on the Baye's theorem with strong independence assumptions. In simple terms, this probability model assumes that the presence or absence of a particular feature of a class is unrelated to presence or absence of any other feature. For instance, a car may be considered to be a vehicle if there are 4 tires, engine and at least 2 doors. Even if these features depend on each other, the naive Bayes classifier considers all these properties to independently contribute to the probability that the car is an vehicle.
In our case, each stemmed word in a tweet is taken to be a unique variable in the model, and the goal is to find the probability of that word, and consequently the all sentence itself, belonging to a certain class: Positive vs Negative. In spite of their naive design and simplified assumptions, naive Bayes classifiers have worked very well in many complex real-word situations. One of the extensive uses is in spam filtering. In fact, one of the most popular packages (SpamAssassin) is a direct implementation of Naive Bayes. It is a an amazingly simple approach and also a powerful classifier. More information about the algorithm of Naive Bayes it can be found here and here.
In our case, each stemmed word in a tweet is taken to be a unique variable in the model, and the goal is to find the probability of that word, and consequently the all sentence itself, belonging to a certain class: Positive vs Negative. In spite of their naive design and simplified assumptions, naive Bayes classifiers have worked very well in many complex real-word situations. One of the extensive uses is in spam filtering. In fact, one of the most popular packages (SpamAssassin) is a direct implementation of Naive Bayes. It is a an amazingly simple approach and also a powerful classifier. More information about the algorithm of Naive Bayes it can be found here and here.
In this tutorial I have used the implementation of a simple Naive Bayesian Classifier. There are some variations using Fisher Score discriminant function, which will not be explored in this article now.
In our problem domain, we will use the Naive Bayes classifier to classify the tweets in Positive or Negative. To apply this task, it is necessary to train the classifier by creating a training set of tweets (or words) already classified (in Positive or Negative). Trained the model, Naive Bayes assumes that all features in the feature vector (after the pre-processing step), and applies Bayes' rule on the tweet. Naive Bayes calculates the prior probability frequency for each label in the training set. Each label is given a likelihood estimate from the contributions of all features, and the sentence is assigned the label with highest likelihood estimate, that is, 'Positive' or 'Negative'.
Summarization
The last step: The presentation of the results. Generally, it can be presented by using text or charts. The summarization in text is not a easy task, and it has a research field only for it. The common presentation is a list of sentences where for each the classification is shown by its side.
Summarization by TwitterSentiment Analyzer - http://twittersentiment.appspot.com/ |
The other common option is to use charts to present the results. It's the simplest way, and there are many flavors for this presentation. And don't forget they are more attractive for the end users and reports.
Summarization by TwitterSentiment Analyzer - http://twittersentiment.appspot.com/ |
Our Study Case: Movies Reviews on Twitter
For a simple study case and demonstration here, I've created a training set with more about 300 sentences related to movie reviews on Twitter. So sentences such as 'Eu amei o filme Eclipse!' or 'Eclipse poderia ser melhor. #horrível Pior filme da saga.' are labeled under positive and negative polarities for creating the training set. It is important to notice that my data corpus are related to portuguese tweets. So english ones will not work properly. I've used the Twitter Search API for searching tweets with the keyword 'Eclipse', the new released movie in the theaters from the saga 'Twilight' and fetched tweets during three days (July 05th 2010 - July 07th 2010).
What people are talking about the movie Eclipse on Twitter ?? |
The task is to analyze the opinions about the movie in Twitter. Using the Naive Bayes Classifier with some adjustments to include the 'Neutral' Classification, we have analyzed 4265 tweets. I have used some pre-processing techniques explained above. All the code was written in Python. The result you can see in the graph below.
As you may notice, brazilians are really loving the movie Eclipse!! There are also a great amount on neutral tweets, it means that the classifier didn't have the confidence to classify as positive or negative, (I'll explain soon) or tweets that don't express sentiment (opinions).
Polarity Chart on tweets about the movie 'Eclipse' |
As you may notice, brazilians are really loving the movie Eclipse!! There are also a great amount on neutral tweets, it means that the classifier didn't have the confidence to classify as positive or negative, (I'll explain soon) or tweets that don't express sentiment (opinions).
You can see some of the tweets classified as follows:
Some Tweets classified related to the movie Eclipse |
The neutral classification, I've added a simple threshold where after evaluating the probabilities for each polarity (Positive and Negative). The maximum one is compared to a threshold and if its value isn't higher, it is classified as a neutral one. It means that the classifier doesn't have the confidence to classify the tweet as one of the classes trained in the model.
As you can see, the Naive Bayes is a simple machine learning technique quite efficient for automatically extract sentiment from small sentences like Tweets. However, it is important to notice that it still can be improved with new feature extraction approaches or another robust machine learning techniques.
Demo
Soon I will provide all the source code for this project, but if you want to play around with the classifier, I've developed a simple REST API for using the classifier for simple tests. This version is still not official and it only works with movie reviews domain (in portuguese of course).
Docummentation
The output result is in JSON. PLEASE THIS API is still in development. So if you plan on integrating this API into a product, please let me know , so that I can contact you in order to help you to create your own.
The classifier is only working with a single arbitrary text (in portuguese language) and as result give you its polarity. Example:
As you can see, the Naive Bayes is a simple machine learning technique quite efficient for automatically extract sentiment from small sentences like Tweets. However, it is important to notice that it still can be improved with new feature extraction approaches or another robust machine learning techniques.
Demo
Soon I will provide all the source code for this project, but if you want to play around with the classifier, I've developed a simple REST API for using the classifier for simple tests. This version is still not official and it only works with movie reviews domain (in portuguese of course).
Docummentation
The output result is in JSON. PLEASE THIS API is still in development. So if you plan on integrating this API into a product, please let me know , so that I can contact you in order to help you to create your own.
The classifier is only working with a single arbitrary text (in portuguese language) and as result give you its polarity. Example:
Query:
http://mobnip.appspot.com/api/sentiment/classify?text=Encontro+Explosivo+filme+ruim+Danado&query=encontro+explosivo
Response:
{"results": {"polarity": "negativo", "text": "Encontro Explosivo filme ruim Danado", "query": "encontro explosivo"}}
Parameters:
- text: The text you want to classify. This should be URL encoded.
- text: The text you want to classify. This should be URL encoded.
- query: The subject. This should be URL encoded. (optional)
Response:
- text: original text submitted
- query: original query submitted
- polarity. The polarity values are:
- 'negativo': corresponds to negative
- 'neutro': corresponds to neutral
- 'positivo': corresponds to positive
- text: original text submitted
- query: original query submitted
- polarity. The polarity values are:
- 'negativo': corresponds to negative
- 'neutro': corresponds to neutral
- 'positivo': corresponds to positive
Conclusions
As we can notice that sentiment analysis is a trend in the Web, with several interesting application with a lot of data sources provided by users. Microblogs and Social Networks such as Twitter, Facebook and Orkut and another web services are powerful crowding sources for obtain opinions from the users in the Web about any subject and specially to help to answer the question about what people are interested on. Despite the challenge, more companies and researchers are working in this area until one day it would be easy for users and companies to simply obtain complete and rich summarized reports about the opinions from the Web in order to support them in the decision making process in their daily life.
In machine learning topics, there are also a lot of improvements that would be made for this task and several challenges for researchers to extend it. I would say the most important ones are related to the natural language processing, which is directly associated to sentiment analysis. I would cite the subjectivity identification (distinguish if a text is really a opinion or a fact), identification of sarcasms, which are not easy to extract, the multiple reference of items in a same sentence (a tweet referencing Iphone and Ipod) and with different opinions, which it may confuse the classification. Misspelling words and abbreviated words (common in blogs and social networks) are also difficulties for search and classification. Many others could be cited here but it may be found if you search on Google about sentiment analysis.
In the end, I would tell that applying sentiment analysis in portuguese language is still a new area with great challenges to deal with text mining since the language is different from the current ones applied on english in the literature. My work is a initial discussion about how we can learn more about sentiment analysis and how portuguese could be inserted in this context. More analysis and a better evaluation should be applied to compare another popular machine learning techniques with Naive Bayes classifier.
I expect you have enjoyed this tutorial, it's quite long but I believe it could be a useful start for the beginners in machine learning.
Thanks,
See you next time!
Marcel Caraciolo
Hi Marcel, I'm a master degree student and I'm very interesting about sentiment analysis, especially with social networks.
ReplyDeleteHow much is your accuracy? Can you share your training corpus?
Hi Marcel, I'm would like to know your training / test corpus and your accuracy, too... Can you comment about this ?
ReplyDeleteNice work. My graduation paper was on the same line.
ReplyDeleteI did some feature extraction and product sentiment analysis. I didnt get to the summarization part, though.
Hi Marcel,
ReplyDeleteCan you tell us how many tweet you had to tag manually to train your classifier ?
Thank you
@silvano I'm writting an article about this and I'm pretending to post the results and accuracy here at my blog. Initial analysis using a test set gave me 80,51% of accuracy. Further evaluation using more experiments and ROC will give more details about it.
ReplyDeleteSilvano unfortunately the training set is still not available, but I will release soon.
----
@murilohq Hi Murilo, my training set was collected manually around 1500 tweets labeled positive and negative. My main contribution is the Pre-processing where I am using some custom techniques (NLP) to remove inappropriate words (POS-TAGGER in Portuguese) and regular expressions to correct some slangs common in informal portuguese (ameeeei for example.)
The test set was also manually labeled and used about 25% of all data to be used to evaluate the accuracy. I am still working on this, while I am writing a paper about this experiment.
-----
@biggahead Great, I'd like to see your work and we can discuss more about it! Is it available ?
---
@KG Hi KG, I've collected around 3000 tweets and after a first filtering (removing all neutral tweets) it gave me around 1500 tweets. Of course, the worst part is to label one by one, and I spent a lot of effort on it: The building of the corpus and training set.
I'm waiting for your paper. Thanks.
ReplyDeleteComo devo montar a URL no caso da frase:
ReplyDeleteO filme não é bom
thanks for this post!
ReplyDeleteGreat post Marcel, I'm a senior and I'm doing my senior design project that's very similar. Trying to classify tweets as 'sports', 'love', 'entertainment', 'health'...etc. I agree that the pre-processing is definitely tedious and annoying.
ReplyDeleteUsing naive bayes you cannot have anything more than 65% without constantly calculating the conditional probability indexes. The best approach I've managed so far (for unsupervised calculation of conditional probabilities), is around 80% (78,41%), using hybrid rule-based grammar and Hierarchically Hidden Markov Models instead of Bayes to reach that accuracy. We implemented our own maximum entropy tagger, collected and generated our own Annotated Corpora for the specific use within twitter (blogs use a different set of vocabulary), and even like that the error we saw is mostly because of a poor annotated corpora.
ReplyDeleteOlá pessoal, também estou desenvolvendo um projeto que analisa sentimentos. Gostaria de saber se você criaram um dicionário de expressões positivas / negativas ou se baixaram algo da web. encontrei isso em inglês, mas nada em portugês.
ReplyDelete[]´s
Hi, Can you provide your corpus of training?
ReplyDeleteHi, Marcel, i'm Carlos, how are you??, well i want ask you if you can explain me, how function the algorithms with the information extracted and what algorithms, You had been use. well i hope your reply
ReplyDeleteHi Marcel, how do you alter the language of english to portuguese in python
ReplyDeleteOi Marcel, a ferramenta Demo já está disponível?
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThanks for sharing this great article, I really enjoyed the insign you bring to the topic, awesome stuff!
ReplyDeleteTaking NLP training is like learning the language of your mind.
ReplyDeleteNLP Certification in Chennai
It’s grate too good information thank you so much for sharing. NLP Training in Chennai | ramsnlp
ReplyDeleteDoesn't API work any more?
ReplyDeleteAmazing content.
ReplyDeleteData Mining Service Providers in Bangalore
Great blog. Livewire Technologiesis actively engaged in the autonomous driving space. Over the years, we have developed experience in Data Collection & Staging, Labeling & Annotation, and ADAS support.
ReplyDeleteThis professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:
ReplyDelete-Phone hacks (remotely)
-Credit repair
-Bitcoin recovery (any cryptocurrency)
-Make money from home (USA only)
-Social media hacks
-Website hacks
-Erase criminal records (USA & Canada only)
-Grade change
-funds recovery
Email: onlineghosthacker247@ gmail .com
This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free.
ReplyDelete일본야동
Terimakasih artikel nya sangat bermanfaat, jika berkenan silahkan kunjungi website kami :
ReplyDeletecara melakukan sentimen analisis
SMM PANEL
ReplyDeleteSMM PANEL
Https://isilanlariblog.com/
İnstagram Takipçi Satın Al
Hirdavatci
beyazesyateknikservisi.com.tr
Servis
tiktok jeton hilesi
beykoz arçelik klima servisi
ReplyDeleteüsküdar arçelik klima servisi
pendik samsung klima servisi
pendik mitsubishi klima servisi
tuzla vestel klima servisi
tuzla bosch klima servisi
tuzla arçelik klima servisi
çekmeköy alarko carrier klima servisi
kadıköy toshiba klima servisi
Good content. You write beautiful things.
ReplyDeletesportsbet
mrbahis
hacklink
mrbahis
vbet
hacklink
sportsbet
korsan taksi
vbet
Success Write content success. Thanks.
ReplyDeletecanlı slot siteleri
betmatik
kıbrıs bahis siteleri
kralbet
betpark
canlı poker siteleri
deneme bonusu
slot siteleri
ReplyDeletekralbet
betpark
tipobet
betmatik
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
BAAZG
elazığ
ReplyDeletekağıthane
kastamonu
nevşehir
niğde
yalova
O5H
bilecik
ReplyDeletegebze
ısparta
şırnak
alsancak
YE3A
salt likit
ReplyDeletesalt likit
5VGG
ağrı
ReplyDeletemuş
mersin
afyon
uşak
DPRQ
https://saglamproxy.com
ReplyDeletemetin2 proxy
proxy satın al
knight online proxy
mobil proxy satın al
7ZS
ds
ReplyDeleteThis is very interesting article and effective dude. Thankyou for sharing
ReplyDeleteYour style is so unique. I’ve enjoyed read stuff here. Waiting for your next post.
ReplyDeleteHello, I'm happy to see some great articles on this site. thankyou so much
ReplyDeleteI know this is one of the most meaningful information for me. Thanks
ReplyDeleteThank you for sharing this information with us, Keep on bloggingg here man!
ReplyDeleteEasily, the article is actually the best topic on this issue. Great job. Thankyou
ReplyDeleteThis was an incredible post. Really loved studying this site post. Excellent points
ReplyDeleteI believe that this weblog is interesting. Thank you for your time and efforts here.
ReplyDeleteI am glad to see this site share valued information Nice one! Great dayyy
ReplyDeleteReally amazed with your writing talent. Thanks for sharing, Love your works
ReplyDeleteFantastic website. A lot of helpful info here. Thanks to the effort you put into this blog
ReplyDeleteSentiment analysis is rapidly evolving, fueled by data from microblogs and social networks like Twitter and Facebook. Despite challenges like identifying subjectivity, sarcasm, and handling multiple references with differing opinions, advancements in natural language processing continue to drive improvements. For Portuguese, this field presents unique challenges compared to English, necessitating tailored approaches. As researchers explore these complexities, sentiment analysis holds promise for enhancing decision-making processes by distilling comprehensive opinions from web data. Exciting developments lie ahead, promising richer insights and broader applicability in machine learning.
ReplyDeleteftyfjghjkhgkjhkljhlk
ReplyDeleteشركة عزل اسطح بالقطيف
fjgfjghkjhlkdfgjfdhg
ReplyDeleteشركة عزل اسطح بالقطيف
شركة شراء اثاث مستعمل بالرياض
ReplyDeleteشراء اثاث مستعمل بالرياض
35468
شركة شراء اثاث مستعمل بالرياض
ReplyDeleteشراء اثاث مستعمل بالرياض
35689
شركة الركن المثالي تقدم خدمات متخصصة في صيانة ولحام خزانات المياه بكفاءة عالية وحلول مبتكرة لضمان استدامتها.
ReplyDeleteشركة لحام خزانات
564859
ReplyDeleteThanks very much for helping me
ReplyDeleteThis article is truly amazing. Appreciative for sharing such mind blowing information.
شركة مكافحة حشرات بالجبيل 7Jsa3tqhvz
ReplyDeleteThanks for some other wonderful article
ReplyDeleteLet's say your blog is nice. Really excited to read more. Fantastic.
ReplyDeleteI admire the valuable information you offer in your articles. Thanks for posting it.
شركة تسليك مجاري بالقطيف f85ZNFUpxi
ReplyDelete
ReplyDeleteWe are really grateful for this blog post. Absolutely a Great work
صيانة افران الغاز بمكة v7iqQBKSbY
ReplyDelete
ReplyDeleteThis is the perfect website for anyone who really wants to understand this topic.
شركة تسليك مجاري بالاحساء YXYMsGk8aG
ReplyDeleteشركة تنظيف بالدمام hT2ZDyekjj
ReplyDelete
ReplyDeletewow just what I was looking for
ReplyDeleteThanks for the info I will try to figure it out for more
ReplyDeleteUsually I don't read post on blogs, however I wish to say that this write-up very impressived!!
Great insights on sentiment analysis! Your breakdown makes a complex topic feel approachable. Looking forward to reading more of your work. Keep it up! abogado tráfico appomatox virginia
ReplyDeleteشركة مكافحة بق الفراش بالاحساء 2eF5L3GdPk
ReplyDelete