Free Online Book for Introducing Data Mining!

Friday, December 31, 2010

For anyone who wants to start on the data mining topics, I extremely recommend this  book a available online for free.  I found it last week and one of the interesting features of this book that caught my attention was its structure and visualization.  They provided a table of content which is actually a Data Mining Map.

The book is practical and more suited for people who wants to be introduced to Data Mining in a easy way. There are a lot of pictures, tables , definitions and exercises.   This project was created by the Department of Chemical Engineering and Applied Chemistry at the University of Toronto.

Great work and finding!

Marcel Caraciolo

Developing a Computer-Assisted Twitter User . Meet @caocurseiro !

Wednesday, December 29, 2010

Hi all,

It has been a while since my last post but the reason is that I've been working in some new stuff related to social recommendation and at my master thesis. But this post is to talk about a recent project that I developed for a brazilian social network called AtePassar.  AtePassar is a brazilian social network for people who wants apply for positions at brazilian civil (government) services.  One of the great features of this social network is because people can share their interests about studies and meet people all around Brazil with same interests or someone that will apply for the same exam as him.  Can you imagine the possibilities ?

AtePassar  Main Page

It is a social network for students into a virtual space where there are several relations of friendship, studies and even exam partners. All the network services are free for the users and there is also a e-commerce store where the users can buy resources for helping them in their studies like video lectures, papers, books, etc. My current job there is to apply intelligence in the social network, specifically collective intelligence. Right now I am working in a big project for a social recommender at AtePassar (it will be soon posted here explaining how I managed to build it) and small projects for helping the social network to fetch more new users.   

One of them that I'd like to comment is the new Twitter Bot that I developed that tries to simulate a human, specifically controlling all the actions of a twitter account.  Let me explain more about it.  The goal is not to develop a SPAM bot or a bot that only talk about specified terms.  I imagined it as a bot that could post new updates about open public positions at Brazil, self-helping quotes for stimulating people and simple mentions related to the AtePassar. Also it would retweet tweets from another users that speak about open public exams in Brazil, send mentions to people that are interest about those subjects presenting them the AtePassar social network and even following those users.

Written in Python and hosted at Google AppEngine I created the bot @caocurseiro, who is the mascot of the social network. What does it do ?
  • Post 1 to n updates in a random time set by a interval defined by the administrator.
  • Send 1 to n mentions to new users  (@newuser) in a random time set by a interval defined by the administrator
  • Follow 1 to n new users in a random time set by a interval defined by the administrator
  • Retweet 1 to n tweets in a random time set by a interval defined by the administrator.
Of course, since the Twitter politics is against at all kinds of SPAM, I developed this bot to obey the rules and the rare limits established by them.  Other feature is that  he even sleeps during the night (our Brazilian local time)  for a random period.  I am still thinking how to improve it by answering mentions when is directed to him and improve the quality of the posts that sometimes come repeated.

It has been online for about two days our mascot and he's already following 42 users and it's followed by 25 users.  As soon as I put a new retweets rules which it would retweet posts from another users about  subjects related to open public exams and related stuff, I think it will improve much more the quality of the bot.

Unfortunately the code is not yet open-source, but I am working on it to provide it as open-source.  I think the biggest contribution is how to develop a computer driven Twitter User to be similar as human user as possible. Although he still uses static rules, maybe I could put some intelligence to him by answering real questions about any open public exams at Brazil ( future ideas).

Caocurseiro Twitter Computer's Driven

That's all,

Marcel Caraciolo

My lecture about Recommender Systems at IX Pernambuco Python User Group Meeting and my contributions.

Tuesday, December 7, 2010

Hi all,

It has been while since my last post, but my master thesis is taking a lot of time available. Soon I will come back with posts and content related to what I am working now.

But the main reason of this post is to publish my lectures that were recorded at the IX Pernambuco Python User Group Meeting (PUG-PE) last month in November.  I had the opportunity to talk more about what I am studying, which is related to the topics of recommender systems and a lighting talk about lighting talks!

But since this blog concerns about artificial intelligence,  I will focus on the recommender systems. In this lecture I've introduced the main concepts behind recommender systems, how it works, the advantages and drawbacks of each classing filtering algorithm.  Both examples presented were used in my lecture at PythonBrasil (a main meeting that joins all Python Users of Brazil).  The result of this project will be explained in the future in two posts. But let me explain my main contributions in this field.

One is my work currently at the startup Orygens, where I am developing a novel recommender system applied on social networks. The idea is to recommend users and content to the users of a brazilian social network called AtePassar.

Main Profile of AtePassar

The other contribution is the development of a framework written purely in Python called Crab.  It was originally idealized by me to be a simple easy-to-use recommendation engine framework in order to  be applied on any domains. Besides it will be used to test new approaches, since it will be easy to plug-in new recommender algorithms and test them with the evaluation tools available.  This project is open-source is completely available at my account on

The main page of the Crab Project

Today we have four collaborators and we are planning to keep going forward with some demos and a distributed computing framework totally integrated with Crab.  More information I will provide soon here at my blog with some demonstrations.

My video about Recommender Systems.  The video is in portuguese.

Wait for news!  Please any further information, contact me!

Marcel Caraciolo

Recommender Engines and Data mining with Python at PythonBrasil Conference

Thursday, October 28, 2010

Hi all,

Last week I was at an important event in Curitiba called Python Brasil. It is a annual event where it joins several brazilian developers to discuss about technology and of course about the programming language Python.

I also had the opportunity  to lecture three presentations about several topics of my interest.   The official presentation was about recommendation engines with Python.  This work shows how developers could use python in developing recommendation engines with several examples and explaining the main concepts behind this subject.  The best part was the demos where I used real data from the web such as Twitter, to suggest users that are similar to me among the PythonBrasil followers.  The other example is based on collective buying, which i crawled some popular brazilian web sites and gathered real offers at Curitiba. The main idea is to recommend new offers according to my interests and what people similar to me also liked. It is the classical example of the collaborative filtering, commonly used at several e-commerces today including Amazon.
The presentation was great with lots of feedback.  If you want to take a look at the slides (it is on portuguese) please take a look here:

The main contribution of this work is a new library for building recommendation engines in Python language called Crab.  I've decided with some colleagues to develop this library in order to be a powerful tool for developers to use python as the main language to build and use classical recommender algorithms in their applications. Besides it is extensible so developers can add easily new algorithms to the engine. There is also a easy API so users can plug with their web apps running in different platforms such as Django, AppEngine, Web2Py, etc. 

If you want to collaborate or interest about this subject please feel free to join us at this work. The project is hosted at GitHub with the link:

My second presentation was a lighting talk. What's that ?  There is a extra category of presentations in PythonBrasil where you have 5 minutes to speak about any topic you want.  The one rule is 5 minutes, no more!   I was challenged so I one day before the presentation to develop a web crawler in order to scrap all the lectures submitted and approved at PythonBrasil conference.  With all this data in my hands I've decided to make an analysis to answer three questions in my mind:

a) Which are the main and  frequent topics showed at the keynotes at Python Brazil ?

b) Based on this information, how we could organize the speakers based on those topics ? That is group speakers with similar topics. A classical problem of clustering.

c) What information we could also extract such as level of expertise of the lectures, total time spent in the lectures, etc.

For all those questions I was seeking to answer I decided to use Python, Matplotlib and Ubigraph (A 3D Visualization tool for graphs).   It was really interesting because I really could find some groups based on similar interests.  The main subjects was Entrepreneurship, Hardware, Web, Design Patterns, Data Mining, Django and Artificial Intelligence.

With those subjects I could now group the speakers using a simple clustering algorithm such as K-means and organize them based what their topics were. I've recorded a simple video to present the result using the tool Ubigraph. Take a look:

The presentation in portuguese  you can see here:

In the end I think the event was awesome some great keynotes and of course lots of new contacts at my network.  I have to say it is a great opportunity to meet great people and share ideas and technology!

Next year it will be in São Paulo, Brazil. I expect to be there !

Best regards,

Marcel Caraciolo

My new experiment: TweetTalk : A Twitter Post Chatter ;D Update tweets from your Google Talk Account!

Saturday, October 2, 2010

Hi all,

During my recent studies about Chatter Bots, I've inspired myself to build a new one now integrating Twitter and Google Talk IM Service.  His name is  TweetTalk.  What's the catch ?  

TweetTalk is a Jabber IM Bot for anyone who wants to quickly update a post on your Twitter Account. So instead of going to a separated client, directly from anywhere you can access the Google Talk Engine (Web mail or Desktop Client) you can just write your tweet and the bot will responsible of sending the post to the Twitter.

Let's see it in action:

The tweet status on Twitter:

As you can see, it is really fast now for me to send a tweet from my webmail gmail. It is a simple experiment of how those type of bots can improve your life and work daily.

If you wanna try it, please add to your contacts:  

By the way I am using Python for developing the main logic of the Bot.  For web communication I used Django + Google AppEngine + Twitter API.  And as the bot infra-structure the Imified API.

I am writing some new articles about performance evaluation, recommendation engines, REST APIs and SVM with Keyword/Term Extraction. 

Stay tuned !

Marcel Caraciolo

Tools for Machine Learning Performance Evaluation: ROC Curves in Python

Tuesday, September 21, 2010

Hi all,

Continuing from the my last post about performance analysis on classification machine learning techniques, in this article I will talk about a specific analysis called Discriminatory Power Analysis by using Receiver-Operating Characteristic Curves  (ROC Curves).

First, let's review and introduce some concepts related to the classification problems. When we deal with systems that involves the detection, diagnostics or even the prediction of results, it is really important to check the obtained results in order to validate the discriminative power as good or not of a given analysis.  However only depending on the quantization of hits and misses on a test group does not really tell how good a system is, since it depends on the quality and distribution of the test group data.

For instance, let's consider a fraud detection system, which outputs are 1 or 0, indicating whether a transaction has been classified as a fraudulent or not.  Now suppose that we applied the system on a test group of 100 transactions which we already know which one was a fraud or not, and the system correctly identified 90% of conditions. Do you think it is a good performance, don't you ?

Actually it is not. One information is missing and it is how the test group was distributed.  Let's consider that actually 90% was fraudulent. The system, thus, could have considered  '1' to every input given, and still have achieved a 90% accuracy rate, as the 10% transactions left were not fraudulent. By now, we cannot be sure if the system was really good or worked as a random classifier considering everyone as fraudulent without any previous knowledge or calculations.

In this case, we have to consider another metrics in order to evaluate this unbalance in the test groups. As I said in a previous post in this blog about confusion matrix, it will act as a base for the measures used in the discriminatory power analysis in question.

Confusion Matrix

From this table, I will present some metrics used to help in this analysis.

1. Accuracy
It measures the proportion of correct predictions considering the positive and negative inputs.  It is highly dependant of the data set distribution which can easily lead to wrong conclusions about the system performance.

          = (TP + TN) / (P + N)

2. Sensitivity

It measures the proportion of the true positives, that is, the ability of the system on predicting the correct values in the cases presented.

           =  TP/ (TP+FN)

3. Specificity

It measures the proportion of the true negatives, that is, the ability of the system on predicting the 
correct values for the cases that are the opposite to the desired one.

           =  TN/ (TN +FP)
4. Efficiency

It is represented by the mean of Sensibility and Specificity.  The perfect decision would be  with 100% of specificity and 100% sensitivity. However this situation is rarely conceived, so a balance between both metrics must be obtained. This metric is a good evaluator for measure the responsiveness in a overall situation to the production of false positives and false negatives. Generally, when the system is too responsive to positive, it tends to produce many false positives and vice-versa.

EFF = (SENS + SPEC) /2

5. Positive Predictive Value
This measure indicates the estimation of how good the system is when making a positive affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true positives in contrast with all positive predictions.

         = TP / (TP + FP)
 6. Negative Predictive Value

This measure indicates the estimation of how good the system is when making a negative affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true negatives in contrast with all negative predictions.

         = TN / (TN + FN)

7. Phi (φ) Coefficient

It is a measure not commonly used and measure the quality of the confusion matrix in a single value which can be compared. For instance, if you have two binary classifications with same classes but with different sizes, it can be used to compare both of them. It returns a value between -1 and +1, where +1 represents a perfect prediction, 0 a random prediction, and -1 an inverse prediction.

φ =  (TP*TN - FP*FN) / sqrt( (TP+FP) * (TP +FN) * (TN+FP) * (TN +FN) )

Source Code

As I presented in the first post of this series, I've updated the confusion matrix by adding the new metrics presented above.

View the script at Gist.Github

The Receiver Operating Characteristic (ROC) Curve

The ROC curve was developed during the World War II and was extremely used by engineers to detect enemy objects in enemy fields. It became famous and widely used in other areas such as medicine, radiology, etc.  More recently, it has been introduced to the area of machine learning and dataming.

The main idea behind the ROC curves is to analyze the output from the classification systems, which are generally continuous.  Because of that, it is necessary to define a cut-off value (or a discriminatory threshold) to classify and count the number of positive and negative predictions (such as the fraudulent or legal transactions in the case of the statuses in bank transactions).  This threshold can be arbitrarily determined, so the best way to compare the performance of different classifiers is to select several cutoff values over the output data and study their effect.

Considering many thresholds, it is possible to calculate a set of pairs (sensitivity, 1-specificity) which can be plotted in a curve.  This curve will be the ROC curve for the system where the y-axis (ordenades) represents the sensitivity and the x-axis (abscissas) represents the complement of specificity (1 - specificity).

Examples of ROC Curves are presentend below.

Examples of ROC Curves (From: CezarSouza's Blog)

As you can see in the figure above, higher the ROC curve's area, better the system.  This is a standard measure for classifiers comparison, which is called the area under the ROC Curve (AUC). It is evaluated by a numerical integration, such as, for example, the trapezoidal rule.

Now let's go to the development of a simple ROC curve in Python.  During a data mining project that I worked on last year, I decided to implement a ROC curve in order to analyze the performance of some classifiers.  This work resulted into a open-source library called PyROC which you can use inside your own applications for the creation, visualization and analysis of ROC curves.

Source Code

The project is hosted at my personal repository at GitHub.

Feel free to download the source code and use in your applications.

Using the code

To start using the code in your projects, just create a new ROCData object passing as argument the list of tuples containing the actual data, as measure by the experiment, and the predicted data as given by the prediction classsifier.  The actual data must be a dichotomous variable, with only two possible valid values. Generally it is assumed the values 0 and 1 for represent the two states. For the test data, given by the classifier their values must be continuous and the range between the highest and lowest values considered in the actual data. For instance, if the values are 0 and 1, the test data values must be inside the [0,1] range.

from pyroc import *
random_sample  = random_mixture_Model()  # Generate a custom set randomly
print random_sample[(1, 0.53543926503331496), (1, 0.50937533997469853), (1, 0.58701681878005862), (1, 0.57043399840000497),
(1, 0.56229469766270523), (1, 0.6323079028948545), (1, 0.72283523937059946), (1, 0.55079104791257383),
(1, 0.59841921172330748), (1, 0.63361144887035825)]
To compute the area under the curve (AUC) and plot the ROC curve just call the RocData object's methods auc() and plot() . You can also pass the desired number of points to use  for different cutoff values.

#Example instance labels (first index) with the decision function , score (second index)
#-- positive class should be +1 and negative 0.
roc = ROCData(random_sample)  #Create the ROC Object
roc.auc() #get the area under the curve
roc.plot(title='ROC Curve') #Create a plot of the ROC curve
Some metrics are also available evaluated from the confusion matrix such as accuracy, specificity, sensitivity, etc.

#threshold passed as parameter - the cutoff value. (0.5) 
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
         Actual class
        +(1)    -(0)
+(1)    82      18      Predicted
-(0)    18      82      class
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
{'ACC': 0.81000000000000005, 'PHI': 0.62012403721240439, 'PPV': 0.81632653061224
492, 'SENS': 0.80000000000000004, 'NPV': 0.80392156862745101, 'SPEC': 0.81999999
999999995, 'EFF': 0.81000000000000005}

At least if you wanna plot two curves in the same chart, it is also possible by passing a list of RocData objects. It is useful when you want to compare different classifiers.

x = random_mixture_model()
r1 = ROCData(x)
y = random_mixture_model()
r2 = ROCData(y)
lista = [r1,r2]
plot_multiple_roc(lista,'Multiple ROC Curves',include_baseline=True)

There is also another option for you to use the script : stand-alone. Just run at the terminal the command below to see  the available options.

Run at your terminal python


Let's see some ROC curves plotted in action.

 I've finished the second part of the performance analysis on machine learning classifiers. This is a simple introduction, so if you want a better understanding of how ROC curves work and its meaning, please take a look at this website about ROC curves and its applications. It includes many applets for you experiment with the curves. I expect you have enjoyed this post. The next one in this series will be about the F1-Score also known as F-measure.

A.I. Challenge with the game Planet Wars sponsored by Google!

Sunday, September 19, 2010

Hi all,

Recently Google and the University of Waterloo's computer science club have launched an Artificial Intelligence challenge. The task is to write a program to compete in a Planet Wars tournament.  Your goal is to conquer all the planets in your side of space or destroy all of your opponents ships.  The contest supports languages in Python, Java, C# and C++. The support for Ruby, Lisp, Haskell and Perl are still in development.

The challenge has already starded and ends on November 27th.  It sounds quite interesting!!

If you don't know about this game, I can give you some glues. Planet Wars is a strategy game inspired by Galcon Iphone and desktop version. Take a look at Planet Wars game in action:

I am planning to participate also in this challenge. It may be quite interesting for reviewing and practicing machine learning techniques. There is also a rank, let's see which position I will place myself.  ;D

Marcel Caraciolo

Chatterbots and Natural Language Processing: A real example with Brazilian Correios Bot TweetComendas

Tuesday, September 14, 2010

Hi all,

In a previous post at this blog I've presented two web robots that I developed : Tweetcomendas and TransitoRE.  For Tweetcomendas, which is a Twitter bot for track mail orders from Brazilian post office Correios, I've developed a chat bot to interact with the users. Besides Twitter,  the user  also can track his orders by asking at real-time our Chat Client TweetComendas.

Recently, several approaches were proposed in the research domain of chatterbots. But first let's review the concepts of a chatterbot or chatbot.  Chatterbots are computer programs that seek to simulate a human in the conversation with other people. Its goal is to answer questions in a manner that people have the impression they are talking with a real person and not a computer program. This  tatement remembers me the Turing Test which is a test proposed by Alan Turing in his 1950 paper in order to evaluate the machine's ability to demonstrate intelligence by using conversation between humans and machines. It's quite interesting, I recommend the reading.

So after the dispatch of the questions in natural language, the program consults a knowledge database and then provides an answer which tries to mimic the human behavior.  Although, many chatterbots do appear to simulate the human input intelligently in their responses,  the bot that I developed tries to simulate an human assistant by scanning for keywords within the input and answers the closest as possible in natural language the statuses of the orders at Brazilian Correios.

Recently, chatbot techniques have been used for several practical purposes such as online help, personalised services or information acquisition, by working as a conversational agent. One of the classical examples of Chatterbot is A.L.I.C.E, which uses a weak A.I.. In the example of A.L.I.C.E, it uses a programming language called AIML, which is specific to its function as a conversational agent and has been adopted by many developers, which created the called Alicebots.


The one that I've liked a lot and at some way inspired to develop my own bot was the Aadvark ChatterBot. Aadvark, recently bought by Google, was  a startup focused on creating a social recommendation engine mashing-up some components of Yahoo! Answers and Instant Messaging.  Their service is to offer the user the possibility of asking a question to the bot, and based on its algorithms it will compute its own tag (keywords or you could give it a few) and then, it will find a right person to your question and it will forward it to him in order to  answer. The best part is that Aadvark works very well not only in their web page but also in another clients such as IM's, Iphone, etc.  So you have a question, just open the chat screen of the bot and ask a question and after some minutes you may get a useful answer in the same session (client). Try it for free and see its power!

Aadvark Chat Client

The Tweetcomendas ChatterBot is actually a Jabber Chat Clients which uses the XMPP protocol for receiving and pushing text messages. I've used as an example of another bots that I am planning to develop. Let's see a working example of how it works:

1- First, we have to add the chatter bot (it uses the IMIFIED IM's  bots infrastructure, so you can host your bot there for free.).

2 - After he comes online, you can send at anytime the tracking code of your package order from Correios in order to track it.

It uses regular expressions to match the tracking code.

3 - Wait while he looks at the Correios Tracking WebServices and receive the current status.

The current status in natural language for the user.

3 - Now, another useful feature of the Tweetcomendas ChatterBot is he can actually track for you automatically your order and any new updates he will come to you by giving the new status on your chat client if you're online or by sending you an e-mail (It only works with gmail by now.)

He will ask you politely if you want him to track your order for you.

A gentleman chatterBot :D

4 - Receive new updates of your order by e-mail or by your Jabber Client.

Updates on your e-mail even when you 're not online at the Google Talk

So, this is my first attempt to develop real useful chatter bots for helping users in order to track their post orders. There are many ideas in my mind that I want to create. I believe this kind of tool is the next generation of agents for interacting with people working as an assistant such as the Siri: a mobile assistant agent running on Iphones.  Take a look at the video and see the future of those kind of technologies.

I expect that you have enjoyed my simple bot, it still sometimes can be lazy but in almost of the time it works smoothly. Add to your Google Talk Contacts:   This is a personal experiment project that I've developed with the help of Ricardo Caspirro (@ricardocaspirro) on intelligent agents and bots.  If you want more information  about it, take a look at its homepage. There is also other client working on Twitter!

That's all,


Marcel Caraciolo

Knowing the 'hidden' class re.Scanner in Python and how to use it: Building a simple Tweet Extractor

Wednesday, September 8, 2010

Hi all,

I've written a simple post at my personal blog (Yes, I have one, but it talks more about my daily life, personal projects and geek stuff) Mobideia about the use of a hidden class 'Scanner' from the module re, responsible for handling with regular expressions.

The practical example includes a simple Tweet Extractor for identify the URLs, (,  usernames (@joao), hashtags (#django), Retweets (RT) and words.  It may be very useful for you who wants to play with Natural Language Processing (NLP) and text mining on Twitter.

In order to avoid the replication of all text here,  I've decided to place the links for accessing the post:

Portuguese Version

English Version (by Google Translate):

I hope you enjoy!


Marcel Caraciolo

Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Hi all, 

I'll start to write some posts starting from now about Supervised and Unsupervised learning, specific related to performance evaluation such as classification accuracy, lift, roc curves, F1-Score and errors.

The Confusion Matrix

Let's start with the one popular tools to evaluate the performance of a model in tasks of classification or prediction:  The confusion matrix (in unsupervised learning it is typically called a matching matrix). Its focus is on the predictive capability of a model  rather than how fast the model takes to perform the classification, scalability, etc.

The confusion matrix is represented by a matrix which each row represents the instances in a predicted class, while each column represents in an actual class. One of the advantages of using this performance evaluation tool is that the data mining analyzer can easily see if the model is confusing two classes (i.e. commonly  mislabeling one as another).

The matrix also shows the accuracy of the classifier as the percentage of correctly classified patterns in a given class divided by the total number of patterns in that class. The overall (average) accuracy of the classifier is also evaluated by using the  confusion matrix.

Let's see a confusion matrix in action by showing an example. Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).  For a classifier trained with this dataset to classify an e-mail as 'Spam' or 'Non-Spam', we can use the confusion matrix to see the classification accuracy based on the training data. In the example confusion matrix below, of the 33 patterns that are 'Spam' (S),  27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams' (NB) (achieving an accuracy of 81.8%).  On the other hand, of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams' (an accuracy of 85.1%).  The overall accuracy of the classifier  for predicting both classes given this dataset is evaluated achieving 83%.

Confusion Matrix on spam classification model

However the confusion matrix only tell us how the classifier is behaving for individual classes. When a data set is unbalanced (where the number of samples in one class is significantly more than that in the other class - it happens a lot with Spam/Non-Spam datasets) the accuracy evaluated of a classifier is not representative of the true performance of the classifier. For instance, imagine there are 990 patterns that are 'Non Spam'  and only 10 patterns that are 'Spam' , the classifier can easily be biased towards the class 'Non Spam'.  If the model classifies all the samples as 'Non-Spam', the accuracy will be 99%.  And this is not real indication of the classification's performance. The classifier has a 100% recognition rate for 'Non-Spam'  but a 0% error rate for 'Spam'. Looking at the matrix, the system has trouble in predicting the 'Spam' class, even though the system has to be 99% accurate in its prediction. Given that the prediction of 'Spam' class would be the one of actual interest, only using the confusion matrix to evaluate the model's performance is not enough, but it can give us an insight of how the model is predicting the classes and start to use other metrics that we will explain in the next section.

Confusion Matrix on a unbalanced dataset

The Table of  Confusion

In the Confusion Matrix, for each cell in the matrix we have fields as True Positives, False Positives, False Negatives and True Negatives.  These are defined as:
  • False Positive (FP):  Falsely Predicting a label (or saying that Non-Spam is a Spam).
  • False Negative (FN):  Missing and incoming label (or saying a Spam is Non-Spam).
  • True Positive (TP):  Correctly predicting a label (or saying a Spam is Spam).
  • True Negative (TN): Correctly predicting the other label (or saying Non-Spam is Non-Spam).

Looking at the confusion matrix in a general view is as follows:

Confusion Matrix
How can we use those metrics ?  For instance, let's consider the previous model now for predicting if a text message have positive or negative opinion associated (common in sentiment analysis task).  We have a data set with 10.000 text messages where the model correctly predicts 9.700 negative messages, and 100 positive messages. The model still incorrectly predicts 150 messages which are positive to be negative, and 50 messages which are negative to be positive.  The resulting Confusion Matrix is shown below.

Confusion Matrix on Sentiment classification task

For the binary classification problems, which was our case situation , we can derive from those metrics two equations called sensitivity and specificity. They are commonly used for the evaluation of any binary classifier. 

The Specificity (TNR) measures the proportion of messages that are negative (TN) of all the messages that are actually negative (TN+FP). It can be looked at as the probability that the message is classified as negative given that the message does not contain negative words. With higher specificity, fewer positive messages are labeled as negative.

On the other hand, Sensitivity (TPR) is the proportion of messages that are positive (TP) of all the messages that are actually positive (TP+FN).  It can be seen as the probability that the message is positive given that the patient contain positive words. With higher sensitivity, fewer actual messages will be classified as negative.  

Sensitivity can be expressed as :
  • TP / (TP+FN)
and then Specificity which is:
  • TN / (TN+FP)

In general here, Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages  ?
  • Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40% 
  • Specificity = TN / (TN+FP) = 9700/(9700+150) = 0.98 = 98% 

As you can see, if we have a test for sentiment classification with 40% sensitivity and 98% specificity, and we have to check 1000 messages, and 500 of them are positive and 500 are negative. You are likely to get about 200 messages true positives, 300 messages false negatives,  490 true negatives and 10 false positives. You can conclude that the the negative prediction is more confident, specially based on the high value of specificity and the low level of sensitivity. As you can see it's a important metric for analyzing the performance of your classifier only looking both separated.
The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve, which it will be one of the next posts about this topic.

I've developed some code in Python for evaluating the Confusion Matrix, Specificity and Sensitivity of a classifier here.  Please make the necessary changes for adapting for your classifier. 

That's all,

I expect you have enjoyed!


Marcel Caraciolo


MapReduce with MongoDB and Python

Saturday, August 21, 2010

Hi all,

 In this post, I'll present a demonstration of a map-reduce example with MongoDB and server side JavaScript.  Based on the fact that I've been working  with this technology recently, I thought it would be useful to present here a simple example of  how it works and how to integrate with Python.

But What is MongoDb ?

For you, who doesn't know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases:
  • SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
  • Doesn't guarantee atomic operations.
  • Distributed and horizontally scalable.
  • It doesn't have to predefine schemas. (Non-Schema)
  • Non-tabular data storing (eg; key-value, object, graphs, etc).
Although it is not so obvious, No-SQL is an abbreviation  to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here.  It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.

MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database  that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..

One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain  how it works illustrated with a simple example using MongoDb and Python.

If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.

Map- Reduce 

MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers.  Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming.  It can be considered  'Data-Oriented' which process data in two primary steps: Map and Reduce.  On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called 'Map', and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called 'Reduce'.  The paper about the MapReduce with more details it can be read here.

Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous is Hadoop and is implemented in Java as an open-source project.  In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.  In real case scenarios, in a situation where  you would have used GROUP BY in SQL,  map/reduce is the equivalent tool in MongoDB.

Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.


PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See here how to install  and use it.

Map-Reduce in Action

Now let's see Map-Reduce in action. For demonstrate the map-reduce I've decided to used of the classical problems solved using it: Word Frequency count across a series of documents. It's a simple problem and is suited to being solved by a map-reduce query.

I've decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works.  The second one is the 2009 Obama's Speech at his election for president. It will be used to show a real example illustrated by the code.

Let's consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split  in words and grouped by the function map and after reduced independently (aggregation)  by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. It's also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy.
Map-Reduce Distribution

Some notes you need to know before developing your map and reduce functions:
  •  The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
                 for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
  •  Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
  • If you need to perform an operation only once, use a finalize function.

 Let's go now to the code. For this task, I'll use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obama's speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's Speech.

Obama's Speech in 2009

For writing our map and reduce functions, MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.


As you can see the 'this' variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by calling this.text.  The map function doesn't return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others  intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now.

The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example using the Pymongo client and passing the map/reduce functions to the server.
Let's see the result now:

And it works! :D

With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we  can see the potential of map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be download here.

My next posts will be about  performance evaluation on machine learning techniques.  Wait for news!

Marcel Caraciolo


CodeCereal : Blog about Logic programming and artificial intelligence!

Thursday, July 29, 2010

Hi all,

I'd like to share a great blog that my colleague and friend has started to write about Artificial Intelligence and Reasoning and expert systems: The blog CodeCereal by Daker Fernandes, an undergraduate student from  Informatics Center (CIN) / UFPE .

I've read some posts and it's quite interesting the recently posts about abudctive reasoning, knowledge representation and the building of expert systems using  Prolog.   Anyone interested about this area and logic programming must take a look at this blog!!

Congratulations Daker and keep going with this great work!



Working on Sentiment Analysis on Twitter with Portuguese Language

Tuesday, July 20, 2010

Hi all,

In this post I would like to write about my last studies that I was  focusing on. In recent years it is noticeable the increase in amount of information available on the Web. People have started to share in the Web their knowledge and opinions in blogs, social networks and other medium.

On top of that, it have appeared a new research field related to Natural Language Processing, called Opinion Mining also known as Sentiment Analysis. Sentiment Anaylsis aims to identify the sentiment or feeling in the users to something such as a product, company, place, person and others based on the content published in the web. In the view of the requester, it is possible to obtain a full report including a summary about what people are feeling about an item without the need of find and read all opinions and news related to it.
In machine learning area, sentiment analysis is a  text categorization problem which desires to detect favorable and unfavorable opinions related to a specific topic. Its main challenge is to identify how the sentiments are expressed in text and whether they point a positive opinion or a negative one.

There are many applications for the use of Sentiment Analysis, some as follows:
  • Sentiment Analysis in Company Stocks :  An important data for investors in order to identify the humor of the market to the companies stocks based on the opinion of analysts, therefore, identifying the trends in their prices.
  • Sentiment Analysis in Products: A company might be interested in the opinion of their customers about a certain product. For example,  Google (or another company) can use the sentiment mining to know what the people are speaking about the Android cellphone: Nexus One. This information can be used to improve their products or even identify new marketing strategies.
  • Sentiment Analysis on Places:  One person who will travel might want know the best places to visit or the best restaurant to eat. The opinion mining could help those people recommending good places during the planning of their travel.
  • Sentiment Analysis on elections: The voters could use the sentiment analysis to identify the  opinion of other voters about a specific candidate.
  • Analysis on games and movies:  It is possible to mine the sentiments about games and even movies. We will talk more about this in the future.
Another application for sentiment analysis is on status messages on social networks such as Twitter or Facebook. Twitter has gained a special attention recently where people used it  to express their opinion on certain topic.  Therefore, applying sentiment analysis tools  for Twitter that attempt to classify tweets into either positive, negative or neutral categories  automatically could be quite useful for companies and marketers. It is a widely public data source that couldn't be ignored.

Of course, there are several tools in this area, but more focused on classifying tweets in english language. When using this tools for portuguese comments, many tweets end up in the wrong classification. It happens since those tools are generally trained with algorithms modeled to parse and extract english words. In Brazil, there is a quite interesting web tool called Opsys developed by Thomas Lopes  which is focused on mining opinions from feeds and tweets in the web. Currently, its focus is on brazilian elections 2010 and investments (companies stocks). It works really well with portuguese texts and have a nice web summarization tool for analysis.

But since it is a new area, specially working with portuguese corpora I've decided to develop a simple working sentiment analysis tool for identifying opinions on from Twitter. For this, I apply a common and simple machine learning technique called Naive Bayes to classify the set of tweets related to movie reviews  which I will explain more about it in the next sections.

Ok, But how does the sentiment analysis work ?! What are the required steps ?
  1. Data Collection and Pre-processing:  In this step, it is important to search in the web the item of interest, that is, what you want to know the opinion about. It is important also to remove all facts that don't express opinions like news and objective phrases. if your system doesn't identify  subjectivity. The focus is on the user's opinions. The pre-processing is also important in order to remove unnecessary words or irrelevant words to the next step: The classification. 
      2. Classification: The polarity of the content that must be identified. Generally, the polarities used are positive, negative or neutral.

      3.  Presentation of Results: In this step, the classification of several opinions must be summarized in order to be presented to the user. The goal is to facilitate the understanding and give a general comprehension about what people are talking about an item.  This summarization can be expressed in graphics or text. 

Data Collection and Pre-processing

The first step (Data Collection) is related to the information retrieving. It is necessary to extract keywords from the text that may lead to correct classification. Keywords about the original data are usually stored in the form of a feature vector, F = (f1,f2,... fn).  Each coordinate of a feature vector represents one word, also called a feature, of the original text.  The value for each feature may be a binary value, indicating the presence or absence of the feature, an integer which may further express the intensity of the feature in the original text.  It is important to have a good selection of features since it strongly influences the subsequent learning in the machine learning process.  The goal of selecting good features is to capture the desired properties of the original text that are relevant for the sentiment analysis task. Unfortunately, this task for finding best features does not exist. It is required to rely on our intuition, the domain knowledge and a lot of experimentation to choose the best set of features. I strongly recommend the study of the  Natural Language Processing (NLP)  subject, which it may help you to understand this vast research field.

Our approach includes the use of Bag-of-Words. It is a popular model used in Information Retrieving that takes individual words (unigrams) in sentence as features, assuming their conditional independence. So the whole text is represented by a unordered collection of words.  Each feature in the vector represents a existence of one word. The challenge with this approach is the choice of words that are appropriate to become features.

For instance, considering this model, the tweet:  'Assisti hoje o filme Eclipse, ele é lindo !'  may be represented by the following feature vector:

 F = {'Assisti': 1 , 'hoje': 1, 'o': 1, 'filme': 1, 'Eclipse': 1, 'ele': 1, 'é': 1, 'lindo':1}

Here we represent the feature vector as a python dictionary.

Obviously, for any real use, we have to compare this vector to a feature vector that would have much larger number of words. It would be necessary in fact a dictionary of the language, however this model would be inefficient since it would overfit and lead to bad performance when exposed to new examples.
In literature, a common approach is to manually select the most important keywords, for example the word 'lindo' (adjective) is a good indicator of the author's opinion. The most important keywords such as 'excelente',  'horrível', 'fraco'  would be selected as features  since  they express polarity of a sentence). However, Pang et al. show that manual keyword model is outperformed by statistical models, where a good set of words that represent features are selected by their occurrence in the existent training corpus. So, the quality of the selection would depend on the size of the corpus and the similarity of domains of training and test data.  The use of corpus from different domains that don't have the same properties of the domain of the text we want to classify, may lead to inaccurate results. For instance, if the analysis is done on a set of tweets related to a product and is trained with a set based on movies, the most of the sentences would be misclassified.  It is important also to create a dictionary which will capture the most important features for classifying previously unseen sentences.

Additionally, it is possible to remove some of the existing words that bring little useful information like pronouns, articles, prepositions, etc (List of stop words).  Of course, the model presented here is simple, and there are several limitations, which include the inability to capture the polarity relation between words and different meanings of one word.  Another limitations would be suppressed by use of regular expressions for handling the negation and parts of speech for a syntax analysis of the word.


Classification algorithms are efficient and consolidated techniques for this sentiment classification task, since it predicts the label for a given input. However, depending on which approach used (supervised or unsupervised), it will be required a training set with labeled examples before new tweets be classified. It is important to train the model with the set in a domain related to the data domain that will be used as input. The labels we are interested here are the subjectivity of the sentence and the polarity of the sentence (neutral, positive or negative).  There are several machine learning techniques for this task, but in this article we are using the simplest one but with a great  efficience in classification problems: Naive Bayes technique.

The Naive Bayes model or Naive Bayer classifier is a simple probabilistic classifier based on the Baye's theorem with strong independence assumptions. In simple terms, this probability model assumes that the presence or absence of a particular feature of a class is unrelated to presence or absence of any other feature.  For instance, a car may be considered to be a vehicle if there are  4 tires, engine and at least 2 doors.  Even if these features depend on each other, the naive Bayes classifier considers all these properties to independently contribute to the probability that the car is an vehicle.

In our case, each stemmed word in a tweet is taken to be a unique variable in the model, and the goal is to find the probability of that word, and consequently the all sentence itself, belonging to a certain class: Positive vs Negative.  In spite of their naive design and simplified assumptions,  naive Bayes classifiers have worked very well in many complex real-word situations. One of the extensive uses is in spam filtering. In fact, one of the most popular packages (SpamAssassin) is a direct implementation of Naive Bayes. It is a an amazingly simple approach and also a powerful classifier.  More information about the algorithm of Naive Bayes it can be found here and here.
In this tutorial I have used the implementation of a simple Naive Bayesian Classifier. There are some variations using Fisher Score discriminant function, which will not be explored in this article now.
In our problem domain, we will use the Naive Bayes classifier to classify the tweets in Positive or Negative.  To apply this task, it is necessary to train the classifier by creating a training set of tweets (or words) already classified (in Positive or  Negative). Trained the model,  Naive Bayes assumes that all features in the feature vector (after the pre-processing step), and applies Bayes' rule on the tweet. Naive Bayes calculates the prior probability frequency for each label in the training set. Each label is given a likelihood estimate from the contributions of all features, and the sentence is assigned the label with highest likelihood estimate, that is, 'Positive' or 'Negative'. 


The last step: The presentation of the results. Generally, it can be presented by using text or charts. The summarization in text is not a easy task, and it has a research field only for it. The common presentation is a list of sentences where for each the classification is shown by its side.

Summarization by TwitterSentiment Analyzer  -

The other common option is to use charts to present the results. It's the simplest way, and there are many flavors for this presentation. And don't forget they are more attractive for the end users and reports.

Summarization by TwitterSentiment Analyzer  -

Our Study Case:  Movies Reviews on Twitter

For a simple study case and demonstration here, I've created a training set with more about 300 sentences related to movie reviews on Twitter. So sentences such as 'Eu amei o filme Eclipse!' or 'Eclipse poderia ser melhor. #horrível Pior filme da saga.'  are  labeled under positive and negative polarities for creating the training set. It is important to notice that my data corpus are related to portuguese tweets.  So english ones will not work properly.  I've used the Twitter Search API for searching tweets with the keyword 'Eclipse',  the new released  movie in the theaters from the saga  'Twilight' and fetched tweets during three days (July 05th 2010 - July 07th 2010). 
What people are talking about the movie Eclipse on Twitter ??
The task is to analyze the opinions about the movie in Twitter. Using the Naive Bayes Classifier with some adjustments to include the 'Neutral' Classification, we have analyzed 4265 tweets.  I have used some pre-processing techniques explained above. All the code was written in Python. The result you can see in the graph below.

Polarity Chart on tweets about the movie 'Eclipse'

As you may notice, brazilians are really loving the movie Eclipse!! There are also a great amount on neutral tweets, it means that the classifier didn't have the confidence to classify as positive or negative, (I'll explain soon) or tweets that don't express sentiment (opinions).

You can see some of the tweets classified as follows:

Some Tweets classified related to the movie Eclipse

The neutral classification, I've added a simple threshold where after evaluating the probabilities for each polarity (Positive and Negative). The maximum one is compared to a threshold and if its value isn't higher, it is classified as a neutral one. It means that the classifier doesn't have the confidence to classify the tweet as one of the classes trained in the model.

As you can see, the Naive Bayes is a simple machine learning technique quite efficient for automatically extract sentiment from small sentences like Tweets. However, it is important to notice that it still can be improved with new feature extraction approaches or another robust machine learning techniques.


Soon I will provide all the source code for this project, but if you want to play around with the classifier, I've developed a simple REST API for using the classifier for simple tests. This version is still not official and it only works with movie reviews domain (in portuguese of course).


The output result is in JSON. PLEASE THIS API is still in development. So if you plan on integrating this API into a product, please let me know , so that I can contact you in order to help you to create your own.

The classifier is only working with a single arbitrary text (in portuguese language)  and as result give you its polarity. Example:


{"results": {"polarity": "negativo", "text": "Encontro Explosivo filme ruim Danado", "query": "encontro explosivo"}}

- text: The text you want to classify. This should be URL encoded. 
- query: The subject. This should be URL encoded. (optional)
- text: original text submitted
- query: original query submitted
- polarity. The polarity values are:
       -  'negativo': corresponds to negative
       -  'neutro': corresponds to neutral
       - 'positivo': corresponds to positive


As we can notice that sentiment analysis is a trend in the Web, with several interesting application with a lot of data sources provided by users.  Microblogs and Social Networks such as Twitter, Facebook and Orkut and another web services are powerful crowding sources for obtain opinions from the users in the Web about any subject and specially to help to answer the question about what people are interested on. Despite the challenge, more companies and researchers are working in this area until one day it would be easy for users and companies to simply obtain complete and rich summarized reports about the opinions from the Web in order to support them in the decision making process in their daily life.

 In machine learning topics, there are also a lot of improvements that would be made for this task and several challenges for researchers to extend it. I would say the most important ones are related to the natural language processing, which is directly associated to sentiment analysis. I would cite the subjectivity identification (distinguish if a text is really a opinion or a fact), identification  of sarcasms, which are not easy to extract,  the multiple reference of items in a same sentence (a tweet referencing Iphone and Ipod) and with different opinions, which it may confuse the classification. Misspelling words and abbreviated words (common in blogs and social networks) are also difficulties for search and classification. Many others could be cited here but it may be found if you search on Google about sentiment analysis.

In the end, I would tell that applying sentiment analysis in portuguese language is still a new area with great challenges to deal with text mining since the language is different from the current ones applied on english in the literature. My work is  a initial discussion about how we can learn more about sentiment analysis and how portuguese could be inserted in this context.  More analysis and a better evaluation should be applied to compare another popular machine learning techniques with Naive Bayes classifier.

I expect   you have enjoyed this tutorial, it's quite long but I believe it could be a useful start for the beginners in machine learning.


See you next time!

Marcel Caraciolo