Tools for Machine Learning Performance Evaluation: ROC Curves in Python

Tuesday, September 21, 2010

Hi all,

Continuing from the my last post about performance analysis on classification machine learning techniques, in this article I will talk about a specific analysis called Discriminatory Power Analysis by using Receiver-Operating Characteristic Curves  (ROC Curves).

First, let's review and introduce some concepts related to the classification problems. When we deal with systems that involves the detection, diagnostics or even the prediction of results, it is really important to check the obtained results in order to validate the discriminative power as good or not of a given analysis.  However only depending on the quantization of hits and misses on a test group does not really tell how good a system is, since it depends on the quality and distribution of the test group data.

For instance, let's consider a fraud detection system, which outputs are 1 or 0, indicating whether a transaction has been classified as a fraudulent or not.  Now suppose that we applied the system on a test group of 100 transactions which we already know which one was a fraud or not, and the system correctly identified 90% of conditions. Do you think it is a good performance, don't you ?

Actually it is not. One information is missing and it is how the test group was distributed.  Let's consider that actually 90% was fraudulent. The system, thus, could have considered  '1' to every input given, and still have achieved a 90% accuracy rate, as the 10% transactions left were not fraudulent. By now, we cannot be sure if the system was really good or worked as a random classifier considering everyone as fraudulent without any previous knowledge or calculations.

In this case, we have to consider another metrics in order to evaluate this unbalance in the test groups. As I said in a previous post in this blog about confusion matrix, it will act as a base for the measures used in the discriminatory power analysis in question.

Confusion Matrix

From this table, I will present some metrics used to help in this analysis.

1. Accuracy
It measures the proportion of correct predictions considering the positive and negative inputs.  It is highly dependant of the data set distribution which can easily lead to wrong conclusions about the system performance.

          = (TP + TN) / (P + N)

2. Sensitivity

It measures the proportion of the true positives, that is, the ability of the system on predicting the correct values in the cases presented.

           =  TP/ (TP+FN)

3. Specificity

It measures the proportion of the true negatives, that is, the ability of the system on predicting the 
correct values for the cases that are the opposite to the desired one.

           =  TN/ (TN +FP)
4. Efficiency

It is represented by the mean of Sensibility and Specificity.  The perfect decision would be  with 100% of specificity and 100% sensitivity. However this situation is rarely conceived, so a balance between both metrics must be obtained. This metric is a good evaluator for measure the responsiveness in a overall situation to the production of false positives and false negatives. Generally, when the system is too responsive to positive, it tends to produce many false positives and vice-versa.

EFF = (SENS + SPEC) /2

5. Positive Predictive Value
This measure indicates the estimation of how good the system is when making a positive affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true positives in contrast with all positive predictions.

         = TP / (TP + FP)
 6. Negative Predictive Value

This measure indicates the estimation of how good the system is when making a negative affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true negatives in contrast with all negative predictions.

         = TN / (TN + FN)

7. Phi (φ) Coefficient

It is a measure not commonly used and measure the quality of the confusion matrix in a single value which can be compared. For instance, if you have two binary classifications with same classes but with different sizes, it can be used to compare both of them. It returns a value between -1 and +1, where +1 represents a perfect prediction, 0 a random prediction, and -1 an inverse prediction.

φ =  (TP*TN - FP*FN) / sqrt( (TP+FP) * (TP +FN) * (TN+FP) * (TN +FN) )

Source Code

As I presented in the first post of this series, I've updated the confusion matrix by adding the new metrics presented above.

View the script at Gist.Github

The Receiver Operating Characteristic (ROC) Curve

The ROC curve was developed during the World War II and was extremely used by engineers to detect enemy objects in enemy fields. It became famous and widely used in other areas such as medicine, radiology, etc.  More recently, it has been introduced to the area of machine learning and dataming.

The main idea behind the ROC curves is to analyze the output from the classification systems, which are generally continuous.  Because of that, it is necessary to define a cut-off value (or a discriminatory threshold) to classify and count the number of positive and negative predictions (such as the fraudulent or legal transactions in the case of the statuses in bank transactions).  This threshold can be arbitrarily determined, so the best way to compare the performance of different classifiers is to select several cutoff values over the output data and study their effect.

Considering many thresholds, it is possible to calculate a set of pairs (sensitivity, 1-specificity) which can be plotted in a curve.  This curve will be the ROC curve for the system where the y-axis (ordenades) represents the sensitivity and the x-axis (abscissas) represents the complement of specificity (1 - specificity).

Examples of ROC Curves are presentend below.

Examples of ROC Curves (From: CezarSouza's Blog)

As you can see in the figure above, higher the ROC curve's area, better the system.  This is a standard measure for classifiers comparison, which is called the area under the ROC Curve (AUC). It is evaluated by a numerical integration, such as, for example, the trapezoidal rule.

Now let's go to the development of a simple ROC curve in Python.  During a data mining project that I worked on last year, I decided to implement a ROC curve in order to analyze the performance of some classifiers.  This work resulted into a open-source library called PyROC which you can use inside your own applications for the creation, visualization and analysis of ROC curves.

Source Code

The project is hosted at my personal repository at GitHub.

Feel free to download the source code and use in your applications.

Using the code

To start using the code in your projects, just create a new ROCData object passing as argument the list of tuples containing the actual data, as measure by the experiment, and the predicted data as given by the prediction classsifier.  The actual data must be a dichotomous variable, with only two possible valid values. Generally it is assumed the values 0 and 1 for represent the two states. For the test data, given by the classifier their values must be continuous and the range between the highest and lowest values considered in the actual data. For instance, if the values are 0 and 1, the test data values must be inside the [0,1] range.

from pyroc import *
random_sample  = random_mixture_Model()  # Generate a custom set randomly
print random_sample[(1, 0.53543926503331496), (1, 0.50937533997469853), (1, 0.58701681878005862), (1, 0.57043399840000497),
(1, 0.56229469766270523), (1, 0.6323079028948545), (1, 0.72283523937059946), (1, 0.55079104791257383),
(1, 0.59841921172330748), (1, 0.63361144887035825)]
To compute the area under the curve (AUC) and plot the ROC curve just call the RocData object's methods auc() and plot() . You can also pass the desired number of points to use  for different cutoff values.

#Example instance labels (first index) with the decision function , score (second index)
#-- positive class should be +1 and negative 0.
roc = ROCData(random_sample)  #Create the ROC Object
roc.auc() #get the area under the curve
roc.plot(title='ROC Curve') #Create a plot of the ROC curve
Some metrics are also available evaluated from the confusion matrix such as accuracy, specificity, sensitivity, etc.

#threshold passed as parameter - the cutoff value. (0.5) 
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
         Actual class
        +(1)    -(0)
+(1)    82      18      Predicted
-(0)    18      82      class
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
{'ACC': 0.81000000000000005, 'PHI': 0.62012403721240439, 'PPV': 0.81632653061224
492, 'SENS': 0.80000000000000004, 'NPV': 0.80392156862745101, 'SPEC': 0.81999999
999999995, 'EFF': 0.81000000000000005}

At least if you wanna plot two curves in the same chart, it is also possible by passing a list of RocData objects. It is useful when you want to compare different classifiers.

x = random_mixture_model()
r1 = ROCData(x)
y = random_mixture_model()
r2 = ROCData(y)
lista = [r1,r2]
plot_multiple_roc(lista,'Multiple ROC Curves',include_baseline=True)

There is also another option for you to use the script : stand-alone. Just run at the terminal the command below to see  the available options.

Run at your terminal python


Let's see some ROC curves plotted in action.

 I've finished the second part of the performance analysis on machine learning classifiers. This is a simple introduction, so if you want a better understanding of how ROC curves work and its meaning, please take a look at this website about ROC curves and its applications. It includes many applets for you experiment with the curves. I expect you have enjoyed this post. The next one in this series will be about the F1-Score also known as F-measure.

A.I. Challenge with the game Planet Wars sponsored by Google!

Sunday, September 19, 2010

Hi all,

Recently Google and the University of Waterloo's computer science club have launched an Artificial Intelligence challenge. The task is to write a program to compete in a Planet Wars tournament.  Your goal is to conquer all the planets in your side of space or destroy all of your opponents ships.  The contest supports languages in Python, Java, C# and C++. The support for Ruby, Lisp, Haskell and Perl are still in development.

The challenge has already starded and ends on November 27th.  It sounds quite interesting!!

If you don't know about this game, I can give you some glues. Planet Wars is a strategy game inspired by Galcon Iphone and desktop version. Take a look at Planet Wars game in action:

I am planning to participate also in this challenge. It may be quite interesting for reviewing and practicing machine learning techniques. There is also a rank, let's see which position I will place myself.  ;D

Marcel Caraciolo

Chatterbots and Natural Language Processing: A real example with Brazilian Correios Bot TweetComendas

Tuesday, September 14, 2010

Hi all,

In a previous post at this blog I've presented two web robots that I developed : Tweetcomendas and TransitoRE.  For Tweetcomendas, which is a Twitter bot for track mail orders from Brazilian post office Correios, I've developed a chat bot to interact with the users. Besides Twitter,  the user  also can track his orders by asking at real-time our Chat Client TweetComendas.

Recently, several approaches were proposed in the research domain of chatterbots. But first let's review the concepts of a chatterbot or chatbot.  Chatterbots are computer programs that seek to simulate a human in the conversation with other people. Its goal is to answer questions in a manner that people have the impression they are talking with a real person and not a computer program. This  tatement remembers me the Turing Test which is a test proposed by Alan Turing in his 1950 paper in order to evaluate the machine's ability to demonstrate intelligence by using conversation between humans and machines. It's quite interesting, I recommend the reading.

So after the dispatch of the questions in natural language, the program consults a knowledge database and then provides an answer which tries to mimic the human behavior.  Although, many chatterbots do appear to simulate the human input intelligently in their responses,  the bot that I developed tries to simulate an human assistant by scanning for keywords within the input and answers the closest as possible in natural language the statuses of the orders at Brazilian Correios.

Recently, chatbot techniques have been used for several practical purposes such as online help, personalised services or information acquisition, by working as a conversational agent. One of the classical examples of Chatterbot is A.L.I.C.E, which uses a weak A.I.. In the example of A.L.I.C.E, it uses a programming language called AIML, which is specific to its function as a conversational agent and has been adopted by many developers, which created the called Alicebots.


The one that I've liked a lot and at some way inspired to develop my own bot was the Aadvark ChatterBot. Aadvark, recently bought by Google, was  a startup focused on creating a social recommendation engine mashing-up some components of Yahoo! Answers and Instant Messaging.  Their service is to offer the user the possibility of asking a question to the bot, and based on its algorithms it will compute its own tag (keywords or you could give it a few) and then, it will find a right person to your question and it will forward it to him in order to  answer. The best part is that Aadvark works very well not only in their web page but also in another clients such as IM's, Iphone, etc.  So you have a question, just open the chat screen of the bot and ask a question and after some minutes you may get a useful answer in the same session (client). Try it for free and see its power!

Aadvark Chat Client

The Tweetcomendas ChatterBot is actually a Jabber Chat Clients which uses the XMPP protocol for receiving and pushing text messages. I've used as an example of another bots that I am planning to develop. Let's see a working example of how it works:

1- First, we have to add the chatter bot (it uses the IMIFIED IM's  bots infrastructure, so you can host your bot there for free.).

2 - After he comes online, you can send at anytime the tracking code of your package order from Correios in order to track it.

It uses regular expressions to match the tracking code.

3 - Wait while he looks at the Correios Tracking WebServices and receive the current status.

The current status in natural language for the user.

3 - Now, another useful feature of the Tweetcomendas ChatterBot is he can actually track for you automatically your order and any new updates he will come to you by giving the new status on your chat client if you're online or by sending you an e-mail (It only works with gmail by now.)

He will ask you politely if you want him to track your order for you.

A gentleman chatterBot :D

4 - Receive new updates of your order by e-mail or by your Jabber Client.

Updates on your e-mail even when you 're not online at the Google Talk

So, this is my first attempt to develop real useful chatter bots for helping users in order to track their post orders. There are many ideas in my mind that I want to create. I believe this kind of tool is the next generation of agents for interacting with people working as an assistant such as the Siri: a mobile assistant agent running on Iphones.  Take a look at the video and see the future of those kind of technologies.

I expect that you have enjoyed my simple bot, it still sometimes can be lazy but in almost of the time it works smoothly. Add to your Google Talk Contacts:   This is a personal experiment project that I've developed with the help of Ricardo Caspirro (@ricardocaspirro) on intelligent agents and bots.  If you want more information  about it, take a look at its homepage. There is also other client working on Twitter!

That's all,


Marcel Caraciolo

Knowing the 'hidden' class re.Scanner in Python and how to use it: Building a simple Tweet Extractor

Wednesday, September 8, 2010

Hi all,

I've written a simple post at my personal blog (Yes, I have one, but it talks more about my daily life, personal projects and geek stuff) Mobideia about the use of a hidden class 'Scanner' from the module re, responsible for handling with regular expressions.

The practical example includes a simple Tweet Extractor for identify the URLs, (,  usernames (@joao), hashtags (#django), Retweets (RT) and words.  It may be very useful for you who wants to play with Natural Language Processing (NLP) and text mining on Twitter.

In order to avoid the replication of all text here,  I've decided to place the links for accessing the post:

Portuguese Version

English Version (by Google Translate):

I hope you enjoy!


Marcel Caraciolo