Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Hi all, 

I'll start to write some posts starting from now about Supervised and Unsupervised learning, specific related to performance evaluation such as classification accuracy, lift, roc curves, F1-Score and errors.

The Confusion Matrix

Let's start with the one popular tools to evaluate the performance of a model in tasks of classification or prediction:  The confusion matrix (in unsupervised learning it is typically called a matching matrix). Its focus is on the predictive capability of a model  rather than how fast the model takes to perform the classification, scalability, etc.

The confusion matrix is represented by a matrix which each row represents the instances in a predicted class, while each column represents in an actual class. One of the advantages of using this performance evaluation tool is that the data mining analyzer can easily see if the model is confusing two classes (i.e. commonly  mislabeling one as another).

The matrix also shows the accuracy of the classifier as the percentage of correctly classified patterns in a given class divided by the total number of patterns in that class. The overall (average) accuracy of the classifier is also evaluated by using the  confusion matrix.

Let's see a confusion matrix in action by showing an example. Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).  For a classifier trained with this dataset to classify an e-mail as 'Spam' or 'Non-Spam', we can use the confusion matrix to see the classification accuracy based on the training data. In the example confusion matrix below, of the 33 patterns that are 'Spam' (S),  27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams' (NB) (achieving an accuracy of 81.8%).  On the other hand, of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams' (an accuracy of 85.1%).  The overall accuracy of the classifier  for predicting both classes given this dataset is evaluated achieving 83%.

Confusion Matrix on spam classification model

However the confusion matrix only tell us how the classifier is behaving for individual classes. When a data set is unbalanced (where the number of samples in one class is significantly more than that in the other class - it happens a lot with Spam/Non-Spam datasets) the accuracy evaluated of a classifier is not representative of the true performance of the classifier. For instance, imagine there are 990 patterns that are 'Non Spam'  and only 10 patterns that are 'Spam' , the classifier can easily be biased towards the class 'Non Spam'.  If the model classifies all the samples as 'Non-Spam', the accuracy will be 99%.  And this is not real indication of the classification's performance. The classifier has a 100% recognition rate for 'Non-Spam'  but a 0% error rate for 'Spam'. Looking at the matrix, the system has trouble in predicting the 'Spam' class, even though the system has to be 99% accurate in its prediction. Given that the prediction of 'Spam' class would be the one of actual interest, only using the confusion matrix to evaluate the model's performance is not enough, but it can give us an insight of how the model is predicting the classes and start to use other metrics that we will explain in the next section.

Confusion Matrix on a unbalanced dataset

The Table of  Confusion

In the Confusion Matrix, for each cell in the matrix we have fields as True Positives, False Positives, False Negatives and True Negatives.  These are defined as:
  • False Positive (FP):  Falsely Predicting a label (or saying that Non-Spam is a Spam).
  • False Negative (FN):  Missing and incoming label (or saying a Spam is Non-Spam).
  • True Positive (TP):  Correctly predicting a label (or saying a Spam is Spam).
  • True Negative (TN): Correctly predicting the other label (or saying Non-Spam is Non-Spam).

Looking at the confusion matrix in a general view is as follows:

Confusion Matrix
How can we use those metrics ?  For instance, let's consider the previous model now for predicting if a text message have positive or negative opinion associated (common in sentiment analysis task).  We have a data set with 10.000 text messages where the model correctly predicts 9.700 negative messages, and 100 positive messages. The model still incorrectly predicts 150 messages which are positive to be negative, and 50 messages which are negative to be positive.  The resulting Confusion Matrix is shown below.

Confusion Matrix on Sentiment classification task

For the binary classification problems, which was our case situation , we can derive from those metrics two equations called sensitivity and specificity. They are commonly used for the evaluation of any binary classifier. 

The Specificity (TNR) measures the proportion of messages that are negative (TN) of all the messages that are actually negative (TN+FP). It can be looked at as the probability that the message is classified as negative given that the message does not contain negative words. With higher specificity, fewer positive messages are labeled as negative.

On the other hand, Sensitivity (TPR) is the proportion of messages that are positive (TP) of all the messages that are actually positive (TP+FN).  It can be seen as the probability that the message is positive given that the patient contain positive words. With higher sensitivity, fewer actual messages will be classified as negative.  

Sensitivity can be expressed as :
  • TP / (TP+FN)
and then Specificity which is:
  • TN / (TN+FP)

In general here, Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages  ?
  • Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40% 
  • Specificity = TN / (TN+FP) = 9700/(9700+150) = 0.98 = 98% 

As you can see, if we have a test for sentiment classification with 40% sensitivity and 98% specificity, and we have to check 1000 messages, and 500 of them are positive and 500 are negative. You are likely to get about 200 messages true positives, 300 messages false negatives,  490 true negatives and 10 false positives. You can conclude that the the negative prediction is more confident, specially based on the high value of specificity and the low level of sensitivity. As you can see it's a important metric for analyzing the performance of your classifier only looking both separated.
The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve, which it will be one of the next posts about this topic.

I've developed some code in Python for evaluating the Confusion Matrix, Specificity and Sensitivity of a classifier here.  Please make the necessary changes for adapting for your classifier. 

That's all,

I expect you have enjoyed!


Marcel Caraciolo


MapReduce with MongoDB and Python

Saturday, August 21, 2010

Hi all,

 In this post, I'll present a demonstration of a map-reduce example with MongoDB and server side JavaScript.  Based on the fact that I've been working  with this technology recently, I thought it would be useful to present here a simple example of  how it works and how to integrate with Python.

But What is MongoDb ?

For you, who doesn't know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases:
  • SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
  • Doesn't guarantee atomic operations.
  • Distributed and horizontally scalable.
  • It doesn't have to predefine schemas. (Non-Schema)
  • Non-tabular data storing (eg; key-value, object, graphs, etc).
Although it is not so obvious, No-SQL is an abbreviation  to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here.  It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.

MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database  that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForge, GitHub, Eletronic Arts and The New York Times..

One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain  how it works illustrated with a simple example using MongoDb and Python.

If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.

Map- Reduce 

MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers.  Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming.  It can be considered  'Data-Oriented' which process data in two primary steps: Map and Reduce.  On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called 'Map', and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called 'Reduce'.  The paper about the MapReduce with more details it can be read here.

Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous is Hadoop and is implemented in Java as an open-source project.  In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.  In real case scenarios, in a situation where  you would have used GROUP BY in SQL,  map/reduce is the equivalent tool in MongoDB.

Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.


PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See here how to install  and use it.

Map-Reduce in Action

Now let's see Map-Reduce in action. For demonstrate the map-reduce I've decided to used of the classical problems solved using it: Word Frequency count across a series of documents. It's a simple problem and is suited to being solved by a map-reduce query.

I've decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works.  The second one is the 2009 Obama's Speech at his election for president. It will be used to show a real example illustrated by the code.

Let's consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split  in words and grouped by the function map and after reduced independently (aggregation)  by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. It's also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy.
Map-Reduce Distribution

Some notes you need to know before developing your map and reduce functions:
  •  The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
                 for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
  •  Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
  • If you need to perform an operation only once, use a finalize function.

 Let's go now to the code. For this task, I'll use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obama's speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's Speech.

Obama's Speech in 2009

For writing our map and reduce functions, MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.


As you can see the 'this' variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by calling this.text.  The map function doesn't return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others  intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now.

The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example using the Pymongo client and passing the map/reduce functions to the server.
Let's see the result now:

And it works! :D

With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we  can see the potential of map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be download here.

My next posts will be about  performance evaluation on machine learning techniques.  Wait for news!

Marcel Caraciolo