Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Hi all, 

I'll start to write some posts starting from now about Supervised and Unsupervised learning, specific related to performance evaluation such as classification accuracy, lift, roc curves, F1-Score and errors.

The Confusion Matrix

Let's start with the one popular tools to evaluate the performance of a model in tasks of classification or prediction:  The confusion matrix (in unsupervised learning it is typically called a matching matrix). Its focus is on the predictive capability of a model  rather than how fast the model takes to perform the classification, scalability, etc.

The confusion matrix is represented by a matrix which each row represents the instances in a predicted class, while each column represents in an actual class. One of the advantages of using this performance evaluation tool is that the data mining analyzer can easily see if the model is confusing two classes (i.e. commonly  mislabeling one as another).

The matrix also shows the accuracy of the classifier as the percentage of correctly classified patterns in a given class divided by the total number of patterns in that class. The overall (average) accuracy of the classifier is also evaluated by using the  confusion matrix.

Let's see a confusion matrix in action by showing an example. Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS).  For a classifier trained with this dataset to classify an e-mail as 'Spam' or 'Non-Spam', we can use the confusion matrix to see the classification accuracy based on the training data. In the example confusion matrix below, of the 33 patterns that are 'Spam' (S),  27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams' (NB) (achieving an accuracy of 81.8%).  On the other hand, of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams' (an accuracy of 85.1%).  The overall accuracy of the classifier  for predicting both classes given this dataset is evaluated achieving 83%.

Confusion Matrix on spam classification model

However the confusion matrix only tell us how the classifier is behaving for individual classes. When a data set is unbalanced (where the number of samples in one class is significantly more than that in the other class - it happens a lot with Spam/Non-Spam datasets) the accuracy evaluated of a classifier is not representative of the true performance of the classifier. For instance, imagine there are 990 patterns that are 'Non Spam'  and only 10 patterns that are 'Spam' , the classifier can easily be biased towards the class 'Non Spam'.  If the model classifies all the samples as 'Non-Spam', the accuracy will be 99%.  And this is not real indication of the classification's performance. The classifier has a 100% recognition rate for 'Non-Spam'  but a 0% error rate for 'Spam'. Looking at the matrix, the system has trouble in predicting the 'Spam' class, even though the system has to be 99% accurate in its prediction. Given that the prediction of 'Spam' class would be the one of actual interest, only using the confusion matrix to evaluate the model's performance is not enough, but it can give us an insight of how the model is predicting the classes and start to use other metrics that we will explain in the next section.

Confusion Matrix on a unbalanced dataset

The Table of  Confusion

In the Confusion Matrix, for each cell in the matrix we have fields as True Positives, False Positives, False Negatives and True Negatives.  These are defined as:
  • False Positive (FP):  Falsely Predicting a label (or saying that Non-Spam is a Spam).
  • False Negative (FN):  Missing and incoming label (or saying a Spam is Non-Spam).
  • True Positive (TP):  Correctly predicting a label (or saying a Spam is Spam).
  • True Negative (TN): Correctly predicting the other label (or saying Non-Spam is Non-Spam).

Looking at the confusion matrix in a general view is as follows:

Confusion Matrix
How can we use those metrics ?  For instance, let's consider the previous model now for predicting if a text message have positive or negative opinion associated (common in sentiment analysis task).  We have a data set with 10.000 text messages where the model correctly predicts 9.700 negative messages, and 100 positive messages. The model still incorrectly predicts 150 messages which are positive to be negative, and 50 messages which are negative to be positive.  The resulting Confusion Matrix is shown below.

Confusion Matrix on Sentiment classification task

For the binary classification problems, which was our case situation , we can derive from those metrics two equations called sensitivity and specificity. They are commonly used for the evaluation of any binary classifier. 

The Specificity (TNR) measures the proportion of messages that are negative (TN) of all the messages that are actually negative (TN+FP). It can be looked at as the probability that the message is classified as negative given that the message does not contain negative words. With higher specificity, fewer positive messages are labeled as negative.

On the other hand, Sensitivity (TPR) is the proportion of messages that are positive (TP) of all the messages that are actually positive (TP+FN).  It can be seen as the probability that the message is positive given that the patient contain positive words. With higher sensitivity, fewer actual messages will be classified as negative.  

Sensitivity can be expressed as :
  • TP / (TP+FN)
and then Specificity which is:
  • TN / (TN+FP)

In general here, Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages  ?
  • Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40% 
  • Specificity = TN / (TN+FP) = 9700/(9700+150) = 0.98 = 98% 

As you can see, if we have a test for sentiment classification with 40% sensitivity and 98% specificity, and we have to check 1000 messages, and 500 of them are positive and 500 are negative. You are likely to get about 200 messages true positives, 300 messages false negatives,  490 true negatives and 10 false positives. You can conclude that the the negative prediction is more confident, specially based on the high value of specificity and the low level of sensitivity. As you can see it's a important metric for analyzing the performance of your classifier only looking both separated.
The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve, which it will be one of the next posts about this topic.

I've developed some code in Python for evaluating the Confusion Matrix, Specificity and Sensitivity of a classifier here.  Please make the necessary changes for adapting for your classifier. 

That's all,

I expect you have enjoyed!


Marcel Caraciolo



  1. Great script. We'll be waiting for your comments about ROC curve.

    Muito obrigado Marcel.

  2. hello you made a mistake in your post regarding the example:

    Specificity should be :
    9700/(9700+50) = 0.9948 ( 99.5% )


  3. Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages ?

    is not on the contrary, Positive/Negative?

  4. i think the values in your picture are in the wrong places, the 150 & 50 should swap

  5. Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40%
    This calculation is wrong and should be 2/3 or 67%.

  6. Hello,

    You might be interested by my project and its Pip package

    With this package confusion matrix can be pretty-printed, plot. You can binarize a confusion matrix, get class statistics such as TP, TN, FP, FN, ACC, TPR, FPR, FNR, TNR (SPC), LR+, LR-, DOR, PPV, FDR, FOR, NPV and some overall statistics

    Kind regards

  7. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

    PLC training in chennai
    Automation training in chennai
    Best plc training in chennai
    PLC SCADA training in chennai
    Process automation training in chennai
    Final year eee projects in chennai
    VLSI training in chennai

  8. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai. Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino, etc.

    Embedded system training in chennai
    Embedded Course training in chennai
    Matlab training in chennai
    Android training in chennai
    LabVIEW training in chennai
    Robotics training in chennai
    Oracle training in chennai
    Final year projects in chennai
    Mechanical projects in chennai
    ece projects in chennai

  9. Nice blog with having good information. Its very useful for everyone. Thanks and keep posting this type of blog.

    Data Visualization Training Institutes in Chennai Trichy

  10. Some time I wonder whether I will get the information I am looking for. But you have given complete information which I required.
    MATLAB Training In Noida