Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Hi all,

I'll start to write some posts starting from now about Supervised and Unsupervised learning, specific related to performance evaluation such as classification accuracy, lift, roc curves, F1-Score and errors.

The Confusion Matrix

Let's start with the one popular tools to evaluate the performance of a model in tasks of classification or prediction: The confusion matrix (in unsupervised learning it is typically called a matching matrix). Its focus is on the predictive capability of a model rather than how fast the model takes to perform the classification, scalability, etc.

The confusion matrix is represented by a matrix which each row represents the instances in a predicted class, while each column represents in an actual class. One of the advantages of using this performance evaluation tool is that the data mining analyzer can easily see if the model is confusing two classes (i.e. commonly mislabeling one as another).

The matrix also shows the accuracy of the classifier as the percentage of correctly classified patterns in a given class divided by the total number of patterns in that class. The overall (average) accuracy of the classifier is also evaluated by using the confusion matrix.

Let's see a confusion matrix in action by showing an example. Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS). For a classifier trained with this dataset to classify an e-mail as 'Spam' or 'Non-Spam', we can use the confusion matrix to see the classification accuracy based on the training data. In the example confusion matrix below, of the 33 patterns that are 'Spam' (S), 27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams' (NB) (achieving an accuracy of 81.8%). On the other hand, of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams' (an accuracy of 85.1%). The overall accuracy of the classifier for predicting both classes given this dataset is evaluated achieving 83%.

Confusion Matrix on spam classification model

However the confusion matrix only tell us how the classifier is behaving for individual classes. When a data set is unbalanced (where the number of samples in one class is significantly more than that in the other class - it happens a lot with Spam/Non-Spam datasets) the accuracy evaluated of a classifier is not representative of the true performance of the classifier. For instance, imagine there are 990 patterns that are 'Non Spam' and only 10 patterns that are 'Spam' , the classifier can easily be biased towards the class 'Non Spam'. If the model classifies all the samples as 'Non-Spam', the accuracy will be 99%. And this is not real indication of the classification's performance. The classifier has a 100% recognition rate for 'Non-Spam' but a 0% error rate for 'Spam'. Looking at the matrix, the system has trouble in predicting the 'Spam' class, even though the system has to be 99% accurate in its prediction. Given that the prediction of 'Spam' class would be the one of actual interest, only using the confusion matrix to evaluate the model's performance is not enough, but it can give us an insight of how the model is predicting the classes and start to use other metrics that we will explain in the next section.

Confusion Matrix on a unbalanced dataset

The Table of Confusion

In the Confusion Matrix, for each cell in the matrix we have fields as True Positives, False Positives, False Negatives and True Negatives. These are defined as:

False Positive (FP): Falsely Predicting a label (or saying that Non-Spam is a Spam).
False Negative (FN): Missing and incoming label (or saying a Spam is Non-Spam).
True Positive (TP): Correctly predicting a label (or saying a Spam is Spam).
True Negative (TN): Correctly predicting the other label (or saying Non-Spam is Non-Spam).

Looking at the confusion matrix in a general view is as follows:

Confusion Matrix

How can we use those metrics ? For instance, let's consider the previous model now for predicting if a text message have positive or negative opinion associated (common in sentiment analysis task). We have a data set with 10.000 text messages where the model correctly predicts 9.700 negative messages, and 100 positive messages. The model still incorrectly predicts 150 messages which are positive to be negative, and 50 messages which are negative to be positive. The resulting Confusion Matrix is shown below.

Confusion Matrix on Sentiment classification task

For the binary classification problems, which was our case situation , we can derive from those metrics two equations called sensitivity and specificity. They are commonly used for the evaluation of any binary classifier.

The Specificity (TNR) measures the proportion of messages that are negative (TN) of all the messages that are actually negative (TN+FP). It can be looked at as the probability that the message is classified as negative given that the message does not contain negative words. With higher specificity, fewer positive messages are labeled as negative.

On the other hand, Sensitivity (TPR) is the proportion of messages that are positive (TP) of all the messages that are actually positive (TP+FN). It can be seen as the probability that the message is positive given that the patient contain positive words. With higher sensitivity, fewer actual messages will be classified as negative.

Sensitivity can be expressed as :

TP / (TP+FN)

and then Specificity which is:

TN / (TN+FP)

In general here, Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages ?

Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40%
Specificity = TN / (TN+FP) = 9700/(9700+150) = 0.98 = 98%

As you can see, if we have a test for sentiment classification with 40% sensitivity and 98% specificity, and we have to check 1000 messages, and 500 of them are positive and 500 are negative. You are likely to get about 200 messages true positives, 300 messages false negatives, 490 true negatives and 10 false positives. You can conclude that the the negative prediction is more confident, specially based on the high value of specificity and the low level of sensitivity. As you can see it's a important metric for analyzing the performance of your classifier only looking both separated.

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve, which it will be one of the next posts about this topic.

I've developed some code in Python for evaluating the Confusion Matrix, Specificity and Sensitivity of a classifier here. Please make the necessary changes for adapting for your classifier.

That's all,

I expect you have enjoyed!

Cheers,

Marcel Caraciolo

References

Confusion Matrix - Wikipedia
Binary Classification - Wikipedia
Sensitivity and Specificity - Wikipedia
Confusion Matrix - QSAR World