Artificial Intelligence in Motion: performance

Hi all,

I've been working on in the last weeks at a little project that I developed called benchy. The goal of benchy is answer some trivial questions about which code is faster ? Or which algorithm consumes more memory ? I know that there are several tools suitable for this task, but I would like to create some performance reports by myself using Python.

Why did I create it ? Since the beginning of the year I decided to rewrite all the code at Crab, a python framework for building recommender systems. And one of the main components that required some refactoring was the pairwise metrics such as cosine, pearson, euclidean, etc. I needed to unit test the performance of several versions of code for those functions. But doing this manually ? It's boring. That's why benchy came for!

What benchy can do ?

Benchy is a lightweight Python library for running performance benchmarks over alternative versions of code. How can we use it ?

Let's see the cosine function, a popular pairwise function for comparing the similarity between two vectors and matrices in recommender systems.

Let's define the benchmarks to test:

With all benchmarks created, we could test a simple benchmark by calling the method run:

The dict associated to the key memory represents the memory performance results. It gives you the number of calls repeat to the statement, the average consumption usage in units . In addition, the key 'runtime' indicates the runtime performance in timing results. It presents the number of calls repeat following the average time to execute it timing in units.

Do you want see a more presentable output ? It is possible calling the method to_rst with the results as parameter:

Benchmark setup

import numpy
X = numpy.random.uniform(1,5,(1000,))

import scipy.spatial.distance as ssd
X = X.reshape(-1,1)
def cosine_distances(X, Y):
    return 1. - ssd.cdist(X, Y, 'cosine')

Benchmark statement

cosine_distances(X, X)

name	repeat	timing	loops	units
scipy.spatial 0.8.0	3	18.36	10	ms

Now let's check which one is faster and which one consumes less memory. Let's create a BenchmarkSuite. It is referred as a container for benchmarks.:

Finally, let's run all the benchmarks together with the BenchmarkRunner. This class can load all the benchmarks from the suite and run each individual analysis and print out interesting reports:

Next, we will plot the relative timings. It is important to measure how faster the other benchmarks are compared to reference one. By calling the method plot_relative:

As you can see the graph aboe the scipy.spatial.distance function is 2129x slower and the sklearn approach is 19x. The best one is the numpy approach. Let's see the absolute timings. Just call the method plot_absolute:

You may notice besides the bar representing the timings, the line plot representing the memory consumption for each statement. The one who consumes the less memory is the nltk.cluster approach!

Finally, benchy also provides a full repport for all benchmarks by calling the method to_rst:

Performance Benchmarks

These historical benchmark graphs were produced with benchy.

Produced on a machine with

Intel Core i5 950 processor

Mac Os 10.6

Python 2.6.5 64-bit

NumPy 1.6.1

scipy.spatial 0.8.0

Benchmark setup

import numpy
X = numpy.random.uniform(1,5,(1000,))

import scipy.spatial.distance as ssd
X = X.reshape(-1,1)
def cosine_distances(X, Y):
    return 1. - ssd.cdist(X, Y, 'cosine')

Benchmark statement

cosine_distances(X, X)

name	repeat	timing	loops	units
scipy.spatial 0.8.0	3	19.19	10	ms

sklearn 0.13.1

Benchmark setup

import numpy
X = numpy.random.uniform(1,5,(1000,))

from sklearn.metrics.pairwise import cosine_similarity as cosine_distances

Benchmark statement

cosine_distances(X, X)

name	repeat	timing	loops	units
sklearn 0.13.1	3	0.1812	1000	ms

nltk.cluster

Benchmark setup

import numpy
X = numpy.random.uniform(1,5,(1000,))

from nltk import cluster
def cosine_distances(X, Y):
    return 1. - cluster.util.cosine_distance(X, Y)

Benchmark statement

cosine_distances(X, X)

name	repeat	timing	loops	units
nltk.cluster	3	0.01024	1e+04	ms

numpy

Benchmark setup

import numpy
X = numpy.random.uniform(1,5,(1000,))

import numpy, math
def cosine_distances(X, Y):
    return 1. -  numpy.dot(X, Y) / (math.sqrt(numpy.dot(X, X)) *
                                     math.sqrt(numpy.dot(Y, Y)))

Benchmark statement

cosine_distances(X, X)

name	repeat	timing	loops	units
numpy	3	0.009339	1e+05	ms

Final Results

name	repeat	timing	loops	units	timeBaselines
scipy.spatial 0.8.0	3	19.19	10	ms	2055
sklearn 0.13.1	3	0.1812	1000	ms	19.41
nltk.cluster	3	0.01024	1e+04	ms	1.097
numpy	3	0.009339	1e+05	ms	1

Final code!

I might say this micro-project is still a prototype, however I tried to build it to be easily extensible. I have several ideas to extend it, but feel free to fork it and send suggestions and bug fixes. This project was inspired by the open-source project vbench, a framework for performance benchmarks over your source repository's history. I recommend!

For me, benchy will assist me to test several pairwise alternative functions in Crab. :) Soon I will publish the performance results that we got with the pairwise functions that we built for Crab :)

I hope you enjoyed,

Regards,

Marcel Caraciolo

Hi all,

I'll start to write some posts starting from now about Supervised and Unsupervised learning, specific related to performance evaluation such as classification accuracy, lift, roc curves, F1-Score and errors.

The Confusion Matrix

Let's start with the one popular tools to evaluate the performance of a model in tasks of classification or prediction: The confusion matrix (in unsupervised learning it is typically called a matching matrix). Its focus is on the predictive capability of a model rather than how fast the model takes to perform the classification, scalability, etc.

The confusion matrix is represented by a matrix which each row represents the instances in a predicted class, while each column represents in an actual class. One of the advantages of using this performance evaluation tool is that the data mining analyzer can easily see if the model is confusing two classes (i.e. commonly mislabeling one as another).

The matrix also shows the accuracy of the classifier as the percentage of correctly classified patterns in a given class divided by the total number of patterns in that class. The overall (average) accuracy of the classifier is also evaluated by using the confusion matrix.

Let's see a confusion matrix in action by showing an example. Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS). For a classifier trained with this dataset to classify an e-mail as 'Spam' or 'Non-Spam', we can use the confusion matrix to see the classification accuracy based on the training data. In the example confusion matrix below, of the 33 patterns that are 'Spam' (S), 27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams' (NB) (achieving an accuracy of 81.8%). On the other hand, of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams' (an accuracy of 85.1%). The overall accuracy of the classifier for predicting both classes given this dataset is evaluated achieving 83%.

Confusion Matrix on spam classification model

However the confusion matrix only tell us how the classifier is behaving for individual classes. When a data set is unbalanced (where the number of samples in one class is significantly more than that in the other class - it happens a lot with Spam/Non-Spam datasets) the accuracy evaluated of a classifier is not representative of the true performance of the classifier. For instance, imagine there are 990 patterns that are 'Non Spam' and only 10 patterns that are 'Spam' , the classifier can easily be biased towards the class 'Non Spam'. If the model classifies all the samples as 'Non-Spam', the accuracy will be 99%. And this is not real indication of the classification's performance. The classifier has a 100% recognition rate for 'Non-Spam' but a 0% error rate for 'Spam'. Looking at the matrix, the system has trouble in predicting the 'Spam' class, even though the system has to be 99% accurate in its prediction. Given that the prediction of 'Spam' class would be the one of actual interest, only using the confusion matrix to evaluate the model's performance is not enough, but it can give us an insight of how the model is predicting the classes and start to use other metrics that we will explain in the next section.

Confusion Matrix on a unbalanced dataset

The Table of Confusion

In the Confusion Matrix, for each cell in the matrix we have fields as True Positives, False Positives, False Negatives and True Negatives. These are defined as:

False Positive (FP): Falsely Predicting a label (or saying that Non-Spam is a Spam).
False Negative (FN): Missing and incoming label (or saying a Spam is Non-Spam).
True Positive (TP): Correctly predicting a label (or saying a Spam is Spam).
True Negative (TN): Correctly predicting the other label (or saying Non-Spam is Non-Spam).

Looking at the confusion matrix in a general view is as follows:

Confusion Matrix

How can we use those metrics ? For instance, let's consider the previous model now for predicting if a text message have positive or negative opinion associated (common in sentiment analysis task). We have a data set with 10.000 text messages where the model correctly predicts 9.700 negative messages, and 100 positive messages. The model still incorrectly predicts 150 messages which are positive to be negative, and 50 messages which are negative to be positive. The resulting Confusion Matrix is shown below.

Confusion Matrix on Sentiment classification task

For the binary classification problems, which was our case situation , we can derive from those metrics two equations called sensitivity and specificity. They are commonly used for the evaluation of any binary classifier.

The Specificity (TNR) measures the proportion of messages that are negative (TN) of all the messages that are actually negative (TN+FP). It can be looked at as the probability that the message is classified as negative given that the message does not contain negative words. With higher specificity, fewer positive messages are labeled as negative.

On the other hand, Sensitivity (TPR) is the proportion of messages that are positive (TP) of all the messages that are actually positive (TP+FN). It can be seen as the probability that the message is positive given that the patient contain positive words. With higher sensitivity, fewer actual messages will be classified as negative.

Sensitivity can be expressed as :

TP / (TP+FN)

and then Specificity which is:

TN / (TN+FP)

In general here, Sensitivity means the accuracy on the class Negative, and Specificity means the accuracy on the class Positive. So using these metrics, what is the accuracy on Positive and Negative messages ?

Sensitivity = TP / (TP+FN) = 100/(100+50) = 0.4 = 40%
Specificity = TN / (TN+FP) = 9700/(9700+150) = 0.98 = 98%

As you can see, if we have a test for sentiment classification with 40% sensitivity and 98% specificity, and we have to check 1000 messages, and 500 of them are positive and 500 are negative. You are likely to get about 200 messages true positives, 300 messages false negatives, 490 true negatives and 10 false positives. You can conclude that the the negative prediction is more confident, specially based on the high value of specificity and the low level of sensitivity. As you can see it's a important metric for analyzing the performance of your classifier only looking both separated.

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve, which it will be one of the next posts about this topic.

I've developed some code in Python for evaluating the Confusion Matrix, Specificity and Sensitivity of a classifier here. Please make the necessary changes for adapting for your classifier.

That's all,

I expect you have enjoyed!

Cheers,

Marcel Caraciolo

References

Confusion Matrix - Wikipedia
Binary Classification - Wikipedia
Sensitivity and Specificity - Wikipedia
Confusion Matrix - QSAR World

Artificial Intelligence in Motion

Pages

Performing runtime benchmarks with Python Monitoring Tool Benchy

Friday, March 22, 2013

Performance Benchmarks

scipy.spatial 0.8.0

sklearn 0.13.1

nltk.cluster

numpy

Final Results

Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

Popular Posts

Categories

Blog Archives

Visitors

@marcelcaraciolo updates

Artificial Intelligence in Motion

Pages

Performing runtime benchmarks with Python Monitoring Tool Benchy

Friday, March 22, 2013

Performance Benchmarks

scipy.spatial 0.8.0

sklearn 0.13.1

nltk.cluster

numpy

Final Results

Tools for Machine Learning Performance Evaluation: Confusion Matrix

Tuesday, August 31, 2010

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

Popular Posts

Categories

Blog Archives

Visitors

@marcelcaraciolo updates

Subscribe To