This article is the third one of the series about High Computation with Python. For anyone that missed the first and the second parts check this link and this one. The goal is to present approaches to make CPU-demanding tasks in Python run much faster.
The techniques that are being covered:
Python Profiling- How to find bottlenecksCython - Annotate your code and compile to C- Numpy Vectors - Fast vector operations using numpy arrays
- Numpy integration with Cython - fast numerical Python library wrapped by Cython
- PyPy - Python's new Just in Time Compiler
In this post I will talk about Numpy Vectors and how you can wrap it with Cython!
The Problem
In this series we will analyze how to optimize the statistical Spearman Rank's Correlation coefficient, which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.
Numpy
Numpy is a powerful extension to Python, adding support for large, multi-dimensional array and matrices, along with several mathematical functions to manipulate these arrays. To install it you can type this command at your terminal
$ sudo easy_install numpy
Or
$ pip install numpy
In our example we will change the spearman.py . Import the numpy library and change the spearman_correlation to look the one below. If you run and test it you will ger the same output as before.
The numpy strength is that can simplify lots of operations on vectors or matrixes of numbers since they work directly in all list rather than on individual elements at one time. So before we had nested for loops over individual terms in a list, now with numpy you could do the same job in a faster and simple way.
Some notes:
Some notes:
- You define an array with numpy.array statement, in our case a list of tuples indexed by the labels keys and ranks. (lines 29 and 30).
- Lots of operations already implemented in numpy, such as numpy.in1d which finds where the elements in the first vector are in the second vector returning an array os bools.
- We have numpy.sort which sort the elements based on a key, in this example (ranks) (lines 16 and 17).
- diffs * diffs does a pairwise multiplication, think of it as diff[0] = diff[0] * diff[0]; diff[1] = diff[1] * diff[1]...; diff[n-1] = diff[n-1] * diff[n-1]. (line 36)
- size is an attribute from numpy.array to fetch the m*n elements (count) from an array.
If it stills unclear I suggest you to try it at the command line, step-by-step to look over the results. Put a small number of elements in the array and see it in action.
Numpy with Cython
Numpy is a powerful library and uses very fast C optimized math libraries to perform these calculations very quickly. You can also wrap your python code with Cython. The main difference is the annotation of the numpy arrays. You can see the tutorial for further details. The difference are how we import: cimport numpy as np and the assinature of the function _rank_dists.
Special Notes - Meeting Scipy
Another poweful library is Scipy, it is a package for Python that brings several algebra techniques for dealing with matrices and vectors. One special module is the scipy.stats, which comes with the spearmanr function. It receives two arrays with the observations and returns the spearman coefficient. Amazing! Let's see our code below:In the next post we will study the Pypy, a JIT Compiler which can speed your code with minimal changes at your code!