Hi all,
This article is the fourth one of the series about High Computation with Python. For anyone that missed the first, second and third parts check this link about Python Profiling, this one about Cython and finally this about Numpy Vectors. The goal is to present approaches to make CPU-demanding tasks in Python run much faster.
The techniques that are being covered:
Python Profiling- How to find bottlenecksCython - Annotate your code and compile to CNumpy Vectors - Fast vector operations using numpy arraysNumpy integration with Cython - fast numerical Python library wrapped by Cython- PyPy - Python's new Just in Time Compiler
In this post I will talk about PyPy - the JIT Compiler for Python!
The Problem
In this series we will analyze how to optimize the statistical Spearman Rank's Correlation coefficient, which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.
Pypy
PyPy is a Just in TIme compiler for the Python programming language. It is multi-platform and it runs Python 2.7. With your code running in PyPy, it will make your code (depending on how you write your code) run faster (2 - 10 x speed-ups). Sometimes some work has to be done in the code because of the use of shortcuts that works in CPython that aren't actually correct in the Python specification.
You can download and install PyPy here. To install it, just place it in your home directory and put a symlink from somewhere to it. Let's run the spearman.py with PyPy and without Python and see the performance difference;
The difference is about 34.77% faster with PyPy against pure Python considering the input with 190340 on my Macbook. The amazing part is that I didn't change any line of my code! \m/
If you aren't using a C library like numpy then you should check PyPy - it might just make your code run several times faster. They are still porting PyPy to support Numpy since it some C libraries required for running Numpy must be rewritten. You can see some benchmarks of the porting here.
Although the PyPy team gave us a simple integration with array interfaces that behaves in a numpy-like fashion, for now it has very few functions and only supports double arithmetic.
I strongly recommend you to take a look at PyPy, it shows a great promise for high performance Python with little effort and specially for the scientific community with the support with existing numpy would be a great advance!
I didn't mention until now by I will write a special post to close this series with High Performance with Python: It is about the module multiprocessing and how you can work with it. I will show some examples and a library called JobLib that wraps it where you can easily use the power of the processors of your machine and do some parallell work.
See you next time,
Regards,
Marcel Caraciolo