Pages

Review of the book Numpy 1.5 - Beginner's Guide

Saturday, November 26, 2011

Hi all,

I'd like to share my review of the book Numpy 1.5 the Beginner's Guide by Ivan Idris, which is one of the latest books in a series of manuals covering scientific computing libraries written in Python.  This book covers the Numpy library for manipulating vectors and matrices and support for mathematical libraries.

Numpy 1.5  from Packt Publisher

Quick Review

The book is a great and useful resource for anyone who wants to explore further the Numpy scientific library since it covers almost all of the modules available at Numpy 1.5.  It comes with several examples, specially for finantial researchers and developers that work with finantial data. The author explored several modules using stocks and historical price data.  The authors explains each function or operation with code and the expected results, so the reader can follow precisely what's happening when he presents the modules. One of the values of the book is how it is organized: the step-by-step guide when he presents complex functions at Numpy, for example: add.reduceat, add.accumulate and add.reduce operators.
The part that I didn't like was about the exercises which was quite simple. I'd like to see deep exercises exploring the resources given at the book and I missed more information about NaN values. Also, I didn't  see information also about the functions squeeze, choose and about more complex structured arrays (arrays with tuples, etc.).

To sum up, I recommend this book for anyone whishing to learn about scientific computing with Python using the mathematical library Numpy which is a great alternative (and free !) for Matlab, Mathematica and other packages. I expect quite soon a book covering Scipy library also!  By the way, the finantial fans will love this book since it covers almost of the entire book with examples using finance data!

Review


The book starts with a step-by-step installation process of Numpy as also giving a litte introduction about what is Numpy, its history, etc.   I'd like to mention that even all the platforms covered at the book, Numpy is not so easy as mentioned to install at Mac OS.  The problem is that generally the developers don't use the built-in Python that comes with the Mac, since it is outdated (my Snow leopard comes with the Python 2.6.1). So when you install the new Python, that the problems come! Several compilation errors, messages that you can't understand, etc.  But if you go by using the MacPorts,  you will free of all these errors! ( After all the nightmare of the installation, I discovered the MacPorts :P).

The following chapters 2-4 presents the Numpy Fundamentals covering the array manipulations and most commonly used operations.  The books goes into a cyclic process, where each function that the author presents goes through an introduction about the problem to solve, the actions (how you with Numpy can solve), auxiliar numpy functions and operations and finally what just happened, that is, explain what he has done after showing the solution. The examples covered at book, most of them, are from finantial data and stock market values. An interesting choice since he used the same examples through the chapters in a progression and logical way.  Having each function and numpy featured described and explained made the book a good reference guide for someone using the library.  There were minor issues  related to the imports, he doesn't mention the imports in some examples,  for instance the numpy.loadtxt function when he uses the datetime module.  For a beginner that is studying Python for the first time, it may be harder to them to follow the examples, since he could not always tell where the functions or modules were coming from.

The second part of the book includes the matrices, universal functions, some scipy modules and the use of matplotlib and testing.  The chapter 5 covers the matrix module and universal functions such as add, divide, prod, sum and so on.  I missed some functions that weren't covered at this chapter such as numpy.choose or numpy.squeeze.  I believe the author didn't remember or didn't have space to mention these specific functions, but it does not prejudice at all the quality of the book. The chapter that I liked the most at the book was about testing. Several developers, special the scientific researchers are not used to test their code, so I believe it is a great chapter for anyone who wants to assure quality and avoid future bugs using Numpy testing modules.  The chapter should be more bigger and include more examples even creating test cases and tips for scientific developers.
Finally the last two chapters focus on plotting and Scipy integration. I think the plotting chapter should be at the beginning of the book, because he already uses lots of examples at the previous chapters with matplotlib and only at the end explain further about the library. The chapter is well-written and gives you sufficient content for beginning with Matplotlib. The last chapter covers the use of several scipy functions but it does not give deeper explanations about how it works as he did at the previous chapters with Numpy. However it gives several useful examples to work with integration, image processing and even optimization. Many developers will enjoy this extra-chapter covering the use of scipy+numpy.


Conclusions

 My overall impression of this book is that it could make a useful reference guide for Numpy. For finantial researchers and developers it will be a great book since it also covers lots of examples using finance data to present the numpy fundamentals.  There were minor issues related to Scipy and Matplotlib that should be more explained. For anyone who wants to start using Numpy it can a be an excellent book to begin, since it covers all the fundamentals steps with a cyclic progressive introduction of using the scientific packages in Python.

Regards,

Marcel Caraciolo

Machine Learning with Python - Logistic Regression

Sunday, November 6, 2011

 Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:


In this post I will cover the Logistic Regression and Regularization.


Logistic Regression


Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function).  Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others. 


Visualizing the Data



Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.   We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

Sample training visualization

The code


Costing Function and Gradient



The logistic regression hypothesis is defined as:

where the function g  is the sigmoid function. It is defined as:


The sigmoid function has special properties that can result values in the range [0,1].  So you have large positive values of X, the sigmoid should be close to 1, while for large negative values,  the sigmoid should be close to 0.

Sigmoid Logistic Function 

The cost function and gradient for logistic regression is given as below:


and the gradient of the cost is a vector theta where the j element is defined as follows:



You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:






Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs.  It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
  • The initial values of the parameters you are trying to optimize;
  • A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.






Evaluating logistic regression



Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set.  If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:


Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct.  Source code.



89% , not bad hun?! 



Regularized logistic regression



But when your data can not be separated into positive and negative examples by a straight-line trought the plot ?  Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships  of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected.  We have a dataset of test results on past microships, from which we can build a logistic regression model.  





Visualizing the data



Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
Microship training set 

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases.  Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out.  This is known as overfitting.   You can imagine that if you  were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.





Feature mapping



One way to fit the data better is to create more features from each data point. We will map the features  into all polynomial terms of x1 tand x2 up to the sixth power.


As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector  will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.


Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.


Source code.


Cost function and gradient


The regularized cost function in logistic regression is :


Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n.  The gradient of the cost function is a vector where the jn element is defined as follows:



Now let's learn the optimal parameters theta.  Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta. 

The all code now provided (code)







Plotting the decision boundary



Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples. 

Decision Boundary


As you can see our model succesfully predicted our data with accuracy of 83.05%.

Code





Scikit-learn



Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with.  Take a look at this link and see by yourself!  I recommend!




Conclusions



Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results.  There are another machine learning techniques to handle with non-linear problems and we will see in the next posts.   I hope you enjoyed this article!


All source from this article here.

Regards,

Marcel Caraciolo


Google AI Challenge this year is open! Ants Battlefield!

Friday, November 4, 2011

Hi all,

One more time Google and the University of Waterloo's computer science club have launched an Artificial Intelligence challenge. This year the task is to write a program to compete in the Ants Multiplayer Challenge. The goal is seek and destroy the most enemy ant hills while defending their own hills.  You must create a bot that plays the game of Ants as intelligently as possible.  The contest supports languages in Python, Java, C# and C++.

The current state of the contest, which you can submit your project, is until December 18th. After, there will be a final tournament between the contestants to decide the ultimate winner!

See  the game in action at the video below.





It is a great opportunity to learn Artificial Intelligence and play with your skills at programming, machine learning and logical reason!

Regards,

Marcel Caraciolo