## Sunday, November 6, 2011

### Machine Learning with Python - Logistic Regression

Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:

In this post I will cover the Logistic Regression and Regularization.

Logistic Regression

Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function).  Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others.

Visualizing the Data

Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.   We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

 Sample training visualization

The code

The logistic regression hypothesis is defined as:

where the function g  is the sigmoid function. It is defined as:

The sigmoid function has special properties that can result values in the range [0,1].  So you have large positive values of X, the sigmoid should be close to 1, while for large negative values,  the sigmoid should be close to 0.

 Sigmoid Logistic Function

The cost function and gradient for logistic regression is given as below:

and the gradient of the cost is a vector theta where the j element is defined as follows:

You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:

Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs.  It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
• The initial values of the parameters you are trying to optimize;
• A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.

Evaluating logistic regression

Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set.  If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct.  Source code.

Regularized logistic regression

But when your data can not be separated into positive and negative examples by a straight-line trought the plot ?  Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships  of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected.  We have a dataset of test results on past microships, from which we can build a logistic regression model.

Visualizing the data

Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
 Microship training set

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases.  Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out.  This is known as overfitting.   You can imagine that if you  were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.

Feature mapping

One way to fit the data better is to create more features from each data point. We will map the features  into all polynomial terms of x1 tand x2 up to the sixth power.

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector  will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.

Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.

Source code.

The regularized cost function in logistic regression is :

Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n.  The gradient of the cost function is a vector where the jn element is defined as follows:

Now let's learn the optimal parameters theta.  Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta.

The all code now provided (code)

Plotting the decision boundary

Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples.

 Decision Boundary

As you can see our model succesfully predicted our data with accuracy of 83.05%.

Scikit-learn

Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with.  Take a look at this link and see by yourself!  I recommend!

Conclusions

Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results.  There are another machine learning techniques to handle with non-linear problems and we will see in the next posts.   I hope you enjoyed this article!

Regards,

Marcel Caraciolo

#### 19 comentÃ¡rios:

commercial storage company said...

Amazing numerics.........

chandan said...

wow..i am also doing the stanford exercises in python..i was stuck with the optimization part..thank you very much for the article..

Anonymous said...

the source code posted is giving errors..

Warning: divide by zero encountered in log
Warning: overflow encountered in power
Warning: overflow encountered in power

please check out! i am trying to debug it

Anonymous said...

Hi Anonymous:
Did you try feature scaling (mean normalization)?
Here some quick code for that (n.B. I'm using other data, so the axes may have to change):
def normalize(X):
mu = numpy.mean(X, axis=0)
Smin = numpy.amin(X, axis=0)
Smax = numpy.amax(X, axis=0)
x = (X - mu) / (Smax - Smin)
return x

Tom said...

Hi Marcel

Great work posting this.

I am getting similar warnings as anonymous using the above code (primarily when I expand the number of thetas to be estimated). The code works just fine for about 10 parameters or so.

The following warning also appears:
Warning: overflow encountered in double_scalars

More concerning I suppose, the value returned maximum likelihood from fmin_bfgs (fmin_l_bfgs_b in my case) is nan and the following error occurs ABNORMAL_TERMINATION_IN_LNSRCH.

Also, my features have already all be scaled as well.

Any thoughts on what could possibly be occurring?

Suain Logistic said...

fantastic presentation of Logistic Regression..

MB said...

Can you stop linking to this image http://www.mblondel.org/tlml/_images/math/9dd37d56c18555e80d91e8f57a1ceeb83fc72a5a.png? (the bandwidth on my server is not for free)

Thanks.

kiruthi ka said...

Hey, nice site you have here! Keep up the excellent work!

Function Point Estimation Training

David Reed said...

This is a great tutorial, but I am confused with the first example. Why is the theta vector of length 3? Shouldn't it be of length 2? The theta vector you are trying to optimize is the slope and y-intercept, correct?

Thanks for the help.

Rahul Kavi said...

How would I use fmin bfgs if I'm training for a Neural Network? The cost function over there has more than 1 theta. How would I provide a list of thetas to "decorated cost" function. I tried doing it but, I get errors in scipy optimize (the thetas don't change; program crashes after a couple of iterations.

jeffy said...

Hi. Nice post. I am wondering if it is possible to tweak a little bit of LogisticRegression in scikit-learn to get a "Regressor" rather that a "Classifier" like LogisticRegression? I went through all the codes. It seems that one of the main base class BaseLibLinear can only train different set of coefficients for different y. I really appreciate if you happy to get an answer. thanks.

johndburger said...
This comment has been removed by the author.
johndburger said...

Like several of the commenters, I get:

Warning: divide by zero encountered in log

because the elements of theta very quickly get big enough that sigmoid returns 1.0.

Has anyone gotten the basic logistic regression code to actually work (without the regularization)?

johndburger said...

I figured it out. Line 26 of compute_cost():

return - 1 * J.sum()

This negates the entire cost function, which makes it difficult for LBFGS to minimize it. (:

This explains why the thetas go through the roof.

Matt Miller said...

I seem to be having an issue with the code. Downloaded from GitHub and run it. I would assume that in log_reg.py that the output from decorated_cost() function would be the theta values defining our boundary. In fact, the code hard codes those theta values rather than using the model output. If you use what is returned by decorated_cost(), it is not accurate. How did you generate the hard coded values? Am I missing something?

shakil hossain said...

This is an informative post review. I am so pleased to get this post article and nice information. I was looking forward to get such a post which is very helpful to us. A big thank for posting this article in this website. Keep it up.
mind control

Farhana awan said...

Thanks for sharing such kind of nice and wonderful collection......Nice post Dude keep it up.

I have appreciate with getting lot of good and reliable and legislative information with your post......

Fariha Chowdhury said...

I like totally and agree. And I think that in order to be comfortable with your style is to wear it more often. So wear your style to the lab on days that you don't have to do anything bloody, muddy or otherwise gross!