Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:

In this post I will cover the Logistic Regression and Regularization.

**Logistic Regression**

Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function). Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others.

**Visualizing the Data**

Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression. For each training example, you have the applicant's scores on two exams and the admissions decision. We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

Sample training visualization |

The code

**Costing Function and Gradient**

The logistic regression hypothesis is defined as:

where the function g is the sigmoid function. It is defined as:

The sigmoid function has special properties that can result values in the range [0,1]. So you have large positive values of X, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0.

The cost function and gradient for logistic regression is given as below:

and the gradient of the cost is a vector theta where the j element is defined as follows:

You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:

Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs. It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).

The parameters are:

- The initial values of the parameters you are trying to optimize;
- A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.

**Evaluating logistic regression**

Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set. If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:

Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct. Source code.

89% , not bad hun?!

**Regularized logistic regression**

But when your data can not be separated into positive and negative examples by a straight-line trought the plot ? Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected. We have a dataset of test results on past microships, from which we can build a logistic regression model.

**Visualizing the data**

Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases. Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out. This is known as overfitting. You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.

**Feature mapping**

One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of x1 tand x2 up to the sixth power.

As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.

Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.

Source code.

Source code.

**Cost function and gradient**

The regularized cost function in logistic regression is :

Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n. The gradient of the cost function is a vector where the jn element is defined as follows:

Now let's learn the optimal parameters theta. Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta.

The all code now provided (code)

**Plotting the decision boundary**

Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples.

**Scikit-learn**

Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with. Take a look at this link and see by yourself! I recommend!

**Conclusions**

Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results. There are another machine learning techniques to handle with non-linear problems and we will see in the next posts. I hope you enjoyed this article!

Regards,

Marcel Caraciolo

Amazing numerics.........

ReplyDeletewow..i am also doing the stanford exercises in python..i was stuck with the optimization part..thank you very much for the article..

ReplyDeletethe source code posted is giving errors..

ReplyDeleteWarning: divide by zero encountered in log

Warning: overflow encountered in power

Warning: overflow encountered in power

please check out! i am trying to debug it

Hi Anonymous:

ReplyDeleteDid you try feature scaling (mean normalization)?

Here some quick code for that (n.B. I'm using other data, so the axes may have to change):

def normalize(X):

mu = numpy.mean(X, axis=0)

Smin = numpy.amin(X, axis=0)

Smax = numpy.amax(X, axis=0)

x = (X - mu) / (Smax - Smin)

return x

Hi Marcel

ReplyDeleteGreat work posting this.

I am getting similar warnings as anonymous using the above code (primarily when I expand the number of thetas to be estimated). The code works just fine for about 10 parameters or so.

The following warning also appears:

Warning: overflow encountered in double_scalars

More concerning I suppose, the value returned maximum likelihood from fmin_bfgs (fmin_l_bfgs_b in my case) is nan and the following error occurs ABNORMAL_TERMINATION_IN_LNSRCH.

Also, my features have already all be scaled as well.

Any thoughts on what could possibly be occurring?

fantastic presentation of Logistic Regression..

ReplyDeleteCan you stop linking to this image http://www.mblondel.org/tlml/_images/math/9dd37d56c18555e80d91e8f57a1ceeb83fc72a5a.png? (the bandwidth on my server is not for free)

ReplyDeleteThanks.

Hey, nice site you have here! Keep up the excellent work!

ReplyDeleteFunction Point Estimation Training

This is a great tutorial, but I am confused with the first example. Why is the theta vector of length 3? Shouldn't it be of length 2? The theta vector you are trying to optimize is the slope and y-intercept, correct?

ReplyDeleteThanks for the help.

How would I use fmin bfgs if I'm training for a Neural Network? The cost function over there has more than 1 theta. How would I provide a list of thetas to "decorated cost" function. I tried doing it but, I get errors in scipy optimize (the thetas don't change; program crashes after a couple of iterations.

ReplyDeleteHi. Nice post. I am wondering if it is possible to tweak a little bit of LogisticRegression in scikit-learn to get a "Regressor" rather that a "Classifier" like LogisticRegression? I went through all the codes. It seems that one of the main base class BaseLibLinear can only train different set of coefficients for different y. I really appreciate if you happy to get an answer. thanks.

ReplyDeleteThis comment has been removed by the author.

ReplyDeleteLike several of the commenters, I get:

ReplyDeleteWarning: divide by zero encountered in log

because the elements of theta very quickly get big enough that sigmoid returns 1.0.

Has anyone gotten the basic logistic regression code to actually work (without the regularization)?

I figured it out. Line 26 of compute_cost():

ReplyDeletereturn - 1 * J.sum()

This negates the entire cost function, which makes it difficult for LBFGS to minimize it. (:

This explains why the thetas go through the roof.

I seem to be having an issue with the code. Downloaded from GitHub and run it. I would assume that in log_reg.py that the output from decorated_cost() function would be the theta values defining our boundary. In fact, the code hard codes those theta values rather than using the model output. If you use what is returned by decorated_cost(), it is not accurate. How did you generate the hard coded values? Am I missing something?

ReplyDeleteThis is an informative post review. I am so pleased to get this post article and nice information. I was looking forward to get such a post which is very helpful to us. A big thank for posting this article in this website. Keep it up.

ReplyDeletemind control

Thanks for sharing such kind of nice and wonderful collection......Nice post Dude keep it up.

ReplyDeleteI have appreciate with getting lot of good and reliable and legislative information with your post......

scripts, NLP, vance, advertisement

I like totally and agree. And I think that in order to be comfortable with your style is to wear it more often. So wear your style to the lab on days that you don't have to do anything bloody, muddy or otherwise gross!

ReplyDeletesubliminal advertising

Heya¡my very first comment on your site. ,I have been reading your blog for a while and thought I would completely pop in and drop a friendly note. .

ReplyDeleteFunction Point Estimation Training

Hi all I solved the issue related to logistic regression, for a simple misunderstood I replaced the cost_function with wrong J , since the f_min receives only a single value and also the negative value which was wrong from the problem (minimization).

ReplyDeleteHello Marcel, I can not make either one work.

ReplyDeletethe log_reg.py shows the "RuntimeWarning: overflow encountered in exp"

for the log_reg_regular.py, I changed the maxfun to maxiter

but it still shows thetaR = theta[1:, 0]

IndexError: too many indices"

anybody got it work? Can I have the code please?

Very Good information on property dealing. This site has very useful inputs related to Real Estate. Well Done & Keep it up to the team of Property Bytes….

ReplyDeleteFunction Point Estimation Training in Chennai

Found your article and is very intersting after some effort to understand logistic regression. I notice that if the h[it] in predict function is changed from 0.5 to 0.2 or 0.3, the test accuracy result is sky rocketing to 0.92! Can you explain why ? How can we understand if that is a correct result or not ?

ReplyDeleteThanks for any feedback.

In first example, what did you use to draw decision boundary?

ReplyDeleteIn theory example 1 should yield better accuracy if we added more features the same way it's done in example 2. After adding additional features for some reason minimizing function doesn't want to converge and stays at 60% any ideas why?

ReplyDeleteThank You For The Information

ReplyDelete