Pages

Machine Learning with Python - Logistic Regression

Sunday, November 6, 2011

 Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:


In this post I will cover the Logistic Regression and Regularization.


Logistic Regression


Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function).  Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others. 


Visualizing the Data



Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.   We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

Sample training visualization

The code


Costing Function and Gradient



The logistic regression hypothesis is defined as:

where the function g  is the sigmoid function. It is defined as:


The sigmoid function has special properties that can result values in the range [0,1].  So you have large positive values of X, the sigmoid should be close to 1, while for large negative values,  the sigmoid should be close to 0.

Sigmoid Logistic Function 

The cost function and gradient for logistic regression is given as below:


and the gradient of the cost is a vector theta where the j element is defined as follows:



You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:






Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs.  It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
  • The initial values of the parameters you are trying to optimize;
  • A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.






Evaluating logistic regression



Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set.  If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:


Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct.  Source code.



89% , not bad hun?! 



Regularized logistic regression



But when your data can not be separated into positive and negative examples by a straight-line trought the plot ?  Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships  of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected.  We have a dataset of test results on past microships, from which we can build a logistic regression model.  





Visualizing the data



Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
Microship training set 

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases.  Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out.  This is known as overfitting.   You can imagine that if you  were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.





Feature mapping



One way to fit the data better is to create more features from each data point. We will map the features  into all polynomial terms of x1 tand x2 up to the sixth power.


As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector  will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.


Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.


Source code.


Cost function and gradient


The regularized cost function in logistic regression is :


Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n.  The gradient of the cost function is a vector where the jn element is defined as follows:



Now let's learn the optimal parameters theta.  Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta. 

The all code now provided (code)







Plotting the decision boundary



Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples. 

Decision Boundary


As you can see our model succesfully predicted our data with accuracy of 83.05%.

Code





Scikit-learn



Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with.  Take a look at this link and see by yourself!  I recommend!




Conclusions



Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results.  There are another machine learning techniques to handle with non-linear problems and we will see in the next posts.   I hope you enjoyed this article!


All source from this article here.

Regards,

Marcel Caraciolo


31 comments:

  1. wow..i am also doing the stanford exercises in python..i was stuck with the optimization part..thank you very much for the article..

    ReplyDelete
  2. the source code posted is giving errors..

    Warning: divide by zero encountered in log
    Warning: overflow encountered in power
    Warning: overflow encountered in power

    please check out! i am trying to debug it

    ReplyDelete
  3. Hi Anonymous:
    Did you try feature scaling (mean normalization)?
    Here some quick code for that (n.B. I'm using other data, so the axes may have to change):
    def normalize(X):
    mu = numpy.mean(X, axis=0)
    Smin = numpy.amin(X, axis=0)
    Smax = numpy.amax(X, axis=0)
    x = (X - mu) / (Smax - Smin)
    return x

    ReplyDelete
  4. Hi Marcel

    Great work posting this.

    I am getting similar warnings as anonymous using the above code (primarily when I expand the number of thetas to be estimated). The code works just fine for about 10 parameters or so.

    The following warning also appears:
    Warning: overflow encountered in double_scalars

    More concerning I suppose, the value returned maximum likelihood from fmin_bfgs (fmin_l_bfgs_b in my case) is nan and the following error occurs ABNORMAL_TERMINATION_IN_LNSRCH.

    Also, my features have already all be scaled as well.

    Any thoughts on what could possibly be occurring?

    ReplyDelete
  5. fantastic presentation of Logistic Regression..

    ReplyDelete
  6. Can you stop linking to this image http://www.mblondel.org/tlml/_images/math/9dd37d56c18555e80d91e8f57a1ceeb83fc72a5a.png? (the bandwidth on my server is not for free)

    Thanks.

    ReplyDelete
  7. Hey, nice site you have here! Keep up the excellent work!

    Function Point Estimation Training

    ReplyDelete
  8. This is a great tutorial, but I am confused with the first example. Why is the theta vector of length 3? Shouldn't it be of length 2? The theta vector you are trying to optimize is the slope and y-intercept, correct?

    Thanks for the help.

    ReplyDelete
  9. How would I use fmin bfgs if I'm training for a Neural Network? The cost function over there has more than 1 theta. How would I provide a list of thetas to "decorated cost" function. I tried doing it but, I get errors in scipy optimize (the thetas don't change; program crashes after a couple of iterations.

    ReplyDelete
  10. Hi. Nice post. I am wondering if it is possible to tweak a little bit of LogisticRegression in scikit-learn to get a "Regressor" rather that a "Classifier" like LogisticRegression? I went through all the codes. It seems that one of the main base class BaseLibLinear can only train different set of coefficients for different y. I really appreciate if you happy to get an answer. thanks.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Like several of the commenters, I get:

    Warning: divide by zero encountered in log

    because the elements of theta very quickly get big enough that sigmoid returns 1.0.

    Has anyone gotten the basic logistic regression code to actually work (without the regularization)?

    ReplyDelete
  13. I figured it out. Line 26 of compute_cost():

    return - 1 * J.sum()

    This negates the entire cost function, which makes it difficult for LBFGS to minimize it. (:

    This explains why the thetas go through the roof.

    ReplyDelete
  14. I seem to be having an issue with the code. Downloaded from GitHub and run it. I would assume that in log_reg.py that the output from decorated_cost() function would be the theta values defining our boundary. In fact, the code hard codes those theta values rather than using the model output. If you use what is returned by decorated_cost(), it is not accurate. How did you generate the hard coded values? Am I missing something?

    ReplyDelete
  15. This is an informative post review. I am so pleased to get this post article and nice information. I was looking forward to get such a post which is very helpful to us. A big thank for posting this article in this website. Keep it up.
    mind control

    ReplyDelete
  16. Thanks for sharing such kind of nice and wonderful collection......Nice post Dude keep it up.

    I have appreciate with getting lot of good and reliable and legislative information with your post......
    scripts, NLP, vance, advertisement


    ReplyDelete
  17. I like totally and agree. And I think that in order to be comfortable with your style is to wear it more often. So wear your style to the lab on days that you don't have to do anything bloody, muddy or otherwise gross!
    subliminal advertising

    ReplyDelete
  18. Heya¡­my very first comment on your site. ,I have been reading your blog for a while and thought I would completely pop in and drop a friendly note. .
    Function Point Estimation Training

    ReplyDelete
  19. Hi all I solved the issue related to logistic regression, for a simple misunderstood I replaced the cost_function with wrong J , since the f_min receives only a single value and also the negative value which was wrong from the problem (minimization).

    ReplyDelete
  20. Hello Marcel, I can not make either one work.
    the log_reg.py shows the "RuntimeWarning: overflow encountered in exp"
    for the log_reg_regular.py, I changed the maxfun to maxiter
    but it still shows thetaR = theta[1:, 0]
    IndexError: too many indices"

    anybody got it work? Can I have the code please?

    ReplyDelete
  21. Very Good information on property dealing. This site has very useful inputs related to Real Estate. Well Done & Keep it up to the team of Property Bytes….


    Function Point Estimation Training in Chennai

    ReplyDelete
  22. Found your article and is very intersting after some effort to understand logistic regression. I notice that if the h[it] in predict function is changed from 0.5 to 0.2 or 0.3, the test accuracy result is sky rocketing to 0.92! Can you explain why ? How can we understand if that is a correct result or not ?
    Thanks for any feedback.

    ReplyDelete
  23. In first example, what did you use to draw decision boundary?

    ReplyDelete
  24. In theory example 1 should yield better accuracy if we added more features the same way it's done in example 2. After adding additional features for some reason minimizing function doesn't want to converge and stays at 60% any ideas why?

    ReplyDelete
  25. Thank You For The Information

    ReplyDelete
  26. I agree with your post, the Introduction of automation testing product shortens the development life cycle. It helps the software developers and programmers to validate software application performance and behavior before deployment. You can choose testing product based on your testing requirements and functionality. QTP Training Chennai

    ReplyDelete
  27. Nice site....Please refer this site also nice
    Dot Net Training in Chennai,

    dotnettrainingchennai

    ReplyDelete
  28. Its new for me, i will try learn this and we introduce my web design training institute

    ReplyDelete
  29. Hi, While running
    scipy.optimize.fmin_bfgs(costFunction(theta, X, y), initial_theta,maxiter = 10),
    it throws error - 'tuple' object is not callable.

    My cost function is:

    def costFunction(theta, X, y):
    J = 0
    grad = zeros((size(theta)))

    z = sigmoid(np.dot(X,theta))
    cost = -(y*log(z)) - (1-y)*log(1 - z)

    J = sum(cost)/m

    grad = np.dot(X.T, (z - y))/m


    return grad,J

    Can anyone help me with this?

    ReplyDelete
  30. hi,
    I am trying to do event recommendation by tags of events and i want to use logistic regresion as an algorith of the system. But logistic regression using vectors (x,y), but i could not transform tags to vectors. Does anyone can help me ?

    thnx

    ReplyDelete