I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!
In this post I will cover the Logistic Regression and Regularization.
Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function). Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others.
Visualizing the Data
Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression. For each training example, you have the applicant's scores on two exams and the admissions decision. We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.
Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.
|Sample training visualization|
Costing Function and Gradient
The logistic regression hypothesis is defined as:
where the function g is the sigmoid function. It is defined as:
The sigmoid function has special properties that can result values in the range [0,1]. So you have large positive values of X, the sigmoid should be close to 1, while for large negative values, the sigmoid should be close to 0.
The cost function and gradient for logistic regression is given as below:
and the gradient of the cost is a vector theta where the j element is defined as follows:
You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).
Let's see the code:
Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs. It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
- The initial values of the parameters you are trying to optimize;
- A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).
The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.
Evaluating logistic regression
Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.
But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set. If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:
Where 1 represents admited and -1 not admited.
Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct. Source code.
89% , not bad hun?!
Regularized logistic regression
But when your data can not be separated into positive and negative examples by a straight-line trought the plot ? Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.
Suppose you are the product manager of the factory and you have the test results for some microships of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected. We have a dataset of test results on past microships, from which we can build a logistic regression model.
Visualizing the data
Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases. Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out. This is known as overfitting. You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.
One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of x1 tand x2 up to the sixth power.
As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.
Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.
Cost function and gradient
The regularized cost function in logistic regression is :
Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n. The gradient of the cost function is a vector where the jn element is defined as follows:
Now let's learn the optimal parameters theta. Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta.
The all code now provided (code)
Plotting the decision boundary
Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples.
As you can see our model succesfully predicted our data with accuracy of 83.05%.
Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!
If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with. Take a look at this link and see by yourself! I recommend!
Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results. There are another machine learning techniques to handle with non-linear problems and we will see in the next posts. I hope you enjoyed this article!
All source from this article here.