Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets. Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

**Linear Regression**

In this post I will implement the linear regression and get to see it work on data. Linear Regression is the oldest and most widely used predictive model in the field of machine learning. The goal is to minimize the sum of the squared errros to fit a straight line to a set of data points. (You can find further information at Wikipedia).

The linear regression model fits a linear function to a set of data points. The form of the function is:

*Y*=

*β*

_{0}+

*β*

_{1}*

*X*

_{1}+

*β*

_{2}*

*X*

_{2}+ … +

*β*

_{n}*

*X*

_{n}

_{}

Where

*Y*is the target variable, and*X*_{1},*X*_{2}, ...*X*_{n }are the predictor variables and*β*_{1},*β*_{2}, …*β*_{n }are the coefficients that multiply the predictor variables.*β*_{0 }is constant.
For example, suppose you are the CEO of a big company of shoes franchise and are considering different cities for opening a new store. The chain already has stores in various cities and you have data for profits and populations from the cities. You would like to use this data to help you select which city to expand next. You could use linear regression for evaluating the parameters of a function that predicts profits for the new store.

The final function would be:

Y = -3.63029144 + 1.16636235 *

*X*_{1}
There are two main approaches for linear regression: with one variable and with multiple variables. Let's see both!

**Linear regression with one variable**

Considering our last example, we have a file that contains the dataset of our linear regression problem. The first column is the population of the city and the second column is the profit of having a store in that city. A negative value for profit indicates a loss.

Before starting, it is useful to understand the data by visualizing it. We will use the scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems in real life are multi-dimensional and can't be plotted on 2-d plot.

If you run this code above (you must have the Matplotlib package installed in order to present the plots), you will see the scatter plot of the data as shown at Figure 1.

Now you must fit the linear regression parameters to our dataset using gradient descent. The objective of linear regression is to minimize the cost function:

where the hypothesis H0 is given by the linear model:

The parameters of your model are the θ values. These are the values you will adjust to minimize cost J(θ). One way to do it is to use the batch gradient descent algorithm. In batch gradient, each iteration performs the update:

With each step of gradient descent, your parameters θ, come close to the optimal values that will achieve the lowest cost J(θ).

For our initial inputs we start with our initial fitting parameters θ, our data and add another dimmension to our data to accommodate the θo intercept term. As also our learning rate alpha to 0.01.

As you perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. The function cost is show below:

A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. It should converge to a steady valeu by the end of the algorithm.

Your final values for θ will be used to make predictions on profits in areas of 35.000 and 70.000 people. For that we will use some matrix algebra functions with the packages Scipy and Numpy, powerful Python packages for scientific computing.

Our final values as shown below:

Y = -3.63029144 + 1.16636235 *

*X*_{1}
Now you can use this function to predict your profits! If you use this function with our data we will come with plot:

Another interesting plot is the contour plots, it will give you how J(θ) varies with changes in θo and θ1. The cost function J(θ) is bowl-shaped and has a global mininum as you can see in the figure below.

This minimum is the optimal point for θo and θi, and each step of gradient descent moves closer to this point.

**Linear regression with multiple variables**

Ok, but when you have multiple variables ? How do we work with them using linear regression ? That comes the linear regression with multiple variables. Let's see an example:

Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

Our training set of housing prices in Recife, Pernambuco, Brazil are formed by three columns (three variables). The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

But before going directly to the linear regression it is important to analyze our data. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, it is important to perfom a feature scaling that can make gradient descent converge much more quickly.

The basic steps are:

- Subtract the mean value of each feature from the dataset.
- After subtracting the mean, additionally scale (divide) the feature values by their respective “standard deviations.”

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within ±2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).

Now that you have your data scaled, you can implement the gradient descent and the cost function.

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

In the multivariate case, the cost function can also be written in the following vectorized form:

J(θ)=12m(Xθ−y)T(Xθ−y)

After running our code, it will come with following function:

215810.61679138, 61446.18781361, 20070.13313796

The gradient descent will run until convergence to find the final values of θ. Next, we will this value of θ to predict the price of a house with 1650 square feet and 3 bedrooms.

θ:=θ−α1mxT(xθT−y)

θ:=θ−α1mxT(xθT−y)

**Predicted price of a 1650 sq-ft, 3 br house:**183865.197988

If you plot the convergence plot of the gradient descent you may see that convergence will decrease as the number of iterations grows.

**Extra Notes**

The Scipy package comes with several tools for helping you in this task, even with a module that has a linear regression implemented for you to use!

The module is scipy.stats.linregress and implements several other techniques for updating the theta parameters. Check more about it here.

**Conclusions**

The goal of regression is to determine the values of the ß parameters that minimize the sum of the squared residual values (difference betwen predicted and the observed) for the set of observations. Since linear regression is restricted to fiting linear (straight line/plane) functions to data, it's not adequate to real-world data as more general techniques such as neural networks which can model non-linear functions. But linear regression has some interesting advantages:

- Linear regression is the most widely used method, and it is well understood.
- Training a linear regression model is usually much faster than methods such as neural networks.
- Linear regression models are simple and require minimum memory to implement, so they work well on embedded controllers that have limited memory space.
- By examining the magnitude and sign of the regression coefficients (
*β*) you can infer how predictor variables affect the target outcome. - It's is one of the simplest algorithms and available in several packages, even Microsoft Excel!

I hope you enjoyed this simple post, and in the next one I will explore another field of machine learning with Python! You can download the code at this link.

Marcel Caraciolo

Muito bom, continue assim.

ReplyDeleteAbraço.

Excelente Marcel, good job though!

ReplyDeleteCongratulations on Machine Learning in Python! http://www.scn.org/~mentifex/AiMind.html achieves Machine Learning in JavaScript by asking questions for the human user to answer.

ReplyDeleteEnjoy reading your post. Great article, thank you very much! Really nice and impressive blog i found today... Thx for sharing this

ReplyDeleteThis comes in handy for me! Thank you

ReplyDeletethanks so much man - I am really struggling with the stanford course and have been wishing it was in python! this is awesome...

ReplyDeleteis there a video tuorial on this?

ReplyDeleteYour code is a bit confusing in that you do not use at all your compute_cost function!

ReplyDeleteThe result of the cost function is stored in the variable J_history which is never used again.

Furthermore your gradient descent method never checks on a minimal error but just iterates a fixed number of times towards a minimum. This can also lead to bad results.

great tutorial. your code helps my comprehension of the process. do you have any examples of this executed using pandas?

ReplyDeletehi,

ReplyDeleteit[:,1] = X does not seem to work ...

A new column was not added to the array.

Nice work. Thanks!

ReplyDeleteI'm pretty sure gradient descent isn't actually linear regression, its a more general solver thats actually more advanced and used with non-linear data. Linear regression will fit only the simplest models but its FAST. Gradient descent is far slower.

ReplyDelete