Machine Learning with Python - Linear Regression

Thursday, October 27, 2011

Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

Linear Regression

In this post I will implement the linear regression and get to see it work on data.  Linear Regression is the oldest and most widely used predictive model in the field of machine learning. The goal is to  minimize the sum of the squared errros to fit a straight line to a set of data points.  (You can find further information at Wikipedia).

The linear regression model fits a linear function to a set of data points. The form of the function is:

Y = β0 + β1*X1 + β2*X2 + … + βn*Xn

Where Y is the target variable,  and X1X2, ... Xare the predictor variables and  β1β2, … βare the coefficients that multiply the predictor variables.  βis constant. 

For example, suppose you are the CEO of a big company of shoes franchise and are considering different cities for opening a new store. The chain already has stores in various cities and you have data for profits and populations from the cities.  You would like to use this data to help you select which city to expand next. You could use linear regression for evaluating the parameters of a function that predicts profits for the new store.

The final function would be:

                                                         Y =   -3.63029144  + 1.16636235 * X1

There are two main approaches for linear regression: with one variable and with multiple variables. Let's see both!

Linear regression with one variable

Considering our last example, we have a file that contains the dataset of our linear regression problem. The first column is the population of the city and the second column is the profit of having a store in that city. A negative value for profit indicates a loss.

Before starting, it is useful to understand the data by visualizing it.  We will use the scatter plot to visualize the data, since it has only two properties to plot (profit and population). Many other problems in real life are multi-dimensional and can't be plotted on 2-d plot.

If you run this code above (you must have the Matplotlib package installed in order to present the plots), you will see the scatter plot of the data as shown at Figure 1.


Now you must fit the linear regression parameters to our dataset using gradient descent. The objective of linear regression is to minimize the cost function:

where the hypothesis H0 is given by the linear model:

The parameters of your model are the θ values. These are the values you will adjust to minimize cost J(θ). One way to do it is to use the batch gradient descent algorithm. In batch gradient, each iteration performs the update:

With each step of gradient  descent, your parameters θ, come close to the optimal values that will achieve the lowest cost J(θ).

For our initial inputs we start with our initial fitting parameters θ, our data and add another dimmension to our data  to accommodate the θo intercept term. As also our learning rate alpha to 0.01.

As you perform gradient descent to learn minimize the cost function J(θ), it is helpful to monitor the convergence by computing the cost. The function cost is show below:

A good way to verify that gradient descent is working correctly is to look at the value of J(θ) and check that it is decreasing with each step. It should converge to a steady valeu by the end of the algorithm.

Your final values for θ will be used to make predictions on profits in areas of 35.000 and 70.000 people.  For that we will use some matrix algebra functions with the packages Scipy and Numpy,  powerful Python packages for scientific computing.

Our final values as shown below:

                                                         Y =   -3.63029144  + 1.16636235 * X1

Now  you can use this function to predict your profits!  If you use this function with our data we will come with plot:

Another interesting plot is the contour plots, it will give you how J(θ) varies with changes in θo and  θ1.  The cost function J(θ) is bowl-shaped and has a global mininum as you can see in the figure below.

This minimum is the optimal point for θo and θi, and each step of gradient descent moves closer to this point.

All the code is shown here.

Linear regression with multiple variables

Ok, but when you have multiple variables ? How do we work with them using linear regression ? That comes the linear regression with multiple variables. Let's see an example:

Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

Our training set of housing prices in Recife, Pernambuco, Brazil are formed by three columns  (three variables). The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

But before going directly to the linear regression it is important to analyze our data. By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, it is important to perfom a feature scaling that can make gradient descent converge much more quickly.

The basic steps are:

  • Subtract the mean value of each feature from the dataset.
  • After subtracting the mean, additionally scale (divide) the feature values by their respective “standard deviations.”

The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (most data points will lie within ±2 standard deviations of the mean); this is an alternative to taking the range of values (max-min).

Now that you have your data scaled, you can implement the gradient descent and the cost function.

Previously, you implemented gradient descent on a univariate regression problem. The only difference now is that there is one more feature in the matrix X. The hypothesis function and the batch gradient descent update rule remain unchanged.

In the multivariate case, the cost function can also be written in the following vectorized form:


After running our code, it will come with following function:

             215810.61679138,   61446.18781361,   20070.13313796

The gradient descent will run until convergence to find the final values of θ.  Next, we will this value of θ to predict the price of a house with 1650 square feet and 3 bedrooms.


Predicted price of a 1650 sq-ft, 3 br house: 183865.197988

If you plot the convergence plot of the gradient descent you may see that convergence will decrease as the number of iterations grows.

The code for linear regression with multi variables is available here.

Extra Notes

The Scipy package comes with several tools for helping you in this task, even with a module that has a linear regression implemented for you to use!

The module is scipy.stats.linregress  and implements several other techniques for updating the theta parameters.  Check more about it here.


The goal of regression is to determine the values of the ß parameters that minimize the sum of the squared residual values (difference betwen predicted and the observed) for the set of observations. Since linear regression is restricted to fiting linear (straight line/plane) functions to data, it's not adequate to real-world data as more general techniques such as neural networks which can model non-linear functions.  But linear regression has some interesting advantages:

  • Linear regression is the most widely used method, and it is well understood.
  • Training a linear regression model is usually much faster than methods such as neural networks.
  • Linear regression models are simple and require minimum memory to implement, so they work well on embedded controllers that have limited memory space.
  • By examining the magnitude and sign of the regression coefficients (β) you can infer how predictor variables affect the target outcome.
  • It's is one of the simplest algorithms and available in several packages, even Microsoft Excel!

I hope you enjoyed this simple post, and in the next one I will explore another field of machine learning with Python! You can download the code at this link.

Marcel Caraciolo

High Performance Computation with Python - Part 04

Monday, October 10, 2011

Hi all,

This article is the fourth one of the series about High Computation with Python.  For anyone that missed the first, second and third parts check this link about Python Profiling, this one about Cython and finally this about  Numpy Vectors. The goal is to present approaches to make CPU-demanding tasks in Python run much faster.

The techniques that are being covered:

  1.  Python Profiling - How to find bottlenecks
  2.  Cython -  Annotate your code and compile to C
  3.  Numpy Vectors - Fast vector operations using numpy arrays
  4.  Numpy integration with Cython - fast numerical Python library wrapped by Cython
  5.  PyPy - Python's new Just in Time  Compiler
In this post I will talk about PyPy - the JIT Compiler for Python!

The Problem

In this series we will analyze how to optimize the statistical Spearman Rank's Correlation coefficient,  which it is a particular measure used to compute the similarity between items in recommender systems and assesses how well the relationship between two variables can be described using a monotonic function. The source code for this metric can be found in the first post.


PyPy is a Just in TIme compiler for the Python programming language. It is multi-platform and it runs Python 2.7.  With your code running in PyPy, it will make your code (depending on how you write your code) run faster (2 - 10 x speed-ups).   Sometimes some work has to be done in the code because of the use of shortcuts that works in CPython that aren't actually correct in the Python specification.

You can download and install PyPy here. To install it, just place it in your home directory and put a symlink from somewhere to it.  Let's run the with PyPy and without Python and see the performance difference;

The difference is about 34.77% faster with PyPy against pure Python  considering the input with 190340 on my Macbook. The amazing part is that I didn't change any line of my code! \m/  
If you aren't using a C library like numpy then you should check PyPy - it might just make your code run several times faster. They are still porting PyPy to support Numpy since it some C libraries required   for running Numpy must be rewritten.   You can see some benchmarks of the porting here.

Although the PyPy team gave us a simple integration with array interfaces that behaves in a numpy-like fashion,  for now it has very few functions and only supports double arithmetic.

I strongly recommend you to take a look at PyPy, it shows a great promise for high performance Python with little effort and specially for the scientific community with the support with existing numpy would be a great advance!

I didn't mention until now by I will write a special post to close this series with High Performance with Python: It is about the module multiprocessing and how you can work with it. I will show some examples and a library called JobLib that wraps it where you can easily use the power of the processors of your machine and do some parallell work.

See you next time,


Marcel Caraciolo

Presentation at VII Brazilian Symposium of Collaborative Systems (SBSC) about Recommender Systems

Hi all,

I am sharing the slides from my keynote at VII Brazilian Symposium of Collaborative Systems (SBSC) where I presented my current work at recommender systems focusing in social networks.

My work "Content Recommendation based on Data Mining in Adaptive Social Networks"  presents how I built the recommender system at the educational social network AtéPassar and the current results behind it. It is a novel project in brazilian social networks specially because I worked hard at the explanations that come along with each recommendation.

Soon I will provide the paper.  The special part of this event was the track only for recommender systems! I had the opportunity to meet brazilian researchers and developers interested in this field.

I hope I will participate again next year! It is a great event for the researchers who miss those events focused on recommender systems.



Slides from Keynotes at VII PythonBrasil

Monday, October 3, 2011

Hi all,

I'd like to share the slides of the keynotes I lectured at the VII PythonBrasil, the Brazilian Python Users Meeting that happens once a year.   This year I had the opportunity to give two talks: One is about the Open-Source Communities and the experience with the local community of Pernambuco: The Python User Group of Pernambuco (PUG-PE) and about the framework I am currently working on: Crab - A Python Framework for Building Recommender Systems.

It was an amazing event and with lots of amazing keynotes, opportunities to meet people and make some friends. I also had the opportunity to give two more lighting talks: the pipeline toolkit for scientific computations JobLib and about Ipython.

Below the slides provided:

                           The JobLib slides for download.

I'd like also to announce the launch of the new home page of the project Crab with a reformulated design. It still in development, with lots of work to do, but it's coming! The first release 0.1 will be launched until the second week of October.

Crab new Home Page 

Thanks for the feedback from all developers at PythonBrasil and I expect new contributors at the project.


Marcel Caraciolo