Data mining in practice: Learn about Linear Regression with Python

Monday, August 31, 2009

Hi all,

Let's continue our studies in data mining algorithms. In this article, we will see how to use the linear regression to predict values of data series with a simple implementation in Python programming language.

Linear regression is more connected to statistics than computing, it can be used to fit a predictive model to an observed data set of y and x values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. This model can be shown as a line which best represents the data set. Generally, the problems that linear regression may help are related to prediction of quantity of items at certain moment.

To better understand how the linear regression works, let's see an example. Consider the table 01, which shows the year evolution of the unit price of a product and also the quantity of units sold of this product.

Table 01. History of the unit cost and quantity of sells of a product.

Based on the data set shown at the Table 01, the goal is predict the quantity of products sold when the price (unit cost) of the product achieve the value 2,0. This prediction must consider only the data provided at the table, without considering other possible factors. It's important to notice that all the data provided are fictional, used here only to illustrate the use of the linear regression to solve prediction problems.

To answer this question, we can apply the linear regression. However, nothing guarantees that the Linear Regression will make a "good"prediction, that is, the linear model will fit well to the data. To better understand how it works, let's plot the data shown at the Table 01 into a scatter plot. The Figure 01 shows this graph where the values of the column Unit Cost are placed at the axis X (horizontal) and the values of the quantity sold place at axis y (vertical).

Fig. 01. Plot with the unit price x Quantity of products sold (1990-2004)

Analyzing visually the data plot of the Figure 01, we can mentally trace a line that adjust itself to the points. The linear regression just does that: it analyzes the data and mount an line equation so we can predict the next points. One important question to be discussed is: How good it must be this line ? Not only a line that can be generated from analysis of the data. We can imagine a curve (maybe generated by exponential equation) that also adapts itself to the data. In the cases where not ever a curve fit well to the data, we must use a non-linear model. To verify if the data fit well to the linear model, you must use the test called Rˆ2 that checks numerically if it worths or not to use the linear model to the data subset.

There are many ways to use the linear regression. The usual way is use the Microsoft Excel software, which allows to add a line at the plot based on the data points created, as also its correspondent line equation. In this article, we will use a python script implementation of the linear regression algorithm. For further information about the calculation behind of the linear regression, i recommend to the readers a visit to the Kardi Teknomo's website at the link:

First let's suppose that we have the table 01 stored at a simple text file named 'dataset.txt' . Just remembering as we talked earlier, our goal is to obtain the prediction of the quantity of units sold when the price achieve the value of $$ 2,00 per unit.

To evaluate this prediction, we use the python script '' and at the terminal type the following commands:

>>> python 'C:/dataset.txt' 'quant_sold' 2.0

The result of the execution of the program is shown at the Figure 02.

>>>y = ax + b (y = 82.9842x + 23.5645)

>>>x = 2 y = 189.533

>>>RR: 0.912826

>>>Linear Model

Figure 02. Result of the execution of the linear regression

It can be noticed that the result of the script printed at the console three lines. The first one showing the line equation, the second the value predicted for the quantity of items sold when the unit cost is $$ 2,00 and finally, the third line brings the evaluated Rˆ2 metrics. The fourth line shows the interpretation of the Rˆ2 metrics: if this value is below than 0.8, it's recommended to use a non-linear model. Otherwise, the linear regression can be used to this prediction problem.

One detail that must be considered is the numerical precision of the python implementation, that can be different from the equation presented by the Excel. Other important observation is that we cannot forget that this generated equation not necessarily provide all the scatters of the plot, that is, by using the equation we can not obtain exactly the same values of the previously data, thus the equation generated by the linear regression creates an approximation of the values. The Figure 03 shows the plot with the new value predicted, which it's represented by the red dot.

Figure 03. The Quantity of products sold predicted when the price = $2,00

In this example, we considered that the quantity sold depends only of the unit cost of the product. Based on this supposition, we worked with a price of $$ 2,00 for each unit and calculated the quantity approximated by the sells model.

To download the script with all the archives used at this example, just click here.

I expect you enjoyed this article,

Any doubts, please comment !

See you next time,

Marcel P. Caraciolo


How to do a Simple Linear Regression with Python


Data mining in practice: Learn about K-means Clustering Algorithm .

Sunday, August 23, 2009

Hello Folks!

In this article we will start a deep study about Algorithms used at Data Mining. I will explain how to use the classic classification algorithm (clustering) for data segmentation in accordance to categories called K-Means Clustering Algorithm. One simple version of the algorithm will be shown here implemented with Python, similar to the other articles posted here at this blog.

The Algorithm
The main idea from the K-Means algorithm is to provide the classification of a lot of information based on its own data. This classification, that it will be shown next, is based on analysis and comparison between numerical values from the data. Thus, the algorithm automatically will provide a autonomous classification without human supervision, that is, with no existing classification. Because of this characteristic, the K-Means is considered as an unsupervised data mining algorithm.
To understand how the algorithm works, let's imagine that we have a table distributed with lines and columns that contains a lot of samples to be classified. In this table, each column is called of dimension and each line contains information for each dimension, which can be also called of ocurrences or dots. Generally, this algorithm works with continously samples, but it can also treat discrete data, provided that they must be mapped to corresponding numerical values.
As i said earlier, the algorithm will analyse all the samples of this table and generate clusters (classifications). So, the algorithm will classify the data into one cluster and indicate which lines (patterns) belong to this cluster (class). The user or the developer must provide to the algorithm the number of clusters (k) that the data must be partitioned. This number of clusters (K) remembers the first letter of the algorithm: K-means.

To generate the clusters and classify the samples, the algorithm makes a comparison between each value of the line based on a distance measure. Generally, it's used the euclidian distance to calculate how "far" the attribute of the pattern is from each other. How to evaluate this distance depends on how many attributes exist from the provided table. After the calculation of the distances, the algorithm computes the centroid for each one of the clusters. While the algorithm goes through each step, the value of each centroid is recomputed based on the mean of the values of each attribute of each pattern that belongs to this centroid. Thus, the algorithm results with k centroids and put the paterrns of the table in accordance to its distance of centroids.
To simplify all the explanation of how the algorithm works, i will present the K-means process at the following steps:
Step 01: Begin with a decision on the value of k = number of clusters.
In this step, the k centroids must be initiated. You may assign the training samples randomly, or systematically as the following:
1. Take the first k training samples of the table as single-element clusters
2. Assign each of the remaining (N-k) training samples to the cluster with the nearest centroid. After each assignment, recomputed the centroid of the gaining cluster.
Step 02: Create a distance matrix between each pattern and the centroids.
In this step, for each sample in sequence compute its distance from the centroid of each of the clusters. The drawback of this step is the heavy calculation, since we have N samples and k centroids, the algorithm will have to evaluate NxK distances.
Step 03: Put each sample in the cluster with the closest centroid (minimal distance).
Here, the samples are classified based on its distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster. Notice that the algorithm will end when no data is moving to another cluster anymore.
Step 04: Update the new centroids for each cluster.
At this moment, the centroid location is updated. For each centroid of the cluster that gained or lost a sample, its location is updated through calculating the mean of each attribute of all samples that belong to the respective cluster.
Step 05: Repeat until the convergence condition satisfied.
The algorithm comes back to the Step 02 , repeating the adjustment process of the location of each centroid until convergence is achieved, that is until a pass through the training sample causes no new assignments.
The fluxogram of all steps described above can be ilustrated at the Figure 01 below.
K means clustering algorithm
Figure 01. K-means Algorithm Process
One can note that we will have a classification that puts each sample at only one cluster. Thus, we can conclude that this algorithm generates a "hard clustering," once each sample can only be classified at only one class. Other algorithms work with the "soft" classification concept, where there is one metric that measures the degree of the sample belong to each class.
If you want to read more about the algorithm K-Means, you can look further at the link below and even download other implementations of the algorithm:
Now that we introduced the algorithm, let's see a practical example using the K-means technique.
Practical Example of the use of K-means
In this example, let's suppose a company that sells products to clients by orders made up of a list of itens (products). To make easy the comprehension of the problem and the data scenary, we will consider objects ( X clients) and each object have two attributes or features as shown in Figure o2 below.

Figure 02. Data Set samples
Based on this model, the marketing department desires to segment the clients to offer exclusive discounts and other benefits. Our goal is to group these clients of the marketing data set into three categories: Gold , Silver and Bronze Clients. The client classification criteria must consider only the two attributes: the total of orders of each client and the total cost of the client in his all orders without discounts. Obviously that the clients that have more orders and with higher total costs will be classified as Golden Clients.
With the all objects shown at the table at the Figure 2, each client will be represented as one point with two attributes (X,Y) where X = Number of Orders and Y = Total Cost. The Figure 02 shows the chart based on those attributes mapped into coordinates.
We will use the algorithm K-means to classify the data set in accordance to the marketing department wants. As it was only specified two attributes (Total Cost and number of orders) , those will be used to classify the clients. In real problems the algorithm K-Means could also work with any number of attributes to classify the objects.
Analyzing the data of the Figure 03, we can predict that the three clients will be classified as Golden Clients, therefore it's easy to see the distance between these clients and the others. However it's not so easy the classification of the rest of the clients into Silver and Bronze Clients categories.

Figure 03. Scatter Plot (Total of Orders x Total Cost)

To help the classification of these clients, we will use a implementation of the K-means algorithm that will work with only two attributes. This implementation in Python named and can be seen here.
To execute the algorithm we the module at the console with the following parameters:
%% python <"dataSet pathFile">
EX: %%python "C:/dataSet.txt" 3
Running the algorithm with the data set we presented earlier, the results were very satisfactory. In our example, the K-means classified the data into three classes: class 1, 2 and 3. Based on the definition of the client type, we associate the class 1 to Bronze Client, the class 2 to Silver Client and class 3 to the Golden Client. Putting these samples at scatter plot, we can visualize clearly the classification of the clients. This chart is shown at the Figure 04.
Interpreting the chart presented at the Figure 04, the clients represented by the color green are the Golden Clients, the clients at color red are the Silver Clients and the clients at blue color are the Bronze Clients. The three triangles in yellow point are the centroids calculated by the algorithm.

Figure 04. Clustered Data after running the K-means algorithm.
With the use of the K-means algorithm, it's possible to classify the current clients in accordance to their number of orders and the total cost in all orders, as the company marketing department desired. To classify a new client, just execute again the implementation and verify which is its classification. Thus, all clients will be again analyzed and classified.
One future feature could also be implemented is compare the attributes of a new client to the centroids ones before including it at the data set. This comparison is done by calculating the distance between the values of the new client and values of all centroids provided by the algorithm. Thus, the new client is said belong to the cluster that has minimum distance from this data.
To download the code implementation used at this example, click here.
PS: To run completely showing the charts presented above, you must have MatplotLib and Python installed at your PC. It's necessary for plotting the charts.
I expect that you enjoyed learning about some algorithms used at the Data Mining process. Wait for more tutorials soon!
See you next time,

Mudanças no blog!!

cOlá pessoal,

Após alguns meses parados sem postar nenhum conteúdo interessante no blog, decidi tomar algumas decisões estratégicas visando melhorar a qualidade deste blog. Primeiramente, os posts de agora e diante serão escritos em inglês. O motivo desta decisão é aumentar o tamanho do público-alvo deste blog, não apenas visando brasileiros, mas como também pessoas de outros países. Logo, como a lingua inglesa é considerada a lingua atualmente universal, e por ser a minha segunda língua, não há mais do que razões suficientes para escolha desse idioma.

Já estou preparando um novo artigo agora começando a focar na parte de mineração de dados, que é atualmente o tema do meu mestrado. Outros estudos também serão abordados aqui especialmente na área de computação pervasiva inteligente e móvel (Mobile Pervasive Intelligent Computing).

Espero que apreciem o blog e o conteúdo por vir!!

I hope that you enjoy it !! (Espero que gostem !!)

Regards, (Agradeço desde já,)