In this article, we will explore one of the basic steps in the knowledge discovery process, "Data Preprocessing", an important step that can be considered as a fundamental building block of data mining. The process of preprocessing has many steps, but can be summarized as the extraction, transformation and loading of the data. To be more precise modifying the source data into a different format which:
(a) enables data mining algorithms to be applied easily
(b) improves the effectiveness and the performance of the mining algorithms
(c) represents the data in easily and understandable format for both humans and machines
(d) supports faster data retrieval from databases
(e) makes the data suitable for a specific analysis to be performed.
The real world data can be considered extremely complicated to interpret without Data Preprocessing. I am going to explain this through a simple example based on Normalization. For people who come from database background this Normalization is completely different from 1st, 2nd and 3rd form of normalization used in the relational database design. We are talking about another type of normalization, and it's related to data preprocessing technique. To see how a simple data preprocessing technique could improve the effectiveness of analysis in orders of magnitude, let's move on further and talk about euclidian distance and how it's can be used to evaluate similarities between samples of data.
Consider two points in a two-dimensional space (p1,p2) and (q1,q2) , the distance between these two points is given by the formula shown at the Figure 01.
Figure 01. Euclidean Distance Measure for 2-D Points
This is called the Euclidian Distance between two points. The same concept can be extended easily to multidimensional space. If the points are (p1,p2,p3,p4,...) and (q1,q2,q3,q4,...), the distance between the points is given by the formula (Figure 02).
Figure 02. Euclidean Distance for Multi-Dimensional Points
Now, I am going to introduce a sample data set of mobile profile users details. The task is to find the mobile users who have similar profiles, that is, that has similar use of the phone based on the call and SMS logs.
Table 01. Example Data Set (Mobile users profiles during one month)
In this data set, the first column is an unique identifier and the rest of the columns contain information. Ignoring the ID attribute, if we consider each column (attribute) as a dimension we can assume that each mobile user is represented by a point in 3-D space. The good thing is that we can calculate the distance between each of these points. The distance between points 1 and 2, (user 1 and 2) can be calculated as below:
Figure 03. Euclidean Distance between users 01 and 02.
Look at the data set above, through examination we figure out that the user id 1 and 4 are users with almost similar profiles, that is, use their phones very similar. So in the three dimensional space the distance between them should be less, that is, these two points should be very close.
The following table (Figure 04) gives the euclidean distance calculation for each user with the other users in our given data set.
Figure 04. Euclidean Distance Matrix for all users
We can see from the above list that the distance between the user 1 and 4 is 2000.00, which is less when compared between userId 1 and other users. So our interpretation seems to be correct. However, an alert reader would have noticed a significant observation here. If you see the list again, you can see that the euclidian distance values are very close to the differences in duration calls. Which this means ?
See this, the difference between the duration calls of users 1 and 4 is abs(25000 - 27000) = 2000, and the euclidean distance between one and four is 2000.30303! So it looks like this approach seems to be flawed. The euclidean distances are dominated by the duration calls amount. So guess you can't rely on euclidean distance to find mobile users with similar profile.
How do we get ourselves out of this problem ? We do not want the duration calls attribute to dominate the euclidean distance calculation. We can achieve this by applying one of the Data Preprocessing techniques called Normalization over the data set.
What is Normalization ?
The attribute data is scaled to fit into a specific range. There are many types of normalization available, we will see one technique called Min-Max Normalization. Min-Max Normalization transforms a value A to B which fits in the range [C,D]. It is given by the formula below (Figure 05):
Figure 05. Min-Max Normalization Formula
Consider the example below, the duration calls value is 50000, we want to transform this in to the range [0.0 , 1.0], so first we find the maximum value of duration calls which is 55000 and the minimum value of duration calls, 25000, them the new scaled value for 50000 will be:
Figure 06. Min-Max for Duration Calls Attribute from the User 01
Now let's apply the normalization technique to all the three attributes in our data set. Consider the following maximum and minimum values:
Max duration calls = 55001
Min duration calls = 24999
Min SMS = 23
Max SMS = 33
Max consume data = 8
Min consume data = 3
The attributes need to be scaled to fit in the range [0.0 , 1.0]. Applying the min-max normalization formula above, we get the normalized data set as given below (Figure 07):
Figure 07. Data set after Normalization
Now given the new data set all normalized, We will calculate the euclidean distances for each employee with the other employees. It's given in the table below (Figure 08):
Now compare the euclidean distance calculation before and after normalization. We can see that the distances are no more dominated by the duration calls attribute and they make more sense now. The number of messages and data consumed now also contribute to the distance calculation. You can see how the normalization technique and data preprocessing can really help get useful and right information about your data before applying some machine learning or data mining algorithm.
To conclude this tutorial, it's important to notice that there are many measures similar to euclidean distance which can be used to calculate the similarity between two records such as Pearson coefficient, Tanimoto Coefficient, etc. You can try replacing euclidean distance with any of these measures and experiment.
You can read more on these measures in the following link. Take a special look at other normalization techniques such as Xˆ2 (X Square) Normalization and decimal scaling which are also worth trying.
Special thanks to the Blog IntelligenceMining that inspired me to write about this important topic!
Any doubts or suggestions,
Please make yourself welcome to give!
Marcel Pinheiro Caraciolo