In this article, we will explore one of the basic steps in the knowledge discovery process, "Data Preprocessing", an important step that can be considered as a fundamental building block of data mining. The process of preprocessing has many steps, but can be summarized as the extraction, transformation and loading of the data. To be more precise modifying the source data into a different format which:

(a) enables data mining algorithms to be applied easily

(b) improves the effectiveness and the performance of the mining algorithms

(c) represents the data in easily and understandable format for both humans and machines

(d) supports faster data retrieval from databases

(e) makes the data suitable for a specific analysis to be performed.

The real world data can be considered extremely complicated to interpret without Data Preprocessing. I am going to explain this through a simple example based on Normalization. For people who come from database background this Normalization is completely different from 1st, 2nd and 3rd form of normalization used in the relational database design. We are talking about another type of normalization, and it's related to data preprocessing technique. To see how a simple data preprocessing technique could improve the effectiveness of analysis in orders of magnitude, let's move on further and talk about euclidian distance and how it's can be used to evaluate similarities between samples of data.

**Euclidian Distance**

Consider two points in a two-dimensional space (p1,p2) and (q1,q2) , the distance between these two points is given by the formula shown at the Figure 01.

**Figure 01. Euclidean Distance Measure for 2-D Points**

This is called the Euclidian Distance between two points. The same concept can be extended easily to multidimensional space. If the points are (p1,p2,p3,p4,...) and (q1,q2,q3,q4,...), the distance between the points is given by the formula (Figure 02).

**Figure 02**. Euclidean Distance for Multi-Dimensional Points

Now, I am going to introduce a sample data set of mobile profile users details. The task is to find the mobile users who have similar profiles, that is, that has similar use of the phone based on the call and SMS logs.

**Table 01.**Example Data Set (Mobile users profiles during one month)

In this data set, the first column is an unique identifier and the rest of the columns contain information. Ignoring the ID attribute, if we consider each column (attribute) as a dimension we can assume that each mobile user is represented by a point in 3-D space. The good thing is that we can calculate the distance between each of these points. The distance between points 1 and 2, (user 1 and 2) can be calculated as below:

**Figure 03**. Euclidean Distance between users 01 and 02.

Look at the data set above, through examination we figure out that the user id 1 and 4 are users with almost similar profiles, that is, use their phones very similar. So in the three dimensional space the distance between them should be less, that is, these two points should be very close.

The following table (Figure 04) gives the euclidean distance calculation for each user with the other users in our given data set.

**Figure 04.**Euclidean Distance Matrix for all users

We can see from the above list that the distance between the user 1 and 4 is 2000.00, which is less when compared between userId 1 and other users. So our interpretation seems to be correct. However, an alert reader would have noticed a significant observation here. If you see the list again, you can see that the euclidian distance values are very close to the differences in duration calls. Which this means ?

See this, the difference between the duration calls of users 1 and 4 is abs(25000 - 27000) = 2000, and the euclidean distance between one and four is 2000.30303! So it looks like this approach seems to be flawed. The euclidean distances are dominated by the duration calls amount. So guess you can't rely on euclidean distance to find mobile users with similar profile.

How do we get ourselves out of this problem ? We do not want the duration calls attribute to dominate the euclidean distance calculation. We can achieve this by applying one of the Data Preprocessing techniques called Normalization over the data set.

**What is Normalization ?**

The attribute data is scaled to fit into a specific range. There are many types of normalization available, we will see one technique called Min-Max Normalization. Min-Max Normalization transforms a value A to B which fits in the range [C,D]. It is given by the formula below (Figure 05):

**Figure 05**. Min-Max Normalization Formula

Consider the example below, the duration calls value is 50000, we want to transform this in to the range [0.0 , 1.0], so first we find the maximum value of duration calls which is 55000 and the minimum value of duration calls, 25000, them the new scaled value for 50000 will be:

**Figure 06**. Min-Max for Duration Calls Attribute from the User 01

Now let's apply the normalization technique to all the three attributes in our data set. Consider the following maximum and minimum values:

Max duration calls = 55001

Min duration calls = 24999

Min SMS = 23

Max SMS = 33

Max consume data = 8

Min consume data = 3

The attributes need to be scaled to fit in the range [0.0 , 1.0]. Applying the min-max normalization formula above, we get the normalized data set as given below (Figure 07):

**Figure 07.**Data set after Normalization

Now given the new data set all normalized, We will calculate the euclidean distances for each employee with the other employees. It's given in the table below (

**Figure 08**):Now compare the euclidean distance calculation before and after normalization. We can see that the distances are no more dominated by the duration calls attribute and they make more sense now. The number of messages and data consumed now also contribute to the distance calculation. You can see how the normalization technique and data preprocessing can really help get useful and right information about your data before applying some machine learning or data mining algorithm.

To conclude this tutorial, it's important to notice that there are many measures similar to euclidean distance which can be used to calculate the similarity between two records such as Pearson coefficient, Tanimoto Coefficient, etc. You can try replacing euclidean distance with any of these measures and experiment.

You can read more on these measures in the following link. Take a special look at other normalization techniques such as Xˆ2 (X Square) Normalization and decimal scaling which are also worth trying.

Any doubts or suggestions,

Please make yourself welcome to give!

Marcel Pinheiro Caraciolo

thank you for this informative article.:D

ReplyDeletehow are you choosing 50000 as A

ReplyDeleteSame question here. Shouldn't it be 25000 instead of 50000?

Deletethanks a lot i understand it

ReplyDeletemany many thanks

This seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

ReplyDeleteThis seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

ReplyDeleteThis is a plain copy of my work. Please take this down or give due credits by providing the original link

ReplyDeletegreat explanation!! THANK YOU!

ReplyDeleteDears,

ReplyDeleteI have the following case, I am not sure witch method to use:

Year math english City GPA

2000 90 out of 100 100 out of 150 City1 80/100

2001 88 out of 120 80 out of 100 City2 90/100

.

.

.

as you can see the math mark some time out of 100 and some time out of 120, the same as English and other subjects, and from year to year and from City to city this also changes.

which method appropriate for this case???

JJihad

very informative blog and useful article thank you for sharing with us , keep posting Data Science online Course Hyderabad

ReplyDeleteAmazing content.

ReplyDeleteData Mining Service Providers in Bangalore

This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:

ReplyDelete-Phone hacks (remotely)

-Credit repair

-Bitcoin recovery (any cryptocurrency)

-Make money from home (USA only)

-Social media hacks

-Website hacks

-Erase criminal records (USA & Canada only)

-Grade change

-funds recovery

Email: onlineghosthacker247@ gmail .com