In this article, we will explore one of the basic steps in the knowledge discovery process, "Data Preprocessing", an important step that can be considered as a fundamental building block of data mining. The process of preprocessing has many steps, but can be summarized as the extraction, transformation and loading of the data. To be more precise modifying the source data into a different format which:

(a) enables data mining algorithms to be applied easily

(b) improves the effectiveness and the performance of the mining algorithms

(c) represents the data in easily and understandable format for both humans and machines

(d) supports faster data retrieval from databases

(e) makes the data suitable for a specific analysis to be performed.

The real world data can be considered extremely complicated to interpret without Data Preprocessing. I am going to explain this through a simple example based on Normalization. For people who come from database background this Normalization is completely different from 1st, 2nd and 3rd form of normalization used in the relational database design. We are talking about another type of normalization, and it's related to data preprocessing technique. To see how a simple data preprocessing technique could improve the effectiveness of analysis in orders of magnitude, let's move on further and talk about euclidian distance and how it's can be used to evaluate similarities between samples of data.

**Euclidian Distance**

Consider two points in a two-dimensional space (p1,p2) and (q1,q2) , the distance between these two points is given by the formula shown at the Figure 01.

**Figure 01. Euclidean Distance Measure for 2-D Points**

This is called the Euclidian Distance between two points. The same concept can be extended easily to multidimensional space. If the points are (p1,p2,p3,p4,...) and (q1,q2,q3,q4,...), the distance between the points is given by the formula (Figure 02).

**Figure 02**. Euclidean Distance for Multi-Dimensional Points

Now, I am going to introduce a sample data set of mobile profile users details. The task is to find the mobile users who have similar profiles, that is, that has similar use of the phone based on the call and SMS logs.

**Table 01.**Example Data Set (Mobile users profiles during one month)

In this data set, the first column is an unique identifier and the rest of the columns contain information. Ignoring the ID attribute, if we consider each column (attribute) as a dimension we can assume that each mobile user is represented by a point in 3-D space. The good thing is that we can calculate the distance between each of these points. The distance between points 1 and 2, (user 1 and 2) can be calculated as below:

**Figure 03**. Euclidean Distance between users 01 and 02.

Look at the data set above, through examination we figure out that the user id 1 and 4 are users with almost similar profiles, that is, use their phones very similar. So in the three dimensional space the distance between them should be less, that is, these two points should be very close.

The following table (Figure 04) gives the euclidean distance calculation for each user with the other users in our given data set.

**Figure 04.**Euclidean Distance Matrix for all users

We can see from the above list that the distance between the user 1 and 4 is 2000.00, which is less when compared between userId 1 and other users. So our interpretation seems to be correct. However, an alert reader would have noticed a significant observation here. If you see the list again, you can see that the euclidian distance values are very close to the differences in duration calls. Which this means ?

See this, the difference between the duration calls of users 1 and 4 is abs(25000 - 27000) = 2000, and the euclidean distance between one and four is 2000.30303! So it looks like this approach seems to be flawed. The euclidean distances are dominated by the duration calls amount. So guess you can't rely on euclidean distance to find mobile users with similar profile.

How do we get ourselves out of this problem ? We do not want the duration calls attribute to dominate the euclidean distance calculation. We can achieve this by applying one of the Data Preprocessing techniques called Normalization over the data set.

**What is Normalization ?**

The attribute data is scaled to fit into a specific range. There are many types of normalization available, we will see one technique called Min-Max Normalization. Min-Max Normalization transforms a value A to B which fits in the range [C,D]. It is given by the formula below (Figure 05):

**Figure 05**. Min-Max Normalization Formula

Consider the example below, the duration calls value is 50000, we want to transform this in to the range [0.0 , 1.0], so first we find the maximum value of duration calls which is 55000 and the minimum value of duration calls, 25000, them the new scaled value for 50000 will be:

**Figure 06**. Min-Max for Duration Calls Attribute from the User 01

Now let's apply the normalization technique to all the three attributes in our data set. Consider the following maximum and minimum values:

Max duration calls = 55001

Min duration calls = 24999

Min SMS = 23

Max SMS = 33

Max consume data = 8

Min consume data = 3

The attributes need to be scaled to fit in the range [0.0 , 1.0]. Applying the min-max normalization formula above, we get the normalized data set as given below (Figure 07):

**Figure 07.**Data set after Normalization

Now given the new data set all normalized, We will calculate the euclidean distances for each employee with the other employees. It's given in the table below (

**Figure 08**):Now compare the euclidean distance calculation before and after normalization. We can see that the distances are no more dominated by the duration calls attribute and they make more sense now. The number of messages and data consumed now also contribute to the distance calculation. You can see how the normalization technique and data preprocessing can really help get useful and right information about your data before applying some machine learning or data mining algorithm.

To conclude this tutorial, it's important to notice that there are many measures similar to euclidean distance which can be used to calculate the similarity between two records such as Pearson coefficient, Tanimoto Coefficient, etc. You can try replacing euclidean distance with any of these measures and experiment.

You can read more on these measures in the following link. Take a special look at other normalization techniques such as Xˆ2 (X Square) Normalization and decimal scaling which are also worth trying.

Any doubts or suggestions,

Please make yourself welcome to give!

Marcel Pinheiro Caraciolo

thank you for this informative article.:D

ReplyDeleteSimilar post with the same data is found in the foll. link too

ReplyDeletehttp://intelligencemining.blogspot.in/2009/07/data-preprocessing-normalization.html

Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino etc.

ReplyDeleteEmbedded system training in chennai

Embedded Course training in chennai

Matlab training in chennai

Android training in chennai

LabVIEW training in chennai

Arduino training in chennai

Robotics training in chennai

Oracle training in chennai

Final year projects in chennai

Mechanical projects in chennai

ece projects in chennai

WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs

ReplyDelete(AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)PLC training in chennai

Automation training in chennai

Best plc training in chennai

PLC SCADA training in chennai

Process automation training in chennai

Final year eee projects in chennai

VLSI training in chennai

how are you choosing 50000 as A

ReplyDeleteSame question here. Shouldn't it be 25000 instead of 50000?

Deletethanks a lot i understand it

ReplyDeletemany many thanks

This seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

ReplyDeleteThis seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

ReplyDeleteThis is a plain copy of my work. Please take this down or give due credits by providing the original link

ReplyDeletegreat explanation!! THANK YOU!

ReplyDeleteI really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Data Mining, kindly contact us http://www.maxmunus.com/contact

ReplyDeleteMaxMunus Offer World Class Virtual Instructor led training on Data Mining. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

For Free Demo Contact us:

Name : Arunkumar U

Email : arun@maxmunus.com

Skype id: training_maxmunus

Contact No.-+91-9738507310

Company Website –http://www.maxmunus.com

PLC training in Cochin, Kerala

ReplyDeleteAutomation training in Cochin, Kerala

Embedded System training in Cochin, Kerala

VLSI training in Cochin, Kerala

PLC training institute in Cochin, Kerala

Embedded training in Cochin, Kerala

Best plc training in Cochin, Kerala

Please guide me about MSCE Training Institute who provide best mcse training in jalandhar. Thanks

ReplyDeleteDears,

ReplyDeleteI have the following case, I am not sure witch method to use:

Year math english City GPA

2000 90 out of 100 100 out of 150 City1 80/100

2001 88 out of 120 80 out of 100 City2 90/100

.

.

.

as you can see the math mark some time out of 100 and some time out of 120, the same as English and other subjects, and from year to year and from City to city this also changes.

which method appropriate for this case???

JJihad