## Monday, September 28, 2009

In this article, we will explore one of the basic steps in the knowledge discovery process, "Data Preprocessing", an important step that can be considered as a fundamental building block of data mining. The process of preprocessing has many steps, but can be summarized as the extraction, transformation and loading of the data. To be more precise modifying the source data into a different format which:

(a) enables data mining algorithms to be applied easily

(b) improves the effectiveness and the performance of the mining algorithms

(c) represents the data in easily and understandable format for both humans and machines

(d) supports faster data retrieval from databases

(e) makes the data suitable for a specific analysis to be performed.

The real world data can be considered extremely complicated to interpret without Data Preprocessing. I am going to explain this through a simple example based on Normalization. For people who come from database background this Normalization is completely different from 1st, 2nd and 3rd form of normalization used in the relational database design. We are talking about another type of normalization, and it's related to data preprocessing technique. To see how a simple data preprocessing technique could improve the effectiveness of analysis in orders of magnitude, let's move on further and talk about euclidian distance and how it's can be used to evaluate similarities between samples of data.

Euclidian Distance

Consider two points in a two-dimensional space (p1,p2) and (q1,q2) , the distance between these two points is given by the formula shown at the Figure 01.

Figure 01. Euclidean Distance Measure for 2-D Points

This is called the Euclidian Distance between two points. The same concept can be extended easily to multidimensional space. If the points are (p1,p2,p3,p4,...) and (q1,q2,q3,q4,...), the distance between the points is given by the formula (Figure 02).

Figure 02. Euclidean Distance for Multi-Dimensional Points

Now, I am going to introduce a sample data set of mobile profile users details. The task is to find the mobile users who have similar profiles, that is, that has similar use of the phone based on the call and SMS logs.

Table 01. Example Data Set (Mobile users profiles during one month)

In this data set, the first column is an unique identifier and the rest of the columns contain information. Ignoring the ID attribute, if we consider each column (attribute) as a dimension we can assume that each mobile user is represented by a point in 3-D space. The good thing is that we can calculate the distance between each of these points. The distance between points 1 and 2, (user 1 and 2) can be calculated as below:

Figure 03. Euclidean Distance between users 01 and 02.

Look at the data set above, through examination we figure out that the user id 1 and 4 are users with almost similar profiles, that is, use their phones very similar. So in the three dimensional space the distance between them should be less, that is, these two points should be very close.

The following table (Figure 04) gives the euclidean distance calculation for each user with the other users in our given data set.

Figure 04. Euclidean Distance Matrix for all users

We can see from the above list that the distance between the user 1 and 4 is 2000.00, which is less when compared between userId 1 and other users. So our interpretation seems to be correct. However, an alert reader would have noticed a significant observation here. If you see the list again, you can see that the euclidian distance values are very close to the differences in duration calls. Which this means ?

See this, the difference between the duration calls of users 1 and 4 is abs(25000 - 27000) = 2000, and the euclidean distance between one and four is 2000.30303! So it looks like this approach seems to be flawed. The euclidean distances are dominated by the duration calls amount. So guess you can't rely on euclidean distance to find mobile users with similar profile.

How do we get ourselves out of this problem ? We do not want the duration calls attribute to dominate the euclidean distance calculation. We can achieve this by applying one of the Data Preprocessing techniques called Normalization over the data set.

What is Normalization ?
The attribute data is scaled to fit into a specific range. There are many types of normalization available, we will see one technique called Min-Max Normalization. Min-Max Normalization transforms a value A to B which fits in the range [C,D]. It is given by the formula below (Figure 05):

Figure 05. Min-Max Normalization Formula

Consider the example below, the duration calls value is 50000, we want to transform this in to the range [0.0 , 1.0], so first we find the maximum value of duration calls which is 55000 and the minimum value of duration calls, 25000, them the new scaled value for 50000 will be:

Figure 06. Min-Max for Duration Calls Attribute from the User 01

Now let's apply the normalization technique to all the three attributes in our data set. Consider the following maximum and minimum values:

Max duration calls = 55001
Min duration calls = 24999
Min SMS = 23
Max SMS = 33
Max consume data = 8
Min consume data = 3

The attributes need to be scaled to fit in the range [0.0 , 1.0]. Applying the min-max normalization formula above, we get the normalized data set as given below (Figure 07):

Figure 07. Data set after Normalization

Now given the new data set all normalized, We will calculate the euclidean distances for each employee with the other employees. It's given in the table below (Figure 08):

Figure 08. Euclidean Distance Matrix for the data set

Now compare the euclidean distance calculation before and after normalization. We can see that the distances are no more dominated by the duration calls attribute and they make more sense now. The number of messages and data consumed now also contribute to the distance calculation. You can see how the normalization technique and data preprocessing can really help get useful and right information about your data before applying some machine learning or data mining algorithm.

To conclude this tutorial, it's important to notice that there are many measures similar to euclidean distance which can be used to calculate the similarity between two records such as Pearson coefficient, Tanimoto Coefficient, etc. You can try replacing euclidean distance with any of these measures and experiment.

You can read more on these measures in the following link. Take a special look at other normalization techniques such as Xˆ2 (X Square) Normalization and decimal scaling which are also worth trying.

Any doubts or suggestions,

Please make yourself welcome to give!

Marcel Pinheiro Caraciolo

1. thank you for this informative article.:D

2. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino etc.

Embedded system training in chennai
Embedded Course training in chennai
Matlab training in chennai
Android training in chennai
LabVIEW training in chennai
Arduino training in chennai
Robotics training in chennai
Oracle training in chennai
Final year projects in chennai
Mechanical projects in chennai
ece projects in chennai

3. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

PLC training in chennai
Automation training in chennai
Best plc training in chennai
Process automation training in chennai
Final year eee projects in chennai
VLSI training in chennai

4. how are you choosing 50000 as A

1. Same question here. Shouldn't it be 25000 instead of 50000?

5. thanks a lot i understand it
many many thanks

6. This seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

7. This seems awfully similar to http://intelligencemining.blogspot.com.ee/2009/07/data-preprocessing-normalization.html almost edging on plagiarism. How do you comment?

8. This is a plain copy of my work. Please take this down or give due credits by providing the original link

9. great explanation!! THANK YOU!

10. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Data Mining, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Data Mining. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

Name : Arunkumar U
Email : arun@maxmunus.com
Skype id: training_maxmunus
Contact No.-+91-9738507310
Company Website –http://www.maxmunus.com

11. Please guide me about MSCE Training Institute who provide best mcse training in jalandhar. Thanks

12. Dears,
I have the following case, I am not sure witch method to use:
Year math english City GPA
2000 90 out of 100 100 out of 150 City1 80/100
2001 88 out of 120 80 out of 100 City2 90/100
.
.
.
as you can see the math mark some time out of 100 and some time out of 120, the same as English and other subjects, and from year to year and from City to city this also changes.
which method appropriate for this case???

13. very informative blog and useful article thank you for sharing with us , keep posting Data Science online Course Hyderabad

14. Thank you for sharing wonderful information with us to get some idea about that content. check it once through
Best Machine Learning institute in Chennai | Best Machine learning training | best machine learning training certification

python certification

16. myTectra the Market Leader in Artificial intelligence training in Bangalore
myTectra offers Artificial intelligence training in Bangalore using Class Room. myTectra offers Live Online Design Patterns Training Globally.Read More

17. This professional hacker is absolutely reliable and I strongly recommend him for any type of hack you require. I know this because I have hired him severally for various hacks and he has never disappointed me nor any of my friends who have hired him too, he can help you with any of the following hacks:

-Phone hacks (remotely)
-Credit repair
-Bitcoin recovery (any cryptocurrency)
-Make money from home (USA only)
-Social media hacks
-Website hacks
-Erase criminal records (USA & Canada only)