Pages

Data mining in practice: Learn about Bayesian Classifier Algorithm with Python

Saturday, September 19, 2009

Hi all,

In this article we will continue our studies about Data Mining algorithms. Now, i will present a supervised learning algorithm called Bayesian Classification. As same as the previous articles presented in this blog, a simple example of the algorithm will be presented which can be executed with Python Interpreter.

The Algorithm

The Bayesian classification algorithm is called with this name because is based on the Bayes' probability theorem. It's known also by Naïve Bayes classification rule or only by Bayesian Classifier.

The algorithm aims to predict the class membership probabilities, such as the probability that a given tuple or pattern belongs to a particular class,that is, predict the most probable class that the pattern belongs to. This type of prediction is called statistical classification, which is totally based on probabilities.

This classification also is called a simple or Naïve, because it assumes that the effect of an atribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence and it's made to simplify the computations involved.

Furthermore about attributes, it's important to notice that the Bayesian classifier gets better results when the attribute values are categorical instead of continuous-valued . Maybe this will be more clear at the example that will be shown soon.

Other characteristic of the algorithm is that it requires data set already classified, that is, a set of patterns and their associated class labels. Based on this data set, called also 'training data set ' , the algorithm receives as input a new pattern (unknown data), that is, data patterns for which the class label is not known, and returns as output the class which the probability calculated for this data is maximum in accordance to probabilistic calculations. Different from the K-means algorithm seen in the previous article, the Bayesian classifier doesn't need a metric to compare the 'distance' between the instances and neither classifies the unknown pattern automatically, since it's necessary a data set already classified (training set). Because of this requirement, the Bayesian Classification algorithm is considered a supervised data mining algorithm.

To see how the algorithm works, let's resume it with the following four steps:

Step 01: Probabilities classes evaluation (Class Prior Probability).

In this step, each class of the training set has its probability calculated. In most of times, we only work with two classes, for instance, one class shows if a certain consumer buys or not a product based on his demographic characteristics. The calculation is done by dividing the number of patterns of a specific class by total number of patterns of the training set.

Step 02: Probabilities evaluation of the training set

Now, each value of each attribute of the data of the training set has your probability calculated for each possible class. This step is where occurs the most consuming time computational processing of the algorithm, since depending on the number of attributes, classes and patterns of the training set, it's possible that many calculations must be done before get some results (probabilities).

It's important to notice that this calculation depends totally on the attribute values of the unknown sample data, that is, the sample that you desire to predict the class label. Supposing that there are k classes in the test set and m attributes in the test set, it must be necessary to calculate k x m probabilities.

Step 03: Evaluate the probabilities of the unknown data.

In this step, the probabilities calculated for the patterns of the unknown data of the same class are multiplied. Thus, the result obtained is multiplied by the probability class calculated at the Step 01.

With the probabilities of each class calculated, then check which class has the maximum value for the probability of the unknown data. The algorithm ends returning the class with the probability that has the maximum value (the predicted class) for the unknown data.

Further information about Bayesian classification can be found at the links below:

http://en.wikipedia.org/wiki/Na%C3%AFve_Bayes
http://www.devmedia.com.br/articles/viewcomp.asp?comp=2637


Now, let's see a practical example of the use of Simple Bayesian Classification algorithm with the probabilities evaluation.

Example of the Algorithm

In this example, let's consider that a bank loans officer wants to predict if the client will be a bank defaulter or not. For this, the bank must consider his historical client profiles and some attributes. To make easy the comprehension of the scenario and the data model, let's use a training set with only 15 rows and 4 columns (attributes). The Figure 01 shows the training set that will be used in this example.




Figure 01. The historic client profiles (Training Set).

The attributes shown at the Figure 01 are described as below:

CLIENT_ID: This column has an unique integer sequential identifier. For the algorithm this attribute is optional, but it may help to organize the rows of the data set.

GENDER : This attribute identifies the gender of the client. The values allowed are only MALE or FEMALE.

MARITAL_STATUS: This attribute brings information about the marital status of the client. It can be only the values MARRIED or SINGLE.

EDUCATION: This attribute brings the information about the education level of the client. It can assume only four different values: HIGHSCHOOL_INCOMPLETE , HIGHSCHOOL_COMPLETE, GRADUATION_INCOMPLETE and GRADUATION_COMPLETE.

INCOMES: This attributes refers to the earnings of the client. It can only has the values: ONE_MIMINUM_SALARY, TWO_MINIMUM_SALARIES and UPPER_THREE_MINIMUM_SALARIES.

DEFAULTER: This column represents the classification label attribute of the patterns. In this example the classification shows if the client is bank defaulter, that is DEFAULTER=YES, or the client is not bank defaulter, that is, DEFAULTER=NO. To clarify the visualisation, the clients of the training set that are defaulters were marked in red and clients that aren't defaulters are marked in blue.

Let's execute the Bayesian Classification to a given unknown pattern. Based on the data shown at the Figure 01, the target is to predict the class label (DEFAULTER) of this new client shown at the Figure 02 by using the Bayesian Classifier.





Figure 02. The new Client to be classified as DEFAULTER or NOT DEFAULTER


Step 01: The Evaluation of the classes probabilities.

There are only two classes, one that shows the client is bank defaulter (DEFAULTER= YES) and another that points the client is not bank defaulter (DEFAULTER= NO). Calculating the probabilities of the classes, we have:

Probability DEFAULTER= YES : 4/15 = 0,2667

Probability DEFAULTER= NO: 11/15 = 0,7334


Step 02: Calculate the probabilities of the training set.

For the first attribute of the unknown data GENDER=MALE, let's calculate the probability of DEFAULTER=YES:

Probability of GENDER=MALE and INADIPLENT=YES : 2/4 = 0,5

And for the case where the client is male and is not defaulter, we have:

Probability GENDER=MALE and DEFAULTER=NO: 4/11 = 0,3636

For the rest of the attribute values of the data set, we have:

Probability of MARITAL_STATUS =SINGLE and DEFAULTER=YES: 1/4 = 0,25
Probability of MARITAL_STATUS=SINGLE and DEFAULTER=NO: 6/11 = 0,5455

Probability of EDUCATION= HIGHSCHOOL_INCOMPLETE an DEFAULTER=YES: 1/4 = 0,25
Probability of EDUCATION= HIGHSCHOOL_INCOMPLETE and DEFAULTER=NO: 4/11 = 0,3636

Probability of INCOMES= ONE_MIMINUM_SALARY and DEFAULTER=YES: 1/4 = 0,25
Probability of INCOMES= ONE_MIMINUM_SALARY and DEFAULTER=NO: 4/11 = 0,3636


Step 03: Calculate the probability of the unknown data.

Multiplying the probabilities of the unknown data for the case of DEFAULTER=YES by the priori probability of DEFAULTER calculated at the Step 01, we have:

0,5 x 0,25 x 0,25 x 0,25 x 0,2667 = 0,0021

Multiplying the probabilities of the unknown data for the case of DEFAULTER= NO by the probability of NOT DEFAULTER calculated at the Step 01, we have:

0,3636 x 0,5455 x 0,3636 x 0,3636 x 0,7334 = 0,0192

As 0,0192 > 0,0021, the algorithm classifies the unkown pattern as INADIPLENT=NO, that is, this new client has higher probability of not becoming a bank defaulter than becoming one, based on the previous data (training set) and the Bayesian classification.

To help classifying those clients, let's use a implementation of the Bayesian Classification algorithm that will work with only many attributes that has nominal (categorical) values. This implementation was written with Python Script 'bayesian_classify.py' .


>>> python bayesian_classify.py 'C:\dataset.txt' 'DEFAULTER' 'MALE;SINGLE;HIGHSCHOOL_INCOMPLETE;ONE_MINIMUM_SALARY' 1

Figure 03. The bayesian_classify.py call

The first parameter that must be passed as argument of the script is the data set file path. The second parameter must indicate the list of columns used at the classification, all then splitted by comma and at one string. The third parameter shows the column that has the classifications. The fourth parameter must receive the unknown data pattern values list split by comma and in the same order of the attributes passed in the second parameter. The Figure 03 shows the call of the script at the console based on the example shown above.

The Script has one more parameter. If this parameter is passed as 0, the script returns all probabilities of each class. If the parameter is passed as 1, the script returns only the classification of the unknown data. The Figure 04 shows the result of the call of the script presented at the Figure 03.

>>> DEFAULTER=NO
Figure 03. Execution of the bayesian_classifier.py returning the classification.

Therefore, it must have to be considered some observations before using the Bayesian Classifier. It's necessaty that the training set must be correct and consistent, since one line that presents some wrong value can compromise the final result. Other drawback of the algorithm is when there is missing value in the attribute, so the probability is assigned to 0, which makes difficult to give the correct classification of certain samples. Anyway, some techniques have been presented to go around these problems, but it's not the scope of this article now.

To download the script of the Bayesian Classification algorithm and the data set used at this article, click here.

I expect you enjoyed and learned more about data mining algorithms!

See you next time,

Marcel Pinheiro Caraciolo

5 comments:

  1. thanks for the article.You explained it very well.But coming to the practical apllication what if we want to do it using java?how will we provide training dataset in java.should store it in a db???

    ReplyDelete
  2. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino etc.

    Embedded system training in chennai
    Embedded Course training in chennai
    Matlab training in chennai
    Android training in chennai
    LabVIEW training in chennai
    Arduino training in chennai
    Robotics training in chennai
    Oracle training in chennai
    Final year projects in chennai
    Mechanical projects in chennai
    ece projects in chennai

    ReplyDelete
  3. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

    PLC training in chennai
    Automation training in chennai
    Best plc training in chennai
    PLC SCADA training in chennai
    Process automation training in chennai
    Final year eee projects in chennai
    VLSI training in chennai

    ReplyDelete
  4. how to run the code in windows?

    ReplyDelete