Pages

Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Ruby in the world of Recommendations (also machine learning, statistics and visualizations)

Tuesday, September 17, 2013

Hello everyone!

I am back with lots of news and articles! I've been quite busy but I returned. In this post I'd like to share my last presentation that I gave at Frevo On'Rails Pernambuco Ruby Meeting at Recife, PE. My  Ruby developer colleagues around Recife invited me to give a lecture there.  I was quite excited about the invitation and instanlty I accepted.

I decided to research more about scientific computing with Ruby and recommender libraries written focused on Ruby either.  Unfortunately the escossistem for scientific environment in Ruby is still at very beginning.  I found several libraries but most of them were abandoned by their developers or commiters.  But I didn't give up and decided to refine my searches. I found a espectular and promising work on scientific computing with ruby called SciRuby. The goal is to por several matrix and vector representations with Ruby using C and C++ at backend. It remembered me a lot the beginnings of the library Numpy and Scipy :)

About the recommenders, I didn't find any deep work as Mahout, but I found a library called Recommendable that uses memory-based collaborative filtering.  I really liked the design of the library and the workaround of the developer on making the linear algebra operations with Redis instead of Ruby :D

All those considerations and more insights I put on my slides, feel free to share :)



I hope you enjoyed, and even I love Python, I really like programming another languages :)


Regards,

Marcel

Machine learning and Data Mining - Association Analysis with Python

Friday, January 11, 2013

Hi all,


Recently I've been working with recommender systems and association analysis.  This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships. 

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules  pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets.  Frequent item sets are simply a collection of items that frequently occur together.  And association rules suggest a strong relationship that exists between two items.  Let's  illustrate these two concepts with an example.



A list of transactions from a grocery store is shown in the figure above. Frequent items are a list of items that commonly appear together.  One example is {wine, diapers, soy milk}.  From the data set we can also find an association rule such as diapers -> wine.  This means that if someone buys diapers, there is a good chance they will buy wine. With the frequent item sets and association rules retailers have a much better understanding of their customers.  Although common examples of association rulea are from the retail industry, it can be applied to a number of other categories, such as web site traffic, medicine, etc.

How do we define these so called relationships ? Who defines what is interesting ?  When we are looking for frequent item sets or association rules, we must look two parameters that defines its relevance.  The support of an itemset, which is defined as the percentage of the data set which containts this itemset. From the figure above the support of {soy milk} is 4/5. The support of {soy milk, diapers } is 3/5 because of the five transactions, three contained both soy milk and diapers. Support applies  to an itemset, so we can define a minimum support and only get the itemsets that meet that minimum support. And the cofidence which is defined for an association rule like diapers->wine. The confidence for this rule is defined as support({ diapers, wine})/ support(diapers).  From the figure above, support of {diapers, wine} is 3/5. The support for diapers is 4/5, so the confidence for diapers -> wine  is: 3/4 = 0.75.  That means in 75% of the items in our data set containing diapers our rule is correct.

The support and confidence are ways we can quantify the success of our association analysis. Let's assume we wanted to find all sets of items with a support greater than 0.8. How would we do that ?  We could generate a list of every combination of items then go and count how frequent that occurs. But this is extremely slow when we have thousands of items for sale. In the next section, we will analyze the Apriori principle which address this problem and reduces the number of calculations we need to do to learn association rules.

To sum up, one example of rule extracted from associatio analysis:

"If the client bought the product A1 and the product A4, so he also bought the product A6 in 74% of the cases. This rule applies on 20% of the studied cases."
The first parameter is the support, that is, the percentage of cases that the product A1 appears in the frequent item sets.  In the rule above, the value 20% points the support, which reports that the rule is applied on 20% of the studied cases.  The second parameter is the confidence, which tells the percentage of occurrences of the rule.  So the value 74% represents the confidence,  where in the 20% of the rules where products A1 and A4 are found, 74% of them we also can find the product A6.

Apriori Algorithm

Let's assume that we are running a market store with a very limited selection. We are interested in finding out which items were purchased together. We have only four items: item0, item1, item2 and item3. What are all the possible combinations that can be purchased ? We can have one item, say item0 alone, or two items, or threee items, or all of the items together. If someone purchased two of item0 and four of item2, we don't care. We are concerned only that they purchased one or more of an item.



A diagram showing all possible combinations of the items is shown in the figure above.  To make easier to interpret, we only use the item number such as 0 instead of item0.  The first set is a big Ø, which means the null set or a set containing no items. Lines connecting item sets indicate that two or more sets can be combined to form a larger set.


Remember that our goal is to find sets of items that are purchased together frequently. The support of a set counted the percentage of transactions that contained that set. How do we calculate this support for a given set, say {0,3}? Well. we go through every transaction and ask, ”Did this transaction contain 0 and 3?” If the transaction did contain both those items, we increment the total. After scanning all of our data, we divide the total by the number of transactions and we have our support. This result is for only one set: {0,3}. We will have to do this many times to get the support for every possible set. We can count the sets in Figure below and see that for four items, we have to go over the data 15 times. This number gets large quickly. A data set that contains N possible items can generate 2N-1 possible itemsets. Stores selling 10,000 or more items are not uncommon. Even a store selling 100 items can generate 1.26*1030 possible itemsets. This would take a very, very long time to compute on a modern computer.
To reduce the time needed to compute this value, researchers identified something called the Apriori principle. The Apriori principle helps us reduce the number of possible interesting itemsets. The Apriori principle is: if an itemset is frequent, then all of its subsets are frequent. In Figure below this means that if {0,1} is frequent then {0} and {1} have to be frequent. This rule as it is doesn’t really help us, but if we turn it inside out it will help us. The rule turned around reads: if an itemset is infrequent then its supersets are also infrequent, as shown in Figure  below.



As you can observe, the shaded itemset {2,3} is known to be infrequent. From this knowledge, we know that itemsets {0,2,3}, {1,2,3}, and {0,1,2,3} are also infrequent. This tells us that once we have computed the support of {2,3}, we don’t have to compute the support of {0,2,3}, {1,2,3}, and {0,1,2,3} because we know they won’t meet our requirements. Using this principle, we can halt the exponential growth of itemsets and in a reasonable amount of time compute a list of frequent item sets.



Finding frequent item sets


Now let's go through the Apriori algorithm.  We first need to find the frequent itemsets, then we can find association rules. The Apriori algorithm needs a minimum support level as an input and a data set. The algorithm will generate a list of all candidate itemsets with one item. The transaction data set will then be scanned to see which sets meet the minimum support level. Sets that do not meet the minimum support level will get tossed out. The remaining sets are then combined to make itemsets with two elements. Again, the transaction data set will be scanned and itemsets not meeting the minimum support level will get tossed. This procedure is repeated until all sets are tossed out.

We will use Python with Scipy and Numpy to implement the algorithm. The pseudocode for the Apriori Algorithm would look like this:


While the number of items in the set is greater than 0:
               Create a list of candidate itemsets of length k 
              Scan the dataset to see if each itemset is frequent
             Keep frequent itemsets to create itemsets of length k+1


The code in Python is shown below:



The source is docummented so  you can easily understand all the rules generation process.

Let's see in action ?!



First, we created our first candidate item set C1. C1 contains a list of all items in frozenset. After we created D, which is just a dataset in the form of list of sets (set is a list of unique elements).  With everything in set form we removed items that didn't meet our minimum support. So in our example 0.5 was the minimum support level chosen.  The result is represented by the list L1 and support_data. These four items that make our list L1 are the sets that occur in at least 50% of all transactions. Item 4 didn't make the minimum support level, so it's not a part of L1. By removing it, we have removed more work we needed to do when wen find the list of two-item-sets. Remember our pruned tree above?!



Two more functions now were used: apriori() and aprioriGen(). In apriori() given a data set and a support level, it will generate a list of candidate itemsets. AprioriGen() is responsible for generate the next list of candidate itemsets and scanD throw out itemsets that don't meet the minimum support levels. See the result of aprioriGen() it generates six unique two-item-sets.  If we run it with our apriori function we will check that L contains some lists of frequent itemsets that met a minimum support of 0.5.  The variable support_data is just a dictionary with the support values of our itemsets.  We don't care about those values now, but it will be useful in the next section.   We have now which items occur in 50% of all transactions and we can begin to draw conclusions from this.  In the next section we will generate association rules to try to get an if-then understanding of the data.


Mining association rules from frequent item sets


We just saw how we can find frequent itemsets with Apriori Algorithm. Now we have to find the association rules. To find them, we first start with a frequent itemset.  The set of items is unique, but we can check if there is anything else we can get out of these items. One item or one set of items can imply another item. From the example above, if we have a frequent itemset: {soy milk lettuce}, one example of an association rule is: soy milk -> lettuce.  This means if someone purchases soy and milk, then there is a statistically significant chance that he will purchase lettuce. Of course this is not always true, that is, just because soy milk -> lettuce is statistically significant, it does not mean that lettuce -> soy milk will be statistically significant.

We showed the support level as one measurement for quantifying the frequency of an itemset. We also have one for association rules. This measurement is called the confidence. The confidence for a rule P-> H is defined as: support (P|H)/support(P).  Remember that the | symbol is the set union. P|H means all the items in set P or in set H.  We already have the support values for all frequent itemsets provided in apriori function, remember ?  Now, to calculate the confidence, all we have todo is call up those support values and do one divide.

From one frequent itemset, how many association rules can we have ?  If you see the figure below, it shows all the different possible associations rules from the itemset {0, 1, 2, 3}. To find relevant rulese, we simply generate a list of possible rules, then test the confidence of each rule. If the confidence does not meet our minimum requirement, then we throw out the rule.




Usually we don't this. Imagine the number of rules if we have a itemset of 10 items, it would be huge. As we did with Apriori Algorithm we can also reduce the number of rules that we need to test. If the rule does not meet the minimum confidence requirement, then subsets of that rule also will not meet the minimum. So assuming that the rule 0,1,2 -> 3 does not meet the minimum confidence, so any rules where is subset of 0,1,2 ->3 will also not meet the minimum confidence.


By creating a list of sets with one item on the right hand size, testing them and merging the remaining rulese, we can create a list of rules of two items and do this recursively. Let's see the code below:



Let's see in action:



As you may seen in the fist call, it gives us three rules: {1} -> {3} , {5} -> {2} and {2} -> {5}. It is interesting to see the rules with 2 and 5 can be flipped around, but not the rule with 1 and 3. If we lower the confidencer threshold to 0.5, we may see more rules.  Let's go now to a real-life dataset and see how it does work.


Courses  at Atepassar.com

It's time to see these tools on a real life dataset. What can we use ?  I love educational data, so let's see an example using the Atepassar social network data. Atepassar is a popular social-learning environment where people can organzie the studies for public exams in order to get a civil job. I work there as Chief Scientist!






The data set provided can give us some interesting insights about the profiles of the users who buys on-line courses at Atepassar. So we will use the apriori() and generateRules() functions to find interesting information in the transactions.

So let's get started.  First I will create a transaction database and then generate the list of frequent itemsets and association rules.


Collect Dataset from Atepassar.

So let's get started.  First I will create a transaction database and then generate the list of frequent itemsets and association rules.

I have encoded some information related to our transactions, for example, the client's user profile such as state, gender and degree. I also extracted the bought courses and the category of the course such as 'national', 'regional', 'state' or 'local'.

This give us important hypothesis to test like "Male people buy more preparation courses for national exams?", "People from Sao Paulo tends to buy more for state exams ?" or even if male people graduated buys courses for 'local' exams ?"  Those are some examples of explanations that we may get with association analysis.

After loading my dataset we can see inside each transaction:




Let's see one of them in order to explain it:


So one transaction is composed by a woman who bought a preparation course for a national exam (Brazilian exam) and lives at Distrito Federal (DF).

Now,  let's use our Apriori algorithm. But first let's make a list of all the transactions. We will throw out the keys, which are the usernames. We are not interested in that.  What we are interested is the items and associations among them.  Now let's going to mine the frequent itemsets and association rules.

Applying the Apriori algorithm developed earlier,  and use support settings of 5%, we may see several of frequent itemsets. 



Now let's try to generate the association rules. We begin with minimum confidence of 70%.  



If we increase the confidence even more, we don't see any rules at all. That means lots of distinct transactions or  small amount of data unfortunately.



I will be with confidence of 70%. So we have the following rules:




What can we say about those insights ? Do you remember the support level of 5% ? That means that the rules above show up in at least 5% of all transactions. To sum up,  we are going to see these association rules in at least 5% of the transactions, and in the case of  DF --> Nacional, this rule is right 83,8% of the time.  People from Brasilia really purchase for better positions in the government,  or even get the best salaries! 

Conclusions

In this post I presented one of the most traditional tools in data mining to find interesting relationships in a large set of data.  We saw that there are two ways we can quantify the relationships: using frequent itemsets, which shows items that commonly appear in the data together. And the second one are the association rules which imply an if -> t hen relationship between items.

But wait,  before start using it at your dataset, I have to give sou some warnings.  Finding different combinations of items can be a very consuming task and expensive in terms of computer proccessing.  So you will need more intelligent approaches to find frequent itemsets in a small amount time.  Apriori is one approach that tries to reduce the number of sets that are chacked against the dataset. With Support measure and Confidence we can combine both to generate association rules.

The main problem of Apriori Algorithm is it requires to scan over the dataset each time we increase the length of our frequent itemsets. Imagine with a huge dataset, it can slow down the speed of finding frequent itemsets.  Alternative techniques for this issue is the FP-Growth algorithm, for example. It only needs to go over the dataset twice, which can led to a better performance.


The code in Python is shown here.



I hope you enjoyed this article, the first of 2013!

Regards,


Marcel Caraciolo

Keynotes and Tutorial at PythonBrasil 8 about education, data mining and big data!

Saturday, November 17, 2012

Hi all,

Next week I will be at PythonBrasil,  a annual meeting that joins all Python developers around Brazil to discuss programming, projects and opportunities to see some old friends, make new ones and even some business there. 


It will be a great event with more than 50 keynotes about Python and related applications and projects. I will host some keynotes and a hands-on tutorial during the event.

On Thurday, 22th - 2:00PM - 6:00PM


           I will present some learning techniques implemented with Python and Scikit-learn such as Linear Regression, Naive Bayes and several others. For  anyone who  knows Python and familiar with basic statistics.

On  Friday, 23th - 10:40AM- 11:10AM

          I will show how Python is used nowadays as main platform in several on-line e-learning system such as the innitatives Coursera, Udacity, CodeAcademy, Khan Academy worldwide. I also  present  my experiences with Python in brazilian enviroment such as Atepassar and PyCursos.  Python is also a good language to start learning and I will present how I and our team teached  it for more than de 400 students around Brazil.

On Saturday, 24th - 4:10 PM - 4:40 PM
       In this presentation I will show how I used Python, Hadoop and MapReduce to scale up our recommender system currently running at Atepassar, an educational  brazilian social network with         more than 150 thousand students.   Examples, tips,  guides to anyone who wants explore the         power of distributed computing at Amazon as also the current effort at the framework Crab to         include those features.

Those are my talks and tutorials above. There are several other talks quite interesting that I want to watch, I extremely recommend you to attend this conference, it will be awesome.

For you who wants to follow up the event,  follow @PythonBrasil at Twitter and like the PythonBrasil8 fanpage at Facebook.

Look for me at the event, for anyone who wants to talk about education, python, entrepreneurship, data mining and big data ;D

See you at the conference!

Marcel Caraciolo

Some data and machine learning talks videos from PyCon Us 2012

Monday, March 12, 2012



Unfortunately this year I couldn't participate of the PyCon, the world meeting of Python developers. I had a poster accepted this year, but I had some problems that didn't allow me to go. Apart from that, the best part is that I could watch talks, tutorials and keynotes from the congress.

Thanks to the PyVideo team, they uploaded all the videos from the PyCon quickly! The best ones that I enjoyed (of course related to data mining, natural language processing and machine learning) I will share with you:







17. IPython: Python at your fingertips




There are some of talks and tutorials that I enjoyed a lot. I recommend you to take a look at all the videos available from PyCon at this link.

I really hope next  year I can go to this meeting :)

Cheers,

Marcel

Guide to Recommender Systems Book Online

Friday, February 24, 2012



Hi all,

This year one of my goals is to write a book such as a guide to teach recommender systems for programmers. I know there are several textbooks that focus on providing a theorical foundation for recommender systems, and as result, may seem difficult to understand. For programmers that want to learn how to start to use or understand the components of a recommender system, this book is what they are looking for.  

This guide follows a learn-by-doing approach. Therefore, I will use theory and apply it through the exercises and experiment with Python code.  I hope when you complete the book you will be able to understand how to build a recommender system and give you the first steps to apply them at your own systems. The textbook is laid out as a series of small steps that will guide you for undestanding the recommender system techniques. 

This book is available for download for free under a Creative Commons license. This project is also leaded by my colleague Ricardo Caspirro, who will review and translate it to portuguese language.

Below I provide the table of contents of the book.


Guide to Recommender Systems


The link for the online guide is available here.


http://muricoca.github.com/recommendation-lectures/index.html


Table of Contents


Chapter 01: Introduction to Recommender Systems

Finding out what recommender system is and what problems it solves. And a fast review of what you will be able to do when you finish this book.

Chapter 02: Collaborative Filtering

This chapter focus on how you can use the state-of-the-art techniques of collaborative filtering that makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from similar users (user-based) or similar items( item-based).


Chapter 03: Content Based Filtering
Recommender systems that  suggest an item to a user based upon a description of the item and a profile of the user's interests. Although thedetails of various systems differ, content-based recommendation systems share in common a means for describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to recommend. 


Chapter 04: Hybrid Based Filtering
This chapter will focus how to pick the best features of collaborative and of content and mix them to build hybrid recommender systems. It will present the current work on this field and an example of  how it works and how you can decide the best strategy to select.

Chapter 05:  Model - Based Recommenders
Techniques that will include memory-based techniques or data mining techniques such as association analysis, symbolic data analysis and classification/clustering techniques will be covered in this chapter.

Chapter 06:  Evaluation of Recommender Systems
This chapter starts with a short description on how to evaluate the recommender systems and the commonly used metrics for compare the recommender algorithms in the development and deployment stages.

Chapter 07:  Recommender Systems and Distributed-Computing
Recommender Systems suffer with sparse matrices where the user x items preferences are sparsed (lots of missing values - preferences). It results on large datasets with millions of items, users and preferences.  For this task it is considered to use distributed computing techniques such as map-reduce to distribute the recommendations. This chapter will cover those topics.

Chapter 08:  Study Case
It will present a study case of a mobile recommender system for recommend users to another users using several techniques showed above and how we tested and deployed it.

Chapter 09:  Recommender Systems the Next Generation
This chapter brings the next generation of recommender systems, describing what the research is going after in several fields such as ubiquity, semmantics, etc.

Chapter 10:  Meeting Python-RecSys Framework
It will present the Python-RecSys framework for building recommender systems with Python in a easy way. It will describe how to build or test already implemented techniques or develop new ones and deploy them with frameworks Web and REST.


This book is under development, please let me know if there are any suggestions or corrections to make over one of those chapters. If you see that there is a topic that needs an extra chapter or a topic that I am missing, please also let me know and comment.


I hope you enjoy this work, specially the developers!


Regards,

Marcel Caraciolo

Machine Learning with Python: Meeting TF-IDF for Text Mining

Monday, December 19, 2011

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!


What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:



and 




The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:


Tokenization


First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:


We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:



As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.



Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.



Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:




The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.



Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:




A row of this matrix would be:



I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our M_{tf\mbox{-}idf} matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.

Regards,

Marcel Caraciolo

Machine Learning with Python - Logistic Regression

Sunday, November 6, 2011

 Hi all,

I decided to start a new series of posts now focusing on general machine learning with several snippets for anyone to use with real problems or real datasets.  Since I am studying machine learning again with a great course online offered this semester by Stanford University, one of  the best ways to review the content learned is to write some notes about what I learned. The best part is that it will include examples with Python, Numpy and Scipy. I expect you enjoy all those posts!

The series:


In this post I will cover the Logistic Regression and Regularization.


Logistic Regression


Logistic Regression is a type of regression that predicts the probability of ocurrence of an event by fitting data to a logit function (logistic function).  Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For instance, the probability that a person has a heart attack within a specified time period might be predicted from knowledege of the person's age, sex and body mass index. This regression is quite used in several scenarios such as prediction of customer's propensity to purchase a product or cease a subscription in marketing applications and many others. 


Visualizing the Data



Let's explain the logistic regression by example. Consider you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams. You have the historical data from previous applicants that you can use as a trainning set for logistic regression.  For each training example, you have the applicant's scores on two exams and the admissions decision.   We will use logistic regression to build this model that estimates the probability of admission based the scores from those two exams.

Let's first visualize our data on a 2-dimensional plot as show below. As you can see the axes are the two exam scores, and the positive and negative examples are shown with different markers.

Sample training visualization

The code


Costing Function and Gradient



The logistic regression hypothesis is defined as:

where the function g  is the sigmoid function. It is defined as:


The sigmoid function has special properties that can result values in the range [0,1].  So you have large positive values of X, the sigmoid should be close to 1, while for large negative values,  the sigmoid should be close to 0.

Sigmoid Logistic Function 

The cost function and gradient for logistic regression is given as below:


and the gradient of the cost is a vector theta where the j element is defined as follows:



You may note that the gradient is quite similar to the linear regression gradient, the difference is actually because linear and logistic regression have different definitions of h(x).

Let's see the code:






Now to find the minimum of this cost function, we will use a scipy built-in function called fmin_bfgs.  It will find the best parameters theta for the logistic regression cost function given a fixed dataset (of X and Y values).
The parameters are:
  • The initial values of the parameters you are trying to optimize;
  • A function that, when given the training set and a particular theta, computes the logistic regression cost and gradient with respect to theta for the dataset (X,y).

The final theta value will then be used to plot the decision boundary on the training data, resulting in a figure similar to the figure below.






Evaluating logistic regression



Now that you learned the parameters of the model, you can use the model to predict whether a particular student will be admited. For a student with an Exam1 score of 45 and an Exam 2 score of 85, you should see an admission probability of 0.776.

But you can go further, and evaluate the quality of the parameters that we have found and see how well the learned model predicts on our training set.  If we consider the threshold of 0.5 using our sigmoid logistic function, we can consider that:


Where 1 represents admited and -1 not admited.

Going to the code and calculate the training accuracy of our classifier we can evaluate the percentage of examples it got correct.  Source code.



89% , not bad hun?! 



Regularized logistic regression



But when your data can not be separated into positive and negative examples by a straight-line trought the plot ?  Since our logistic regression will be only be able to find a linear decision boundary, we will have to fit the data in a better way. Let's go through an example.

Suppose you are the product manager of the factory and you have the test results for some microships  of two different tests. From these two tests you would like to determine whether the microships should be accepted or rejected.  We have a dataset of test results on past microships, from which we can build a logistic regression model.  





Visualizing the data



Let's visualize our data. As you can see in the figure below, the axes are the two test scores, and the positive (y = 1, accepted) and negative (y = 0, rejected) examples are shown with different markers.
Microship training set 

You may see that the model built for this task may predict perfectly all training data and sometimes it migh cause some troubling cases.  Just because ithe model can perfectly reconstruct the training set does not mean that it had everything figured out.  This is known as overfitting.   You can imagine that if you  were relying on this model to make important decisions, it would be desirable to have at least of regularization in there. Regularization is a powerful strategy to combat the overfitting problem. We will see it in action at the next sections.





Feature mapping



One way to fit the data better is to create more features from each data point. We will map the features  into all polynomial terms of x1 tand x2 up to the sixth power.


As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimmensional vector. A logistic regression classifier trained on this higher dimension feature vector  will have a more complex decision boundary and will appear nonlinear when drawn in our 2D plot.


Although the feature mapping allows us to buid a more expressive classifier, it also me susceptible to overfitting. That comes the regularized logistic regression to fit the data and avoid the overfitting problem.


Source code.


Cost function and gradient


The regularized cost function in logistic regression is :


Note that you should not regularize the parameter theta, so the final summation is for j = 1 to n, not j= 0 to n.  The gradient of the cost function is a vector where the jn element is defined as follows:



Now let's learn the optimal parameters theta.  Considering now those new functions and our last numpy optimization function we will be able to learn the parameters theta. 

The all code now provided (code)







Plotting the decision boundary



Let's visualize the model learned by the classifier. The plot will display the non-linear decision boundary that separates the positive and negative examples. 

Decision Boundary


As you can see our model succesfully predicted our data with accuracy of 83.05%.

Code





Scikit-learn



Scikit-learn is an amazing tool for machine learning providing several modules for working with classification, regression and clustering problems. It uses python, numpy and scipy and it is open-source!

If you want to use logistic regression and linear regression you should take consider the scikit-learn. It has several examples and several types of regularization strategies to work with.  Take a look at this link and see by yourself!  I recommend!




Conclusions



Logistic regression has several advantages over linear regression, one specially it is more robust and does not assume linear relationship since it may handle nonlinear effects. However it requires much more data to achieve stable, meaningful results.  There are another machine learning techniques to handle with non-linear problems and we will see in the next posts.   I hope you enjoyed this article!


All source from this article here.

Regards,

Marcel Caraciolo