Pages

Review about the book Numpy Cookbook!

Sunday, February 10, 2013


Hi all,

This year I had the opportunity to review the book Numpy Cookbook by Ivan Idris and published by Pack Publishing. The goal of the book is to present the numpy library through several examples.  The author refers them as recipes.




For my first impression taking a right look into the bok before reading it I could say that the book is not only about Numpy but it also covers another related libraries such as Scipy, Scikit, Matplotlib, Cython, Pandas, Rpy, which I think for the book it's quite better.  Even the title is not well suited to it (it could be such as Scientific Python Cookbook for example), the book is a perfect book for scientific developers and for anyone not much familiar with the libraries, so it offers a fast way to explore it.

I believe the audience of this book if for people who knows how Python works and has some experience on scientific side.


Content


The book is well structured and generally easy to read and digest (congratulations to all Packt books that I read has this same organization).  I like the way the author introduces the examples by using IPython shell instead the regular one. The first experience with Scientific python with IPython must be presented for anyone who works in this field.

The first chapter covers the Ipython and the feature notebook.  It supports Matplotlib and it's a good way os sharing scripts between people.  Great points (even he didn't mention that) when the examples that he shows through the book runs on the Ipython Notebook! :)
The next two chapters he tackles the Numpy aspects: array indexing and several numpy functions.  I didn't like the little mess with different instalations of the packages. However I could install all the packages provided based on the links in the book.  The recipes were quite interesting and I liked the examples with trading stocks. For financial developers those examples are quite attractive!

The fourth chapter handles the Numpy comunication  with other Python modules and other programming languages.  There are several examples covering Matlab, octave, R, Java and even Google App Engine! All of them are simple recipes, so don't expect much of real action or deeper use (Unfortunately!).  You still need to read the docummentations of those interfaces to go beyond of the very basics provided in the book.
I'd like to give an extra credit for the author mentioning the picloud enviroment, a distributed cloud computing provider, which offers pre-installed Python software including numpy for distributed scientific  computing!

The next chapter covers audio and image signal processing and I think it's one of the best scientific recipes approached in the book. Even there's no in-depth details, but creating an audio filter was interesting! Some advanced functions covered in this chapter needs more details such as memmap, clip, etc.   
The Chapter 6 presents more special Numpy classes that can be useful such as masked arrays and recarrays. Those advanced structures can be quite useful in complex algorithms.

The main streams of the book are the next two chapters. Many scientists usually forget: profiling, debugging and quality insurance. The author covers all those topics showing several recipes using several tools ( I believe many of the examples uses nosetests - the author should mention it).  I really appreciate the BDD and the mock objects! I didn't know that!  My opinion for those chapters is that it worths the reading specially about those overlooked topics on scientific books covered in this book.

The chapter 9 gives some interesting recipes on optimizing your code with Cython. By the way it was excellent chapter since we don't have many books handling about those topics. Extra points for tackling a common problem with almost scientists: Call C functions with Python! However, it does not mention Fortran (Fpy) integration.

Finally the last chapter covers the scikit and pandas, that for me they are one of the best frameworks on top of numpy/scipy nowadays covering machine learning, image processing and statistics research fields! Although the recipes are not deeper you can have at a glance of how powerful each framework is and makes you curious to explore each one of them.

Conclusions

The book is well written and it's really easy to read digest. I liked the approach that the author uses by several examples familiar to scientists and scientific developers. It's not a book for Python beginners and it's focused on showing with recipes what the Scientific Python enviroment is capable of. And it does quite well!  I recommend this book for everyone looking to study profoundly with learning experiments the Numpy/Scipy/Matplotlib frameworks. I only miss some deeper explanations about some topics but it does not compromise the book.  Congratulations Packt and Ivan for this great contribution to the scientific python community!



Regards,

Marcel Caraciolo

How recommend deals on-line for coupon systems ? My Ranking model approach for this task!

Saturday, January 26, 2013

Hi all,

In this post, I will share with you some of my scientific contributions at the startup Favoritoz, a previous business that I was involved as Co-Founder and Scientist Chief. The main contribution was an algorithm to rank on-line deals in coupon systems. This algorithm was implemented and worked for ten months when the startup was running on-line.   I believe that all this implementation could serve as baseline for anyone who is looking after some work related to this field and also a main contribution to the recommender system community.

Favoritoz's Coupon System

Favoritoz was on-line coupon system where retailers could publish offers with special discounts or their products to segmented customers interested at their products or by a specific brand. One of the critical features was the personalization, where people could say wha they would like to receive daily at their deals wall. For instance, If l liked sushis, videogames and junk-food , the system could detect those interests and using our algorithm it would rank higher offers and deals related by those topics.

My goal was to develop this baseline for ranking deals based on their interests, but not only using the user profile, also we could use particular aspects from the on-line deals such as: the initial date of the deal, the popularity of the deal or the number of coupons left for that deal. All those aspects could be measured in order to compute a overall score that would be used to rank those recommendable items.

Ranking Model


Recommender systems aim to present desirable items for a person to choose from. This task is accomplished by sorting some itens in ther order of utility or interest for the person. Generaly those items are organized in some form of list, such as, in our scenario, the various deals organized as wallboard on Favoritoz. But the question is how to order those deals in a way that users could interest at or even convert into transactions ?  That's my work! I came with a simple ranking algorithm that could use various type of information available in order to come with a useful ranking system that could recommend the best deals for each of our customers at Favoritoz.

By looking to specific features in our domain (coupons market) we could use them to build a ranking function that could give some personalized recommendations. For instance, we look for "hot" deals, which is represented by most bought deals (popularity) . But it can lead to deals that are common to everyone and it's the opposite of personalization. So let's use another information like the publishing date and expiration date of the deal which could give us the newest deals and the deals that are almost close to end. Furthermore, let's look into the number of coupons available, in this case, a deal that are almost with no coupons left, could atract more users to purchase the last ones.  Another important feature is the user taste. In Favoritoz people could follow retailers (their favorites) and categories (sports, cuisine, automobile, etc.)  We use also this information in the ranking equation. In the end,  in order to produce rankings that balance all of these aspects we came wih a prediction model using all those features.

I liked the NetFlix approach on his recommendation problem at their company, where they came up with a simple scoring approach by choosing the ranking function to be a linear combination of several features. We also came with this approach but using different features exclusive at our domain.

It's like this equation:

frank(u,d) =    w1 * SD  + w2 * ED + w3*P + w4*CA + w5*UP + w6*DV + b
where,
SD = Publishing Date of the deal d,
ED = Expiration Date of the deal d,
P =    Deal Popularity
CA =  Coupons Available of the deal d,
UP =   User Profile (user tastes)
DV =   Diversity (based on stores)

It's simple and of course there are many improvements that could be made. But it was our baseline and at our tests in production was very satisfactory. In 4 months running we achieved an increase of almot 300% in click-through-rates per user thant without any ranking methods. Of course several external factors could influence this result, but showed us promising results at our first attempt.   I implemented all the code with Python and a fork of my current framework Crab, a python framework for building recommender systems and variations.

All the algorithm explained can be accessed at this link.

Conclusion

As I presented, this ranking algorithm was my first attempt to solve this problem identified at our on-line website. Since the startup didn't go on (no more coupons), I couldn't reformulate the algorithm in terms of  optimization and more brute A/B testes. With more data availability we could achieve even better results. It was a rapid experimentation and I believe this work could be useful for someone that is working in this research field. I also invite anyone to qualify my recommendation approach to improve it or pointing out some flaws. (-- under approval --  This article is indexed at arxiv.org.)

Feel free to analyze it and if you want to reference this article or extend this work, let me know. Favoritoz was a great research lab for me!

That's all,

Regards,

Marcel Caraciolo

Machine learning and Data Mining - Association Analysis with Python

Friday, January 11, 2013

Hi all,


Recently I've been working with recommender systems and association analysis.  This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships. 

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules  pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets.  Frequent item sets are simply a collection of items that frequently occur together.  And association rules suggest a strong relationship that exists between two items.  Let's  illustrate these two concepts with an example.



A list of transactions from a grocery store is shown in the figure above. Frequent items are a list of items that commonly appear together.  One example is {wine, diapers, soy milk}.  From the data set we can also find an association rule such as diapers -> wine.  This means that if someone buys diapers, there is a good chance they will buy wine. With the frequent item sets and association rules retailers have a much better understanding of their customers.  Although common examples of association rulea are from the retail industry, it can be applied to a number of other categories, such as web site traffic, medicine, etc.

How do we define these so called relationships ? Who defines what is interesting ?  When we are looking for frequent item sets or association rules, we must look two parameters that defines its relevance.  The support of an itemset, which is defined as the percentage of the data set which containts this itemset. From the figure above the support of {soy milk} is 4/5. The support of {soy milk, diapers } is 3/5 because of the five transactions, three contained both soy milk and diapers. Support applies  to an itemset, so we can define a minimum support and only get the itemsets that meet that minimum support. And the cofidence which is defined for an association rule like diapers->wine. The confidence for this rule is defined as support({ diapers, wine})/ support(diapers).  From the figure above, support of {diapers, wine} is 3/5. The support for diapers is 4/5, so the confidence for diapers -> wine  is: 3/4 = 0.75.  That means in 75% of the items in our data set containing diapers our rule is correct.

The support and confidence are ways we can quantify the success of our association analysis. Let's assume we wanted to find all sets of items with a support greater than 0.8. How would we do that ?  We could generate a list of every combination of items then go and count how frequent that occurs. But this is extremely slow when we have thousands of items for sale. In the next section, we will analyze the Apriori principle which address this problem and reduces the number of calculations we need to do to learn association rules.

To sum up, one example of rule extracted from associatio analysis:

"If the client bought the product A1 and the product A4, so he also bought the product A6 in 74% of the cases. This rule applies on 20% of the studied cases."
The first parameter is the support, that is, the percentage of cases that the product A1 appears in the frequent item sets.  In the rule above, the value 20% points the support, which reports that the rule is applied on 20% of the studied cases.  The second parameter is the confidence, which tells the percentage of occurrences of the rule.  So the value 74% represents the confidence,  where in the 20% of the rules where products A1 and A4 are found, 74% of them we also can find the product A6.

Apriori Algorithm

Let's assume that we are running a market store with a very limited selection. We are interested in finding out which items were purchased together. We have only four items: item0, item1, item2 and item3. What are all the possible combinations that can be purchased ? We can have one item, say item0 alone, or two items, or threee items, or all of the items together. If someone purchased two of item0 and four of item2, we don't care. We are concerned only that they purchased one or more of an item.



A diagram showing all possible combinations of the items is shown in the figure above.  To make easier to interpret, we only use the item number such as 0 instead of item0.  The first set is a big Ø, which means the null set or a set containing no items. Lines connecting item sets indicate that two or more sets can be combined to form a larger set.


Remember that our goal is to find sets of items that are purchased together frequently. The support of a set counted the percentage of transactions that contained that set. How do we calculate this support for a given set, say {0,3}? Well. we go through every transaction and ask, ”Did this transaction contain 0 and 3?” If the transaction did contain both those items, we increment the total. After scanning all of our data, we divide the total by the number of transactions and we have our support. This result is for only one set: {0,3}. We will have to do this many times to get the support for every possible set. We can count the sets in Figure below and see that for four items, we have to go over the data 15 times. This number gets large quickly. A data set that contains N possible items can generate 2N-1 possible itemsets. Stores selling 10,000 or more items are not uncommon. Even a store selling 100 items can generate 1.26*1030 possible itemsets. This would take a very, very long time to compute on a modern computer.
To reduce the time needed to compute this value, researchers identified something called the Apriori principle. The Apriori principle helps us reduce the number of possible interesting itemsets. The Apriori principle is: if an itemset is frequent, then all of its subsets are frequent. In Figure below this means that if {0,1} is frequent then {0} and {1} have to be frequent. This rule as it is doesn’t really help us, but if we turn it inside out it will help us. The rule turned around reads: if an itemset is infrequent then its supersets are also infrequent, as shown in Figure  below.



As you can observe, the shaded itemset {2,3} is known to be infrequent. From this knowledge, we know that itemsets {0,2,3}, {1,2,3}, and {0,1,2,3} are also infrequent. This tells us that once we have computed the support of {2,3}, we don’t have to compute the support of {0,2,3}, {1,2,3}, and {0,1,2,3} because we know they won’t meet our requirements. Using this principle, we can halt the exponential growth of itemsets and in a reasonable amount of time compute a list of frequent item sets.



Finding frequent item sets


Now let's go through the Apriori algorithm.  We first need to find the frequent itemsets, then we can find association rules. The Apriori algorithm needs a minimum support level as an input and a data set. The algorithm will generate a list of all candidate itemsets with one item. The transaction data set will then be scanned to see which sets meet the minimum support level. Sets that do not meet the minimum support level will get tossed out. The remaining sets are then combined to make itemsets with two elements. Again, the transaction data set will be scanned and itemsets not meeting the minimum support level will get tossed. This procedure is repeated until all sets are tossed out.

We will use Python with Scipy and Numpy to implement the algorithm. The pseudocode for the Apriori Algorithm would look like this:


While the number of items in the set is greater than 0:
               Create a list of candidate itemsets of length k 
              Scan the dataset to see if each itemset is frequent
             Keep frequent itemsets to create itemsets of length k+1


The code in Python is shown below:



The source is docummented so  you can easily understand all the rules generation process.

Let's see in action ?!



First, we created our first candidate item set C1. C1 contains a list of all items in frozenset. After we created D, which is just a dataset in the form of list of sets (set is a list of unique elements).  With everything in set form we removed items that didn't meet our minimum support. So in our example 0.5 was the minimum support level chosen.  The result is represented by the list L1 and support_data. These four items that make our list L1 are the sets that occur in at least 50% of all transactions. Item 4 didn't make the minimum support level, so it's not a part of L1. By removing it, we have removed more work we needed to do when wen find the list of two-item-sets. Remember our pruned tree above?!



Two more functions now were used: apriori() and aprioriGen(). In apriori() given a data set and a support level, it will generate a list of candidate itemsets. AprioriGen() is responsible for generate the next list of candidate itemsets and scanD throw out itemsets that don't meet the minimum support levels. See the result of aprioriGen() it generates six unique two-item-sets.  If we run it with our apriori function we will check that L contains some lists of frequent itemsets that met a minimum support of 0.5.  The variable support_data is just a dictionary with the support values of our itemsets.  We don't care about those values now, but it will be useful in the next section.   We have now which items occur in 50% of all transactions and we can begin to draw conclusions from this.  In the next section we will generate association rules to try to get an if-then understanding of the data.


Mining association rules from frequent item sets


We just saw how we can find frequent itemsets with Apriori Algorithm. Now we have to find the association rules. To find them, we first start with a frequent itemset.  The set of items is unique, but we can check if there is anything else we can get out of these items. One item or one set of items can imply another item. From the example above, if we have a frequent itemset: {soy milk lettuce}, one example of an association rule is: soy milk -> lettuce.  This means if someone purchases soy and milk, then there is a statistically significant chance that he will purchase lettuce. Of course this is not always true, that is, just because soy milk -> lettuce is statistically significant, it does not mean that lettuce -> soy milk will be statistically significant.

We showed the support level as one measurement for quantifying the frequency of an itemset. We also have one for association rules. This measurement is called the confidence. The confidence for a rule P-> H is defined as: support (P|H)/support(P).  Remember that the | symbol is the set union. P|H means all the items in set P or in set H.  We already have the support values for all frequent itemsets provided in apriori function, remember ?  Now, to calculate the confidence, all we have todo is call up those support values and do one divide.

From one frequent itemset, how many association rules can we have ?  If you see the figure below, it shows all the different possible associations rules from the itemset {0, 1, 2, 3}. To find relevant rulese, we simply generate a list of possible rules, then test the confidence of each rule. If the confidence does not meet our minimum requirement, then we throw out the rule.




Usually we don't this. Imagine the number of rules if we have a itemset of 10 items, it would be huge. As we did with Apriori Algorithm we can also reduce the number of rules that we need to test. If the rule does not meet the minimum confidence requirement, then subsets of that rule also will not meet the minimum. So assuming that the rule 0,1,2 -> 3 does not meet the minimum confidence, so any rules where is subset of 0,1,2 ->3 will also not meet the minimum confidence.


By creating a list of sets with one item on the right hand size, testing them and merging the remaining rulese, we can create a list of rules of two items and do this recursively. Let's see the code below:



Let's see in action:



As you may seen in the fist call, it gives us three rules: {1} -> {3} , {5} -> {2} and {2} -> {5}. It is interesting to see the rules with 2 and 5 can be flipped around, but not the rule with 1 and 3. If we lower the confidencer threshold to 0.5, we may see more rules.  Let's go now to a real-life dataset and see how it does work.


Courses  at Atepassar.com

It's time to see these tools on a real life dataset. What can we use ?  I love educational data, so let's see an example using the Atepassar social network data. Atepassar is a popular social-learning environment where people can organzie the studies for public exams in order to get a civil job. I work there as Chief Scientist!






The data set provided can give us some interesting insights about the profiles of the users who buys on-line courses at Atepassar. So we will use the apriori() and generateRules() functions to find interesting information in the transactions.

So let's get started.  First I will create a transaction database and then generate the list of frequent itemsets and association rules.


Collect Dataset from Atepassar.

So let's get started.  First I will create a transaction database and then generate the list of frequent itemsets and association rules.

I have encoded some information related to our transactions, for example, the client's user profile such as state, gender and degree. I also extracted the bought courses and the category of the course such as 'national', 'regional', 'state' or 'local'.

This give us important hypothesis to test like "Male people buy more preparation courses for national exams?", "People from Sao Paulo tends to buy more for state exams ?" or even if male people graduated buys courses for 'local' exams ?"  Those are some examples of explanations that we may get with association analysis.

After loading my dataset we can see inside each transaction:




Let's see one of them in order to explain it:


So one transaction is composed by a woman who bought a preparation course for a national exam (Brazilian exam) and lives at Distrito Federal (DF).

Now,  let's use our Apriori algorithm. But first let's make a list of all the transactions. We will throw out the keys, which are the usernames. We are not interested in that.  What we are interested is the items and associations among them.  Now let's going to mine the frequent itemsets and association rules.

Applying the Apriori algorithm developed earlier,  and use support settings of 5%, we may see several of frequent itemsets. 



Now let's try to generate the association rules. We begin with minimum confidence of 70%.  



If we increase the confidence even more, we don't see any rules at all. That means lots of distinct transactions or  small amount of data unfortunately.



I will be with confidence of 70%. So we have the following rules:




What can we say about those insights ? Do you remember the support level of 5% ? That means that the rules above show up in at least 5% of all transactions. To sum up,  we are going to see these association rules in at least 5% of the transactions, and in the case of  DF --> Nacional, this rule is right 83,8% of the time.  People from Brasilia really purchase for better positions in the government,  or even get the best salaries! 

Conclusions

In this post I presented one of the most traditional tools in data mining to find interesting relationships in a large set of data.  We saw that there are two ways we can quantify the relationships: using frequent itemsets, which shows items that commonly appear in the data together. And the second one are the association rules which imply an if -> t hen relationship between items.

But wait,  before start using it at your dataset, I have to give sou some warnings.  Finding different combinations of items can be a very consuming task and expensive in terms of computer proccessing.  So you will need more intelligent approaches to find frequent itemsets in a small amount time.  Apriori is one approach that tries to reduce the number of sets that are chacked against the dataset. With Support measure and Confidence we can combine both to generate association rules.

The main problem of Apriori Algorithm is it requires to scan over the dataset each time we increase the length of our frequent itemsets. Imagine with a huge dataset, it can slow down the speed of finding frequent itemsets.  Alternative techniques for this issue is the FP-Growth algorithm, for example. It only needs to go over the dataset twice, which can led to a better performance.


The code in Python is shown here.



I hope you enjoyed this article, the first of 2013!

Regards,


Marcel Caraciolo

My slides from PythonBrazil's Conference available

Monday, November 26, 2012



Hi all,

This weekend I attended the PythonBrasil,  a Annual Brazilian Conference for Python Developers. It was a great event meeting old friends and keep in touch with the best python developers around Brazil. I had the opportunity to lecture there two talks and one tutorial.

My first presentation was about how Python is used nowadays as main platform in several on-line e-learning system such as the innitatives Coursera, Udacity, CodeAcademy, Khan Academy worldwide. I also  present  my experiences with Python in brazilian enviroment such as Atepassar and PyCursos.  Python is also a good language to start learning and I will present how I and our team teached  it for more than de 500 students around Brazil.


SlideShows here.




My second presentation was about how I used Python, Hadoop and MapReduce to scale up our recommender system currently running at Atepassar, an educational  brazilian social network with         more than 150 thousand students.   Examples, tips,  guides to anyone who wants explore the         power of distributed computing at Amazon as also the current effort at the framework Crab to         include those features.



SlideShows here.





The tutorial I gave at FGV about Machine Learning with Python is being updated and I will release it soon here at this post.

Congratulations to Rio's Staff for this exciting conference! Next year it will be at Brasilia, Brazil.

See you there,

Marcel Caraciolo

Keynotes and Tutorial at PythonBrasil 8 about education, data mining and big data!

Saturday, November 17, 2012

Hi all,

Next week I will be at PythonBrasil,  a annual meeting that joins all Python developers around Brazil to discuss programming, projects and opportunities to see some old friends, make new ones and even some business there. 


It will be a great event with more than 50 keynotes about Python and related applications and projects. I will host some keynotes and a hands-on tutorial during the event.

On Thurday, 22th - 2:00PM - 6:00PM


           I will present some learning techniques implemented with Python and Scikit-learn such as Linear Regression, Naive Bayes and several others. For  anyone who  knows Python and familiar with basic statistics.

On  Friday, 23th - 10:40AM- 11:10AM

          I will show how Python is used nowadays as main platform in several on-line e-learning system such as the innitatives Coursera, Udacity, CodeAcademy, Khan Academy worldwide. I also  present  my experiences with Python in brazilian enviroment such as Atepassar and PyCursos.  Python is also a good language to start learning and I will present how I and our team teached  it for more than de 400 students around Brazil.

On Saturday, 24th - 4:10 PM - 4:40 PM
       In this presentation I will show how I used Python, Hadoop and MapReduce to scale up our recommender system currently running at Atepassar, an educational  brazilian social network with         more than 150 thousand students.   Examples, tips,  guides to anyone who wants explore the         power of distributed computing at Amazon as also the current effort at the framework Crab to         include those features.

Those are my talks and tutorials above. There are several other talks quite interesting that I want to watch, I extremely recommend you to attend this conference, it will be awesome.

For you who wants to follow up the event,  follow @PythonBrasil at Twitter and like the PythonBrasil8 fanpage at Facebook.

Look for me at the event, for anyone who wants to talk about education, python, entrepreneurship, data mining and big data ;D

See you at the conference!

Marcel Caraciolo