Pages

Developing a Computer-Assisted Twitter User . Meet @caocurseiro !

Wednesday, December 29, 2010

Hi all,

It has been a while since my last post but the reason is that I've been working in some new stuff related to social recommendation and at my master thesis. But this post is to talk about a recent project that I developed for a brazilian social network called AtePassar.  AtePassar is a brazilian social network for people who wants apply for positions at brazilian civil (government) services.  One of the great features of this social network is because people can share their interests about studies and meet people all around Brazil with same interests or someone that will apply for the same exam as him.  Can you imagine the possibilities ?

AtePassar  Main Page

It is a social network for students into a virtual space where there are several relations of friendship, studies and even exam partners. All the network services are free for the users and there is also a e-commerce store where the users can buy resources for helping them in their studies like video lectures, papers, books, etc. My current job there is to apply intelligence in the social network, specifically collective intelligence. Right now I am working in a big project for a social recommender at AtePassar (it will be soon posted here explaining how I managed to build it) and small projects for helping the social network to fetch more new users.   

One of them that I'd like to comment is the new Twitter Bot that I developed that tries to simulate a human, specifically controlling all the actions of a twitter account.  Let me explain more about it.  The goal is not to develop a SPAM bot or a bot that only talk about specified terms.  I imagined it as a bot that could post new updates about open public positions at Brazil, self-helping quotes for stimulating people and simple mentions related to the AtePassar. Also it would retweet tweets from another users that speak about open public exams in Brazil, send mentions to people that are interest about those subjects presenting them the AtePassar social network and even following those users.

Written in Python and hosted at Google AppEngine I created the bot @caocurseiro, who is the mascot of the social network. What does it do ?
  • Post 1 to n updates in a random time set by a interval defined by the administrator.
  • Send 1 to n mentions to new users  (@newuser) in a random time set by a interval defined by the administrator
  • Follow 1 to n new users in a random time set by a interval defined by the administrator
  • Retweet 1 to n tweets in a random time set by a interval defined by the administrator.
Of course, since the Twitter politics is against at all kinds of SPAM, I developed this bot to obey the rules and the rare limits established by them.  Other feature is that  he even sleeps during the night (our Brazilian local time)  for a random period.  I am still thinking how to improve it by answering mentions when is directed to him and improve the quality of the posts that sometimes come repeated.

It has been online for about two days our mascot and he's already following 42 users and it's followed by 25 users.  As soon as I put a new retweets rules which it would retweet posts from another users about  subjects related to open public exams and related stuff, I think it will improve much more the quality of the bot.

Unfortunately the code is not yet open-source, but I am working on it to provide it as open-source.  I think the biggest contribution is how to develop a computer driven Twitter User to be similar as human user as possible. Although he still uses static rules, maybe I could put some intelligence to him by answering real questions about any open public exams at Brazil ( future ideas).

Caocurseiro Twitter Computer's Driven

That's all,

Marcel Caraciolo

My lecture about Recommender Systems at IX Pernambuco Python User Group Meeting and my contributions.

Tuesday, December 7, 2010

Hi all,

It has been while since my last post, but my master thesis is taking a lot of time available. Soon I will come back with posts and content related to what I am working now.

But the main reason of this post is to publish my lectures that were recorded at the IX Pernambuco Python User Group Meeting (PUG-PE) last month in November.  I had the opportunity to talk more about what I am studying, which is related to the topics of recommender systems and a lighting talk about lighting talks!

But since this blog concerns about artificial intelligence,  I will focus on the recommender systems. In this lecture I've introduced the main concepts behind recommender systems, how it works, the advantages and drawbacks of each classing filtering algorithm.  Both examples presented were used in my lecture at PythonBrasil (a main meeting that joins all Python Users of Brazil).  The result of this project will be explained in the future in two posts. But let me explain my main contributions in this field.

One is my work currently at the startup Orygens, where I am developing a novel recommender system applied on social networks. The idea is to recommend users and content to the users of a brazilian social network called AtePassar.

Main Profile of AtePassar



The other contribution is the development of a framework written purely in Python called Crab.  It was originally idealized by me to be a simple easy-to-use recommendation engine framework in order to  be applied on any domains. Besides it will be used to test new approaches, since it will be easy to plug-in new recommender algorithms and test them with the evaluation tools available.  This project is open-source is completely available at my account on GitHub.com.

The main page of the Crab Project


Today we have four collaborators and we are planning to keep going forward with some demos and a distributed computing framework totally integrated with Crab.  More information I will provide soon here at my blog with some demonstrations.


My video about Recommender Systems.  The video is in portuguese.








Wait for news!  Please any further information, contact me!

Marcel Caraciolo

Recommender Engines and Data mining with Python at PythonBrasil Conference

Thursday, October 28, 2010

Hi all,

Last week I was at an important event in Curitiba called Python Brasil. It is a annual event where it joins several brazilian developers to discuss about technology and of course about the programming language Python.

I also had the opportunity  to lecture three presentations about several topics of my interest.   The official presentation was about recommendation engines with Python.  This work shows how developers could use python in developing recommendation engines with several examples and explaining the main concepts behind this subject.  The best part was the demos where I used real data from the web such as Twitter, to suggest users that are similar to me among the PythonBrasil followers.  The other example is based on collective buying, which i crawled some popular brazilian web sites and gathered real offers at Curitiba. The main idea is to recommend new offers according to my interests and what people similar to me also liked. It is the classical example of the collaborative filtering, commonly used at several e-commerces today including Amazon.
The presentation was great with lots of feedback.  If you want to take a look at the slides (it is on portuguese) please take a look here:







The main contribution of this work is a new library for building recommendation engines in Python language called Crab.  I've decided with some colleagues to develop this library in order to be a powerful tool for developers to use python as the main language to build and use classical recommender algorithms in their applications. Besides it is extensible so developers can add easily new algorithms to the engine. There is also a easy API so users can plug with their web apps running in different platforms such as Django, AppEngine, Web2Py, etc. 

If you want to collaborate or interest about this subject please feel free to join us at this work. The project is hosted at GitHub with the link:



My second presentation was a lighting talk. What's that ?  There is a extra category of presentations in PythonBrasil where you have 5 minutes to speak about any topic you want.  The one rule is 5 minutes, no more!   I was challenged so I one day before the presentation to develop a web crawler in order to scrap all the lectures submitted and approved at PythonBrasil conference.  With all this data in my hands I've decided to make an analysis to answer three questions in my mind:

a) Which are the main and  frequent topics showed at the keynotes at Python Brazil ?

b) Based on this information, how we could organize the speakers based on those topics ? That is group speakers with similar topics. A classical problem of clustering.

c) What information we could also extract such as level of expertise of the lectures, total time spent in the lectures, etc.


For all those questions I was seeking to answer I decided to use Python, Matplotlib and Ubigraph (A 3D Visualization tool for graphs).   It was really interesting because I really could find some groups based on similar interests.  The main subjects was Entrepreneurship, Hardware, Web, Design Patterns, Data Mining, Django and Artificial Intelligence.

With those subjects I could now group the speakers using a simple clustering algorithm such as K-means and organize them based what their topics were. I've recorded a simple video to present the result using the tool Ubigraph. Take a look:







The presentation in portuguese  you can see here:





In the end I think the event was awesome some great keynotes and of course lots of new contacts at my network.  I have to say it is a great opportunity to meet great people and share ideas and technology!

Next year it will be in São Paulo, Brazil. I expect to be there !

Best regards,

Marcel Caraciolo

My new experiment: TweetTalk : A Twitter Post Chatter ;D Update tweets from your Google Talk Account!

Saturday, October 2, 2010

Hi all,

During my recent studies about Chatter Bots, I've inspired myself to build a new one now integrating Twitter and Google Talk IM Service.  His name is  TweetTalk.  What's the catch ?  

TweetTalk is a Jabber IM Bot for anyone who wants to quickly update a post on your Twitter Account. So instead of going to a separated client, directly from anywhere you can access the Google Talk Engine (Web mail or Desktop Client) you can just write your tweet and the bot will responsible of sending the post to the Twitter.

Let's see it in action:

The tweet status on Twitter:



As you can see, it is really fast now for me to send a tweet from my webmail gmail. It is a simple experiment of how those type of bots can improve your life and work daily.

If you wanna try it, please add to your contacts:   tweettalk@bot.im  

By the way I am using Python for developing the main logic of the Bot.  For web communication I used Django + Google AppEngine + Twitter API.  And as the bot infra-structure the Imified API.

I am writing some new articles about performance evaluation, recommendation engines, REST APIs and SVM with Keyword/Term Extraction. 

Stay tuned !

Marcel Caraciolo

Tools for Machine Learning Performance Evaluation: ROC Curves in Python

Tuesday, September 21, 2010

Hi all,

Continuing from the my last post about performance analysis on classification machine learning techniques, in this article I will talk about a specific analysis called Discriminatory Power Analysis by using Receiver-Operating Characteristic Curves  (ROC Curves).

First, let's review and introduce some concepts related to the classification problems. When we deal with systems that involves the detection, diagnostics or even the prediction of results, it is really important to check the obtained results in order to validate the discriminative power as good or not of a given analysis.  However only depending on the quantization of hits and misses on a test group does not really tell how good a system is, since it depends on the quality and distribution of the test group data.

For instance, let's consider a fraud detection system, which outputs are 1 or 0, indicating whether a transaction has been classified as a fraudulent or not.  Now suppose that we applied the system on a test group of 100 transactions which we already know which one was a fraud or not, and the system correctly identified 90% of conditions. Do you think it is a good performance, don't you ?

Actually it is not. One information is missing and it is how the test group was distributed.  Let's consider that actually 90% was fraudulent. The system, thus, could have considered  '1' to every input given, and still have achieved a 90% accuracy rate, as the 10% transactions left were not fraudulent. By now, we cannot be sure if the system was really good or worked as a random classifier considering everyone as fraudulent without any previous knowledge or calculations.

In this case, we have to consider another metrics in order to evaluate this unbalance in the test groups. As I said in a previous post in this blog about confusion matrix, it will act as a base for the measures used in the discriminatory power analysis in question.

Confusion Matrix

From this table, I will present some metrics used to help in this analysis.

1. Accuracy
It measures the proportion of correct predictions considering the positive and negative inputs.  It is highly dependant of the data set distribution which can easily lead to wrong conclusions about the system performance.

ACC =  TOTAL HITS/ NUMBER OF ENTRIES IN THE SET
          = (TP + TN) / (P + N)

2. Sensitivity

It measures the proportion of the true positives, that is, the ability of the system on predicting the correct values in the cases presented.

SENS = POSITIVE HITS/ TOTAL POSITIVES
           =  TP/ (TP+FN)

3. Specificity

It measures the proportion of the true negatives, that is, the ability of the system on predicting the 
correct values for the cases that are the opposite to the desired one.

SPEC  =  NEGATIVE HITS/ TOTAL NEGATIVES
           =  TN/ (TN +FP)
4. Efficiency

It is represented by the mean of Sensibility and Specificity.  The perfect decision would be  with 100% of specificity and 100% sensitivity. However this situation is rarely conceived, so a balance between both metrics must be obtained. This metric is a good evaluator for measure the responsiveness in a overall situation to the production of false positives and false negatives. Generally, when the system is too responsive to positive, it tends to produce many false positives and vice-versa.

EFF = (SENS + SPEC) /2

5. Positive Predictive Value
 
This measure indicates the estimation of how good the system is when making a positive affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true positives in contrast with all positive predictions.

PPV = POSITIVE HITS / TOTAL POSITIVE PREDICTIONS
         = TP / (TP + FP)
 
 6. Negative Predictive Value

This measure indicates the estimation of how good the system is when making a negative affirmation. However it is not recommended to use it alone, causing easily to lead to wrong conclusions about system performance. It is the proportion of the true negatives in contrast with all negative predictions.

NPV = NEGATIVE HITS / TOTAL NEGATIVE PREDICTIONS
         = TN / (TN + FN)

7. Phi (φ) Coefficient

It is a measure not commonly used and measure the quality of the confusion matrix in a single value which can be compared. For instance, if you have two binary classifications with same classes but with different sizes, it can be used to compare both of them. It returns a value between -1 and +1, where +1 represents a perfect prediction, 0 a random prediction, and -1 an inverse prediction.

φ =  (TP*TN - FP*FN) / sqrt( (TP+FP) * (TP +FN) * (TN+FP) * (TN +FN) )


Source Code

As I presented in the first post of this series, I've updated the confusion matrix by adding the new metrics presented above.

View the script at Gist.Github


The Receiver Operating Characteristic (ROC) Curve

The ROC curve was developed during the World War II and was extremely used by engineers to detect enemy objects in enemy fields. It became famous and widely used in other areas such as medicine, radiology, etc.  More recently, it has been introduced to the area of machine learning and dataming.

The main idea behind the ROC curves is to analyze the output from the classification systems, which are generally continuous.  Because of that, it is necessary to define a cut-off value (or a discriminatory threshold) to classify and count the number of positive and negative predictions (such as the fraudulent or legal transactions in the case of the statuses in bank transactions).  This threshold can be arbitrarily determined, so the best way to compare the performance of different classifiers is to select several cutoff values over the output data and study their effect.

Considering many thresholds, it is possible to calculate a set of pairs (sensitivity, 1-specificity) which can be plotted in a curve.  This curve will be the ROC curve for the system where the y-axis (ordenades) represents the sensitivity and the x-axis (abscissas) represents the complement of specificity (1 - specificity).

Examples of ROC Curves are presentend below.

classifiers
Examples of ROC Curves (From: CezarSouza's Blog)

As you can see in the figure above, higher the ROC curve's area, better the system.  This is a standard measure for classifiers comparison, which is called the area under the ROC Curve (AUC). It is evaluated by a numerical integration, such as, for example, the trapezoidal rule.

Now let's go to the development of a simple ROC curve in Python.  During a data mining project that I worked on last year, I decided to implement a ROC curve in order to analyze the performance of some classifiers.  This work resulted into a open-source library called PyROC which you can use inside your own applications for the creation, visualization and analysis of ROC curves.

Source Code

The project is hosted at my personal repository at GitHub.

Feel free to download the source code and use in your applications.


Using the code


To start using the code in your projects, just create a new ROCData object passing as argument the list of tuples containing the actual data, as measure by the experiment, and the predicted data as given by the prediction classsifier.  The actual data must be a dichotomous variable, with only two possible valid values. Generally it is assumed the values 0 and 1 for represent the two states. For the test data, given by the classifier their values must be continuous and the range between the highest and lowest values considered in the actual data. For instance, if the values are 0 and 1, the test data values must be inside the [0,1] range.

from pyroc import *
random_sample  = random_mixture_Model()  # Generate a custom set randomly
print random_sample[(1, 0.53543926503331496), (1, 0.50937533997469853), (1, 0.58701681878005862), (1, 0.57043399840000497),
(1, 0.56229469766270523), (1, 0.6323079028948545), (1, 0.72283523937059946), (1, 0.55079104791257383),
(1, 0.59841921172330748), (1, 0.63361144887035825)]
 
To compute the area under the curve (AUC) and plot the ROC curve just call the RocData object's methods auc() and plot() . You can also pass the desired number of points to use  for different cutoff values.

#Example instance labels (first index) with the decision function , score (second index)
#-- positive class should be +1 and negative 0.
roc = ROCData(random_sample)  #Create the ROC Object
roc.auc() #get the area under the curve
0.93470000000000053
roc.plot(title='ROC Curve') #Create a plot of the ROC curve
Some metrics are also available evaluated from the confusion matrix such as accuracy, specificity, sensitivity, etc.


#threshold passed as parameter - the cutoff value. (0.5) 
roc.confusion_matrix(0.5)
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
roc.confusion_matrix(0.5,True)
         Actual class
        +(1)    -(0)
+(1)    82      18      Predicted
-(0)    18      82      class
{'FP': 18, 'TN': 82, 'FN': 18, 'TP': 82}
r1.evaluateMetrics(r1.confusion_matrix(0.5))
{'ACC': 0.81000000000000005, 'PHI': 0.62012403721240439, 'PPV': 0.81632653061224
492, 'SENS': 0.80000000000000004, 'NPV': 0.80392156862745101, 'SPEC': 0.81999999
999999995, 'EFF': 0.81000000000000005}


At least if you wanna plot two curves in the same chart, it is also possible by passing a list of RocData objects. It is useful when you want to compare different classifiers.

x = random_mixture_model()
r1 = ROCData(x)
y = random_mixture_model()
r2 = ROCData(y)
lista = [r1,r2]
plot_multiple_roc(lista,'Multiple ROC Curves',include_baseline=True)
 

There is also another option for you to use the script : stand-alone. Just run at the terminal the command below to see  the available options.

Run at your terminal python pyroc.py


Demonstration


Let's see some ROC curves plotted in action.


 I've finished the second part of the performance analysis on machine learning classifiers. This is a simple introduction, so if you want a better understanding of how ROC curves work and its meaning, please take a look at this website about ROC curves and its applications. It includes many applets for you experiment with the curves. I expect you have enjoyed this post. The next one in this series will be about the F1-Score also known as F-measure.