## Wednesday, December 21, 2011

Hi all,

I'd like to share a project that I am developing that it may be useful for anyone who wants to create datasets from mobile location networks.  Specifically, I developed a wrapper in Python for accessing the Foursquare API called PyFoursquare

For anyone who doesn't know what is Foursquare, it is a popular mobile social-location network with more 10.000.000 of users around the world. The idea is that you can share your current location with your friends and as result discover new places, find where your friends are and even check some tips and recommendations about a place and what to do when you arrive there. It is an amazing project with lots of data available for anyone who wants to develop new apps for connect or mine (data mining) its data!

 Foursquare Mobile Application

This Python API is one of the results of my master degree project where I proposed a new architecture for mobile recommenders that fetches reviews from social networks to improve the explanation and the quality of the given recommendations.  I  used this library to collect tips (text reviews) from Foursquare from places at my neighborhood Recife, Brazil.  This API was a little messy, so I decided to clean it up, organize and documment it for publish for the open-source community.

One of advantages of this API is that you can handle each entity from the Foursquare data as Model object. So instead of handling with json dictionaries, I encapsulate the results in the respective models (Venue, Tips, User, etc.) and access its attributes as common object in Python!

I inspired myself at the work of Joshua at Tweepy, which is a Python library for Twitter.  In this version released 0.0.1 I only implemented some API's such as search/venues,  venue_details and venue_tips.  In future releases I pretend to add more models and support for more API methods available at Foursquare.

How can you use it at your project ?

It is simple! Just install it by downloading at the Github's home project, extract the source from the tar.gz and  at the directory of the project run the command below:

\$ python setup.py install

or the easier way is to install by the command easy_install:

\$ easy_install pyfoursquare

After that, you can  simple test by running the command below at your Python Shell

>>> import pyfoursquare

Now let's see how you can get started with the PyFoursquare:

First you need to create an application at Foursquare. The link is this.  There  you can also get further information about the API, another libraries and several applications using the Foursquarw API's.

 The Foursquare Developer's Settings

After creating your application, you must get the client_id and your client_secret. Those keys will be important to connect the app to the users' accounts.  Foursquare uses the secure authentication based on OAuth2.  In PyFoursquareAPI, you won't need to handle with all steps provided by OAuth2.  It already encapsulates all the steps and handshakes between your app and Foursquare servers. \m/

Below the  code you must write for authenticate an user to connect to your app:

After the user  authorized, you now can instantiate the PyFoursquare API.  It will give you access to the Foursquare API methods.  I implemented several methods, but feel free to add new ones! Don't forget to submit the final results as pull requests at the project's repository at Github.

In this example I fetched a venue by giving as input the latitude and longitude and querying for the place with the name 'Burburinho'.  Burburinho is a popular bar nearby where I work!

Source code

Now you can access the result and access the Venue as a Python Object. All elements of the Venue are represented as attributes of the object Venue at PyFoursquare. The goal is to make easier the life of the developer when he access the Foursquare API by parsing all the JSON (the result) and placing in the correct model for him.

I expect you enjoyed this API. Feel free to use it at your applications or research!  I'd like to thank the Foursquare team for expose their data by providing those API's!  For data mining researchers instered in mobile location data, it is a mine of gold!

Further information about PyFoursquare, you can find here.

Feel free to give sugestions, improvements and comments,

Regards,

Marcel Caraciolo

## Monday, December 19, 2011

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!

What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:

and

The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:

Tokenization

First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:

We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:

As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.

Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.

Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:

The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.

Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:

A row of this matrix would be:

I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our $M_{tf\mbox{-}idf}$ matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.

Regards,

Marcel Caraciolo

## Wednesday, December 7, 2011

Hi all,

I am announcing the launch of the website PyCursos. Pycursos is a online-course and training platform for anyone who wants to learn Python programming language and its related extensions. The first course is being already announced, which is the Scientific Computing Programming with Python,  with me as teacher.

The goal of the course is to teach scientific computing, specially on how to solve scientific problems in your daily routine by using the packages that Python provides for free: Scipy, Numpy and Matplotlib.

With those tools, the student will learn how to reproduce their problems into a simple and legible code and to use the helpful tools to plot graphs, write reports, mathematics optmization, matrices manipulation, linear algebra and more.

The requirement to attend the course is only the student be motivated to learn and have some experience with programming.  The course will start in 2012, January in on-line mode, where the students will apply and follow a schedule of video-classes on-line and review exercises regularly.

We have also the option of the in-company training, where the student may watch the classes in a classroom with another students. In both modes the students, at the end of the course, will receive a conclusion certificate.

It is important to tell that the course now is all  in Portuguese! Sorry for anyone for another countries!