## Monday, December 19, 2011

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!

What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:

and

The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:

Tokenization

First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:

We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:

As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.

Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.

Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:

The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.

Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:

A row of this matrix would be:

I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our $M_{tf\mbox{-}idf}$ matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.

Regards,

Marcel Caraciolo

#### 24 comments:

1. Marcelo,thanks a lot for your post. We will use it for teaching the young computer engineers.

2. i don't understand why def idf(word, list_of_docs): the variable is named list_of_doc when the variable vocabulary is input?

1. Yeah, I was wondering the same thing...

3. Thanks for this example. Could you reupload 1st and 2nd images. I think those are formulas.

Nice work

1. 1st and 2nd images are the just the formula for TF and IDF

4. These post were really helpful in understanding the meaning of text mining. It help in putting the unstructured text in a structured form. you can check the website LOGINWORKS http://www.loginworks.com/data-mining/ , to get the accurate data!!

5. Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

Embedded System Training in chennai
Embedded System Training Institute in chennai
Embedded Training in chennai
Embedded Course in chennai
Best Embedded System Training in chennai
Best Embedded System Training Institute in chennai
Best Embedded System Training Institutes in chennai
Embedded Training Institute in chennai
Embedded System Course in chennai
Best Embedded System Training in chennai

6. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

PLC training in chennai
Automation training in chennai
Best plc training in chennai
PLC SCADA training in chennai
Process automation training in chennai
Final year eee projects in chennai
VLSI training in chennai

7. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai. Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino, etc.

Embedded system training in chennai
Embedded Course training in chennai
Matlab training in chennai
Android training in chennai
LabVIEW training in chennai
Robotics training in chennai
Oracle training in chennai
Final year projects in chennai
Mechanical projects in chennai
ece projects in chennai

8. I like the way you start and then conclude your thoughts. Thanks for this nice information.

Data Analytics Courses in Chennai

9. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.

10. Hey how its possible to learn python to assemble machine language through it

Android training in chennai | Oracle dba Training in Chennai | Python Training in chennai

11. tf–idf, short for term frequency–inverse document frequency, is a numerical ... PHP Training in Chennai |
Pega Training in Chennai

12. Great and Useful Article.

Online Java Training | Online Java Training

13. Wiztech Automation Solutions is that the best PLC SCADA Marketing Communication Management: A Holistic Approach for Increased Profitability Training institute in Chennai and it's generating variety of PLC Engineers through its robust pillars like quality education, effective coaching, intimate with staffs to guide each student, spacious laboratories, Un-limited sensible sessions to boost their data as per the conditions of Automation trade.

PLC, SCADA training in chennai
PLC training in chennai
Automation training in chennai

14. Useful Post, Thanks for sharing with us

Best engineering colleges in jaipur

15. PHP course is very important nowadays.so learn language quickly & connect a link https://goo.gl/z0JFB7
https://goo.gl/WzKGr2
https://goo.gl/pO7P7J
https://goo.gl/Tshw0J
https://goo.gl/HZJgxn

16. TestComplete can be act as a backbone of the Web Automation Tool.| https://www.gangboard.com/software-testing-training/test-complete-training
https://www.gangboard.com/crm-training/salesforce-admin-training
https://www.gangboard.com/business-intelligence-training/msbi-training
https://www.gangboard.com/cloud-computing-training/vmware-training

17. Thank you so much for sharing. it’s useful for me....| Dot Net Training in Annanagar

Dot Net Training

18. Thank you for the programs

Python Training in Chennai