Machine Learning with Python: Meeting TF-IDF for Text Mining

Monday, December 19, 2011

3Hi all,

This month I was studying about information retrieval and text mining, specially how to convert the textual representation of information into a Vector Space Model (VSM).  The VSM is an algebraic model representing the importance of a term (tf-idf) or even the absence or presence (Bag of Words) of it in a document. I'd like to mention the excellent post from the researcher Christian Perone at his blog Pyevolve about Machine learning and Text Mining with TF-IDF, a great post to read.

I decided in this post to be shorter and give some examples using Python . I expect at the end of this post you feel confortamble to use tf-idf at your tasks handling with text mining.

By the way, I extremely recommend you to check the scikit.learn machine learning toolkit. There is a whole package to work with text classification, including TF-IDF with Python!

What is TF-IDF ?

Term Frequency - Inverse Document Frequency is a weighting scheme that is commonly used in information retrieval tasks. The goal is to model each document into a vector space, ignoring the exact ordering of the words in the document while retaining information about the occurrences of each word.

It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a documnet, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Or, in symbols:


The TF-IDF gives how important is a word to a document in a collection, since it takes in consideration not only the isolated term but also the term within the document collection. The intuition is that a term that occurs frequently in many documents is not a good discriminator ( why emphasize a term which is almost present in the entire corpus of your documents ?)  So it will scale down the frequent terms while scaling up the rare terms; for instance, a term that occurs 10 times more than another isn't 10 times more important thant it.

For computing the TF-IDF weights for each document in the corpus, it is required in the corpus a series of steps:  1) Tokenize the corpus  2)  Model the Vector  Space  and 3) Compute the TF-IDF weight for each document in the corpus.

Let's going through each step:


First we need to tokenize the text. To do this, we can use the NLTK library which is a collection of natural language processing algorithms written in Python. The process of tokenizing the documents in the corpous is a two steps:  First the text is splint into sentences, and then the sentences are split into the individual words. It is important to notice that there are several words that are not relevant, that is, terms like "the, is, at, on", etc...  aren't going to help us, so in the information extraction, we ignore them. Those words are commonly called stop words and they are present in almost all documents, so it is not relevant for us. In portuguese we also have those stop words such as (a, os , as , os, um , umas, que, etc.)

So considering our example below:

We will tokenize this collection of documents and represent them as vectors (rows) of a matrix with |D| x F shape, where |D|  is the cardinality of the document space, or how many documents we have and the F is the number of features, in our example it is represented by the vocabulary size.

So the matrix representation of our vectors above is:

As you have noticed, these matrices representing the term frequencies (tf) tend to be very sparse (lots of  zero-elements),  so you will usually see the representation of these matrices as sparse matrices. The code shown below will tokenize each document in the corpus and compute the term frequencies.

Model the Vector Space

Now that each of the documents in the corpus has been tokenized, the next step is to compute the document frequency quantity, that is, for each term, how many documents that term appears in. Before going to IDF, it is important to normalize the term-frequencies. Why ?  Imagine that we have a repeated term in document with porpuse of improving its ranking on an Information Retrieval System or even create a bias torwards long documents, making them look more important than they are just because of the high frequency of the term in the document. By normalizing the TF vector we can overcome this problem.
The code.

Compute the TF-IDF

Now that you saw how the vector normalization was applied, we will now have to compute the second term of tf-idf: the inverse document frequency. The code is provided below:

The TF-IDF is the product between the TF and IDF.  So a high weight of the tf-idf  is reached when you have a high term frequency (tf) in the given document and low document frequency of the term in the whole collection. Now let's see the tf-idf computed for each term present in the vector space.

The code.

Putting everything together, the following code will compute the TF-IDF weights for each document. And the result matrix it will be:

A row of this matrix would be:

I ommited the zero-values elements of the row.

If we would decide to check the most relevant words for this place, by using the tf-idf I could see that the place has a nice hot chocolate drink (0.420955 <= chocolate quente ótimo), the soft drink nega maluca is also delicious (0.315716 - nega maluca uma delicia),  its Cheese bun is also quite good (0.252573 - pao de queijo muito bom).

And that is how we comput  our M_{tf\mbox{-}idf} matrix.  You can take a look at this link and this one to know how to use it with GenSim and Scikit.Learn respectively.

That's all,  I hope that  you enjoyed this article and help more people to know how to implement the tf-idf weight to mine your collection of texts.  Feel free to comment and make suggestions.

The source code of this example is also available.


Marcel Caraciolo


  1. Marcelo,thanks a lot for your post. We will use it for teaching the young computer engineers.

  2. i don't understand why def idf(word, list_of_docs): the variable is named list_of_doc when the variable vocabulary is input?

    1. Yeah, I was wondering the same thing...

  3. Thanks for this example. Could you reupload 1st and 2nd images. I think those are formulas.

    Nice work

    1. 1st and 2nd images are the just the formula for TF and IDF

  4. These post were really helpful in understanding the meaning of text mining. It help in putting the unstructured text in a structured form. you can check the website LOGINWORKS , to get the accurate data!!

  5. Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

    Embedded System Training in chennai
    Embedded System Training Institute in chennai
    Embedded Training in chennai
    Embedded Course in chennai
    Best Embedded System Training in chennai
    Best Embedded System Training Institute in chennai
    Best Embedded System Training Institutes in chennai
    Embedded Training Institute in chennai
    Embedded System Course in chennai
    Best Embedded System Training in chennai

  6. WIZTECH Automation, Anna Nagar, Chennai, has earned reputation offering the best automation training in Chennai in the field of industrial automation. Flexible timings, hands-on-experience, 100% practical. The candidates are given enhanced job oriented practical training in all major brands of PLCs (AB, Keyence, ABB, GE-FANUC, OMRON, DELTA, SIEMENS, MITSUBISHI, SCHNEIDER, and MESSUNG)

    PLC training in chennai
    Automation training in chennai
    Best plc training in chennai
    PLC SCADA training in chennai
    Process automation training in chennai
    Final year eee projects in chennai
    VLSI training in chennai

  7. Embedded system training: Wiztech Automation Provides Excellent training in embedded system training in Chennai - IEEE Projects - Mechanical projects in Chennai. Wiztech provide 100% practical training, Individual focus, Free Accommodation, Placement for top companies. The study also includes standard microcontrollers such as Intel 8051, PIC, AVR, ARM, ARMCotex, Arduino, etc.

    Embedded system training in chennai
    Embedded Course training in chennai
    Matlab training in chennai
    Android training in chennai
    LabVIEW training in chennai
    Robotics training in chennai
    Oracle training in chennai
    Final year projects in chennai
    Mechanical projects in chennai
    ece projects in chennai

  8. I like the way you start and then conclude your thoughts. Thanks for this nice information.

    Data Analytics Courses in Chennai

  9. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.

  10. tf–idf, short for term frequency–inverse document frequency, is a numerical ... PHP Training in Chennai |
    Pega Training in Chennai

  11. Wiztech Automation Solutions is that the best PLC SCADA Marketing Communication Management: A Holistic Approach for Increased Profitability Training institute in Chennai and it's generating variety of PLC Engineers through its robust pillars like quality education, effective coaching, intimate with staffs to guide each student, spacious laboratories, Un-limited sensible sessions to boost their data as per the conditions of Automation trade.

    PLC, SCADA training in chennai
    PLC training in chennai
    Automation training in chennai

  12. PHP course is very important learn language quickly & connect a link

  13. TestComplete can be act as a backbone of the Web Automation Tool.|

  14. It is really a great work and the way in which u r sharing the knowledge is excellent.
    Thanks for helping me to understand basic concepts. As a beginner in dot net programming your post help me a lot.Thanks for your informative article.Dot Net training in chennai | dot net training and placement | Dot Net training in velachery

  15. It is a great thing how well you have created this post, its very unique and informative. This is among the pages that i would bookmark, the information on machine learning with python is quite scary but interesting to read. With a very beautiful Digital Calendar, you will not be required to keep using manual one.Its saves a lot of space, its easy and convenient to use too.

  16. Taking NLP training is like learning the language of your mind.
    NLP Certification in Chennai

  17. Bán thuốc diệt kiến của Nhật Bản Super Arinosu Koroki siêu an toàn, diệt 1 con lây chết cả tổ LH 0983131528

  18. the blog is about Machine Learning with Python: Meeting TF-IDF for Text Mining it is useful for students and Python Developers for more updates on python follow the link

    Python Online Training

    For more info on other technologies go with below links

    tableau online training hyderabad

    ServiceNow Online Training

    mulesoft Online Training

  19. Try to use the Dịch Vụ Làm Visa Trọn Gói 0983131528 Viseca offers a wide selection of Mastercard and Visa credit cards. Read more online now and make a free comparison.

  20. Try to use the Dịch Vụ Làm Visa Trọn Gói Hà Nội 0983.1315.28 Viseca offers a wide selection of Mastercard and Visa credit cards. Read more online now and make a free comparison.

  21. the blog is good and Interactive it is about Mulesoft Developer it is useful for students and Mulesoft Developers for more updates on Mulesoft mulesoft Online course hyderabad

  22. This post is much helpful for us. This is really very massive value to all the readers and it will be the only reason for the post to get popular with great authority.
    Best Online Python training

  23. I am really happy with your blog because your article is very unique and powerful for new reader.
    Click here:
    Selenium Training in Chennai | Selenium Training in Bangalore | Selenium Training in Pune

  24. SUPER

  25. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
    Selenium Training in Bangalore | Selenium Training in Bangalore | Selenium Training in Bangalore | Selenium Training in Bangalore

  26. If someone looking for the best business opportunity in education sector in India. Brainy India is the best option for you.

  27. Gaining Python certifications will validate your skills and advance your career.
    python certification

  28. Nice information thank you,if you want more information please visit our link machine learning online training

  29. Selenium is one of the most popular automated testing tool used to automate various types of applications. Selenium is a package of several testing tools designed in a way for to support and encourage automation testing of functional aspects of web-based applications and a wide range of browsers and platforms and for the same reason, it is referred to as a Suite.

    Selenium Interview Questions and Answers
    Javascript Interview Questions
    Human Resource (HR) Interview Questions

  30. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.

    rpa training in chennai
    rpa training in bangalore
    rpa course in bangalore
    best rpa training in bangalore
    rpa online training

  31. Amazing Article ! I have bookmarked this article page as i received good information from this. All the best for the upcoming articles. I will be waiting for your new articles. Thank You ! Kindly Visit Us @ Coimbatore Travels | Ooty Travels | Coimbatore Airport Taxi | Coimbatore taxi

  32. I accept there are numerous more pleasurable open doors ahead for people that took a gander at your site.we are providing ReactJs training in Chennai.
    For more details: ReactJs training in Velachery | ReactJs training in chennai

  33. Nice tips. Very innovative... Your post shows all your effort and great experience towards your work Your Information is Great if mastered very well.
    python training in rajajinagar
    Python training in bangalore
    Python training in usa

  34. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
    Online DevOps Certification Course - Gangboard
    Best Devops Training institute in Chennai

  35. Currently Python is the most popular Language in IT. Python adopted as a language of choice for almost all the domain in IT including Web Development, Cloud Computing (AWS, OpenStack, VMware, Google Cloud, etc.. ),Read More

  36. myTectra the Market Leader in Artificial intelligence training in Bangalore
    myTectra offers Artificial intelligence training in Bangalore using Class Room. myTectra offers Live Online Design Patterns Training Globally.Read More

  37. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Best Devops Training in pune
    Microsoft azure training in Bangalore
    Power bi training in Chennai

  38. Amazing article. Your blog helped me to improve myself in many ways thanks for sharing this kind of wonderful informative blogs in live. I have bookmarked more article from this website. Such a nice blog you are providing ! Kindly Visit Us @ Best Travels in Madurai | Tours and Travels in Madurai | Madurai Travels


  39. Some us know all relating to the compelling medium you present powerful steps on this blog and therefore strongly encourage
    contribution from other ones on this subject while our own child is truly discovering a great deal.
    Have fun with the remaining portion of the year.

    Selenium training in bangalore | best selenium training in bangalore | advanced selenium training in bangalore

  40. Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    Advanced AWS Course Interview Questions And Answers, Top 250+AWS Jobs Interviews Questions and Answers 2018
    Advanced AWS Jobs Interview questions and answers |Best Top 110 AWS Interview Question and Answers – india

  41. Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    Advanced AWS Course Interview Questions And Answers, Top 250+AWS Jobs Interviews Questions and Answers 2018
    Advanced AWS Jobs Interview questions and answers |Best Top 110 AWS Interview Question and Answers – india


  42. Greetings. I know this is somewhat off-topic, but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform like yours, and I’m having difficulty finding one? Thanks a lot.
    Best AWS Training in Chennai | Amazon Web Services Training Institute in Chennai Velachery, Tambaram, OMR
    Advanced AWS Training in Bangalore |Best AWS Training Institute in Bangalore BTMLA ,Marathahalli

  43. Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Work

    DevOps is currently a popular model currently organizations all over the world moving towards to it. Your post gave a clear idea about knowing the DevOps model and its importance.

    Good to learn about DevOps at this time.

    devops training in chennai | devops training in chennai with placement | devops training in chennai omr | devops training in velachery | devops training in chennai tambaram | devops institutes in chennai | devops certification in chennai | trending technologies list 2018