Text Vectorization using Bag of Words and TF-IDF

Swagata Ashwani
3 min readJun 2, 2022

--

Photo by Markus Spiske on Unsplash

In the previous article, we talked about how we can achieve basic text processing for text data. For any machine learning model, input has to be in numerical format. Hence, once we pre-process the text, the next step in to vectorize the text, i.e convert the text into numerical format that can be fed to the machine learning model.

Bag of Words

First method of converting text into numbers is using a method called Bag of words.

How is works?

The Bag of Words model creates a vocabulary from all the words in the document/corpus.

Next, it counts the occurrences of words in 3 different ways-

  1. Binary(present or not)
  2. Word count
  3. Frequencies

Let’s take an example to understand how Bag of Words works with Word Count example-

Let’s say my document/corpus has the following three sentences.

  1. It is a dog.
  2. It is a cat.
  3. It is not a cat, it is a dog.

The Bag of Words, word count will look like this-

The other way to do is using Frequencies, which is also termed as Term Frequency.

Term Frequency

(tf) or Term Frequency increases the weight for common words in a document. So, the above Bag of Words Word Count model will transform into as follows-

Inverse Document Frequency

Inverse document frequency or (idf) decreases the weights for commonly used words and increases weights for rare words in the vocabulary.

𝑖𝑑𝑓 (𝑡𝑒𝑟𝑚 ) = log(no of documents) /(no of documents containing the term}+1) +1

Using the above equation, the idf values can be calculated-

idf(it) = log(3/4)+1=1.11

and similarly for all the other terms in the vocabulary.

Term Frequency Inverse Document Frequency

Term Freq. Inverse Doc. Freq (tf-idf): Combines term frequency and inverse document frequency

𝑡 f{𝑖𝑑𝑓} ( 𝑡𝑒𝑟𝑚 , 𝑑𝑜𝑐 )= 𝑡𝑓 (𝑡𝑒𝑟𝑚 , 𝑑𝑜𝑐 ) ∗ 𝑖𝑑𝑓 ( 𝑡𝑒𝑟𝑚 )

The key intuition motivating tf-idf is the importance of a term is inversely related to its frequency across documents.

tf gives us information on how often a term appears in a document

idf gives us information about the relative rarity of a term in the collection of documents. By multiplying these values together we can get our final tf-idf value.

The key difference between Bag of Words and tf-idf is that the former does not incorporate any sort of inverse document frequency (idf) and is only a frequency count (tf).

Happy Text Vectorizing!

--

--

Swagata Ashwani
Swagata Ashwani

Written by Swagata Ashwani

I love talking Data! Data Scientist with a passion for finding optimized solutions in the AI space.Follow me here — https://www.linkedin.com/in/swagata-ashwani/

No responses yet