Machine Learning with Text Data — Text Processing

Swagata Ashwani
2 min readJun 1, 2022
Photo by Waldemar Brandt on Unsplash

Natural Language Processing, or NLP for short, is defined as the automatic manipulation of natural language, like speech and text, by software.

In other words, it can be described as Machine Learning with Text data.

Machine Learning models work only with well-defined numerical data. Hence, the first step with text data is to convert it into numerical format. There are a couple of steps involved in Machine Learning with Text Data as shown in the below diagram —

Text Pre-processing

  1. Tokenization

In this step, the text data is split into small parts by white space and punctuation.

Example -

Input — “I like to dance.”

Tokenization output — “I”,”like”,”to”,”dance”

2. Stop Word Removal

The next step is in the processing is stop word removal.

Stop words are words that frequently appear in texts, but they don’t contribute too much to the overall meaning.
Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”, “my”
Example-

Input — “ There is a restaurant near my house”

Stop word removal output- “restaurant near house”

3. Stemming

Stemming refers to a set of rules to slice a string to a substring that usually refers to a more general meaning.
The goal is to remove word affixes (particularly suffixes) such as “s”, “es”, “ing”, “ed”, etc.
Examples —

“playing”,“played” “play” all become — ”plays”

4. Lemmatization

Similar to Stemming but more advanced. Lemmatization takes linguistics into account, taking into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.

“am”,”is”,”are” all become “be”

These are the basic processing steps when dealing with text data. Next step is converting the clean processed data into numbers. Stay tuned for the next article diving into that part. Happy pre-processing!

--

--

Swagata Ashwani

I love talking Data! Data Scientist with a passion for finding optimized solutions in the AI space.Follow me here — https://www.linkedin.com/in/swagata-ashwani/