Machine Learning with Text Data — Text Processing
Natural Language Processing, or NLP for short, is defined as the automatic manipulation of natural language, like speech and text, by software.
In other words, it can be described as Machine Learning with Text data.
Machine Learning models work only with well-defined numerical data. Hence, the first step with text data is to convert it into numerical format. There are a couple of steps involved in Machine Learning with Text Data as shown in the below diagram —
Text Pre-processing
- Tokenization
In this step, the text data is split into small parts by white space and punctuation.
Example -
Input — “I like to dance.”
Tokenization output — “I”,”like”,”to”,”dance”
2. Stop Word Removal
The next step is in the processing is stop word removal.
Stop words are words that frequently appear in texts, but they don’t contribute too much to the overall meaning.
Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”, “my”
Example-
Input — “ There is a restaurant near my house”
Stop word removal output- “restaurant near house”
3. Stemming
Stemming refers to a set of rules to slice a string to a substring that usually refers to a more general meaning.
The goal is to remove word affixes (particularly suffixes) such as “s”, “es”, “ing”, “ed”, etc.
Examples —
“playing”,“played” “play” all become — ”plays”
4. Lemmatization
Similar to Stemming but more advanced. Lemmatization takes linguistics into account, taking into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma.
“am”,”is”,”are” all become “be”
These are the basic processing steps when dealing with text data. Next step is converting the clean processed data into numbers. Stay tuned for the next article diving into that part. Happy pre-processing!