What Is Stemming?
Stemming refers to a text normalization technique in natural language processing that reduces words to their root forms. Stemming is done primarily by removing affixes of the words, which may result in an invalid dictionary word.
Stemming is commonly used for:
- Information retrieval, where stemmed words are used as synonyms to expand search criteria
- Engineering applications to reduce dimensionality, where stemming results in fewer words to be tracked and used in a model with machine learning algorithms
Porter’s Stemming Algorithm
The Porter stemmer algorithm is one of the most popular stemming approaches for the English language, and is based on simple heuristic rules. This stemming approach is fast but may not always be accurate. In subsequent years, many other algorithms were proposed, but Porter’s stemming algorithm remains popular due to its speed and simplicity.
Stemming vs. Lemmatization
A related, but more sophisticated approach, to stemming is lemmatization. Differences between the two approaches include:
- Lemmatization uses vocabulary and morphological analysis, and stemming uses simple heuristic rules.
- Lemmatization returns dictionary forms of the words, whereas stemming may result in invalid words.
Example results after lemmatization and stemming are shown in the table.
| Input | After Lemmatization | After Stemming |
|---|---|---|
| Requiring | Require | Requir |
| Required | Require | Requir |
| Requirement | Requirement | Requir |
In MATLAB, stemming can be done using normalizeWords function with the default style option of stem. To learn more about stemming and building models with text data, see Text Analytics Toolbox™.
Examples and How To
Software Reference
See also: natural language processing, sentiment analysis, word2vec, n-gram, text mining with MATLAB, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™