What Is Stemming?

Stemming refers to a text normalization technique in natural language processing that reduces words to their root forms. Stemming is done primarily by removing affixes of the words, which may result in an invalid dictionary word.

Stemming is commonly used for:

Information retrieval, where stemmed words are used as synonyms to expand search criteria
Engineering applications to reduce dimensionality, where stemming results in fewer words to be tracked and used in a model with machine learning algorithms

Porter’s Stemming Algorithm

The Porter stemmer algorithm is one of the most popular stemming approaches for the English language, and is based on simple heuristic rules. This stemming approach is fast but may not always be accurate. In subsequent years, many other algorithms were proposed, but Porter’s stemming algorithm remains popular due to its speed and simplicity.

Stemming vs. Lemmatization

A related, but more sophisticated approach, to stemming is lemmatization. Compared to stemming,

Lemmatization uses vocabulary and morphological analysis and stemming uses simple heuristic rules
Lemmatization returns dictionary forms of the words, whereas stemming may result in invalid words

The differences between lemmatization and stemming are shown below.

Actual Word	Lemmatization	Stemming
Requiring	Require	Requir
Required	Require	Requir
Requirement	Requirement	Requir

In MATLAB, stemming can be done using “normalizeWords” function with the default style option of ‘stem’. To learn more about stemming and building models with text data, see Text Analytics Toolbox™.

Examples and How To

Prepare Text Data for Analysis - Example
Create Simple Text Model for Classification - Example

Software Reference

Language Considerations - Documentation
Text Analytics Glossary - Documentation
Getting Started with Text Analytics Toolbox - Documentation
normalizeWords: Perform stemming or lemmatize words - Function
tokenizedDocument: Array of tokenized document for text analysis - Function

Getting Started with Text Analytics in MATLAB

Download white paper