normalizeWords

Stem or lemmatize words

Syntax

updatedDocuments = normalizeWords(documents)

updatedWords = normalizeWords(words)

updatedWords = normalizeWords(words,'Language',language)

___ = normalizeWords(___,'Style',style)

Description

Use normalizeWords to reduce words to a root form. To lemmatize English words (reduce them to their dictionary forms), set the 'Style' option to 'lemma'.

The function supports English, Japanese, German, and Korean text.

updatedDocuments = normalizeWords(documents) reduces the words in documents to a root form. For English and German text, the function, by default, stems the words using the Porter stemmer for English and German text respectively. For Japanese and Korean text, the function, by default, lemmatizes the words using the MeCab tokenizer.

example

updatedWords = normalizeWords(words) reduces each word in the string array words to a root form.

example

updatedWords = normalizeWords(words,'Language',language) reduces the words and also specifies the word language.

___ = normalizeWords(___,'Style',style) also specifies normalization style. For example, normalizeWords(documents,'Style','lemma') lemmatizes the words in the input documents.

example

Examples

collapse all

Stem Words in Documents

Open Live Script

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument([
    "a strongly worded collection of words"
    "another collection of words"]);
newDocuments = normalizeWords(documents)

newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: a strongli word collect of word
    4 tokens: anoth collect of word

Stem Words in String Array

Open Live Script

Stem the words in a string array using the Porter stemmer. Each element of the string array must be a single word.

words = ["a" "strongly" "worded" "collection" "of" "words"];
newWords = normalizeWords(words)

newWords = 1x6 string
    "a"    "strongli"    "word"    "collect"    "of"    "word"

Lemmatize Words in Documents

Open Live Script

Lemmatize the words in a document array.

documents = tokenizedDocument([
    "I am building a house."
    "The building has two floors."]);
newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: i be build a house .
    6 tokens: the build have two floor .

To improve the lemmatization, first add part-of-speech details to the documents using the addPartOfSpeechDetails function. For example, if the documents contain part-of-speech details, then normalizeWords reduces the only verb "building" and not the noun "building".

documents = addPartOfSpeechDetails(documents);
newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 
  2x1 tokenizedDocument:

    6 tokens: i be build a house .
    6 tokens: the building have two floor .

Lemmatize Japanese Text

Open Live Script

Tokenize Japanese text using the tokenizedDocument function. The function automatically detects Japanese text.

str = [
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"
    "駅までは遠くて、歩けない。"
    "遠くの駅まで歩けない。"];
documents = tokenizedDocument(str);

Lemmatize the tokens using normalizeWords.

documents = normalizeWords(documents)

documents = 
  4x1 tokenizedDocument:

    10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。
    10 tokens: 空 の 星 が 輝き を 増す て いる 。
     9 tokens: 駅 まで は 遠い て 、 歩ける ない 。
     7 tokens: 遠く の 駅 まで 歩ける ない 。

Stem German Text

Open Live Script

Tokenize German text using the tokenizedDocument function. The function automatically detects German text.

str = [
    "Guten Morgen. Wie geht es dir?"
    "Heute wird ein guter Tag."];
documents = tokenizedDocument(str);

Stem the tokens using normalizeWords.

documents = normalizeWords(documents)

documents = 
  2x1 tokenizedDocument:

    8 tokens: gut morg . wie geht es dir ?
    6 tokens: heut wird ein gut tag .

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array

Input documents, specified as a tokenizedDocument array.

`words` — Input words
string vector | character vector | cell array of character vectors

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words as a character vector, then the function treats the argument as a single word.

Data Types: string | char | cell

`style` — Normalization style
`'stem'` | `'lemma'`

Normalization style, specified as one of the following:

'stem' – Stem words using the Porter stemmer. This option supports English and German text only. For English and German text, this value is the default.
'lemma' – Extract the dictionary form of each word. This option supports English, Japanese, and Korean text only. If a word is not in the internal dictionary, then the function outputs the word unchanged. For English text, the output is lowercase. For Japanese and Korean text, this value is the default.

The function only normalizes tokens with type 'letters' and 'other'. For more information on token types, see tokenDetails.

Tip

For English text, to improve lemmatization of words in documents, first add part-of-speech details using the addPartOfSpeechDetails function.

`language` — Word language
`'en'` | `'de'`

Word language, specified as one of the following:

'en' – English language
'de' – German language

If you do not specify language, then the software detects the language automatically. To lemmatize Japanese or Korean text, use tokenizedDocument input.

Data Types: char | string

Output Arguments

collapse all

`updatedDocuments` — Updated documents
`tokenizedDocument` array

Updated documents, returned as a tokenizedDocument array.

`updatedWords` — Updated words
string array | character vector | cell array of character vectors

Updated words, returned as a string array, character vector, or cell array of character vectors. words and updatedWords have the same data type.

Algorithms

collapse all

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of normalizeWords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the Language option of tokenizedDocument. To view the token details, use the tokenDetails function.

Version History

Introduced in R2017b

expand all

R2018b: `normalizeWords` skips complex tokens

Starting in R2018b, for tokenizedDocument input, normalizeWords normalizes tokens with type 'letters' or 'other' only. This behavior prevents the function from affecting complex tokens such as URLs and email-addresses.

In previous versions, normalizeWords normalizes all tokens. To reproduce this behavior, use the command updatedDocuments = docfun(@(str) normalizeWords(str),documents).

normalizeWords

Syntax

Description

Examples

Stem Words in Documents

Stem Words in String Array

Lemmatize Words in Documents

Lemmatize Japanese Text

Stem German Text

Input Arguments

`documents` — Input documents
`tokenizedDocument` array

`words` — Input words
string vector | character vector | cell array of character vectors

`style` — Normalization style
`'stem'` | `'lemma'`

`language` — Word language
`'en'` | `'de'`

Output Arguments

`updatedDocuments` — Updated documents
`tokenizedDocument` array

`updatedWords` — Updated words
string array | character vector | cell array of character vectors

Algorithms

Language Details

Version History

R2018b: `normalizeWords` skips complex tokens

See Also

Topics

normalizeWords

Syntax

Description

Examples

Stem Words in Documents

Stem Words in String Array

Lemmatize Words in Documents

Lemmatize Japanese Text

Stem German Text

Input Arguments

documents — Input documents tokenizedDocument array

words — Input words string vector | character vector | cell array of character vectors

style — Normalization style 'stem' | 'lemma'

language — Word language 'en' | 'de'

Output Arguments

updatedDocuments — Updated documents tokenizedDocument array

updatedWords — Updated words string array | character vector | cell array of character vectors

Algorithms

Language Details

Version History

R2018b: normalizeWords skips complex tokens

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array

`words` — Input words
string vector | character vector | cell array of character vectors

`style` — Normalization style
`'stem'` | `'lemma'`

`language` — Word language
`'en'` | `'de'`

`updatedDocuments` — Updated documents
`tokenizedDocument` array

`updatedWords` — Updated words
string array | character vector | cell array of character vectors

R2018b: `normalizeWords` skips complex tokens