This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Language Considerations

Text Analytics Toolbox™ supports the languages English, Japanese, and German. Most Text Analytics Toolbox functions also work with text in other languages. This table summarizes how to use Text Analytics Toolbox features for other languages.

FeatureLanguage ConsiderationWorkaround
Tokenization

The tokenizedDocument function has built-in rules for English, Japanese, and German only. For English and German text, the 'unicode' tokenization method of tokenizedDocument detects tokens using rules based on Unicode® Standard Annex #29 [1] and the ICU tokenizer [2], modified to better detect complex tokens such as hashtags and URLs. For Japanese text, the 'mecab' tokenization method detects tokens using rules based on the MeCab tokenizer [3].

For other languages, you can still try using tokenizedDocument. If tokenizedDocument does not produce useful results, then try tokenizing the text manually. To create a tokenizedDocument array from manually tokenized text, set the 'TokenizeMethod' option to 'none'.

For more information, see tokenizedDocument.

Stop word removal

The stopWords and removeStopWords functions support English, Japanese, and German stop words only.

To remove stop words from other languages, use removeWords and specify your own stop words to remove.

Sentence detection

The addSentenceDetails function detects sentence boundaries based on punctuation characters and line number information. For English and German text, the function also uses a list of abbreviations passed to the function.

For other languages, you might need to specify your own list of abbreviations for sentence detection. To do this, use the 'Abbreviations' option of addSentenceDetails.

For more information, see addSentenceDetails.

Word clouds

For string input, the wordcloud function uses English, Japanese, and German tokenization, stop word removal, and word normalization.

For other languages, you might need to manually preprocess your text data and specify unique words and corresponding sizes in wordcloud.

To specify word sizes in wordcloud, input your data as a table or arrays containing the unique words and corresponding sizes.

For more information, see wordcloud.

Word embeddings

File input to the trainWordEmbedding function requires words separated by whitespace.

For files containing non-English text, you might need to input a tokenizedDocument array to trainWordEmbedding.

To create a tokenizedDocument array from pretokenized text, use the tokenizedDocument function and set the 'TokenizeMethod' option to 'none'.

For more information, see trainWordEmbedding.

Language-Independent Features

Word and N-Gram Counting

The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.

Modeling and Prediction

The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.

The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.

References

[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/

[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/

See Also

| | | | | | | | | |

Related Topics