This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

addLanguageDetails

Add language identifiers to documents

Use addLanguageDetails to add language identifiers to documents.

The function supports English, Japanese, and German text.

Syntax

updatedDocuments = addLanguageDetails(documents)
updatedDocuments = addLanguageDetails(documents,'Language',language)

Description

example

updatedDocuments = addLanguageDetails(documents) detects the language of documents and updates the token details. The function adds details to the tokens with missing language details only. To get the language details from updatedDocuments, use tokenDetails.

updatedDocuments = addLanguageDetails(documents,'Language',language) specifies the language to update with.

Tip

Use addLanguageDetails before using the lower and upper functions as addLanguageDetails uses information that is removed by this functions.

Examples

collapse all

Manually tokenize some text by splitting it into an array of words. Convert the manually tokenized text into a tokenizedDocument object by setting the 'TokenizeMethod' option to 'none'.

str = split("an example of a short sentence")';
documents = tokenizedDocument(str,'TokenizeMethod','none');

View the token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×2 table
      Token       DocumentNumber
    __________    ______________

    "an"                1       
    "example"           1       
    "of"                1       
    "a"                 1       
    "short"             1       
    "sentence"          1       

When you specify 'TokenizeMethod','none', the function does not automatically detect the language details of the documents. To add the language details, use the addLanguageDetails function. This function, by default, automatically detects the language.

documents = addLanguageDetails(documents);

View the updated token details using tokenDetails.

tdetails = tokenDetails(documents)
tdetails=6×4 table
      Token       DocumentNumber     Type      Language
    __________    ______________    _______    ________

    "an"                1           letters       en   
    "example"           1           letters       en   
    "of"                1           letters       en   
    "a"                 1           letters       en   
    "short"             1           letters       en   
    "sentence"          1           letters       en   

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Language, specified as one of the following:

  • 'en' – English

  • 'ja' – Japanese

  • 'de' – German

If you do not specify a value, then the function detects the language from the input text using the corpusLanguage function.

This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails. These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Introduced in R2018b