This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

tokenDetails

Details of tokens in tokenized document array

Syntax

tdetails = tokenDetails(documents)

Description

example

tdetails = tokenDetails(documents) returns a table of token details for the tokens in the tokenizedDocument array documents.

Examples

collapse all

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence and an emoticon. :)"
    "Here is another example document. :D"];
documents = tokenizedDocument(str);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)
ans=8×5 table
      Token       DocumentNumber    LineNumber       Type        Language
    __________    ______________    __________    ___________    ________

    "This"              1               1         letters           en   
    "is"                1               1         letters           en   
    "an"                1               1         letters           en   
    "example"           1               1         letters           en   
    "document"          1               1         letters           en   
    "."                 1               1         punctuation       en   
    "It"                1               1         letters           en   
    "has"               1               1         letters           en   

The type variable contains the type of each token. View the emoticons in the documents.

idx = tdetails.Type == "emoticon";
tdetails(idx,:)
ans=2×5 table
    Token    DocumentNumber    LineNumber      Type      Language
    _____    ______________    __________    ________    ________

    ":)"           2               1         emoticon       en   
    ":D"           3               1         emoticon       en   

Create a tokenized document array.

str = [ ...
    "This is an example document. It has two sentences."
    "This document has one sentence."
    "Here is another example document. It also has two sentences."];
documents = tokenizedDocument(str);

Add sentence details to the documents using addSentenceDetails. This function adds the sentence numbers to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addSentenceDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
ans=8×6 table
      Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    __________    ______________    ______________    __________    ___________    ________

    "This"              1                 1               1         letters           en   
    "is"                1                 1               1         letters           en   
    "an"                1                 1               1         letters           en   
    "example"           1                 1               1         letters           en   
    "document"          1                 1               1         letters           en   
    "."                 1                 1               1         punctuation       en   
    "It"                1                 2               1         letters           en   
    "has"               1                 2               1         letters           en   

View the token details of the second sentence of the third document.

idx = tdetails.DocumentNumber == 3 & ...
    tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language
    ___________    ______________    ______________    __________    ___________    ________

    "It"                 3                 2               1         letters           en   
    "also"               3                 2               1         letters           en   
    "has"                3                 2               1         letters           en   
    "two"                3                 2               1         letters           en   
    "sentences"          3                 2               1         letters           en   
    "."                  3                 2               1         punctuation       en   

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)
ans=8×5 table
       Token       DocumentNumber    LineNumber     Type      Language
    ___________    ______________    __________    _______    ________

    "fairest"            1               1         letters       en   
    "creatures"          1               1         letters       en   
    "desire"             1               1         letters       en   
    "increase"           1               1         letters       en   
    "thereby"            1               1         letters       en   
    "beautys"            1               1         letters       en   
    "rose"               1               1         letters       en   
    "might"              1               1         letters       en   

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
ans=8×7 table
       Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
    ___________    ______________    ______________    __________    _______    ________    ______________

    "fairest"            1                 1               1         letters       en       adjective     
    "creatures"          1                 1               1         letters       en       noun          
    "desire"             1                 1               1         letters       en       verb          
    "increase"           1                 1               1         letters       en       noun          
    "thereby"            1                 1               1         letters       en       adverb        
    "beautys"            1                 1               1         letters       en       verb          
    "rose"               1                 1               1         letters       en       noun          
    "might"              1                 1               1         letters       en       auxiliary-verb

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

Table of token details. tdetails has the following variables:

NameDescription
TokenToken text, returned as a string scalar.
DocumentNumberIndex of document that the token belongs to, returned as a positive integer.
SentenceNumberSentence number of token in document, returned as a positive integer. If these details are missing, then first add sentence details to documents using the addSentenceDetails function.
LineNumberLine number of token in document, returned as a positive integer.
Type

The type of token, returned as one of the following:

  • 'letters' – string of letter characters only

  • 'digits' – string of digits only

  • 'punctuation' – string of punctuation and symbol characters only

  • 'email-address' – detected email address

  • 'web-address' – detected web address

  • 'hashtag' – detected hashtag (starts with "#" character followed by a letter)

  • 'at-mention' – detected at-mention (starts with "@" character)

  • 'emoticon' – detected emoticon

  • 'emoji' – detected emoji

  • 'other' – does not belong to the previous types and is not a custom type

If these details are missing, then first add type details to documents using the addTypeDetails function.

Language

Language of the token, returned as one of the following:

  • 'en' – English

  • 'ja' – Japanese

  • 'de' – German

These language details determine the behavior of the removeStopWords, addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails functions on the tokens.

If these details are missing, then first add language details to documents using the addLanguageDetails function.

For more information about language support in Text Analytics Toolbox™, see Language Considerations.

PartOfSpeech

Part of speech tag, specified as one of the following:

  • 'adjective'

  • 'adposition'

  • 'adverb'

  • 'auxiliary-verb'

  • 'coord-conjunction'

  • 'determiner'

  • 'interjection'

  • 'noun'

  • 'numeral'

  • 'particle'

  • 'pronoun'

  • 'proper-noun'

  • 'punctuation'

  • 'subord-conjunction'

  • 'symbol'

  • 'verb'

  • 'other'

If these details are missing, then first add part-of-speech details to documents using the addPartOfSpeechDetails function.

Entity

Entity tag, specified as one of the following:

  • 'location' – detected location

  • 'organization' – detected organization

  • 'person' – detected person

  • 'other' – detected entity, not belonging to the above categories

  • 'non-entity' – no entity detected

If these details are missing, then first add entity details to documents using the addEntityDetails function.

Lemma

Lemma form. If these details are missing, then first lemma details to documents using the addLemmaDetails function.

Compatibility Considerations

expand all

Behavior changed in R2018b

Introduced in R2018a