trainHMMEntityModel
Syntax
Description
Use the trainHMMEntityModel function to train a model for
      named entity recognition (NER) that is based on a hidden Markov model (HMM).
The addDependencyDetails function automatically detects person names, locations,
        organizations, and other named entities in text. To train a custom model that predicts
        different tags or train a model using your own data, use the trainHMMEntityModel function.
Examples
Read the example entity data from the exampleEntities CSV file into a table. 
tbl = readtable("exampleEntities.csv",TextType="string");
View the first few rows of the table. The table has two columns, Token and Entity, which correspond to the tokens and entities, respectively.
head(tbl)
             Token                 Entity   
    ________________________    ____________
    "Analyze"                   "non-entity"
    "text"                      "non-entity"
    "in"                        "non-entity"
    "MATLAB"                    "product"   
    "using"                     "non-entity"
    "Text Analytics Toolbox"    "product"   
    "."                         "non-entity"
    "Engineers"                 "non-entity"
Train an HMM-based NER model using the trainHMMEntityModel function.
mdl = trainHMMEntityModel(tbl)
mdl = 
  hmmEntityModel with properties:
    Entities: [3×1 categorical]
View the entities of the model.
mdl.Entities
ans = 3×1 categorical
     organization 
     product 
     non-entity 
To add entity details to documents using the trained hmmEntityModel object, use the addEntityDetails function and set the Model option to the trained NER model.
Create a tokenized document containing text data.
str = "MathWorks develops MATLAB and Simulink.";
document = tokenizedDocument(str);Add entity details using the trained hmmEntityModel object and view the updated token details using the tokenDetails function. The Entity column contains the predicted entities.
document = addEntityDetails(document,Model=mdl); details = tokenDetails(document)
details=6×8 table
       Token       DocumentNumber    SentenceNumber    LineNumber       Type        Language      PartOfSpeech          Entity   
    ___________    ______________    ______________    __________    ___________    ________    _________________    ____________
    "MathWorks"          1                 1               1         letters           en       proper-noun          organization
    "develops"           1                 1               1         letters           en       verb                 non-entity  
    "MATLAB"             1                 1               1         letters           en       proper-noun          product     
    "and"                1                 1               1         letters           en       coord-conjunction    non-entity  
    "Simulink"           1                 1               1         letters           en       proper-noun          product     
    "."                  1                 1               1         punctuation       en       punctuation          non-entity  
Extract the tokens that are named entities.
idx = details.Entity ~= "non-entity"; details(idx,["Token" "Entity"])
ans=3×2 table
       Token          Entity   
    ___________    ____________
    "MathWorks"    organization
    "MATLAB"       product     
    "Simulink"     product     
Input Arguments
Table of tokens and corresponding labels, specified as a table with these variables:
- Token— Tokens, specified as string scalars or 1-by-1 cell arrays containing a character vector.
- Entity— Entity labels, specified as categorical scalars, string scalars, or 1-by-1 cell arrays containing a character vector.
You must specify pairs of tokens and entities in context. The algorithm does not support lists of independent token-entity pairs. For example, you can specify this table.
| Token | Entity | 
|---|---|
| "William Shakespeare" | "person" | 
| "was" | "non-entity" | 
| "born" | "non-entity" | 
| "in" | "non-entity" | 
| "Stratford-upon-Avon" | "location" | 
| "." | "non-entity" | 
To specify entities that span multiple tokens, use one of these approaches:
- Whitespace-delimited tokens — Specify multitoken entities as a single token with a single entity value. For example, specify the token - "William Shakespeare"and the entity- "person".
- IOB2 labeling scheme — For each entity, use the prefix - "B-"(beginning) to denote the first token in each entity and use the prefix- "I-"(inside) to denote subsequent tokens in multitoken entities. Specify which entity corresponds to the- "O"(outside) tag using the- nameargument. For example, specify the successive tokens- "William"and- "Shakespeare", and the corresponding entities- "B-person"and- "I-person". For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.
If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.
Data Types: table
List of tokens, specified as a tokenizedDocument scalar, a string array, or a cell array of character
            vectors.
You must specify tokens in context. The algorithm does not support lists of
            independent token-entity pairs. For example, you can specify the array of tokens
              ["William Shakespeare" "was" "born" "in" "Stratford-upon-Avon"
            "."].
List of named entities, specified as a categorical array, a string array, or a cell array of character vectors.
To specify entities that span multiple tokens, use one of these approaches:
- Whitespace-delimited tokens — Specify multitoken entities as a single token with a single entity value. For example, specify the token - "William Shakespeare"and the entity- "person".
- IOB2 labeling scheme — For each entity, use the prefix - "B-"(beginning) to denote the first token in each entity and use the prefix- "I-"(inside) to denote subsequent tokens in multitoken entities. Specify which entity corresponds to the- "O"(outside) tag using the- nameargument. For example, specify the successive tokens- "William"and- "Shakespeare", and the corresponding entities- "B-person"and- "I-person". For more information, see Inside, Outside, Beginning (IOB) Labeling Schemes.
If you use the IOB2 labeling scheme, then all tokens in the input must use this scheme.
The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.
Data Types: char | string | cell | categorical
Name to assign tokens that are not named entities, specified as a string scalar, a character vector, or a cell array of character vectors.
The software automatically removes leading and trailing spaces from the entities. The entities must contain at least one nonwhitespace character.
Data Types: char | string | cell
Output Arguments
NER model, returned as an hmmEntityModel object.
Algorithms
The inside, outside (IO) labeling scheme tags entities with
            "O" or prefixes the entities with "I". The tag
            "O" (outside) denotes nonentities. For each token in an entity, the
        tag is prefixed with "I-" (inside), which signifies that the token is
        part of an entity.
The IO labeling scheme does not specify entity boundaries between adjacent entities of the same type. The inside, outside, beginning (IOB) labeling scheme, also known as the beginning, inside, outside (BIO) labeling scheme, addresses this limitation by introducing a "beginning" prefix.
The IOB labeling scheme has two variants: IOB1 and IOB2.
For each token in an entity, the tag is prefixed with one of these values:
- "B-"(beginning) — The token is a single-token entity or the first token of a multitoken entity.
- "I-"(inside) — The token is a subsequent token of a multitoken entity.
For a list of entity tags Entity, the IOB labeling
            scheme helps identify boundaries between adjacent entities of the same type by using
            this logic:
- If - Entity(i)has the prefix- "B-"and- Entity(i+1)is- "O"or has the prefix- "B-", then- Token(i)is a single entity.
- If - Entity(i)has the prefix- "B-",- Entity(i+1), ...,- Entity(N)have the prefix- "I-", and- Entity(N+1)is- "O"or has the prefix- "B-", then the phrase- Token(i:N)is a multitoken entity.
The IOB1 labeling scheme does not use the prefix "B-" when an entity token
            follows an "O-" prefix. In this case, an entity token that is the
            first token in a list or that follows a nonentity token is the first token of an entity.
            That is, if Entity(i) has the prefix "I-" and
                i is equal to 1 or Entity(i-1) has the prefix
                "O-", then Token(i) is a single-token entity
            or the first token of a multitoken entity.
Version History
Introduced in R2023a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)