textAnalytics toolbox: removing Entity details from documents

3 ビュー (過去 30 日間)
david cowan
david cowan 2023 年 11 月 18 日
移動済み: Cris LaPierre 2023 年 11 月 19 日
I have a very large set of documents that I am preprocessing to use in a bert classification model.
I have tokenized the documents and added the entity details.
Now I want to remove all of the tokenswith in the documents that have been "tagged as" orginisation.
I have the following variables:
documents: tokenized documents
tdetails: a table of tokens with the document number, sentence number, line number, Type, Language, PartOfSpeech and Entity.
Token
"Astoria" 1 2 3 'letters' 'en' 'proper-noun' 'person'
"Federal Savings Bank" 1 2 3 'other' 'en' 'proper-noun' 'organization'
"settled" 1 2 3 'letters' 'en' 'verb' 'non-entity'
How do I remove all of the tokens in the variable documents based on the entity=organisation
eg in documents(1,1).Vocabulary(7) I can find "Federal Savings Bank" which is in row 7 of the example above. I coudl loop through all of the documents and tdetails==organisation but that woudl take quite while
cant seem to figure out how to do this more simply

採用された回答

Cris LaPierre
Cris LaPierre 2023 年 11 月 18 日
I would use removeWords.
documents = tokenizedDocument(Text(:));
tdetails = tokenDetails(documents) ;
documents2 = removeWords(documents,tdetails{tdetails.Entity=="organisation"});
  1 件のコメント
david cowan
david cowan 2023 年 11 月 19 日
移動済み: Cris LaPierre 2023 年 11 月 19 日
Really appreciate that.
removeWords !!
I'll not forget that now - I knew there had to be a simple approach I was just missing

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeText Data Preparation についてさらに検索

製品


リリース

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by