how to extract a list of unique words from a set of one row strings

56 ビュー (過去 30 日間)
Harrison
Harrison 2024 年 11 月 14 日 0:58
コメント済み: Harrison 2024 年 11 月 15 日 16:56
Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.
I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.
A{1} = updatedDocuments(1,1)
B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')
Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

採用された回答

Madheswaran
Madheswaran 2024 年 11 月 14 日 9:32
編集済み: Madheswaran 2024 年 11 月 15 日 5:17
I am assuming the following:
  • 'updatedDocuments' is an array of 'tokenizedDocument'
  • Each document contains text that is comma seperated and doesn't end with a comma
To get the unique words from the entire set of strings, you can follow the below approach:
% remove comma from the documents if you don't want comma to be
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ",");
uniqueWords = updatedDocuments.Vocabulary;
If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:
updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)
For more information, refer to the following documentations:
  1. https://mathworks.com/help/textanalytics/ref/tokenizeddocument.html
  2. https://mathworks.com/help/matlab/ref/double.unique.html
Hope this helps!
  3 件のコメント
Madheswaran
Madheswaran 2024 年 11 月 15 日 5:18
That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.
Let me know if that helps!
Harrison
Harrison 2024 年 11 月 15 日 16:56
Thats exactly right! Thank you!!

サインインしてコメントする。

その他の回答 (1 件)

Paul
Paul 2024 年 11 月 14 日 1:09
If UpdatedDocuments is a 1D cell array of chars ...
UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
{'one,three,two'} {'one,three,two'} {'one,three,two'}
  1 件のコメント
Paul
Paul 2024 年 11 月 15 日 1:06
The Vocabulary property of tokenizedDocument returns the uniqew words in the array
documents = tokenizedDocument([
"an example of a short sentence an example of a short sentence "
"a second short sentence a second short sentence"]);
documents
documents =
2x1 tokenizedDocument: 12 tokens: an example of a short sentence an example of a short sentence 8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
"an" "example" "of" "a" "short" "sentence" "second"

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeCharacters and Strings についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by