How to find the most used word in a text?
3 ビュー (過去 30 日間)
古いコメントを表示
i have a notepad file with a literary text and i need to find the most repeated word/words . How many times they appear in that text.
1 件のコメント
the cyclist
2023 年 4 月 3 日
編集済み: the cyclist
2023 年 4 月 3 日
FYI, this question was closed by another editor as a duplicate, but I don't think it was. This question is asking about repeated words, and the other was asking about repeated letters.
回答 (3 件)
the cyclist
2023 年 4 月 3 日
編集済み: the cyclist
2023 年 4 月 3 日
I'm putting this answer here as possibly the "canonical" MATLAB answer, but I expect you do not have the Text Analytics Toolbox.
myTextFile = "sonnets.txt"; % Put your file name here
str = extractFileText(myTextFile);
T = wordCloudCounts(str);
0 件のコメント
DGM
2023 年 4 月 3 日
編集済み: DGM
2023 年 4 月 3 日
Define "word". Once you have defined "word" and have implemented a means to split a block of text into said words, then the rest is basic.
I'm sure this can be improved a lot, but I was in a hurry.
bunchofwords = fileread('wordpile.txt')
% i assume the capitalization doesn't matter
bunchofwords = lower(bunchofwords);
% try to fix words that are hyphenated on linebreaks
% but not all hyphenation is done with U+002D
bunchofwords = regexprep(bunchofwords,'(?<=\w+)-(\r\n|\r|\n)+(?=\w+)','');
% split the file into blobs separated by whitespace
% this causes lots of problems
%words = regexp(bunchofwords,'\S+','match');
% instead, split the file into blobs of "word" type characters
% this still has problems, but it's a bit better
words = regexp(bunchofwords,'\w+','match');
% find unique words
[uwords,~,uwidx] = unique(words);
% get histogram counts and sort them
hc = histcounts(uwidx,'binmethod','integers');
[hc hcidx] = sort(hc,'descend');
% sort unique word list by frequency
uwordssorted = uwords(hcidx);
% display the results as a table as a cursory effort toward readability
table(uwordssorted.',hc.')
Note that this still has plenty of problems with contractions.
2 件のコメント
Image Analyst
2023 年 4 月 3 日
Or simpler than
words = regexp(bunchofwords,'\w+','match');
is to use strsplit
words = strsplit(bunchofwords);
DGM
2023 年 4 月 3 日
編集済み: DGM
2023 年 4 月 3 日
No, that would be similar to the first example, naively splitting on whitespace. This causes problems with any punctuation. Note the cases of 'file', 'list', and 'words'.
bunchofwords = fileread('wordpile.txt');
bunchofwords = lower(bunchofwords);
uwords = unique(strsplit(bunchofwords))
uwords = unique(regexp(bunchofwords,'\S+','match'))
uwords = unique(regexp(bunchofwords,'\w+','match'))
I'm sure there are better ways to handle splitting into words, but using \w+ was simple enough.
Image Analyst
2023 年 4 月 3 日
If you don't have the Text Analytics Toolbox (like @the cyclist solution requires) then you can get a histogram like this:
str = 'abcddrd,ee,fghd,**^^###$s t q j' % Whatever your character array is
% Convert characters to numbers.
strAscii = str - char(0);
% Compute histogram
edges = 0 : max(strAscii);
counts = histogram(strAscii, edges);
% Fancy up the plot.
grid on;
xlabel('ASCII value');
ylabel('Count');
title('Histogram of Characters')
2 件のコメント
the cyclist
2023 年 4 月 3 日
Unless I misunderstand, this solution finds the count of characters. This question (and my solution) is about finding words.
Image Analyst
2023 年 4 月 3 日
I think your solution is more like what the OP wants. But maybe I'll leave mine up in case someone in the future stumbles across it and wants a histogram of characters.
By the way, if he doesn't have that toolbox, is there a solution for a histogram of complete words?
参考
カテゴリ
Help Center および File Exchange で Data Distribution Plots についてさらに検索
製品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!