Count word frequency, please help

Question

Amr Hashem 2015 年 7 月 21 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/230619-count-word-frequency-please-help

コメント済み: Cyril Nii Amankwah Nyankerh 2022 年 3 月 9 日

採用された回答: Cedric

MATLAB Online で開く

I have a column which contain cells each contain a text, and i want to count the frequency of each word

I write this code - which works - to count the word frequency and sort it

 str = alldata{:,51};
C = regexp((str),' ','split')';  
[val,idxC, idxV] = unique(C);
n = accumarray(idxV,1);
m = [n idxC] ;     %n occurences for word C(idxC)
y = [val num2cell(n)];       % t without sort
[~, so]= sort(n,'descend');     % sort the frequencies descend and rest alphabet
words= val(so);    % sort words by frequency 
freq= n(so);           % sort frequency
z = [words num2cell(freq)];    % show words with frequency sort

it give me a good answer which:

this work on the first cell only, but i want to make the word frequency to the whole cells

I try this:

 j=1;
 for i=1:size(alldata,1)    
   fid(j,1) = alldata(i,51);
   C(k,1) = regexp((fid(j,1)),' ','split')';
   [val{k},idxC{k}, idxV{k}] = unique(C{k});
   n{k} = accumarray(idxV(k),1);
   j=j+1;
 end
 m = [n idxC] ; 
 y = [val num2cell(n)];      
 [~, so] = sort(n,'descend');% s= sort(n);   
 words = val(so);  
 freq = n(so);          
 z = [words num2cell(freq)];

but it didn't work , it seems i have to merge all cells first.

| how i can get the word frequency of the whole file ? | *

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Cedric 2015 年 7 月 23 日

I return the question to you ;-)

Alex 2018 年 11 月 11 日

How to implement the same using MapReduce? Each row of one column will be sent to different mapper?

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Cedric 2015 年 7 月 23 日

5
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/230619-count-word-frequency-please-help#answer_186922

編集済み: Cedric 2015 年 7 月 23 日

MATLAB Online で開く

What do you mean by "it didn't work"? Did you get an error message? If so, please copy/paste it. If the output was not correct, please provide some source/test file and describe what is not correct. Without this, it is difficult to guess what is wrong.

At first sight, and without knowing really what is in column 51 of the alldata cell array, the first example works by "luck", because in the first line:

str = alldata{:,51};

the right hand side is a comma separated list (CSL), and you store only its first element, which is alldata{1,51} in the variable str.

Then, the way you get counts of occurrences using UNIQUE and ACCUMARRAY is smart and elegant as far as I am concerned. You don't seem to be using the next two lines where you define m and y, and the last lines with the sorting are correct.

Now in the second part, the image that you provide seems to indicate that you have a rectangular array of strings, but you are still working on column 51 exclusively. Why? You are using an index k that is not defined, and you seem to want to store UNIQUE outputs in cell arrays, but then you don't process them as cell arrays (it's like a copy/paste of what you had done above, but not adapted to this new setup). So there are many reasons for it not to work.

If you need to process all entries of the alldata cell array together, one way to achieve it without iterating through each cell and then having to perform a complex merge, is to create a giant string with all cells' content. You can achieve this by just copying the code for the first case, and replacing the lines:

 str = alldata{:,51} ;
 C = regexp((str),' ','split')';

by the following:

 % - Build giant string with all content, separated by white spaces.
 buffer = [alldata(:), repmat( {' '}, numel( alldata ), 1 )]' ;
 buffer = strtrim( [buffer{:}] ) ;
 % - Split giant string on white spaces.
 C = strsplit( buffer, ' ' )' ;

so your code should look like the following overall (where I took the freedom to make some variable names more explicit):

 % - Build giant string with all content, separated by white spaces.
 buffer = [alldata(:), repmat( {' '}, numel( alldata ), 1 )]' ;
 buffer = strtrim( [buffer{:}] ) ;
 % - Split giant string on white spaces, build array of unique words and get
 %  counts.
 words = strsplit( buffer, ' ' )' ;
 [words_u, ~, idxU] = unique( words ) ;
 counts = accumarray( idxU, 1 ) ;
 % - Sort entries by count.
 [~, idxS] = sort( counts, 'descend' ) ;
 words_us = words_u(idxS) ;
 counts_s = counts(idxS) ;
 % - Build cell array of unique words and counts.
 result = [words_us, num2cell( counts_s )] ;

Hope it helps!

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

Med Aymane Ahajjam 2020 年 2 月 24 日

Perfect! Thank you even after 5 years!

Cyril Nii Amankwah Nyankerh 2022 年 3 月 9 日

Thanks for this

サインインしてコメントする。

Answer 2

Amr Hashem 2015 年 7 月 24 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/230619-count-word-frequency-please-help#answer_187016

MATLAB Online で開く

it works...

 buffer = [alldata(:,51), repmat( {' '}, numel( alldata(:,51) ), 1 )]' ;
 buffer = strtrim( [buffer{:}] ) ;
C = regexp((buffer),' ','split')';  
 [val,idxC, idxV] = unique(C);
n = accumarray(idxV,1);
m = [n idxC] ;     %n occurences for word C(idxC)
y = [val num2cell(n)];       % t without sort
[~, so]= sort(n,'descend');     % sort the frequencies descend and rest alphabet
words= val(so);    % sort words by frequency 
freq= n(so);           % sort frequency
z = [words num2cell(freq)];    % show words with frequency sort

thanks to Cedric Wannaz

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Count word frequency, please help

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

採用された回答

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

Count word frequency, please help

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

採用された回答

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示