Finding duplicate strings in a cell array and their index
41 ビュー (過去 30 日間)
古いコメントを表示
I have to convert a cell array with more than 100,000 elements and convert it to a structure array with four fields. Right now, I have something like:
% cell array = nameData
n = 1;
for j = 2:102
for i = 2:length(nameData)
S(n).name = nameData{i,j};
S(n).frequency = 1;
n = n+1;
end
end
However, I need to find duplicate strings in this array, and find information about them. Basically, I am collecting a database of strings and if I run across a duplicate, increase the frequency of that string rather than adding it to the structure.
I had been using loops within the previous two loops to achieve this:
for k = 1:n
if strcmpi(S(k).name, nameData{i,j}
S(k).frequency = S(k).frequency + 1;
end
end
However, I always just end up with all 100,000 structure elements. Any other solution I have gotten to work was entirely too slow, and this conversion from cell to structure array must happen in less than 20 seconds.
Thanks!
2 件のコメント
Paul Wintz
2021 年 9 月 10 日
The use of i and j as index variables are so ubiquitous to programming that I would say, instead, that you should avoid using i and j as the imaginary unit, and instead use 1i or 1j, which cannot be overwritten.
採用された回答
Stephen23
2015 年 4 月 12 日
編集済み: Stephen23
2015 年 4 月 13 日
Learn to write vectorized code to make your code neater, faster and more robust: loops are not the first choice for solving problems in MATLAB, vectorization is!
This solution takes less than one second on my machine. First we generate an array of fake data, consisting of 100000 two-character strings of random characters:
N = 100000;
C = cellstr(char(32+randi(94,N,2)));
tic
[D,~,X] = unique(C(:));
Y = hist(X,unique(X));
Z = struct('name',D,'freq',num2cell(Y(:)));
toc
Elapsed time is 0.379057 seconds.
And we can have a look at a random example of the output Z:
>> Z(5).name
ans =
!%
>> Z(5).freq
ans =
12
For newer versions you can use histogram instead. Note that vectorized code scale up to larger array sizes much nicer than loops do: even for one million elements in array C this method only took 4.87 seconds on my machine.
0 件のコメント
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Matrix Indexing についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!