Loop through the unique values of a very large column and extract data
7 ビュー (過去 30 日間)
古いコメントを表示
This is more a speed question than a "how to" question.
Assume I have the following problem:
Three variables A, B and C.
A is a series of IDs and B and C are data (e.g. dates and a measurement).
For various reasons I want to seperate the data, so instead of three columns I have a structure with something like:
mystruct.First_ID_FROM_A = [B(indexFirstID,:) C(indexFirstID,:)]
Traditionally I just do the following:
[uA,IA,IB] = unique(A);
for i=1:length(uA)
ii = find(i==IB);
mystruct.(uA{i,1}) = [B(ii,:) C(ii,:)];
%sometimes I do other stuff here with some cross referencing so the index ii is useful.
end
Job done. I have tried other methods, but this is pretty fast, except now I have like crazy big data (e.g. A, B and C is like the best part of a billion rows). So this is my second attempt that I run on a server:
[uA,IA,IB] = unique(A);
N = length(uA);
temp = cell(N,1);
% do the indexing with a cell structure that can be cut.
parfor i=1:N
ii = find(i==IB);
temp{i,1} = [B(ii,:) C(ii,:)];
end
% do a second loop just to reallocate the data
for i=1:N
mystruct.(uA{i,1}) = temp{i,1};
end
So despite being two loops this can be quicker as the extraction is in parallel and the assignment is fast.
Is there a fancy way of using something like an array based version of a binary expansion function that can do this faster without the loop, in either step of the second process? Or should I make a C++ and a mex routine to speed this tedious thing up? I think a problem here is the output array is uncertain in terms of size.
If so does anyone have any experience or examples of how to create and map a Matlab structure in C++ so the output can be read by matlab? I use str2doubleq a lot, this takes cell array of strings and outputs doubles, which is quite vanilla, and I have made a few custom C and C++ codes, for fast date and time pulls, when datenum was too slow.
But this is annoying, me, I am sure there is a neater way to do it. Once the data is in the structure, it is reall fast to just use the fieldnames command and then loop through the sub data objects.
7 件のコメント
Sindar
2020 年 6 月 16 日
The point of tables is that they act like a more organized structure array. If you are naming each structure field, you already spend that memory. Depending on the shape of your data, something similar to:
mytable = array2table([B(IB,:) C(IB,:)],'RowNames',num2str(uA))
should work without any loops
回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Matrix Indexing についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!