Loop through the unique values of a very large column and extract data

3 ビュー (過去 30 日間)

古いコメントを表示

Julian Williams 2020 年 6 月 15 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/548415-loop-through-the-unique-values-of-a-very-large-column-and-extract-data

コメント済み: Julian Williams 2020 年 6 月 16 日

MATLAB Online で開く

This is more a speed question than a "how to" question.

Assume I have the following problem:

Three variables A, B and C.

A is a series of IDs and B and C are data (e.g. dates and a measurement).

For various reasons I want to seperate the data, so instead of three columns I have a structure with something like:

mystruct.First_ID_FROM_A = [B(indexFirstID,:) C(indexFirstID,:)]

Traditionally I just do the following:

[uA,IA,IB] = unique(A);
for i=1:length(uA)
    ii = find(i==IB);
    mystruct.(uA{i,1}) = [B(ii,:) C(ii,:)];
    %sometimes I do other stuff here with some cross referencing so the index ii is useful.
end

Job done. I have tried other methods, but this is pretty fast, except now I have like crazy big data (e.g. A, B and C is like the best part of a billion rows). So this is my second attempt that I run on a server:

[uA,IA,IB] = unique(A);
N = length(uA);
temp = cell(N,1);
% do the indexing with a cell structure that can be cut.
parfor i=1:N
    ii = find(i==IB);
    temp{i,1} = [B(ii,:) C(ii,:)];
end
% do a second loop just to reallocate the data
for i=1:N
    mystruct.(uA{i,1}) = temp{i,1};  
end

So despite being two loops this can be quicker as the extraction is in parallel and the assignment is fast.

Is there a fancy way of using something like an array based version of a binary expansion function that can do this faster without the loop, in either step of the second process? Or should I make a C++ and a mex routine to speed this tedious thing up? I think a problem here is the output array is uncertain in terms of size.

If so does anyone have any experience or examples of how to create and map a Matlab structure in C++ so the output can be read by matlab? I use str2doubleq a lot, this takes cell array of strings and outputs doubles, which is quite vanilla, and I have made a few custom C and C++ codes, for fast date and time pulls, when datenum was too slow.

But this is annoying, me, I am sure there is a neater way to do it. Once the data is in the structure, it is reall fast to just use the fieldnames command and then loop through the sub data objects.

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

Julian Williams 2020 年 6 月 15 日

Hello,

Thanks for the suggestion, I don't want to write to disc, so I think the tall tables is a no-no, I have no issue with memory. I think the tables use structures and fieldnames no? I agree once the data is in the structure, life is pretty easy, very fast. Tricky bit is getting it in there.

A problem of row names, as I see it, is the following, if I name the rows, then I add say about a number of bytes per row proportional to the size of the dataset.

Now if I have say 350 million rows, then I just added like about 8.4GB (need 24KB per row to make a name big enough, times 350 million). I still then have to index the rows, so I kind of get back to my original problem. Cell arrays are not very neat and I dislike the fact that I double the data array, in my second example. If I delete data as I assign it, then I add time in the calculation, proportional to the data size and I am super impatient!!

Thanks for the thoughts.

Julian

Sindar 2020 年 6 月 16 日

MATLAB Online で開く

The point of tables is that they act like a more organized structure array. If you are naming each structure field, you already spend that memory. Depending on the shape of your data, something similar to:

mytable = array2table([B(IB,:) C(IB,:)],'RowNames',num2str(uA))

should work without any loops

Julian Williams 2020 年 6 月 16 日

Benjamin, that is very neat, much appreciated. Sindar, many thanks for the point on the tables.

サインインしてコメントする。

サインインしてこの質問に回答する。

回答 (0 件)

サインインしてこの質問に回答する。

カテゴリ

MATLAB Language Fundamentals Matrices and Arrays Matrix Indexing

Help Center および File Exchange で Matrix Indexing についてさらに検索

製品

MATLAB

リリース

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by