Is there a parallel version of splitapply()?

Group-based computation is naturally sutiable for parallel computation. I wonder why Matlab has not yet a built-in parfor-splitapply? In Github, there is a repository called Matlab-FunUtils, which has cmap.m. It's like a parfor version of arrayfun. I couldn't find a similar function for splitapply. I have been trying to modify Matlab's splitapply, but the codes are very difficult to understand. Any suggesstoins? Should I write from scratch?

 採用された回答

Edric Ellis
Edric Ellis 2023 年 9 月 15 日

0 投票

tall arrays support parallel execution of splitapply. Would that work for your case?

6 件のコメント

Simon
Simon 2023 年 9 月 15 日
Thanks for your help. This is what I did:
tallA = tall(A);
g = findgroups(cat1, cat2);
tallg = tall(g);
tallout = splitapply(@mean, tallA, tallg);
shortout = gather(tallout);
I think, then, I can use shortout for 'normal' processings. Was I doing it right?
Simon
Simon 2023 年 9 月 15 日
Very glad I've asked the question. It led me to the awareness of a very important, very useful feature that I had ignored.
note: advanced Matlab utilities are hidden at the not-so-fine-print bottom of documentaion.
Edric Ellis
Edric Ellis 2023 年 9 月 15 日
I think you should apply findgroups to tallA.
Also note that if you want mean by groups, you can probably use some ideas from this example: https://www.mathworks.com/help/matlab/import_export/grouped-statistics-calculations-with-tall-arrays.html
Simon
Simon 2023 年 9 月 16 日
移動済み: Voss 2023 年 9 月 16 日
@Edric Ellis "I think you should apply findgroups to tallA. "
Great tips! Thank you.
"Also note that if you want mean by groups, you can probably use some ideas from this example: https://www.mathworks.com/help/matlab/import_export/grouped-statistics-calculations-with-tall-arrays.html"
That's a very relevant link. Mean is just for an example. What I really do is to wrangle out a new array/table out of each group.
Simon
Simon 2023 年 9 月 20 日
@Edric Ellis When my codes have nested splitapply, the tall approach fails. The situation occurs when I have 2-layer nested groupping structure, and splitapply() is used to process groups in each layer. For example, like in patients.mat, suppose there is a new grouping variable, 'BloodType', within each Male/Female group in Gender variable. tall will not allow splitapply to work in both level of groups. I guess this is the intrinsic limitation of parallel toolbox? Similarly, it does not allow nested parfor-loop.
Edric Ellis
Edric Ellis 2023 年 9 月 20 日
Yes, typically this sort of algorithm cannot work in a nested way. It isn't quite the same restriction as nested parfor (in fact, with thread-pools, you can get genuine 2-level parallelism with parfor, but there's a separate restriction that you need to "hide" the inner parfor inside a function).
As to your original question - does it work to make the groups based on the combinations of variables, as per the 2nd syntax of findgroups ? I.e. something like:
load patients
G = findgroups(Smoker, Age >= 40)
G = 100×1
3 2 1 2 2 2 3 2 1 1
This divides the data into 4 groups for each combination of Smoker and Age >= 40.

サインインしてコメントする。

その他の回答 (1 件)

Bruno Luong
Bruno Luong 2023 年 9 月 15 日

0 投票

To me there are 2 reasons:
  • Not everyone have parallel toolbox
  • The user function can be multi-threaded and already use efficiently the CPU cores, add parallel on top will then likely reduce the performance. The parallel computation should never be applied automatically anywhere. It should be judged case by case.

2 件のコメント

Simon
Simon 2023 年 9 月 15 日
I do computations with tall, tall tables. parfor saves huge amount of runtime for me compared with for-looping through rows serially. Now I am doing group-based computations and the number of groups I have is large. For that, I feel that splitapply is a little sluggish.
Simon
Simon 2023 年 9 月 16 日
@Bruno Luong "Not everyone have parallel toolbox"
I know that. I was very hesitant to use parallel toolbox. I thought it would be extremely difficult to adopt parallelization. But I dived in giving it a try. It turned out Matlab has done a wonderful job making a daunting task beginner-easy. I google how parallelization could be done in Python, and again what I find is choices of possible useful packages, new sets of documentations to digest, ... and etc.
IMHO, if one has chosen Matlab, he should seriously consider adopting parallel toolbox. It's easy to use, and combined with group-processing functions and functional programming, to shift the whole coding game to a different paradigm, saving both coding and run time.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeParallel for-Loops (parfor) についてさらに検索

製品

リリース

R2023a

質問済み:

2023 年 9 月 15 日

コメント済み:

2023 年 9 月 20 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by