Why workers keep aborting during parallel computation on cluster?
古いコメントを表示
I keep getting the warning
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 392)
In parallel_function>distributed_execution (line 741)
In parallel_function (line 573)
In fuction_pa1 (line 100)]
when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC.
I would like to know the reason of this issue, and is there a way to avoid it ?
Thanks.
19 件のコメント
Mario Malic
2020 年 12 月 7 日
Whhat kind of simulation?
Muh Alam
2020 年 12 月 7 日
Kojiro Saito
2020 年 12 月 8 日
matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.
c=parcluster();
c.JobStorageLocation
Muh Alam
2020 年 12 月 9 日
Kojiro Saito
2020 年 12 月 9 日
Does your code have file I/O? For example, save.
Parallel workers might crash if multiple workers try to write to the same file.
Muh Alam
2020 年 12 月 9 日
Kojiro Saito
2020 年 12 月 10 日
No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.
Did you try changing SpmdEnabled option to false?
parpool('SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
2020 年 12 月 10 日
Kojiro Saito
2020 年 12 月 10 日
OK. Does this occur if you require smaller wokers?
Such as,
parpool(2, 'SpmdEnabled', false);
parfor n=1:100
% parallel codes
end
Muh Alam
2020 年 12 月 10 日
Kojiro Saito
2020 年 12 月 11 日
Does your cluster have enough resource?
If Linux, from Terminal
ulimit -a
provides the resouce (max processes etc.).
Muh Alam
2020 年 12 月 14 日
Muh Alam
2021 年 2 月 3 日
Kojiro Saito
2021 年 2 月 3 日
I don't think so. I think it is an usual script.
Are you able to check the SLURM's log file?
Kojiro Saito
2021 年 2 月 4 日
I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.
Muh Alam
2021 年 2 月 6 日
Kojiro Saito
2021 年 2 月 7 日
Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;
"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."
Muh Alam
2021 年 2 月 8 日
採用された回答
その他の回答 (0 件)
カテゴリ
ヘルプ センター および File Exchange で Third-Party Cluster Configuration についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!