I keep getting the warning Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers. In distcomp/remoteparfor/handleIntervalErrorResult (line 245) In distcomp/remoteparfor/getCompleteIntervals (line 392) In parallel_function>distributed_execution (line 741) In parallel_function (line 573) In fuction_pa1 (line 100)] when I run a simulation that has parfor loop on the cluster. I noticed that workers abort excution one after another and that seems to happen more when on a cluster compated to my PC. I would like to know the reason of this issue, and is there a way to avoid it ? Thanks.

Why workers keep aborting during parallel computation on cluster?

Mario Malic 2020 年 12 月 7 日

Whhat kind of simulation?

Muh Alam 2020 年 12 月 7 日

Monte-carlo simulation on Matlab not simulink.

Kojiro Saito 2020 年 12 月 8 日

MATLAB Online で開く

According to this answer, It might be related to workers' crash.

matlab_crash_dump files might be stored in JobStrageLocation of parallel workers.

c=parcluster();
c.JobStorageLocation

Muh Alam 2020 年 12 月 9 日

Thanks Kojiro!

Based on the discription linked answer, I checked log files (appear as job#.log) of the jobs but they are all empty. I am not sure if making smptEnabled falge false would help or not. I suspect that it is communication-based issue forcing workers to abort.

Kojiro Saito 2020 年 12 月 9 日

Does your code have file I/O? For example, save.

Parallel workers might crash if multiple workers try to write to the same file.

Muh Alam 2020 年 12 月 9 日

Yes, I used save after the parfor loop ends. Isn't only the body of the parfor loop that gets distributed?

please, correct me if I got this worng.

Kojiro Saito 2020 年 12 月 10 日

MATLAB Online で開く

No, I meant save inside parfor loop. But you're using save after parfor loop, it's safe.

Did you try changing SpmdEnabled option to false?

parpool('SpmdEnabled', false);
parfor n=1:100
    % parallel codes
end

Muh Alam 2020 年 12 月 10 日

yes it is false but I still see the same warning.

Kojiro Saito 2020 年 12 月 10 日

MATLAB Online で開く

OK. Does this occur if you require smaller wokers?

Such as,

parpool(2, 'SpmdEnabled', false);
parfor n=1:100
    % parallel codes
end

Muh Alam 2020 年 12 月 10 日

I haven't tried this small but in some cases it keeps running with one remaining worker (e.g. 1 out of 12 or 24) and other times all workers abort the parfor excution.

Kojiro Saito 2020 年 12 月 11 日

MATLAB Online で開く

Does your cluster have enough resource?

If Linux, from Terminal

ulimit -a

provides the resouce (max processes etc.).

Muh Alam 2020 年 12 月 14 日

yes, resources are there but it is per availability since many poeple use it in my university. I can reserve multiple computing nodes on the slurm scheduler for a single job or multiple of them.

Muh Alam 2021 年 2 月 3 日

MATLAB Online で開く

@Kojiro Saito would putting parfor the outermost be the reason?

for example

parfor i=1:100
    %do something
    for l=1:1000
        % do something
    end
    for j=1:100
        % do another thing
    end
end

Kojiro Saito 2021 年 2 月 3 日

@Muh Alam

I don't think so. I think it is an usual script.

Are you able to check the SLURM's log file?

Muh Alam 2021 年 2 月 3 日

編集済み: Muh Alam 2021 年 2 月 3 日

I found oom(out of memory) but with increasing the allocated memory I did not find other errors in slurm logs.

In the /.matlab/local_cluster_jobs directory, the job.log is empty and the file job.metadata.text contains 'concurrent' only. I also want to add that the cluster profile is local only; that is I don't have Matlab parallel server that allow cluster profile using slurm or Maltalb scheduler. So the I did it is by reserving the resources on HPC via slurm and when granted I run matlab locally on the resources of HPC I granted.

Kojiro Saito 2021 年 2 月 4 日

I understood. It was related to memory error. As you mentioned, increasting the allocated memory such as "--mem-per-cpu=2G" in sbatch option would solve.

Muh Alam 2021 年 2 月 6 日

That is not the case always. I found it effective in times and other times it just the same. I wonder if that is related to cluster being very heterogenous.

Kojiro Saito 2021 年 2 月 7 日

Heterogenous would be a cause. This link is a system requirement of Parallel Server not Parallel Computing Toolbox, but it says an important point;

"Parallel processing constructs that work on the infrastructure enabled by parpool—parfor, parfeval spmd, distributed arrays, and message passing functions—cannot be used on a heterogeneous cluster configuration. The underlying MPI infrastructure requires that all cluster computers have matching word sizes and processor endianness."

Muh Alam 2021 年 2 月 8 日

Interesting point! I think this is the reason in my case. Thank you @koj@Kojiro Saito

Why workers keep aborting during parallel computation on cluster?

19 件のコメント
17 件の古いコメントを表示 17 件の古いコメントを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (0 件)

カテゴリ

製品

タグ

Community Treasure Hunt

Why workers keep aborting during parallel computation on cluster?

19 件のコメント 17 件の古いコメントを表示 17 件の古いコメントを非表示

採用された回答

2 件のコメント なしを表示 なしを非表示

その他の回答 (0 件)

カテゴリ

製品

タグ

参考

Community Treasure Hunt

19 件のコメント
17 件の古いコメントを表示 17 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示