why does my job on cluster stop to produce output

3 ビュー (過去 30 日間)
Patrick Laux
Patrick Laux 2016 年 4 月 12 日
コメント済み: Simone Stünzi 2021 年 7 月 9 日
Hey, I am using parallel toolbox on a linux cluster (istan nodes and SLURM scheduler). The main routine (the parfor loop section) looks as follows. An 2-d array (MASKE) is used to extract time series which have values, and the function core_eQM is applied on these time series: ...
cd $WORKDIR
pc = parcluster('local')
pc.JobStorageLocation = strcat('$WORKDIR/',getenv('SLURM_JOB_ID'))
% start the matlabpool with maximum available workers
% control how many workers by setting ntasks in your sbatch script
matlabpool(pc, getenv('SLURM_CPUS_ON_NODE'))
...
pardim=size(MASKE,2);
XX1=NaN(7305,pardim);
parfor ii=1:pardim
if ~isnan(MASKE(1,ii))
fprintf('%i \t \n', ii); %shows me progress of job (creates files, which are empty)
if meth == 1;
xx1 = core_eQM(squeeze(VARIABLE_BSE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,3654:7305)))
elseif meth == 2
...
end
end
end
Now, I use the following script to submit it on the cluster:
#!/bin/bash
#SBATCH --job-name=imat_par_test
#SBATCH --output=matlab_parfor.out
#SBATCH --error=matlab_parfor.err
#SBATCH --partition=ivy
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=20
source /etc/profile.d/00-modules.sh
module load app/matlab2014b
cd $WORKDIR
# Create a local work directory
mkdir -p $WORKDIR/$SLURM_JOB_ID
#cd $WORKSDIR/$SLURM_JOB_ID
# Kick off matlab
matlab -nodesktop < script_apply_BC.m &
#wait
# Cleanup local work directory
rm -rf $WORKSDIR/$SLURM_JOB_ID
At the beginning (first few hours) the job runs fine. The size of pardim is 420. After pardim reaching approx. 250, the procedure slows down and finally does not "continue", i.e. the job is still running without producing output files. Thus, no problems are reported in the matlab_parfor.err file. I do not know exactly how I can analyse the problems in this case.
Any ideas?
  5 件のコメント
Patrick Laux
Patrick Laux 2021 年 7 月 9 日
unfortunately not, Simone. I just gave up.
If you find out more, I would be happy if you let me know.
Patrick
Simone Stünzi
Simone Stünzi 2021 年 7 月 9 日
I've increased idleTimeout to Inf and will let you know if that solves my issue.
Best, Simone

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeThird-Party Cluster Configuration についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by