why does my job on cluster stop to produce output
3 ビュー (過去 30 日間)
古いコメントを表示
Hey, I am using parallel toolbox on a linux cluster (istan nodes and SLURM scheduler). The main routine (the parfor loop section) looks as follows. An 2-d array (MASKE) is used to extract time series which have values, and the function core_eQM is applied on these time series: ...
cd $WORKDIR
pc = parcluster('local')
pc.JobStorageLocation = strcat('$WORKDIR/',getenv('SLURM_JOB_ID'))
% start the matlabpool with maximum available workers
% control how many workers by setting ntasks in your sbatch script
matlabpool(pc, getenv('SLURM_CPUS_ON_NODE'))
...
pardim=size(MASKE,2);
XX1=NaN(7305,pardim);
parfor ii=1:pardim
if ~isnan(MASKE(1,ii))
fprintf('%i \t \n', ii); %shows me progress of job (creates files, which are empty)
if meth == 1;
xx1 = core_eQM(squeeze(VARIABLE_BSE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,1:3653)),squeeze(WRF_VARIABLE(ii,jj,3654:7305)))
elseif meth == 2
...
end
end
end
Now, I use the following script to submit it on the cluster:
#!/bin/bash
#SBATCH --job-name=imat_par_test
#SBATCH --output=matlab_parfor.out
#SBATCH --error=matlab_parfor.err
#SBATCH --partition=ivy
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=20
source /etc/profile.d/00-modules.sh
module load app/matlab2014b
cd $WORKDIR
# Create a local work directory
mkdir -p $WORKDIR/$SLURM_JOB_ID
#cd $WORKSDIR/$SLURM_JOB_ID
# Kick off matlab
matlab -nodesktop < script_apply_BC.m &
#wait
# Cleanup local work directory
rm -rf $WORKSDIR/$SLURM_JOB_ID
At the beginning (first few hours) the job runs fine. The size of pardim is 420. After pardim reaching approx. 250, the procedure slows down and finally does not "continue", i.e. the job is still running without producing output files. Thus, no problems are reported in the matlab_parfor.err file. I do not know exactly how I can analyse the problems in this case.
Any ideas?
5 件のコメント
Simone Stünzi
2021 年 7 月 9 日
I've increased idleTimeout to Inf and will let you know if that solves my issue.
Best, Simone
回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Third-Party Cluster Configuration についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!