Submitting parallel compiled code to SLURM

12 ビュー (過去 30 日間)
Leos Pohl
Leos Pohl 2021 年 7 月 22 日
回答済み: Raymond Norris 2021 年 7 月 22 日
I have a compiled application. I am trying to run it with SLURM on several nodes. The application itself is parallel with local parpool. When I use sbatch with slurm script, the parallel pool does not get created and i get errors and output (see the attached files). The slurm script is below. When I do interactive shell, and execute the commands on each node, all works as expected. Interestingly, even if I use only a single node and signle srun command in the slurm script, i still get an error, so i am not sure which processes are racing to write to those files, but i guess, the error is of different nature.
When i remove the lines that set MCR_CACHE_ROOT, i get a little different error:
Error using parpool (line 113)
Invalid default value for property 'ParallelNode' in class 'parallel.internal.settings.ParallelSettingsTree':
No value is set for setting 'PCTVersionNumber' at the any level.
Error in run_getIllumination (line 28)
MATLAB:settings:config:UndefinedSettingValueForLevel
srun: error: ec64: task 2: Exited with exit code 255
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Error in run_getIllumination (line 28)
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line 676)
Failed to locate and destroy old interactive jobs.
Error using parallel.Cluster/findJob (line 74)
The job storage metadata file '/lustre/fs0/home/lpohl/.mcrCache9.5/run_ge0/local_cluster_jobs/R2018b/matlab_metadata.mat' does not exist or is corrupt. For assistance recovering job
data, contact MathWorks Support Team. Otherwise, delete all files in the JobStorageLocation and try again.
parallel:cluster:PoolCreateFailed
Error using parpool (line 113)
Parallel pool failed to start with the following error.
Can someone help out?
#!/bin/bash
#SBATCH -A user
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=30
#SBATCH --time=00:20:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --nodelist=ec64
#Select File to run
export file="run_getIllumination"
export args="~/IlluminationModel/matlab_code/constants.txt"
#Select how logs get stored
mkdir $SLURM_JOB_ID
export debug_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
export benchmark_logs="$SLURM_JOB_ID/job_$SLURM_JOB_ID.log"
#Load Modules
module load matlab/matlab-R2018b
# Enter Working Directory
cd $SLURM_SUBMIT_DIR
# Create Log File
echo $SLURM_SUBMIT_DIR
echo "JobID: $SLURM_JOB_ID" >> $debug_logs
echo "Running on $SLURM_NODELIST" >> $debug_logs
echo "Running on $SLURM_NNODES nodes." >> $debug_logs
echo "Running on $SLURM_NPROCS processors." >> $debug_logs
echo "Current working directory is `pwd`" >> $debug_logs
# Module debugging
module list >> $debug_logs
date >> $benchmark_logs
echo "ulimit -l: " >> $benchmark_logs
ulimit -l >> $benchmark_logs
export MCR_CACHE_ROOT="/tmp/mcr_cache_root_$USER"
mkdir -p $MCR_CACHE_ROOT
# Run job
#ls -d ~/IlluminationModel/matlab_code/*.txt | egrep 'tp_grid[0-9]+?\.txt$' | xargs -I {} srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon {}
srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid1.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid2.txt
#srun run_getIllumination ~/IlluminationModel/matlab_code/constants.txt flatlon ~/IlluminationModel/matlab_code/tp_grid3.txt
echo "Program is finished with exit code $? at: `date`"
sleep 3
date >> $benchmark_logs
echo "ulimit -l" >> $benchmark_logs
ulimit -l >> $benchmark_logs
mv job.$SLURM_JOB_ID.err $SLURM_JOB_ID/
mv job.$SLURM_JOB_ID.out $SLURM_JOB_ID/

回答 (1 件)

Raymond Norris
Raymond Norris 2021 年 7 月 22 日

カテゴリ

Help Center および File ExchangeCluster Configuration についてさらに検索

製品


リリース

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by