Why do I get MPI_Abort errors when trying to submit a parallel job?

2 ビュー (過去 30 日間)
Paul Zhang
Paul Zhang 2014 年 5 月 23 日
コメント済み: Edric Ellis 2014 年 5 月 23 日
The core of my job submission code is below:
jopt.email_notif = 0;
jopt.toggleleft = left_list(j);
jopt.toggleCausalDir = dir_list(k);
jopt.toggleChoice = choice(l);
jopt.od_number = od_list(i);
jopt.connectivity = 1;
sched = findResource('scheduler', 'configuration', 'NeuroEcon.local')
set(sched,'SubmitArguments', '-l walltime=0:20:00')
pjob = createParallelJob(sched);
set(pjob, 'FileDependencies', {'multiDCMset1.m'})
set(pjob, 'MaximumNumberOfWorkers', 1)
set(pjob, 'MinimumNumberOfWorkers', 1)
t = createTask(pjob, @multiDCMset1, 1, {jopt})
t_all{1,jj}=t; jj=jj+1;
submit(pjob);
---------------------------------------
The following is the error message I get in the job submission log, after the job finishes running. I don't understand the error or what could cause it. I do know that the same script runs fine on another person's computer. Do I need some specific settings to submit parallel jobs?
------------------
Node file: /opt/torque/aux//2075983.neuroecon.caltech.edu
Starting SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -s -phrase MATLAB -port 25983
All SMPDs launched
"/opt/matlab//bin/mw_mpiexec" -phrase MATLAB -port 25983 -l -n 1
-machinefile /opt/torque/aux//2075983.neuroecon.caltech.edu -genvlist
MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE _CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG
"/opt/matlab/bin/worker" -parallel
[0]which: no shopt in
(/opt/matlab/bin:/usr/kerberos/bin:/usr/java/latest/bin:/opt /intel/itac/7.1/bin:/opt/intel/fce/10.1.018/bin:/opt/intel /idbe/10.1.018/bin:/opt/intel/cce/10.1.018/bin:/usr/local /bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/opt /openmpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin: /opt/rocks/bin:/opt/rocks/sbin)
[0] < M A T L A B (R) >
[0] Copyright 1984-2009 The MathWorks, Inc.
[0] Version 7.8.0.347 (R2009a) 64-bit (glnxa64)
[0] February 12, 2009
[0]
[0] To get started, type one of these: helpwin, helpdesk, or demo.
[0] For product information, visit www.mathworks.com.
[0]
job aborted:
rank: node: exit code[: error message]
0: compute-1-30: -2: application called MPI_Abort(MPI_COMM_WORLD, 42) -
process 0
Stopping SMPD on compute-1-30 ...
ssh compute-1-30 "/opt/matlab//bin/mw_smpd" -shutdown -phrase MATLAB -port
25983
Exiting with code: 42
  1 件のコメント
Edric Ellis
Edric Ellis 2014 年 5 月 23 日
Is there any error in the task of the job? Check using:
pjob.Tasks(1).Error
or even
getReport(pjob.Tasks(1).Error)

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeCluster Configuration についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by