failing to run Reinforcement learning job on the cluster

4 ビュー (過去 30 日間)
Ahmad Momani
Ahmad Momani 2024 年 5 月 14 日
コメント済み: Edric Ellis 2024 年 5 月 15 日
I have a custom reinforcement learning environment in which I train an agent using the SAC algorithm. The training runs smoothly on my desktop with four cores, but attempting to speed up the process on the university cluster has been unsuccessful. Below is some information about the job. Can this issue be resolved?
>> (jobRL5)
jobRL5 =
Job
Properties:
ID: 103
Type: pool
Username: amomani1
State: failed
SubmitDateTime: 13-May-2024 18:25:52
StartDateTime: 13-May-2024 18:27:12
RunningDuration: 0 days 13h 39m 28s
NumWorkersRange: [11 11]
NumThreads: 2
AutoAttachFiles: true
Auto Attached Files: List files
AttachedFiles: R:\amomani1\matlabcodes_SI_2023a\talbot_inversion.m
R:\amomani1\matlabcodes_SI_2023a\talbot_inversion2.m
R:\amomani1\matlabcodes_SI_2023a\talbotcode.m
AutoAddClientPath: true
AdditionalPaths: \\lightning.bu.binghamton.edu\matlab\nonshared\23a\IntegrationScripts\spiedie
\\lightning.bu.binghamton.edu\matlab\nonshared\23a
C:\Users\amomani1\Documents\MATLAB
C:\Users\amomani1\AppData\Local\Temp\8\Editor_retgg
FileStore: [1x1 parallel.FileStore]
ValueStore: [1x1 parallel.ValueStore]
EnvironmentVariables: {}
Associated Tasks:
Number Pending: 0
Number Running: 0
Number Finished: 11
Task ID of Errors: []
Task ID of Warnings: []
Task Scheduler IDs: 4857192
>> c.getDebugLog(jobRL5)
LOG FILE OUTPUT:
Node file: compute[078,162]
Starting SMPD on compute078 compute162 ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -debug 0 &
Checking that SMPD processes are running (Attempt 1 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
No SMPD process running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
No SMPD process running on compute162
Checking that SMPD processes are running (Attempt 2 of 60)
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute078 > /dev/null 2>&1
SMPD process found running on compute078
/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -phrase MATLAB -port 27192 -status compute162 > /dev/null 2>&1
SMPD process found running on compute162
All SMPDs launched
Machine args: -hosts 2 compute078 6 compute162 6
"/cm/shared/apps/Mathworks-MPS/2023a/bin/mw_mpiexec" -smpd -phrase MATLAB -port 27192 -l -hosts 2 compute078 6 compute162 6 -genvlist PARALLEL_SERVER_DECODE_FUNCTION,PARALLEL_SERVER_STORAGE_LOCATION,PARALLEL_SERVER_STORAGE_CONSTRUCTOR,PARALLEL_SERVER_JOB_LOCATION,PARALLEL_SERVER_DEBUG,PARALLEL_SERVER_LICENSE_NUMBER,MLM_WEB_LICENSE,MLM_WEB_USER_CRED,MLM_WEB_ID,TZ,MDCE_DECODE_FUNCTION,MDCE_STORAGE_LOCATION,MDCE_STORAGE_CONSTRUCTOR,MDCE_JOB_LOCATION,MDCE_DEBUG,MDCE_LICENSE_NUMBER "/cm/shared/apps/Mathworks-MPS/2023a/bin/worker" -parallel
job aborted:
rank: node: exit code[: error message]
0: compute078: -2
1: compute078: -2
2: compute078: -2
3: compute078: -2
4: compute078: -2
5: compute078: -2
6: compute162: -2
7: compute162: -2
8: compute162: -2
9: compute162: -2
10: compute162: -2
11: compute162: 1: process 11 exited without calling finalize
Stopping SMPD ...
srun --ntasks-per-node=1 --ntasks=2 /cm/shared/apps/Mathworks-MPS/2023a/bin/mw_smpd -shutdown -phrase MATLAB -port 27192
Exiting with code: 123
  1 件のコメント
Edric Ellis
Edric Ellis 2024 年 5 月 15 日
This looks like you aren't getting as far as running any sort of job on the cluster. Contact MathWorks support, they can help sort out this sort of thing.

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeThird-Party Cluster Configuration についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by