Parpool consistently failing to initialize
5 ビュー (過去 30 日間)
古いコメントを表示
I'm consistently running into problems getting parpool to initialize on linux clusters. These systems typically have 39 to 128 idle cores and 76GB to 4.5TB of free RAM. Sometimes I can launch a parpool with 128 workers, other times I can't start one with as little as 4 workers. I've been using Matlab R2019a and R2018b. Any ideas?
>> n=16; %number of workers you want
>> setlocal = parcluster('local');
>> setlocal.NumWorkers = n;
>> parpool(setlocal);
Starting parallel pool (parpool) using the 'local' profile ...
Warning: The system time zone setting, 'Navajo', does not specify a single time
zone unambiguously. It will be treated as 'America/Denver'. See the <a
href="matlab:doc('datetime.TimeZone')">datetime.TimeZone property</a> for
details about specifying time zones.
> In verifyTimeZone (line 34)
In datetime (line 543)
In parallel.internal.cluster.FileSerializer>iLoadDate (line 342)
In parallel.internal.cluster.FileSerializer/getFields (line 100)
In parallel.internal.cluster.CJSSupport/getProperties (line 260)
In parallel.internal.cluster.CJSSupport/getJobProperties (line 478)
In parallel.internal.cluster.CJSJobMixin/hGetProperty (line 85)
In parallel.internal.cluster.CJSJobMethods.setJobTerminalStateFromCluster (line 179)
In parallel.internal.cluster.CJSJobMixin/hSetTerminalStateFromCluster (line 116)
In parallel.cluster.CJSCluster/hGetJobState (line 401)
In parallel.internal.cluster.CJSJobMixin/getStateEnum (line 159)
In parallel.Job/get.StateEnum (line 238)
In parallel.Job/get.State (line 230)
In parallel.internal.customattr.CustomGetSet>iVectorisedGetHelper (line 128)
In parallel.internal.customattr.CustomGetSet>@(a,b,c)iVectorisedGetHelper(obj,a,b,c) (line 102)
In parallel.internal.customattr.CustomGetSet/doVectorisedGet (line 103)
In parallel.internal.customattr.CustomGetSet/hVectorisedGet (line 76)
In parallel.internal.customattr.GetSetImpl>iAccessProperties (line 322)
In parallel.internal.customattr.GetSetImpl>iGetAllPropertiesVec (line 264)
In parallel.internal.customattr.GetSetImpl.getImpl (line 133)
In parallel.internal.customattr.CustomGetSet>iHetFunGetFunction (line 154)
In parallel.internal.customattr.CustomGetSet>@(o)iHetFunGetFunction(o,props) (line 139)
In parallel.internal.cluster.hetfun (line 46)
In parallel.internal.customattr.CustomGetSet>iHetFunGetProperty (line 139)
In parallel.internal.customattr.CustomGetSet/get (line 38)
In parallel.internal.pool.InteractiveClient/pRemoveOldJobs (line 474)
In parallel.internal.pool.InteractiveClient/start (line 315)
In parallel.Pool>iStartClient (line 796)
In parallel.Pool.hBuildPool (line 585)
In parallel.internal.pool.doParpool (line 18)
In parallel.Cluster/parpool (line 71)
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: corrupted double-linked list: 0x00007f3c402e1bc0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c402b1e00 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40249390 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40238380 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f3c40238380 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40019530 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f3c40019110 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3211a75e5e]
/lib64/libc.so.6[0x3211a78cf0]
/usr/local/matlab/R2018b/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so(+0x5dccb9)[0x7f3c471bccb9]
/lib64/libc.so.6(exit+0xe2)[0x3211a35a02]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1cb1a)[0x7f3c6f2cbb1a]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1c5ce)[0x7f3c6f2cb5ce]
/usr/local/matlab/R2018b/bin/glnxa64/libtbb.so.2(+0x1c5a6)[0x7f3c6f2cb5a6]
/lib64/libpthread.so.0[0x3211e07aa1]
/lib64/libc.so.6(clone+0x6d)[0x3211ae8c4d]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0060d000-0060e000 r--p 0000d000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0060e000-0060f000 rw-p 0000e000 00:23 419417053 /usr/local/matlab/R2018b/bin/glnxa64/MATLAB
0206a000-0224a000 rw-p 00000000 00:00 0 [heap]
3211600000-3211620000 r-xp 00000000 08:03 14577410 /lib64/ld-2.12.so
3211820000-3211821000 r--p 00020000 08:03 14577410 /lib64/ld-2.12.so
3211821000-3211822000 rw-p 00021000 08:03 14577410 /lib64/ld-2.12.so
3211822000-3211823000 rw-p 00000000 00:00 0
3211a00000-3211b8b000 r-xp 00000000 08:03 14577415 /lib64/libc-2.12.so
3211b8b000-3211d8a000 ---p 0018b000 08:03 14577415 /lib64/libc-2.12.so
3211d8a000-3211d8e000 r--p 0018a000 08:03 14577415 /lib64/libc-2.12.so
3211d8e000-3211d90000 rw-p 0018e000 08:03 14577415 /lib64/libc-2.12.so
3211d90000-3211d94000 rw-p 00000000 00:00 0
3211e00000-3211e17000 r-xp 00000000 08:03 14577416 /lib64/libpthread-2.12.so
3211e17000-3212017000 ---p 00017000 08:03 14577416 /lib64/libpthread-2.12.so
3212017000-3212018000 r--p 00017000 08:03 14577416 /lib64/libpthread-2.12.so
3212018000-3212019000 rw-p 00018000 08:03 14577416 /lib64/libpthread-2.12.so
3212019000-321201d000 rw-p 00000000 00:00 0
3212200000-3212283000 r-xp 00000000 08:03 14577561 /lib64/libm-2.12.so
3212283000-3212482000 ---p 00083000 08:03 14577561 /lib64/libm-2.12.so
3212482000-3212483000 r--p 00082000 08:03 14577561 /lib64/libm-2.12.so
3212483000-3212484000 rw-p 00083000 08:03 14577561 /lib64/libm-2.12.so
3212600000-3212602000 r-xp 00000000 08:03 14577435 /lib64/libdl-2.12.so
3212602000-3212802000 ---p 00002000 08:03 14577435 /lib64/libdl-2.12.so
3212802000-3212803000 r--p 00002000 08:03 14577435 /lib64/libdl-2.12.so
3212803000-3212804000 rw-p 00003000 08:03 14577435 /lib64/libdl-2.12.so
3212a00000-3212a15000 r-xp 00000000 08:03 14577501 /lib64/libz.so.1.2.3
3212a15000-3212c14000 ---p 00015000 08:03 14577501 /lib64/libz.so.1.2.3
3212c14000-3212c15000 r--p 00014000 08:03 14577501 /lib64/libz.so.1.2.3
3212c15000-3212c16000 rw-p 00015000 08:03 14577501 /lib64/libz.so.1.2.3
3212e00000-3212e07000 r-xp 00000000 08:03 14577419 /lib64/librt-2.12.so
3212e07000-3213006000 ---p 00007000 08:03 14577419 /lib64/librt-2.12.so
3213006000-3213007000 r--p 00006000 08:03 14577419 /lib64/librt-2.12.so
3213007000-3213008000 rw-p 00007000 08:03 14577419 /lib64/librt-2.12.so
3214600000-3214602000 r-xp 00000000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214602000-3214802000 ---p 00002000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214802000-3214803000 rw-p 00002000 08:03 6962157 /usr/lib64/libXau.so.6.0.0
3214a00000-3214a24000 r-xp 00000000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214a24000-3214c24000 ---p 00024000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214c24000-3214c25000 rw-p 00024000 08:03 6962578 /usr/lib64/libxcb.so.1.1.0
3214e00000-3214f37000 r-xp 00000000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3214f37000-3215137000 ---p 00137000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3215137000-321513d000 rw-p 00137000 08:03 6962601 /usr/lib64/libX11.so.6.3.0
3215200000-3215211000 r-xp 00000000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3215211000-3215411000 ---p 00011000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3215411000-3215412000 rw-p 00011000 08:03 6963060 /usr/lib64/libXext.so.6.4.0
3218e00000-3218e04000 r-xp 00000000 08:03 14577556 /lib64/libuuid.so.1.3.0
3218e04000-3219003000 ---p 00004000 08:03 14577556 /lib64/libuuid.so.1.3.0
3219003000-3219004000 rw-p 00003000 08:03 14577556 /lib64/libuuid.so.1.3.0
3219200000-321920c000 r-xp 00000000 08:03 14578841 /lib64/libpam.so.0.82.2
321920c000-321940c000 ---p 0000c000 08:03 14578841 /lib64/libpam.so.0.82.2
321940c000-321940d000 r--p 0000c000 08:03 14578841 /lib64/libpam.so.0.82.2
321940d000-321940e000 rw-p 0000d000 08:03 14578841 /lib64/libpam.so.0.82.2
3219e00000-3219e07000 r-xp 00000000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
3219e07000-321a007000 ---p 00007000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
321a007000-321a008000 rw-p 00007000 08:03 6962154 /usr/lib64/libSM.so.6.0.1
321aa00000-321aa17000 r-xp 00000000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321aa17000-321ac17000 ---p 00017000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321ac17000-321ac18000 rw-p 00017000 08:03 6961964 /usr/lib64/libICE.so.6.3.0
321ac18000-321ac1c000 rw-p 00000000 00:00 0
3220200000-3220207000 r-xp 00000000 08:03 14577723 /lib64/libcrypt-2.12.so
3220207000-3220407000 ---p 00007000 08:03 14577723 /lib64/libcrypt-2.12.so
3220407000-3220408000 r--p 00007000 08:03 14577723 /lib64/libcrypt-2.12.so
3220408000-3220409000 rw-p 00008000 08:03 14577723 /lib64/libcrypt-2.12.so*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c26c710 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2f1da0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2a4ac0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c2973b0 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
*** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: free(): corrupted unsorted chunks: 0x00007f066c245a70 ***
*** glibc detected *** *** glibc detected *** /usr/local/matlab/R2018b/bin/glnxa64/MATLAB: double free or corruption (!prev): 0x00007f066c2a1500 ***
Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
676)
Failed to initialize the interactive session.
Error using
parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus
(line 790)
The interactive communicating job failed with no message.
6 件のコメント
Frank Schluenzen
2020 年 12 月 2 日
編集済み: Frank Schluenzen
2020 年 12 月 3 日
I'm having the same problems with R2020b (and >2019a) on Centos_7. I also see the glibc-errors, but validation of local parallel pools was more pointing towards tbb-threads running out of memory. The only place I could find references to Heap settings were in ~/.matlab/R2020b/matlab.prf and indirectly in ~/.matlab/R2020b/toolbox_cache-9.9.0-34542918-glnxa64.xml.
So after fiddling a lot with stack and heap settings I finally simply removed ~/.matlab/R2020b/toolbox_cache-9.9.0-34542918-glnxa64.xml before starting matlab, and could consistently use all available 40+ cores on various different machines. Disabling Toolbox path caching under Matlab General Preferences seemingly does the same, and is of course the better choice. No idea why that would help, but seems at least to work for R2020b on Centos_7.
Frank Schluenzen
2020 年 12 月 4 日
mathworks recommendation: ulimit -u 63536. works with toolbox path cache enabled
回答 (1 件)
Raghav Bansal
2024 年 11 月 26 日
Hi Ross,
Based on your error, I assume that you are using Simulia iSight. I think that this issue can be resolved by un-checking the following setting in iSight: "Use iSight JRE for Matlab application".
Also, if it does not help. You can try the following debugging steps:
1) Look for possible permissions issues, especially with the job storage locations. Process Monitor can be helpful for finding potential issues
2) Try disabling MPI, as discussed in the following MATLAB Answers post: https://www.mathworks.com/matlabcentral/answers/196549-failed-to-start-a-parallel-pool-in-matlab2015a
3) Try clearing out the local scheduler files, as discussed in the following MATLAB Answers Post: https://www.mathworks.com/matlabcentral/answers/92445-why-do-i-receive-errors-when-using-the-local-scheduler-in-the-parallel-computing-toolbox
4) Check if any logs containing helpful diagnostic information were created. The location of the logs can be found using the following two commands:
c = parcluster();
c.JobStorageLocation
5) Add the following two lines of code to trigger additional information to be printed to the console and to be added to the log files.
setSchedulerMessageHandler(@disp);
setenv('MDCE_DEBUG','true')
I hope this helps.
0 件のコメント
参考
カテゴリ
Help Center および File Exchange で Parallel Computing Fundamentals についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!