Inconsistent lost connection with worker error
4 ビュー (過去 30 日間)
古いコメントを表示
I have a program which runs an spmd code block. At the end of the block, I have each worker save their workspace to file. Sometimes I get the following error:
The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
Based on printed output from my code, I know that the error is most likely occurring near the save the workspace portion, after the rest of the program has executed.
This error does not always happen, however. I find it generally happens more often when the workers are trying to save larger files, but not always. I can run the same code twice and once it will error and once it will not. I am running the code on a server, so I'm not sure if the memory demands on the server might be contributing (if it's a memory issue). Any thoughts?
EDIT:
Due to the fact that the processes are sending messages frequently in the spmd block, it is likely that the the writing of the files is happening simultaneously -- I wonder if on these larger files there's a higher probability of writing to the same disk space and creating corrupt files (often the .mat files exist but cannot be read). Perhaps forcing the program to save sequentially will help?
EDIT:
I also get the following message when it fails to write the files:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
回答 (2 件)
Jiannan Zhou
2018 年 8 月 25 日
I encountered exactly the same problem on R2017b, using parallel computing tool and save function.
0 件のコメント
Manhui Wang
2019 年 2 月 8 日
I see the similar problem with R2017b:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
but it appears to work fine with R2018a.
0 件のコメント
参考
製品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!