Inconsistent lost connection with worker error

4 ビュー (過去 30 日間)
rp
rp 2018 年 2 月 7 日
回答済み: Manhui Wang 2019 年 2 月 8 日
I have a program which runs an spmd code block. At the end of the block, I have each worker save their workspace to file. Sometimes I get the following error:
The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
Based on printed output from my code, I know that the error is most likely occurring near the save the workspace portion, after the rest of the program has executed.
This error does not always happen, however. I find it generally happens more often when the workers are trying to save larger files, but not always. I can run the same code twice and once it will error and once it will not. I am running the code on a server, so I'm not sure if the memory demands on the server might be contributing (if it's a memory issue). Any thoughts?
EDIT:
Due to the fact that the processes are sending messages frequently in the spmd block, it is likely that the the writing of the files is happening simultaneously -- I wonder if on these larger files there's a higher probability of writing to the same disk space and creating corrupt files (often the .mat files exist but cannot be read). Perhaps forcing the program to save sequentially will help?
EDIT:
I also get the following message when it fails to write the files:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
  3 件のコメント
rp
rp 2018 年 2 月 8 日
I believe it's just the Parallel computing toolbox. Matlab version 9.3.0.713579 (R2017b). CentOS Linux release 7.4.1708 (Core).
rp
rp 2018 年 2 月 8 日
Also, I have been using the save function in the spmd block via another function to get the job done -- I'm wondering if that might be contributing to something, since apparently that's a bad thing.
https://www.mathworks.com/matlabcentral/answers/215594-saving-within-spmd-or-parfor

サインインしてコメントする。

回答 (2 件)

Jiannan Zhou
Jiannan Zhou 2018 年 8 月 25 日
I encountered exactly the same problem on R2017b, using parallel computing tool and save function.

Manhui Wang
Manhui Wang 2019 年 2 月 8 日
I see the similar problem with R2017b:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
but it appears to work fine with R2018a.

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by