Parallel reinforcement learning on HPC with warning "Received duplicate id = x from worker"

10 ビュー (過去 30 日間)
Mirjan Heubaum
Mirjan Heubaum 2021 年 11 月 29 日
コメント済み: Walter Roberson 2021 年 12 月 5 日
When I'm running training of a reinforcement learning agent using a HPC cluster and parallel computing toolbox I get the warning "Received duplicate id = 22 from worker" (or other id) after e.g. 180 training episodes. Then the training seems to be stopped and there is no further error or warning. I am using this command to start the .m-script:
module load matlab/R2021a
matlab -nodisplay < rl_training.m
When I set
trainOpts.UseParallel = false;
often I get the warning "Error reading character from command line". Does anyone know why these messages are occurring and is there perhaps a way to continue the training?
  5 件のコメント
Image Analyst
Image Analyst 2021 年 12 月 2 日
If you have a maintenance contract in place, I'd call them on the phone. Of course you can use email like @Raymond Norris said. I never use email or a support page since when I encounter a problem I need an immediate solution so I call them.
Walter Roberson
Walter Roberson 2021 年 12 月 5 日
I never call them, myself -- I open support cases, where I can describe the problem and include code and results to show clearly what is expected and what is received instead. 85% of the time the response is going to be "You are right, that's not good, the developers have been notified and it might get fixed some day".

サインインしてコメントする。

回答 (0 件)

カテゴリ

Help Center および File ExchangeThird-Party Cluster Configuration についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by