[bug?] 2018a trainnetwork accuracy suddenly dropped with multi-gpu

1 回表示 (過去 30 日間)

khcy82dyc 2018 年 4 月 14 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/394843-bug-2018a-trainnetwork-accuracy-suddenly-dropped-with-multi-gpu

コメント済み: khcy82dyc 2018 年 4 月 17 日

When I use trainnet I experienced the accuracy dropped suddenly and it was not able to come back normal. The following is my trainingoption.

options = trainingOptions('sgdm','Momentum', 0.9,'InitialLearnRate', 1e-3,'L2Regularization', 0.0005,'MaxEpochs', 20000, 'MiniBatchSize',4,'Shuffle', 'every-epoch', 'CheckpointPath',newdirstoragetraicheckpoint, 'ExecutionEnvironment','multi-gpu','Plots','training-progress', 'VerboseFrequency', 2);

| 2320 | 39426 | 10:23:50 | 82.76% | 0.2362 | 0.0010 |

| 2320 | 39428 | 10:23:52 | 83.29% | 0.2832 | 0.0010 |

| 2320 | 39430 | 10:23:54 | 25.52% | 3.1097 | 0.0010 |

| 2320 | 39432 | 10:23:56 | 27.04% | 3.0014 | 0.0010 |

| 2320 | 39434 | 10:23:58 | 23.22% | 2.9561 | 0.0010 |

I've never had this issue before in 2017b so I suspect it's something to do with the new trainnetwork in 2018a. One thing I notice is that 2017b didn't introduce multi-gpu support for 'ExecutionEnvironment', could this be the reason? I'm running the same script again in 2017b at the moment with the 'ExecutionEnvironment' set to 'gpu' to see if it will occur.

2 件のコメント
なしを表示なしを非表示

Joss Knight 2018 年 4 月 14 日

Nothing obvious changed in the multi-gpu training between R2017b and R2018a, although NCCL was upgraded. What happens when you take the most recent checkpoint before the loss jumped and input the layers from that network back into training, does the same thing happen?

This sort of behaviour isn't unheard of, because the loss landscape can be non-smooth near the solution and you can suddenly step to a bad solution with no means of escaping the local minimum. You may have been unlucky and this will never happen again. Try lowering the learn rate or use a learn rate drop schedule to ensure the learn rate is lower when you reach this unstable region.

khcy82dyc 2018 年 4 月 17 日

You are right! I continued from the checkpoint using the same learning rate and it was running with no issue for 22 hours until I stopped it manually. I guess I was just unlucky... Thanks for this!

サインインしてコメントする。

サインインしてこの質問に回答する。