Odd error comes during training deep learning model on GPU

So, I want to run this deep learning model on Matlab. It's frustrating that the code is correct, but anyway I'm receiving wierd errors.
ValidationSet = imageDatastore('C:\Users\V\Documents\MATLAB\cifar10Validation',...
'FileExtensions', {'.png'}, 'IncludeSubfolders',true, ...
'LabelSource','foldernames');
TrainingSet = imageDatastore('C:\Users\V\Documents\MATLAB\cifar10Train',...
'FileExtensions', {'.png'}, 'IncludeSubfolders',true, ...
'LabelSource','foldernames');
%-------------------Define Layers
varSize = 64;
conv1 = convolution2dLayer(5,varSize,'Padding',2,'BiasLearnRateFactor',2);
conv1.Weights = gpuArray(single(randn([5 5 3 varSize])*0.0001));
fc1 = fullyConnectedLayer(2560,'BiasLearnRateFactor',2);
fc1.Weights = gpuArray(single(randn([2560 23040])*0.1));
fc2 = fullyConnectedLayer(160,'BiasLearnRateFactor',2);
fc2.Weights = gpuArray(single(randn([160 2560])*0.1));
fc3 = fullyConnectedLayer(10,'BiasLearnRateFactor',2);
fc3.Weights = gpuArray(single(randn([10 160])*0.1));
layers = [
imageInputLayer([varSize varSize 3]);
conv1;
maxPooling2dLayer(3,'Stride',2);
reluLayer();
convolution2dLayer(5,32,'Padding',2,'BiasLearnRateFactor',2);
reluLayer();
averagePooling2dLayer(3,'Stride',2);
convolution2dLayer(5,2560,'Padding',2,'BiasLearnRateFactor',2);
reluLayer();
averagePooling2dLayer(3,'Stride',2);
convolution2dLayer(5, 2560, 'Padding', 2, 'BiasLearnRateFactor', 2);
reluLayer();
averagePooling2dLayer(3,'Stride',2);
fc1;
reluLayer();
fc2;
reluLayer();
fc3;
reluLayer();
softmaxLayer()
classificationLayer()];
%-----------TrainingOptions
options = trainingOptions('sgdm', ...
'InitialLearnRate', 0.001, ...
'LearnRateSchedule', 'piecewise', ...
'LearnRateDropFactor', 0.1, ...
'LearnRateDropPeriod', 8, ...
'L2Regularization', 0.004, ...
'MaxEpochs', 10, ...
'MiniBatchSize', 100, ...
'Verbose', true);
%Train------------------------
inputSize = layers(1).InputSize;
pixelRange = [-32 32];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);
augTrainingSet = augmentedImageDatastore(inputSize(1:2),TrainingSet, ...
'DataAugmentation',imageAugmenter);
%--------------
augValidationSet = augmentedImageDatastore(inputSize(1:2),ValidationSet);
[net, info] = trainNetwork(augTrainingSet, layers, options);
%Testing Part---------------------------------
[YPred,scores] = classify(net,augValidationSet);
idx = randperm(numel(ValidationSet.Files),64);
figure
for i = 1:64
subplot(8,8,i)
I = readimage(imdsValidation,idx(i));
imshow(I)
label = YPred(idx(i));
title(string(label));
end
YValidation = imdsValidation.Labels;
accuracy = mean(YPred == YValidation);
It might be needed to mention that the laptop I'm running has a Nvidia 1050ti GPU. I would appreciate any help like necessary updates,
This is the current errror.
Error using trainNetwork (line 150)
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
****Update: After installing CUDA 10 and updating my NVIDIA driver, I still receive the same error.
Thanks!

6 件のコメント

Walter Roberson
Walter Roberson 2019 年 2 月 3 日
I do not see a copy of the error messages?
Which MATLAB release are you using, and which CUDA driver do you have installed ?
Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
編集済み: Stephen23 2019 年 2 月 4 日
So, I installed CUDA version 10 and updated the driver for my Nvidia GPU. Still, I receive the same error. Does anyone know away to circumvent this issue?
Version 2018b. I have CUDA 10 and updated my Nvidia driver. Is there anyway at all?
The difference now is that I get this warning first and the above error.
This is the warning: Warning: An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Matt J
Matt J 2019 年 2 月 4 日
Did you reboot after upgrading?
Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
I guess so. I restarted both my computer and Matlab each time after installing CUDA and updating my Nvidia driver.
Jan
Jan 2019 年 7 月 9 日
[MOVED from flags] ruchika lalit about 4 hours ago.
can you please help me how to install cuda
[Please use flags only to informa admins and editors about inappropriate contents like rudeness or spam. Thanks]

サインインしてコメントする。

回答 (1 件)

Joss Knight
Joss Knight 2019 年 2 月 4 日

0 投票

Nearly always this error is a kernel timeout. Use Windows regedit and follow the instructions on this page to disable TDR by setting TdrLevel to 0.
Try this and get back to me.

8 件のコメント

Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
編集済み: Sorena Sirousi 2019 年 2 月 4 日
Thanks, I created a new TdrLevel with zero value. (I did not have one) But, I now receive a new error.
Error using trainNetwork (line 150)
GPU out of memory. Try reducing 'MiniBatchSize' using the trainingOptions function.
Caused by:
Error using +
Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.
Joss Knight
Joss Knight 2019 年 2 月 4 日
Great! Now you know you don't have enough GPU memory to run things as they are. You should reduce the MiniBatchSize in the trainingOptions.
Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
編集済み: Sorena Sirousi 2019 年 2 月 4 日
I set it even to 1 (the minimum value) and this error pops up again. Is there anyway to work this out?
Joss Knight
Joss Knight 2019 年 2 月 4 日
Well, how much memory do you have?
gpu = gpuDevice;
gpu.AvailableMemory
Your first fully connected layer does need 225MB, so you've got a pretty large network, I don't suppose that's the problem but it's worth a look. You could try reducing the number of outputs from that layer.
Also, let's see exactly where you're running out of memory. After the error type
getReport(MException.last.UnderlyingCause)
Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
編集済み: Sorena Sirousi 2019 年 2 月 4 日
I'm sorry, but I can't see which layer is causing the error in these lines.
I have 3.3 GB of memory available out of my 4 GB for graphic card.
'Error using +
Out of memory on device. To view more detail about available memory on the GPU, use 'gpuDevice()'. If the problem persists, reset the GPU by calling 'gpuDevice(1)'.
Error in nnet.internal.cnn.SeriesNetwork/updateLearnableParameters (line 431)
this.Layers{el}.LearnableParameters(param).Value = this.Layers{el}.LearnableParameters(param).Value + deltas{currentDelta};
Error in nnet.internal.cnn.Trainer/train (line 96)
net = net.updateLearnableParameters(velocity);
Error in trainNetwork>doTrainNetwork (line 218)
trainedNet = trainer.train(trainedNet, trainingDispatcher);
Error in trainNetwork (line 148)
[trainedNet, info] = doTrainNetwork(layersOrGraph, opts, X, Y);
Error in scratchnetwork (line 60)
[net, info] = trainNetwork(augTrainingSet, layers, options);'
Sorena Sirousi
Sorena Sirousi 2019 年 2 月 4 日
Is it possible to work it out with parallel computing?
Walter Roberson
Walter Roberson 2019 年 2 月 4 日
I suggest putting a breakpoint at nnet.internal.cnn.SeriesNetwork/updateLearnableParameters (line 431) and examine the size() of this.Layers{el}.LearnableParameters(param).Value and deltas{currentDelta} . I am wondering if you might accidentally be adding a row vector to a column vector, which would try to generate a rectangular matrix of result.
Joss Knight
Joss Knight 2019 年 2 月 5 日
So, you have nearly enough memory but not quite, because in order to update the model parameters we need to take a temporary copy of each on this line of code.
You really are close to the wire with this model. I'm afraid I haven't the patience to do it for you, but if you run analyzeNetwork on your input layer array, and add up all the sizes of all the activations and model parameters, you'll probably find your model needs about 1 GB of space. The way training works, you need some significant multiple of that, around 3x, because you need to retain activations in memory and at least one copy of the weights.
Perhaps there's a deep learning expert here who can comment on whether your model needs to be as big as it is. Certainly, your sudden jump from 32 to 2560 channels in layer 8 seems unusual, and is probably giving you a lot of unused filters.

サインインしてコメントする。

カテゴリ

質問済み:

2019 年 2 月 3 日

コメント済み:

2019 年 7 月 9 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by