現在この質問をフォロー中です
- フォローしているコンテンツ フィードに更新が表示されます。
- コミュニケーション基本設定に応じて電子メールを受け取ることができます。
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
14 ビュー (過去 30 日間)
古いコメントを表示
Hi,
This error stops training when the 'ExecutionEnvironment' is 'parallel', 'multi-gpu', or 'gpu'. Training is running uninterrupted when set to 'cpu'. I'm running code for the first time on Ubuntu 20.04.2 LTS system with Intel i9 12 core cpu and 2x 3070 gpu's. It indicates only 12 workers and seems to not recognize the gpus.
Any suggestions and help is welcome.
Thank-you
採用された回答
Joss Knight
2021 年 3 月 21 日
編集済み: Joss Knight
2021 年 3 月 23 日
After some investigation (see thread below), this problem seems to be limited to RTX 3080 and 3070 and Linux. It can be worked around by disabling tensor cores. Restart MATLAB and run
setenv NVIDIA_TF32_OVERRIDE 0
before you do anything else. Further investigations are under way to look for a solution that doesn't require this workaround, which will reduce performance.
Original answer:
Are you running MATLAB release R2021a? The 3070 is not supported on earlier releases.
47 件のコメント
Scott Stearns
2021 年 3 月 21 日
Yes. Thanks John. I installed the R2021a Linux version on our new LambdaLabs machine.
Is there some other installation required (outside MATLAB) for cuDNN that might be missing?
Appreciate you.
Scott Stearns
2021 年 3 月 21 日
gpuDeviceTable
ans =
2×5 table
Index Name ComputeCapability DeviceAvailable DeviceSelected
_____ __________________ _________________ _______________ ______________
1 "GeForce RTX 3070" "8.6" true true
2 "GeForce RTX 3070" "8.6" true false
%%%%%%%%%%%%%%%
the gpuDevice(i) output for both is the same:
CUDADevice with properties:
Name: 'GeForce RTX 3070'
Index: 1
ComputeCapability: '8.6'
SupportsDouble: 1
DriverVersion: 11.2000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.3701e+09
AvailableMemory: 8.0412e+09
MultiprocessorCount: 46
ClockRateKHz: 1770000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
Joss Knight
2021 年 3 月 21 日
We need to disambiguate here. You say "seems to not recognise the gpus", but gpuDeviceTable shows that your devices are recognised and available. You should probably show all of the output so we can see what you're doing. You may have been confused by the warning about idle workers into thinking your devices are not recognised.
On the face of it, there is some sort of bug in cuDNN, so we'd have to get a copy of your actual network to reproduce.
Scott Stearns
2021 年 3 月 21 日
Thanks Joss. Agreed. I am new to using gpus and was not aware of the gpuDeviceTable() function. Thanks for that. The code loads a modified GoogLeNet (class nnet.cnn.LayerGraph) as lgraph. It is too big to attach. The only modifications are a new fully connected layer (layer 142 with InputSize 'auto' and OutputSize 2) for our 2 class problem; and new ClassificationOutputLayer (layer 144).
Q: Are you aware of a check that the cuDNN is installed properly?
I may set up with a smaller network to trouble shoot this.
thanks.
Scott Stearns
2021 年 3 月 21 日
Here's the code:
% Load network with adjusted fully connected and output layers
%
load new_GoogLeNet % network graph in lgraph
% Set up training options
%
options = trainingOptions('sgdm',...
'Momentum',0.95,...
'MiniBatchSize',80,...
'MaxEpochs',10,...
'InitialLearnRate',5e-4,...
'ValidationData',imagesValidds,...
'ValidationFrequency',5,...
'Verbose',true,...
'ExecutionEnvironment','parallel',...
'Plots','training-progress');
rng default
% Train the network
disp('training network....')
tic
trainedNet = trainNetwork(imagesTrainds, lgraph, options);
toc
Joss Knight
2021 年 3 月 21 日
Why don't you attach the code for generating your layer graph? You can load it into Deep Network Designer (deepNetworkDesigner(lgraph)) then use the Export to Live Script function, then just attach the script here.
cuDNN is a library that is installed with MATLAB. If there's something wrong with the installation then it's our fault. However, it's plausible that you've had some sort of installation problem. Did you have the R2021a prerelease and then installed the General Release? That could cause trouble if they got mixed together since they use different versions of cuDNN. Perhaps you should delete your installation and re-install just in case.
Scott Stearns
2021 年 3 月 21 日
Here we go. I will try to re-install on the ubuntu machine. We did not start with the prerelease. Did the installation for the first time on Friday. Thanks Joss.
Scott Stearns
2021 年 3 月 21 日
Joss, sorry I didn't get this earlier. Here is the output from the attempted training (ExecutionEnvironment 'multi-gpu'):
training network....
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 2).
Initializing input data normalization.
|======================================================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Validation |
Mini-batch | Validation | Base Learning |
| | | (hh:mm:ss) | Accuracy | Accuracy
| Loss | Loss | Rate |
|======================================================================================================================|
Error using trainNetwork (line 184)
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Error in deepLearnWhip1 (line 106)
trainedNet = trainNetwork(imagesTrainds, lgraph, options);
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
Error detected on worker 1.
Error using
nnet.internal.cnn.layer.util.Convolution2DGPUStrategy/backward
(line 82)
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Artem Lenskiy
2021 年 3 月 22 日
I am having the same issue.
Running this example
TrainABasicConvolutionalNeuralNetworkForClassificationExample.mlx
on Ubuntu 20.04, Matlab 2021a.
This call
net = trainNetwork(imdsTrain,layers,options);
produces
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
I have two RTX 3080 installed.
>> gpuDeviceTable
ans =
2×5 table
Index Name ComputeCapability DeviceAvailable DeviceSelected
_____ __________________ _________________ _______________ ______________
1 "GeForce RTX 3080" "8.6" true true
2 "GeForce RTX 3080" "8.6" true false
Artem Lenskiy
2021 年 3 月 22 日
With the following set of paramters
options = trainingOptions('sgdm', ...
'InitialLearnRate',0.01, ...
'ExecutionEnvironment',"parallel",...
'MaxEpochs',4, ...
'Shuffle','every-epoch', ...
'ValidationData',imdsValidation, ...
'ValidationFrequency',30, ...
'Verbose',false, ...
'Plots','training-progress');
I get
Error using trainNetwork (line 184)
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Caused by:
Error using nnet.internal.cnn.ParallelTrainer/train (line 96)
Error detected on worker 1.
Error using nnet.internal.cnn.layer.util.Convolution2DGPUStrategy/backward (line 82)
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Joss Knight
2021 年 3 月 22 日
So only when training in parallel? Not when training with ExecutionEnvironment 'gpu'?
Joss Knight
2021 年 3 月 22 日
Please restart MATLAB and then disable one of your GPUs by executing
setenv CUDA_VISIBLE_DEVICES 0
and then run your training code (in serial) and tell me if that makes any difference. I'm trying to see if this can only be reproduced on a 2-GPU system.
Scott Stearns
2021 年 3 月 22 日
Good morning. Reinstalled, and executed the setenv function. Checked:
>> gpuDeviceTable
ans =
1×5 table
Index Name ComputeCapability DeviceAvailable DeviceSelected
_____ __________________ _________________ _______________ ______________
1 "GeForce RTX 3070" "8.6" true true
Changed ExecutionEnvironment to 'gpu' and re-ran with the same result:
Error using trainNetwork (line 184)
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Error in deepLearnWhip1 (line 106)
trainedNet = trainNetwork(imagesTrainds, lgraph, options);
This time the error message did not have all the details ('caused by... ') as before.
Joss Knight
2021 年 3 月 22 日
We are looking into this. What happens when you reduce the MiniBatchSize to 1?
Scott Stearns
2021 年 3 月 22 日
Restarted MATLAB.
>> setenv CUDA_VISIBLE_DEVICES 0
>> gpuDeviceTable
ans =
1×5 table
Index Name ComputeCapability DeviceAvailable DeviceSelected
_____ __________________ _________________ _______________ ______________
1 "GeForce RTX 3070" "8.6" true false
Set MiniBatchSize to 1, ExecutionEnvironment to 'gpu'
Same result.
Error using trainNetwork (line 184)
Unexpected error calling cuDNN:
CUDNN_STATUS_BAD_PARAM.
Error in deepLearnWhip1 (line 106)
trainedNet = trainNetwork(imagesTrainds, lgraph,
options);
Hmmm...
Scott Stearns
2021 年 3 月 22 日
Wondering if there is a cluster profile that should be defined other than the default?
Joss Knight
2021 年 3 月 22 日
編集済み: Joss Knight
2021 年 3 月 22 日
It's clearly nothing to do with parallel execution. It's an issue with cuDNN and compute capability 8.6 devices. And Ubuntu.
Do you mind if I use you as a guinea pig? Can you run this code, see if it succeeds?
%%
X = gpuArray.ones(7,7,16,32,'single');
W = gpuArray.ones(3,3,16,32,'single');
bias = gpuArray.ones(1,1,32,'single');
Z = gpuArray.ones(7,7,32,32,'single');
padding = [1 1];
stride = [1 1];
dilation = [1 1];
numGroups = 1;
dW = nnet.internal.cnngpu.convolveBackwardFilterND(X,W,Z, ...
padding,padding,stride,dilation,numGroups);
dX = nnet.internal.cnngpu.convolveBackwardDataND(X,W,Z, ...
padding,padding,stride,dilation,numGroups);
Scott Stearns
2021 年 3 月 22 日
Sure Joss. It's on my sprint this week 8-). Was in a meeting. Here is the result:
Error using gpuArray.ones
Unknown or unsupported data type: single .
Error in testscript (line 3)
X = gpuArray.ones(7,7,16,32,'single ');
With try/catch ME:
{'ME id:parallel:gpu:array:UnknownDataType' }
{'ME message:Unknown or unsupported data type: single .'}
{'cause : ' }
Scott Stearns
2021 年 3 月 22 日
If I remove the space at the end of 'single ' :
>> testscript
{'ME id:nnet_cnn:internal:cnngpu:CuDNNError' }
{'ME message:Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.'}
{'cause : ' }
Joss Knight
2021 年 3 月 22 日
That's weird, there's no space in what I pasted there. Anyway, whatever, thanks for that, that's incredibly helpful. Can you just show the output in the command window? I need to know which one of those two functions errored.
Did you try reducing the batch size to see if that helped? You can try that with the code above by changing the 4th dimension of X and Z:
%%
X = gpuArray.ones(7,7,16,1,'single');
W = gpuArray.ones(3,3,16,32,'single');
bias = gpuArray.ones(1,1,32,'single');
Z = gpuArray.ones(7,7,32,1,'single');
padding = [1 1];
stride = [1 1];
dilation = [1 1];
numGroups = 1;
dW = nnet.internal.cnngpu.convolveBackwardFilterND(X,W,Z, ...
padding,padding,stride,dilation,numGroups);
dX = nnet.internal.cnngpu.convolveBackwardDataND(X,W,Z, ...
padding,padding,stride,dilation,numGroups);
Scott Stearns
2021 年 3 月 22 日
>> testscript
Error using nnet.internal.cnngpu.convolveBackwardFilterND
Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.
Error in testscript (line 11)
dW = nnet.internal.cnngpu.convolveBackwardFilterND(X,W,Z , ...
Scott Stearns
2021 年 3 月 22 日
yes, after disabling one gpu (re comments from 3 hr ago) and setting MiniBatchSize to 1... same result.
Joss Knight
2021 年 3 月 22 日
Brilliant, thank you so much. We'll try to chase this down but it's not looking like there is any straightforward workaround.
Joss Knight
2021 年 3 月 22 日
One more guess at something that might work: restart MATLAB and disable tensor cores by running
setenv NVIDIA_TF32_OVERRIDE 0
before you do anything else.
Worth a try.
Artem Lenskiy
2021 年 3 月 23 日
Joss it worked for me as well. Thank you!
I am just curious what is the effect of this variable. The documentation says that when set to 0, GPU will "never accelerate FP32 computations with TF32 tensor cores". As far as I know all computations are performed on FP32 on consumer GPUs.
Joss Knight
2021 年 3 月 23 日
That worked? Awesome, great news! Can you confirm that you're saying that disabling TF32 fixes this issue?
This disables the use of the special 'TF32' datatype which allows single precision (FP32) convolutions to be performed on Ampere's tensor cores, making single precision much faster (at the expense of a little accuracy). It is an internal optimisation that only applies to Ampere (and future) cards.
Joss Knight
2021 年 3 月 23 日
Note: we cannot reproduce this on a 3090 card on Ubuntu 20. The only differences I can see are the presence of two cards, which could I suppose be some sort of power issue (very unlikely), the driver version (possible), or the precise card (3090/3080/3070) (unlikely but...maybe).
What driver version do you have? Run nvidia-smi in a terminal.
Thanks.
Artem Lenskiy
2021 年 3 月 23 日
編集済み: Artem Lenskiy
2021 年 3 月 23 日
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3080 Off | 00000000:01:00.0 On | N/A |
| 30% 29C P0 85W / 320W | 440MiB / 10001MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3080 Off | 00000000:21:00.0 Off | N/A |
| 30% 23C P8 4W / 320W | 10MiB / 10018MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Please let me know if it helps.
Scott Stearns
2021 年 3 月 23 日
編集済み: Scott Stearns
2021 年 3 月 23 日
Tue Mar 23 08:23:23 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3070 Off | 00000000:1A:00.0 Off | N/A |
| 0% 49C P8 27W / 240W | 3203MiB / 7982MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3070 Off | 00000000:68:00.0 On | N/A |
| 0% 48C P8 29W / 240W | 271MiB / 7974MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Scott Stearns
2021 年 3 月 23 日
Joss! I tried to submit a thank-you comment in this thread last night ... but YES, this change (setenv NVIDIA_TF32_OVERRIDE 0) worked. Was able to run gpu, multi-gpu, and parallel with MiniBatchSize gt 1. Thank-you. Please send reference to the environmental variables - I would like to learn more about how/why this worked.
You had us worried there for while! Awesome. Thanks.
Joss Knight
2021 年 3 月 23 日
Okay, I'm going to modify my answer to mention this workaround, but I might ask you to do further investigations if you don't mind continuing to be guinea pigs. Use of this workaround will significantly reduce your performance so ideally there would be a 'proper' fix.
Unfortunately that is the same driver as we have tested with. I guess it could be a 3080/3070 thing.
Scott Stearns
2021 年 3 月 23 日
Sure. For us, reduced performance on 2x gpus is better than no performance or cpu-only training. Thanks. Happy to help with investigations.
Joss Knight
2021 年 3 月 26 日
Are you also driving your display with one of your cards? Can you plug your display into your motherboard's on-board display port instead and see if the problem goes away? Thanks.
David
2021 年 4 月 11 日
編集済み: David
2021 年 4 月 11 日
Joss,
I just got a new machine with a 3080 and ubuntu 18.04, I am having the same problem on the very first Deep Learning demo that uses googLeNet. I would be happy to be a guinea pig for any tests you might want to try, (since I just forked over $3K for this machine to speed up my Matlab Deep Learning code).
Just to clarify, does your
setenv NVIDIA_TF32_OVERRIDE 0
workaround make the code CPU only or is it just slower GPU code?
Thanks.
Also, if I revert to R2020b does this problem go away? I'd just try but it won't let me install both versions at once for some reason.
Joss Knight
2021 年 4 月 11 日
Hi David. Thanks for the offer! We have now reproduced this in-house and so have NVIDIA. It turns out to be a bug in NVIDIA's cuDNN library. We are working with them to characterize the problem better so we can provide the fix that has the least possible performance impact, and to see if there are any better workarounds in the meantime.
The TF32 override will only make GPU execution slower - although for your problem you may find it makes no difference. It only works because it changes some internal logic for algorithm selection, so it isn't really a final fix. Indeed, we believe the problem might occur on older cards too.
If you downgrade you might not be able to use your 3080. Due to forward compatibility issues, cuDNN sometimes gives errors or wrong answers, so we can't safely recommend use for deep learning.
David
2021 年 4 月 11 日
I did actually do a full remove of all versions, clean install of 2020b and yes it gave me a forward compatibility error, I switched off the warning as per the red text suggestion (forgot what it was) and nothing worked after that. So I'm back to 2021a with your workaround in my startup script. Is there a way to subscribe to a bug so I know when it's fixed?
Thanks,
Dave
Joss Knight
2021 年 4 月 11 日
Yes, once we have published an external bug report for this you will be able to subscribe to it so you know when it is fixed (or worked around). We're not quite ready to publish the bug report but I'll let you know here when we have.
There is a hardcore workaround which is to download cuDNN 8.2 and drop new libraries into your MATLAB installation. Since this involves modding MATLAB it is not for MATLAB Answers - if you're interested please create a tech support query.
David
2021 年 4 月 14 日
編集済み: David
2021 年 4 月 15 日
Thanks! Before I go the hardcore route do you have a guesstimate how much speed improvement I could expect? I'm not really in a situation where I'm speed critical, yet, but obviously once I start seriously training for some task every speed increase will matter.
Joss Knight
2021 年 4 月 15 日
I have not yet seen a truly significant improvement from the use of TF32, but I can't truly answer that question without more investigation.
Scott Stearns
2021 年 6 月 17 日
Hi Joss,
This workaround is no longer working. Do we have any progress on the toolboxes/GPU issues?
Here is what I'm seeing:
training network....
Error using trainNetwork (line 184)
GPU support for deep neural networks requires Parallel Computing Toolbox and a supported GPU device.
Error in deepLearnUCSF (line 139)
trainedNet = trainNetwork(imagesTrainds, lgraph, options);
Caused by:
Error using feval
Unable to find a supported GPU device. For more information on GPU support, see GPU Support by Release.
>>
I restarted MATLAB and have: setenv NVIDIA_TF32_OVERRIDE 0 at the top of my code. Here are the trainingOptions I'm using. Fustrated that this expensive machine is not being used. Hope there's help on this...
Thanks,
Scott
options = trainingOptions('sgdm', ...
'InitialLearnRate',initialLearnRate,...
'Momentum',momentumFactor,...
'MaxEpochs',maxEpochs, ...
'MiniBatchSize',miniBatchSize, ...
'Shuffle','every-epoch',...
'Verbose',true, ...
'ValidationFrequency',floor(NumTrain/miniBatchSize),...
'ValidationData',imagesValidds,...
'Plots','training-progress',...
'LearnRateSchedule','piecewise',...
'LearnRateDropFactor',learnRateDropFactor, ...
'LearnRateDropPeriod',learnRateDropPeriod, ...
'CheckpointPath', checkpointPath,...
'ExecutionEnvironment','multi-gpu');
Joss Knight
2021 年 6 月 18 日
編集済み: Joss Knight
2021 年 6 月 18 日
Looks like your device is no longer being recognised, which could be because you need to upgrade or install a graphics driver, or because your device is incorrectly installed. What does nvidia-smi say?
We are actively working on working around the NVIDIA bug that is the root cause of the original problem. It has involved a relatively complex back-and-forth with NVIDIA to work out how we can disable the mal-functioning code without make everyone's training slower. However, the error message you are getting is something else, not to do with this issue.
Scott Stearns
2021 年 6 月 18 日
Yes - thank-you Joss! After posting this comment, I found that the gpuDeviceTable was empty... this led to updating AddtionalDrivers and getting the most recent driver. I think an installation of LibreOffice changed the driver. So I'm back on track. Thank-you for your quick reply. We're training a deep learning nn on dual GPUs and it is incredibly fast.
Glad to hear your team and NVIDIA are working the the orginal problem. I still use the TF override workaround and it seems to be fine.
Best regards,
Scott
Tom Van den heuvel
2021 年 9 月 21 日
Hi,
I'm facing the same problem, using two RTX A5000's on Ubuntu 20.04, Matlab r2021a. I see this issue exists since March 2021, by when can we expect a fix that does not reduce performance?
Can you confirm this issue does not manifest on Windows 10, as that would be a better workaround for me in the short term?
Kr,
Tom
Joss Knight
2021 年 9 月 21 日
編集済み: Joss Knight
2021 年 9 月 21 日
This is fixed in the next update of MATLAB R2021a, however you'd be better off simply downloading R2021b which will be out in a week or so.
Unfortunately NVIDIA weren't able to provide us with a fix that has no effect on performance, but we can at least limit the workaround to the problematic convolutions. A proper fix will arrive with the next CUDA upgrade.
We've never seen this problem on Windows.
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で GPU Computing in MATLAB についてさらに検索
タグ
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!エラーが発生しました
ページに変更が加えられたため、アクションを完了できません。ページを再度読み込み、更新された状態を確認してください。
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
アジア太平洋地域
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)