Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

9 ビュー (過去 30 日間)

Scott Stearns 2021 年 3 月 20 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/778632-unexpected-error-calling-cudnn-cudnn_status_bad_param

コメント済み: Tom Van den heuvel 2021 年 9 月 21 日

Hi,

This error stops training when the 'ExecutionEnvironment' is 'parallel', 'multi-gpu', or 'gpu'. Training is running uninterrupted when set to 'cpu'. I'm running code for the first time on Ubuntu 20.04.2 LTS system with Intel i9 12 core cpu and 2x 3070 gpu's. It indicates only 12 workers and seems to not recognize the gpus.

Any suggestions and help is welcome.

Thank-you

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

採用された回答

Joss Knight 2021 年 3 月 21 日

4
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/778632-unexpected-error-calling-cudnn-cudnn_status_bad_param#answer_653787

編集済み: Joss Knight 2021 年 3 月 23 日

MATLAB Online で開く

After some investigation (see thread below), this problem seems to be limited to RTX 3080 and 3070 and Linux. It can be worked around by disabling tensor cores. Restart MATLAB and run

setenv NVIDIA_TF32_OVERRIDE 0

before you do anything else. Further investigations are under way to look for a solution that doesn't require this workaround, which will reduce performance.

Original answer:

Are you running MATLAB release R2021a? The 3070 is not supported on earlier releases.

47 件のコメント
45 件の古いコメントを表示45 件の古いコメントを非表示

Scott Stearns 2021 年 3 月 21 日

gpuDeviceTable

ans =

2×5 table

Index Name ComputeCapability DeviceAvailable DeviceSelected

_____ __________________ _________________ _______________ ______________

1 "GeForce RTX 3070" "8.6" true true

2 "GeForce RTX 3070" "8.6" true false

%%%%%%%%%%%%%%%

the gpuDevice(i) output for both is the same:

CUDADevice with properties:

Name: 'GeForce RTX 3070'

Index: 1

ComputeCapability: '8.6'

SupportsDouble: 1

DriverVersion: 11.2000

ToolkitVersion: 11

MaxThreadsPerBlock: 1024

MaxShmemPerBlock: 49152

MaxThreadBlockSize: [1024 1024 64]

MaxGridSize: [2.1475e+09 65535 65535]

SIMDWidth: 32

TotalMemory: 8.3701e+09

AvailableMemory: 8.0412e+09

MultiprocessorCount: 46

ClockRateKHz: 1770000

ComputeMode: 'Default'

GPUOverlapsTransfers: 1

KernelExecutionTimeout: 1

CanMapHostMemory: 1

DeviceSupported: 1

DeviceAvailable: 1

DeviceSelected: 1

Scott Stearns 2021 年 3 月 21 日

Joss, sorry I didn't get this earlier. Here is the output from the attempted training (ExecutionEnvironment 'multi-gpu'):

training network....

Starting parallel pool (parpool) using the 'local' profile ...

Connected to the parallel pool (number of workers: 2).

Initializing input data normalization.

|======================================================================================================================|

Mini-batch | Validation | Base Learning |

| Loss | Loss | Rate |

|======================================================================================================================|

Error using trainNetwork (line 184)

Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

Error in deepLearnWhip1 (line 106)

trainedNet = trainNetwork(imagesTrainds, lgraph, options);

Caused by:

Error using nnet.internal.cnn.ParallelTrainer/train (line 96)

Error detected on worker 1.

Error using

nnet.internal.cnn.layer.util.Convolution2DGPUStrategy/backward

(line 82)

Unexpected error calling cuDNN: CUDNN_STATUS_BAD_PARAM.

Artem Lenskiy 2021 年 3 月 23 日

編集済み: Artem Lenskiy 2021 年 3 月 23 日

MATLAB Online で開く

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3080    Off  | 00000000:01:00.0  On |                  N/A |
| 30%   29C    P0    85W / 320W |    440MiB / 10001MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3080    Off  | 00000000:21:00.0 Off |                  N/A |
| 30%   23C    P8     4W / 320W |     10MiB / 10018MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Please let me know if it helps.

Scott Stearns 2021 年 3 月 23 日

編集済み: Scott Stearns 2021 年 3 月 23 日

Tue Mar 23 08:23:23 2021

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|===============================+======================+======================|

| 0 GeForce RTX 3070 Off | 00000000:1A:00.0 Off | N/A |

| 0% 49C P8 27W / 240W | 3203MiB / 7982MiB | 0% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

| 1 GeForce RTX 3070 Off | 00000000:68:00.0 On | N/A |

| 0% 48C P8 29W / 240W | 271MiB / 7974MiB | 1% Default |

| | | N/A |

+-------------------------------+----------------------+----------------------+

Scott Stearns 2021 年 6 月 17 日

Hi Joss,

This workaround is no longer working. Do we have any progress on the toolboxes/GPU issues?

Here is what I'm seeing:

training network....

Error using trainNetwork (line 184)

GPU support for deep neural networks requires Parallel Computing Toolbox and a supported GPU device.

Error in deepLearnUCSF (line 139)

trainedNet = trainNetwork(imagesTrainds, lgraph, options);

Caused by:

Error using feval

Unable to find a supported GPU device. For more information on GPU support, see GPU Support by Release.

I restarted MATLAB and have: setenv NVIDIA_TF32_OVERRIDE 0 at the top of my code. Here are the trainingOptions I'm using. Fustrated that this expensive machine is not being used. Hope there's help on this...

Thanks,

Scott

options = trainingOptions('sgdm', ...

'InitialLearnRate',initialLearnRate,...

'Momentum',momentumFactor,...

'MaxEpochs',maxEpochs, ...

'MiniBatchSize',miniBatchSize, ...

'Shuffle','every-epoch',...

'Verbose',true, ...

'ValidationFrequency',floor(NumTrain/miniBatchSize),...

'ValidationData',imagesValidds,...

'Plots','training-progress',...

'LearnRateSchedule','piecewise',...

'LearnRateDropFactor',learnRateDropFactor, ...

'LearnRateDropPeriod',learnRateDropPeriod, ...

'CheckpointPath', checkpointPath,...

'ExecutionEnvironment','multi-gpu');

Joss Knight 2021 年 9 月 21 日

編集済み: Joss Knight 2021 年 9 月 21 日

This is fixed in the next update of MATLAB R2021a, however you'd be better off simply downloading R2021b which will be out in a week or so.

Unfortunately NVIDIA weren't able to provide us with a fix that has no effect on performance, but we can at least limit the workaround to the problematic convolutions. A proper fix will arrive with the next CUDA upgrade.

We've never seen this problem on Windows.

Tom Van den heuvel 2021 年 9 月 21 日

Thx for the update!

サインインしてコメントする。

その他の回答 (0 件)

サインインしてこの質問に回答する。

カテゴリ

Parallel Computing Parallel Computing Toolbox GPU Computing GPU Computing in MATLAB

Help Center および File Exchange で GPU Computing in MATLAB についてさらに検索

製品

リリース

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by