GPUCoder does not generate parallelized code

Question

Le Ki 2022 年 4 月 26 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1705955-gpucoder-does-not-generate-parallelized-code

コメント済み: Joss Knight 2022 年 5 月 1 日

I am currently working on optimizing a program that otherwise runs on the CPU. For this, I am using the GPU coder to run my CPU code on the GPU. However, this does not provide a significant speedup.

Now I tried to build a function as simple as possible with the GPU coder.

function [out] = simple_function(vector)
out = sqrt(sqrt(sqrt(vector))));
end

I then call this function with very large input vectors, which definitely requires computation. However, when I analyze this function with gpucoder.profile(...) and then display the result of the profiling in the NVIDIA Visual Profiler, it indicates that the code is not well optimized. In particular, it shows that 0% of the time parallelized computations are being performed.

Even though this is a very easy function to parallelize. Is there any way to set the GPU coder to parallelize more?

Thanks

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Joss Knight 2022 年 4 月 29 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1705955-gpucoder-does-not-generate-parallelized-code#answer_953775

MATLAB Online で開く

This looks about right to me, because your kernel is too simple and you're transferring data from and to the CPU on every call. Try recompiling with gpuArray input and output (if you have PCT) to remove the data transfer bit, or else write some code that will require the GPU to launch multiple kernels. Do some reductions perhaps?

sz = size(x);
for i = 1:100
    y = sum(sqrt(sqrt(sqrt(abs(x)))),"all");
    x = y*randn(sz,"like",x);
end

2 件のコメント
なしを表示なしを非表示

Le Ki 2022 年 5 月 1 日

Thank you very much for the helpful and quick answer.

I will definitely test whether I can still achieve a speedup with the GPU arrays in my specific use case.

I have also profiled the program you suggested. The result is much better and you can see more paralellization. Nevertheless, the profiler indicates that there is no kernel concurrency and also the compute utilization has increased to 14%, but still not very high. Is this the maximum I can expect when working with the GPU coder or are there other ways to improve this?

Also, I wanted to ask if it's normal that the NVIDIA Visual Profiler analysis tools available for coded CUDA kernels don't work for this kind of profiling. For example, when I press "More...." on the kernel concurrency details in the screenshotted window, nothing happens. Is this normal for this type of profiling or is my version of the profiler broken?

Thanks a lot

Joss Knight 2022 年 5 月 1 日

Thank goodness there's no kernel concurrency! Each operation is dependent on the outcome of the last. Write some code that doesn't such dependencies and you'll see more concurrency.

If I were you, I'd just profile some real code rather than testing with toy examples. Ultimately this code is too simple to trouble the GPU very much so no doubt you're still bounded by your memory transfer operations.

I'm afraid I can't help you with your question about the profiler.

サインインしてコメントする。

GPUCoder does not generate parallelized code

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

2 件のコメント
なしを表示なしを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

GPUCoder does not generate parallelized code

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

2 件のコメント なしを表示なしを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示