GPU arrayfun is so slow, what is going on?

Question

Hao Zhang 2018 年 12 月 11 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/435152-gpu-arrayfun-is-so-slow-what-is-going-on

コメント済み: Derrick Ling 2019 年 4 月 20 日

採用された回答: Matt J

MATLAB Online で開く

Hi,

I am trying to understand what the GPU arrayfun is doing? The following is a test code.

clear;clc;close all
gd=gpuDevice();
reset(gd);
N=2e3;
a=rand(60,N,'single','gpuArray');
tic;
b=sum(a,1);
wait(gd);
toc;
tic;
c=arrayfun(@(i) sum(a(:,i),1),(1:N));
wait(gd);
toc;

The results are:

Elapsed time is 0.000468 seconds.
Elapsed time is 0.584521 seconds.

What is going on here? 1000 times difference?? I would expect similary runtime since GPU arrayfun is supposed to be executed parallel on GPU cores. Did I make stupid errors on using the arrayfun?

Thanks!

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Hao Zhang 2018 年 12 月 13 日

Hi, I come back for more updates. I have successfully vectorized and implemented my particle simulation on GPU. The speed up is astonishing, ~10 times faster than CPU code. Thanks Matt J and Joss Knight for their wonderful suggestions.

Now the other part of the code (except neighbor search but solving the fluid equations) is so fast that the limiting part now is the matlab function knnsearch, which uses kdtree algorithm runing on CPU. It takes 85% percent of the runtime (see the following code profiler results)

The function 'knnCPU_kdtree_func' uses the matlab built-in function knnsearch with kdtree algorithm runing on CPU. The other functions are doing the real math runing on GPU only consumes 10% of the total time.

I wonder is there any GPU implementation of k-nearest neighbor search that I can free download and using as a function call in my matlab code? Many thanks.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Matt J 2018 年 12 月 11 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/435152-gpu-arrayfun-is-so-slow-what-is-going-on#answer_351810

編集済み: Matt J 2018 年 12 月 11 日

MATLAB Online で開く

What is the most efficient way to vectorize the above code

I would say, as follows,

idx_Neighbor=randi([1 N],60,N,'uint8');
temp=p(idx_Neighbor);
temp=temp+p.';
ax=sum(temp .* delW_x,1);

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Hao Zhang 2018 年 12 月 11 日

yes, but one can do uint8 to any code above. And if the indexing is larger then one has to use uint32. But the improvement of vectorizing is so much that one has to bear with a bit more memory usage.

Maybe a matlab developer can say something about the efficiency comparing to cuda C. I hope it will be really close.

Derrick Ling 2019 年 4 月 20 日

MATLAB Online で開く

Hi, which code did you use to run GPU? arrayfun? gpuArray?

Where or how did you insert the code?

And is there any advice you would give to make this code nicer? Undefined function error appears if I remove bbb = 0.

bbb = 0;
bbb = bbb + h*W_4';

サインインしてコメントする。

Answer 2

Joss Knight 2018 年 12 月 11 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/435152-gpu-arrayfun-is-so-slow-what-is-going-on#answer_351763

MATLAB Online で開く

You haven't called GPU arrayfun here, you've called CPU arrayfun and in the arrayfun function you are doing stuff on the GPU. This is because none of the arguments to your arrayfun call is a gpuArray.

You could force it to use GPU arrayfun by converting your input:

c = arrayfun(@(i) sum(a(:,i),1), gpuArray(1:N));

However, you'll immediately find it errors, because sum is not supported for GPU arrayfun. Obviously this is just a toy example, but the solution here is sum(a,1), not arrayfun.

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Hao Zhang 2018 年 12 月 11 日

Hi Joss,

Thanks a lot for your reply! so, the sum functio is doing on the GPU but the arryfun is actually doing on the CPU, I got it, no wonder it is soooo slow.

Is it true that I cannot even pass something with the colon operator to the GPU arrayfun? seems if I try to pass a(:,i), evenif the function itself is supported for GPU arrayfun, I stil get the following error message: Function passed as first input argument contains unsupported or unknown function 'colon'.

I understand that sum(a,1) is the solution here, but what if I need to do something more complex, for example, if I want to mutiply (.*) each colomn of a (60 by 2e3) by another colomn vector b (60 by 1), and then sum over the first dimension of the result matrix. i.e., I need to do

c=sum(a.*repmat(b,1,2e3),1).

This will lead to fast and correct results, however it is not memory efficient since I need to repmat the vector b to a big matrix. If the vector b is very long, this will eventually limit the size of the problem I can solve. Is ther any way to do this memory friendly on the GPU?

Thanks!

Joss Knight 2018 年 12 月 11 日

Thanks Matt.

GPU arrayfun is very special, you should read the documentation and list of supported functions. It only supports element-wise functionality, so you can't do any vector operations. That means you can't index an array unless you're indexing a single element or an up-level variable, you can't call sum or any other reduction or accumulation, and you can't output anything other than a scalar. This is because your arrayfun function gets compiled into a single CUDA kernel with no inter-thread communication. So it's incredibly useful and efficient when used within its limitations.

Nearly always (in my experience) when you want to do something more complex with vector operations, you can translate your code into a series of vectorized calls to normal MATLAB matrix functions, arrayfun, and pagefun.

Hao Zhang 2018 年 12 月 11 日

MATLAB Online で開く

This is a brillant solution! Thanks for making this happen.

However, if I want to do something even more complex (And this is what I actually need to do, instead of the toy examples before :)). So I need to do the following code:

clear;clc;close all;
N=2e3;
idx_Neighbor=randi([1 N],60,N);
p=rand(N,1);
delW_x=rand(60,N);
ax=zeros(N,1);
tic;
for j=1:N
    temp=idx_Neighbor(:,j);
    inner=p(temp)+p(j);
    ax(j)=sum(inner.*delW_x(:,j));
end
toc;

What is the most efficient way to vectorize the above code (so without using the for loop) and avoiding using repmat as much as possible? Thanks!

サインインしてコメントする。

GPU arrayfun is so slow, what is going on?

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

GPU arrayfun is so slow, what is going on?

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

採用された回答

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示