Why is the gpuArray version of my code slower?

Question

Ariel Lanza 2019 年 7 月 15 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/471725-why-is-the-gpuarray-version-of-my-code-slower

コメント済み: Walter Roberson 2019 年 7 月 17 日

Hello,

I am doing some experiments with gpuArrays with a simple problem: for every pair of distinct rows of a random matrix I want to find the difference of the sum of the elements in the rows. (I am aware that this formulation of the problem is linear, however let'assume that the function I am applying to the couple of rows is not linear)

n = 100;
gpu = true;
if gpu
    returns = rand(n, 'gpuArray');
    ansmat = zeros(n*(n-1)/2, 3,'gpuArray');
else
    returns = rand(n);
    ansmat = zeros(n*(n-1)/2, 3);
end
itermat = 1;
for v1 = 1:n
    for v2 = 1:v1-1
        ansmat(itermat,1:2) = [v1,v2];
        itermat = itermat + 1;
    end
    for v2 = v1+1:n
        ansmat(itermat,1:2) = [v1,v2];
        itermat = itermat + 1;
    end
end
tic
for i=1:size(ansmat,1)
    sum1 = sum(returns(ansmat(i,1),:),2);
    sum2 = sum(returns(ansmat(i,2),:),2);
    ansmat(i,3) = sum1-sum2;
end
toc

When I set

gpu = true;

I get

>> ZZZ
Elapsed time is 3.120591 seconds.

When I set

gpu = false;

I get

>> ZZZ
Elapsed time is 0.016989 seconds.

I think I am missing a few fundamentals. For example, when I tried to rewrite my code in terms of arrayfun with

myfun = @(x, y) sum(returns(ansmat(x,1),:),2) - sum(returns(ansmat(y,2),:),2);
tic
ansmat(:,3) = arrayfun(@(x,y) myfun(x,y), ansmat(:,1), ansmat(:,2));
toc

it works without the GPU, however using the GPU there is a problem:

>> ZZZ
Error using ZZZ (line 33)
Use of functional workspace is not supported.
For more information see Tips and Restrictions.

But I could not find those tips.

What is the correct way to gain some speed in solving my problem using the GPU?

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Walter Roberson 2019 年 7 月 15 日

Indexing on gpu is not efficient.

Walter Roberson 2019 年 7 月 15 日

Using data as indices is particularly inefficient on GPU.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Stephane Dauvillier 2019 年 7 月 15 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/471725-why-is-the-gpuarray-version-of-my-code-slower#answer_383273

OK,

Basically if your program takes less than a second on your CPU it's not a surprise it will be slower in a GPU or even in a distant cluster.

I'm suppose you have a convenient GPU.

When dealing with parallel computation (because it's the same thing with "normal" parallel computation) you have 2 set of time: computation and communication. When the communication is bigger than computation time, that's when you have better performance between sequential and parallel.

Let's imagine your CPU as one person and the cluster (in your case GPU) as a team of 10 people.

Let's assume you want the result of the following operation 1+2+..+n. (Let's assume you don't know the formula 1+2+...+n = n*(n+1)/2).

If n is very big then you will use more time that the 10 people (each will compute a 10 th of the computation and then someone will sum the results of each one)

Now, imagine n isn't huge at all (let's say 20). Then you will spend too much time splitting the work the team comparing to the time they will spend to actually do the computation.

Bonus: After doing calculation on GPU retrieve the reuslt on your MATLAB session (meaning get back the result from GPU to CPU with the function gather.

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

Stephane Dauvillier 2019 年 7 月 17 日

MATLAB Online で開く

What I mean is the same thing than Andrea Picciau: there is a cost to transfer data from CPU memory to GPU memory. If the computing time is less than the data transfer time, then you will have what you observed.

Your data are too small to gain anything from parallel or GPU computing.

If you look at the "Speedup of computations on GPU compared to CPU" graph (the last one) here you will see that if the matrix is tiny, the CPU method is faster than the GPU.

By the way:

you can simplify (and optimize) the creation of ansmath

ansmat = zeros(n*(n-1)/2,3);
ansmat(:,1:2) = nchoosek(1:n,2);

nchoosek will give you all the combinaison of 2 elements choosen in the vector 1:n

vectorize you code, you can first compute the sum for tyour matrix return and then do the difference in the forloop

SReturn = sum(returns,2);
SR1 = SReturn(ansmat(:,1),:);
SR2 = SReturn(ansmat(:,2),:);
ansmat(:,3) = SR1-SR2 ;

By doing so, on my PC, with a n of 1e3, it takes 0.05 seconds on CPU.

Stephane Dauvillier 2019 年 7 月 17 日

MATLAB Online で開く

"Is it possible to pass the variables ''returns'' and ''ansmat'' only one time to the GPU or do they need to be passed many times hence making the GPU implementation always slower?"

These variables are directly create in the GPU by doing

   returns = rand(n, 'gpuArray');
    ansmat = zeros(n*(n-1)/2, 3,'gpuArray');

the CPU send only the variable n to the gpu in that case

That why in the for loop the index isn't in the gpu and need to be send.

Also note that most (not to say all) mathematical functions are optimized for GPU so all 5 arithmetic operator plus their scalar expansion (+,-,*,/,^,.*,./,.^) are optimize for gpuarray

which plus -all

return for instance (I've only put the interested line)

    C:\MATLAB\64bits\R2019a\toolbox\distcomp\gpu\@gpuArray\plus.m                 % gpuArray method

Walter Roberson 2019 年 7 月 17 日

You need to know something about how cuda works. Cuda has a whole bunch of arithmetic units, but it only has one instruction decoder for groups of arithmetic units. When a particular instruction is to apply to some of the arithmetic units (some of the data locations) managed by a particular controller but not others, then that is handled by having the controller send "pause" to the unselected units and the selected units execute the instruction.

Now when you index by a scalar, at most one of the arithmetic units managed by a controller is going to be activated and the rest are going to be paused for the instruction. It is like going through a bunch of student solo recitals one by one, each arithmetic unit getting its time in the spotlight while the others stand around. If you were indexing an array of length 1000 element by element, then each of the arithmetic units would be idle 999/1000 of the time.

Reformulation of code from indexing to vectorized can make a huge difference for cuda.

サインインしてコメントする。

Answer 2

Andrea Picciau 2019 年 7 月 16 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/471725-why-is-the-gpuarray-version-of-my-code-slower#answer_383375

編集済み: Andrea Picciau 2019 年 7 月 16 日

Hi Ariel,

There are three problems with your script.

Your code is doing a lot of for loops and indexing of gpuArray data, which is killing performance (this would happen with any GPU code, not just MATLAB!). Have a look at this answer to see why this is is a problem and how you can improve your code.
Your matrices and vectors are too small. You need matrices of at least 1000x1000 elements to amortise the cost of data transfer to and from the GPU.
You're not measuring GPU performance correctly. You should use timeit and gputimeit, like I'm discussing in this other answer.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Why is the gpuArray version of my code slower?

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

回答 (2 件)

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Why is the gpuArray version of my code slower?

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

回答 (2 件)

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示