Same gpu operation in loop but two speeds

Question

François Fabre 2020 年 5 月 6 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/523392-same-gpu-operation-in-loop-but-two-speeds

コメント済み: François Fabre 2020 年 5 月 7 日

I am running the code below on Windows 10 64 bits with an intel i5-8300H and a Nvidia GTX 1060:

dev = gpuDevice;
time = zeros(1, length(t)); % preallocation of time
time2 = zeros(1,length(t)); % preallocation of time2
for i = 2:length(t)-1 %looping over time vector 
    wait(dev); tic;
    % Computation of dot product between A(:, i:-1:1) and C(:,1:i,1) + C(:,1:i,2) for each row
    approx_conv_pair = sum(reshape(A(txN-i*N+1:txN-N).*(C(1:(i-1)*N) + C(txN+1:txN+(i-1)*N)),[N,i-1]),2);
    wait(dev); time(i)=toc;
    wait(dev); tic;
    % Computation of dot product between B(:, i:-1:1) and C(:,1:i,1) - C(:,1:i,2) for each row
    approx_conv_impair = sum(reshape(B(txN-i*N+1:txN-N).*(C(1:(i-1)*N) - C(txN+1:txN+(i-1)*N)),[N,i-1]),2);
    wait(dev); time2(i)=toc;
end

A, B and C are gpuArrays of size, respectively, (N, length(t)) (N, length(t)) and (N, length(t), 2) .

My indexing simply accesses A(:, i:-1:2), B(:, i:-1:2), C(:, 1:i-1, 1) and C(:, 1:i-1, 2) as column vectors.

Here's what I obtain when I plot time and time2 with respect to looping iterator i when N=252 and length(t) = 13200:

Does anybody knows why there's such a difference between execution times?

Is it due to my way of coding or something linked to overhead time on the GPU?

FYI I tried to invert the order of approx_conv_pair and approx_conv_impair and I observe the same problem (second operation almost twice as fast).

Thank you in advance!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Joss Knight 2020 年 5 月 6 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/523392-same-gpu-operation-in-loop-but-two-speeds#answer_430819

MATLAB will do some optimisations when it sees you are doing the same thing repeatedly. In your case, the optimisation that applies is that memory needs to be allocated for the temporary variables you are creating such as the result of A(txN-i*N+1:txN-N).*(C(1:(i-1)*N). Since this changes in size each iteration new allocations keep having to be done; but in the second operation, the memory allocated in the first operation has been pooled and can be reused. The cost of allocating memory (except for the output) is eliminated and you just see the raw cost of computation.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

François Fabre 2020 年 5 月 7 日

Thank you for this clear answer. That's a good thing to have in mind! Then I guess I'll have to approach my problem differently.

サインインしてコメントする。

Same gpu operation in loop but two speeds

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

Same gpu operation in loop but two speeds

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示