Optimize GPU code with nested pagemtimes

Hello all,
I'm trying to speed up computation using the GPUs that are available to me. Right now I have two arrays, Q and W.
size(W) = (16 1 1000)
size(Q) = (16 16 1 2000)
I want to do a sudo-matrix multiplication M = W ' *Q*W to get size(M) = (1000 2000).
To do this I use two instances of pagemtimes which is able to utilize GPU. Here's the code
%%
tic
Sar_pm_gpu = zeros(num_psar_kept,2,size(shim_pm_gpu,3),'single','gpuArray');
for n =1:size(W,3)
inter_calc = pagemtimes(Q_gpu,shim_pm_gpu(:,1,n));
Sar_this_shim = squeeze(pagemtimes(shim_pm_gpu_left(:,:,n),inter_calc)); %in a test, this one is ~15% faster
[Sar_maxk, index_maxk] = max(Sar_this_shim);
Sar_pm_gpu(:,:,n)=[Sar_maxk,index_maxk];
end
With this code I get ~5x speedup vs running it on the cpu. However I'd expect it to be quite a bit faster than that. I then used nvidia-smi and the power consumption on the GPU was ~35W. For referance the resting power consumption is 30W so I don't think that this code is actually utilizing the GPU. If anyone sees a way to speed this up it would be much appriciated! (a explaination on why the GPU power consumption is so low with this posted code would also be much appriciated, I assume it has something to do with memory)

2 件のコメント

Matt J
Matt J 2022 年 7 月 28 日
You shouldn't be using tic/toc for timing gpuArray operations,
tiwwexx
tiwwexx 2022 年 7 月 29 日
I clipped off the end of the code on accident, I make sure to
gather(output)
before calling toc so it's accurate.

サインインしてコメントする。

 採用された回答

Matt J
Matt J 2022 年 7 月 28 日

1 投票

I don't think you need either a loop or a second pagemtimes call.
Wr=reshape(W,16,1000);
Qr=reshape(Q,16,16,2000);
M=sum(pagemtimes(Qr,Wr).*Wr,1);
M=reshape(M,1000,2000);

4 件のコメント

tiwwexx
tiwwexx 2022 年 7 月 28 日
編集済み: tiwwexx 2022 年 7 月 28 日
That works amazing, verying the 1000 and 2000 dimensions this code gets linearly faster than the original code. @Matt J, Would you mind a brief explaination as to why your code is so much faster than the code below (a referance hyperlink or a way to see what memory calls are being made in these different matlab functions would also be very appricated!)
squeeze(pagemtimes(W,'ctranspose',pagemtimes(Qt,W),'none'))
Matt J
Matt J 2022 年 7 月 28 日
Would you mind a brief explaination as to why this code is so much faster than the code below.
I don't find it to be. On my machine, it is even a little bit faster.
W=rand(16,1,1000);
Q=rand(16,16,1,2000);
timeit(@()version1(Q,W))
ans = 0.0974
timeit(@()version2(Q,W))
ans = 0.1330
function M=version1(Q,W)
Wr=reshape(W,16,1000);
Qr=reshape(Q,16,16,2000);
M=sum(pagemtimes(Qr,Wr).*Wr,1);
M=reshape(M,1000,2000);
end
function M=version2(Q,W)
M = squeeze(pagemtimes(W,'ctranspose',pagemtimes(Q,'transpose',W,'none'),'none'));
end
Matt J
Matt J 2022 年 7 月 28 日
編集済み: Matt J 2022 年 7 月 28 日
It seems to be slower only on the GPU. pagemtimes isn't well-optimized for the GPU, it would appear.
tiwwexx
tiwwexx 2022 年 7 月 28 日
Hmm, very interesting indeed. I have a feeling that I'm eventually going to need to learn CUDA since I run into these problems quite often...

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

製品

リリース

R2021b

質問済み:

2022 年 7 月 28 日

コメント済み:

2022 年 7 月 29 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by