Matrix multiplication optimization using GPU parallel computation

Dear all,
I have two questions.
(1) How do I monitor GPU core usage when I am running a simulation? Is there any visual tool to dynamically check GPU core usage?
(2) Mathematically the new and old approaches are same, but why is the new approach is 5-10 times faster?
%%% Code for new approach %%%
M = gpuArray(M) ;
for nt=1:STEPs
if (there is a periodic boundary condition)
M = A1 * M + A2 * f * M
else
% diffusion
M = A1 * M ;
end
end

6 件のコメント

Jan
Jan 2022 年 8 月 18 日
Just curious: What timings do you get for:
M = (A1 + A2 * f) * M;
Are A1, A2 and f gpuArrays also?
Nick
Nick 2022 年 8 月 18 日
Hi Jan,
It needs about 600 seconds to run the old approach, 120 seconds to run the new approach on the consumer GPU (5 times), 10 times on the professional GPU card.
Do you know how can we check the GPU core and memory usage in realtime (graphically if possible)?
Thanks!
Jan
Jan 2022 年 8 月 18 日
@Nick: I do not understand, what the "old" and the "new" approach is. I asked for the speed of:
M = (A1 + A2 * f) * M;
which might avoid a matrix multiplication. Are A1, A2 and f gpuArrays also?
Nick
Nick 2022 年 8 月 19 日
Tried A1, A2 and f gpuArrays. It doesn't help on the calculation speed.
Jan
Jan 2022 年 8 月 19 日
Okay. As far as I understand, you do not want to tell me the speed difference between
M = A1 * M + A2 * f * M;
and
M = (A1 + A2 * f) * M
and you do not want to show the complete code for the "old" implementation. Then I cannot estimate, if storing the data in "B(t_n)" is a cause of the problem.
Nick
Nick 2022 年 8 月 20 日
Hi Jan,
The following table summarizes the computation time comparison over different approach and GPU enabled/disabled.
New one-step app 1 doesn't have any improvement.

サインインしてコメントする。

 採用された回答

Matt J
Matt J 2022 年 8 月 18 日
編集済み: Matt J 2022 年 8 月 18 日

0 投票

Because in your second formulation, there is no need to build a table of non-zero entries for the sparse matrix B. The table-building step requires sorting operations, which your second version avoids.
Also, if B has many columns, it will consume a lot of memory in proportion to the number of columns (independent of the sparsity). That is avoided as well by the second implementation.

10 件のコメント

Nick
Nick 2022 年 8 月 18 日
Hi Matt,
Thanks for you insights!
(1) I am surprised that the Matlab compiler didn't optimize this step for the old approach (just substitution as the new approach). Is it due to the MATLAB line-by-line script execution mode? Will MATLAB optimize from old to new if I compile the program into a standaline EXE.
(2) Do you know how we can check the GPU core and memory usage in the realtime (graphically if possible)?
(3) With (2), can we timer and monitor GUP calculation in the realtime?
Thanks!
Matt J
Matt J 2022 年 8 月 19 日
編集済み: Matt J 2022 年 8 月 21 日
I am surprised that the Matlab compiler didn't optimize this step for the old approach (just substitution as the new approach). Is it due to the MATLAB line-by-line script execution mode?
If you write code instructing Matlab to create a matrix B, Matlab must assume you actually want to use B even if B is never used later in the code. If, for example, you use were to insert a breakpoint in the code, B needs to be available so that you can examine it.
Will MATLAB optimize from old to new if I compile the program into a standaline EXE.
If you convert the code to C/C++ with the Matlab Coder, it might.
Nick
Nick 2022 年 8 月 22 日
Hi Matt,
Do you have any reference about how MATLAB builds a matrix (e.g. sorting operation you mentioned)? Is it done in CPU?
Or how does MATLAB handle the matrix in general?
I observed about 70% CPU usage after gpuArray is called. What operations does CPU work on when GPU is running?
Thanks!
Matt J
Matt J 2022 年 8 月 22 日
If you build the matrix on the CPU, then transfer it to the GPU, it would explain why you see CPU activity.
Nick
Nick 2022 年 8 月 29 日
Hi Matt, do you have some reference reading related to MATLAB matrix handling?
Thanks!!
Matt J
Matt J 2022 年 8 月 29 日
編集済み: Matt J 2022 年 8 月 29 日
There is this rather general doc,
but I want to emphasize that nothing you are seeing is likely related to CPU/GPU transfers or GPU versus CPU differences. It is simply more expensive to create a sparse matrix than to do matrix/vector multiplication with that matrix, even in the plain vanilla case where all processing is done on the CPU (see below). In your case, by avoiding the creation of an additional sparse matrix B, your second version avoids very obvious overhead.
A1=sprand(1e5,1e5,0.001);
A2=sprand(1e5,1e5,0.001);
b=rand(1e5,1);
tic;
B=A1+A2;
toc
Elapsed time is 0.185736 seconds.
tic
B*b;
toc
Elapsed time is 0.039143 seconds.
tic
A1*b; A2*b;
toc
Elapsed time is 0.037451 seconds.
Nick
Nick 2022 年 9 月 3 日
Hi Matt, Thanks!
Nick
Nick 2023 年 1 月 19 日
Hi Matt,
I am trying to better understand what you said here: "there is no need to build a table of non-zero entries for the sparse matrix B".
Do you know how MATLAB manages sparse array elements? For example, in a 1000x1000 sparse matrix with only 100 non-zero elements, will MATLAB save the non-zero elements in a table? If so, will any operation on those non-zero elements cause the sorting operations you mentioned above?
Do you have some MATLAB reference about the sparse array handling?
Thanks in advance!
Matt J
Matt J 2023 年 1 月 19 日
編集済み: Matt J 2023 年 1 月 19 日
Do you know how MATLAB manages sparse array elements?
Here is some detail on how sparse matrices are stored,
If so, will any operation on those non-zero elements cause the sorting operations you mentioned above?
If a new sparsity pattern is generated, then it will. Here's maybe another example to show how this can make sparse operations slower than full operations:
N=5000;
A=sprand(N,N,1/5);
B=sprand(N,N,1/5);
tic;
A+B;
toc; %sparse matrix addition
Elapsed time is 0.085529 seconds.
A=full(A); B=full(B);
tic
A+B;
toc %full matrix addition
Elapsed time is 0.049478 seconds.
Nick
Nick 2023 年 1 月 23 日
Matt,
Thank you!

サインインしてコメントする。

その他の回答 (1 件)

Joss Knight
Joss Knight 2022 年 8 月 19 日

1 投票

The Windows Task Manager lets you track GPU utilization and memory graphically, and the utility nvidia-smi lets you do it in a terminal window.
Neither the CUDA driver nor the runtime provide access to which core is running what, although you might be able to hand-code something using NVML.

3 件のコメント

Nick
Nick 2022 年 8 月 19 日
Hi Joss,
Thanks for your tip.
It does consume a lot of CPU power (~70%) after executing gpuArray command, and it drops to 12% in the end of GPU simulation.
I observed the dedicated GPU memory usage increase from 0.2 GB to 2.6 GB, but all the GPU performance parameters (e.g. 3D, Copy, Video Encode and Decode) are at almost 0% usage with very tiny ripples.
I am curious what GPU parameter is the key index for matrix multiplication calculation. Would you please advise?
Joss Knight
Joss Knight 2022 年 8 月 20 日
Ah, I forgot that you cannot see utilization information for GeForce cards, sorry. Those charts are for graphics and so not relevant for compute (except the memory one).
You'll have to use nvidia-smi.
Nick
Nick 2022 年 8 月 29 日
Hi Joss, thanks for your info!

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeGPU Computing についてさらに検索

製品

リリース

R2022a

質問済み:

2022 年 8 月 18 日

コメント済み:

2023 年 1 月 23 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by