GPU backslash performance much slower than CPU

I am doing numerical power flow caclulation by modifying the functions of matpower, an open source toolbox. By modifying its function newtonpf.m, GPU computation can be implemented. However, I found that GPU performance is much much slower than CPU. When calculating the built-in case3012wp of matpower, the matrix in newtonpf.m will be :
A: 5725 * 5725 sparse double, b: 5725 * 1 double.
The process of A \ b in the 1st iteration of newtonpf() will generally take around 0.01 sec on my i7-10750H + RTX 2070super MSI-GL65.
But if A and b are changed into GPU arrays, the process of A \ b will take the following time if A is the following types:
full double, 0.8 sec
sparse double, 4 sec
full single, 0.1 sec
(sparse single is not supported)
So why is the diference in performance? I thought GPU could do things much faster than CPU.
Files are attached as follows. Atest is sparse and Agpu is a sparse gpu array. All are doubles.

9 件のコメント

Walter Roberson
Walter Roberson 2020 年 12 月 27 日
A\b or A/b ??
Meme Young
Meme Young 2020 年 12 月 27 日
A \ b, my bad
Matt J
Matt J 2020 年 12 月 27 日
編集済み: Matt J 2020 年 12 月 27 日
What graphics card are you using? Easiest would be to show us the output of
>>gpuDevice
Also, I recommend attaching a .mat file containing A and b.
Meme Young
Meme Young 2020 年 12 月 27 日
Just as what I have mentioned, it's RTX 2070super notebook version.
Name: 'GeForce RTX 2070 Super'
Index: 1
ComputeCapability: '7.5'
SupportsDouble: 1
DriverVersion: 10.2000
ToolkitVersion: 10.2000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
AvailableMemory: 5.6821e+09
MultiprocessorCount: 40
ClockRateKHz: 1380000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Matt J
Matt J 2020 年 12 月 27 日
I recommend attaching a .mat file containing A and b.
Meme Young
Meme Young 2020 年 12 月 27 日
Files are as attachments
Matt J
Matt J 2020 年 12 月 27 日
Please attach all the variables in a single .mat file, to make the download more convenient.
kant
kant 2022 年 5 月 26 日
I also have this problem for my matlab code? Has the problem been solved?
Matt J
Matt J 2022 年 5 月 26 日
編集済み: Matt J 2022 年 5 月 26 日
@kant It has been concluded that this is expected behavior, but see below.

サインインしてコメントする。

回答 (1 件)

Matt J
Matt J 2020 年 12 月 27 日

0 投票

This thread looks relevant. It appears that sparse mldivide on the GPU is not expected to be faster.

13 件のコメント

Meme Young
Meme Young 2020 年 12 月 28 日
I have found that the backslash \ of sparse gpu arrays are slower that full gpu arrays.
What surprised me is that CPU sparse arrays are a lot faster than GPU full arrays. This has made gpu computation for power flow (usually has to do \ on sparse matrix) not that attractive.
I have uploaded another zip file with the mat files combined and an m file. Just run the m file and you can see the difference.
Matt J
Matt J 2020 年 12 月 28 日
What surprised me is that CPU sparse arrays are a lot faster than GPU full arrays.
There's no reason to think that GPU full arrays will be faster that CPU sparse arrays.
Matt J
Matt J 2020 年 12 月 28 日
編集済み: Matt J 2020 年 12 月 28 日
The condition number for your Atest matrix is quite poor:
>> cond(full(Atest))
ans =
2.1049e+06
When the condition number is better, I find that the advice provided at the link I gave you works quite favorably:
N=5725;
A=sprand(N,N,0.005);
A=A.'*A+speye(N);
b=rand(N,1);
Ag=gpuArray(A);
bg=gpuArray(b);
timeit(@()A\b) %0.8228 seconds
timeit(@()pcg(A,b,1e-6,1e3)) %0.2709 seconds
gputimeit(@()pcg(Ag,bg,1e-6,1e3)) %0.0538 seconds
Walter Roberson
Walter Roberson 2020 年 12 月 28 日
Whether GPU of the full() would be faster than CPU of the sparse array, depends upon sparsity.
At one end, an empty sparse array is easily detected and CPU could finish it quickly. At the other end, a sparse array that is mostly filled in (sparse in name only) would be faster processed in full() on GPU.
Matt J
Matt J 2020 年 12 月 28 日
編集済み: Matt J 2020 年 12 月 28 日
At one end, an empty sparse array is easily detected and CPU could finish it quickly. At the other end, a sparse array that is mostly filled in (sparse in name only) would be faster processed in full() on GPU.
Yes, but why is the GPU so counterproductive when the matrix truly is sparse?
N=5725;
A=sprand(N,N,0.001);
A=A.'*A+speye(N);
b=rand(N,1);
Ag=gpuArray(A);
bg=gpuArray(b);
timeit(@()A\b) %0.3567 seconds
gputimeit(@()Ag\bg) %38 seconds
Meme Young
Meme Young 2020 年 12 月 28 日
Thank you man, I think we have found the limitations of GPU computing. Sadly, I would say that in most power flow calculations for power networks more than 300 buses, the condition value of its jacobian matrix (Atest) in the 1st iteration is close to the order of 1e5 ~ 1e6, as I have tested. So maybe the speed-up of GPU computing is meaningless for such applications :( . I think the speed up GPU works much better for dot slash, instead of matrix slash.
Matt J
Matt J 2020 年 12 月 28 日
編集済み: Matt J 2020 年 12 月 28 日
If the condition number is 1e5 ~1e6, I question whether even the CPU is giving you a meaningful result. Surely there is something you should be doing to regularize the problem...?
Meme Young
Meme Young 2020 年 12 月 28 日
I think it is because matlab does not provide optimized algorithms for truly sparse gpu-sparse-type matrices. GPU computing is not well supported and warrated as CPU computing, since it is still a newly-developed inmature technology.
Matt J
Matt J 2020 年 12 月 28 日
編集済み: Matt J 2020 年 12 月 28 日
No, as I showed you above, the GPU outperformns the CPU when the equations are well-conditioned. When they are ill-conditioned, speed is irrelevant. The solution is too numericallu sensitive to be of any value.
Meme Young
Meme Young 2020 年 12 月 28 日
Easy man, 1e5~1e6 is not that horrible. matlab will report a warning only when you try to backslash on a matrix with rcond ≈ 1e-15, and con ≈ 1e16. You can try this by testing if Atest \ Atest equals to eye(); the error will be around 1e-16, which is numerically precise, so backslash on Atest is still accurate.
Joss Knight
Joss Knight 2020 年 12 月 29 日
We recommend the sparse solver algorithms with preconditioning for solving sparse systems on the GPU (and CPU in most cases), Direct solves using the backslash operator are generally inefficient to compute.
Meme Young
Meme Young 2020 年 12 月 30 日
What do you mean sparse solver algorithm Mr Knight? like pcg()? I have tried it is not as efficient as this way: reordering using amd(), LU decomp, and two backslashes based on the decomp, especially when coping with the type of sparse matrix that I uploaded
Joss Knight
Joss Knight 2021 年 1 月 10 日
編集済み: Joss Knight 2021 年 1 月 10 日
Yes, PCG, GMRES, CGS, LSQR, QMR, TFQMR, BICG, BICGSTAB. Try them all, play with tolerance, iterations and preconditioning - something is likely to work. I'm not an expert in this field but this is what the sparse community tend to do.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeLinear Algebra についてさらに検索

質問済み:

2020 年 12 月 27 日

編集済み:

2022 年 5 月 26 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by