Preconditioning for iterative solvers on GPU - Performance issues

10 ビュー (過去 30 日間)

Paulo Ribeiro 2019 年 11 月 14 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/491178-preconditioning-for-iterative-solvers-on-gpu-performance-issues

コメント済み: Joss Knight 2019 年 11 月 25 日

Dear all,

I'm experimenting some preconditioners for iterative solvers on GPU in a linear system [A]{x}={B}. The problem is defined by this simple command line:

sol=pcg(A_gpu,B_gpu,tol,maxit,P)

where A and B are gpuArrays and P is the preconditioner.

Some simple tests point out that the solution is faster than any iterative CPU solver, whenever P=[ ], with speedups up to 12x;

However, what I still can't figure out, is the reason why the performance drops whenever any type of preconditioner is selected. For an instance, using Incomplete Cholesky factorization:

L=ichol(A)
sol=pcg(A_gpu,B_gpu,tol,maxit,L*L')

Blows out the performance when compared to no preconditioner at all on the GPU. The solution is even slower than the CPU version, where this same preconditioner improves the CPU performance by 1.5x. That's really strange.

I've also tried passing A_gpu as preconditioner, but the solution takes forever:

sol=pcg(A_gpu,B_gpu,tol,maxit,A_gpu)

This issue is also related to other iterative solvers, such as: BICG and SYMMLQ

Am I doing something wrong? It appears that any preconditioner on the GPU is acting as a drawback, even when it is efficient for the CPU version.

Please share your thoughts and experiences. Thanks!

7 件のコメント
5 件の古いコメントを表示5 件の古いコメントを非表示

Paulo Ribeiro 2019 年 11 月 15 日

編集済み: Paulo Ribeiro 2019 年 11 月 16 日

Hi Joss and Walter.

I'm providing an Onedrive link for [A] and {B}:

Download [A] and {B} - mat file

The mat file exceeds 5MB and cannot be attached to this message. For this specific benchmark there's not a single case where a preconditioner in the GPU provides better performance than a run with no preconditioner at all.

Benchmarks are based on the following test:

P=diag(diag(A));
A=gpuArray(A);
B=gpuArray(B);
sol=pcg(A,B,1e-5,1e5,P);

Some comments:

a) when P=[ ] there's convergence with 3.26s in 5346 iterations (that's the GPU with no preconditioner case).

b) when P is set to diag(diag(A)) there's convergence in 8.73s in 5501 iterations.

c) using BICG as the iterative solver with P=[ ] provides convergence with 4.90s in 5395 iterations. When P is set to diag(diag(A)) there's convergence in 16.58s in 5567 iterations.

d) using Incomplete Cholesky factorization with:

L=ichol(A);
P=L*L';

blows out the performance in both methods (PCG and BICG) and processing time is greater than 300s. I cancelled this operation. On the other hand, this same preconditioner provides a significant speedup (1.6x) using the CPU solver.

e) ILU preconditioner is also a problem.

f) in my experiments there's not a single case where a preconditioner provides better results than the scenario with P= [ ].

g) my current setup is a NVIDIA RTX 2070 SUPER in a Windows 10 environment using MATLAB R2019a.

h) [A] is a sparse and symmetric matrix, diagonal dominant.

Many thanks for your support. Hope that your experience can help me on this issue. Regards.

PS: I wonder if preconditioning prior to the iterative solver call will provide better performance.

Joss Knight 2019 年 11 月 21 日

For what it's worth, these are the results I got for your data on a Titan V, which has around 7 TFLOPS in double precision. I saw similar issues for passing the reconstructed cholesky or ILU factors - I can't explain that but perhaps the sparsity pattern is just a really poor match for the GPU factorization algorithm. We do intend to provide a future enhancement that will allow two triangular preconditioners to be passed to the solver so that the decomposition can be done independently.

>> Ag = gpuArray(A);
>> Bg = gpuArray(B);
>> P = diag(diag(A));
>> tic; pcg(A,B,1e-5,6000); toc 
pcg converged at iteration 5346 to a solution with relative residual 1e-05.
Elapsed time is 25.906744 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000); toc
pcg converged at iteration 5345 to a solution with relative residual 1e-05.
Elapsed time is 1.399854 seconds.
>> tic; pcg(A,B,1e-5,6000,P); toc
pcg converged at iteration 5501 to a solution with relative residual 1e-05.
Elapsed time is 34.181677 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000,P); toc
pcg converged at iteration 5502 to a solution with relative residual 9.8e-06.
Elapsed time is 2.404074 seconds

In other words, preconditioner or no, the GPU is giving a great performance improvement.

Paulo Ribeiro 2019 年 11 月 21 日

編集済み: Paulo Ribeiro 2019 年 11 月 22 日

Thanks Joss. These are really impressive results on a Titan V. It's even faster than a backslash solver A\B on the CPU with an Intel i7 8700:

tic; A\B; toc
Elapsed time is 1.712258 seconds.

For this specific case it appears that the best option is to avoid preconditioning on the GPU.

Regards.

Joss Knight 2019 年 11 月 25 日

I investigated further and found that applying the preconditioner - not just decomposing it - does appear to be taking an unusually long time. This does warrant further investigation, since these two triangular solves should be fast, and your system matrix is band-diagonal. It does have quite a large bandwidth of 543 however, so that could be the issue.

Iterative solvers are always faster than direct solves for large sparse matrices (assuming they have reasonable convergence properties). Direct solves are hugely memory intensive because there is a lot of fill-in during factorization.

サインインしてコメントする。

サインインしてこの質問に回答する。