Hi all,
In some cases the use of iterative solvers is useful also with full matrices, which is my case. I would like to use an iterative solver like GMRES with full matrices where the matrix and the RHS are gpuArrays, but it looks like this is not provided with Matlab 2013a.
My data are
>> n = 1024;
>> Acpu = rand(n)+100*eye(n);
>> bcpu = rand(n,1);
>> Agpu = gpuArray(Acpu); bgpu = gpuArray(bcpu);
I tried either
>> x = gmres(Agpu,bgpu,[]);
Error using iterchk (line 39)
Argument must be a floating point matrix or a function handle.
Error in gmres (line 86)
[atype,afun,afcnstr] = iterchk(A);
and
>> x = gmres(@(x)(Agpu*x),bgpu,[]);
The following error occurred converting from gpuArray to double:
Conversion to double from gpuArray is not possible
Error in gmres (line 297)
U(:,1) = u;
The only way I found to make it work is
>> x = gmres(@(x)gather(Agpu*x),bcpu,[]);
gmres converged at iteration 7 to a solution with relative residual 2.4e-07.
That is terribly ugly because the matrix-vector-product is continuously swapped from GPU to the system memory. Any suggestion to use GMRES on GPU using MATLAB built-in functions?
Thanks in advance Fabio

2 件のコメント

Matt J
Matt J 2014 年 9 月 16 日
Are you saying you get no acceleration over CPU-gmres? I wouldn't expect the data transfer of Agpu*x to be such a big penalty. It's not like you're transfering all of Agpu, after all.
I also vaguely wonder whether this would continue to be a problem on newer graphics cards and newer versions of CUDA. My understanding was that the newer CUDA versions could share memory with the CPU.
Fabio Freschi
Fabio Freschi 2014 年 9 月 16 日
Not yet implemented in Matlab 2013a. I get out-of-memory pretty soon if I exceed the GPU memory (12GB in my case, with Tesla K40)

サインインしてコメントする。

 採用された回答

Matt J
Matt J 2014 年 9 月 16 日
編集済み: Matt J 2014 年 9 月 16 日

1 投票

Even for much larger problem sizes (n=10240) and a not so new graphics card (GTX 580), I see negligible overhead in time to swap between CPU and GPU,
n = 1024*10;
Acpu = rand(n)+100*eye(n);
bcpu = rand(n,1);
Agpu = gpuArray(Acpu);
bgpu= gpuArray(bcpu);
gputimeit(@() Agpu*bgpu) %all data on gpu
%0.0052sec
gputimeit(@() gather( Agpu*bcpu )) %requires data transfer
%0.0054sec
Speed-up in GMRES also seems pretty good (factor of 4)
tic;
x = gmres(@(x) Acpu*x,bcpu,[]);
toc
%Elapsed time is 0.391786 seconds.
tic;
x = gmres(@(x)gather(Agpu*x),bcpu,[]);
toc
%Elapsed time is 0.097924 seconds.

5 件のコメント

Fabio Freschi
Fabio Freschi 2014 年 9 月 16 日
Thanks Matt J for the quick reply.
My Matlab version does not support gputimeit , anyway my tic-toc timings (Tesla K40) are slightly different
>> tic; x = Agpu*bgpu; toc % GPU
Elapsed time is 0.000792 seconds.
>> tic; x = gather(Agpu*bgpu); toc % GPU+transfer
Elapsed time is 0.006088 seconds.
>> tic; x = gather(Agpu*bcpu); toc % GPU+transfer+transfer
Elapsed time is 0.006120 seconds.
(Note that the last MVP is between GPU-matrix and CPU-vector like in your example.)
Your solution works, but does not take advantage of full GPU computing and my overhead seems more pronounced.
Matt J
Matt J 2014 年 9 月 16 日
編集済み: Matt J 2014 年 9 月 16 日
You can't rely on tic...toc, I'm afraid. Is this with n=10240? Too-small data sizes will also color the comparison. And what timing comparison do you see for gmres (CPU vs GPU)?
Fabio Freschi
Fabio Freschi 2014 年 9 月 16 日
編集済み: Fabio Freschi 2014 年 9 月 16 日
Yes, it is with n = 10240. I'll try to get an updated Matlab version to check results.
GMRES timings
>> tic; x = gmres(@(x) Acpu*x,bcpu,[]); toc
gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 7.3e-06.
Elapsed time is 0.842030 seconds.
>> tic; x = gmres(@(x)gather(Agpu*x),bcpu,[]); toc
gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 7.3e-06.
Elapsed time is 0.098119 seconds.
That seems in agreement with the tic-toc of the MVP (10 times plus rubbish)
Matt J
Matt J 2014 年 9 月 16 日
編集済み: Matt J 2014 年 9 月 16 日
If you must use tic...toc, the following would be a better set of tests
tic;
x=gather( Agpu*bcpu );x(:)=1;
toc %requires data transfer
tic; for ii=1:10,
x= Agpu*bgpu;
end;
x=gather(x);
x(:)=1;
toc/10 %all data on gpu
tic; x= Acpu*bcpu;x(:)=1; toc
Notice that the second test is the most realistic representation of what you would like to do, i.e., many iterations of GPU operations plus a final gather() operation at the end of the iterations.
Fabio Freschi
Fabio Freschi 2014 年 9 月 16 日
編集済み: Fabio Freschi 2014 年 9 月 16 日
Following a suggestion found in the Mathworks website:
>> gd = gpuDevice;
>> tic; for i = 1:100, x = Agpu*bgpu; end; wait(gd); toc
Elapsed time is 0.537721 seconds.
>> tic; for i = 1:100, x = gather(Agpu*bgpu); end; wait(gd); toc
Elapsed time is 0.547418 seconds.
That are in accordance with your experiments
EDIT: I see now your comment that is similar with this implementation

サインインしてコメントする。

その他の回答 (1 件)

Joss Knight
Joss Knight 2015 年 9 月 7 日

0 投票

If you download the R2015b release of MATLAB (released on 3rd September) you will find that gmres is now supported for sparse gpuArrays, including support for a single sparse matrix preconditioner. See http://www.mathworks.com/help/distcomp/release-notes.html.

カテゴリ

質問済み:

2014 年 9 月 16 日

回答済み:

2015 年 9 月 7 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by