Pre-load data on multiple GPUs for parfor loop

Question

Massimiliano Zanoli 2021 年 3 月 15 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/773572-pre-load-data-on-multiple-gpus-for-parfor-loop

コメント済み: Edric Ellis 2021 年 3 月 17 日

I have two GPUs with 6 Gb of RAM each.

I need to perform a Particle Swarm optimization which evaluates a cost function that is very well suited for GPU computation, but the data arrays are huge (~4 Gb).

I have code that successfully works using one GPU and no parallelization. The code pre-loads the arrays into the GPU (which is time consuming) and subsequently enters the optimization process, where the cost function is quickly evaluated.

Now, I'd like to exploit the second GPU, but for that I need to start a parallel pool with 2 workers, and assign a GPU to each. The problem is the pre-loading of the arrays.

I have tried different options, also those suggested in MATLAB's blogs, documentation and answers, but they don't work.

For instance this:

% create a 4 Gb array
A = rand(1024, 1024, 512);
spmd
    % copies the array to each worker and loads it on their respective GPU
    % results in a 2x1 Composite
    A = gpuArray(A);
end
% for each potential solution to evaluate
parfor n = 1 : N
    % evaluate the cost function (*)
    <..> = costFunction(A, n, ..);
end

will throw an error at (*) because "Composites are not supported in parfor loops". Since Particle Swarm uses a parfor, I cannot go this way.

The only other alternative to pre-load data to the workers is via a parallel.pool.Constant, but:

it does not work meaningfully with gpuArray (at least not in my version 2020a).
it is some weird wrapper not fully integrated with MATLAB's language (everything has to be changed to <variable>.Value which forces you to have two versions of the code, one for parallelized and one for non-parallelized computing).

In particular:

A = rand(1024, 1024, 512);
A = parallel.pool.Constant(A);
spmd
    % will turn A back into a Composite, defeating the purpose
    A = gpuArray(A.Value);
end

and:

A = rand(1024, 1024, 512);
% load on one of the GPUs from the main thread, occupying 4 Gb of RAM
A = gpuArray(A);
% copy A to each worker and load it on its respective GPU (*)
A = parallel.pool.Constant(A);

will throw an out of memory at (*) because one GPU has already 4 Gb occupied by the main thread. Plus it suffers from memory leaks, leaving the original gpuArray from the main thread on the GPU even when all references to it have gone.

Is there a way to pre-load massive arrays into each GPU and run parfor evaluations on them? Maybe something in the new releases?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Edric Ellis 2021 年 3 月 16 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/773572-pre-load-data-on-multiple-gpus-for-parfor-loop#answer_649062

MATLAB Online で開く

You've got a number of options here depending on whether you can build the value of A directly on the workers. The simplest case is where you can do that, and then you'd do this:

Ac = parallel.pool.Constant(@() rand(1024,1024,512,'gpuArray'));
parfor ...
    doStuff(Ac.Value);
end

Things are a little trickier if the value of A must be calculated on the client. But it should work to do this:

A = rand(1024,1024,512);
Ac = parallel.pool.Constant(@() gpuArray(A));

In that case, the CPU value of A is embedded in the anonymous function handle workspace, and it gets pushed to the GPU only on the workers when the function handle is evaluated.

2 件のコメント
なしを表示なしを非表示

Massimiliano Zanoli 2021 年 3 月 16 日

MATLAB Online で開く

Thank you Edric, I had totally missed the parallel.pool.Constant(<function_handle>) constructor. This does indeed what I need. Too bad though it suffers from huge memory leaks when in combination with gpuArray (at least from what I see in my version). Once the array is transferred to the worker and loaded onto its GPU, calling <Constant_handle>.delete will not release the GPU memory. If I do:

spmd
    A = gather(Ac.Value);
end
Ac.delete;

the array is transferred back from the GPU to the worker and becomes a Composite, but then this:

clear A;

apparently does not free the worker's memory. Maybe I'm missing something else :) But your solution is a good one.

Edric Ellis 2021 年 3 月 17 日

MATLAB Online で開く

One thing to note about Composite values - they do not release worker memory until the spmd block immediately after they are cleared (this is to avoid excessive client-worker communication). So, you if you do

clear A
spmd, end

you should see the memory returned. (In your code, this would be CPU memory).

サインインしてコメントする。

Pre-load data on multiple GPUs for parfor loop

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Pre-load data on multiple GPUs for parfor loop

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

2 件のコメント なしを表示なしを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示