How to efficiently allocate memory using a parfor loop

Question

tiwwexx 2022 年 6 月 28 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1749840-how-to-efficiently-allocate-memory-using-a-parfor-loop

コメント済み: tiwwexx 2022 年 6 月 30 日

採用された回答: Jan

MATLAB Online で開く

Hello all, I have a quick optimization question.

I'm doing calculations on some very large point cloud data. The calculation I'm doing is

for n=1:size(E_mat,1)
    Q_matrix(n,:,:) = sigmaE(n)/2/mass_density(n)*squeeze(E_mat(n,:,:))'*squeeze(E_mat(n,:,:));
end

where size(E_mat) ~70000000,3,24. This code should be super parallelizable but when I use parfor I get a memory issue. I have access to a good compute server with 40 cores and 512Gb of RAM. The current for loop utilizes about 300Gb of RAM but only 1.2% CPU. I'm pretty new to high performance computing but I'm pretty sure the for loop is running single threaded due to the low CPU usage. Is there a simple way to fix this?

Thanks so much for the help!!

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Walter Roberson 2022 年 6 月 28 日

squeeze is fast. It is extracting the data that is slow. The memory layout is

(1,1,1) (2,1,1) (3,1,1) (4,1,1)... (70000000,1,1), (1,2,1) (2,2,1)... (70000000, 2,1) and so on. The data for (n, :, :) is all over the place in memory. If you make 70000000 the final dimension then each 3x24 is stored in consecutive memory.

tiwwexx 2022 年 6 月 29 日

Thanks for the explaination!

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Jan 2022 年 6 月 29 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1749840-how-to-efficiently-allocate-memory-using-a-parfor-loop#answer_996290

編集済み: Jan 2022 年 6 月 29 日

MATLAB Online で開く

ET = permute(E_mat, [2,3,1]);
Q  = zeros(size(ET));
parfor n = 1:size(E_mat, 3)
    Q(:,:,n) = sigmaE(n) / 2 / mass_density(n) * ET(:, :, n)' * ET(:, :, n);
    % Or maybe this is faster:
    % tmp = ET(:, :, n);
    % Q(:,:,n) = sigmaE(n) / 2 / mass_density(n) * tmp' * tmp;
end

I'm curious: What do you observe?

Du you really mean ctranspose or is ET real? Then .' would be the transposition.

What about using pagemtimes ?

ET = permute(E_mat, [2,3,1]);
Q  = pagetimes(ET, 'transpose', ET, 'none');

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

tiwwexx 2022 年 6 月 29 日

編集済み: tiwwexx 2022 年 6 月 29 日

MATLAB Online で開く

For reference my E_mat is complex and I did want conj transpose.

Using

tic
Q_test = pagemtimes(E_mat_pt,'ctranspose',E_mat_pt,'none');
for n=1:size(Q_test,3)
    Q_test(:,:,n) = sigmaE_pt(n) / 2 / mass_pt(n) * Q_test(:,:,n);
end
toc

I get ~60% CPU usage the whole time. This makes sense since size(Q_test(:,:,n)) = 24x24 and 60%cpu would be 24 cores. The calc ends up taking about 85 seconds.

Then the following,

ET = permute(E_mat, [2,3,1]);
Q  = zeros(size(ET));
parfor n = 1:size(E_mat, 3)
    Q(:,:,n) = sigmaE(n) / 2 / mass_density(n) * ET(:, :, n)' * ET(:, :, n);
end

took around 45 seconds. For reference, I started a 12 worker parpool before running the code so the time to allocate the parpool wasn't included in the tic toc. I also tried the tem variable creation in the for loop and that ran is ~65 seconds. It should also be noted that the parpool was a little bit less memory efficient and used ~10gb more RAM. Very interesting.

I'm going to keep the question open just a bit longer to see if anyone has any insight on the GPU optimization. I've tried running

%% move to GPU and run pagemtimes
tic
E_mat_gpu = gpuArray(E_mat);
toc
tic
Q_gpu = pagemtimes(E_mat_gpu,'ctranspose',E_mat_gpu,'none');
toc

The data transfer is fast, ~1sec and the pagemtimes is also fast, taking ~1sec. that's a pretty good speed up but I still think that It could be better with a more optimized gpu code. It also has a problem with running the second part of the computation

for n=1:size(E_mat_gpu,3)
Q_gpu(:,:,n) = sigmaE_gpu(n) / 2 / mass_gpu(n) * Q_gpu(:,:,n);
end

Any suggestions on how to optimize this for-loop for a GPU?

Jan 2022 年 6 月 30 日

By the way: A=E_mat_pt(:,:,1:end) is less efficient than A=E_mat_pt .

tiwwexx 2022 年 6 月 30 日

That was a by product of my GPU running out of memory, I had to split up the array into a few parts to fit it on the gpu.

サインインしてコメントする。

How to efficiently allocate memory using a parfor loop

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

How to efficiently allocate memory using a parfor loop

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示