Alternatives to accumarray for faster calculations?

Question

Nathan Zechar 2020 年 4 月 20 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/519212-alternatives-to-accumarray-for-faster-calculations

編集済み: Nathan Zechar 2020 年 4 月 27 日

Hello, I have taken someone else's code which had several for-loops in it, and vectorized the code. The code now runs extremelly fast and even faster when using gpuArray(). I'm at a point now where the slowest part of the code is where I used the accumarray function. I'm not sure if there are alternative approaches to vectorizing code without using accum array.

Here is a rough example of the code I vectorized and an exampe of that code, vectorized

nx = 16;
nz = 512;
A  = rand(nx,nx,nz);
%% First Method For Loops
R1  = zeros(nx,nx,nz);
B1  = zeros(nx,nz);
N1  = zeros(nx,nz);
for k =         1:nz
    for j =     1:nx
        for i = 1:nx
            
            r = round((i^2+j^2)^0.5);
            
            if(r <= nx)
                R1(i,j,k) = r;
                N1(R1(i,j,k),k)  = N1(R1(i,j,k),k)+1;
                B1(R1(i,j,k),k)  = B1(R1(i,j,k),k) + A(i,j,k);
                
            end
        end
    end
    N1(1,k) = 1;
end
B1 = B1./N1;
%% Second Method Vectorized
[I,J]   = ndgrid(1:nx,1:nx);
r       = (I.^2+J.^2).^0.5;
R2      = round(r);
L21     = (R2 <= nx);
N2      = squeeze(repmat(histcounts(L21.*R2(:,:,1),1:nx+1),[1 1 nz]));
A2      = reshape(A,[1,numel(A)]);
B2      = reshape(repmat(L21.*R2,[1 1 nz]),[1,numel(A)]);
L22     = logical(B2).*repelem(0:nx:(nz-1)*nx,(nx)^2);
B2      = B2+L22;
B2      = reshape(accumarray((B2+logical(~B2)).',A2.*logical(B2)).',[nx nz])./N2;
% Check that Vectorized Code = For Loop Code
sum(sum(sum(B1-B2)))

The vectorized code isn't much faster when using gpuArray() and is only a tiny bit faster than the non-vectorized code without using gpuArray(). It is currently the bottleneck within my code and I'm not sure what other functions might be able to help me out here.

The great thing about the vectorized code is that I can put it into a function and pull several of the variables out of the function in an initialize step. Still though, the accumarray takes a significant amount of time to run compared to all other parts of my code (not shown) that is vectorized. And when accumarray is put into a function, is appears to be slower when I put tic/toc just outside of the function.

Thanks!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Andrea Picciau 2020 年 4 月 20 日

3
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/519212-alternatives-to-accumarray-for-faster-calculations#answer_427067

編集済み: Andrea Picciau 2020 年 4 月 20 日

MATLAB Online で開く

Hi Nathan,

Accumulation operations require synchronisation and are never going to squeeze the best performance out of your GPU. This is valid for all parallel computer architectures, by the way.

As a way to help myself understand your code, I rewrote the last line:

function tempData = iOriginalCode(A, B)
% Breakdown of original code
logicalB = logical(B);
indexes = B + ~logicalB;
indexes = indexes.';
values = A.*logicalB;
values = values.';
tempData = accumarray(indexes, values);
end

I can't immediately see how to change this code to improve performance and the rest of your vectorized code is not readable enough. I'm 90% sure the right thing to do is to go back to designing your algorithm to avoid that accumarray... you should take some of these things into consideration:

Try not to think in terms of indexes and values. Instead, try to formulate as many things as possible in terms of mathematical operations. You're already doing this in some parts of your code. For example, you're writing

A(~logicalB)=0

as

A.*logicalB

(see my re-written code). Eventually, you might be able re-write your accumarray as a matrix-vector multiplication, which should help improve performance.

Use the structure in your data to simplify your code. You're using a lot of repmats, reshapes, and a repelem to generate your indexes array. You can most likely find a better way to re-write that for-loop-based code.

I would go back to the original code and think again about the structure of this R

j = 1:nx;
i = 1:nx;
r = (i^2+j^2)^0.5;
R = round(r);
R(R>nx/2) = 0;

Use gputimeit to measure your GPU performance. The right way to measure the performance of the GPU on this is to use gputimeit, and on my K40c (with the data obtained from your script) I got this:

>> gputimeit(@() iOriginalCode(A, B), 1)
ans =
    0.1309 

The profiler can help you, but it's going to mess with the underlying flow of operations on the GPU. By the way, the profiler agrees that accumarray is the bottleneck here.

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

Nathan Zechar 2020 年 4 月 20 日

編集済み: Nathan Zechar 2020 年 4 月 21 日

Hey, thank you for your reply Andrea! It was helpful in that I did not know about gputimeit(). I was afriad I might have to find another completely different approach to this. It was very hard coming up with what I currently have - but it sounds like I'm just going to have to put in the time and try a little harder.

It was also helpful for you to code in a way that explicitly shows indicies and values (edit: I know you said don't think on these terms but given the structure of R it's a little hard not to) Unfortunately for me I have indicies that range from 0 to the length of a vector instead of 1 to the length of a vector so I have to come up with a unique way to acount for that or makes sure the values = 0 where the corresponding indicies = 0. Then when I sum over indicies similiar to accum array, those values won't contribute anything.

Back to the drawing board here.

Andrea Picciau 2020 年 4 月 21 日

編集済み: Andrea Picciau 2020 年 4 月 21 日

MATLAB Online で開く

nathansProblem.m

Hi Nathan,

Yesterday I couldn't get your problem out of my head and I found myself thinking about better ways to vectorize the code. I thought you might be interested in seeing a couple of MATLAB tricks in action, so here's what I came up with:

function [N,B] = iVectorizedCode(A, B, N)
% New vectorized code
nx = cast(size(A, 1), 'like', A);
% Precompute R
[IJ, selR, numR] = iSelectRows(nx);
% Update N
N = N + iColumn(numR);
N(1, :) = 1;
% Update B
sumA = iSumByPage(A, IJ);
B(selR, :) = B(selR, :) + iRow(sumA);
B = B./N;
    function [IJ, selR, numR] = iSelectRows(nx)
        % Pre-compute the rows to select
        
        allIndices = 1:nx;
        
        % Compute R for all the combinations of I and J
        R = round(sqrt(allindices.^2+allindices'.^2));
        IJ = R<=nx/2;
        R = R(IJ); % Remove all the unnecessary Rs
        
        % Find the unique indices in R
        uniqueR = unique(R);
        numUniqueR = sum(R==uniqueR');
        
        % Convert uniqueR (indices) to row selectors (logicals)
        selR = ismember(allIndices, uniqueR);
        numR = zeros(1, nx, 'like', nx);
        numR(selR) = numUniqueR;
    end
    function sumA = iSumByPage(A, IJ)
        % Sum A by page (a page = dimension>2 in an array);
        
        A = A.*IJ;
        sumA = sum(A, [1, 2]);
    end
    function colVector = iColumn(anyVector)
        colVector = reshape(anyVector, [], 1);
    end
    function rowVector = iRow(anyVector)
        rowVector = reshape(anyVector, 1, []);
    end
end

On my K40, this has better performance than the original vectorised code. To be honest, I'm not even sure this is correct! I guess you'll have to test that...

originalVectorized =
    0.2965
originalVectorizedCpu =
    0.8719
newVectorized =
    0.0149

My comments:

In your vectorized code, A N and B are not gpuArrays initially! Be careful not to mix GPU and CPU computations unless you mean to. Otherwise, you won't be able to trust your performance results.
In the original code R doesn't depend on k. I took it out of the loop and pre-computed it together with the number of times you update N and B in the original code (represented by sumR).
I made extensive use of implicit expansion which can be useful in these cases (for example R==uniqueR'). Note that implicit expansion might limit the size of the problems you can work on...but I think it'll be fine in this case.
Also, R is very regular and you should be able to generate its values in a much faster way! You should be able to find a closed form expression to give you numR given the row index.
If you keep working on something like this, I'm sure you can make it much faster!

Let me know what you think about this! I attached the whole script to this reply...

Nathan Zechar 2020 年 4 月 24 日

MATLAB Online で開く

Hey Anrea, I just now attempted to run your code, but I don't believe it is producing the same values as the other methods. I edited my original post to make more clear what is happening with the R value by adding subscripts to it in the loop and rand instead of zeros for A.

Here is everything all together.

clear all
%% Values that will be used in all three methods
nx = 16;
nz = 512;
A  = rand(nx,nx,nz);
%% First Method For Loops
R1  = zeros(nx,nx,nz);
B1  = zeros(nx,nz);
N1  = zeros(nx,nz);
for k =         1:nz
    for j =     1:nx
        for i = 1:nx
            
            r = round((i^2+j^2)^0.5);
            
            if(r <= nx)
                R1(i,j,k) = r;
                N1(R1(i,j,k),k)  = N1(R1(i,j,k),k)+1;
                B1(R1(i,j,k),k)  = B1(R1(i,j,k),k) + A(i,j,k);
                
            end
        end
    end
    N1(1,k) = 1;
end
B1 = B1./N1;
%% Second Method Vectorized
[I,J]   = ndgrid(1:nx,1:nx);
r       = (I.^2+J.^2).^0.5;
R2      = round(r);
L21     = (R2 <= nx);
N2      = squeeze(repmat(histcounts(L21.*R2(:,:,1),1:nx+1),[1 1 nz]));
A2      = reshape(A,[1,numel(A)]);
B2      = reshape(repmat(L21.*R2,[1 1 nz]),[1,numel(A)]);
L22     = logical(B2).*repelem(0:nx:(nz-1)*nx,(nx)^2);
B2      = B2+L22;
B2      = reshape(accumarray((B2+logical(~B2)).',A2.*logical(B2)).',[nx nz])./N2;
%% Check First and Second Method Are Equal
sum(sum(sum(B1-B2)))
%% Third Method
B  = zeros(nx,nz);
N  = zeros(nx,nz);
[N3,B3] = iVectorizedCode(A, B, N);
%% Check First and Third Method Are Equal
sum(sum(sum(B1-B3)))
function [N,B] = iVectorizedCode(A, B, N)
% New vectorized code
nx = cast(size(A, 1), 'like', A);
% Precompute R
[IJ, selR, numR] = iSelectRows(nx);
% Update N
N = N + iColumn(numR);
N(1, :) = 1;
% Update B
sumA = iSumByPage(A, IJ);
B(selR, :) = B(selR, :) + iRow(sumA);
B = B./N;
    function [IJ, selR, numR] = iSelectRows(nx)
        % Pre-compute the rows to select
        
        allindices = 1:nx;
        
        % Compute R for all the combinations of I and J
        R = round(sqrt(allindices.^2+allindices'.^2));
        IJ = R<=nx/2;
        R = R(IJ); % Remove all the unnecessary Rs
        
        % Find the unique indices in R
        uniqueR = unique(R);
        numUniqueR = sum(R==uniqueR');
        
        % Convert uniqueR (indices) to row selectors (logicals)
        selR = ismember(allindices, uniqueR);
        numR = zeros(1, nx, 'like', nx);
        numR(selR) = numUniqueR;
    end
    function sumA = iSumByPage(A, IJ)
        % Sum A by page (a page = dimension>2 in an array);
        
        A = A.*IJ;
        sumA = sum(A, [1, 2]);
    end
    function colVector = iColumn(anyVector)
        colVector = reshape(anyVector, [], 1);
    end
    function rowVector = iRow(anyVector)
        rowVector = reshape(anyVector, 1, []);
    end
end

Andrea Picciau 2020 年 4 月 24 日

編集済み: Andrea Picciau 2020 年 4 月 24 日

MATLAB Online で開く

Hi Nathan,

I really wanted to give you working code, so here's the previous code, fixed:

function [B, N] = hThirdMethod(nx, nz, A)
B  = zeros(nx,nz, 'like', A);
N  = zeros(nx,nz, 'like', A);
nx = cast(nx, 'like', A);
% Precompute R
[indices, numR] = iSelectRows(nx);
% Update N
N = N + iColumn(numR);
N(1, :) = 1;
% Update B
sumA = iSumByPage(A, indices);
B = B + sumA;
B = B./N;
end
function [indexPerPage, numR] = iSelectRows(nx)
% Pre-compute the rows to select
allindices = 1:nx;
% Compute R for all the combinations of I and J
R = round(sqrt(allindices.^2+allindices'.^2));
R(R>nx) = 0;
% Create an array of size nx * nx * nx, where dimension = 3 corresponds to 
% a different value in R
allIndicesPerPage = reshape(allindices, [1, 1, nx]);
indexPerPage = R==allIndicesPerPage;
% Sum across every page in R
numR = sum(indexPerPage, [1, 2]);
end
function sumA = iSumByPage(A, indices)
% Sum A by page (a page = dimension>2 in an array);
% Replicate indices through a 4th dimension, which goes from 1 to nz
indices = repmat(indices, [1, 1, 1, size(A,3)]);
% Convert A to an array of size nx * nx * nx * nz to match the dimensionality
% of indices
A = reshape(A, [size(A,1), size(A,2), 1, size(A,3)]);
A = A.*indices;
sumA = sum(A, [1, 2]);
% Shifting sumA back to a matrix of size nx * nz
sumA = reshape(sumA, [size(sumA,3), size(sumA,4)]);
end
function colVector = iColumn(anyVector)
colVector = reshape(anyVector, [], 1);
end

I hope seeing the working code is going to help you in your investigation. Keep in mind the implicit dimension expansion!

From my tests, this should be about 2x faster than the version of the vectorised code you attached in the last comment.

Andrea Picciau 2020 年 4 月 25 日

MATLAB Online で開く

I used that cast as a hack to make sure everything's done on the GPU. If you remove that, you'll have to have

allindices = gpuArray(1:nx);

otherwise iSelectRows is going to be executed on the CPU instead of the GPU!

The multi-dimension technique in iSumByPage is really necessary to be able to select the elements of A as an elementwise array product

indices = repmat(indices, [1, 1, 1, size(A,3)]);
A = reshape(A, [size(A,1), size(A,2), 1, size(A,3)]);
A = A.*indices;

which is quite fast on a GPU, instead of using indexing, which I would expect to be slower. Keep in mind that this technique uses more GPU memory: at some point, allocating all that GPU memory is going to be slower than a for loop with indexing. You'll have to evaluate the performance depending on your application...

I enjoyed answering your question very much. If you found the answer useful, please upvote it for the others to see!

Nathan Zechar 2020 年 4 月 27 日

Thank you again Andrea, hope to be as amazing as you someday!

サインインしてコメントする。

Alternatives to accumarray for faster calculations?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Alternatives to accumarray for faster calculations?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

9 件のコメント 7 件の古いコメントを表示7 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示