Accumulation operations require synchronisation and are never going to squeeze the best performance out of your GPU. This is valid for all parallel computer architectures, by the way.
As a way to help myself understand your code, I rewrote the last line:
function tempData = iOriginalCode(A, B)
logicalB = logical(B);
indexes = B + ~logicalB;
indexes = indexes.';
values = A.*logicalB;
values = values.';
tempData = accumarray(indexes, values);
I can't immediately see how to change this code to improve performance and the rest of your vectorized code is not readable enough. I'm 90% sure the right thing to do is to go back to designing your algorithm to avoid that accumarray... you should take some of these things into consideration:
Try not to think in terms of indexes and values. Instead, try to formulate as many things as possible in terms of mathematical operations. You're already doing this in some parts of your code. For example, you're writing
(see my re-written code). Eventually, you might be able re-write your accumarray as a matrix-vector multiplication, which should help improve performance.
Use the structure in your data to simplify your code. You're using a lot of repmats, reshapes, and a repelem to generate your indexes array. You can most likely find a better way to re-write that for-loop-based code.
I would go back to the original code and think again about the structure of this R
j = 1:nx;
i = 1:nx;
r = (i^2+j^2)^0.5;
R = round(r);
R(R>nx/2) = 0;
Use gputimeit to measure your GPU performance. The right way to measure the performance of the GPU on this is to use gputimeit, and on my K40c (with the data obtained from your script) I got this:
>> gputimeit(@() iOriginalCode(A, B), 1)
The profiler can help you, but it's going to mess with the underlying flow of operations on the GPU. By the way, the profiler agrees that accumarray is the bottleneck here.