Have I reached the limitations of GPU processing?
1 回表示 (過去 30 日間)
古いコメントを表示
I'm working on a project that requires massive, repetitive use of a single function. spmd would be an ideal solution to my problem, but I only have one graphics card running on a single computer.
I'm using a trial version of the Parallel Computing Toolbox, and I want to make sure that I'm not overlooking any optimization techniques before declaring that a GPU, and the toolbox, won't be suitable for me needs.
The sample function performs an operation on two arguments, and produces a single output.
C = operation(A,B);
Where A, B and C are all simple numerical arrays of the same type. They each have the same dimensions, and the same number of elements.
The "operation" function works as follows:
function C = operation(A,B)
Fixed_Size = 100;
V{1} = [3 5; 2 8; 10 7]; %Three rows
V{2} = [8 3]; % One row
V{3} = [23 2; 19 4; 16 5; 34 9]; % Four rows
.
.
.
V{Fixed_Size} = [73 28; 45 22]; % Two rows
C = zeros(1,Fixed_Size);
for n = 1:Fixed_Size
part1 = A(V{n}(:,1)) .* B(V{n}(:,2));
C(n) = rem(sum(part1),19);
end
I've tried eliminating the loop with more vectorization, but this slows things down. That's because the matricies, in V, have varying numbers of rows. To get rid of the loop, and cell array, I would have to pad the rows so that they're all the same size. This then allows me to let "V" be a matrix. However, this process is slower because there are more elements to work with.
The "operation" function is used in the following way, without the GPU:
tic;
C1 = operation(A1,B1);
C2 = operation(A2,B2);
F = operation(C1,C2);
toc;
Here's the fastest way that I can figure to achieve this using the GPU:
% first the inputs are put into the GPU.
% The time taken to do this isn't included in the time comparison.
A1 = gpuArray(A1);
A2 = gpuArray(A2);
B1 = gpuArray(B1);
B2 = gpuArray(B2);
tic;
F = operation(operation(A1,B1), operation(A2,B2));
toc;
In practice, there are WAY more inputs than just two pairs. Instead of 2 "A" inputs, and 2 "B" inputs, imagine having nearly 200,000. With that many inputs, there will obviously be limits to how many pairs can be handled in a single line for the GPU. But the concept remains the same -- I'm using the GPU to process the "operation" function simultaneously.
Using the GPU, in this way, is still a lot slower than the first method -- which doesn't use the GPU. The only real benefit, that I've noticed, is that as I increase the number of inputs, the time increase with the CPU is linear, as expected. However, with the GPU it's not. In one case, when doubling the number of inputs, the time it took to complete only increased by about 50%, instead of 100% (with the GPU). However, when the number of inputs got considerably large, the increase in time became linear.
Is this the best that I can expect from a GPU, or am I missing something? I'd hate to conclude that this setup is a failure, only to find that it would have succeeded with a properly used technique. I'm new to GPU computing, so I've only tried what makes sense. Is there some trick to unleashing the real power, or is this the best kind of performance that I could expect for such an algorithm?
Thanks for your time, and for any input. I apologize for such a long question, but I wanted to be thorough.
*Please Note: In this sample "operation" function, I've shown "V" as being declared inside the function. However, that's not the way it's set up in practice. In practice, I have V set up as a global Cell array that the function has access to. This prevents the function from having to declare "V" every time it is used.
0 件のコメント
採用された回答
Joss Knight
2015 年 12 月 3 日
What you've done is unlikely to work well on the GPU, you are not doing enough work inside the loop.
I will assume for the purposes of this answer that you cannot reorganize your index data in a more efficient way manually. In that case, I suggest vectorizing by concatenating all your indices and creating a single long vector of values from A and B. Keep track of where the last value is in each section and then use that to index the data later to get your dot products:
VMat = cell2mat(V');
LastRowIndex = cumsum(cellfun(@(x)size(x,1), V));
ProdAll = A(VMat(:,1)) .* B(VMat(:,2));
DotProdAll = cumsum(ProdAll);
DotProd = DotProdAll(LastRowIndex); % Index into cumsum to get answer
DotProd = [DotProd(1); diff(DotProd)];
C = rem(DotProd,19);
Make A and B gpuArrays and Bob's your Uncle. Only problem here is the potential for overflow with such a big accumulation.
4 件のコメント
Joss Knight
2015 年 12 月 7 日
You're welcome. Too bad I didn't get there in time.
Vectorizing code is definitely a skill that takes time to acquire. I can recommend this Blog article and this page in the MATLAB documentation to give a breakdown of the process and an overview of all the various tools available to you.
Stephen23
2015 年 12 月 7 日
It might take a bit of time, but learning to write vectorized code is well worth the effort. Vectorization is the most powerful way of writing fast and efficient MATLAB code.
その他の回答 (0 件)
参考
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!