Applying large logical masks to large arrays quickly, row wise
50 ビュー (過去 30 日間)
古いコメントを表示
I am filtering many large arrays NxM (n>500000, m>10) by applying a NxM mask to generate a many coloumn vectors. It is only possible for the mask to be true in a single column per row so the resultant column vectors will be nx1.
Directly applying the mask to the array is not an option as MATLAB applies the mask in a columnwise fashion. This is undesireable as the rows in the data represent a time stamp.
Is there a quicker way to apply a mask in a row wise fasion or convert a mask to linear indicies.
Example:
%% SETUP
% Data to be filtered
data = ...
[ 17 24 1 8 15; ...
23 5 7 14 16; ...
4 6 13 20 22; ...
10 12 19 21 3; ...
11 18 25 2 9 ];
% Mask to be applied
mask = ...
[ 0 0 1 0 0 ; ...
0 0 1 0 0 ; ...
0 1 0 0 0 ; ...
0 1 0 0 0 ; ...
1 0 0 0 0 ...
];
mask = logical(mask);
%% DESIRED DATA
wanted = [ 1;7;6;12;11]
%% DIRECT APPLICATION OF MASK
tic
direct = data(mask); % [11;6;12;1;7]
toc
disp(direct)
%% CURRENT SLOW SOLOUTION
tic
[row,col] = find(mask); % This line is slow when called many times with large arrays
[row,sortIdx] = sort(row);
col = col(sortIdx);
idx = sub2ind(size(mask), row, col);
current = data(idx);
toc
disp(current)
4 件のコメント
Dyuman Joshi
2022 年 12 月 16 日
I see, it's top to down rather than right to left.
I have a query regarding your comment below, if you don't mind answering @Jan
"The creation of large temporary arrays is expensive."
How to know if something is memory or time (I don't if they are different or not) expensive?
Jan
2022 年 12 月 17 日
You can simply count, how much memory a code costs. Split the executing into atomic parts, e.g.:
Z = exp(sin(X + Y).^2); % where X and Y are matrices
% In steps:
A = X + Y;
B = sin(X);
C = B.^2;
Z = exp(C);
Three temporary arrays are required. How expensive this is, depends on the size of the arrays and the CPU. If the array matchs into the CPU caches, there is no latency for waiting for the slow RAM.
How expensive the calculations are, depends in the CPU also and even on the Matlab version. SIN, EXP and the elementwise power can be multi-threaded. Then Matlab distributes the processing to several cores. It depends on the implementation how many cores are used for a specific data size.
There is no general way to predict, how much time is needed for the processing and the memory access. You have to try it. Look at my answers for your question. I've tried 10 different algorithms, but posted the most promissing versions only.
When I write a function, I implement different methods and store them in a unit-test function. On one hand I use them to test the output. On the other hand I can check for new Matlab versions, if the formerly fastest implementation is still the best one.
Don't forget: The time for solving a problem is:
T_all = T_design + T_program + T_debug + T_documentation + T_run
Avoid a pre-mature optimization of the run time. Write stable code at first and then focus on on the bottleneck determined by the profiler.
採用された回答
Jan
2022 年 12 月 15 日
編集済み: Jan
2022 年 12 月 15 日
Here is a set of different approaches. Run the code to see timings locally on your computer.
n = 500000;
m = 10;
data = rand(n, m);
mask = false(n, m);
mask(sub2ind([n, m], 1:n, randi([1,m], 1, n))) = true;
tic;
for k = 1:100
y0 = origSol(data, mask);
end
toc
tic;
for k = 1:100
y1 = transSol(data, mask);
end
toc
assert(isequal(y0, y1), 'Results differ');
tic;
for k = 1:100
y2 = matchSol(data, mask);
end
toc
assert(isequal(y0, y2), 'Results differ');
tic;
for k = 1:100
y3 = loopSol(data, mask);
end
toc
assert(isequal(y0, y3), 'Results differ');
% ========================================================================
function data = origSol(data, mask) % Written by George Hackney
[row, col] = find(mask);
[row, sortIdx] = sort(row);
col = col(sortIdx);
idx = sub2ind(size(mask), row, col);
data = data(idx);
end
function data = transSol(data, mask) % Transpose both matrices
D = data.';
data = D(mask.');
end
function R = matchSol(data, mask) % Leaner version of the original code
[I, J] = find(mask);
R(I) = data(sub2ind(size(data), I, J));
R = R(:);
end
function R = loopSol(data, mask) % C-style loops
[m, n] = size(data);
R = zeros(m, 1);
for in = 1:n
for im = 1:m
if mask(im, in)
R(im) = data(im, in);
end
end
end
end
Transposing large arrays is expensive, such that the speed-up of transSol() is not impressive.
The ugly loop solution is inspired by the C-mex implementation, which is provided in my 2nd answer. Timings on my R2018b, i5-3320M:
% 2.60 sec Original
% 2.29 sec transpose data and mask
% 1.83 sec matchSol(): find(mask)
% 1.13 sec dull loops
% 0.45 sec C-mex
5 件のコメント
Jan
2022 年 12 月 16 日
編集済み: Jan
2022 年 12 月 16 日
@George Hackney: Yes, you are told to use fast vectorized code. But this is not necessarily the code with the shortest run-time.
There is an easy rule: The creation of large temporary arrays is expensive. In this case:
function R = transSol(data, mask) % Transpose both matrices
D = data.'; % Expensive
M = mask.'; % Expensive
R = D(M); % Fairly efficient
end
To my surprise even D(M) can be accelerated by a simple C-mex function: FEX: CopyMask . But the most expensive parts are the explicit transposings of the inputs. CPUs got very fast, even if just one core is used. In comparison, the RAM access can be more expensive than the actual processing. If the processed memory does not fit into the CPU caches, this slows down the processing.
I do love Matlab's elegant vectorized commands. The compact code is nice, easy to maintain and less prone to typos compared to nested loops. But "vectorized" means, that vectors (or in general: arrays) are processed. If this requires large temporary arrays, the speed suffers.
Jan
2022 年 12 月 16 日
@George Hackney: Another try to improve the speed: Actually processing a matrix cloumnwise is cheaper, because the CPU imports memory in blocks of 64 byte to uts caches (the "cache-line-size"). Therefore accessing neighboring bytes is faster than a random access. Matlab stores arrays in columnwise order, so processing them column-wise is faster.
Inspite of this, the row-wise processing in loopSolRow() is faster, because the inner loop can be stopped after the first match (there is only 1 TRUE per row). This saves the half loop in average, if the TRUEs are randomly distributed.
n = 500000;
m = 10;
data = rand(n, m);
mask = false(n, m);
mask(sub2ind([n, m], 1:n, randi([1,m], 1, n))) = true;
tic;
for k = 1:100
y3 = loopSol(data, mask);
end
toc
tic;
for k = 1:100
y3 = loopSolRow(data, mask);
end
toc
% ============================================================
function R = loopSol(data, mask) % C-style loops, column-wise
[m, n] = size(data);
R = zeros(m, 1);
for in = 1:n
for im = 1:m
if mask(im, in)
R(im) = data(im, in);
end
end
end
end
function R = loopSolRow(data, mask) % C-style loops, row-wise
[m, n] = size(data);
R = zeros(m, 1);
for im = 1:m
for in = 1:n
if mask(im, in)
R(im) = data(im, in);
break;
end
end
end
end
その他の回答 (1 件)
Jan
2022 年 12 月 15 日
編集済み: Jan
2022 年 12 月 15 日
A C-mex version:
// File: maskTCopy.c
// Compile: mex -O -R2018a maskTCopy.c
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *data, *out;
mxLogical *mask;
mwSize m, n, im, in;
data = mxGetDoubles(prhs[0]);
m = mxGetM(prhs[0]);
n = mxGetN(prhs[0]);
mask = mxGetLogicals(prhs[1]);
plhs[0] = mxCreateDoubleMatrix(m, 1, mxREAL);
out = mxGetDoubles(plhs[0]);
for (in = 0; in < n; in++) {
for (im = 0; im < m; im++) {
out[im] += *mask++ * *data++;
}
}
}
Under my Matlab R2018b this needs 0.39 sec instead of 1.83 sec of matchSol() from my other answer and 2.65 sec of the original version. [EDITED, new version of C code, 045 -> 0.39 sec]
0 件のコメント
参考
カテゴリ
Help Center および File Exchange で Scope Variables and Generate Names についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!