speed up ismember with two-dim data

Question

Felix Müller 2021 年 7 月 13 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/877513-speed-up-ismember-with-two-dim-data

コメント済み: Felix Müller 2021 年 7 月 15 日

I have two large cell-arrays (X and Y) where each field is a two dimensional array representing pixel positions. The cell arrays are not the same size (can be in the range of thousands) and each entry can be be a different length (ranging from length 1 to length >1000). Maybe relevant: Over X and Y individually, no point can be included twice, i.e. all entries in X have no points in common (same for Y).

Now I want to find the overlap between all entries in X with all entries in Y. (in reality I filter the entries in Y for a criterion and compute only roughly 10-20 percent of the comparisons)

Is there a way to speed up this process? I have read about ismembc which requires sorted data (and thus I thought it cannot be used in my case).

Minimal working example:

X{1} = [1 1; 1 2; 1 3];
X{2} = [3 1; 3 2; 3 3];
Y{1} = [1 1; 1 2];
Y{2} = [2 3; 3 1; 3 2; 3 3];
Y{3} = [5 5];
mat_overlap = NaN(length(X), length(Y));
for ii = 1:length(X)
for jj = 1:length(Y)
  mat_overlap(ii,jj) = sum(ismember(X{ii}, Y{jj}, 'rows'));
end
end

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Jan 2021 年 7 月 13 日

編集済み: Jan 2021 年 7 月 13 日

MATLAB Online で開く

Are the values integers or floating point numbers? How large are the values?

The current set of data can be converted to 1D data by:

m     = 10;
XX{i} = X{i}(:, 1) * m + X{i}(:, 2);

Afterwards ismembc can acclerate the search again. This works only if there is an m > max(Data(:, 2)) which does not cause values > 2^53, where rounding effects would invalidate the integrity.

An alternative is to calculate the MD5 hash of each row, but this takes some time and I guess, that this is slower in total.

Optimizing the runtime depends on the specific input: How large are the data in average? How many elements overlap? Would this be a realistic input:

X = cell(1, 1000);
for k = 1:1000
   n    = randi([1, 100]);
   X{k} = rand(n, 2);     % Or: randi([1,50], n, 2) ?
   % Or: unique(randi([1,50], n, 2), 'rows')?
end
% And the same for Y

Are the rows of the elements of X and Y guaranteed to be unique? Or is this possible:

X{1} = [1 1; 1 2; 1 1];

Felix Müller 2021 年 7 月 14 日

編集済み: Felix Müller 2021 年 7 月 14 日

MATLAB Online で開く

Wow, I am amazed by you. Thanks so much for the help!

all values in X and Y are integers <= 4000 (and for each similar application there would be such an upper bound as well)
each row in X itself (and in Y itself) is unique (not only within X{1} but across the whole of X)
an example input could be X or Y having 400-1000 fields with length 15-1500 and the entries are random 2-dim points in [1,4000] (integer)
between 0% and 90% of points overlap (between X{ii} and Y{jj}
if important: within one entry the points will be four-connected (i.e. each point has at least one other point directly adjacent (up, right, down, left) to it but they are usually not listed in order)

To take your example, it would be more like this. (a bit bloated I guess, but I didn't know how else to code the unique-ness across X)

X = cell(1, 1000);
arr = randi([15, 1500], 1000, 1);
num_total = sum(arr);
% create vector with unique rows
random_vector = [];
while length(random_vector) < num_total
    num_missing = num_total - length(random_vector);
    append = round(4000*rand(num_missing, 2));
    random_vector = unique([random_vector; append], 'rows');
end
% fill X with the correct different lengths
for k = 1:1000
    idx_first = sum(arr(1:(k-1))) + 1;
    idx_last = sum(arr(1:k));
    X{k} = random_vector(idx_first:idx_last);
end
% And the same for Y

Jan 2021 年 7 月 14 日

MATLAB Online で開く

@Felix Müller: I assume, there is a typo in the code to produce the test data:

X{k} = random_vector(idx_first:idx_last, :);
%                                      ^^^  added to get 2 columns

I've posted a comparison with ismembc and get a speedup of a factor 100 - much more then 60%.

Felix Müller 2021 年 7 月 15 日

True, typo! (I tested with my real data)

Yes, that can well be! I'll make it clearer in my answer above. I tested my whole routine which does other things as well besides only using ismember/ismembc and the 60% speed up was for that. For the raw comparison it will be much more but of course that is the interesting number here!

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Jan 2021 年 7 月 14 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/877513-speed-up-ismember-with-two-dim-data#answer_747033

MATLAB Online で開く

function Test_Felix
X = TestData;
Y = TestData;
% Original method:
tic;
mat_overlap = NaN(length(X), length(Y));
for ii = 1:length(X)
   for jj = 1:length(Y)
      mat_overlap(ii,jj) = sum(ismember(X{ii}, Y{jj}, 'rows'));
   end
end
toc
% Modified:
tic;
nX = numel(X);  % Condense the 2 columns to 1:
X2 = cell(1, nX);
for iX = 1:nX
   X2{iX} = X{iX}(:,1) * 5000 + X{iX}(:, 2);
end
nY = numel(Y);
Y2 = cell(1, nY);
for iY = 1:nY
   Y2{iY} = Y{iY}(:,1) * 5000 + Y{iY}(:, 2);
end
mat_overlap2 = NaN(nX, nY);
for iY = 1:nY          % Swap loops
   Yi = sort(Y2{iY});  % Sort once only  
   for iX = 1:nX
      mat_overlap2(iX, iY) = sum(ismembc(X2{iX}, Yi));
   end
end
toc
isequal(mat_overlap, mat_overlap2)
end
function X = TestData
n   = 200;
X   = cell(1, n);
arr = randi([15, 1500], n, 1);
num_total = sum(arr);
% create vector with unique rows
random_vector = [];
while length(random_vector) < num_total
    num_missing = num_total - length(random_vector);
    append = round(4000*rand(num_missing, 2));
    random_vector = unique([random_vector; append], 'rows');
end
% fill X with the correct different lengths
for k = 1:n
    idx_first = sum(arr(1:(k-1))) + 1;
    idx_last  = sum(arr(1:k));
    X{k}      = random_vector(idx_first:idx_last, :);
end
end

Timings on Matlab R2018b, i5 mobile:

Elapsed time is 17.209408 seconds.
Elapsed time is 0.175748 seconds.

A factor of 100. Nice! I've reduced the data size from n=1000 to 200 for faster tests.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Felix Müller 2021 年 7 月 15 日

wonderful!

サインインしてコメントする。

speed up ismember with two-dim data

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

speed up ismember with two-dim data

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示