Intersect() with with repetition

42 ビュー (過去 30 日間)
Yi-xiao Liu
Yi-xiao Liu 2022 年 8 月 17 日
編集済み: Yi-xiao Liu 2022 年 9 月 14 日
The syntax [C,ia,ib] = intersect(A,B,'rows') returns elements without repetitions. However, I need every potential combination of ia and ib that gives C. For example:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
I need the output to be:
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[2;3;2;3];
The answer here generates indcies into A and B that are intersecting elements: https://www.mathworks.com/matlabcentral/answers/338649-arrays-intersection-with-repetition However there are no correspondace between ia and ib generated by this method. e.g., A(ia,:)~=B(ib,:). it also doesn't give every potential combination of indices.
Any ideas?
Thanks
  2 件のコメント
Yi-xiao Liu
Yi-xiao Liu 2022 年 8 月 18 日
Fixed the typo in ib. Good catch!

サインインしてコメントする。

採用された回答

Jan
Jan 2022 年 9 月 9 日
If you really need the redundant information in iAB_C, iAB_Cprime, idx1, idx2, this is faster than the original version tested with the data:
A = randi([0, 200], 1e6, 2);
B = randi([0, 200], 1e6, 2);
I get 1.12 s instead of 1.86 s.
function [C, iAB_C, iAB_Cprime, idx1, idx2] = repintersect_1b(A2, B2)
% Join data for faster sorting:
A = A2 * [4294967296; 1];
B = B2 * [4294967296; 1];
[uA, ~, iAx] = unique(A, "stable");
[a, idx] = sort(iAx);
aV = 1:numel(A);
aV = aV(idx).';
aI = RunLen(a);
[uB,~,iBx] = unique(B, "stable");
[b, idx] = sort(iBx);
bV = 1:numel(B);
bV = bV(idx).';
bL = cumsum([1, RunLen(b)]);
[C2, iuA, iuB] = intersect(uA, uB, "stable");
% Split data for the output:
C = [floor(C2 / 4294967296), rem(C2, 4294967296)];
iAB_C = cell(numel(iuA), 1);
a0 = 1;
for ii = 1:numel(iuA)
a1 = a0 + aI(ii); % Easier indexing for A
aa = aV(a0:a1 - 1);
a0 = a1;
na = numel(aa);
b0 = bL(iuB(ii)); % Need accumulated RunLength for B
b1 = bL(iuB(ii) + 1) - 1;
bb = bV(b0:b1);
nb = numel(bb);
qa = repmat(aa.', nb, 1); % Replace MESHGRID
qb = repmat(bb, 1, na);
iAB_C{ii} = [qa(:), qb(:)];
end
iAB_Cprime = cell2mat(iAB_C);
idx2 = cumsum(cellfun('size', iAB_C, 1)); % Faster than cellfun(@(x) size(x,1), C)
idx1 = [1; idx2(1:end-1) + 1];
end
% Helper function:
function n = RunLen(x)
d = [true; diff(x(:)) ~= 0]; % TRUE if values change
n = diff(find([d.', true])); % Indices of changes
end
I've omitted the temporarily used cell arrays.
Further time could be saved, if you do not create iAB_Cprime and the set of iAB_C and the indices, because both contain the same information.
  3 件のコメント
Yi-xiao Liu
Yi-xiao Liu 2022 年 9 月 14 日
編集済み: Yi-xiao Liu 2022 年 9 月 14 日
That (the distribution of na/nb) was the observation when I was trying your function on the data.
However, now come to think about it, the majority of na/nb being small means that the size of repetition is consistent for most intersections. So the function can (I can) be optimized to use vectorized code to cover those na==nb==1, and then loop over only the case when na>1|nb>1. (or depending on the data, also vectorize (na==1&nb==2)|(na==2&nb==1), etc.)
I know the output is redundant, and I only need one (set) of them. I was just saying eihter is acceptable.
In short, thanks for your help. I now know how to tailor it for my data.

サインインしてコメントする。

その他の回答 (3 件)

the cyclist
the cyclist 2022 年 8 月 18 日
I frankly have not quite figured out the logic of what you want as output, but I'm pretty sure that you can construct what you want by using the ismember command instead of intersect. You'll probably need both "directions", and possibly all outputs:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[tf_ab, loc_ab] = ismember(A,B,"rows")
tf_ab = 3×1 logical array
1 1 0
loc_ab = 3×1
2 2 0
[tf_ba, loc_ba] = ismember(B,A,"rows")
tf_ba = 3×1 logical array
0 1 1
loc_ba = 3×1
0 1 1
  2 件のコメント
the cyclist
the cyclist 2022 年 8 月 18 日
Ah, I get it now. Also, in my mind I was thinking that ismember had that third output like unique does, that gives the mapping back to all elements of the original vector.

サインインしてコメントする。


Yi-xiao Liu
Yi-xiao Liu 2022 年 8 月 18 日
Here is my current solution.
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[uA,~,iA]=unique(A,"rows","stable");
iA=sortrows([iA,(1:size(A,1))'],1);
iA=mat2cell(iA(:,2),accumarray(iA(:,1),1));
[uB,~,iB]=unique(B,"rows","stable");
iB=sortrows([iB,(1:size(B,1))'],1);
iB=mat2cell(iB(:,2),accumarray(iB(:,1),1));
[C,iuA,iuB]=intersect(uA,uB,"rows","stable");
iA_C=cell(size(iuA));iB_C=cell(size(iuA));iAB_C=cell(size(iuA));
for ii=1:numel(iAB_C)
iA_C{ii}=iA{iuA(ii)};
iB_C{ii}=iB{iuB(ii)};
[iAB_C_iiA,iAB_C_iiB]=meshgrid(iA_C{ii},iB_C{ii});
iAB_C{ii}=[iAB_C_iiA(:),iAB_C_iiB(:)];
end
disp(iAB_C)
{4×2 double}
disp(iAB_C{1})
1 2 1 3 2 2 2 3
s
  11 件のコメント
Yi-xiao Liu
Yi-xiao Liu 2022 年 9 月 9 日
Sorry. I was posting late last night just before sleep.
For typical inputs, (elements of) A and B are integers in range of 1 to 2^24. They are expected to have ~10^10 rows each, in the same scale of output C. If it helps, A usually is a subset of B (not always).
And about our enivorment, the typical node we can request has 48 CPU cores and 177GB memory. We can request large memory (~1TB) nodes if necessary.

サインインしてコメントする。


Jan
Jan 2022 年 8 月 18 日
編集済み: Jan 2022 年 8 月 18 日
A simple loop approach:
A = [1,1; ...
1,1; ...
1,2];
B = [0,1; ...
1,1; ...
1,1];
[C, iA, iB] = RepIntersectRows(A, B)
C = 4×2
1 1 1 1 1 1 1 1
iA = 4×1
1 1 2 2
iB = 4×1
2 3 2 3
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
C = zeros(nA * nB, w);
iA = zeros(nA * nB, 1);
iB = zeros(nA * nB, 1);
% Find matchs:
iC = 0;
for kA = 1:nA
a = A(kA, :);
for kB = 1:nB
if isequal(a, B(kB, :))
iC = iC + 1;
C(iC, :) = a;
iA(iC) = kA;
iB(iC) = kB;
end
end
end
% Crop unused elements:
C = C(1:iC, :);
iA = iA(1:iC);
iB = iB(1:iC);
end
If A and B have 1e4 rows, the runtime is 4 seconds already. But it is thought to clarify,what you exactly need. The output of your approach does not match the original question exactly. Before I optimize my code, I want to know, if it matchs your needs at all.
  4 件のコメント
Jan
Jan 2022 年 8 月 18 日
@Yi-xiao Liu: Matlab's unique and intersect are based on binary searchs, while the loops (e.g. in my prove of concept code) perform a linear search.
Binary search means, that the data are sorted at first, such that you do not have to compare all elements, but log2 of the elements only by dividing the inteval of interest by 2 in each step.
I'm try to find a shorter and faster solution, when I'm at home.

サインインしてコメントする。

製品


リリース

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by