Memory optimization and code speed up

142 ビュー (過去 30 日間)
Lorenco Santos Vasconcelos
Lorenco Santos Vasconcelos 2024 年 12 月 20 日
コメント済み: Lorenco Santos Vasconcelos 2025 年 1 月 27 日 14:23
Hi! I have a function to compute what is called Fuzzy Rand Index. It is a way to compare a fuzzy (soft) or probabilistic data partition (clustering result) with a reference hard partition. The math behind this is basically an algebra with t-norms and t-conorms (s-norms). Among some choices, we can use min() as t-norm and max() as s-norm or algebraic product and probabilistic OR (probor) also as t-norm and s-norm. In the present case I'm using min() and max().
The input data are 2 partition matrices (Nxd) in which N is the number of data points and d the number of clusters. So each element h_ji of a partition matrix is precisely the probability of datapoint j belongs to cluster i. In general, N >> d (e.g. N = 20000 to 50000 points and d is tipically 2 to 5 clusters).
The math behind the index is such that I have to compute sort of a "pairwise min" between each data point and all others within the same column (min(h_ji,h_ki)), so I'll have N*(N-1)/2 values for each of the d columns. To these d arrays of size N*(N-1)/2, I must apply the max s-norm element wise to end up with a quantity V that is an array of size N*(N-1)/2.
Also I must compute the same "pairwise min" t-norm between every data point and each other but mixing the columns like min(h_ji,h_kp). So I'll end up with factorial(d)/factorial(d-2) arrays of size N*(N-1)/2 and to these I must also apply the max s-norm element wise to end up with a quantity X that is an array of size N*(N-1)/2.
What I call "pairwise min" is much like pdist do, comparing each point with the next ones, but not with the previous ones.
Later, this V and X from one partition matrix are compared with another V0 and X0 from the reference partition matrix with min t-norm.
I have implemented this in 3 different ways.
The slower and the one that has high memory consumption is the more compact form as:
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = ones(d,2).*(1:d)';
V = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).','UniformOutput',false)),1:size(comb,1),'UniformOutput',false)).');
comb = nchoosek(1:d,2);
comb = [comb;fliplr(comb)];
X = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).','UniformOutput',false)),1:size(comb,1),'UniformOutput',false)).');
In some intermediate form, I've tried to use parallel code like
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = nchoosek(1:d,2);
ncomb = size(comb,1);
X = cell(n-1,1);
V = cell(n-1,1);
parfor j=1:n-1
tmp = zeros(n-j,1);
for k=1:ncomb
tmp = max(tmp,max(min(U(j,comb(k,1)), U((j+1):end,comb(k,2))) , min(U(j,comb(k,2)), U((j+1):end,comb(k,1)))));
end
X{j} = tmp;
tmp = zeros(n-j,1);
for k=1:d
tmp = max(tmp,min(U(j,k), U((j+1):end,k)));
end
V{j} = tmp;
end
X2 = cell2mat(X);
V2 = cell2mat(V);
Finally, the version I've got the best results in terms of speed and memory usage is the following, where I did the trick using circshifts and have used single type, as double precision is not important to me:
% you can use as an example A = rand(20000,4);
A = rand(30000,4,'single'); % just for test
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,'single');
X = zeros(numelements,1,'single');
temp = zeros(N,1,'single');
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
The problem is that this calculation must be made several times to compare a lot of different clustering results to the reference one and for datasets with N > 20000 things take a lot of time. So do you think there is room for more optimization and get this faster and more memory efficient than the 3rd version with the circshifts?
Also, as this is done multiple times (2300 times for each dataset), this function is being called inside a parfor. So each thread executes this function. But min and max are multi-threaded functions also. My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower.
  10 件のコメント
Mike Croucher
Mike Croucher 2025 年 1 月 17 日
@Lorenco Santos Vasconcelos Yes, it runs. Thank you. I spent some time trying to make it faster but failed. I also tried using a GPU and it didn't work well.
I think that your strategy of paralleising at the level above this function is the right approach. Threads on parpools are complicated things and the behaviour of the number of threads per worker depends on if you have a process pool or a thread pool.
You said this "My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower."
This is not correct. I'll write up how it works in a blog post soon.
What I suggest you do is take an empirical approach. You say you have 2300 iteratons of this per dataset. I would set up a version that works through 100-200 iterations and simply time it. First using no pool, then a pool of 2 workers, 4 workers and so on. Whatever is fastest is your sweet spot.
If you work at an institution that has a High Performance Computing cluster, even better.
It's not always the case that using all available cores results in the fastest execution and there can be many reasons for this. Its often algorithm and hardware dependent.
Lorenco Santos Vasconcelos
Lorenco Santos Vasconcelos 2025 年 1 月 22 日 11:27
Hi @Mike Croucher! Thank you very much for your comments.
Yes, after I realized that what I've said: "My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower." is not the right way to do this.
Then I thought the following: start a parpool with Processes profile but controlling the number of threads used by each process with
pctRunOnAll maxNumCompThreads(nThreads)
In this way, I could use for example 2 or 4 threads inside each worker so these threads would be used for the max and min built-in multithreading. Example: 4 processes with 4 threads available to each on using maxNumCompThreads(4). In this way, each process would do part of the parfor iterations (2300) and each iteration would count with 4 threads to speed up the min and max functions, right?
I've tried several combinations like 8 processes with 2 threads each, 4 processes with 4 threads and so on. Surprisingly, none of this was better than parpool('threads',8). My machine is 8 physical cores Intel Xeon W-11955M with 16 logic cores.
When I run with parpool('threads',8), i.e. using all physical cores, does min and max can use the builtin multithreading or they become sequential?
I think my main doubt is: as my time consuming function uses a lot of min and max, there is a way to run this function inside a parfor (to perform iterations in parallel) and also ensure that min and max uses multithreading? This configuration, if possible, would speed up things or just using the parfor with maximum available hardware and forget about min and max multithreading will be the best way to go?
I could not think any code changes to speed up the FRI function

サインインしてコメントする。

採用された回答

Chris Turnes
Chris Turnes 2025 年 1 月 22 日 14:32
編集済み: Chris Turnes 2025 年 1 月 22 日 14:34
I think it will be somewhat hard to make something drastically faster than what you have, mostly because of the sheer amount of computation. The non-linear nature of the combined min-max means that there is a lot of computation that can't really be avoided. That said, a lot of the effort is going toward building the array X, and I think there is an opportunity to cut it down a good bit.
For the computation of any given value of X, we're looking at a comparison of some row (call it j) and all shifts of another row (call it k) except the 0-shift; i.e., we're circularly shifting around the second row but ignoring its original permutation. For a row k, call the maximum value across the row Mk and the second-to-maximum (penultimate maximum) Pk. Then for any pair of rows, if we look at them when both are in the original ordering, then we have one of two scenarios:
  • If Mj and Mk occur in different columns, then the value of X for that pair is min(Mj, Mk) (the minimum of the two row maximum). You can reason your way to this by arguing that you need to find a pair of row values that's going to have the largest minimum, and so the largest possible minimum must be the maximum of the row that has the lower maximum.
  • If Mj and Mk occur in the same column, then by similar logic the value of X for that pair must be max(min(Mj, Pk), min(Pj, Mk)).
This means to compute X, we only have to keep track of the maxima and penultimate maxima of each row, and we don't actually have to compute any explicit column shifts. What is particularly nice about this is that it makes the computational cost of getting X independent of the actual value of d (though you can't say the same for V, but I don't see a way around that).
Here's a version of the code that does that, with some comments that hopefully explain the logic well along the way. I also included a slightly modified version of your original code so we can verify it's giving the same results (I incorporated Jeremy's points about single precision).
%% Set up.
N = 30000;
d = 4;
A = rand(N,d,'single');
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
%% Circshifted version from the post:
tic;
V = zeros(numelements,1,"like",A);
X = zeros(numelements,1,"like",A);
temp = zeros(N,1,"like", A);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
% temp is never assigned, so it doesn't need to be re-initialized
%temp(:) = 0;
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
toc
Elapsed time is 10.231751 seconds.
%% A version using sorting to reduce computation for X.
tic;
V2 = zeros(numelements,1,'single');
X2 = zeros(numelements,1,'single');
% Get the sort order of each *row* of A.
[As,mcol] = sort(A, 2);
% Only bother keeping track of the maxima and penultimate maxima.
As = As(:, (end-1):end);
% Flip the positions of the maxima and penultimate maxima to help
% us make more efficient comparisons.
Asc = As(:, [2 1]);
% Keep around the indices telling us which row contributed the maxima.
mcol = mcol(:,end);
mcols = mcol;
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
% Compute V as before:
V2((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
% Compute X:
Asc = circshift(Asc, -1, 1);
mcols = circshift(mcols, -1, 1);
% temp1 holds max(min(Mk,Pj), min(Pk,Mj))
temp1 = max(min(As, Asc), [], 2);
% temp2 holds max(min(Mk,Mj))
temp2 = min(As(:,2), Asc(:,1));
% Find which column should use temp1 vs temp2
mask = mcol == mcols;
X2((nc1-1)*N + 1 : nc1*N) = temp1.*mask + temp2.*~mask;
end
% I ignored the last iteration for simplicity, but you could surely take the same
% approach. It's a small fraction of the runtime so it shouldn't matter.
if N/2 - NC == 0
temp = zeros(N, 1, 'single');
V2((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X2((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V2((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X2((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
toc;
Elapsed time is 5.435947 seconds.
% Make sure everything lines up.
assert(isequal(V, V2));
assert(isequal(X, X2));
%% Helper functions
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
  5 件のコメント
Chris Turnes
Chris Turnes 2025 年 1 月 22 日 19:27
編集済み: Chris Turnes 2025 年 1 月 27 日 14:09
Well as a caveat, I think if you're going to try to use MEX to speed this up you're going to want to make sure you have some kind of threading framework to place it in. The current MATLAB implementation is heavily multithreaded, so it's likely if you went with a plain serial MEX implementation you'd be a good bit slower.
I think the main potential gains come from two places:
  • No explicit copies would be necessary. Rather than circularly shifting, you can just keep track of the pointer offsets and array quantities.
  • You wouldn't need to compute intermediate results like min(A,Ac) as a full array, but rather wrap them into a single calculation. This also means for individual rows, you could use the maxima+penultimate maxima trick to skip computation in V as well when possible.
Lorenco Santos Vasconcelos
Lorenco Santos Vasconcelos 2025 年 1 月 27 日 14:23
Yes, I think keeping things only in MATLAB is better for now because of the multithreading. Maybe using mex will require considerable effort for a small gain. At this point, I think optimizing the computation V is what will help a little more, but it is more intricate. I will try to come up with something. If I could get anything faster than the actual max(min(A,Ac),[],2); I let you know.

サインインしてコメントする。

その他の回答 (1 件)

Jeremy Hughes
Jeremy Hughes 2025 年 1 月 21 日 15:48
Following on the comments above:
"where I did the trick using circshifts and have used single type" - @Lorenco Santos Vasconcelos
I checked, and using single doesn't appear to improve the performance unless you use single everywhere; Since your input is double, mixing with single doesn't actually do what you expect. Likely it's being upcast internaly.
But if you use single throughout; then I see a different (from about ~13 seconds on my machine down to ~9). That might not be what you want if you need to preserve the input precision, but if you really don't care about that level of precsision, cast A to single, and get a 25% speedup without changing the code. This is my version, removing temporaries and making this agnostic to the type of A.
function [V,X] = repro_func(A)
[N,~] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,"like",A);
X = zeros(numelements,1,"like",A);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = minmax1(A,Ac);
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = minmax2(A);
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = minmax1(A);
end
end
function res = minmax1(A,Ac)
sz = size(A);
res = zeros(sz(1),1,"like",A);
for nc2 = 1:width(A)-1
res = max(res,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function res = minmax2(A)
sz = size(A);
res = zeros(sz(1)/2,1,"like",A);
for nc2 = 1:width(A)-1
res = max(res,max(min(A(1:sz(1)/2,:),circshift(A(sz(1)/2+1:end,:),-nc2,2)),[],2));
end
end
Using circshift is letting you take advantage of vectorization in min at the cost of allocating more data, but I don't think it's the most optimal way to do what you want.
Probably you want to write out the nested loops, and vectorize the inner most call to get the most speed.
I don't think this is doing the same thing you want, but it's calculating the minimum between each element and each the following elements, then finding the max of those mins. It's much faster and visiting the same set of points:
function M = pairminmax(A)
arguments
A(:,1) {mustBeNumeric}
end
M = zeros(numel(A),1,'like',A);
for ii = 1:numel(A) - 1
M(ii) = max(min(A(ii),A((ii+1):numel(A))));
end
end
And this is the naive version which just does singleton expansions (and twice as many comparisons) which is only slightly slower than the one above.
max(min(A(:),A(:)'),[],2)
If I had the expressions written out, I could probably craft something specific to this case, but I hope this gives you enough to go on to write a faster loop.
  2 件のコメント
Lorenco Santos Vasconcelos
Lorenco Santos Vasconcelos 2025 年 1 月 21 日 17:20
Hi @Jeremy Hughes! Thank you for taking a look into this.
You're correct about the single type comment. It only makes sense when A is already single type. I forgot to change A = rand(30000,4); to A = rand(30000,4,'single'); when creating the test matrix.
I'll test your pairminmax function to see the impact and come back to tell you.
I'm sending here the expressions so you can analyze further and maybe come with a better (faster) idea:
Here are the definitions of the terms:
:Fuzzy set of pairs of data objects belonging to the i-th class in A. The membership degree of each pair of objects is given by the truth value of the proposition ‘‘object belongs to the i-th class AND object belongs to the i-th class’’, that is, , where ‘‘t’’ is a triangular norm (t-norm, such as min or algebraic product) used as a conjunction to implement the connective ‘‘AND’’ of the proposition.
: Fuzzy set of pairs of data objects belonging to different classes and in A. The membership degree of each pair of objects is given by the truth value of the proposition ‘‘object belongs to the -th class AND object belongs to the -th class’’, that is, .
V: Fuzzy set of pairs of data objects belonging to the same class in A. The membership degree of each pair of objects is given by the truth value of the proposition ‘‘(object belongs to the 1st class AND object belongs to the 1st class) OR (object belongs to the 2nd class AND object belongs to the 2nd class) OR … OR (object belongs to the k-th class and object belongs to the k-th class)’’, that is: where ‘‘s’’ is a triangular co-norm (s-norm, e.g., max or probabilistic or) used as a disjunction to implement the connective ‘‘OR’’ of the proposition.
X: Fuzzy set of pairs of data objects belonging to different classes in A. The membership degree of each pair of objects is given by the truth value of the disjunction (connective “OR”) of the propositions ‘‘(object belongs to the -th class AND object belongs to the -th class) for , that is: where ‘‘s’’ is a triangular co-norm (s-norm, e.g., max or probabilistic or) used as a disjunction to implement the connective ‘‘OR’’ of the proposition.
When using the t-norm as min and s-norm as max, the expressions are:
and
Lorenco Santos Vasconcelos
Lorenco Santos Vasconcelos 2025 年 1 月 22 日 12:12
Yes, your pairminmax function does not do the same thing I need.
Matrix A has N rows (number of datapoints) by d columns (number of clusters). It is a fuzzy partition matrix.
What I need is to compute the pairwise minimum on each column and then compute max of each row.
The first part is correctly computed by
for ii = 1:numel(A) - 1
min(A(ii),A((ii+1):numel(A)));
end
But this needs to be done in each column of A.
This will result in a N*(N-1)/2 rows matrix with d columns
Then I compute the max of each row
I've written with only nested loops in the most basic way (FRI2 function) and the performance is worse:
A = rand(30000,4,'single'); % just for test
tic
FRI(A);
toc
tic
FRI2(A);
toc
function [V,X] = FRI(A)
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,'single');
X = zeros(numelements,1,'single');
temp = zeros(N,1,'single');
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
function [V,X] = FRI2(A)
[N,d] = size(A);
numelements = N*(N-1)/2;
V = zeros(numelements,1,'single');
X = zeros(numelements,1,'single');
k = 0;
for ii = 1:N-1
V(k + 1 : k + N - ii) = max(min(A(ii,:),A((ii+1):N,:)),[],2);
for jj = 1:d-1
for kk = jj+1:d
X(k + 1 : k + N - ii) = max(X(k + 1 : k + N - ii),max(min(A(ii,jj),A((ii+1):N,kk)),min(A(ii,kk),A((ii+1):N,jj))));
end
end
k = k + N - ii;
end
end

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeParallel Computing Fundamentals についてさらに検索

製品


リリース

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by