# Improving the performance of matrix multiplication and division with nested for loops

4 ビュー (過去 30 日間)
Matthew Kehoe 2021 年 7 月 25 日
コメント済み: Matthew Kehoe 2021 年 7 月 31 日
My Matlab code has a section of code which repeatedly performs matrix multiplication and division (through the backslash operator =A\b). When testing large datasets, a certain function named solvebvp_colloc.m is called millions of times. As this function contains multiple matrix multiplications and divisions, it tends to take up a lot of runtime. I'm curious if I could rewrite solvebvp_collect.m (or the code before solvebvp_collect.m) in order to improve performance. I'm not sure if matrix decomposition will help or if I could remove one of the for loops referencing j=1:Nx completely.
My original implementation is shown below. The function solvebvp_colloc.m is harmful to the overall performance of my Matlab code.
% Test Data
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150; % I'm creating a ntests variable to test running the code below multiple times. The
% ntests variable is not part of my production code (and is only used for testing performance).
tic
% Original Implementation
for ii=1:ntests
for n=0:N
for m=0:M
for j=1:Nx
Fnmhat_p = Fnmhat(j,:).';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p(j)^2 - 2*alpha*p(j);
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(j,n+1,m+1);
d_max = -1i*gammap(j);
n_max = 1.0;
r_max = Jnmhat(j);
% This subroutine is expensive for large N,M. The problem is that
% solvebvp_collec.m needs to run against three for loops from
% 1:N,1:M,1:Nx and can be called millions of times.
uhat_p = solvebvp_colloc(Fnmhat_p,alphaalpha,betabeta,gammagamma,...
d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2);
Uhat(j,:) = uhat_p.';
end
end
end
end
toc
where the function solvebvp_colloc is of the form
function [utilde] = solvebvp_colloc(ftilde,alpha,beta,gamma,d_min,n_min,...
r_min,d_max,n_max,r_max,identy,D,D2)
A = alpha*D2 + beta*D + gamma*identy;
b = ftilde;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end) = r_min;
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max;
b(1) = r_max;
utilde = A\b;
return;
This code is slow for large N,M (large being anything over N=M=10. I would consider N=M=30 as a resonable test case for testing large datasets.) I think that my code design has three major performance flaws including:
1. The function solvebvp_colloc.m is processed a large number of times (as it needs to run for all values of the for loops 1:N, 1:M, and 1:Nx). When N and M are large (say N=M=20), the profiler shows that this function is called 20,829,312 times and takes a total runtime of 1042 seconds or 17.3 minutes. The operation utilde = A\b is expensive and calling this millions of times takes up a lot of runtime. Calling the matrix multiplication operations in solvebvp_colloc.m is also expensive.
2. Running an extra for loop with j=1:Nx creates larger runtime.
3. I don't use matrix decomposition. Since the matrix A changes in every iteration of j (because of the value of gammagamma is different for different values of j), I'm not sure if I can use matrix decomposition to speed up this calculation.
My initial thought was to avoid writing the additional for loop from j=1:Nx. I implemented this in a second impementation below. I wasn't able to figure out how to move the matrix A out of the for loop because the size of gammagamma is (Nx,1) and the size of the identity matrix (Nx+1,Nx+1) are different.
% Test Data
N = 3;
M = 3;
% In real test scenarios, I would have larger N,M such as N=M=30
Nx = 32; % Ny=Nx=Nz
Nz = 32;
Ny = 32;
% My production data doesn't have rand(). I am only creating random
% matrices to test peformance.
Fnmhat = rand(Nx,Nz+1);
Jnmhat = rand(Nx,1);
xi_n_m_hat = rand(Nx,N+1,M+1);
Uhat = zeros(Nx,Nz+1);
Uhat_new = zeros(Nx,Nz+1);
identy = eye(Ny+1,Ny+1);
p = rand(Nx,1);
gammap = rand(Nx,1);
D = rand(Nx+1,Ny+1);
D2 = rand(Nx+1,Ny+1);
gamma = 1.5;
alpha = 0; % this could be non-zero
ntests = 150;
% New Implementation
tic
for ii=1:ntests
for n=0:N
for m=0:M
% Moved outside of the for loop
Fnmhat_p = Fnmhat.';
alphaalpha = 1.0;
betabeta = 0.0; % this could be non-zero
gammagamma = gamma*gamma - p.^2 - 2*alpha.*p; % size (Nx,1)
d_min = 1.0;
n_min = 0.0; % this could be non-zero
r_min = xi_n_m_hat(:,n+1,m+1);
d_max = -1i.*gammap;
n_max = 1.0;
r_max = Jnmhat;
% Moved b outside of the for loop
b = Fnmhat_p;
% I don't know how to move A outside of the for loop as gammagamma
% is of size(Nx,1) and identy is of size(Ny+1,Ny+1) so writing
% A = alphaalpha*D2 + betabeta*D + gammagamma.*identy
% wouldn't work since the two matrices are of different sizes.
for j=1:Nx
uhat_p_new = solvebvp_colloc_new(b,alphaalpha,betabeta,gammagamma,...
d_min,n_min,r_min,d_max,n_max,r_max,identy,D,D2,j);
Uhat_new(j,:) = uhat_p_new.';
end
end
end
end
toc
diff = norm(Uhat - Uhat_new,inf); % is zero
where the new function solvebvp_colloc_new.m is written as
function [utilde] = solvebvp_colloc_new(b,alpha,beta,gamma,d_min,n_min,...
r_min,d_max,n_max,r_max,identy,D,D2,j)
A = alpha*D2 + beta*D + gamma(j)*identy;
A(end,:) = n_min*D(end,:);
A(end) = A(end) + d_min;
b(end,j) = r_min(j);
A(1,:) = n_max*D(1,:);
A(1,1) = A(1,1) + d_max(j);
b(1,j) = r_max(j);
utilde = A\b(:,j);
return;
A local test shows that the second implementation is "slightly faster."
Elapsed time of original implementation is 4.087457 seconds.
Elapsed time of new implementation 3.934619 seconds.
For large M,N (say N=M=20 or N=M=30) this difference would be larger. However, I wasn't able to significantly speed up the calculation because I still do a large amount of matrix multiplications and divisions inside the new function solvebvp_colloc_new.m. I'm interested in analyzing how I can speed up this calculation (I will need to test N=M=20 and N=M=30) and the profiler shows that this calculation takes up about 1/3 of the total runtime. I'm interested in analyzing the following performance improvements:
1. Can Matlab matrix decomposition help here even though A changes for every value of j (since gammagamma is different for every value of j)?
2. Is it possible to avoid writing the for loop from j=1:Nx in order to perform utilde = A\b less and matrix multiplication less?
3. Are there any other performance improvements that would speed up this calculation (the function solvebvp_colloc.m is called millions of times for large N,M)? I think that bsxfun may help where the backslash operator becomes utilde = bsxfun(@rdivide, A, b); although this doesn't match the dimensions of the inner for loop through j.

サインインしてコメントする。

### 採用された回答

Chunru 2021 年 7 月 25 日

Analyse your program to see if the matrix A is independent of which loop variables (it seems A is independent to some loop variables at the first look of the program). Compute A outside the loops of the independent variables. In inner-loop contactate the right-hand b into a matrix B. Then solve multiple linear systems with the same A in one-go: X = A \ B. This way you will only do one matrix "division" for all the independent loops (if any) and hopefully the performance can be improved.
If A is changing over all loop variable, then you may not be able to improve the performance drastically.
##### 4 件のコメント表示非表示 3 件の古いコメント
Matthew Kehoe 2021 年 7 月 31 日
I've optimized some of the code and have asked a different question. I am going to close this question by accepting your answer.

サインインしてコメントする。

R2020a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by