Fast Elementwise Matrix-Multiplications

Question

0 投票

Hello,

I'm trying to optimize some code for speed and my code currently has a few bottlenecks in lines where a lot of elementwise multidimensional matrix-matrix multiplications are performed.

A simplified example:

M1=rand(1e2,2e2,1e4);
M2=rand(1e2,2e2,1e4);
M3=rand(1e2,2e2,1e4);
% .. and more
M = M1.*M2 + M1.*M3 + M2.*M3; % ... actually more multiplications
% example lines:
% detJdr = dxdrr.*dyds.*dzdt-dxdrr.*dydt.*dzds-dxdrs.*dydr.*dzdt+dxdrs.*dydt.*dzdr+dxdrt.*dydr.*dzds-dxdrt.*dyds.*dzdr-dydrr.*dxds.*dzdt+dydrr.*dxdt.*dzds+dydrs.*dxdr.*dzdt-dydrs.*dxdt.*dzdr-dydrt.*dxdr.*dzds+dydrt.*dxds.*dzdr+dzdrr.*dxds.*dydt-dzdrr.*dxdt.*dyds-dzdrs.*dxdr.*dydt+dzdrs.*dxdt.*dydr+dzdrt.*dxdr.*dyds-dzdrt.*dxds.*dydr;
% drdxJdt = dydst.*dzdt-dydtt.*dzds-dzdst.*dydt+dzdtt.*dyds;

The matrix sizes used in the example indicate typical sizes used in the actual code.

I've already tried Matlab Coder to auto-generate mex files, but the result was a much longer and a little slower code.

Can anyone comment on, if my code would benefit from coding these routines in C/Fortran where I could use v?Mul for fast elementwise multiuplications? Or does Matlab already use these routines for the .* operation already?

2 件のコメント
なしを表示なしを非表示

Jan 2019 年 6 月 11 日

It would be useful if you mention, what exactly "actually more multiplications" is. It is easier to optimize code, if it is known exactly, which code is meant.

I guess, that a specific C code can be faster. The current example looks, like there is a pattern in the matrices to be multiplied - perhaps: sum of all ordered pairs of inputs. If you define this pattern, exploiting it is possible.

thengineer 2019 年 6 月 11 日

@Jan: Thanks for lookin into this. I've updated my question with some typical lines. As you can see, the pattern is covered in the minimal example above. I am actually looking for a fast implementation of the times() operator, rather than a c-coded version of my whole function. That would be more flexible.

サインインしてコメントする。

サインインしてこの質問に回答する。

サインインしてアクティビティをフォロー

Answer 1

Jan 2019 年 6 月 11 日

編集済み: Jan 2019 年 6 月 11 日

MATLAB Online で開く

1 投票

Of course times is implemented efficiently already and most likely it does use the MKL, but this is not documented and a reverse enineering of Matlab is prohibitted by the license conditions. There is no magically faster version of times for users, who need more speed.

If the matrices are used multiple times in the list of multiplication, this can be exploited to gain more speed. Unfortunately this example does not shed light on the underlying pattern:

dxdrr.*dyds.*dzdt-dxdrr.*dydt.*dzds-dxdrs.*dydr.*dzdt+ ...
dxdrs.*dydt.*dzdr+dxdrt.*dydr.*dzds-dxdrt.*dyds.*dzdr-...
dydrr.*dxds.*dzdt+dydrr.*dxdt.*dzds+dydrs.*dxdr.*dzdt-...
dydrs.*dxdt.*dzdr-dydrt.*dxdr.*dzds+dydrt.*dxds.*dzdr+...
dzdrr.*dxds.*dydt-dzdrr.*dxdt.*dyds-dzdrs.*dxdr.*dydt+...
dzdrs.*dxdt.*dydr+dzdrt.*dxdr.*dyds-dzdrt.*dxds.*dydr

If you do know, which pattern is applied here, be so kind and explain this. This would be more efficient than letting me guess, what the pattern is. If the pattern is known, you can apply arithmetic simplifications to reduce the number of multiplications. e.g.:

dxdrr .* dyds .* dzdt - dxdrr .* dydt .* dzds
% ==>
dxdrr .* (dyds .* dzdt - dydt .* dzds)
% 3 instead of 4 multiplications!

Try it:

M1=rand(1e2,2e2,1e4);
M2=rand(1e2,2e2,1e4);
M3=rand(1e2,2e2,1e4);
tic;
for k = 1:100
    y2 = M1 .* M2 + M2 .* M3 + M1 .* M3;
end
toc
tic;
for k = 1:100
    y1 = M1 .* (M2 + M3) + M2 .* M3;
end
toc

2 件のコメント
なしを表示なしを非表示

thengineer 2019 年 6 月 12 日

Well then the good news is that the code is already as fast as it can be :) (except for potential algebraic optimizations of course)

Just out of curiousty: Do you think a complete implememtation of my routine that has almost 200 lines of such multiplications would give a performance gain?

Jan 2019 年 6 月 12 日

@thengineer: A "complete implementation"? A performance gain compared to what?

Exploiting the underlying pattern to reduce the number of arithmetic operations is the first step for optimizing code. With reordering the terms you can save 33% of the multiplications in my example. It is hard to beat this with using highly optimized libraries. At first reduce the work, and trying to work more efficiently afterwards.

The next step is to use efficient methods for the memory access: RAM is accessed in block of 64 bytes and transfered to the CPU cache. Therefore neighboring elements are faster to process than blocks far away from eachother.

If the array sizes exceed the available RAM, splitting the work into blocks is useful to avoid using the slow disk as virtual RAM.

Code optimization is a serious job. Speculations help to be motivated to try it, but it remains a trial-and-error procedure.

サインインしてコメントする。

Answer 2

James Tursa 2019 年 6 月 11 日

編集済み: James Tursa 2019 年 6 月 11 日

1 投票

The element-wise times operation in MATLAB is already multi-threaded. You are not going to beat it by writing your own low level code.

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

サインインしてコメントする。

Fast Elementwise Matrix-Multiplications

2 件のコメント
なしを表示なしを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

Community Treasure Hunt

Fast Elementwise Matrix-Multiplications

2 件のコメント なしを表示 なしを非表示

採用された回答

2 件のコメント なしを表示 なしを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示 -2 件の古いコメントを非表示

カテゴリ

製品

リリース

タグ

参考

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

2 件のコメント
なしを表示なしを非表示

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示