following mex code is running too slow, but I don't know why it is and how to make it faster. Any help is greatly appreciated!
calculate_my_way.cpp
#include "mex.hpp"
#include "mexAdapter.hpp"
#include <cmath>
class MexFunction : public matlab::mex::Function {
public:
void operator()(matlab::mex::ArgumentList outputs, matlab::mex::ArgumentList inputs) {
matlab::data::TypedArray<double> var0 = inputs[0];
matlab::data::TypedArray<double> var1 = inputs[1];
matlab::data::TypedArray<double> var2 = inputs[2];
matlab::data::TypedArray<double> var3 = inputs[3];
auto var0Iter = var0.begin();
auto var1Iter = var1.begin();
auto var2Iter = var2.begin();
auto var3Iter = var3.begin();
const int numOfElements = var0.getNumberOfElements();
double buffer = 0;
for (int x = 0; x<numOfElements; x++)
{
buffer = std::sin(*var0Iter) + std::sin(*var1Iter) + std::sin(*var2Iter) + std::cos(*var3Iter);
*var0Iter = buffer;
buffer = std::sin(*var1Iter + *var2Iter) + std::cos(*var3Iter);
*var1Iter = buffer;
var0Iter++;
var1Iter++;
var2Iter++;
var3Iter++;
}
outputs[0] = std::move(var0);
outputs[1] = std::move(var1);
}
};
It's just simple calculation, but this code runs even slower than native distance function which performs a lot more complicated calculation than just a few sin+cos.
I'm using compiler that came with Visual Studio 2017. below is how I run mex and the compiler setup info.
mex -v calculate_my_way.cpp
...
Compiler location: C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\
...
OPTIMFLAGS : /O2 /Oy- /DNDEBUG
and this is how I am seeing performance issues.
clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
elapsed_time = timeit(cant_beat_me);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
elapsed_time = timeit(mex_slow);

15 件のコメント

Rik
Rik 2022 年 11 月 2 日
Apart from the segfault if var1 is longer than the others, did you try with a random test set as well? The distance function may have some calls optimized away.
I might be able to try this code on my desktop later today.
Walter Roberson
Walter Roberson 2022 年 11 月 2 日
buffer = std::sin(*var0Iter) + std::sin(*var1Iter) + std::sin(*var2Iter) + std::cos(*var3Iter);
*var0Iter = buffer;
buffer = std::sin(*var1Iter + *var2Iter) + std::cos(*var3Iter);
You calculate std::cos(*var3Iter) twice
Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Rik, I'm not concerned about the safety just yet, i can't even catch up without safety checks, how can I catch up with additional argument checks?
@Walter Roberson, sure, I did. But I bet you distance function has more than just one singular sin calculation that has been repeated.
I'm guessing this is a compiler choice? Does matlab uses intel compiler that I don't have?
Bruno Luong
Bruno Luong 2022 年 11 月 2 日
編集済み: Bruno Luong 2022 年 11 月 2 日
"I'm guessing this is a compiler choice? Does matlab uses intel compiler that I don't have?"
I have Intel compiler I can test.
But Matlab can implement with vector arithmetics with multi-threading, you also could with OpenMP.
There are few people here that do miracles with Mex programing, James Tursa and Jan Simon to cite fews, but I believe they are C oriented and less C++.
Walter Roberson
Walter Roberson 2022 年 11 月 2 日
Which distance function are you comparing to?
Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Bruno Luong Thanks! If you can test it out using the Intel compiler it'd be great! I wish I can do C, but I probably can't, since I want to be able to invoke some other functions that are written in C++.
@Walter Roberson. Mapping toolbox distance, it's just out of convenience since it's also 4 inputs 2 outputs. Distance between points on sphere or ellipsoid - MATLAB distance (mathworks.com)
Walter Roberson
Walter Roberson 2022 年 11 月 2 日
The Mapping Toolbox distance() function is not coded in mex. You can read the MATLAB source code for it. The code converts the angles to radians, and then uses its local function greatcircledist() to compute using the haversine formula, and then does something that I do not recognize at the moment involving atan2() -- at least for the default calculation. There is a different code path if you use some of the options.
Bruno Luong
Bruno Luong 2022 年 11 月 2 日
timeit result of your code with VS compiler and Intel OneAPI compiler (2022)
VS_elapsed_time % 0.1795
Intel_elapsed_time % 0.1781
Bruno Luong
Bruno Luong 2022 年 11 月 2 日
編集済み: Bruno Luong 2022 年 11 月 2 日
Obviously evalutae cos/sin depends run time on data
Compare between MATLAB and cpp with zero data
clear
size_test = 1e7;
var1 = zeros(size_test, 1);
var2 = zeros(size_test, 1);
var3 = zeros(size_test, 1);
var4 = zeros(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) % 0.0274
Intel_elapsed_time = timeit(mex_slow) % 0.1803
function [out0,out1] = distance(var0, var1, var2, var3)
out0 = sin(var0) + sin(var1) + sin(var2) + cos(var3);
out1 = sin(var1 + var2) + cos(var3);
end
with random data
clear
size_test = 1e7;
var1 = 2*pi*rand(size_test, 1);
var2 = 2*pi*rand(size_test, 1);
var3 = 2*pi*rand(size_test, 1);
var4 = 2*pi*rand(size_test, 1);
cant_beat_me = @() distance(var1,var2,var3,var4);
mex_slow = @() calculate_my_way(var1,var2,var3,var4);
MATLAB_elapsed_time = timeit(cant_beat_me) % 0.1560
Intel_elapsed_time = timeit(mex_slow) % 0.5101
The factor of
>> 0.5101/0.156
ans =
3.2699
could be well explained by multi-thread.
Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Bruno Luong Thanks! and darn. I guess it's not the compiler? So, now I think what's left to try are
  1. Do this in C. Try to eliminate the possible C++ mex overhead?
  2. vector arithmetics with multi-threading, like you suggested with OpenMP.
Bruno Luong
Bruno Luong 2022 年 11 月 2 日
Or stay with MATLAB?
Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Bruno Luong It'd be nice to stay with MATLAB. but my code is just an example of the eventual implementation. It won't just be simple sin/cos. Right now what I'm doing is trying to understand if mex can actually achieve the speed/performance I need.
Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Bruno Luong Thanks again! I will definitely give OpenMP a try!
Bruno Luong
Bruno Luong 2022 年 11 月 3 日
By curiosity I code the same calculation in C. Time is 0.24 sec; twice faster than C++ (0.5 sec) but 60% slower than MATLAB (0.147 sec).
/* mex -g -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int i, n;
double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
n = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
var0Iter = mxGetDoubles(prhs[0]);
var1Iter = mxGetDoubles(prhs[1]);
var2Iter = mxGetDoubles(prhs[2]);
var3Iter = mxGetDoubles(prhs[3]);
out0Iter = mxGetDoubles(plhs[0]);
out1Iter = mxGetDoubles(plhs[1]);
for (i = 0; i < n; i++) {
*out0Iter = sin(*var0Iter) + sin(*var1Iter) + sin(*var2Iter) + cos(*var3Iter);
*out1Iter = sin(*var1Iter + *var2Iter) + cos(*var3Iter);
out0Iter++;
out1Iter++;
var0Iter++;
var1Iter++;
var2Iter++;
var3Iter++;
}
}
Yifan Lin
Yifan Lin 2022 年 11 月 3 日
@Bruno Luong, Thanks! I was also curious and wanted to give this a try, but you beat me to it! Yes, apparently C++ API is slower than C API for MATLAB. Ref: this post - Is C++ MEX API significantly slower than the C MEX API? - MATLAB Answers - MATLAB Central (mathworks.com). I've also tried openmp like you suggested, but the problem was, I was using VS2017, so I couldn't do #pragma omp simd. I'll wait for my VS2019 install to finish and try again there with the C API.

サインインしてコメントする。

 採用された回答

Bruno Luong
Bruno Luong 2022 年 11 月 3 日
編集済み: Bruno Luong 2022 年 11 月 3 日

1 投票

Last experience, Time with C OpenMP, Intel Parallel Studio XE 2022
CIntel_elapsed_time = 0.0574 [sec]
2.5 faster than MATLAB (finally I beat MATLAB).
To have fast mex: Use C-API (not Cpp), Make it multi-thread, Select a decent compiler.
/* Compile with intel compiler
mex -O COMPFLAGS="$COMPFLAGS /MD /Qopenmp" -R2018a calculate_C_way.c */
#include "mex.h"
#include <math.h>
/* Set to 1 to Enable OPENMP
to 0 to disable it */
#define OPENMP_FLAG 1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int i, n;
double *var0Iter, *var1Iter, *var2Iter, *var3Iter, *out0Iter, *out1Iter;
n = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
plhs[1] = mxCreateNumericMatrix(1, n, mxDOUBLE_CLASS, mxREAL);
var0Iter = mxGetDoubles(prhs[0]);
var1Iter = mxGetDoubles(prhs[1]);
var2Iter = mxGetDoubles(prhs[2]);
var3Iter = mxGetDoubles(prhs[3]);
out0Iter = mxGetDoubles(plhs[0]);
out1Iter = mxGetDoubles(plhs[1]);
#if OPENMP_FLAG==1
#pragma omp parallel for default(none) private(i) \
schedule(static) \
shared(n, out0Iter, out1Iter, var0Iter, var1Iter, var2Iter, var3Iter)
#endif
for (i = 0; i < n; i++) {
out0Iter[i] = sin(var0Iter[i]) + sin(var1Iter[i]) + sin(var2Iter[i]) + cos(var3Iter[i]);
out1Iter[i] = sin(var1Iter[i] + var2Iter[i]) + cos(var3Iter[i]);
}
}

2 件のコメント

Yifan Lin
Yifan Lin 2022 年 11 月 3 日
@Bruno Luong Thank you very much!!!! This is exactly what I was looking for!
James Tursa
James Tursa 2022 年 11 月 7 日
Typically, instead of this
#define OPENMP_FLAG 1
#if OPENMP_FLAG == 1
#include <omp.h>
#endif
you can use this:
#ifdef _OPENMP
#include <omp.h>
#endif
The _OPENMP macro is defined by the compiling environment when OpenMP is available.

サインインしてコメントする。

その他の回答 (1 件)

Bruno Luong
Bruno Luong 2022 年 11 月 2 日
編集済み: Bruno Luong 2022 年 11 月 2 日

0 投票

I don't know well C++, but I have practiced quite a lot mex C.
It looks like this statement just move a bunch of data
outputs[0] = std::move(var0);
outputs[1] = std::move(var1);
ALso I wonder if your input "0, and 1 would change
*var0Iter = buffer;
...
*var1Iter = buffer;
after calling the mex, which is NOT allowed.

2 件のコメント

Yifan Lin
Yifan Lin 2022 年 11 月 2 日
@Bruno Luong! Another one of your answer here helped me tremendously a few years back! thank you!
I've tested the var0 and var1 value, they did change. And they get moved to the output.
So, [a,b] = calculate_my_way(0,0,0,0); [a,b] will be both 1.
I have a suspicion that this slowness may be either
1. MSVC is not as good as the one Mathworks uses (probably Intel Parallel Studio)
2. the C++ Mex function calling may be problematic with some massive overhead that I don't know.
3. I am just not doing something right in my c++ code?
Bruno Luong
Bruno Luong 2022 年 11 月 2 日
" Another one of your answer here helped me tremendously a few years back! thank you! "
Oh... realy glad to read that...

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeWrite C Functions Callable from MATLAB (MEX Files) についてさらに検索

製品

リリース

R2019b

質問済み:

2022 年 11 月 1 日

コメント済み:

2022 年 11 月 7 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by