matlab coder:Generating C code for general functions is a big performance problem

To my incomplete knowledge, for matlab built-in function imwarp generated C code/mex file performance is much inferior to the open source library OpenCV, I originally wanted to use matlab coder/embeded coder to deploy the algorithm to generate C code, but found that performance is a big problem. Below I have written a simple function to test this, "myEntry" is the function to generate the C code. Again I wrote the same function with very little C++ code. The performance was then compared as follows:
function outImg = myEntry(inImg)%#codegen
% or use "tform = rigidtform2d(30,[100,200]);" The following matrix is
% obtained directly from tform.A
% note: here should "single" type input,while very sensitive to "double"
% type and highly susceptible to invalid matrix inputs!
tform = rigidtform2d(single([0.8660 -0.5000 100.0000
0.5000 0.8660 200.0000
0 0 1.0000]));
% get output view full range
[h,w,~] =size(inImg);
xLimitsIn = [0.5,w+0.5];
yLimitsIn = [0.5,h+0.5];
[xLimitsOut,yLimitsOut] = outputLimits(tform,xLimitsIn,yLimitsIn);
% global full view range
xWorldLimits = [min(xLimitsIn(1),xLimitsOut(1)),max(xLimitsIn(2),xLimitsOut(2))];
yWorldLimits = [min(yLimitsIn(1),yLimitsOut(1)),max(yLimitsIn(2),yLimitsOut(2))];
W = round(xWorldLimits(2)-xWorldLimits(1));
H = round(yWorldLimits(2)-yWorldLimits(1));
outputView = imref2d([H,W],xWorldLimits,yWorldLimits);
% warp image
outImg = imwarp(inImg,tform,'OutputView',outputView);
end
The following script was then used to generate the mex/C code executable library file:
%% opencv VS matlab origin code VS generate mex/C performance
%% generate mex file
inImg = im2single(imread("peppers.png"));
codegen -config:mex myEntry -args {inImg}
%% test speed
num = 1000;
t1 = tic;
for i = 1:num
outImg1 = myEntry(inImg);
end
t = toc(t1);
t2 = tic;
for i = 1:num
outImg2 = myEntry_mex(inImg);
end
tt = toc(t2);
fprintf("origin matlab code take time:%.5f seconds,generate C/mex code take time:%.5f seconds\n",t,tt);
figure;imshowpair(outImg1,outImg2,'montage')
Code generation successful.
origin matlab code take time:5.53857 seconds,generate C/mex code take time:5.46453 seconds
These two take about the same amount of time and it looks like imwarp uses a pre-compiled library, even though it generates C code, there is no significant speed advantage!
For the same functionality, I implemented it again using OpenCV C++ as follows:
cv::Mat img = cv::imread("peppers.png");
cv::Mat dst;
size_t nums = 1000;
double t1 = cv::getTickCount();
for (size_t i = 0; i < nums; i++) {
cv::Mat transMat = (cv::Mat_<float>(2, 3) << 0.8660, -0.5000, 100.0000,
0.5000, 0.8660, 200.0000);
// 计算包含目标图像的最大范围
std::vector<cv::Point2f> srcCorners = {cv::Point2f(0, 0), cv::Point2f(img.cols, 0), cv::Point2f(img.cols, img.rows), cv::Point2f(0, img.rows)};
std::vector<cv::Point2f> dstCorners;
cv::transform(srcCorners, dstCorners, transMat); // 对应matlabtranspointsforward
dstCorners.insert(dstCorners.end(), srcCorners.begin(), srcCorners.end());
cv::Rect outputView = cv::boundingRect(dstCorners);
// 平移到可视化区域
transMat.colRange(2, 3) = transMat.colRange(2, 3) - (cv::Mat_<float>(2, 1) << outputView.x, outputView.y);
cv::warpAffine(img, dst, transMat, cv::Size(outputView.width, outputView.height));
}
double t2 = cv::getTickCount();
std::printf("it take time:%.5f seconds. dst image size:(%d*%d)\n", (t2 - t1) * 1.0 / cv::getTickFrequency(), dst.rows, dst.cols);
it take time:0.79880 seconds. dst image size:(789*636)
As you can see above, the performance difference between imwarp's built-in functions is about 6.89 times, so it seems that performance is indeed an issue.
RUN in R2022b, windows 10

 採用された回答

Matan Silver
Matan Silver 2022 年 11 月 7 日

0 投票

Hello,
I took a look at the generated code for these reproduction steps, and found that a large performance penalty is paid in runtime integrity checks. These integrity checks are generated by default for MEX and check things like index bounds, integer overflows, divide by zero, and more. I don't know how openCV works under the hood but it's likely it's not doing as many runtime checks. You can turn off these checks by changing the codegen script to run:
cfg = coder.config('mex');
cfg.IntegrityChecks = false;
codegen -config cfg myEntry -args {inImg}
This could give a closer comparison between generated code and openCV. Running your repro after making those changes, I see the following significant improvement:
>> doit
Code generation successful.
origin matlab code take time:4.58261 seconds,generate C/mex code take time:1.37438 seconds
Hopefully that helps. Note that turning off IntegrityChecks has the downside that invalid input to the MEX can potentially cause the MEX to crash.
Matan

5 件のコメント

xingxingcui
xingxingcui 2022 年 11 月 8 日
編集済み: xingxingcui 2022 年 11 月 8 日
@Matan Silver Thanks for pointing out the problem, but the code you provided above doesn't execute on my computer at such a high multiplier speedup in time.
cfg = coder.config( "mex" );
cfg.ExtrinsicCalls = false;
cfg.IntegrityChecks = false;
cfg.SaturateOnIntegerOverflow = false;
cfg.ResponsivenessChecks = false;
cfg.NumberOfCpuThreads = 16;
I have also tried additional optimizations to further improve the efficiency and the time is printed as:
origin matlab code take time:5.06883 seconds,generate C/mex code take time:2.18149 seconds
Is there anything else that can be done to further improve execution efficiency, such as the possibility of embedding OpenCV library functions directly into the generated C/C++ code?
------------------
Another thing is how to investigate how the low execution efficiency of the mex file is caused by the corresponding matlab code, e.g. is the Imwarp built-in function the main cause of the poor performance in the above example? if so, it should be possible to provide a user-implemented "efficient C function" instead of the Imwarp built-in functions?
thank you again!
Matan Silver
Matan Silver 2022 年 11 月 8 日
Hello,
You found some good other config parameters to turn off to try to improve the runtime performance of the generated code.
Regarding embeddeding calls to OpenCV in the generated code, it might be possible using "coder.ceval", but we should think about this benchmark a bit more.
I'd like to suggest we restructure the benchmark in a way that might treat codegen more fairly compared to OpenCV. Even after we've turned off the runtime checks, responsiveness checks, etc., the MEX is doing extra work. The MEX has to convert the MATLAB value into a C representation on entry, and then it has to restructure the output back to the MATLAB representation before returning. This is called "marshalling". OpenCV does not interface directly with MATLAB so it does not have that overhead. We can avoid including this repeated overhead in our benchmark by absorbing the for loop and the tic/toc of our benchmark into the entry-point function. This way, we only need to marshall inputs and outputs once, instead of 1000 times. Additionally, it doesn't look like the OpenCV code is using the output value. So the MATLAB version has additional overhead of storing the output image in a buffer and returning it to MATLAB. We can also try some tricks like using "parfor" to generate OpenMP code. If we combine all this (loop inside the entry-point, tic/toc inside the entry-point, parfor, ignoring the transformation output), we might write the following code:
myEntry.m:
function outT = myEntry(inImg)%#codegen
tic;
parfor i = 1:1000
myEntrySingle(inImg);
end
outT = toc;
end
function outImg = myEntrySingle(inImg)
% or use "tform = rigidtform2d(30,[100,200]);" The following matrix is
% obtained directly from tform.A
% note: here should "single" type input,while very sensitive to "double"
% type and highly susceptible to invalid matrix inputs!
tform = rigidtform2d(single([0.8660 -0.5000 100.0000
0.5000 0.8660 200.0000
0 0 1.0000]));
% get output view full range
[h,w,~] =size(inImg);
xLimitsIn = [0.5,w+0.5];
yLimitsIn = [0.5,h+0.5];
[xLimitsOut,yLimitsOut] = outputLimits(tform,xLimitsIn,yLimitsIn);
% global full view range
xWorldLimits = [min(xLimitsIn(1),xLimitsOut(1)),max(xLimitsIn(2),xLimitsOut(2))];
yWorldLimits = [min(yLimitsIn(1),yLimitsOut(1)),max(yLimitsIn(2),yLimitsOut(2))];
W = round(xWorldLimits(2)-xWorldLimits(1));
H = round(yWorldLimits(2)-yWorldLimits(1));
outputView = imref2d([H,W],xWorldLimits,yWorldLimits);
% warp image
outImg = imwarp(inImg,tform,'OutputView',outputView);
end
doit.m:
%% opencv VS matlab origin code VS generate mex/C performance
%% generate mex file
inImg = im2single(imread("peppers.png"));
cfg = coder.config( "mex" );
cfg.ExtrinsicCalls = false;
cfg.IntegrityChecks = false;
cfg.SaturateOnIntegerOverflow = false;
cfg.ResponsivenessChecks = false;
cfg.NumberOfCpuThreads = 16;
codegen -config cfg myEntry -args {inImg}
%% test speed
t1 = myEntry(inImg);
t2 = myEntry_mex(inImg);
fprintf("origin matlab code take time:%.5f seconds,generate C/mex code take time:%.5f seconds\n",t1,t2);
which results in the following runtime performance for me:
>> doit
Code generation successful.
origin matlab code take time:2.53481 seconds,generate C/mex code take time:0.65263 seconds
Make sure to run doit twice because the first time will take a while to startup the parallel pool. That looks a bit better! If you wanted, you could also level the playing field in the other direction (e.g. write a hand-written MEX wrapper around your OpenCV code and call it from MATLAB).
Regarding your last question about investigating performance inefficiencies in generated code. It can be tricky because we don't always know exactly what the generated code is doing. But looking at the generated code can be helpful. That's how I found the integrity check calls which suggested to me we could try turning that off. In addition, we want to make sure we actually measure the same thing when benchmarking codegen against hand-written code or 3rd party libraries. Putting tic/toc and any benchmarking loops inside the entry-point can be helpful for that. You can also try generating standalone code and writing benchmarking code in C/C++ to link against the standalone code. One trick is to use cfg.EnableMexProfiling and the MATLAB profiler ("profile on; mexFcn; profile viewer") to see if there are any long-running functions in generated code. If you have really tried all options, you can always use profiling tools like vtune or valgrind to debug performance or memory issues.
I don't think "imwarp" is the source of the differences in the benchmark here. As you can see, when we do a more apples-to-apples comparison, the perfomance looks good.
Matan
xingxingcui
xingxingcui 2022 年 11 月 9 日
編集済み: xingxingcui 2022 年 11 月 9 日
Thank you very much for your detailed and accompanying insightful comments, which have helped me to better understand the ins and outs of generating C/C++ code, and your answers are very professional and worthy of recognition! What you said about doing a fair comparison on the same benchmark does, and I will follow up with further testing on ubuntu in addition to windows to verify. Thank you again! @Matan Silver
--------------
Under the most consistent benchmark possible, without using parfor and without the effects of "marshalling", and with the target platform being "MATLAB Host Computer", I regenerated the standalone C code and used gperftools' The cpuprofiler tool of gperftools was used to measure the performance separately, and the performance difference was comparable. Because "imwarp" uses a pre-compiled, platform-specific shared library for "MATLAB Host Computer" target.
Note:When I don't configure the cfg as a mex target,generate standalone C code only,
cfg = coder.config( "lib", "ecoder", true ); % regardless of whether ecoder option is true or false
cfg does not have this "IntegrityChecks" attribute.
I don't know why, if anyone knows, please let me know.
See attached performance report. (code generation from R2022b, ubuntu20.04, gcc9.4 compile)
// matlab code generation Evaluation
static float outImg[1503045];
static float fv[589824];
cv::Mat inImg, matBigImg;
int rows = 789;
int cols = 635;
int channels = 3;
img.convertTo(inImg, CV_32FC3, 1.0 / 255);
convertCVToMatrix(inImg, fv); // "marshalling"
ProfilerStart("./matlab_codegen.prof");
t1 = cv::getTickCount();
for (size_t i = 0; i < nums; i++) {
/* Call the entry-point 'myEntry'. */
myEntry(fv, outImg); // main evaluation function
}
t2 = cv::getTickCount();
std::printf("it take time:%.5f seconds. \n", (t2 - t1) * 1.0 / cv::getTickFrequency());
ProfilerStop();
convertToMat(outImg, rows, cols, channels, matBigImg);// "marshalling"
cv::imwrite("matlab_codegen_dst.jpg", matBigImg);
/* Terminate the application.
You do not need to do this more than one time. */
myEntry_terminate();
Matan Silver
Matan Silver 2022 年 11 月 9 日
Hello,
IntegrityChecks only exists on MEX configs. In standalone code, there is a different property called RuntimeChecks. Because it's off by default, you shouldn't need to change it. But you can read more about that here:
Matan
xingxingcui
xingxingcui 2022 年 11 月 10 日
@Matan Silver thank your very much again!

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

ヘルプ センター および File ExchangeCode Generation, GPU, and Third-Party Support についてさらに検索

製品

リリース

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by