matlab coder:Generating C code for general functions is a big performance problem

Question

0 投票

To my incomplete knowledge, for matlab built-in function imwarp generated C code/mex file performance is much inferior to the open source library OpenCV, I originally wanted to use matlab coder/embeded coder to deploy the algorithm to generate C code, but found that performance is a big problem. Below I have written a simple function to test this, "myEntry" is the function to generate the C code. Again I wrote the same function with very little C++ code. The performance was then compared as follows：

function outImg = myEntry(inImg)%#codegen
% or use "tform = rigidtform2d(30,[100,200]);" The following matrix is
% obtained directly from tform.A
% note: here should "single" type input,while very sensitive to "double"  
% type and highly susceptible to invalid matrix inputs!
tform = rigidtform2d(single([0.8660   -0.5000  100.0000 
                      0.5000    0.8660  200.0000
                      0         0    1.0000]));
% get output view full range
[h,w,~] =size(inImg);
xLimitsIn = [0.5,w+0.5];
yLimitsIn = [0.5,h+0.5];
[xLimitsOut,yLimitsOut] = outputLimits(tform,xLimitsIn,yLimitsIn);
% global full view range
xWorldLimits = [min(xLimitsIn(1),xLimitsOut(1)),max(xLimitsIn(2),xLimitsOut(2))];
yWorldLimits = [min(yLimitsIn(1),yLimitsOut(1)),max(yLimitsIn(2),yLimitsOut(2))];
W = round(xWorldLimits(2)-xWorldLimits(1));
H = round(yWorldLimits(2)-yWorldLimits(1));
outputView = imref2d([H,W],xWorldLimits,yWorldLimits);
% warp image
outImg = imwarp(inImg,tform,'OutputView',outputView);
end

The following script was then used to generate the mex/C code executable library file：

%% opencv VS matlab origin code VS generate mex/C performance
%% generate mex file 
inImg = im2single(imread("peppers.png"));
codegen -config:mex myEntry -args {inImg}
%% test speed
num = 1000;
t1 = tic;
for i = 1:num
    outImg1 = myEntry(inImg);
end
t = toc(t1);
t2 = tic;
for i = 1:num
    outImg2 = myEntry_mex(inImg);
end
tt = toc(t2);
fprintf("origin matlab code take time:%.5f seconds,generate C/mex code take time:%.5f seconds\n",t,tt);
figure;imshowpair(outImg1,outImg2,'montage')

Code generation successful.

origin matlab code take time:5.53857 seconds,generate C/mex code take time:5.46453 seconds

These two take about the same amount of time and it looks like imwarp uses a pre-compiled library, even though it generates C code, there is no significant speed advantage!

For the same functionality, I implemented it again using OpenCV C++ as follows：

cv::Mat img = cv::imread("peppers.png");
    cv::Mat dst;
    size_t nums = 1000;
    double t1 = cv::getTickCount();
    for (size_t i = 0; i < nums; i++) {
        cv::Mat transMat = (cv::Mat_<float>(2, 3) << 0.8660, -0.5000, 100.0000,
                            0.5000, 0.8660, 200.0000);
        // 计算包含目标图像的最大范围
        std::vector<cv::Point2f> srcCorners = {cv::Point2f(0, 0), cv::Point2f(img.cols, 0), cv::Point2f(img.cols, img.rows), cv::Point2f(0, img.rows)};
        std::vector<cv::Point2f> dstCorners;
        cv::transform(srcCorners, dstCorners, transMat);  // 对应matlab的transpointsforward
        dstCorners.insert(dstCorners.end(), srcCorners.begin(), srcCorners.end());
        cv::Rect outputView = cv::boundingRect(dstCorners);
        // 平移到可视化区域
        transMat.colRange(2, 3) = transMat.colRange(2, 3) - (cv::Mat_<float>(2, 1) << outputView.x, outputView.y);
        cv::warpAffine(img, dst, transMat, cv::Size(outputView.width, outputView.height));
    }
    double t2 = cv::getTickCount();
    std::printf("it take time:%.5f seconds. dst image size:(%d*%d)\n", (t2 - t1) * 1.0 / cv::getTickFrequency(), dst.rows, dst.cols);
    

it take time:0.79880 seconds. dst image size:(789*636)

As you can see above, the performance difference between imwarp's built-in functions is about 6.89 times, so it seems that performance is indeed an issue.

RUN in R2022b, windows 10

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Matan Silver 2022 年 11 月 7 日

0 投票

Hello,

I took a look at the generated code for these reproduction steps, and found that a large performance penalty is paid in runtime integrity checks. These integrity checks are generated by default for MEX and check things like index bounds, integer overflows, divide by zero, and more. I don't know how openCV works under the hood but it's likely it's not doing as many runtime checks. You can turn off these checks by changing the codegen script to run:

cfg = coder.config('mex');

cfg.IntegrityChecks = false;

codegen -config cfg myEntry -args {inImg}

This could give a closer comparison between generated code and openCV. Running your repro after making those changes, I see the following significant improvement:

>> doit

Code generation successful.

origin matlab code take time:4.58261 seconds,generate C/mex code take time:1.37438 seconds

Hopefully that helps. Note that turning off IntegrityChecks has the downside that invalid input to the MEX can potentially cause the MEX to crash.

Matan

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示

xingxingcui 2022 年 11 月 8 日

編集済み: xingxingcui 2022 年 11 月 8 日

MATLAB Online で開く

@Matan Silver Thanks for pointing out the problem, but the code you provided above doesn't execute on my computer at such a high multiplier speedup in time.

cfg = coder.config( "mex" );
cfg.ExtrinsicCalls = false;
cfg.IntegrityChecks = false;
cfg.SaturateOnIntegerOverflow = false;
cfg.ResponsivenessChecks = false;
cfg.NumberOfCpuThreads = 16;

I have also tried additional optimizations to further improve the efficiency and the time is printed as:

origin matlab code take time:5.06883 seconds,generate C/mex code take time:2.18149 seconds

Is there anything else that can be done to further improve execution efficiency, such as the possibility of embedding OpenCV library functions directly into the generated C/C++ code?

------------------

Another thing is how to investigate how the low execution efficiency of the mex file is caused by the corresponding matlab code, e.g. is the Imwarp built-in function the main cause of the poor performance in the above example? if so, it should be possible to provide a user-implemented "efficient C function" instead of the Imwarp built-in functions?

thank you again!

Matan Silver 2022 年 11 月 8 日

Hello,

You found some good other config parameters to turn off to try to improve the runtime performance of the generated code.

Regarding embeddeding calls to OpenCV in the generated code, it might be possible using "coder.ceval", but we should think about this benchmark a bit more.

I'd like to suggest we restructure the benchmark in a way that might treat codegen more fairly compared to OpenCV. Even after we've turned off the runtime checks, responsiveness checks, etc., the MEX is doing extra work. The MEX has to convert the MATLAB value into a C representation on entry, and then it has to restructure the output back to the MATLAB representation before returning. This is called "marshalling". OpenCV does not interface directly with MATLAB so it does not have that overhead. We can avoid including this repeated overhead in our benchmark by absorbing the for loop and the tic/toc of our benchmark into the entry-point function. This way, we only need to marshall inputs and outputs once, instead of 1000 times. Additionally, it doesn't look like the OpenCV code is using the output value. So the MATLAB version has additional overhead of storing the output image in a buffer and returning it to MATLAB. We can also try some tricks like using "parfor" to generate OpenMP code. If we combine all this (loop inside the entry-point, tic/toc inside the entry-point, parfor, ignoring the transformation output), we might write the following code:

myEntry.m:

function outT = myEntry(inImg)%#codegen

tic;

parfor i = 1:1000

myEntrySingle(inImg);

end

outT = toc;

end

function outImg = myEntrySingle(inImg)

% or use "tform = rigidtform2d(30,[100,200]);" The following matrix is

% obtained directly from tform.A

% note: here should "single" type input,while very sensitive to "double"

% type and highly susceptible to invalid matrix inputs!

tform = rigidtform2d(single([0.8660 -0.5000 100.0000

0.5000 0.8660 200.0000

0 0 1.0000]));

% get output view full range

[h,w,~] =size(inImg);

xLimitsIn = [0.5,w+0.5];

yLimitsIn = [0.5,h+0.5];

[xLimitsOut,yLimitsOut] = outputLimits(tform,xLimitsIn,yLimitsIn);

% global full view range

xWorldLimits = [min(xLimitsIn(1),xLimitsOut(1)),max(xLimitsIn(2),xLimitsOut(2))];

yWorldLimits = [min(yLimitsIn(1),yLimitsOut(1)),max(yLimitsIn(2),yLimitsOut(2))];

W = round(xWorldLimits(2)-xWorldLimits(1));

H = round(yWorldLimits(2)-yWorldLimits(1));

outputView = imref2d([H,W],xWorldLimits,yWorldLimits);

% warp image

outImg = imwarp(inImg,tform,'OutputView',outputView);

end

doit.m:

%% opencv VS matlab origin code VS generate mex/C performance

%% generate mex file

inImg = im2single(imread("peppers.png"));

cfg = coder.config( "mex" );

cfg.ExtrinsicCalls = false;

cfg.IntegrityChecks = false;

cfg.SaturateOnIntegerOverflow = false;

cfg.ResponsivenessChecks = false;

cfg.NumberOfCpuThreads = 16;

codegen -config cfg myEntry -args {inImg}

%% test speed

t1 = myEntry(inImg);

t2 = myEntry_mex(inImg);

fprintf("origin matlab code take time:%.5f seconds,generate C/mex code take time:%.5f seconds\n",t1,t2);

which results in the following runtime performance for me:

>> doit

Code generation successful.

origin matlab code take time:2.53481 seconds,generate C/mex code take time:0.65263 seconds

Make sure to run doit twice because the first time will take a while to startup the parallel pool. That looks a bit better! If you wanted, you could also level the playing field in the other direction (e.g. write a hand-written MEX wrapper around your OpenCV code and call it from MATLAB).

Regarding your last question about investigating performance inefficiencies in generated code. It can be tricky because we don't always know exactly what the generated code is doing. But looking at the generated code can be helpful. That's how I found the integrity check calls which suggested to me we could try turning that off. In addition, we want to make sure we actually measure the same thing when benchmarking codegen against hand-written code or 3rd party libraries. Putting tic/toc and any benchmarking loops inside the entry-point can be helpful for that. You can also try generating standalone code and writing benchmarking code in C/C++ to link against the standalone code. One trick is to use cfg.EnableMexProfiling and the MATLAB profiler ("profile on; mexFcn; profile viewer") to see if there are any long-running functions in generated code. If you have really tried all options, you can always use profiling tools like vtune or valgrind to debug performance or memory issues.

I don't think "imwarp" is the source of the differences in the benchmark here. As you can see, when we do a more apples-to-apples comparison, the perfomance looks good.

Matan

xingxingcui 2022 年 11 月 9 日

編集済み: xingxingcui 2022 年 11 月 9 日

MATLAB Online で開く

Thank you very much for your detailed and accompanying insightful comments, which have helped me to better understand the ins and outs of generating C/C++ code, and your answers are very professional and worthy of recognition! What you said about doing a fair comparison on the same benchmark does, and I will follow up with further testing on ubuntu in addition to windows to verify. Thank you again! @Matan Silver

--------------

Under the most consistent benchmark possible, without using parfor and without the effects of "marshalling", and with the target platform being "MATLAB Host Computer", I regenerated the standalone C code and used gperftools' The cpuprofiler tool of gperftools was used to measure the performance separately, and the performance difference was comparable. Because "imwarp" uses a pre-compiled, platform-specific shared library for "MATLAB Host Computer" target.

Note:When I don't configure the cfg as a mex target，generate standalone C code only,

cfg = coder.config( "lib", "ecoder", true ); % regardless of whether ecoder option is true or false

cfg does not have this "IntegrityChecks" attribute.

I don't know why, if anyone knows, please let me know.

See attached performance report. （code generation from R2022b, ubuntu20.04, gcc9.4 compile）

    // matlab code generation Evaluation
    static float outImg[1503045];
    static float fv[589824];
    cv::Mat inImg, matBigImg;
    int rows = 789;
    int cols = 635;
    int channels = 3;
    img.convertTo(inImg, CV_32FC3, 1.0 / 255);
    convertCVToMatrix(inImg, fv);  // "marshalling"
    ProfilerStart("./matlab_codegen.prof");
    t1 = cv::getTickCount();
    for (size_t i = 0; i < nums; i++) {
        /* Call the entry-point 'myEntry'. */
        myEntry(fv, outImg); // main evaluation function
    }
    t2 = cv::getTickCount();
    std::printf("it take time:%.5f seconds. \n", (t2 - t1) * 1.0 / cv::getTickFrequency());
    ProfilerStop();
    convertToMat(outImg, rows, cols, channels, matBigImg);// "marshalling"
    cv::imwrite("matlab_codegen_dst.jpg", matBigImg);
    
    /* Terminate the application.
You do not need to do this more than one time. */
    myEntry_terminate();

Matan Silver 2022 年 11 月 9 日

Hello,

IntegrityChecks only exists on MEX configs. In standalone code, there is a different property called RuntimeChecks. Because it's off by default, you shouldn't need to change it. But you can read more about that here:

https://www.mathworks.com/help/coder/ug/generate-standalone-code-that-detects-and-reports-run-time-errors.html

Matan

xingxingcui 2022 年 11 月 10 日

@Matan Silver thank your very much again!

サインインしてコメントする。

matlab coder:Generating C code for general functions is a big performance problem

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

採用された回答

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

製品

リリース

タグ

Community Treasure Hunt

matlab coder:Generating C code for general functions is a big performance problem

0 件のコメント -2 件の古いコメントを表示 -2 件の古いコメントを非表示

採用された回答

5 件のコメント 3 件の古いコメントを表示 3 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

製品

リリース

タグ

参考

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示