Diagnosing parallelization bottlenecks./Differences between Intel and AMD parallel computing performance?

Question

Andres G. 2021 年 3 月 16 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/774987-diagnosing-parallelization-bottlenecks-differences-between-intel-and-amd-parallel-computing-perform

編集済み: Andres G. 2021 年 3 月 19 日

I am trying to pinpoint/diagnose a parallel computing bottleneck that I've encountered on two different computers. For the computation each worker within the ‘parfor’ loop is assigned one sparse array out of 101 total (each array’s ‘full’ size is approximately 50,000x250). Each worker: 1) turns the sparse array into a ‘full’ array, 2) convolves the array with a small Gaussian kernel (which is also passed into the worker), 3) performs ICA using the ‘fast_ica’ function – the output is the independent component weight matrix. Recently, I started working on a new computer with a substantially higher core count but the performance of this code seems to be hitting some bottleneck such that I am not seeing any further performance increases. The old system has an Intel i7-8700K CPU (6 physical/12 logical cores), the new system is an AMD Ryzen 9 5950X (16 physical/32 logical cores) – both systems have 64 gb RAM, both are running Windows 10 and both have hyper-threading/SMT enabled (the old system is running Matlab R2018b and the new one is running R2020b). In order to compare the parallel performance across systems I ran the same code on both computers using different numbers of workers:

Top row shows the result for the older Intel system and the bottom row for the newer AMD system. Left column left axis shows the total execution time of each of the parfor runs (as measured by tic/tocs before and after), left column right axis shows the difference in execution time using N vs N+1 workers (i.e. points near 0 mean no improvement from N+1 as compared to N workers) - vertical dotted lines show the # of physical cores. Right columns show the system resource utilization during each of these runs. What I noticed is that in both cases using more than about 8 or 9 workers does not improve performance. This is despite the fact that a) more RAM and CPU resources are being used, and b) 9 workers represent 150% of the physical cores (75% of logical cores) in the Intel system but only 56% of the physical cores (28% of the logical cores) in the AMD system. The decreasing benefits of multi-threading past the physical core count can’t be at issue here given that the ‘bottleneck’ occurs well below the physical core count of the AMD system (16) and well above it on the Intel system (6). To me the most interesting ‘clue’ is that the number at which no further improvement occurs seems to be about 8 or 9 for both systems – however, its not impossible this is a coincidence and I don’t know quite how to interpret this fact. So my questions are:

Given the difference in CPU memory architecture are there known differences in parallel computing performance between Intel and AMD Ryzen CPU’s?
Given that the few physical limitations that I looked at (RAM, CPU utilization, physical core count) do not seem to be the problem, what else is likely to be bottlenecking me here?
How can I further diagnose the source of the bottleneck (in terms of potential answers to question 2, or more generally)?

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Edric Ellis 2021 年 3 月 17 日

MATLAB Online で開く

Hm, this definitely feels like you're hitting a resource limit somewhere. I'm no expert on CPU architectures, but I took a quick look at https://en.wikichip.org/wiki/amd/ryzen_9/5950x vs. https://en.wikichip.org/wiki/intel/core_i7/i7-8700k - and one thing that I notice is that the AMD has only moderately higher memory bandwidth than the Intel chip. So, it is possible that memory bandwidth is the limiting factor (I don't know of a way to prove that though). You could consider doing something a bit like this:

spmd
    t = zeros(1,numlabs);
    for nw = 1:numlabs
        labBarrier();
        timer = tic();
        if labindex <= nw
            % Only run on the first nw workers
            for idx = 1:10
                performCalculation();
            end
        end
        labBarrier();
        t(nw) = toc(timer);
    end
end
t{:}

The aim here is to perform the timing without any of the parfor potential overheads getting in the way. I'm sort-of expecting the iteration time to increase the more workers are contending. The other thing you could try is using mpiprofile to run the MATLAB profiler on the workers to see if there's a particular part of your computation that is getting slower as more contention is involved. I.e. with a couple of different pool sizes, try:

mpiprofile on
spmd
    for idx = 1:10
        performCalculation();
    end
end
mpiprofile viewer

Andres G. 2021 年 3 月 17 日

編集済み: Andres G. 2021 年 3 月 17 日

Hi Edric, first of all thanks for the help! I will try to implement your first suggestion tonight - note that it seems similar to the 'parTicToc' example I show in the second comment from the top in this thread. I also briefly looked at mpipprofile viewer and what I saw was consistent with what the parTicToc example showed: the same functions calls were taking the longest time, but in the 20 worker condition they were just linearly taking longer than in the 9 worker condition so that in the end the total amount of time the 'parfor' ran for was exactly the same - however, I should look at this again more carefully.

About the memory bandwidth: I was worried about the bottleneck being due to this - since there is not much that can be done about it. However, are you aware of any predictions that could be made if this were indeed the case? For instance, thinking about it somewhat naively: if for the sake of testing I tried running this same function but only on the first the half of each of the input matrices (i.e. I just take the first n-rows) to be fast_ica'ed then to a first approximation less memory IO would be needed per worker. If this were the case then I should see that the 'no further benefit' worker-limit would go up to some higher number (say, 13 instead of 9) before again being bottlenecked. Or is that too simplistic? I might try doing that later tonight.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Andres G. 2021 年 3 月 19 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/774987-diagnosing-parallelization-bottlenecks-differences-between-intel-and-amd-parallel-computing-perform#answer_652107

編集済み: Andres G. 2021 年 3 月 19 日

So, for posterity: I dug into the ICA code I was using (which I inherited) and it turns out this was almost comically memory inneficent. Optimizing this code a bit led to a several fold performance increase and more relavently it also increased the maximum number of cores leading to performance increases from the 9 above to the 12 to 14 range (i.e. further worker beyond this number did not increase performance) - likewise performing the test where the size of the input arrays are halved also increased this number also to around 12 or 14. More conclusively, using 'single' as opposed to 'double' precision values roughly halved the computational time overall and led to a maximum useful worker limit of about 18-19. I am being close to as efficent with the memory usage as is apparently possible with this particular problem in Matlab, so I'm not 100% sure but I think the issue may very well be what Edric suggested in the comments: the memory bandwith of the chips themselves. The fact that previously both the Intel and AMD chip 'maxed out' at around 9 workers may therefore have been more or less a coincidence: the AMD chip has a slightly larger memory bandwith but is also proportionally faster therefore they reached their computational bottleneck at around the same worker number. What I still find a bit puzzling though is: if I am reaching the limit of the memory bandwith of the CPU how is my computer still so responsive? Like opening a web-broswer and playing a video happens instantaneously and seem to only negligably affect the computational time of the Matlab functions.. If I am literally at the limit of the CPU's memory IO, how is this possible?

PS: The thing I found most surprising about this whole thing is that I couldn't seem to find resources/analysis directly relevant to this behavior online (or maybe I didn't know how to search for it?). That is nearly a first for me in my years of using Matlab... These higher count chips are pretty common now, I'm surely not the only one runnining large-ish parellel computing jobs on them, no!? So if you stumble across this thread and know of someone who has some practical advice on this, or even analyzed this issue properly (or have done so yourself) - drop me a line! I'd love to learn more about it!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Diagnosing parallelization bottlenecks./Differences between Intel and AMD parallel computing performance?

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Diagnosing parallelization bottlenecks./Differences between Intel and AMD parallel computing performance?

6 件のコメント 4 件の古いコメントを表示4 件の古いコメントを非表示

回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示