How are labs distributed across cores?

1 回表示 (過去 30 日間)
Mark Brandon
Mark Brandon 2019 年 3 月 9 日
コメント済み: Mark Brandon 2019 年 3 月 11 日
I have been running a large parallel job on a cluster at my university. Occassionally the file output will be missing a line of text. My support people suggest that this might be due a "race condition", where the "master" is interrupted during an fprintf call. My job is "embarrassingly simple", where a "master" runs the same "task" across a set of "labs". The jobs are deployed and recovered by the master in a serial fashion so I see no obvious reason for the race condition. For those who would like more details: the job is run using the Parallel Computing Toolbox (PCT) with 28 cores on a single computer node. The repeating task takes about 1.5 minutes of wall time on a single core, so the master has to handle input and output for each lab at an interval of about every 3 seconds.
I have found that the race condition goes away when I use the "W" option in fprintf, which invokes a 4k buffer for the fprintf output. That means that have to wait for about 50 tasks to finish before I see the first output. I would prefer to see the output occur more frequently, for troubleshooting and quality control.
I got a suggestion from one of the cluster support people to start the parpool with one less core than available. Up until now, I have been running 28 labs on the 28 available cores. That means that there are actually 29 tasks running on the computing node. In other words, one of the cores is handling a coexisting master and lab.
This issue has sparked my question: How does PCT allocate labs across available cores? Does it actually try to avoid starting labs on core where the master resides? To be clear, the master is the first instance of matlab, and it is where the labs are initiated via a call to the parpool function.
I have searched for an answer to this question, and found nothing yet on the web. I am hoping that there is someone out there who, given experience, knows the answer to this question. I thought to do some experimenting myself to find an answer, but it is not obvious to me how to determine where, among the available cores, the labs and master reside.
Best,
Mark

採用された回答

Walter Roberson
Walter Roberson 2019 年 3 月 9 日
It might depend upon whether you are using spmd compared to parfor or parfeval, but generally MATLAB creates Java task queues for parfor and parfeval work. The calls to create those do not deliberately pass in information about which CPU is being executed on -- which is something that could change without notice as MATLAB itself does not bind in an affinity for the core it starts executing on.
As best I can tell, scheduling and cpu affinity is all left to the operating system. Operating systems will tend to distribute tasks to cores that are idle, and operating systems contain logic for process migration. The operating system might or might not be configured to retain locality (that is, reduce process migration.)
I wonder if you should be using a DataQueue or PollableDataQueue to communicate from the processes back to the client ? https://www.mathworks.com/matlabcentral/answers/350727-how-to-pass-data-between-parfeval#answer_275995
If you are using parfeval() then instead of having the process write the result to the file, return the result and have the client fetch that from the future and write it to the file.
  1 件のコメント
Mark Brandon
Mark Brandon 2019 年 3 月 11 日
Walter,
Thanks for the detailed comments. Very useful. And sorry, I should have indicated that I am using parfeval, NOT parfor. And, yes, I am using the arrangement you describe in the last paragraph. In other words, the client is the only process that writes to the file.
Best,
Mark

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeParallel for-Loops (parfor) についてさらに検索

製品


リリース

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by