ga does not use all the available workers

3 ビュー (過去 30 日間)
Theo
Theo 2012 年 5 月 8 日
Hello, i have configured successfully a cluster to use a matlab job scheduler. Admincenter communicates with all the available nodes and starts successfully the requested workers ( 40 in total) . Matlab makes connection with those workers , but the problem lies when i run my code ( genetic algorith in parallel) only 20 out of 40 workers seem to work. Any ideas why this happens.. Thanks
  2 件のコメント
Jason Ross
Jason Ross 2012 年 5 月 8 日
Can you elaborate further on
1) How you picked the number of workers you did.
2) How you are determining if the worker is working (or not)
3) If your processors support hyper-threading.
Theo
Theo 2012 年 5 月 8 日
1) Firstly, through admincenter i created the job manager and added 40 workers to this. Then , from matlab's menu and manage configurations , i selected the specific job manager (from properties i completed the fields max and min number of workers with 40). Finally before calling the ga function , i used the matlapool open 40 command. It connects to 40 labs.
2) I have adjusted the ga function to run parallel ('UseParallel', 'always', 'Vectorized', 'off'), and i check the time needed to run the fitness function (tic,toc).
Elapsed time is 144.131249 seconds.
Elapsed time is 144.345565 seconds.
Elapsed time is 164.260365 seconds.
Elapsed time is 166.840440 seconds.
Elapsed time is 167.682153 seconds.
Elapsed time is 168.689102 seconds.
Elapsed time is 168.848986 seconds.
Elapsed time is 171.553278 seconds.
Elapsed time is 174.286497 seconds.
Elapsed time is 174.885536 seconds.
Elapsed time is 174.981250 seconds.
Elapsed time is 175.822999 seconds.
Elapsed time is 178.663270 seconds.
Elapsed time is 180.424678 seconds.
Elapsed time is 181.579599 seconds.
Elapsed time is 184.609482 seconds.
Elapsed time is 184.989887 seconds.
Elapsed time is 186.188509 seconds.
Elapsed time is 186.607582 seconds.
Elapsed time is 187.694278 seconds.
Elapsed time is 156.831662 seconds.
Elapsed time is 157.956449 seconds.
Elapsed time is 164.016603 seconds.
Elapsed time is 164.102494 seconds.
Elapsed time is 164.721493 seconds.
Elapsed time is 169.058468 seconds.
Elapsed time is 170.300810 seconds.
Elapsed time is 170.383235 seconds.
Elapsed time is 172.590894 seconds.
Elapsed time is 173.413034 seconds.
Elapsed time is 175.455189 seconds.
Elapsed time is 176.130289 seconds.
Elapsed time is 177.155515 seconds.
Elapsed time is 178.157808 seconds.
Elapsed time is 178.309543 seconds.
Elapsed time is 179.606388 seconds.
Elapsed time is 184.866483 seconds.
Elapsed time is 184.948895 seconds.
Elapsed time is 188.210774 seconds.
Elapsed time is 192.255420 seconds.
So i noticed that the first 20 results are produced about the same time , and then from the 21 result it starts from the beginning (and follows the same pattern). I assume that 20 workers work in parallel each time.
3) I don't think that this processors technology supports hyperthreading.

サインインしてコメントする。

採用された回答

Jason Ross
Jason Ross 2012 年 5 月 9 日
My first suspicion is that you have have 20 (real) cores available to do the work, so you end up essentially queuing the second set of 20 units of work, so when the first 20 finish, those move on and open up the resources for the second set.
Check the specifications for your processor to see if it has hyperthreading. If it's enabled you will see twice the number of compute cores you actually have at the OS level.
Generally the starting recommendation for PCT is one worker per core, but that's only a starting point. You might do better with fewer, you might do better with more. It's also possible that another resource (memory, disk I/O, network I/O) is contributing to how the scheduling is happening. You'd need to monitor a cluster node to see what was actually going on while you are running a job to see if that was the case.

その他の回答 (0 件)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by