Potential bug with parfeval; cumulative slowing down after several hours of operation. Can even exceed 10x the initial compute time.

2 ビュー (過去 30 日間)

Pavel Sinha 2018 年 7 月 15 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/410403-potential-bug-with-parfeval-cumulative-slowing-down-after-several-hours-of-operation-can-even-exce

コメント済み: Pavel Sinha 2018 年 9 月 12 日

There seems to be a bug using parfeval in the parallel pressing tool box. After hours of running the computation time for the compute time of the parallel tasks start increasing.

I tried running in all serial mode and the compute time remains similar even after hours of operation.

I have monitored memory usage and doesn't increase with time.

I tried monitoring the parallel compute time and maintained a low and high compute time. Once the difference exceeds certain (20%) threshold, I manually performed the following;

delete(gcp('nocreate'));    
POOL=parpool('local', NO_PAR_POOLS);

The reset of the parallel pool seems to bring back the parallel compute time back to expected.

Here is the pseudo code:

%%%%%%%%%%%%%%%%%%%%%%%%
tic_sum1=0;
tic_sum1_high=0;
tic_sum1_low=0;
for mini_batch_no=1:NO_OF_MINI_BATCHES
tic;
% Launch (N-1) parallel asynchronous jobs
job{1} = parfeval(POOL, @read_dataset_from_hdd, 1, mini_batch_no, CONST_DATA, 1);
job{2} = parfeval(POOL, @compute_cpu_task, 1, BUFF_DATA_TRAIN.batch_file_read(:,:,:,:,set_cpu_num), CONST_DATA);
job{3} = parfeval(POOL, @compute_cpu_dwt_task, 1, BUFF_DATA_TRAIN.batch_file_process_1(:,:,:,:,set_cpu_dwt_num), CONST_DATA.IM_RESIZE, CONST_DATA);
% Perform the Nth parallel job on the host
compute_gpu_task({BUFF_DATA_TRAIN.batch_file_process_2(:,:,:,:,set_gpu_num), BUFF_DATA_TRAIN.batch_file_label_read(:,:,:,:,set_gpu_num)});
% Collect result from parallel jobs
result{1} = fetchOutputs(job{1});
result{2} = fetchOutputs(job{2});
result{3} = fetchOutputs(job{3});
tic_sum1=tic_sum1+toc;
%%%Perform reset of parallel pool if hi-low diff exceed threshold percentage
if (tic_sum1>=tic_sum1_high)
tic_sum1_high=tic_sum1;
end
if (tic_sum1<=tic_sum1_low)
tic_sum1_low=tic_sum1;
elseif (tic_sum1_low==0)
tic_sum1_low=tic_sum1;
end
tic_sum1=0;
if (tic_sum1_high~=0 && tic_sum1_low~=0)
if ((100*(tic_sum1_high-tic_sum1_low)/tic_sum1_low)>CPU_ALLOWABLE_COMPUTE_TIME_LOW_VS_HIGH_DIFF_PERCENTAGE)
delete(gcp('nocreate'));
POOL=parpool('local', NO_PAR_POOLS);
tic_sum1_high=0;
tic_sum1_low=0;
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%

The iterations go over 5-6 days and after the 1st day the total time of operation exceeds 10x the initial time.

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

Pavel Sinha 2018 年 7 月 23 日

Hi,

Thank you for your reply. Here are my thoughts:

1) It cannot be race condition because each of the async tasks are carried out on independent buffers. So, all read/write to/from happen within independent buffers during execution of the tasks. No communication or dependencies between the tasks, completely independent. All async tasks are synced, then data is collected before the next launch of the parallel threads happens, each time.

2) The above point also makes sure that there is no deadlock. It comes from the sync before launching the parallel jobs and no interaction between the async threads during execution.

Pavel Sinha 2018 年 9 月 12 日

I found that by initializing Matlab with just 1 parallel worker from settings and then activate required number of parallel workers from code, fixes this issue to a great extent. As in, now it takes much longer time before the processing loops take up significantly longer time to compute than the initial once. But eventually once the processing loop exceeds 20% of the initial compute time, I still have to reset the parallel cores and re-assign the number of required parallel workers.

サインインしてコメントする。

サインインしてこの質問に回答する。