Parallel execution gets stuck / hangs (on fetching data)

I'm solving a differential equation using ode45, and because I have to explore the parameter space and each execution takes some time I would like to solve it in parallel for different parameters. The problem I keep running into is that the execution stops randomly. My code essentially looks like:
parfor k=1:N
[tt xx]=ode45(@(t,x) f(t,x,a(k), ...);
X(:,k)=xx;
end
I'm running it on a cluster of machines (12 x 8 cores) running Scientific Linux 6 (kernel 2.6.32-573) and Matlab R2015a. After I start it, it will run for a while and then everything will just stop: CPU load will go down across the cluster and my Matlab session (running on the head node of the cluster) will freeze (and doesn't recover within any reasonable amount of time). This sometime happens after a few minutes, but sometimes after a few hours. The MJS, which is also running on the head node, is reporting all of the worker nodes as busy. If I force quit my Matlab session no error log is generated (or at least I didn't find it). From what I can tell it doesn't appear to be a memory or a communication issue (all of the nodes are reachable and have plenty of free RAM).
If I replace parfor with parfeval I can sumbit my jobs, but then fetchNext hangs in a similar way as parfor does.
I would greatly appreciate any help because I'm out of ideas at this point. If any additional information is required please let me know.
Many thanks!

5 件のコメント

Walter Roberson
Walter Roberson 2016 年 7 月 22 日
Are you specifying more than 2 elements in your tspan vector? If you are not then the number of items returned can vary but your code assumes that it is always the same number.
Edric Ellis
Edric Ellis 2016 年 7 月 22 日
This kind of thing can be pretty tough to diagnose. I would try and work out if there are particular iterations of your parfor loop that are failing to make progress, and see if you can reproduce that problem (or not) on the MATLAB client.
In fact, it's probably easier to diagnose with the parfeval variant. When your fetchNext loop stalls, you should be able to CTRL-C out of that and work out which iterations are still executing (look for futures in state 'running'). You could add @odeprint to your ode45 call, and do something like this:
dvdt = @(t,r) [-398601*r(1)*((r(1)^2+r(2)^2+r(3)^2)^(-1));...
-398601*r(2)*((r(1)^2+r(2)^2+r(3)^2)^(-1)); ...
-398601*r(3)*((r(1)^2+r(2)^2+r(3)^2)^(-1))];
f = parfeval(@ode45, 2, dvdt, [0 1000], [0 8 0], ...
odeset('OutputFcn', @odeprint));
f.Diary
In this case, f.Diary gets updated as the ode45 call makes progress (or not in this case)
Marko Kuzmanovic
Marko Kuzmanovic 2016 年 7 月 22 日
Indeed I am using a tspan vector with more than two elements (because I don't want all the steps that ode45 generates), but as far as I can see the ode45 specification doesn't state something like that. Could you, please, clarify what you meant by that?
Also I verified the code locally before executing it on the cluster and it seems to work fine. If something like that was a problem how would it explain the huge differences between the times that it needs to get stuck?
Thank you for the reply!
Walter Roberson
Walter Roberson 2016 年 7 月 22 日
  • If tspan has two elements, [t0 tf], then the solver returns the solution evaluated at each internal integration step within the interval.
  • If tspan contains more than two elements [t0,t1,t2,...,tf], then the solver returns the solution evaluated at the given points.
You are specifying more than two elements, so you will get results at the locations you specify, the same number as you specify. If you had provided only two elements then you would have gotten the result at each internal integration step, and the number of integration steps and their distance apart will vary as required to meet the integration tolerances so the same function with two marginally different time spans or two marginally different initial conditions might end up producing very different number of internal points as one of the two might end up skipping a difficult-to-integrate point. Especially if the there is a singularity in the time span... you can end up getting lots and lots of points generated for that as MATLAB tries to figure out the singularity.
Marko Kuzmanovic
Marko Kuzmanovic 2016 年 7 月 22 日
If it was generating an array of different length I would get a dimension mismatch error which I don’t get. My tspan looks like tspan=0:(1/f/100):(1000/f); where f is some frequency defined by the physical system I'm modelling. There are no divergencies, and I have set the max step to be 1/f/1000 through odeset. The execution stops, CPU load goes down, it doesn’t get stuck trying to integrate the equation.
Thanks!

サインインしてコメントする。

回答 (0 件)

カテゴリ

ヘルプ センター および File ExchangeProgramming についてさらに検索

質問済み:

2016 年 7 月 21 日

コメント済み:

2016 年 7 月 22 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by