Cluster error: Opening log file

Question

0 投票

Hi everybody,

I am running a matlab code through university cluster which is basically a for loop that submits job to the cluster, waits 2.5 hours for the results to be generated and moves to the next iteration. However, say it completes generation 8, and after 2.5 hours it starts generation 9 and also completes that but in the point it suppose to move to generation 10 this error message appears in the screen "Opening log file: /eng/cvcluster/eggurkanc/java.log.3643" and it does not move to 10th generation. I have no idea how to cope with that, any help will be appreciated.

Thanks in advance.

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Jason Ross 2013 年 2 月 25 日

編集済み: Jason Ross 2013 年 2 月 25 日

0 投票

Are you out of disk space? Have you exceeded a disk quota? Looks like you aren't in a normal "home" directory, so there may be more restrictive limits on the cluster.

Does the queue you are submitting to have restrictions on job time or hours of the day it runs? You might need to check with the admins.

Are you getting pre-empted by some other job that jumps the queue?

Are there any emails from the cluster about your job?

If you check the job status what does it show? (this will depend on the scheduler you are using to know what the command is, but it might be something like qstat)

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示

Ceren GURKAN 2013 年 2 月 26 日

I am not sure if I understand you completely or not, so first of all sorry for that :( , what I can say is that I am just running this specific code and nothng else. So not sure if could I be using the log file simultaneously, and if I do so how I can understand and prevent that to happen ???

Jason Ross 2013 年 2 月 26 日

MATLAB Online で開く

One of the common problems that happens on clustered systems is that something that you test/prototype in single execution that works becomes a shared resource when you run it on a cluster. Since you can now have multiple threads of execution acting on the same resource, this can become a problem. For example, the following will work fine with one process

cd to /cluster/shared/filesystem
open a file named "myresults"
write to "myresults"
close "myresults" when done.

Then you submit this to a cluster and problems start. When you had one process working on that file, everything was OK. Now you have n processes trying to write to the file simultaneously. You end up with (at best) a jumbled mess of output, and at worst you deadlock and get confused.

To get out of this, the solutions are many. One is to use the PID to try and make the log unique (which it looks like is already being tried -- but you can still get a clash). You can also use random numbers, machine name, etc to further make files unique (and then concatenate them at the end of your run).

This is a pretty simple example -- but I'd inspect and further instrument the code to see where it's getting to and what is stopping the execution.

サインインしてコメントする。

Cluster error: Opening log file

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

採用された回答

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

タグ

Community Treasure Hunt

Cluster error: Opening log file

0 件のコメント -2 件の古いコメントを表示 -2 件の古いコメントを非表示

採用された回答

4 件のコメント 2 件の古いコメントを表示 2 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

タグ

参考

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示 2 件の古いコメントを非表示