Back to the main page.
Bug 2940 - rescheduling jobs
Status | CLOSED FIXED |
Reported | 2015-07-31 14:57:00 +0200 |
Modified | 2016-06-14 16:14:55 +0200 |
Product: | FieldTrip |
Component: | qsub |
Version: | unspecified |
Hardware: | PC |
Operating System: | Windows |
Importance: | P5 normal |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Marcel Zwiers - 2015-07-31 14:57:12 +0200
If the matlab session on an execution host accepts and reads in a job it deletes the input.mat file immediately, i.e. before the job was successfully completed. However, if the matlab-session crashes, then the torque/maui/moab will reschedule and rerun the job on a different host. Then the matlab session will fail because it cannot find the (deleted) input.mat file. Proposed solution: Make 'rerunable' an option in qsubcellfun and if rerunable==true then only delete the input.mat file at the very end of the job
Marcel Zwiers - 2015-07-31 15:14:58 +0200
Just to be clear, I come across this problem all the time because (massive multi-core) nodes keep crashing and after a reboot of the node, torque reschedules the job to another node (and then matlab gives the missing input.mat file error).
Robert Oostenveld - 2015-08-19 15:52:43 +0200
done! mac011> svn commit Sending qsubcellfun.m Sending qsubexec.m Sending qsubfeval.m Transmitting file data ... Committed revision 10607.