Back to the main page.
Bug 2940 - rescheduling jobs
| Status | CLOSED FIXED |
| Reported | 2015-07-31 14:57:00 +0200 |
| Modified | 2016-06-14 16:14:55 +0200 |
| Product: | FieldTrip |
| Component: | qsub |
| Version: | unspecified |
| Hardware: | PC |
| Operating System: | Windows |
| Importance: | P5 normal |
| Assigned to: | Robert Oostenveld |
| URL: | |
| Tags: | |
| Depends on: | |
| Blocks: | |
| See also: |
Marcel Zwiers - 2015-07-31 14:57:12 +0200
If the matlab session on an execution host accepts and reads in a job it deletes the input.mat file immediately, i.e. before the job was successfully completed. However, if the matlab-session crashes, then the torque/maui/moab will reschedule and rerun the job on a different host. Then the matlab session will fail because it cannot find the (deleted) input.mat file. Proposed solution: Make 'rerunable' an option in qsubcellfun and if rerunable==true then only delete the input.mat file at the very end of the job
Marcel Zwiers - 2015-07-31 15:14:58 +0200
Just to be clear, I come across this problem all the time because (massive multi-core) nodes keep crashing and after a reboot of the node, torque reschedules the job to another node (and then matlab gives the missing input.mat file error).
Robert Oostenveld - 2015-08-19 15:52:43 +0200
done! mac011> svn commit Sending qsubcellfun.m Sending qsubexec.m Sending qsubfeval.m Transmitting file data ... Committed revision 10607.