Back to the main page.
Bug 1336 - StopOnError=false does not work with exceeded walltime errors
Status | CLOSED FIXED |
Reported | 2012-02-21 17:04:00 +0100 |
Modified | 2014-07-15 17:25:10 +0200 |
Product: | FieldTrip |
Component: | qsub |
Version: | unspecified |
Hardware: | PC |
Operating System: | Linux |
Importance: | P3 major |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Marcel Zwiers - 2012-02-21 17:04:30 +0100
qsubcellfun(@pause, {0,240,0}, 'timreq',1,'memreq',1024^2, 'StopOnError',false, 'Stack',1) submitting job marzwi_mentat304_p19124_b22_j001... qstat job id 390837.dccn-l014.dccn.nl submitting job marzwi_mentat304_p19124_b22_j002... qstat job id 390838.dccn-l014.dccn.nl submitting job marzwi_mentat304_p19124_b22_j003... qstat job id 390839.dccn-l014.dccn.nl job marzwi_mentat304_p19124_b22_j003 returned, it required 0 seconds and 5.3 MB job marzwi_mentat304_p19124_b22_j001 returned, it required 0 seconds and 4.8 MB =>> PBS: job killed: walltime 188 exceeded limit 181 Warning: cleaning up all scheduled and running jobs, don't worry if you see warnings from "qdel" > In qsublist at 111 In qsubcellfun>cleanupfun at 429 In onCleanup>onCleanup.delete at 61 qdel: Request invalid for state of job MSG=invalid state for job - COMPLETE 390838.dccn-l014.dccn.nl qdel 390838.dccn-l014.dccn.nl: Signal 42 ??? Error using ==> qsubget at 75 the batch queue system returned an error for job marzwi_mentat304_p19124_b22_j002, now aborting Error in ==> qsubcellfun at 321 [argout, options] = qsubget(jobid{collect}, 'output', 'cell', 'diary', diary, 'StopOnError', StopOnError);
Marcel Zwiers - 2012-02-28 21:07:27 +0100
I think that embedding the while (~all(collected)) .. end block (line 315-343 in qsubcellfun) in a try-catch statement, such as: try while (~all(collected)) .. end catch Exception if ~StopOnError rethrow Exception else % Notify the user and let qsubcellfun finish normally with the already collected output only end end would be a quick solution that would take away the main problem that all the already collected results are discarded after unexpected matlab-crashes / walltime errors. This could be an intermediate easy patch that would make qsubcellfun behave much more as expected when setting StopOnError=false (though I also see it is not the ultimate/ideal solution because it does stop -- just not with an error).
Marcel Zwiers - 2012-02-28 21:11:58 +0100
Ahum, the ~ in the if ~StopOnError statement should of course not be there :-)
Robert Oostenveld - 2012-02-28 23:11:52 +0100
StopOnError is functionality implemented in qsubget, not qsubcellfun. The error should be dealt with in qsubget in a similar fashion as the errors that are caught by the remote matlab. My idea is that the queue error (which is detected in the "master") is representded as error just like the error that is sent along by the remote matlab. The following section from line 154 onward in qsubget if StopOnError if ischar(err) error(err); else rethrow(err); end else warning('error during remote execution: %s', errmsg); end end % ~isempty(err) can then stay the same. The section to be changed is from line 70 onward, which states % the STDERR output log file should be empty, otherwise it indicates an error tmp = dir(logerr); if ~isempty(tmp) && tmp.bytes>0 % show the error that was printed on STDERR type(fullfile(curPwd, tmp.name)); error('the batch queue system returned an error for job %s, now aborting', jobid); end Instead of trowing the error at that location, it should result in err = ft_getopt(options, 'lasterr'); diarystring = ft_getopt(options, 'diary'); on line 108 and 109 to represent the oXXX and eXXX file contents in the err and diarystring.
Marcel Zwiers - 2014-04-16 16:33:17 +0200
I'm reviving this report as it would be a great feature to have (as discussed with Robert).
Robert Oostenveld - 2014-05-14 23:13:05 +0200
I have implemented consistent parsing of MATLAB and torque errors. This allows custom error handling in case torque jobs get killed. I tested it with this a = qsubcellfun(@myexit, {1, 0, 0}, 'memreq', 1e8, 'timreq', 300, 'StopOnError', 0) where myexit is a function that will exit MATLAB if the input is 1 and will continue if the input is 0. I also tested some other cases. I have not yet tested walltime/mem violations, but those should behave similarly as the test above. roboos@mentat001> svn commit Sending qsub/qsubcellfun.m Sending qsub/qsubget.m Sending qsub/qsublist.m Transmitting file data ... Committed revision 9531. @Marcel, please test and reopen if it does not work as expected.