Back to the main page.
Bug 309 - fail to submit a job
Status | CLOSED FIXED |
Reported | 2010-12-15 13:18:00 +0100 |
Modified | 2011-01-05 12:01:08 +0100 |
Product: | FieldTrip |
Component: | peer |
Version: | unspecified |
Hardware: | PC |
Operating System: | Windows |
Importance: | P1 normal |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Marcel Zwiers - 2010-12-15 13:18:24 +0100
peerlist abc >>.. >>there are 79 peers running on 32 hosts as idle slave... peercellfun('exp',{2}) >> warning: resubmitting job 1 because it takes too long to get started
Robert Oostenveld - 2010-12-15 15:52:13 +0100
this is probably not a bug. The job is submitted to a slave, the slave tries starting the engine, figures out that it cannot get a license (because of license limitations during office hours), drops the job, switches to zombie. The master resubmits (to another slave) because the job never started on the slave.
Marcel Zwiers - 2010-12-15 16:12:03 +0100
If it's not a bug it must be a feature :-) FYI, I just ran peercellfun('exp',{2}, 'timreq',0.1) again and it hasn't finished yet (after more than 20 resubmissions and 10 minutes elapsed time)...
Marcel Zwiers - 2010-12-15 16:40:40 +0100
Another half hour has passed and it just found a slave that was willing to process my job in 0 sec... :-)
Robert Oostenveld - 2010-12-15 20:14:58 +0100
The bug/feature has been there from the beginning and is a design consequence of the command-line peerslaves. If there are no licenses available, the peerslaves cannot start an engine and teh job cannot be executed. It would be a bug if the job would eventually not execute, but there is never a guarantee that peercellfun will actually speed up the job. Competing users (in this case one with many big jobs and another with a single small job) the single job has a disadvantage. Had the single job been bigger, it would not have been different. The disappointing performance (and frequent resubmissions every 30 seconds) have to do with the many peerslaves that cannot get a license but are still running. What do you suggest to solve the problem?
Marcel Zwiers - 2010-12-16 10:38:51 +0100
I suggest that the slave should switch to zombie mode for an hour if it can't get a license.
Robert Oostenveld - 2010-12-19 09:34:10 +0100
r2468 | roboos | 2010-12-19 09:32:18 +0100 (Sun, 19 Dec 2010) | 7 lines increase zombietimeout in peerslave.exe to 900 seconds (15 minutes) peerslave.exe returns an error if the matlab engine fails to start peerslave catches the error and resubmits immediately (used to take 30 seconds)
Robert Oostenveld - 2011-01-05 11:57:03 +0100
selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"