Back to the main page.
Bug 306 - too many resubmissions with small nr of collected jobs
Status | CLOSED FIXED |
Reported | 2010-12-15 12:13:00 +0100 |
Modified | 2011-01-05 12:01:08 +0100 |
Product: | FieldTrip |
Component: | peer |
Version: | unspecified |
Hardware: | PC |
Operating System: | Windows |
Importance: | P1 enhancement |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Marcel Zwiers - 2010-12-15 12:13:57 +0100
peercellfun resubmits massively when all jobs are submitted but only a small (e.g. 1) job has been collected, i.e. when (estimated_max - estimated_min) is very small (e.g. zero) and unreliable Suggested solution: Gradually move from the situation when there are no collected jobs (see line 399): estimated = 3*timreq to the situation when jobs have been collected (line 396): estimated = estimated_avg + 2*(estimated_max - estimated_min) I suggest replacing line 393 (which also contains a logical bug) till 399 with the following weighted average of the two: estimated_avg = mean(collecttime(collected) - submittime(collected)); estimated = (3*timreq + sum(collected)*(estimated_avg + 2*(estimated_max - estimated_min))) / (1 + sum(collected))
Marcel Zwiers - 2010-12-15 13:09:08 +0100
p.s. line 389 should, of course also be adapted to (nb timreq is never empty): elseif ~isempty(timreq)
Robert Oostenveld - 2010-12-15 15:54:00 +0100
If you specify an appropriate timreq or resubmittime, you should not have this problem. Can you please try with either one of these two options?
Marcel Zwiers - 2010-12-15 16:18:19 +0100
Passing an appropriate timreq does not do anything (that is obvious from the code) Passing resubmittime does avoid the problem (that is also obvious from the code), but that is not a good solution to the problem (basically because resubmittime is static and typically very hard to estimate beforehand) but a undesirable work-around.
Robert Oostenveld - 2010-12-15 20:05:42 +0100
timreq is currently indeed not acting as expected and should be fixed. Should resubmittime again be removed from the code (it was added upon your request)?
Robert Oostenveld - 2010-12-19 09:36:58 +0100
r2468 | roboos | 2010-12-19 09:32:18 +0100 (Sun, 19 Dec 2010) | 7 lines fixed, at the moment it does not use the distribution at all, only the estimated timreq (which is nanmax of the collecttime-submittime). It uses 3*timreq, just like the killswitch on the peerslave. Note that the timreq might slightly increase over time (with more jobs returning) and does not reflect the timreq that was used when submitting.
Robert Oostenveld - 2011-01-05 11:57:03 +0100
selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"