Back to the main page.
Bug 144 - peer slaves seem to switch to zombie while the master is sending a job
Status | CLOSED FIXED |
Reported | 2010-09-02 16:30:00 +0200 |
Modified | 2011-01-05 12:01:10 +0100 |
Product: | FieldTrip |
Component: | peer |
Version: | unspecified |
Hardware: | PC |
Operating System: | Mac OS |
Importance: | P1 normal |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Robert Oostenveld - 2010-09-02 16:30:00 +0200
I observed that the master sends many jobs to the slaves and that due to smartcpu some of the idle slaves immediately switch to zombie. So far, so good. But then it seems that the master initially thinks that the jobs have been submitted, which then after some time have to be resubmitted because they don't return. Since resubmission is postponed, the failed jobs delay the completion of the full batch considerably. The problem suggests that the tcpserver accepts a job, while the announce thread (calling smartcpu_update and smartmem_update) the status is toggled. The locking of the mutexhost in the announce thread should be checked, and potentially prolonged.
Robert Oostenveld - 2010-09-02 17:02:50 +0200
note that this was observed with >> peercellfun(@peertest, repmat({1000}, 1, n), repmat({30}, 1, n)); and >> type peertest function peertest(x, y) % use the memory tmp1 = zeros(x*1024*1024/8,1); % create the cpu load stopwatch = tic; while toc(stopwatch)<y tmp2 = inv(randn(100)); end
Robert Oostenveld - 2010-09-07 17:25:31 +0200
This has been resolved. It was due to smartcpu being triggered when starting the MATLAB engine. The timing differences seem to be due to differences in memory-allocation speed on mentat005 (fast) and mentat24x (slow). Furthermore, the resubmission in peercellfun has been improved (not sequential, but parallel).
Robert Oostenveld - 2011-01-05 11:57:05 +0100
selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"