Back to the main page.

Bug 144 - peer slaves seem to switch to zombie while the master is sending a job

Reported 2010-09-02 16:30:00 +0200
Modified 2011-01-05 12:01:10 +0100
Product: FieldTrip
Component: peer
Version: unspecified
Hardware: PC
Operating System: Mac OS
Importance: P1 normal
Assigned to: Robert Oostenveld
Depends on:
See also:

Robert Oostenveld - 2010-09-02 16:30:00 +0200

I observed that the master sends many jobs to the slaves and that due to smartcpu some of the idle slaves immediately switch to zombie. So far, so good. But then it seems that the master initially thinks that the jobs have been submitted, which then after some time have to be resubmitted because they don't return. Since resubmission is postponed, the failed jobs delay the completion of the full batch considerably. The problem suggests that the tcpserver accepts a job, while the announce thread (calling smartcpu_update and smartmem_update) the status is toggled. The locking of the mutexhost in the announce thread should be checked, and potentially prolonged.

Robert Oostenveld - 2010-09-02 17:02:50 +0200

note that this was observed with >> peercellfun(@peertest, repmat({1000}, 1, n), repmat({30}, 1, n)); and >> type peertest function peertest(x, y) % use the memory tmp1 = zeros(x*1024*1024/8,1); % create the cpu load stopwatch = tic; while toc(stopwatch)<y tmp2 = inv(randn(100)); end

Robert Oostenveld - 2010-09-07 17:25:31 +0200

This has been resolved. It was due to smartcpu being triggered when starting the MATLAB engine. The timing differences seem to be due to differences in memory-allocation speed on mentat005 (fast) and mentat24x (slow). Furthermore, the resubmission in peercellfun has been improved (not sequential, but parallel).

Robert Oostenveld - 2011-01-05 11:57:05 +0100

selected a long list of resolved bugs from roboos and changed the status into "RESOLVED"

Robert Oostenveld - 2011-01-05 12:01:10 +0100

selected all old bugs from roboos with status RESOLVED and changed it into CLOSED