Back to the main page.
Bug 933 - Fieldtrip buffer stops and freezes Matlab
Status | CLOSED FIXED |
Reported | 2011-09-05 11:44:00 +0200 |
Modified | 2012-06-20 15:03:19 +0200 |
Product: | FieldTrip |
Component: | realtime |
Version: | unspecified |
Hardware: | All |
Operating System: | All |
Importance: | P1 critical |
Assigned to: | Boris Reuderink |
URL: | |
Tags: | |
Depends on: | 1234 |
Blocks: | 1263 |
See also: |
Philip van den Broek - 2011-09-05 11:44:16 +0200
In current BrainStreams implementation, fieldtrip buffers play a central role both in communicating events and providing access to the data. For the purpose of optional preprocessing, it automatically installs an extra Matlab session and fieldtrip buffer. Also, for viewing purposes another Matlab client connects to the same buffer and on regular basis retrieves data from it. The problem is that at some arbitrary point in time, the fieldtrip buffer server stops writing data to the buffer/socket. A call to the buffer mex-file is responsible for this and it seems like it waits for something that is never going to be obtained. This phenomenon occurs in our lab (see description below), but can also be simulated, see testfunction connectMultiBuffers, also below. At the end some example output of the test function ============================================================================== In our lab, the exact configuration is as follows: ------------------------------------------------------------------------------ buffer1=biosem2ft --> Matlab: preprocess data and write to new buffer 2 (attached to Matlab) Buffer 1 and buffer 2 run on the same mac (OSX). BrainStream connects to buffer 2 and reads both data and events (this runs on another mac (OSX)) Another max (windowsXP) connects to buffer 1 and every second updates a plot of the incoming data. ============================================================================== The testfunction: function connectMultiBuffers(isclient) % start different Matlab sessions and start: % one server : connectMultiBuffers(0) % first client : connectMultiBuffers(1) % second client : connectMultiBuffers(1) % third client ...... % Presumably, increasing N will shorten the time towards running into the issue. url = 'buffer://localhost:1972'; T = 0.01; N = 1; hdr.Fs = 500; hdr.nChans = 80; hdr.nSamples = 0; epoch = floor(hdr.Fs*T); if ~isclient ft_create_buffer(1972); ft_write_data(url,[],'Header',hdr,'Append',false); % write data to the buffer in an endless loop num = 0; while 1 ft_write_data(url,rand(hdr.nChans,epoch),'Header',hdr,'Append',true); num = num + epoch; fprintf('Number of written samples: %d\n',num); pause(T); end else % connect to buffer pause(T*N); while hdr.nSamples < N*epoch hdr = ft_read_header(url); end % read data while 1 hdr = ft_read_header(url); dat = ft_read_data(url,'Header',hdr,'begsample',hdr.nSamples-N*epoch+1,'endsample',hdr.nSamples); fprintf('%d\tSize of read data: (%d,%d)\n',hdr.nSamples,size(dat,1),size(dat,2)); pause(T); end end ======================================================================= example output of test function: ----------------------------------------------------------------------- Server: Number of written samples: 161655 Number of written samples: 161660 Number of written samples: 161665 Number of written samples: 161670 Number of written samples: 161675 Number of written samples: 161680 Number of written samples: 161685 Number of written samples: 161690 Number of written samples: 161695 * Matlab freezes, cannot break out with ctrl-c Client1: 161650 Size of read data: (80,5) 161660 Size of read data: (80,5) 161665 Size of read data: (80,5) 161670 Size of read data: (80,5) 161680 Size of read data: (80,5) 161685 Size of read data: (80,5) 161695 Size of read data: (80,5) bufread: Interrupted system call packet size = 0, should be 8 Trying to close socket 222 ??? Operation terminated by user during ==> ft_read_header at 777 In ==> connectMultiBuffers at 31 hdr = ft_read_header(url); >> Client2: 161650 Size of read data: (80,5) 161660 Size of read data: (80,5) 161665 Size of read data: (80,5) 161675 Size of read data: (80,5) 161680 Size of read data: (80,5) 161690 Size of read data: (80,5) bufread: Interrupted system call packet size = 0, should be 8 Trying to close socket 222 ??? Operation terminated by user during ==> ft_read_data at 593 In ==> connectMultiBuffers at 32 dat = ft_read_data(url,'Header',hdr,'begsample',hdr.nSamples-epoch+1,'endsample',hdr.nSamples); >>
Boris Reuderink - 2011-11-16 15:42:35 +0100
I could reproduce this problem on mentat301 (Redhat Linux), but not on Windows 7. Fieldtrip version was r4724. It seems that the problem occurs *earlier* when a lot of clients (~8) are reading.
Boris Reuderink - 2011-11-17 09:50:41 +0100
With no clients, the problem does not seem to happen on Redhat linux. So, it appears there is some (platform-specific) interaction between the consumer reading from the buffer, and the producer writing to the buffer.
Boris Reuderink - 2011-11-23 10:16:29 +0100
Update from Rutger Vlek (Dutch): Een vergelijkbaar probleem deed zich toevallig voor in een nieuwe setup die ik aan het testen ben. Deze bestaat uit de applicatie biosemi2ft die data uit de amp in een FTbuffer stopt en een matlab-functie die deze data eruit haalt en door preprocessing heen haalt, alles loopt via localhost. Heel af en toe stopt ineens de gehele datastream (dus ook biosemi2ft) op dezelfde manier als door philip beschreven op bugzilla. In deze setup poll ik met grote regelmaat op de aanwezigheid van nieuwe data in het FTbuffer, zodat mijn preprocessing zo strak mogelijk de acquisitie van data volgt. Ik vermoed dat het probleem dus niet specifiek is voor het scenario met meerdere clients, maar vooral afhangt van de algemene belasting van de socket. Op het moment dat de datastream stopte en de applicaties bleven hangen heb ik vanaf een extra matlab process nog geprobeerd de header van het buffer op te vragen. Het programma biosemi2ft reageerde hier nog wel op met een bericht dat er een nieuwe socket werd geopend, maar de header werd niet doorgegeven en het matlab process bleef hangen. Al googelend kwam ik iets tegen over Network Stack sizes en hoe die per operating system kunnen verschillen. Een mogelijke verklaring voor het probleem zou kunnen zijn dat deze stack vol raakt en dat er een soort deadlock ontstaat waarbij processen eindeloos op elkaar wachten?
Robert Oostenveld - 2011-12-21 15:56:59 +0100
I have put some time into this. The fastest way that allows me to reproduce the problem is to first start the demo_buffer executable in a separate window, disable ft_create_buffer in the test function, and to start connectMultiBuffers(0) connectMultiBuffers(0) connectMultiBuffers(0) connectMultiBuffers(1) in 4 matlab instances. Since I suspect it to be a concurrent thread locking problem, I will focus on dmarequest. I have reverted the source code for dmarequest.c to revision 794, which is prior to many changes from Stefan. After removing the properties code, it compiles fine. I found that the bug was already present back then. I reverted to the HEAD version. With verbose=2 and adding the following lines pthread_mutex_lock(&mutexblocker); fprintf(stderr, "locked mutexblocker\n"); switch (request->def->command) { .... } pthread_mutex_unlock(&mutexblocker); fprintf(stderr, "unlocked mutexblocker\n"); it appears that I can prevent the locking problem. The disadvantage is that the wait is much longer like this. But it helps in zooming in... The mutexblocker only around case PUT_DAT is not enough to prevent the problem. The mutexblocker around both the case PUT_DAT and GET_DAT is enough to prevent the problem. So somehow they don't like each others locking style... Gotcha!!! The code reveals that case GET_DAT does the following pthread_mutex_lock(&mutexdata); pthread_mutex_lock(&mutexheader); whereas case PUT_DAT does pthread_mutex_lock(&mutexheader); pthread_mutex_lock(&mutexdata); So it is possible that GET_DAT locks the one, and PUT_DAT the other. Subsequently they start waiting for the other one to be freed... The solution is simple, but the problem (the inconsistent order of locking a sequence of mutexes) has to be checked throughout. I have to go now, will discuss with Boris tomorrow.
Robert Oostenveld - 2011-12-22 09:01:43 +0100
(In reply to comment #4) Yesterday around 16h I started the recompiled stand-alone buffer yesterday with 3 connectMultiBuffers writers and one reader. They are all still running. I have monitored it in top, after an initial increase of RSIZE and RPRVT they have settled at reasonable numbers, so also memory wise it seems to behave correct. I will commit the changed source code. Work that remains to be done is 1) check for the same error elsewhere in the code 2) recompile the dependent mex files and executables that are in fieldtrip 3) get it recompiled for the artinis NIRS problem I suggest 1 is dealt with by Boris, 2 is made conditional on bug #1234 (i.e. first fix that one, subsequently the required stuff to recompile will surface automatically), and that 3 is moved to a separate bug for discussion with Jason et al.
Robert Oostenveld - 2011-12-22 09:03:07 +0100
manzana> svn commit Sending src/dmarequest.c Transmitting file data . Committed revision 5067.
Boris Reuderink - 2012-01-17 12:32:40 +0100
(In reply to comment #5) 1) I'll take a look at the buffer code. 2) Indeed see bug 1234. 3) I'll contact the Artinis people. I have updated the time estimates to reflect this.
Boris Reuderink - 2012-01-17 14:53:39 +0100
I have added issue 1263 (contacting Artinis to update the ft-buffer in their software). What remains to be done in the context of this bug is to check for other occurrences of the same locking problem in the buffer software.
Rutger Vlek - 2012-01-17 15:00:05 +0100
(In reply to comment #8) The changed code was tested in the DCC BCI setups and so far all runs well! Thank you!
Boris Reuderink - 2012-02-03 10:52:32 +0100
I have found a few instances of the problem in other locations, and fixed them. See SVN revision 5233.
Boris Reuderink - 2012-02-28 15:36:29 +0100
All the tasks in the context of this bug have been completed, for remaining work see the related bugs. Marking this bug as RESOLVED:FIXED.