Back to the main page.
Bug 1319 - implement an engine-based distributed computing toolbox
Status | CLOSED FIXED |
Reported | 2012-02-08 13:29:00 +0100 |
Modified | 2012-12-31 11:46:24 +0100 |
Product: | FieldTrip |
Component: | peer |
Version: | unspecified |
Hardware: | PC |
Operating System: | Mac OS |
Importance: | P3 normal |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: |
Robert Oostenveld - 2012-02-08 13:29:20 +0100
At the moment the peer implementation has a number of features that are only supported for 90%, i.e. it works more or less, but not perfect. These features have to be either - removed from the code - improved to make them work 100% I discussed with Guillaume that peer will be used for within-computer distribution of jobs in SPM12. So better to get it in a shape that all SPM12 users will be happy with it on their quad- and octo-core computers. It might also be desirable to change some defaults to make it more secure, e.g. only allow localhost and this user to connect.
Robert Oostenveld - 2012-03-02 15:30:41 +0100
after the follow up discussion with Guillaume the following was concluded: Instead of using peer, we will try to implement a local-computer-solution that is robust and easy to use on the base of MATLAB engines. The old beowulf code can be used as starting point. The interface should be similar to qsubfeval and qsubcellfun. E.g. engpool open 4 engcellfun(@pause, {1, 2, 3, 4}) engpool close
Robert Oostenveld - 2012-03-05 10:15:57 +0100
after an initial attempt to use the existing engIEvalString, I decided to switch to a solution with threads to run processes in the background. The threaded solution gives a robust method to check when an engine is busy or free, regardless of whether the engine is local or remote (with ssh on linux). manzana> svn commit Adding engine Adding engine/engcellfun.m Adding engine/engexec.m Adding engine/engfeval.m Adding engine/engget.m Adding engine/engpool.m Adding engine/private Adding engine/private/engine.c Adding engine/private/engine.m Adding (bin) engine/private/engine.mexmaci64 Adding engine/private/fexec.m Adding engine/private/ft_getopt.c Adding engine/private/ft_getopt.m Adding (bin) engine/private/ft_getopt.mexa64 Adding (bin) engine/private/ft_getopt.mexglx Adding (bin) engine/private/ft_getopt.mexmaci Adding (bin) engine/private/ft_getopt.mexmaci64 Adding (bin) engine/private/ft_getopt.mexw32 Adding (bin) engine/private/ft_getopt.mexw64 Adding engine/private/generatejobid.m Adding engine/private/getcustompath.m Adding engine/private/getcustompwd.m Adding engine/private/getglobal.m Adding engine/private/matlabversion.m Adding engine/private/print_mem.m Adding engine/private/print_tim.m Sending src/ft_getopt.c Transmitting file data ......... Committed revision 5384.
Robert Oostenveld - 2012-03-05 10:18:22 +0100
(In reply to comment #2) some known issues with the current solution include - crash of mex file at exit of matlab - cleaning up of busy engines needs to be improved - the known exceptions/errors have not yet been tested - mex file needs to be compiled and tested on apple32, win32/win64 and linux32/linux64
Robert Oostenveld - 2012-03-05 10:49:11 +0100
there were some private functions missing, I added them from qsub/private manzana> svn commit Sending engine/engcellfun.m Sending engine/engget.m Adding engine/private/memprofile.m Adding (bin) engine/private/memprofile.mexa64 Adding (bin) engine/private/memprofile.mexglx Adding (bin) engine/private/memprofile.mexmaci Adding (bin) engine/private/memprofile.mexmaci64 Adding (bin) engine/private/memprofile.mexw32 Adding (bin) engine/private/memprofile.mexw64 Adding engine/private/setcustompath.m Adding engine/private/setcustompwd.m Adding engine/private/setglobal.m Adding engine/private/tokenize.m Transmitting file data .. Committed revision 5385.
Robert Oostenveld - 2012-03-05 12:57:27 +0100
I made some general cleanups to the code, removed the FIXME sections, increased the speed of submission and collection, added missing private/istrue function manzana> svn commit Sending engine/engcellfun.m Sending engine/engpool.m Adding engine/private/istrue.m Transmitting file data .. Committed revision 5387.
Robert Oostenveld - 2012-10-29 14:22:51 +0100
(In reply to comment #3) The crash of MATLAB at the end persists. I have done enginepool open 2 done some computations enginepool close and then did some unrelated stuff for 30 minutes or so. When doing clear all the following happened. --------- This error was detected while a MEX-file was running. If the MEX-file is not an official MathWorks function, please examine its source code for errors. Please consult the External Interfaces Guide for information on debugging MEX-files. If this problem is reproducible, please submit a Service Request via: http://www.mathworks.com/support/contact_us/ A technical support engineer might contact you with further information. Thank you for your help. MATLAB may attempt to recover, but even if recovery appears successful, we recommend that you save your workspace and restart MATLAB as soon as possible. Warning: An error occurred while running the atExit function for the MEX-file /Volumes/Data/roboos/matlab/fieldtrip/engine/private/engine.mexmaci64. However, the MEX-file was cleared from memory.
Robert Oostenveld - 2012-10-31 09:24:24 +0100
Since I won't be able to develop and test on all platforms, I have deleted the 32 bit mex files for now. mac001> svn commit Deleting private/engine.mexglx Deleting private/engine.mexmaci Deleting private/ft_getopt.mexglx Deleting private/ft_getopt.mexmaci Deleting private/ft_getopt.mexw32 Deleting private/memprofile.mexglx Deleting private/memprofile.mexmaci Deleting private/memprofile.mexw32 Committed revision 6838.
Robert Oostenveld - 2012-11-04 21:48:01 +0100
(In reply to comment #6) The problem with the crash upon cleanup was due to an indexing error (num-1 versus num). Fixed the problem. Re-enabled some code that was commented out due to previous debugging attempts. Recompiled on maci64. mbp> svn commit Sending private/engine.c Sending private/engine.mexmaci64 Transmitting file data .. Committed revision 6872.
Robert Oostenveld - 2012-11-04 21:59:45 +0100
(In reply to comment #8) Also recompiled on linux64, it seems to work fine in a simple test. See revision 6873.
Robert Oostenveld - 2012-11-10 08:49:40 +0100
mac001> svn commit Sending engine/enginepool.m Sending engine/private/compile.m Adding engine/private/compiler.h Sending engine/private/engine.c Sending engine/private/engine.mexmaci64 Sending engine/private/ft_getopt.c Sending engine/private/ft_getopt.mexmaci64 Transmitting file data ...... Committed revision 6904. r6904 | roboos | 2012-11-10 08:47:27 +0100 (Sat, 10 Nov 2012) | 2 lines enhancement - implemented the suggestions from Guillaume, keep the mex file locked if engines are running, reuse the same code to close engines, removed matrix.h from includes, updatec compile script, fixed some function pointer problems, do not use getpref but rather construct it from within matlab, some improvements to starting upo the engines on windows. Tested and recompiled on osx.
Robert Oostenveld - 2012-11-19 22:29:25 +0100
enhancement - various small changes and extensive testing on OSX, the alternative implementation works reasonably well, but is not bug-free. Include the alternative as mex file. mbp> svn commit Sending engine/enginefeval.m Sending engine/private/alternative.c Sending engine/private/compile.m Sending engine/private/engine.m Sending engine/private/engine.mexmaci64 Transmitting file data ..... Committed revision 6957.
Robert Oostenveld - 2012-11-26 10:39:01 +0100
I believe that I figured out why the thread synching failed. Rather than synching separately at the start and end, it should be done like this MUTEX_LOCK(&mutex_finish[engine]); if (status[engine]!=ENGINE_IDLE) { MUTEX_UNLOCK(&mutex_finish[engine]); mexErrMsgTxt("The specified engine is not available"); } HERE THE RELEVANT PREPARATION HAPPENS COND_SIGNAL(&cond_start[engine]); COND_WAIT(&cond_finish[engine], &mutex_finish[engine]); DEBUG_PRINT("The engine thread finished\n"); MUTEX_UNLOCK(&mutex_finish[engine]); The relevant part is that between the COND_SIGNAL and the COND_WAIT there should not be anything. It starts the thread and then waits until it finishes.
Robert Oostenveld - 2012-11-26 10:40:01 +0100
(In reply to comment #12) dhcp-97-167> svn commit Sending private/alternative.c Sending private/engine.mexmaci64 Transmitting file data .. Committed revision 6984. I will now try on windows...
Robert Oostenveld - 2012-11-29 23:19:05 +0100
Ok, so it still did not work. I found this http://www.multicoreinfo.com/research/misc/Pthread-Tutorial-Peter.pdf which on page 11 and esp. page 12 describe the following There is a further subtlety regarding the use of condition variables. Under certain conditions the wait function might return even though the condition variable has not actually been signaled. For example, if the Unix process in general receives a signal, the thread blocked in pthread cond wait() might be elected to process the signal handling function. If system calls are not restarting (the default in many cases) the pthread cond wait() call might return with an interrupted system call error code1. This has nothing to do with the state of the condition so proceeding as if the condition is true would be inappropriate. I was just doing pthread_cond_wait without checking for a conditional. I now changed it. After that I did not get any deadlocks. However, there still was the occasional "thread busy" at unexpected times, which was due to the thread taking a bit longer to return to IDLE than the mex main loop. For that I built in a pause of one second (only triggered if the IDLE condition problem is detected). I have been able to run a bunch of tests on linux without further problems. roboos@mentat001> svn commit Sending engine/private/alternative.c Sending engine/private/engine.c Sending engine/private/engine.mexa64 Sending engine/private/ft_getopt.mexa64 Deleting engine/private/pthreadVC2.dll Transmitting file data .... Committed revision 7058.
Robert Oostenveld - 2012-11-29 23:28:27 +0100
(In reply to comment #14) Note that the spurious wakeups are also described in the answers here http://stackoverflow.com/questions/1136371/pthread-and-wait-conditions Furthermore note that I still suspect some memory to leak in the mex file. The "enginepool open N" command has to be changed a bit. Right now it immediately returns, which is a bit unexpected. Better would be to wait until all engines run. ---- I have now also tested it on maci64, it also seems to work fine there. mbp> svn commit Sending engine/private/engine.mexmaci64 Transmitting file data . Committed revision 7060. The big question is now whether it works on windows... that is something for tomorrow.
Robert Oostenveld - 2012-11-30 10:39:11 +0100
roboos@mentat001> svn commit Sending engine/private/alternative.c Sending engine/private/engine.mexa64 Sending engine/private/engine.mexmaci64 Sending engine/private/engine.mexw64 Sending engine/private/ft_getopt.mexw64 Transmitting file data ..... Committed revision 7062. It works! @Guillaume: please test it. If there are specific issues, report them as separate (new) bugs rather than continuing along this thread.