Back to the main page.
Bug 1731 - implement tighter provenance using md5 hashes on input and output
Status | CLOSED FIXED |
Reported | 2012-09-20 23:26:00 +0200 |
Modified | 2012-12-31 11:46:22 +0100 |
Product: | FieldTrip |
Component: | core |
Version: | unspecified |
Hardware: | PC |
Operating System: | Mac OS |
Importance: | P3 normal |
Assigned to: | Robert Oostenveld |
URL: | |
Tags: | |
Depends on: | |
Blocks: | |
See also: | http://bugzilla.fcdonders.nl/show_bug.cgi?id=1565 |
Robert Oostenveld - 2012-09-20 23:26:45 +0200
for example do % do the general setup of the function ft_defaults ft_preamble help ft_preamble callinfo ft_preamble trackconfig ft_preamble loadvar data ft_preamble inputhash data and then -------- % FT_PREAMBLE_INPUTHASH is a helper script % the name of the variable is passed in the preamble field global ft_default inputhash = cell(size(ft_default.preamble)); for i=1:numel(ft_default.preamble) inputhash{i} = CalcMD5(mxSerialize(eval(ft_default.preamble{i}))); end -------- and compare the inputhas to the one specified in the actual data. In case they are different, the data has been tampered with. Similar for the output, but only store the hash in the output. FT_ANALYSISPROTOCOL should be able to compare them as well and indicate steps where something happened with the data in between the FT functions.
Robert Oostenveld - 2012-10-09 22:38:40 +0200
On http://rrcns.readthedocs.org/en/latest/provenance_tracking.html I read the following interesting list. We can use it as checklist. FT already implements quite a few of them. the code that was run: the version of Matlab, Python, NEURON, NEST, etc.; the compilation options; a copy of the simulation script(s) (or the version number and repository URL, if using version control) copies (or URLs + version numbers) of any external modules/packages/toolboxes that are imported/included how it was run: parameters; input data (filename plus cryptographic identifier to be sure the data hasn’t been changed, later); command-line options; the platform on which it was run: operating system; processor architecture; network distribution, if using parallelization; output data produced (again, filename plus cryptographic identifier) including log files, warnings, etc.
Robert Oostenveld - 2012-10-13 10:12:01 +0200
I suggest to store the input and output information in cfg.datainfo, just like cfg.callinfo.
Robert Oostenveld - 2012-10-13 10:49:27 +0200
I have been thinking about inputcfg and outputcfg. How about configinfo or paraminfo? Right now we only output the "used" configuration setting, not the desired "input" configuration settings. Conceptually, the output from a computation depend son the data, the environment (e.g. the matlab and fieldtrip version) and the parameters for the algorithm. data.cfg.callinfo contains the environment information data.cfg.datainfo contains the information about the input data data.cfg.paraminfo contains the input parameter information This leaves data.cfg for the actual parameters as used by the algorithm.
Robert Oostenveld - 2012-10-13 17:13:54 +0200
I have implemented this in dataout.cfg.datainfo.input = {} dataout.cfg.datainfo.output = {} Furthermore I have added three ft_default fields: trackcallinfo, trackdatainfo and trackparaminfo; the defaults are yes, no, no. So the default behaviour (sofar) has not changed. I have added ft_version to callinfo (fieldtrip version). It made more sense to move ft_version to utilities, and CalcMD5 along with it. The MD5 calculation is now also used by datainfo. I will demonstrate this in one of the next FT meetings. Committed revision 6750. See http://code.google.com/p/fieldtrip/source/detail?r=6750 for all details.
Robert Oostenveld - 2012-10-13 17:16:36 +0200
It would be useful to have a provenance tutorial. Slightly related to this: the distributed computing tutorial should be revised.
Robert Oostenveld - 2012-12-24 09:29:15 +0100
(In reply to comment #4) it is not dataout.cfg.datainfo.input = {} dataout.cfg.datainfo.output = {} but dataout.cfg.datainfo.inputhash = {} dataout.cfg.datainfo.outputhash = {} The desired functionality is in place, so this feature request is resolved.