[CMake] Using CMake/CPack for code provenance

Tue Mar 30 08:52:21 EDT 2010

On 30. Mar, 2010, at 13:38 , Biddiscombe, John A. wrote:

> We have a project where the data generated is very large and too costly to store permanently, so we'd like to be able to
> 
> 
> a)      Tag the source code used when the run is initiated using the SCM (svn/git etc)
> 
> b)      Automatically store all the user configured CMake options (using cmake CONFIGURE_FILE or something like that to generate a file with all the user selected options
> 
> c)       Automatically copy all input files (initial boundary conditions etc) into a particular  location alongside the generated settings from b), {user data will be set via cmake options as in a)}
> 
> d)      Run CPack to archive up all the source code, initial conditions data, other generated files, user configured options into a tarball which can then be stored in a permanent archive so that...
> 
> e)      in theory we can untar the package and essentially recreate the complete build/environment when the simulation was run and reproduce the data rather than storing it should the need ever arise. (of course the compiler may not be available in N years, but if most of the flags/settings are stored we can do our best to ensure that all is reproduced - have to draw the line somewhere.
> 
> Has anyone done anything similar to this, and if so, are there any things I need to watch out for that I haven't considered. Are there any projects out there that already do something similar that I can get some tips from?
> 
> thanks
> 
> JB

Not sure CPack is the right tool for this... I'd go with custom scripts that are triggered by a custom target.

I'm doing similar things by managing input files and source code with git (tagging known states is very useful, obviously), but so far I didn't worry about build settings. Perhaps you could use a cache-initialization script (refer to the docs of the -C option) which you can manage along with the code and the input files. Of course, you have then to somehow ensure/trust that people don't modify the cache, but modify the cache initialization script instead, remove the build-tree and rebuild from scratch. Another option would be to include code in your CMakeLists.txt files to automatically write such a cache-initialization script which one can then use to reproduce the build.

I don't think it is too useful to store specific compiler information though, because then you have to ensure that the versions of the compiler, all required libraries (including standard libraries), operating system etc. match. After all, there might be a new bug in the compiler, or a bug fix which changes the results. For complete reproducibility you'd have to archive the whole OS plus hardware, which in your case is probably out of the question ("Excuse me, do we have some spare shelf-space for a Cray XT5 somewhere around here") ;-)

Michael