[Insight-developers] New workflow to add test data

Mon Jun 20 13:31:24 EDT 2011

On Mon, Jun 20, 2011 at 12:23 PM, Brad King <brad.king at kitware.com> wrote:
> On 06/20/2011 11:21 AM, Cory Quammen wrote:
>> The new workflow for adding test data is very nice. You are too modest
>> about the quality of your instructions in the email (they were quite
>> detailed).
>
> Thanks.
>
>> One thing that caused me some headaches was the hiding of my original
>> test data after running CMake. When I made a mistake while going
>> through this process (which happened a number of times)
>
> What were some of the mistakes you made?  If appropriate I'd like to
> add troubleshooting links to the right-hand column on the instructions
> page.

My biggest trouble was thinking I had all the scripts for doing this
up to date. I was working in a branch based on a commit in master that
must not have had all the necessary updates for the ExternalData
scripts, so after a few false starts, I eventually realized my error
and ran 'git pullall' on master, then rebased my working branch on
that. This was my fault, and shouldn't be a problem as people base
branches on more recent commits to master.

My other "problems" were things I tried when the gerrit builds were
still failing after I rebased my branch. Since the failures were
caused by a server misconfiguration, the actions I took weren't
necessary, but I still re-ran the process from the start each time.

>
>> I deleted the MD5 files and started the process again from a clean slate.
>
> From the point of view of the ExternalData module (and this workflow)
> removing a content link is equivalent to removing the real data file.
> In the latter case you would have to copy it in again too.

> A key concept to this whole approach is that the content link and real
> data object are equivalent and interchangeable in the source tree.  Once
> one has the content link there should never be a reason to put the real
> file back (unless you try to mess with the .ExternalData* files which
> are implementation details).  If you've found a way to wander off the
> documented workflow in a way that requires it then we need to address
> that case.

The content link and real data object aren't interchangeable if you
want to edit the real data object after generating the content link.
Editing the data object may be a relatively rare thing to do compared
to totally regenerating it and copying it into the tree, but I think
it will happen.

If the process continues to remove the data object files, then it
would be good to add a warning to the wiki page stating that the real
data object will disappear after running CMake, so keeping a copy of
the data object outside the source tree is advisable.

>> It would be nice if the test data could be left in place
>
> I originally had this goal in mind.  However, I later realized that it
> is both hard to do and conceptually incorrect:
>
> - It is hard to .gitignore data files (see pre-commit response below)
>  so this requires a separate "git add" for every content link rather
>  than "git add ." in the directory.

That is acceptable to me.

> - The trigger for CMake to do the data->link conversion is the presence
>  of the real file named by DATA{}.  Since develpers may "mv" the files
>  into the source tree we cannot rely on modification timestamps to
>  know if the content links are out of date.  Therefore we would have
>  to transform every time CMake runs, and the transform is expensive
>  compared to reading a content link.

Not to get too off topic, but it seems like there may also be a
problem if a developer moves source code files via "mv" into the tree.
Is that the case?

> - The ExternalData staging ".ExternalData_${algo}_${hash}" file must
>  be created atomically.  A cheap, easy, and reliable way to do this
>  is to rename the original file.
>
> - Since the real data file and content link are equivalent (see above)
>  it makes sense to have only one present at a time.  If you copy in
>  the original data file again CMake will transform it again to
>  preserve this equivalence.
>
>> Perhaps a pre-commit check could be added to prevent this original test
>> data from being committed.
>
> It is hard to write such a check.  It has no idea what the file names
> will be (data can have many possible extensions).  One may also add
> legitimate images like tiny icons in documentation.  My plan is to rely
> on Gerrit reviews for rejecting real data files from commits along with
> a simple max-blob-size on the server.

I was thinking that you could prevent the commit of a data file if
there were a content link corresponding to the data file name (e.g., I
accidentally commit "test.png" but its content link "test.png.md5" is
also in the commit). If there is no .md5 file for a given image file,
then it should be up to Gerrit reviews to determine whether that file
gets in, as you say.

Thanks,
Cory

>
>> Another point of concern: It seems possible that the gerrit builds
>> could begin before the data objects are transferred to the web server
>> if the processes are not synchronized. Is it set up so that the
>> transfer of the objects always happens prior to running the gerrit
>> builds?
>
> Yes, that is a potential problem.  The data sync robot runs every minute
> but it is possible for the test builds to jump in.  I still need to work
> with Marcus (who runs the auto-build driver) to combine the two robots
> so that things go in order.
>
> -Brad
>

-- 
Cory Quammen
Research Associate
Department of Computer Science
University of North Carolina at Chapel Hill
http://www.cs.unc.edu/~cquammen