[Insight-developers] New workflow to add test data

Tue Jun 21 16:13:20 EDT 2011

On Mon, Jun 20, 2011 at 2:37 PM, Brad King <brad.king at kitware.com> wrote:
[snip]

> What kind of files are you editing?  If a file is editable text one
> might argue that it is a source file and should not be treated like
> data with this approach in the first place.

My test input data in this case was a 4 pixel by 4 pixel image of all
white, something that is easily generated in ImageJ. Luckily I had it
in the "old" test data location in my local source tree so I could
copy it back to the new test data location. If I didn't have a backup,
I would have had to regenerate it in ImageJ. This would have taken 30
seconds, but it's another step that I don't think I should have to
take.

Maybe a compromise would be a script that lets me move back the
original file given the .md5 file name as an argument. I'm still out
of luck if I delete the .md5 files, but oh well.

>> If the process continues to remove the data object files, then it
>> would be good to add a warning to the wiki page stating that the real
>> data object will disappear after running CMake, so keeping a copy of
>> the data object outside the source tree is advisable.
>
> I added a note in the new instructions page in the extra-info column
> on the right of that step:
>
>  http://www.itk.org/Wiki/ITK/Git/Develop/Data#Run_CMake
>
> It links to the details discussion at the bottom of the page which
> says where the data go.

The text isn't very noticeable in the right column. From my point of
view, the data objects moving is very unexpected behavior and *must*
be emphasized to the developer, so I suggest putting it in the main
line of the text, right after "This means that CMake converted the
file into a data object referenced by a "content link"."

>>>> It would be nice if the test data could be left in place
>>>
>>> I originally had this goal in mind.  However, I later realized that it
>>> is both hard to do and conceptually incorrect:
>
> One more point:
>
> - The real data object must go into a .ExternalData_${algo}_${hash} file.
>  If the original file were to stay around then we would need to make a
>  copy.  Since the approach is meant to work equally well with large data
>  files we should keep as few copies as possible.  Right now we make no
>  copies and just move/rename the original file a couple of times.

We could argue forever about reasons for moving vs copying the
original data objects, how to determine whether the content links need
to be regenerated, etc.

I've filed my protest against moving the data objects, and you've
given implementation details that justify why they should be moved.
Others should chime in.

>>>  so this requires a separate "git add" for every content link rather
>>>  than "git add ." in the directory.
>>
>> That is acceptable to me.
>
> The DATA{} syntax supports image series as documented in the ExternalData
> module so some developers may add a large number of files at once.  I
> think it is too likely that a real file may leak through in this case.

The file-commit-prevention pre-commit hook set up should take care of this, no?

[snip]

>> I was thinking that you could prevent the commit of a data file if
>> there were a content link corresponding to the data file name (e.g., I
>> accidentally commit "test.png" but its content link "test.png.md5" is
>> also in the commit). If there is no .md5 file for a given image file,
>> then it should be up to Gerrit reviews to determine whether that file
>> gets in, as you say.
>
> Great idea, thanks!  I'll look at adding that because it is useful
> independent of the above discussion.

Great.

-- 
Cory Quammen
Research Associate
Department of Computer Science
University of North Carolina at Chapel Hill
http://www.cs.unc.edu/~cquammen