[Insight-developers] Testing Data

Mon Feb 7 09:58:38 EST 2011

Hi,

As asked by Terry, here are my thoughts on the testing data management.
This issue has been discussed several times here, and some parts may  
not seem new — this is because they have been copy/pasted from some  
previous mails.

i. git submodule is bad for this task

The ITK development process has become more efficient in ITK v4,  
especially with the usage of git and gerrit, but also significantly  
more complicated. I'm afraid this complexity may prevent some new  
developers to join the development effort.
The Testing/Data submodule is the worth example to date.

i.a. The workload for the developer is very significantly higher that  
what it was, or what it could be.
Here are a few examples to highlight the differences with other  
technical solutions:

* with cvs (ITK up to version 3.20), adding a new test was with some  
test images was:

    cvs add Testing/Code/...
    cvs add Testing/Data/...
    cvs ci

* with git alone, it would be:

    git add Testing/Data/...
    git add Testing/Code/...
    git commit
    git push

* with git submodule, it is:

    cd Testing/Data
    git add ...
    git commit
    git push
    cd -
    git add Testing/Data
    git add Testing/Code/...
    git commit
    git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/ 
paste from the error message of the previous commit
    git commit
    git push

366% of increase of the number of command lines compared to the cvs  
case.

i.b. git submodule is not contribution friendly

Because the write access is required to push to ITKData, the  
contributors who don't have this write access will find very difficult  
to submit new tests to gerrit. The contributors can still publish  
their test images elsewhere (where?) but then
* the review in gerrit becomes harder, because the reviewer has to get  
the testing data by hand,
* the workload for the committer to ITK main repository is increased:  
he has to commit the images by himself in the ITKData repository and  
modify the submitted patch to point to the right version of the  
ITKData submodule.
Also, if a patch is rejected in gerrit but the data have been already  
committed in ITKData, the useless data will stay forever in ITKData  
repository.

i.c. git submodule is error prone

As shown several times already in real life examples, it is very easy  
get the wrong ITKData version when merging several patches which have  
modified the required ITKData submodule version.
This should be fixed now, by using an extra git hook. This hooks still  
add some maintenance complexity though.

i.d. git submodule makes harder to read the history

Because the history of the main repository and the submodule are not  
tightly coupled, it is hard to know why a test image was added or  
which image was added or modified to fix or add a new test.

So, to summarize, I understand that git submodule may have been  
tempting to manage ITK's testing data, but real life usage have shown  
that git submodule is not well suited for this task. I'll personally  
be glad when we'll move away from this solution.

ii. Testing/Data is not that big

  ITKData repository takes 74 MB.
  ITK repository takes 154 MB.
  ITK build directory takes 1.3 GB – 8.4 GB if we don't take care to  
remove the temporary data after running the tests.
  ITK build with wrapping takes 5.3 GB.

and it could be smaller. The files, without the .git directory, use 36  
MB, and could be reduced to 22 MB by removing the few files bigger  
than 512 KB.
This is the result of 10 years of developments. Continuing at that  
pace seems quite reasonable.

iii. ITK needs to be able to store large data files

Kitware's solution seems fine for this task, even if it seems to have  
several potential problems at this time:

* Added complexity to manage the testing data — but can be enhanced,  
see below
* No ability to commit offline as promised by the switch to git
* Will give problems to run the tests offline
* Would incite the developers to submit bigger testing data that  
needed which may, in the long term, lead to a significant network  
traffic and storage usage, and probably to a longer testing time.

iv. How to store the testing data

iv.a. Using two solutions

My preference still goes to commit the testing data with the tests in  
the main ITK repository. Having code, tests and testing data stored in  
the same place and in the same commit, as a transactional set, seems  
logical. What is the sense of a test without its data?
A hook is already in place to limit the size of the files in this  
repository. While using two methods for this task may not seem  
optimal, this would

* Keep the workload quite low for the developers.
* Incite the developers to use small baselines.
* Make easier the review of the new or fixed tests and their data in  
gerrit, by allowing the submission to include their testing data.
* Make easy to select to run only a subset of the test if internet  
connexion is available.

The large data would be stored online using Kitware's solution.

iv.b. Using Kitware's solution only

A single solution means less to learn for the developer. All the  
developers may not have to upload large testing data though.
The good point: after git submodules, it is very likely that this  
solution would be more convenient than the current one!
See also iii. for more details.

v. Enhancing the developer experience

If it is decided to use the Kitware's solution, I would like to see  
those goals reached:

* Don't require any new user registration on a new website

Gerrit already requires to register to be able to submit a change.  
This account should be enough.

* Keep every data management in git subcommands and aliases.

We already have added several aliases to make gerrit usage easier, and  
it works very well.
The same should be done for the data management to keep all the  
development management in git. This is very related to git anyway,  
because the .md5 files will have to be commited in git.

* Use very few command lines - ideally, not much than what a developer  
would have to use with a git only solution.

For example, it can be:

    git adddata Testing/Data/...
    git add Testing/Code/...
    git commit
    git push    # or gerrit-push

the first command, git adddata, would
  - convert the files in md5 hashes,
  - git add the .md5 files produced by the previous step,
  - and upload the files on a remote host.

Uploading should be possible even for the lambda contributors, like it  
is now for gerrit, not only for the ITK developers with the write  
access to the main repository.
On the user side, the extra steps which may be required for the data  
management — for example, moving the data from a temporary location to  
the final one — should be transparent and not imply a user action.

* Retaining the ability to test ITK and commit offline would be nice.  
This would require
  - a tool to get all the needed testing data at once without having  
to build anything
  - the ability to put the testing data in a cache if it cannot be  
uploaded immediately, and trigger the upload once connected.

* Incite the developers to reuse the existing testing data when  
possible instead of uploading a new large data set. Not sure how to do  
that — any idea welcome.

Then the points listed in iii. would be mostly gone.

Regards,

Gaëtan

PS: I've noted during my trip to the namic week and the itk v4 meeting  
that I'm still far to get the subtleties of the english language — I  
still don't understand how "simple" may upset anyone in the name  
SimpleITK for example — If you feel offended by anything in that mail,  
please don't be, there is no such intention on my side.

-- 
Gaëtan Lehmann
Biologie du Développement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
http://voxel.jouy.inra.fr  http://www.itk.org
http://www.mandriva.org  http://www.bepo.fr

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 203 bytes
Desc: Ceci est une signature ?lectronique PGP
URL: <http://www.itk.org/mailman/private/insight-developers/attachments/20110207/2b8f4d17/attachment.pgp>