[Insight-developers] Testing Data
Gaëtan Lehmann
gaetan.lehmann at jouy.inra.fr
Mon Feb 7 09:58:38 EST 2011
Hi,
As asked by Terry, here are my thoughts on the testing data management.
This issue has been discussed several times here, and some parts may
not seem new — this is because they have been copy/pasted from some
previous mails.
i. git submodule is bad for this task
The ITK development process has become more efficient in ITK v4,
especially with the usage of git and gerrit, but also significantly
more complicated. I'm afraid this complexity may prevent some new
developers to join the development effort.
The Testing/Data submodule is the worth example to date.
i.a. The workload for the developer is very significantly higher that
what it was, or what it could be.
Here are a few examples to highlight the differences with other
technical solutions:
* with cvs (ITK up to version 3.20), adding a new test was with some
test images was:
cvs add Testing/Code/...
cvs add Testing/Data/...
cvs ci
* with git alone, it would be:
git add Testing/Data/...
git add Testing/Code/...
git commit
git push
* with git submodule, it is:
cd Testing/Data
git add ...
git commit
git push
cd -
git add Testing/Data
git add Testing/Code/...
git commit
git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/
paste from the error message of the previous commit
git commit
git push
366% of increase of the number of command lines compared to the cvs
case.
i.b. git submodule is not contribution friendly
Because the write access is required to push to ITKData, the
contributors who don't have this write access will find very difficult
to submit new tests to gerrit. The contributors can still publish
their test images elsewhere (where?) but then
* the review in gerrit becomes harder, because the reviewer has to get
the testing data by hand,
* the workload for the committer to ITK main repository is increased:
he has to commit the images by himself in the ITKData repository and
modify the submitted patch to point to the right version of the
ITKData submodule.
Also, if a patch is rejected in gerrit but the data have been already
committed in ITKData, the useless data will stay forever in ITKData
repository.
i.c. git submodule is error prone
As shown several times already in real life examples, it is very easy
get the wrong ITKData version when merging several patches which have
modified the required ITKData submodule version.
This should be fixed now, by using an extra git hook. This hooks still
add some maintenance complexity though.
i.d. git submodule makes harder to read the history
Because the history of the main repository and the submodule are not
tightly coupled, it is hard to know why a test image was added or
which image was added or modified to fix or add a new test.
So, to summarize, I understand that git submodule may have been
tempting to manage ITK's testing data, but real life usage have shown
that git submodule is not well suited for this task. I'll personally
be glad when we'll move away from this solution.
ii. Testing/Data is not that big
ITKData repository takes 74 MB.
ITK repository takes 154 MB.
ITK build directory takes 1.3 GB – 8.4 GB if we don't take care to
remove the temporary data after running the tests.
ITK build with wrapping takes 5.3 GB.
and it could be smaller. The files, without the .git directory, use 36
MB, and could be reduced to 22 MB by removing the few files bigger
than 512 KB.
This is the result of 10 years of developments. Continuing at that
pace seems quite reasonable.
iii. ITK needs to be able to store large data files
Kitware's solution seems fine for this task, even if it seems to have
several potential problems at this time:
* Added complexity to manage the testing data — but can be enhanced,
see below
* No ability to commit offline as promised by the switch to git
* Will give problems to run the tests offline
* Would incite the developers to submit bigger testing data that
needed which may, in the long term, lead to a significant network
traffic and storage usage, and probably to a longer testing time.
iv. How to store the testing data
iv.a. Using two solutions
My preference still goes to commit the testing data with the tests in
the main ITK repository. Having code, tests and testing data stored in
the same place and in the same commit, as a transactional set, seems
logical. What is the sense of a test without its data?
A hook is already in place to limit the size of the files in this
repository. While using two methods for this task may not seem
optimal, this would
* Keep the workload quite low for the developers.
* Incite the developers to use small baselines.
* Make easier the review of the new or fixed tests and their data in
gerrit, by allowing the submission to include their testing data.
* Make easy to select to run only a subset of the test if internet
connexion is available.
The large data would be stored online using Kitware's solution.
iv.b. Using Kitware's solution only
A single solution means less to learn for the developer. All the
developers may not have to upload large testing data though.
The good point: after git submodules, it is very likely that this
solution would be more convenient than the current one!
See also iii. for more details.
v. Enhancing the developer experience
If it is decided to use the Kitware's solution, I would like to see
those goals reached:
* Don't require any new user registration on a new website
Gerrit already requires to register to be able to submit a change.
This account should be enough.
* Keep every data management in git subcommands and aliases.
We already have added several aliases to make gerrit usage easier, and
it works very well.
The same should be done for the data management to keep all the
development management in git. This is very related to git anyway,
because the .md5 files will have to be commited in git.
* Use very few command lines - ideally, not much than what a developer
would have to use with a git only solution.
For example, it can be:
git adddata Testing/Data/...
git add Testing/Code/...
git commit
git push # or gerrit-push
the first command, git adddata, would
- convert the files in md5 hashes,
- git add the .md5 files produced by the previous step,
- and upload the files on a remote host.
Uploading should be possible even for the lambda contributors, like it
is now for gerrit, not only for the ITK developers with the write
access to the main repository.
On the user side, the extra steps which may be required for the data
management — for example, moving the data from a temporary location to
the final one — should be transparent and not imply a user action.
* Retaining the ability to test ITK and commit offline would be nice.
This would require
- a tool to get all the needed testing data at once without having
to build anything
- the ability to put the testing data in a cache if it cannot be
uploaded immediately, and trigger the upload once connected.
* Incite the developers to reuse the existing testing data when
possible instead of uploading a new large data set. Not sure how to do
that — any idea welcome.
Then the points listed in iii. would be mostly gone.
Regards,
Gaëtan
PS: I've noted during my trip to the namic week and the itk v4 meeting
that I'm still far to get the subtleties of the english language — I
still don't understand how "simple" may upset anyone in the name
SimpleITK for example — If you feel offended by anything in that mail,
please don't be, there is no such intention on my side.
--
Gaëtan Lehmann
Biologie du Développement et de la Reproduction
INRA de Jouy-en-Josas (France)
tel: +33 1 34 65 29 66 fax: 01 34 65 29 09
http://voxel.jouy.inra.fr http://www.itk.org
http://www.mandriva.org http://www.bepo.fr
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 203 bytes
Desc: Ceci est une signature ?lectronique PGP
URL: <http://www.itk.org/mailman/private/insight-developers/attachments/20110207/2b8f4d17/attachment.pgp>
More information about the Insight-developers
mailing list