[Insight-developers] Testing Data

Daniel Blezek Blezek.Daniel at mayo.edu
Mon Feb 7 14:07:10 EST 2011


Hi Bill,

  While I cast my vote to put test data into the repo directly (which we
have not done with SimpleITK yet), I appreciate that ITK could use more
varied and large datasets.  Test writers tend to focus on a subset of images
that already exist.  Many of these are PNG images, which do not contain
unusual spacing or orientations.

  I have a lot of data which I could contribute to ITK, from many
modalities, spacing and orientations, but there is not a good place to put
300 DICOM CT images (150M) or a 256^3 rotational angiography image.

  Striking a balance between developer efficiency, and large scale testing
data will be difficult.

-dan

On 2/7/11 11:50 AM, "Bill Lorensen" <bill.lorensen at gmail.com> wrote:

> Gaëtan ,
> 
> Thanks for such a detailed analysis of the problems we face with Testing/Data.
> 
> My preference is well known. I prefer that we make the Testing/Data
> part of the main repo. In addition to the complexity that you
> mentioned, your argument for including input data and baselines in the
> gerrit patches is especially compelling.
> 
> I hope other developers will speak up so that we can resolve this
> issue. We have been discussing since last summer.
> 
> Bill
> 
> 2011/2/7 Gaëtan Lehmann <gaetan.lehmann at jouy.inra.fr>:
>> 
>> Hi,
>> 
>> As asked by Terry, here are my thoughts on the testing data management.
>> This issue has been discussed several times here, and some parts may not
>> seem new ‹ this is because they have been copy/pasted from some previous
>> mails.
>> 
>> i. git submodule is bad for this task
>> 
>> The ITK development process has become more efficient in ITK v4, especially
>> with the usage of git and gerrit, but also significantly more complicated.
>> I'm afraid this complexity may prevent some new developers to join the
>> development effort.
>> The Testing/Data submodule is the worth example to date.
>> 
>> i.a. The workload for the developer is very significantly higher that what
>> it was, or what it could be.
>> Here are a few examples to highlight the differences with other technical
>> solutions:
>> 
>> * with cvs (ITK up to version 3.20), adding a new test was with some test
>> images was:
>> 
>>   cvs add Testing/Code/...
>>   cvs add Testing/Data/...
>>   cvs ci
>> 
>> * with git alone, it would be:
>> 
>>   git add Testing/Data/...
>>   git add Testing/Code/...
>>   git commit
>>   git push
>> 
>> * with git submodule, it is:
>> 
>>   cd Testing/Data
>>   git add ...
>>   git commit
>>   git push
>>   cd -
>>   git add Testing/Data
>>   git add Testing/Code/...
>>   git commit
>>   git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/paste from
>> the error message of the previous commit
>>   git commit
>>   git push
>> 
>> 366% of increase of the number of command lines compared to the cvs case.
>> 
>> 
>> i.b. git submodule is not contribution friendly
>> 
>> Because the write access is required to push to ITKData, the contributors
>> who don't have this write access will find very difficult to submit new
>> tests to gerrit. The contributors can still publish their test images
>> elsewhere (where?) but then
>> * the review in gerrit becomes harder, because the reviewer has to get the
>> testing data by hand,
>> * the workload for the committer to ITK main repository is increased: he has
>> to commit the images by himself in the ITKData repository and modify the
>> submitted patch to point to the right version of the ITKData submodule.
>> Also, if a patch is rejected in gerrit but the data have been already
>> committed in ITKData, the useless data will stay forever in ITKData
>> repository.
>> 
>> 
>> i.c. git submodule is error prone
>> 
>> As shown several times already in real life examples, it is very easy get
>> the wrong ITKData version when merging several patches which have modified
>> the required ITKData submodule version.
>> This should be fixed now, by using an extra git hook. This hooks still add
>> some maintenance complexity though.
>> 
>> 
>> i.d. git submodule makes harder to read the history
>> 
>> Because the history of the main repository and the submodule are not tightly
>> coupled, it is hard to know why a test image was added or which image was
>> added or modified to fix or add a new test.
>> 
>> 
>> So, to summarize, I understand that git submodule may have been tempting to
>> manage ITK's testing data, but real life usage have shown that git submodule
>> is not well suited for this task. I'll personally be glad when we'll move
>> away from this solution.
>> 
>> 
>> 
>> ii. Testing/Data is not that big
>> 
>>  ITKData repository takes 74 MB.
>>  ITK repository takes 154 MB.
>>  ITK build directory takes 1.3 GB ­ 8.4 GB if we don't take care to remove
>> the temporary data after running the tests.
>>  ITK build with wrapping takes 5.3 GB.
>> 
>> and it could be smaller. The files, without the .git directory, use 36 MB,
>> and could be reduced to 22 MB by removing the few files bigger than 512 KB.
>> This is the result of 10 years of developments. Continuing at that pace
>> seems quite reasonable.
>> 
>> 
>> 
>> iii. ITK needs to be able to store large data files
>> 
>> Kitware's solution seems fine for this task, even if it seems to have
>> several potential problems at this time:
>> 
>> * Added complexity to manage the testing data ‹ but can be enhanced, see
>> below
>> * No ability to commit offline as promised by the switch to git
>> * Will give problems to run the tests offline
>> * Would incite the developers to submit bigger testing data that needed
>> which may, in the long term, lead to a significant network traffic and
>> storage usage, and probably to a longer testing time.
>> 
>> 
>> 
>> iv. How to store the testing data
>> 
>> iv.a. Using two solutions
>> 
>> My preference still goes to commit the testing data with the tests in the
>> main ITK repository. Having code, tests and testing data stored in the same
>> place and in the same commit, as a transactional set, seems logical. What is
>> the sense of a test without its data?
>> A hook is already in place to limit the size of the files in this
>> repository. While using two methods for this task may not seem optimal, this
>> would
>> 
>> * Keep the workload quite low for the developers.
>> * Incite the developers to use small baselines.
>> * Make easier the review of the new or fixed tests and their data in gerrit,
>> by allowing the submission to include their testing data.
>> * Make easy to select to run only a subset of the test if internet connexion
>> is available.
>> 
>> The large data would be stored online using Kitware's solution.
>> 
>> 
>> iv.b. Using Kitware's solution only
>> 
>> A single solution means less to learn for the developer. All the developers
>> may not have to upload large testing data though.
>> The good point: after git submodules, it is very likely that this solution
>> would be more convenient than the current one!
>> See also iii. for more details.
>> 
>> 
>> 
>> v. Enhancing the developer experience
>> 
>> If it is decided to use the Kitware's solution, I would like to see those
>> goals reached:
>> 
>> * Don't require any new user registration on a new website
>> 
>> Gerrit already requires to register to be able to submit a change. This
>> account should be enough.
>> 
>> * Keep every data management in git subcommands and aliases.
>> 
>> We already have added several aliases to make gerrit usage easier, and it
>> works very well.
>> The same should be done for the data management to keep all the development
>> management in git. This is very related to git anyway, because the .md5
>> files will have to be commited in git.
>> 
>> * Use very few command lines - ideally, not much than what a developer would
>> have to use with a git only solution.
>> 
>> For example, it can be:
>> 
>>   git adddata Testing/Data/...
>>   git add Testing/Code/...
>>   git commit
>>   git push    # or gerrit-push
>> 
>> the first command, git adddata, would
>>  - convert the files in md5 hashes,
>>  - git add the .md5 files produced by the previous step,
>>  - and upload the files on a remote host.
>> 
>> Uploading should be possible even for the lambda contributors, like it is
>> now for gerrit, not only for the ITK developers with the write access to the
>> main repository.
>> On the user side, the extra steps which may be required for the data
>> management ‹ for example, moving the data from a temporary location to the
>> final one ‹ should be transparent and not imply a user action.
>> 
>> * Retaining the ability to test ITK and commit offline would be nice. This
>> would require
>>  - a tool to get all the needed testing data at once without having to build
>> anything
>>  - the ability to put the testing data in a cache if it cannot be uploaded
>> immediately, and trigger the upload once connected.
>> 
>> * Incite the developers to reuse the existing testing data when possible
>> instead of uploading a new large data set. Not sure how to do that ‹ any
>> idea welcome.
>> 
>> Then the points listed in iii. would be mostly gone.
>> 
>> 
>> Regards,
>> 
>> Gaëtan
>> 
>> 
>> 
>> PS: I've noted during my trip to the namic week and the itk v4 meeting that
>> I'm still far to get the subtleties of the english language ‹ I still don't
>> understand how "simple" may upset anyone in the name SimpleITK for example ‹
>> If you feel offended by anything in that mail, please don't be, there is no
>> such intention on my side.
>> 
>> 
>> --
>> Gaëtan Lehmann
>> Biologie du Développement et de la Reproduction
>> INRA de Jouy-en-Josas (France)
>> tel: +33 1 34 65 29 66    fax: 01 34 65 29 09
>> http://voxel.jouy.inra.fr  http://www.itk.org
>> http://www.mandriva.org  http://www.bepo.fr
>> 
>> 
>> _______________________________________________
>> Powered by www.kitware.com
>> 
>> Visit other Kitware open-source projects at
>> http://www.kitware.com/opensource/opensource.html
>> 
>> Kitware offers ITK Training Courses, for more information visit:
>> http://kitware.com/products/protraining.html
>> 
>> Please keep messages on-topic and check the ITK FAQ at:
>> http://www.itk.org/Wiki/ITK_FAQ
>> 
>> Follow this link to subscribe/unsubscribe:
>> http://www.itk.org/mailman/listinfo/insight-developers
>> 
>> 
> _______________________________________________
> Powered by www.kitware.com
> 
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
> 
> Kitware offers ITK Training Courses, for more information visit:
> http://kitware.com/products/protraining.html
> 
> Please keep messages on-topic and check the ITK FAQ at:
> http://www.itk.org/Wiki/ITK_FAQ
> 
> Follow this link to subscribe/unsubscribe:
> http://www.itk.org/mailman/listinfo/insight-developers

-- 
Daniel Blezek, PhD
Medical Imaging Informatics Innovation Center

P 127 or (77) 8 8886
T 507 538 8886
E blezek.daniel at mayo.edu

Mayo Clinic
200 First St. S.W.
Harwick SL-44
Rochester, MN 55905
mayoclinic.org
"It is more complicated than you think." -- RFC 1925



More information about the Insight-developers mailing list