[Insight-developers] Testing Data
Daniel Blezek
Blezek.Daniel at mayo.edu
Mon Feb 7 14:07:10 EST 2011
Hi Bill,
While I cast my vote to put test data into the repo directly (which we
have not done with SimpleITK yet), I appreciate that ITK could use more
varied and large datasets. Test writers tend to focus on a subset of images
that already exist. Many of these are PNG images, which do not contain
unusual spacing or orientations.
I have a lot of data which I could contribute to ITK, from many
modalities, spacing and orientations, but there is not a good place to put
300 DICOM CT images (150M) or a 256^3 rotational angiography image.
Striking a balance between developer efficiency, and large scale testing
data will be difficult.
-dan
On 2/7/11 11:50 AM, "Bill Lorensen" <bill.lorensen at gmail.com> wrote:
> Gaëtan ,
>
> Thanks for such a detailed analysis of the problems we face with Testing/Data.
>
> My preference is well known. I prefer that we make the Testing/Data
> part of the main repo. In addition to the complexity that you
> mentioned, your argument for including input data and baselines in the
> gerrit patches is especially compelling.
>
> I hope other developers will speak up so that we can resolve this
> issue. We have been discussing since last summer.
>
> Bill
>
> 2011/2/7 Gaëtan Lehmann <gaetan.lehmann at jouy.inra.fr>:
>>
>> Hi,
>>
>> As asked by Terry, here are my thoughts on the testing data management.
>> This issue has been discussed several times here, and some parts may not
>> seem new this is because they have been copy/pasted from some previous
>> mails.
>>
>> i. git submodule is bad for this task
>>
>> The ITK development process has become more efficient in ITK v4, especially
>> with the usage of git and gerrit, but also significantly more complicated.
>> I'm afraid this complexity may prevent some new developers to join the
>> development effort.
>> The Testing/Data submodule is the worth example to date.
>>
>> i.a. The workload for the developer is very significantly higher that what
>> it was, or what it could be.
>> Here are a few examples to highlight the differences with other technical
>> solutions:
>>
>> * with cvs (ITK up to version 3.20), adding a new test was with some test
>> images was:
>>
>> cvs add Testing/Code/...
>> cvs add Testing/Data/...
>> cvs ci
>>
>> * with git alone, it would be:
>>
>> git add Testing/Data/...
>> git add Testing/Code/...
>> git commit
>> git push
>>
>> * with git submodule, it is:
>>
>> cd Testing/Data
>> git add ...
>> git commit
>> git push
>> cd -
>> git add Testing/Data
>> git add Testing/Code/...
>> git commit
>> git config "hooks.Testing/Data.update" 085e657..9dc1292 # copy/paste from
>> the error message of the previous commit
>> git commit
>> git push
>>
>> 366% of increase of the number of command lines compared to the cvs case.
>>
>>
>> i.b. git submodule is not contribution friendly
>>
>> Because the write access is required to push to ITKData, the contributors
>> who don't have this write access will find very difficult to submit new
>> tests to gerrit. The contributors can still publish their test images
>> elsewhere (where?) but then
>> * the review in gerrit becomes harder, because the reviewer has to get the
>> testing data by hand,
>> * the workload for the committer to ITK main repository is increased: he has
>> to commit the images by himself in the ITKData repository and modify the
>> submitted patch to point to the right version of the ITKData submodule.
>> Also, if a patch is rejected in gerrit but the data have been already
>> committed in ITKData, the useless data will stay forever in ITKData
>> repository.
>>
>>
>> i.c. git submodule is error prone
>>
>> As shown several times already in real life examples, it is very easy get
>> the wrong ITKData version when merging several patches which have modified
>> the required ITKData submodule version.
>> This should be fixed now, by using an extra git hook. This hooks still add
>> some maintenance complexity though.
>>
>>
>> i.d. git submodule makes harder to read the history
>>
>> Because the history of the main repository and the submodule are not tightly
>> coupled, it is hard to know why a test image was added or which image was
>> added or modified to fix or add a new test.
>>
>>
>> So, to summarize, I understand that git submodule may have been tempting to
>> manage ITK's testing data, but real life usage have shown that git submodule
>> is not well suited for this task. I'll personally be glad when we'll move
>> away from this solution.
>>
>>
>>
>> ii. Testing/Data is not that big
>>
>> ITKData repository takes 74 MB.
>> ITK repository takes 154 MB.
>> ITK build directory takes 1.3 GB 8.4 GB if we don't take care to remove
>> the temporary data after running the tests.
>> ITK build with wrapping takes 5.3 GB.
>>
>> and it could be smaller. The files, without the .git directory, use 36 MB,
>> and could be reduced to 22 MB by removing the few files bigger than 512 KB.
>> This is the result of 10 years of developments. Continuing at that pace
>> seems quite reasonable.
>>
>>
>>
>> iii. ITK needs to be able to store large data files
>>
>> Kitware's solution seems fine for this task, even if it seems to have
>> several potential problems at this time:
>>
>> * Added complexity to manage the testing data but can be enhanced, see
>> below
>> * No ability to commit offline as promised by the switch to git
>> * Will give problems to run the tests offline
>> * Would incite the developers to submit bigger testing data that needed
>> which may, in the long term, lead to a significant network traffic and
>> storage usage, and probably to a longer testing time.
>>
>>
>>
>> iv. How to store the testing data
>>
>> iv.a. Using two solutions
>>
>> My preference still goes to commit the testing data with the tests in the
>> main ITK repository. Having code, tests and testing data stored in the same
>> place and in the same commit, as a transactional set, seems logical. What is
>> the sense of a test without its data?
>> A hook is already in place to limit the size of the files in this
>> repository. While using two methods for this task may not seem optimal, this
>> would
>>
>> * Keep the workload quite low for the developers.
>> * Incite the developers to use small baselines.
>> * Make easier the review of the new or fixed tests and their data in gerrit,
>> by allowing the submission to include their testing data.
>> * Make easy to select to run only a subset of the test if internet connexion
>> is available.
>>
>> The large data would be stored online using Kitware's solution.
>>
>>
>> iv.b. Using Kitware's solution only
>>
>> A single solution means less to learn for the developer. All the developers
>> may not have to upload large testing data though.
>> The good point: after git submodules, it is very likely that this solution
>> would be more convenient than the current one!
>> See also iii. for more details.
>>
>>
>>
>> v. Enhancing the developer experience
>>
>> If it is decided to use the Kitware's solution, I would like to see those
>> goals reached:
>>
>> * Don't require any new user registration on a new website
>>
>> Gerrit already requires to register to be able to submit a change. This
>> account should be enough.
>>
>> * Keep every data management in git subcommands and aliases.
>>
>> We already have added several aliases to make gerrit usage easier, and it
>> works very well.
>> The same should be done for the data management to keep all the development
>> management in git. This is very related to git anyway, because the .md5
>> files will have to be commited in git.
>>
>> * Use very few command lines - ideally, not much than what a developer would
>> have to use with a git only solution.
>>
>> For example, it can be:
>>
>> git adddata Testing/Data/...
>> git add Testing/Code/...
>> git commit
>> git push # or gerrit-push
>>
>> the first command, git adddata, would
>> - convert the files in md5 hashes,
>> - git add the .md5 files produced by the previous step,
>> - and upload the files on a remote host.
>>
>> Uploading should be possible even for the lambda contributors, like it is
>> now for gerrit, not only for the ITK developers with the write access to the
>> main repository.
>> On the user side, the extra steps which may be required for the data
>> management for example, moving the data from a temporary location to the
>> final one should be transparent and not imply a user action.
>>
>> * Retaining the ability to test ITK and commit offline would be nice. This
>> would require
>> - a tool to get all the needed testing data at once without having to build
>> anything
>> - the ability to put the testing data in a cache if it cannot be uploaded
>> immediately, and trigger the upload once connected.
>>
>> * Incite the developers to reuse the existing testing data when possible
>> instead of uploading a new large data set. Not sure how to do that any
>> idea welcome.
>>
>> Then the points listed in iii. would be mostly gone.
>>
>>
>> Regards,
>>
>> Gaëtan
>>
>>
>>
>> PS: I've noted during my trip to the namic week and the itk v4 meeting that
>> I'm still far to get the subtleties of the english language I still don't
>> understand how "simple" may upset anyone in the name SimpleITK for example
>> If you feel offended by anything in that mail, please don't be, there is no
>> such intention on my side.
>>
>>
>> --
>> Gaëtan Lehmann
>> Biologie du Développement et de la Reproduction
>> INRA de Jouy-en-Josas (France)
>> tel: +33 1 34 65 29 66 fax: 01 34 65 29 09
>> http://voxel.jouy.inra.fr http://www.itk.org
>> http://www.mandriva.org http://www.bepo.fr
>>
>>
>> _______________________________________________
>> Powered by www.kitware.com
>>
>> Visit other Kitware open-source projects at
>> http://www.kitware.com/opensource/opensource.html
>>
>> Kitware offers ITK Training Courses, for more information visit:
>> http://kitware.com/products/protraining.html
>>
>> Please keep messages on-topic and check the ITK FAQ at:
>> http://www.itk.org/Wiki/ITK_FAQ
>>
>> Follow this link to subscribe/unsubscribe:
>> http://www.itk.org/mailman/listinfo/insight-developers
>>
>>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
>
> Kitware offers ITK Training Courses, for more information visit:
> http://kitware.com/products/protraining.html
>
> Please keep messages on-topic and check the ITK FAQ at:
> http://www.itk.org/Wiki/ITK_FAQ
>
> Follow this link to subscribe/unsubscribe:
> http://www.itk.org/mailman/listinfo/insight-developers
--
Daniel Blezek, PhD
Medical Imaging Informatics Innovation Center
P 127 or (77) 8 8886
T 507 538 8886
E blezek.daniel at mayo.edu
Mayo Clinic
200 First St. S.W.
Harwick SL-44
Rochester, MN 55905
mayoclinic.org
"It is more complicated than you think." -- RFC 1925
More information about the Insight-developers
mailing list