[Insight-developers] Data submodule was reverted again ( staging check needed? )

Brad King brad.king at kitware.com
Wed Jan 26 08:29:36 EST 2011


On 01/26/2011 04:22 AM, Gaëtan Lehmann wrote:
>    ITKData repository takes 74 MB.
>    ITK repository takes 154 MB.

After using

 $ git repack -a -f -d --window=250 --depth=250

to pack both repositories tightly I get

 $ du -sk ITK.git/objects ITKData.git/objects
 59354   ITK.git/objects
 38005   ITKData.git/objects

The main problem is that data files like png images do not compress well
in Git's pack format because they cannot be represented as a small delta
against another png file.  All the images need to be carried around in
whole in history even if they've already been removed.  Contrast this to
source files which are usually updated by small patches and compress
very well.  The data-to-source ratio will only grow over time.

Note also that the above sizes are not representative of pure-source
v. pure-data because ITK had some data files in its history that were
not in Testing/Data and thus ended up in the history of the main source.

>    ITK build directory takes 1.3 GB – 8.4 GB if we don't take care to  
> remove the temporary data after running the tests.
>    ITK build with wrapping takes 5.3 GB.

The build directory sizes don't count.  We're talking about source sizes.

> The large data is actually a problem. Midas can be a solution for  
> that, and we can put a file limit for the main repository.

There is already a limit on blob size in the ITK repo and a bigger limit
in the ITKData repo.

> So I still think, at this time, that the extra complexity of the
> submodule management is not compensated by the size gain in the main
> repository.

Yes, but once the files appear in the main history we can never go back.
The submodule approach keeps us treading water while something better
is developed.

Bill Lorensen wrote:
> But the current setup with data as a submodule adds complexity to
> checkins and is subject to unexpected abuse as recently reported by
> Brad L.

I can address this with a commit check that ensures no ITK commit's
Testing/Data submodule references an older version than one of its
parents.

> The midas solution adds even more complexity, especially for baselines.
>
> I'm looking forward to a solution that keeps the footprint low but
> keeps the workload on a developer at or near what it was in ITK 3.

That's a design goal.

-Brad K


More information about the Insight-developers mailing list