[ITK-dev] [ITK] About the last optimization on VariableLengthVector
Matt McCormick
matt.mccormick at kitware.com
Fri Nov 13 18:04:48 EST 2015
Hello Luc,
Thanks for all your awesome work on increasing the performance of
itk::VariableLengthVector's and ITK. For those that have not kept up
on Gerrit activity, Luc has increased performance of resampling
VariableLengthVector's, added expression template support, and
populated a new itk::mpl:: namespace with meta-programming classes!
> The *purpose* of this patch (and the previous ones as well) is to
> eliminate allocations made of each pixel.
>
> This will permit (on a Linear Interpolation test case to reduce the
> computation time, on a VectorImage, from 85s to 27s -- it was around 2
> minutes before the first patches, and on the same image read as an
> itk::Image it takes ~15-17s)
Excellent!
> First problem: PixelType Evalutate(...);
> Currently, when we apply an ImageFunction, we call Evaluate() function
> which requires the creation of a new VLV. Indeed this function return a
> new pixel value by value.
> The only way to get rid of the variable creation is to return the new
> value as an output reference parameter.
>
> This means all classes that inherit from ImageFunction will need to
> change the signature of their EvaluateXxxx() functions. And that all
> client classes must take this into account. In other words, many
> modifications will be required if we want to have better performances on
> VLV.
>
> Note however, it's quite simple to migrate gradually the source code to
> the new interface : we can have all client code explicitly use the new
> (and badly named) EvaluateNoCopy(), which in turn call the old
> Evaluate() -- or the other way around (which has a drawback regarding
> cached variables).
This sounds good. EvaluateNoCopy() seems like a reasonable name to me.
> Second Problem: Temporary pixel values
> Some computations will produce new pixel values. For instance
> p = p0 + (p1 - p0) * d;
> produces 3 new pixels values.
> This has already been taken care of in the previous patches thanks to
> Expression Templates.
Nice :-)
> Third Problem: Local Variables
> 3.1- Auxiliary variables
> Some algorithms (e.g. BSplines interpolations) require dynamic data
> structures like matrices. This implies allocation at each iteration.
> This particular problem isn't addressed in this patch. The solution to
> 3.2 will help to provide an solution.
>
> 3.2- Factorized computations
> Sometimes, a complex computation will take advantage of code
> factorisation (easier to maintain, and better performances). e.g.
>
> Pixel p4 = p0 + (p1 - p0) * d0
> Pixel p5 = p2 + (p3 - p2) * d1
> output = p4 + (p5 - p4) * d2
>
> This implies new pixel values, hence dynamic allocations.
>
> In c++11, we could (almost) have written
> auto p4 = p0 + (p1 - p0) * d0
>
> Except that p4 is an expression template. Its type is quite complex to
> express in plain C++98/03. But the problem will be that the current
> implementation of expression templates uses references. Here p4 will be
> made of dangling references, the code would be bugged.
> Workarounds would be possible in C++11 as it will permit us to
> distinguish lvalue references from rvalue references in order to copy
> the latter, and keep references to the former.
> We are not in C++11 yet. Moreover, this would not factorize the
> computations.
>
> We need a C++98 solution.
> The only way I see is to rely on cached variables.
> This means that the ImageFunctions are no longer [[pure]] functions :
> they'll mutate some variables that we want to allocate once per
> ImageFunction and ... per thread.
> Indeed the ImageFunctions shall be reentrant.
>
> I was first hoping to use thread_local variables. Alas they cannot be
> non-static attributes. This is a problem: if I'm not wrong, a same
> ImageFunction could be used multiple times in a same processing chain.
>
> We still have a simple solution, but it requires a second modification
> to Evaluate() signature: we need to pass the current ThreadId.
>
> The new drawback: code migration requires more work. We can not write
> Evaluate() in terms of EvaluateNoCopy() as the latter requires the
> threadId that the former doesn't know. This is the reason why I expect
> the tests to fail.
> Instead we need to convert all client code (like the WarpImageFilter)
> and have the default implementation of EvaluateNoCopy() to call the
> already written and not yet converted Evalutate().
This would be messy, and it would not work very well since the
ImageFunction's are not involved in orchestration of the threading.
Instead, it would be better to make the necessary auxiliary variables
private variables. Parallel code that uses ImageFunction's should
create one ImageFunction instance per thread.
> Fourth Problem: heterogeneous data types (i.e. castings)
> The input image type, the output image type, and the internal pixel type
> may all be different. That's why the code contain so many static_casts.
>
> Alas a static_cast<> on pixels will induce the creation of new pixel values.
> This is contrary to the purpose of this patch.
> Copy constructors and assignment operators already support on-the-fly
> implicit static_casting.
> We still need to be able to require casting only when it's required.
> That's where the new function CastInto(), MoveInto() and As() come into
> rescue.
So compiler's are not intelligent to make a static_cast<> a no-op when
the types are the same?
Thanks,
Matt
More information about the Insight-developers
mailing list