VTK/Threaded Image Algorithms: Difference between revisions

From KitwarePublic
< VTK
Jump to navigationJump to search
No edit summary
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Review of Streaming Pipeline ==
== Review of the Streaming Pipeline ==


The job of a vtkAlgorithm is to perform some operation upon some data (specifically, upon some data encapsulated in a vtkDataObject).  One important aspect of VTK's streaming pipeline is that the vtkDataObject that the algorithm operates on might be one part of a larger data set.  For example, a vtkImageData object might only contain a few slices of a large stack of image slices.  
The job of a [http://www.vtk.org/doc/nightly/html/classvtkAlgorithm.html vtkAlgorithm] is to perform some operation upon some data (specifically, upon some data encapsulated in a [http://www.vtk.org/doc/nightly/html/classvtkDataObject.html vtkDataObject]).  One important aspect of VTK's streaming pipeline is that the vtkDataObject that the algorithm operates on might be one part of a larger data set.  For example, a [http://www.vtk.org/doc/nightly/html/classvtkImageData.html vtkImageData] object might only contain a few slices of a large stack of image slices.  


The portion of a data set that is contained within one vtkImageData object is described by the ''Extent'' array, which provides a (first, last) index for each of the three dimensions:
The portion of a data set that is contained within one vtkImageData object is described by the ''Extent'' array, which provides a (first, last) index for each of the three dimensions:
Line 7: Line 7:
  Extent = { first_idx_x, last_idx_x, first_idx_y, last_idx_y, first_idx_z, last_idx_z }
  Extent = { first_idx_x, last_idx_x, first_idx_y, last_idx_y, first_idx_z, last_idx_z }


The pipeline tells the algorithm two important pieces of information:
The pipeline provides the algorithm with two important pieces of information:
# How large the entire data set is (the '''WHOLE_EXTENT''').
# How large the entire data set is (the '''WHOLE_EXTENT''').
# What part of the data set the output vtkImageData object should contain after the algorithm runs (the '''UPDATE_EXTENT''').
# What part of the data set the output vtkImageData object should contain after the algorithm runs (the '''UPDATE_EXTENT''').


== Splitting the UPDATE_EXTENT among threads ==
== Splitting the UPDATE_EXTENT Among the Threads ==


The algorithm is responsible for producing a vtkImageData object whose size and position within the entire data set described by the '''UPDATE_EXTENT'''.  But wait, what if we want to divide the work among several threads?  Then the '''UPDATE_EXTENT''' must be broken into several chunks, each of which is also an ''extent''.  For example, let's say that the update extent is  { <span style="color:#800000">128</span>, <span style="color:#800000">255</span>, <span style="color:#008000">0</span>, <span style="color:#008000">255</span>, <span style="color:#0000D0">1</span>, <span style="color:#8000D0">100</span> }:
The algorithm is responsible for producing a vtkImageData object whose size and position is described by the '''UPDATE_EXTENT'''.  But what if we want to divide the work among several threads?  Then the '''UPDATE_EXTENT''' must be broken into several chunks, each of which is also an ''extent''.  For example, let's say that the update extent is  { <span style="color:#800000">128</span>, <span style="color:#800000">255</span>, <span style="color:#008000">0</span>, <span style="color:#008000">255</span>, <span style="color:#0000E0">1</span>, <span style="color:#0000E0">100</span> }:


   UPDATE_EXTENT: { 128, 255, 0, 255, 1, 100 }
   UPDATE_EXTENT:     { 128, 255, 0, 255, 1, 100 }
   Thread 1 extent:  { 128, 255, 0, 255, 1, 25 }
   Thread 0 extent:  { 128, 255, 0, 255, 1, 25 }
   Thread 2 extent:  { 128, 255, 0, 255, 26, 50 }
   Thread 1 extent:  { 128, 255, 0, 255, 26, 50 }
   Thread 3 extent:  { 128, 255, 0, 255, 51, 75 }
   Thread 2 extent:  { 128, 255, 0, 255, 51, 75 }
   Thread 4 extent:  { 128, 255, 0, 255, 76, 100 }
   Thread 3 extent:  { 128, 255, 0, 255, 76, 100 }


In this example, the data has been divided along the Z direction.
In this example, the data has been divided along the Z direction. There is, in fact, a specific method in vtkThreadedImageAlgorithm whose only responsibility is to split the data into pieces, one piece for each thread:


== Multithreading ==
int SplitExtent(int pieceExtent[6], int updateExtent[6], int piece, int total)
 
The ''total'' is the total number of pieces to divide the ''updateExtent'' into (this is always the number of threads to be used).  Each thread calls this SplitExtent method with ''piece'' set to a different number (in the above example, with numbers 0 through 3).  The SplitExtent method returns the the extent of data for that thread to operate on in the ''pieceExtent'' array.
 
The ''updateExtent'' is usually split up exactly as shown above, with the split occurring along the Z direction.  This is a very poor choice if there are eight CPU cores available to work on the data, but only two slices in the stack!  Only two of the cores get work to do, and the other six cores get nothing!  To avoid the absolute worst case where one thread does all the work, if there is only one slice in the stack, then SplitExtent will do the division along the Y direction instead of the Z direction.
 
== Multithreaded Execution ==
 
Now that we have seen how the work is divided into pieces, the next stop on our tour of the [http://www.vtk.org/doc/nightly/html/classvtkThreadedImageAlgorithm.html vtkThreadedImageAlgorithm] is the main execution method, which is a virtual method that you would define in your own subclasses:
 
virtual void vtkThreadedImageAlgorithm::ThreadedRequestData(
  vtkInformation *request,
  vtkInformationVector **inputVector,
  vtkInformationVector *outputVector,
  vtkImageData ***inputData,
  vtkImageData **outputData,
  int pieceExtent[6],
  int piece)
{
  // code goes here
}
 
This method is called from each thread that will operate on the data.  Let's ignore ''request'', which is merely used for pipeline bookkeeping, and the ''inputVector'',''outputVector'' which are very important but beyond the current topic of discussion.  What we have left are the ''inputData'', the ''outputData'', and the ''pieceExtent'' and ''piece'' that we discussed in the previous section.  In the VTK codebase, you'll usually see ''threadId'' instead of ''piece'', because each thread gets one piece of the data.  The job of this method is to fill in the voxel values that correspond to the ''pieceExtent''.  Sounds simple, right?


== Progress Reporting ==
== Progress Reporting ==


== Tricky Details ==
The algorithm can communicate to the rest of the application while it is executing.  In order to allow non-thread-safe modes of communication to be used, only one of the algorithm's threads is allowed to talk to the application.  Specifically, only the main thread (the one that gets piece number 0) can do this.  This thread is special, as it is the main thread that the other threads were split ''from''.
 
You will often see code like the following inside the VTK algorithms, where the algorithm tells the application what fraction of the work has been done.  Of course, this is only an estimate of the progress, because this thread only knows how much of ''its own'' work has been done.  It doesn't know what the other threads are doing!
 
  if (threadId == 0)
    {
    // Provide a fraction between 0.0 and 1.0
    this->UpdateProgress(progressFraction);
    }
 
== Some Tricky Problems ==
 
One problem has already been discussed: the way the data is divided into pieces can lead to poor load balancing, and better strategies for dividing the data would be a good thing!  Here is a list of problems:
# The current splitting method leads to poor load balancing
# If the algorithm's operation is masked with a stencil, load balancing is terrible
 
When a stencil is applied, it means that the algorithm operates only on the irregularly-shaped region of the image that is ''inside'' the stencil.  All the voxels ''outside'' the stencil are copied directly from the input, or are simply colored black.  In other words, the cost associated with operating on voxels that are ''inside'' is often many times higher than the cost of operating on pixels that are ''outside.''


== Future ==
== Future ==
The current implementation of vtkThreadedImageAlgorithm uses the coarsest granularity possible, with one (or zero) pieces going to each thread.  This can be problematic, because it can be difficult to set the sizes of the pieces such that all the threads finish at the same time.
One simple way of improving load balancing is to use finer granularity.  If there are many times as many pieces as there are threads, then the pieces don't all have to take the same amount of time.  Instead, every time a thread finishes a piece, it fetches the next piece from the queue.  If there are approximately 10 times as many pieces as there are threads, then we wouldn't expect to see any threads sitting idle until over 90% of the work has been completed.
The new [http://www.vtk.org/doc/nightly/html/classvtkSMPTools.html vtkSMPTools] provide a simple interface for requesting granularity, and provides back-ends for optimized threading libraries.  Threaded image algorithms should migrate from [http://www.vtk.org/doc/nightly/html/classvtkMultiThreader.html vtkMultiThreader] to vtkSMPTools.

Latest revision as of 20:22, 30 March 2015

Review of the Streaming Pipeline

The job of a vtkAlgorithm is to perform some operation upon some data (specifically, upon some data encapsulated in a vtkDataObject). One important aspect of VTK's streaming pipeline is that the vtkDataObject that the algorithm operates on might be one part of a larger data set. For example, a vtkImageData object might only contain a few slices of a large stack of image slices.

The portion of a data set that is contained within one vtkImageData object is described by the Extent array, which provides a (first, last) index for each of the three dimensions:

Extent = { first_idx_x, last_idx_x, first_idx_y, last_idx_y, first_idx_z, last_idx_z }

The pipeline provides the algorithm with two important pieces of information:

  1. How large the entire data set is (the WHOLE_EXTENT).
  2. What part of the data set the output vtkImageData object should contain after the algorithm runs (the UPDATE_EXTENT).

Splitting the UPDATE_EXTENT Among the Threads

The algorithm is responsible for producing a vtkImageData object whose size and position is described by the UPDATE_EXTENT. But what if we want to divide the work among several threads? Then the UPDATE_EXTENT must be broken into several chunks, each of which is also an extent. For example, let's say that the update extent is { 128, 255, 0, 255, 1, 100 }:

 UPDATE_EXTENT:     { 128, 255, 0, 255, 1, 100 }
 Thread 0 extent:   { 128, 255, 0, 255, 1, 25 }
 Thread 1 extent:   { 128, 255, 0, 255, 26, 50 }
 Thread 2 extent:   { 128, 255, 0, 255, 51, 75 }
 Thread 3 extent:   { 128, 255, 0, 255, 76, 100 }

In this example, the data has been divided along the Z direction. There is, in fact, a specific method in vtkThreadedImageAlgorithm whose only responsibility is to split the data into pieces, one piece for each thread:

int SplitExtent(int pieceExtent[6], int updateExtent[6], int piece, int total)

The total is the total number of pieces to divide the updateExtent into (this is always the number of threads to be used). Each thread calls this SplitExtent method with piece set to a different number (in the above example, with numbers 0 through 3). The SplitExtent method returns the the extent of data for that thread to operate on in the pieceExtent array.

The updateExtent is usually split up exactly as shown above, with the split occurring along the Z direction. This is a very poor choice if there are eight CPU cores available to work on the data, but only two slices in the stack! Only two of the cores get work to do, and the other six cores get nothing! To avoid the absolute worst case where one thread does all the work, if there is only one slice in the stack, then SplitExtent will do the division along the Y direction instead of the Z direction.

Multithreaded Execution

Now that we have seen how the work is divided into pieces, the next stop on our tour of the vtkThreadedImageAlgorithm is the main execution method, which is a virtual method that you would define in your own subclasses:

virtual void vtkThreadedImageAlgorithm::ThreadedRequestData(
  vtkInformation *request,
  vtkInformationVector **inputVector,
  vtkInformationVector *outputVector,
  vtkImageData ***inputData,
  vtkImageData **outputData,
  int pieceExtent[6],
  int piece)
{
  // code goes here
}

This method is called from each thread that will operate on the data. Let's ignore request, which is merely used for pipeline bookkeeping, and the inputVector,outputVector which are very important but beyond the current topic of discussion. What we have left are the inputData, the outputData, and the pieceExtent and piece that we discussed in the previous section. In the VTK codebase, you'll usually see threadId instead of piece, because each thread gets one piece of the data. The job of this method is to fill in the voxel values that correspond to the pieceExtent. Sounds simple, right?

Progress Reporting

The algorithm can communicate to the rest of the application while it is executing. In order to allow non-thread-safe modes of communication to be used, only one of the algorithm's threads is allowed to talk to the application. Specifically, only the main thread (the one that gets piece number 0) can do this. This thread is special, as it is the main thread that the other threads were split from.

You will often see code like the following inside the VTK algorithms, where the algorithm tells the application what fraction of the work has been done. Of course, this is only an estimate of the progress, because this thread only knows how much of its own work has been done. It doesn't know what the other threads are doing!

 if (threadId == 0)
   {
   // Provide a fraction between 0.0 and 1.0
   this->UpdateProgress(progressFraction);
   }

Some Tricky Problems

One problem has already been discussed: the way the data is divided into pieces can lead to poor load balancing, and better strategies for dividing the data would be a good thing! Here is a list of problems:

  1. The current splitting method leads to poor load balancing
  2. If the algorithm's operation is masked with a stencil, load balancing is terrible

When a stencil is applied, it means that the algorithm operates only on the irregularly-shaped region of the image that is inside the stencil. All the voxels outside the stencil are copied directly from the input, or are simply colored black. In other words, the cost associated with operating on voxels that are inside is often many times higher than the cost of operating on pixels that are outside.

Future

The current implementation of vtkThreadedImageAlgorithm uses the coarsest granularity possible, with one (or zero) pieces going to each thread. This can be problematic, because it can be difficult to set the sizes of the pieces such that all the threads finish at the same time.

One simple way of improving load balancing is to use finer granularity. If there are many times as many pieces as there are threads, then the pieces don't all have to take the same amount of time. Instead, every time a thread finishes a piece, it fetches the next piece from the queue. If there are approximately 10 times as many pieces as there are threads, then we wouldn't expect to see any threads sitting idle until over 90% of the work has been completed.

The new vtkSMPTools provide a simple interface for requesting granularity, and provides back-ends for optimized threading libraries. Threaded image algorithms should migrate from vtkMultiThreader to vtkSMPTools.