# Proposals:GridComputing

## Development and extension of ITK to be applied to grid computing

For more information contact: Simon Warfield, Steve Pieper, Ron Kikinis

There is an increasing trend to deploy grid computing infrastructures to support computations on extremely large datasets like those found in the Visible Human Project. We therefore believe that it is important for critical aspects of the architecture of ITK revisited and refined to support the emerging standards in the grid computing community and to develop example applications to demonstrate the power of the ITK/grid combination in real-world research computing scenarios.

**Task 1:** Survey of existing grid infrastructure software including the Globus Toolkit and associated middleware. Needs analysis of the bottlenecks in the current ITK as applied to collaborator projects within the SPL. Creation of initial ITK on grid computing examples using task-level parallelization of large population data sets (n > 1000).

Other examples of task level parallelism include parametric search in which an algorithm is applied to a single or small number of data sets many times (n > 1000) with different parameter settings.

**Task 2:** Focus on optimizing the current ITK inner-loops and threading model for CPU architectures and compilers deployed as compute grid nodes. Work with computer science collaborators to identify bottlenecks that can benefit from machine-level code optimizations to accommodate issues such as cache size, special instruction sets, and memory bandwidth hierarchies in the context of a heterogeneous compute grid environment.

Cache size varies from CPU model to CPU model and can have a tremendous impact upon performance. Also, utilization of vector instruction sets such as SSE, and scheduling and branch prediction must be carefully optimized for the specific CPU model, instruction cache size, data cache size and scheduling pipeline for best performance. Compilers, such as GCC or the Intel compiler, can generate optimized code that outperforms hand-tuned code. Our intention is to investigate the potential for performance gains on certain platforms through the generation of cache-optimized, instruction-set optimized, scheduler optimized binaries. We will also test the accuracy and precision of highly optimized code to ensure correct results are obtained.

We also want to optimize and accelerate specific algorithms, such as STAPLE. We think that at this stage, performance boosts can come from optimized thread parallel implementations of key algorithms and inner loops.

**Task 3:** Integrate message passing parallelism using the globus compatible MPI implementations (mpich-g2 or its successors) first to distribute ITK processing pipelines to among grid processing nodes and second to break individual algorithms across nodes.

The grid protocols have been undergoing tremendous change recently. There is no clear preferred protocol for distributing ITK processing pipelines among grid processing nodes at this time. Scheduling and access across Teragrid nodes has also been evolving. Some Teragrid sites are using tools such as PBS for managing jobs. We will focus our efforts on utilizing the protocols which are most stable and provide for the most significant benefits for the effort invested.

**Task 4:** Ongoing support of ITK grid tools at the task, threading, and message passing layers. Review of any newly developed grid middleware emerging during the project period and rework of ITK architecture to accommodate same.

In our initial effort we used Image Guided Neurosurgery as the clinical application driver, and optimized algorithms and performance especially suited for this application. This application requirements range from the need for highly optimized special instruction set inner loops and thread parallel execution for specific algorithms, such as intrasubject registration, to task level parallelism inherent in the statistical atlas construction that underlies robust segmentation methods. This also leverages our NSF funded ITR 0426558 *A Novel Grid Architecture Integrating Real-Time Data and Intervention During Image Guided Therapy* which provides active collaboration with Teragrid sites and access to expertise in the evolving grid protocols. We intend to maintain these application specific optimizations as ITK evolves, while adding new application drivers that ensure the changes we make are relevant and will have significant clinical impact.

## Compute platforms

We would like to get feedback about the compute platforms used in the ITK community. In particular, we would like to know what effort on which hardware platforms would have the largest impact on the community.

CPU model family, user count:

Intel : 1 AMD : 0 SPARC : 0

## Specific Tasks

- Thread-level parallelization of three key algorithms
**STAPLE**STAPLE is an expectation-maximization algorithm for the simultaneous truth and performance level estimation of a collection of segmentations via the computation of a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. The probabilistic estimate of the true segmentation is formed by estimating an optimal combination of the segmentations, weighting each segmentation depending upon the estimated performance level, and incorporating a prior model for the spatial distribution of structures being segmented as well as spatial homogeneity constraints. The following tasks will be performed:- Needs analysis of the current state of STAPLE in ITK. The current implementation is from the algorithm published by Warfield, Zou and Wells in MICCAI 2002 which is limited to binary labels. We will extend the implementation to allow for multi-category labels as described in (Warfield, Zou, Wells IEEE TMI 2004).
- Profiling of the current STAPLE algorithm to find areas which are relatively fast, those that are slow and can be optimized, and any true performance bottlenecks.
- Evaluation of thread scaling performance as a function of number of CPUs (1 to 16 CPUs on SMP machines).
- Evaluation of impact of different hardware platform on performance.
- Thread-level parallelization within STAPLE.
- CPU specific cache size, pipeline scheduling and instruction set optimizations for different hardware platforms.

**Resampling**Resampling is the process of taking one set of data sampled on a discrete lattice and interpolating the values of that data on another discrete lattice. Resampling is required in many areas of image analysis because most transformations that are applied to a data set will move the data points off the original pixel lattice and these new points will consequently need to be placed back on a new lattice in order to be displayed as an image. The following set of tasks will be performed:- Needs analysis of the current state of resampling in ITK
- Profiling of the current ITK algorithm to find areas which are relatively fast, those that are slow and can be optimized, and any true performance bottlenecks.
- Testing of both thread scaling performance of current algorithms (with multiple threads and different hardware platforms) on 1-16 CPU SMPs.
- Development and implementation of a resampling algorithm for complex number data-types currently not supported by ITK, required for interpolation of Fourier transforms.
- Thread-level parallelization with the resampling algorithms
- CPU specific cache size, pipeline scheduling and instruction set optimizations for different hardware platforms.

**Mattes Mutal Information Rigid Registration**The Mattes MI rigid registration algorithm is a mutual-information based registration technique that uses a single spatial sample to calculate the marginal and joint probabilities at discrete points (or bins) uniformly spread within the dynamic range of the images. The following set of tasks will be performed:- Needs analysis of the current state of Mattes MI
- Profiling of the current ITK algorithm to find areas which are relatively fast, those that are slow and can be optimized, and any true performance bottlenecks.
- Testing of the sensitivity of this algorithm to initial conditions in the placement of data sets.
- Comparison of the performance of Mattes MI with respect to other algorithms (such as Wells Viola MI)
- Testing the possibility to register multi-modality image datasets (e.g. CT to MRI)
- Profiling Mattes metric for various image modalities registration (e.g . MRI to MRI, CT to MRI) and decide on the suitable optimizer to be used
- Explore the possibility of combining Mattes MI with other algorithms (such as Normalized Correlation), with the goal of creating novel registration schemes
- Thread-level parallelization of the Mattes MI algorithm
- Cpu-specific optimization of the Mattes MI algorithm for multiple architectures

**Fourier Transform in ITK**The fourier transform is a ubiquitous transform used throughout engineering and computer science for many different roles. ITK currently has two different approaches to interfacing with this transform: via the VNL library (that is a wrapper around the FFTPACK library developed in the 1970's) and via the FFTW library (a high performance FFT library developed at MIT in the late 1990's). Both approaches have the relative strengths and weaknesses: FFTW must be built separately outside of ITK, is licensed under the GNU General Public License, and uses relatively newer technology to optimize the fourier transform across a wide variety of architectures; VNL's FFT is distributed as part of ITK, has the same licensing terms as the rest of ITK, and uses older well-established techniques to perform the fourier transform. In addition many hardware vendors have independently developed and implemented the fourier transform for their particular architecture because of the wide-spread use of this algorithm (eg. Intel MKL, Mac OS X optimized FFT, Sun Sparc FFT in the Sun Scientific Solutions library, and others). The following set of tasks will be performed:- Perform validity testing on all currently available versions of the fourier transform. Analyze the results to see if the various algorithms work, are fast, and are reliable.
- Obtain input from Kitware engineers for architectural considerations of current ITK infrastructure and it's ability to support the FFT across multiple platforms.
- Perform cache and CPU specific optimization for each algorithm

**Grid protocols and Batch-level processing with ITK**Grid computing and using grid level protocols on large grids (such as teragrid, which uses PBS to distribute and schedule jobs) has become more and more important as the level of work required to generate solutions to complicated problems has increased. This is especially relevant in the field of medical imaging where the size of the data sets and the space of solutions can be extremely large requiring single-CPU machines to spend many days to solve certain problems. Experimenting with a grid-based approach to solving these problems will provide insight into the strengths and pitfalls of such an approach, and the value of close integration with ITK. The following set of tasks will be performed:- Setup a local grid from a collection of machines to demonstrate usefulness of using a grid computing approach to solve large scale problems.
- Test the scalability of this approach.
- Assess the performance boost that is possible.