[vtk-developers] Flying Edges

Sat Jan 10 10:04:46 EST 2015

Over the holidays I gave myself the gift of designing and implementing a
new contouring algorithm :-) To my utter delight and amazement I stumbled
upon a novel approach that is readily parallelizable and yet wicked fast in
serial mode as well. The algorithm (vtkFlyingEdges2D/vtkFlyingEdges3D) has
been pushed into gerrit for review if anyone is interested (FlyingEdges
branch). Consider this a work in progress, we are benchmarking and working
through parallel implementation details now. We are planning a paper
submission in the near future.

TLDR Details
+ As many of you know the VTK community is taking aim at seriously
reworking the system to better support parallel computing. Much of this
work is being led by Berk Geveci and his team. For example, Berk introduced
the vtkSMPTools into VTK about a year ago, and we are actively working with
some big name chip manufacturers, DOE Labs, research organizations, and
other customers/collaborators to get this ambitious undertaking done.
Hopefully individuals in this VTK community will also lend their
significant expertise as well.
+ Under prodding by Berk, and inspired by Patrick O'Leary's "born parallel"
slogan to rethinking systems and algorithms for parallel computing, the
motivation was to take one of the most important visualization algorithms
(isocontouring) and see if we could redesign it for emerging parallel
approaches.
+ The venerable Marching Cubes algorithm introduced the key idea that the
topology of an isosurface within a voxel can be captured with an index into
a simple, finite case table. Yet the implementation of the algorithm can be
slower than you might expect because, for example, on interior voxels,
voxel edges can be processed up to four times, and vertices visited eight
times (ignoring gradient calculations). Also the output mesh is dynamically
created: a priori the number of output points and triangles is unknown,
which means that memory resizing (realloc's) are required. Finally, the
algorithm is often implemented with a "locator" or hash table to prevent
the creation of duplicate points on shared voxel edges (the so-called point
merging problem). Not good for parallel computing.
+ Ken Martin's extremely fast Synchronized Templates algorithm introduced
the notion of a "voxel axes" or "edge triad" which consists of the three
x-y-z voxel axes located at the voxel origin. This triad is moved through
the volume in such a way as to intersect each voxel edge only one time
(with special treatment on the +x+y+z volume boundaries). (An aside: Ken
wrote this like 15 years ago, it's been the VTK standard ever since.
Initially we filed a patent application [since abandoned] but that was
before we realized how stupid it is for an open source company to hold
patents. Live and learn :-))
+ Flying Edges builds on these concepts and introduces some novel ideas. It
is a 3-pass algorithm: the first two passes gather information about the
volume including "intersections" along x-edges, and the number of output
primitives and points. The final pass generates output into preallocated
arrays. Some of the key ideas include:
-- each pass traverses x-rows in the volume completely independently
meaning that it is readily parallelizable.
-- the case table is based on the four voxel x-edges (which can be easily
transformed to and from a vertex-based MC table). Because the cases are
edge based, the computation of cases is parallel separable.
-- The novel concept of "computational trimming" is used to avoid massive
amounts of work. Basically the fact that isosurfaces are topologically
smooth is combined with the results of the first pass (building
x-voxel-edge cases) to reason about where the isosurface can exist in the
volume. This very simple topological reasoning enables the algorithm to
skip portions of x-rows, entire x-rows, and even x-y slices very rapidly.
-- Some simple edge metadata is gathered for each x-row volume edge which
can be used to allocate and partition memory for separate threads to write
into without contention (i.e., no mutexes or locks). This metadata also
enables the algorithm to process the voxels along each x-row in such a way
as to synchronously increment the point ids and triangle ids without
requiring point merging or hashing. In fact, topology generation (i.e.,
creating triangles) is completely independent from point/geometry
generation (i.e., interpolating point coordinates and attributes along
edges).
+ Early results show a 2-5x speedup over Synchronized Templates (>10x over
vtkImageMarchingCubes) when running in serial. However, this is very data
dependent and I suspect compiler dependent since there is a lot of C++
templating and in-lining going on (make sure to build optimized/release
mode). Anyway I am very excited because once the parallel loops are
functioning properly we should see much more performance improvement. The
scalability may be hampered however because data movement over the memory
bus may slow things up (computations are relatively modest compared to the
total data processed). We'll be benchmarking performance over the next
couple of months to see where we really are and of course we'll update this
community at that time.
+ The memory overhead of the algorithm is roughly 2-bits per vertex.
However as implemented now I'm avoiding bit packing and just use a byte
(unsigned char) per vertex.
+ Other interesting tidbits:
-- The basic algorithm is extremely simple to implement. Most of the
complications come from dealing with volume boundaries. It would be
possible to significantly reduce code complexity by either padding the
+x+y+z volume boundaries by a layer of voxels, or alternatively not
processing the outermost layer of voxels.
-- Because I am lazy, the algorithm reuses the MC case table as enumerated
in VTK (vtkMarchingCubesTriangleCases.h). However at instantiation the MC
table is converted to a Flying Edges edge-based case table. Also note that
the MC table as originally defined is for a left-handed coordinate system
(see the original paper). So reordering of triangles is required to make
sure that normals and triangle ordering is consistent.
-- On our radar is potential vectorization of operation like interpolation
across voxel edges. As you know some of the big hardware manufacturers are
really pushing this now and we'd like to learn how to best take advantage
of this emerging resource.
-- Probably the biggest downside of this algorithm is that is can produce
degenerate triangles (i.e., zero-area triangles in 3D, zero-length line
segments in 2D). The major reason is that when preallocating output memory,
we do so based on topological evaluation. However, during actual point
generation in degenerate cases (a degenerate case occurs when one of more
vertex scalar values == isocontour value) multiple intersection points can
occur at the vertex producing degeneracies. Since discarding a degenerate
triangle would mess up the output preallocation we can't just throw away
degenerate primitives. However I have already thought through a solution to
this problem but it is based on (spoiler alert) a much more complex case
table that enumerates 3-values per vertex: (<,==,>) when compared to the
isocontour value. This is a novel idea as well, mostly the academic
literature completely ignores the reality of degeneracies. Maybe next
holiday season we'll slip it into the algorithm.....

Okay I've got to get back to the pointy-haired management stuff. But please
feel free to provide feedback, or offers of collaboration :-) In particular
we'd love support, either in funding or brainpower, to do even more cool
parallel processing stuff. Hoping to hear from you, and I hope your coming
year is as exciting as mine is looking to be!

Best,
W

-- 
William J. Schroeder, PhD
Kitware, Inc.
28 Corporate Drive
Clifton Park, NY 12065
will.schroeder at kitware.com
http://www.kitware.com
(518) 881-4902
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/vtk-developers/attachments/20150110/a2105b18/attachment-0001.html>