[vtk-developers] Rendering speed with large numbers of vtkActors (long)

Mon Jun 18 16:13:39 EDT 2001

Very interesting!
I've been working with Roger Brown at SGI to figure out why
performance on the SGI IR2 was so terrible with just a
single actor.  We started on this about a year-and-a-half
ago, but didn't know the appropriate venue to discuss
results.  

We were able to get a 3x performance improvement on the
"sphere" benchmark using the following modifications (some
of which are very similar to yours).  Some of these
improvements come directly from the OpenGL performance
tuning guide and others from profiling the GL pipeline
directly.

http://techpubs.sgi.com/library/dynaweb_bin/ebt-bin/0650/nph-infosrch.cgi/infosrchtpl/SGI_Developer/OpenGLonSGI/@InfoSearch__BookTextView/540?DwebQuery=opengl%2Band%2Btuning	

So we did the following to improve raw polygon rendering
performance with a single actor.

1) Eliminated redundant gl state changes by caching GL state
calls.  Like you, we created a state-cache for GL state
calls.  However, we made it a member of the
vtkOpenGLRenderWindow class because that class holds the
GLXContext.  Whenever the context is made current in the
RenderWindow, it copies the pointer to its own StateCache
object into a static classmember so that when the context is
current you can simply refer to the cache with
	vtkOpenGLRenderWindow::CurrentStateCache
When it was in the Renderer, problems arose with caching
state for the wrong rendering context.  

This resulted in the largest performance boost because IR
pipes have a sizable overhead for state changes.  It also
enabled the following adjustments to show performance
improvements.  Without state caching, the typical things you
do to "tune" OpenGL had no tangible effect.

2) Using:
     glNewList(this->ListId,GL_COMPILE);
     glCallList(this->ListId);
   instead of
     glNewList(this->ListId,GL_COMPILE_AND_EXECUTE);
  I don't know its counterintuitive, but this does make a
performance difference for some cases.

2) Calling glDisable( GL_DITHER );
  Even if you are not actually dithering, the pipeline seems
to take a performance hit if it is enabled.

3) *Not* calling glEnable( GL_NORMALIZE );
  The vtkNormals are already normalized as far as I can
tell.

4) Lengthening the time before any GL state calls are done
after a glXSwapBuffers().  glXSwapBuffers has a huge latency
associated with it that blocks the GL pipeline until it is
completed (sometimes 15% of the total rendering time when
you are rendering at 10fps).  So it is always good to do
*anything* that isn't a GL operation immediately after the
SwapBuffers call.  However, the vtkRenderer calls the
RenderOverlay method of vtkProp almost immediately after the
SwapBuffers which in turn makes many blocking gl state
changes.  Using the StateCache, we can delay the actual
playback of these operations until after all of the state
changes (if any) have been collected.  This buys enough time
to hide the SwapBuffers latency because navigating the vtk's
class hierarchy to accumulate gl state change info is
costly.

5) Lighting needs to be disabled when making glMaterial
state calls.  (at least this is true on SGI's). 

6) There are many redundant calls to
glMatrixMode(GL_MODELVIEW) (the state caching fixed this
problem as well)

7) Eliminated glFlush() calls where glXSwapBuffers() is
going to be called anyways.  glFlush() is costly and in
particular it is redundant if you call glXSwapBuffers()
immediately afterwards as is done in the
vtkOpenGLRenderer::Frame() function.  So glFlush() is *only*
called of glXSwapBuffers() will *not* be called. 

The current implementation I have is kind of kludgy because
we only fixed the state-cache for things that were involved
in the benchmarks we were using.  It would be a bit of work
to make this fix "production quality".  I think these
changes will have benefits for a number of GL pipeline
architectures (not just SGI's IR pipeline).

John T., looking through your results, I think we can
combine many of the gl-state fixes that you found with the
ones that Roger and I discovered to at least accelerate the
raw polygon performance of VTK as an intermediate step
toward the modifications you require to improve performance
with large numbers of actors.  The question is whether this
methodology will work for the multithreaded variety of VTK. 
I think that it will, but I'm interested in other peoples
comments on that.

-john

john Tourtellott wrote:
> 
> >From time to time, Simmetrix has posted messages regarding the apparent
> slowdown of VTK when used to render very large numbers (100's to 1000's)
> of actors. Last year, Bob O'Bara created a benchmark program that
> demonstrates this effect quantitatively. This program showed, for
> example, that rendering 1000 actors, each with one geometric primitive
> (we use a cube), runs 15 times slower than rendering a single actor with
> 1000 geometric primitives. In both cases, the actual geometry rendered
> by
> OpenGL, and the resulting images, are identical, yet the 1000-actor case
> takes 15X more time.
> 
> In the past few weeks, by making a number of modifications to VTK 3.2
> (using the May 11 nightly build) we have reduced this difference in
> rendering times from a factor of 15X to 2X. From this work, we have
> identified many issues that influence rendering speed for large numbers
> of actors, and would like to share our major findings with the
> developers. Our goal is to stimulate discussion and suggest possible
> enhancements to VTK that would benefit applications using large numbers
> of actors. Following is a list of the major issues we discovered and the
> code modifications we used to improve rendering speed. We very much
> would
> like to see these improvements (or equivalent improvements based on
> suggestions of the "right way" to do some of these things) incorporated
> into VTK.
> 
> Issue 1: Numerous OpenGL "state" calls are made by the
> vtkOpenGLProperty::Render() method. At least 12 gl* calls are made for
> each actor every frame. (This was discussed in the vtk-developers list
> beginning May 16.)
> 
> Action: We modified the vtkOpenGLProperty::Render() method to call new
> methods in the vtkOpenGLRenderer class instead of calling OpenGL
> directly. The new methods in vtkOpenGLRenderer store and check the last
> OpenGL settings for things such as material colors, shade model, and
> point size, and only make the gl* call if there's a change.
> 
> Issue 2: glMultMatrixd() is called by vtkOpenGLActor::Render(). In the
> general case, this is needed to render the actor in the right place. In
> our applications, however, the actors nearly always have an identity
> matrix, making the glMultMatrixd() call unnecessary. (Mark Beall posted
> this to the developers list on May 17, but there were no responses.)
> 
> Action: We added an "identity" flag to the vtkMatrix4x4 class to keep
> track of its state. It's not ideal since the matrix elements in this
> class are public (i.e., someone can change the matrix without the class,
> or our identity flag, knowing about it). We also, of course, modified
> the
> vtkOpenGLActor::Render() method to check the matrix identity flag before
> calling glMultMatrixd().
> 
> Issue 3: The vtkFieldData::GetMTime() creates an iterator even when it
> contains no data.
> 
> Action: We added a line to check if vtkFieldData::NumberOfArrays is
> greater than zero before instantiating a vtkFieldData::Iterator.
> 
> Issue 4: The vtkActor::GetBounds() method traverses the VTK pipeline
> checking modified times ad infinitum. With 1000 actors and 120 frames,
> this means 120,000 calls to vtkActor::GetMTime(),
> vtkPolyDataMapper::GetMTime(), vtkDataObject::GetMTime(), and (our
> friend
> from Issue 3) vtkFieldData::GetMTime().
> 
> Action: Since we almost never modify our actors (especially their
> geometry), we added a new protected member called "UpdateMode" to
> vtkProp3D. It is used to encode 1 of 3 possible modes:
> 
>         VTK_UPDATE_DEFAULT
>         VTK_UPDATE_ONCE
>         VTK_UPDATE_DONE
> 
> In VTK_UPDATE_DEFAULT mode, the actor behaves in VTK's normal way. The
> VTK_UPDATE_ONCE mode configures the actor to update one time only, and
> the VTK_UPDATE_DONE mode indicates not to perform the update, but
> instead
> use the results from the previous/initial update.
> 
> Issue 5: The vtkPolyDataMapper::RenderPiece() method traverses the VTK
> pipeline checking modification times.
> 
> Action: Using our new vtkProp3D::UpdateMode, we bypassed nearly all of
> the logic in the RenderPiece method for VTK_UPDATE_DONE actors.
> 
> Issue 6: The vtkPolyDataMapper::RenderPiece() method uses the vtkTimer
> to
> measure rendering time.
> 
> Action: In VTK_UPDATE_DONE mode, we bypass the timer, using the result
> computed in the previous/initial render.
> 
> Issue 7: The vtkProp3D::GetMatrix() method traverses the VTK pipeline
> checking modification times, and then calls vtkMatrix4x4::DeepCopy().
> Note that in this case the DeepCopy is just copying the matrix to
> itself.
> 
> Action: Although we don't know why the DeepCopy call is in there, we
> took
> it out and now return a pointer to the vtkProp3D::Matrix data member.
> 
> Issue 8: The vtkOpenGLPolyDataMapper::Draw() method puts 3 OpenGL calls
> -- glDisable(GL_COLOR_MATERIAL), glDisable(GL_LIGHTING), and
> glEnable(GL_LIGHTING) -- in every display list.
> 
> Action: For the GL_COLOR_MATERIAL case, we used the same workaround as
> in
> Issue #1 -- adding new logic in the vtkOpenGLRenderer class to check the
> OpenGL setting before making the gl* call. The GL_LIGHTING calls were a
> bit more interesting. The Draw() method disables lighting for the case
> when there are no normals available for either lines or vertices (and
> then re-enables lighting for surfaces). We modified the code to simply
> check for the presence of lines or vertices first, before worrying about
> normals. Since our test case actors don't display lines or vertices,
> this
> kept the GL_LIGHTING calls out of our display lists.
> 
> Note: This suggest that, in order to get high throughput with large
> numbers of actors, you cannot mix and match lines and surfaces randomly,
> but will be much better off grouping actors with lines only and surfaces
> only.
> 
> Issue 9: The vtkOpenGLActor::Render() calls glDepthMask(), to either set
> or clear it depending on the actor opacity.
> 
> Action: We added Enable/Disable methods to the vtkRenderer that check
> the
> OpenGL state before making the gl* call, in the same way that we
> replaced
> the gl* calls in the vtkProperty::Render() method (Issue #1).
> 
> Issue 10: The vtkRenderer::Render() method allocates and destroys 3
> arrays (PropArray, RayCastPropArray, and RenderIntoImagePropArray) each
> frame.
> 
> Action: We took a shortcut and only destroy the arrays if an actor is
> added or removed between frames. In the general case, we'd suggest
> reallocating the arrays only if the number of actors has increased since
> the previous render.
> 
> Issue 11: The vtkOpenGLPolyDataMapper::RenderPiece() method calls
> MakeCurrent(), which in turn calls glXGetCurrentContext(). This was
> discussed in the vtk-developers list beginning May 28.
> 
> Action: Nothing was done for this one.
> 
> Issue 12: The vtkRenderer::RenderOverlay() method ends up calling
> vtkProp::SafeDownCast() for each actor every frame. This is a bit
> frustrating since, as we understand it, none of our actors contribute
> anything to the RenderOverlay method.
> 
> Action: Nothing has been done yet, but one suggestion is to let the
> application software enable/disable the RenderOverlay method.
> Another approach would be to change how the renderer stores the props.
> Right now everything is stored in one big list. It seems it would be
> possible to divide this up into multiple lists based on what the type of
> the prop is. For example, when the prop  is added to the renderer, the
> SafeDownCast call could be made and it could be put in the appropriate
> list. This would also solve Issue 13 below.
> 
> Issue 13: The vtkRenderer::Render() ends up calling
> vtkProp::SafeDownCast() for each actor every frame.
> 
> Action: Nothing has been done yet.
> 
> Issue 14: The vtkFrustumCoverageCuller::Cull() processes the same actor
> geometry every frame. Our actors have very simple geometry, and culling
> may not be of any benefit except when zoomed in very close.
> 
> Action: Nothing has been done yet, but perhaps making culling an option,
> either on an actor-by-actor case or for the whole frame, is suggested.
> 
> Issue 15: Using Quantify to measure CPU time, we have observed that the
> glFlush() calls for the 1000-actor case consume twice as many CPU cycles
> as the 1-actor case (82.1 million vs. 42.3 million). At the bottom of
> the
> glFlush() call hierarchy is the system writev() function. For the
> 1-actor
> case, each call to writev takes an average of 338K cycles, whereas for
> the 1000-actor case, each call takes an average of 664K cycles.
> 
> Action: Nothing has been done, other than to verify that the sequence of
> gl* calls and display lists are equivalent for both test cases.
> 
> John Tourtellott
> -------------------------------------------------------
> Simmetrix Inc.
> 1223 Peoples Avenue
> Troy, NY  12180
> voice:  518-276-2728
> fax:    518-276-2944
> mailto: johnt at simmetrix.com
> -------------------------------------------------------
> 
> _______________________________________________
> vtk-developers mailing list
> vtk-developers at public.kitware.com
> http://public.kitware.com/mailman/listinfo/vtk-developers