[Paraview] Parallel Streamtracer
Burlen Loring
bloring at lbl.gov
Thu Jun 7 11:54:16 EDT 2012
Hi Stephan,
I've experienced the scaling behavior that you report when I was working
on a project that required generating millions of streamlines for a
topological mapping algorithm interactively in ParaView. To get the
required scaling I wrote a stream tracer that uses a load on demand
approach with tunable block cache so that all ranks could integrate any
streamline and stay busy throughout the entire computation. It was very
effective on our data and I've used it to integrate 30 Million
streamlines in about 10min on 256 cores. If you really need better
scalability than the distributed data tracing approach implemented in
PV, you might take a look at our work. The down side of our approach is
that in order to provide the demand loading the reader has to implement
a vtk object that provides an api giving the integrator direct access to
I/O functionality. In case you're interested the stream tracer is class
is vtkSQFieldTracer and our reader is vtkSQBOVReader. The latest release
could be found here
https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531
Burlen
On 06/04/2012 02:21 AM, Stephan Rogge wrote:
> Hello Leo,
>
> ok, I took the "disk_out_ref.ex2" example data set and did some time
> measurements. Remember, my machine has 4 Cores + HyperThreading.
>
> My first observation is that PV seems to have a problem with distributing
> the data when the Multi-Core option (GUI) is enabled. When PV is started
> with builtin Multi-Core I was not able to apply a stream tracer with more
> than 1000 seed points (PV is freezing and never comes back). Otherwise, when
> pvserver processes has been started manually I was able to set up to 100.000
> seed points. Is it a bug?
>
> Now let's have a look on the scaling performance. As you suggested, I've
> used the D3 filter for distributing the data along the processes. The stream
> tracer execution time for 10.000 seed points:
>
> ## Bulitin: 10.063 seconds
> ## 1 MPI-Process (no D3): 10.162 seconds
> ## 4 MPI-Processes: 15.615 seconds
> ## 8 MPI-Processes: 14.103 seconds
>
> and 100.000 seed points:
>
> ## Bulitin: 100.603 seconds
> ## 1 MPI-Process (no D3): 100.967 seconds
> ## 4 MPI-Processes: 168.1 seconds
> ## 8 MPI-Processes: 171.325 seconds
>
> I cannot see any positive scaling behavior here. Maybe is this example not
> appropriate for scaling measurements?
>
> One more thing: I've visualized the vtkProcessId and saw that the whole
> vector field is partitioned. I thought, that each streamline is integrated
> in its own process. But it seems that this is not the case. This could
> explain my scaling issues: In cases of small vector fields the overhead of
> synchronization becomes too large and decreases the overall performance.
>
> My suggestion is to have a parallel StreamTracer which is built for a single
> machine with several threads. Could be worth to randomly distribute the
> seeds over all available (local) processes? Of course, each process have
> access on the whole vector field.
>
> Cheers,
> Stephan
>
>
>
> Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
> Gesendet: Freitag, 1. Juni 2012 16:13
> An: Stephan Rogge
> Cc: Andy Bauer; paraview at paraview.org
> Betreff: Re: [Paraview] Parallel Streamtracer
>
> Hi, Stephan,
> I did measure the performance at some point and was able to get fairly
> decent speed up with more processors. So I am surprised you are seeing huge
> latency.
>
> Of course, the performance is sensitive to the input. It is also
> sensitive to how readers distribute data. So, one thing you might want to
> try is to attach the "D3" filter to the reader.
>
> If that doesn't help, I will be happy to get your data and take a look.
>
> Leo
>
> On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge<Stephan.Rogge at tu-cottbus.de>
> wrote:
> Leo,
>
> As I mentioned in my initial post of this thread: I used the up-to-date
> master branch of ParaView. Which means I have already used your
> implementation.
>
> I can imagine, to parallelize this algorithm can be very tough. And I can
> see that distribute the calculation over 8 processes does not lead to a nice
> scaling.
>
> But I don't understand this huge amount of latency when using the
> StreamTracer in a Cave-Mode with two view ports and two pvserver processes
> on the same machine (extra machine for the client). I guess the tracer
> filter is applied for each viewport separately? This would be ok as long as
> both filter executions run parallel. And I doubt that this is the case.
>
> Can you help to clarify my problem?
>
> Regards,
> Stephan
>
>
> Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
> Gesendet: Donnerstag, 31. Mai 2012 21:33
> An: Stephan Rogge
> Cc: Andy Bauer; paraview at paraview.org
> Betreff: Re: [Paraview] Parallel Streamtracer
>
> It is in the current VTK and ParaView master. The class is
> vtkPStreamTracer.
>
> Leo
> On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge<stephan.rogge at tu-cottbus.de>
> wrote:
> Hi, Andy and Leo,
>
> thanks for your replies.
>
> Is it possible to get this new implementation? I would to give it a try.
>
> Regards,
> Stephan
>
> Am 31.05.2012 um 17:48 schrieb Yuanxin Liu<leo.liu at kitware.com>:
> Hi, Stephan,
> The previous implementation only has serial performance: It traces the
> streamlines one at a time and never starts a new streamline until the
> previous one finishes. With communication overhead, it is not surprising it
> got slower.
>
> My new implementation is able to let the processes working on different
> streamlines simultaneously and should scale much better.
>
> Leo
>
> On Thu, May 31, 2012 at 11:27 AM, Andy Bauer<andy.bauer at kitware.com> wrote:
> Hi Stephan,
>
> The parallel stream tracer uses the partitioning of the grid to determine
> which process does the integration. When the streamline exits the subdomain
> of a process there is a search to see if it enters a subdomain assigned to
> any other processes before figuring it whether it has left the entire
> domain.
>
> Leo, copied here, has been improving the streamline implementation inside of
> VTK so you may want to get his newer version. It is a pretty tough algorithm
> to parallelize efficiently without making any assumptions on the flow or
> partitioning.
>
> Andy
>
> On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge<Stephan.Rogge at tu-cottbus.de>
> wrote:
> Hello,
>
> I have a question related to the parallelism of the stream tracer: As I
> understand the code right, each line integration (trace) is processed in an
> own MPI process. Right?
>
> To test the scalability of the Stream tracer I've load a structured
> (curvilinear) grid and applied the filter with a Seed resolution of 1500 and
> check the timings in a single and multi-thread (Multi Core enabled in PV
> GUI) situation.
>
> I was really surprised that multi core slows done the execution time to 4
> seconds. The single core takes only 1.2 seconds. Data migration cannot be
> the explanation for that behavior (0.5 seconds). What is the problem here?
>
> Please see attached some statistics...
>
> Data:
> * Structured (Curvilinear) Grid
> * 244030 Cells
> * 37 MB Memory
>
> System:
> * Intel i7-2600K (4 Cores + HT = 8 Threads)
> * 16 GB Ram
> * Windows 7 64 Bit
> * ParaView (master-branch, 64 bit compilation)
>
> #################################
> Single Thread (Seed resolution 1500):
> #################################
>
> Local Process
> Still Render, 0.014 seconds
> RenderView::Update, 1.222 seconds
> vtkPVView::Update, 1.222 seconds
> Execute vtkStreamTracer id: 2184, 1.214 seconds
> Still Render, 0.015 seconds
>
> #################################
> Eight Threads (Seed resolution 1500):
> #################################
>
> Local Process
> Still Render, 0.029 seconds
> RenderView::Update, 4.134 seconds
> vtkSMDataDeliveryManager: Deliver Geome, 0.619 seconds
> FullRes Data Migration, 0.619 seconds
> Still Render, 0.042 seconds
> OpenGL Dev Render, 0.01 seconds
>
>
> Render Server, Process 0
> RenderView::Update, 4.134 seconds
> vtkPVView::Update, 4.132 seconds
> Execute vtkStreamTracer id: 2193, 3.941 seconds
> FullRes Data Migration, 0.567 seconds
> Dataserver gathering to 0, 0.318 seconds
> Dataserver sending to client, 0.243 seconds
>
> Render Server, Process 1
> Execute vtkStreamTracer id: 2193, 3.939 seconds
>
> Render Server, Process 2
> Execute vtkStreamTracer id: 2193, 3.938 seconds
>
> Render Server, Process 3
> Execute vtkStreamTracer id: 2193, 4.12 seconds
>
> Render Server, Process 4
> Execute vtkStreamTracer id: 2193, 3.938 seconds
>
> Render Server, Process 5
> Execute vtkStreamTracer id: 2193, 3.939 seconds
>
> Render Server, Process 6
> Execute vtkStreamTracer id: 2193, 3.938 seconds
>
> Render Server, Process 7
> Execute vtkStreamTracer id: 2193, 3.939 seconds
>
> Cheers,
> Stephan
>
>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at
> http://www.kitware.com/opensource/opensource.html
>
> Please keep messages on-topic and check the ParaView Wiki at:
> http://paraview.org/Wiki/ParaView
>
> Follow this link to subscribe/unsubscribe:
> http://www.paraview.org/mailman/listinfo/paraview
>
>
>
>
>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html
>
> Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView
>
> Follow this link to subscribe/unsubscribe:
> http://www.paraview.org/mailman/listinfo/paraview
More information about the ParaView
mailing list