[Paraview] Parallel Streamtracer

Tue Jun 5 05:59:42 EDT 2012

Thanks, Leo.

That's sounds great. I'm looking forward to have a parallel Stream Tracer
for small vector fields.

Stephan

Von: Yuanxin Liu [mailto:leo.liu at kitware.com] 
Gesendet: Montag, 4. Juni 2012 19:31
An: Stephan Rogge
Cc: Andy Bauer; paraview at paraview.org
Betreff: Re: [Paraview] Parallel Streamtracer

Hi, Stephan,
  I will look into the multi-core issue as well as the performance issue.

  Some quick answers:

  - Yes, the whole vector fields are partitioned and the streamlines are
passed from one process to another. This is why the performance can be
highly sensitive to how data are distributed and how the streamlines travel
between data partitions. 

  - Your suggestion makes sense if the data is small enough to be run on a
single machine. This is definitely something we would like to do in the
future. Right now, the implementation is more targeted towards handling
large data that have to be distributed across multiple machines.   

Leo

On Mon, Jun 4, 2012 at 5:21 AM, Stephan Rogge <Stephan.Rogge at tu-cottbus.de>
wrote:
Hello Leo,

ok, I took the "disk_out_ref.ex2" example data set and did some time
measurements. Remember, my machine has 4 Cores + HyperThreading.

My first observation is that PV seems to have a problem with distributing
the data when the Multi-Core option (GUI) is enabled. When PV is started
with builtin Multi-Core I was not able to apply a stream tracer with more
than 1000 seed points (PV is freezing and never comes back). Otherwise, when
pvserver processes has been started manually I was able to set up to 100.000
seed points. Is it a bug?

Now let's have a look on the scaling performance. As you suggested, I've
used the D3 filter for distributing the data along the processes. The stream
tracer execution time for 10.000 seed points:

##   Bulitin: 10.063 seconds
##   1 MPI-Process (no D3): 10.162 seconds
##   4 MPI-Processes: 15.615 seconds
##   8 MPI-Processes: 14.103 seconds

and 100.000 seed points:

##   Bulitin: 100.603 seconds
##   1 MPI-Process (no D3): 100.967 seconds
##   4 MPI-Processes: 168.1 seconds
##   8 MPI-Processes: 171.325 seconds

I cannot see any positive scaling behavior here. Maybe is this example not
appropriate for scaling measurements?

One more thing: I've visualized the vtkProcessId and saw that the whole
vector field is partitioned. I thought, that each streamline is integrated
in its own process. But it seems that this is not the case. This could
explain my scaling issues: In cases of small vector fields the overhead of
synchronization becomes too large and decreases the overall performance.

My suggestion is to have a parallel StreamTracer which is built for a single
machine with several threads. Could be worth to randomly distribute the
seeds over all available (local) processes? Of course, each process have
access on the whole vector field.

Cheers,
Stephan

Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
Gesendet: Freitag, 1. Juni 2012 16:13
An: Stephan Rogge
Cc: Andy Bauer; paraview at paraview.org
Betreff: Re: [Paraview] Parallel Streamtracer

Hi, Stephan,
  I did measure the performance at some point and was able to get fairly
decent speed up with more processors. So I am surprised you are seeing huge
latency.

  Of course, the performance is sensitive to the input.  It is also
sensitive to how readers distribute data. So, one thing you might want to
try is to attach the "D3" filter to the reader.

  If that doesn't help,  I will be happy to get your data and take a look.

Leo

On Fri, Jun 1, 2012 at 1:54 AM, Stephan Rogge <Stephan.Rogge at tu-cottbus.de>
wrote:
Leo,

As I mentioned in my initial post of this thread: I used the up-to-date
master branch of ParaView. Which means I have already used your
implementation.

I can imagine, to parallelize this algorithm can be very tough. And I can
see that distribute the calculation over 8 processes does not lead to a nice
scaling.

But I don't understand this huge amount of latency when using the
StreamTracer in a Cave-Mode with two view ports and two pvserver processes
on the same machine (extra machine for the client). I guess the tracer
filter is applied for each viewport separately? This would be ok as long as
both filter executions run parallel. And I doubt that this is the case.

Can you help to clarify my problem?

Regards,
Stephan

Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
Gesendet: Donnerstag, 31. Mai 2012 21:33
An: Stephan Rogge
Cc: Andy Bauer; paraview at paraview.org
Betreff: Re: [Paraview] Parallel Streamtracer

It is in the current VTK and ParaView master.  The class is
vtkPStreamTracer. 

Leo
On Thu, May 31, 2012 at 3:31 PM, Stephan Rogge <stephan.rogge at tu-cottbus.de>
wrote:
Hi, Andy and Leo,

thanks for your replies.

Is it possible to get this new implementation? I would to give it a try.

Regards,
Stephan

Am 31.05.2012 um 17:48 schrieb Yuanxin Liu <leo.liu at kitware.com>:
Hi, Stephan,
   The previous implementation only has serial performance:  It traces the
streamlines one at a time and never starts a new streamline until the
previous one finishes.  With communication overhead, it is not surprising it
got slower.

  My new implementation is able to let the processes working on different
streamlines simultaneously and should scale much better.

Leo

On Thu, May 31, 2012 at 11:27 AM, Andy Bauer <andy.bauer at kitware.com> wrote:
Hi Stephan,

The parallel stream tracer uses the partitioning of the grid to determine
which process does the integration. When the streamline exits the subdomain
of a process there is a search to see if it enters a subdomain assigned to
any other processes before figuring it whether it has left the entire
domain.

Leo, copied here, has been improving the streamline implementation inside of
VTK so you may want to get his newer version. It is a pretty tough algorithm
to parallelize efficiently without making any assumptions on the flow or
partitioning.

Andy

On Thu, May 31, 2012 at 4:16 AM, Stephan Rogge <Stephan.Rogge at tu-cottbus.de>
wrote:
Hello,

I have a question related to the parallelism of the stream tracer: As I
understand the code right, each line integration (trace) is processed in an
own MPI process. Right?

To test the scalability of the Stream tracer I've load a structured
(curvilinear) grid and applied the filter with a Seed resolution of 1500 and
check the timings in a single and multi-thread (Multi Core enabled in PV
GUI) situation.

I was really surprised that multi core slows done the execution time to 4
seconds. The single core takes only 1.2 seconds. Data migration cannot be
the explanation for that behavior (0.5 seconds). What is the problem here?

Please see attached some statistics...

Data:
* Structured (Curvilinear) Grid
* 244030 Cells
* 37 MB Memory

System:
* Intel i7-2600K (4 Cores + HT = 8 Threads)
* 16 GB Ram
* Windows 7 64 Bit
* ParaView (master-branch, 64 bit compilation)

#################################
Single Thread (Seed resolution 1500):
#################################

Local Process
Still Render,  0.014 seconds
RenderView::Update,  1.222 seconds
   vtkPVView::Update,  1.222 seconds
       Execute vtkStreamTracer id: 2184,  1.214 seconds
Still Render,  0.015 seconds

#################################
Eight Threads (Seed resolution 1500):
#################################

Local Process
Still Render,  0.029 seconds
RenderView::Update,  4.134 seconds
vtkSMDataDeliveryManager: Deliver Geome,  0.619 seconds
   FullRes Data Migration,  0.619 seconds
Still Render,  0.042 seconds
   OpenGL Dev Render,  0.01 seconds

Render Server, Process 0
RenderView::Update,  4.134 seconds
   vtkPVView::Update,  4.132 seconds
       Execute vtkStreamTracer id: 2193,  3.941 seconds
FullRes Data Migration,  0.567 seconds
   Dataserver gathering to 0,  0.318 seconds
   Dataserver sending to client,  0.243 seconds

Render Server, Process 1
Execute vtkStreamTracer id: 2193,  3.939 seconds

Render Server, Process 2
Execute vtkStreamTracer id: 2193,  3.938 seconds

Render Server, Process 3
Execute vtkStreamTracer id: 2193,  4.12 seconds

Render Server, Process 4
Execute vtkStreamTracer id: 2193,  3.938 seconds

Render Server, Process 5
Execute vtkStreamTracer id: 2193,  3.939 seconds

Render Server, Process 6
Execute vtkStreamTracer id: 2193,  3.938 seconds

Render Server, Process 7
Execute vtkStreamTracer id: 2193,  3.939 seconds

Cheers,
Stephan

_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Please keep messages on-topic and check the ParaView Wiki at:
http://paraview.org/Wiki/ParaView

Follow this link to subscribe/unsubscribe:
http://www.paraview.org/mailman/listinfo/paraview