[Paraview] Parallel Streamtracer

Fri Jun 8 14:14:07 EDT 2012

OK, you had me a little worried there, ;)

I will send you some instructions and example data to test with, our 
network is down due to an unexpected power outage so it won't be today.

Burlen

On 06/08/2012 07:25 AM, Stephan Rogge wrote:
> Someone told me that you have to clear your build directory completely and
> start a fresh PV build.
>
> Stephan
>
> -----Ursprüngliche Nachricht-----
> Von: burlen [mailto:burlen.loring at gmail.com]
> Gesendet: Freitag, 8. Juni 2012 16:21
> An: Stephan Rogge
> Cc: 'Yuanxin Liu'; paraview at paraview.org
> Betreff: Re: [Paraview] Parallel Streamtracer
>
> Hi Stephan,
>
> Oh, thanks for the update, I wasn't aware of these changes. I have been
> working with 3.14.1.
>
> Burlen
>
> On 06/08/2012 01:47 AM, Stephan Rogge wrote:
>> Hello Burlen,
>>
>> thank you very much for your post. I really would like to test your
>> plugin and so I've start to build it. Unfortunately I've got a lot of
>> compiler errors (e.g. vtkstd isn't used in PV master anymore). Which
>> PV version is the base for your plugin?
>>
>> Regards,
>> Stephan
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Burlen Loring [mailto:bloring at lbl.gov]
>> Gesendet: Donnerstag, 7. Juni 2012 17:54
>> An: Stephan Rogge
>> Cc: 'Yuanxin Liu'; paraview at paraview.org
>> Betreff: Re: [Paraview] Parallel Streamtracer
>>
>> Hi Stephan,
>>
>> I've experienced the scaling behavior that you report when I was
>> working on a project that required generating millions of streamlines
>> for a topological mapping algorithm interactively in ParaView. To get
>> the required scaling I wrote a stream tracer that uses a load on
>> demand approach with tunable block cache so that all ranks could
>> integrate any streamline and stay busy throughout the entire
>> computation. It was very effective on our data and I've used it to
>> integrate 30 Million streamlines in about 10min on 256 cores. If you
>> really need better scalability than the distributed data tracing
>> approach implemented in PV, you might take a look at our work. The
>> down side of our approach is that in order to provide the demand
>> loading the reader has to implement a vtk object that provides an api
>> giving the integrator direct access to I/O functionality. In case you're
> interested the stream tracer is class is vtkSQFieldTracer and our reader is
> vtkSQBOVReader.
>> The latest release could be found here
>> https://github.com/burlen/SciberQuestToolKit/tarball/SQTK-20120531
>>
>> Burlen
>>
>> On 06/04/2012 02:21 AM, Stephan Rogge wrote:
>>> Hello Leo,
>>>
>>> ok, I took the "disk_out_ref.ex2" example data set and did some time
>>> measurements. Remember, my machine has 4 Cores + HyperThreading.
>>>
>>> My first observation is that PV seems to have a problem with
>>> distributing the data when the Multi-Core option (GUI) is enabled.
>>> When PV is started with builtin Multi-Core I was not able to apply a
>>> stream tracer with more than 1000 seed points (PV is freezing and
>>> never comes back). Otherwise, when pvserver processes has been
>>> started manually I was able to set up to 100.000 seed points. Is it a
> bug?
>>> Now let's have a look on the scaling performance. As you suggested,
>>> I've used the D3 filter for distributing the data along the processes.
>>> The stream tracer execution time for 10.000 seed points:
>>>
>>> ##   Bulitin: 10.063 seconds
>>> ##   1 MPI-Process (no D3): 10.162 seconds
>>> ##   4 MPI-Processes: 15.615 seconds
>>> ##   8 MPI-Processes: 14.103 seconds
>>>
>>> and 100.000 seed points:
>>>
>>> ##   Bulitin: 100.603 seconds
>>> ##   1 MPI-Process (no D3): 100.967 seconds
>>> ##   4 MPI-Processes: 168.1 seconds
>>> ##   8 MPI-Processes: 171.325 seconds
>>>
>>> I cannot see any positive scaling behavior here. Maybe is this
>>> example not appropriate for scaling measurements?
>>>
>>> One more thing: I've visualized the vtkProcessId and saw that the
>>> whole vector field is partitioned. I thought, that each streamline is
>>> integrated in its own process. But it seems that this is not the case.
>>> This could explain my scaling issues: In cases of small vector fields
>>> the overhead of synchronization becomes too large and decreases the
>> overall performance.
>>> My suggestion is to have a parallel StreamTracer which is built for a
>>> single machine with several threads. Could be worth to randomly
>>> distribute the seeds over all available (local) processes? Of course,
>>> each process have access on the whole vector field.
>>>
>>> Cheers,
>>> Stephan
>>>
>>>
>>>
>>> Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
>>> Gesendet: Freitag, 1. Juni 2012 16:13
>>> An: Stephan Rogge
>>> Cc: Andy Bauer; paraview at paraview.org
>>> Betreff: Re: [Paraview] Parallel Streamtracer
>>>
>>> Hi, Stephan,
>>>      I did measure the performance at some point and was able to get
>>> fairly decent speed up with more processors. So I am surprised you
>>> are seeing huge latency.
>>>
>>>      Of course, the performance is sensitive to the input.  It is also
>>> sensitive to how readers distribute data. So, one thing you might
>>> want to try is to attach the "D3" filter to the reader.
>>>
>>>      If that doesn't help,  I will be happy to get your data and take
>>> a
>> look.
>>> Leo
>>>
>>> On Fri, Jun 1, 2012 at 1:54 AM, Stephan
>>> Rogge<Stephan.Rogge at tu-cottbus.de>
>>> wrote:
>>> Leo,
>>>
>>> As I mentioned in my initial post of this thread: I used the
>>> up-to-date master branch of ParaView. Which means I have already used
>>> your implementation.
>>>
>>> I can imagine, to parallelize this algorithm can be very tough. And I
>>> can see that distribute the calculation over 8 processes does not
>>> lead to a nice scaling.
>>>
>>> But I don't understand this huge amount of latency when using the
>>> StreamTracer in a Cave-Mode with two view ports and two pvserver
>>> processes on the same machine (extra machine for the client). I guess
>>> the tracer filter is applied for each viewport separately? This would
>>> be ok as long as both filter executions run parallel. And I doubt
>>> that
>> this is the case.
>>> Can you help to clarify my problem?
>>>
>>> Regards,
>>> Stephan
>>>
>>>
>>> Von: Yuanxin Liu [mailto:leo.liu at kitware.com]
>>> Gesendet: Donnerstag, 31. Mai 2012 21:33
>>> An: Stephan Rogge
>>> Cc: Andy Bauer; paraview at paraview.org
>>> Betreff: Re: [Paraview] Parallel Streamtracer
>>>
>>> It is in the current VTK and ParaView master.  The class is
>>> vtkPStreamTracer.
>>>
>>> Leo
>>> On Thu, May 31, 2012 at 3:31 PM, Stephan
>>> Rogge<stephan.rogge at tu-cottbus.de>
>>> wrote:
>>> Hi, Andy and Leo,
>>>
>>> thanks for your replies.
>>>
>>> Is it possible to get this new implementation? I would to give it a try.
>>>
>>> Regards,
>>> Stephan
>>>
>>> Am 31.05.2012 um 17:48 schrieb Yuanxin Liu<leo.liu at kitware.com>:
>>> Hi, Stephan,
>>>       The previous implementation only has serial performance:  It
>>> traces the streamlines one at a time and never starts a new
>>> streamline until the previous one finishes.  With communication
>>> overhead, it is not surprising it got slower.
>>>
>>>      My new implementation is able to let the processes working on
>>> different streamlines simultaneously and should scale much better.
>>>
>>> Leo
>>>
>>> On Thu, May 31, 2012 at 11:27 AM, Andy Bauer<andy.bauer at kitware.com>
>> wrote:
>>> Hi Stephan,
>>>
>>> The parallel stream tracer uses the partitioning of the grid to
>>> determine which process does the integration. When the streamline
>>> exits the subdomain of a process there is a search to see if it
>>> enters a subdomain assigned to any other processes before figuring it
>>> whether it has left the entire domain.
>>>
>>> Leo, copied here, has been improving the streamline implementation
>>> inside of VTK so you may want to get his newer version. It is a
>>> pretty tough algorithm to parallelize efficiently without making any
>>> assumptions on the flow or partitioning.
>>>
>>> Andy
>>>
>>> On Thu, May 31, 2012 at 4:16 AM, Stephan
>>> Rogge<Stephan.Rogge at tu-cottbus.de>
>>> wrote:
>>> Hello,
>>>
>>> I have a question related to the parallelism of the stream tracer: As
>>> I understand the code right, each line integration (trace) is
>>> processed in an own MPI process. Right?
>>>
>>> To test the scalability of the Stream tracer I've load a structured
>>> (curvilinear) grid and applied the filter with a Seed resolution of
>>> 1500 and check the timings in a single and multi-thread (Multi Core
>>> enabled in PV
>>> GUI) situation.
>>>
>>> I was really surprised that multi core slows done the execution time
>>> to 4 seconds. The single core takes only 1.2 seconds. Data migration
>>> cannot be the explanation for that behavior (0.5 seconds). What is
>>> the
>> problem here?
>>> Please see attached some statistics...
>>>
>>> Data:
>>> * Structured (Curvilinear) Grid
>>> * 244030 Cells
>>> * 37 MB Memory
>>>
>>> System:
>>> * Intel i7-2600K (4 Cores + HT = 8 Threads)
>>> * 16 GB Ram
>>> * Windows 7 64 Bit
>>> * ParaView (master-branch, 64 bit compilation)
>>>
>>> #################################
>>> Single Thread (Seed resolution 1500):
>>> #################################
>>>
>>> Local Process
>>> Still Render,  0.014 seconds
>>> RenderView::Update,  1.222 seconds
>>>       vtkPVView::Update,  1.222 seconds
>>>           Execute vtkStreamTracer id: 2184,  1.214 seconds Still
>>> Render,
>>> 0.015 seconds
>>>
>>> #################################
>>> Eight Threads (Seed resolution 1500):
>>> #################################
>>>
>>> Local Process
>>> Still Render,  0.029 seconds
>>> RenderView::Update,  4.134 seconds
>>> vtkSMDataDeliveryManager: Deliver Geome,  0.619 seconds
>>>       FullRes Data Migration,  0.619 seconds Still Render,  0.042
>>> seconds
>>>       OpenGL Dev Render,  0.01 seconds
>>>
>>>
>>> Render Server, Process 0
>>> RenderView::Update,  4.134 seconds
>>>       vtkPVView::Update,  4.132 seconds
>>>           Execute vtkStreamTracer id: 2193,  3.941 seconds FullRes
>>> Data Migration,  0.567 seconds
>>>       Dataserver gathering to 0,  0.318 seconds
>>>       Dataserver sending to client,  0.243 seconds
>>>
>>> Render Server, Process 1
>>> Execute vtkStreamTracer id: 2193,  3.939 seconds
>>>
>>> Render Server, Process 2
>>> Execute vtkStreamTracer id: 2193,  3.938 seconds
>>>
>>> Render Server, Process 3
>>> Execute vtkStreamTracer id: 2193,  4.12 seconds
>>>
>>> Render Server, Process 4
>>> Execute vtkStreamTracer id: 2193,  3.938 seconds
>>>
>>> Render Server, Process 5
>>> Execute vtkStreamTracer id: 2193,  3.939 seconds
>>>
>>> Render Server, Process 6
>>> Execute vtkStreamTracer id: 2193,  3.938 seconds
>>>
>>> Render Server, Process 7
>>> Execute vtkStreamTracer id: 2193,  3.939 seconds
>>>
>>> Cheers,
>>> Stephan
>>>
>>>
>>> _______________________________________________
>>> Powered by www.kitware.com
>>>
>>> Visit other Kitware open-source projects at
>>> http://www.kitware.com/opensource/opensource.html
>>>
>>> Please keep messages on-topic and check the ParaView Wiki at:
>>> http://paraview.org/Wiki/ParaView
>>>
>>> Follow this link to subscribe/unsubscribe:
>>> http://www.paraview.org/mailman/listinfo/paraview
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Powered by www.kitware.com
>>>
>>> Visit other Kitware open-source projects at
>>> http://www.kitware.com/opensource/opensource.html
>>>
>>> Please keep messages on-topic and check the ParaView Wiki at:
>>> http://paraview.org/Wiki/ParaView
>>>
>>> Follow this link to subscribe/unsubscribe:
>>> http://www.paraview.org/mailman/listinfo/paraview
>
> _______________________________________________
> Powered by www.kitware.com
>
> Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html
>
> Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView
>
> Follow this link to subscribe/unsubscribe:
> http://www.paraview.org/mailman/listinfo/paraview