[Paraview] Issues with PVSB 5.2 and OSMesa support

Michel Rasquin michel.rasquin at colorado.edu
Wed Feb 8 05:48:38 EST 2017


Hi Chuck,


Mea culpa. I forgot to set the KNOB_MAX_WORKER_THREADS env variable in my swr tests. Sorry for this oversight.


Your suspicion was right and the large virtual memory consumption was indeed due to a massive oversubscription of threads fired by swr by default. Setting KNOB_MAX_WORKER_THREADS to 1 fixed the oversubscription of threads and the memory consumption issue when the number of mpi processes is equal to the number of cores requested.


I have a last question about the performance of the swr library vs the default llvmpipe library. I have found some benchmark of swr vs llvmpipe showing acceleration factor ranging from 30 to 50 depending on the number of tets (http://openswr.org/slides/SWR_Sept15.pdf). But I is not clear to me which part of this acceleration is due to a better threading of the swr library and which part is due to a potential better usage of the AVX* instructions. If I do not allow threading of swr by setting KNOB_MAX_WORKER_THREADS to 1 (basically serial swr vs serial llvmpipe), do you know which performance gain I could still expect by relying only on a potential better usage of the AVX instructions?


Thanks a lot for your help.


Best regards,


Michel


Performance: OpenSWR vs MESA LLVMpipe<http://openswr.org/slides/SWR_Sept15.pdf>
openswr.org
Performance: OpenSWR vs MESA* LLVMpipe Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.



________________________________
From: Chuck Atkins <chuck.atkins at kitware.com>
Sent: Tuesday, February 7, 2017 10:03:50 PM
To: Michel Rasquin
Cc: paraview at paraview.org
Subject: Re: [Paraview] Issues with PVSB 5.2 and OSMesa support

Hi Michel,

Indeed, I built PVSB 5.2 with the intel 2016.2.181 and intelmpi 5.1.3.181 compilers, then ran the resulting pvserver on Haswell CPU nodes (Intel E5-2680v3) which supports AVX2 instructions.  So this fits exactly the known issue you mentioned in your email.

Yep, that'll do it.  The problem is due to a bug in the Intel compiler performing over-agressive vectorized code generation.  I'm not sure if it's fixed in >= 17 or not but I definitely know it's broken in <= 16.x.  GALLIUM_DRIVER=SWR is going to give you the best performance in this situation anyways though and is the recommended osmesa driver on x86_64 CPUs.



Exporting the GALIIUM_DRIVER env variable to swr then leads to an interesting behavior. With the swr driver, the good news is that I can connect my pvserver built in release mode without crashing.

Great!


For the recollection, the llvmpipe driver compiled in release mode crashes during the client/server connection, whereas the llvmpipe driver compiled in debug mode works fine.

This lines up with the issue being bad vectorization since the compiler won't be doing m,ost of those optimizations in a debug build.


However, our PBS scheduling killed quickly my interactive job because the virtual memory was exhausted, which was puzzling. Increasing the number of cores requested for my job and keeping some of them idle allowed me to increase the available memory at the cost of wasted cpu resources.

I suspect the problem is is a massive oversubscription of threads by swr.  The default behavior of swr is to use all available CPU cores on the node.  However, when running multiple MPI processes per node, they have no way of knowing about each other.  So if you've got 24 cores per node and run 24 pvservers, you'll end up with 24^2 = 576 rendering threads on a nodes; not so great.  You can control this with the KNOB_MAX_WORKER_THREADS environment variable.  Typically you'll want to set it to the inverse of processes per node your job is running.  So if yor node has 24 cores and you run 24 processes per node, then set KNOB_MAX_WORKER_THREADS to 1, but if you're running 4 processes per node, then set it to 6; you get the idea.  That should address the virtual memory problem.  It's a balance since typically rendering will perform better with fewer ppn and mroe threads per process, but the filters, like Contour, parallelize at the MPI level and work better with a higher ppn.  You'll need to find the right balance for your use case depending on whether it's render-heaver or pipeline pricessing heavy.


Would you also know if this known issue with the llvmpipe driver will be fixed of PV 5.3 (agreeing on the fact that the swr driver should be faster on intel CPU provided that it does not exhaust the memory consumption).

It's actually an Intel compiler bug and not a ParaView (or even Mesa for that matter) issue, so probably not.  It may be fixed in future releases of icc but I wouldn't know withotu testng it.


- Chuck

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/paraview/attachments/20170208/b642cf07/attachment.html>


More information about the ParaView mailing list