<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body>

<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif;" dir="ltr">

<p>Hi Chuck,</p>

<p><br>

</p>

<p>Mea culpa. I forgot to set the <span>KNOB_MAX_WORKER_THREADS </span><span style="font-size: 12pt;">env variable in my swr tests. Sorry for this oversight.</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;"></span><span style="font-size: 12pt;">Your suspicion was right and the large virtual memory consumption was indeed due to</span><span style="font-size: 12pt;"> a massive oversubscription of threads fired by swr by default. </span><span style="font-size: 12pt;">Setting </span><span style="font-size: 12pt;">KNOB_MAX_WORKER_THREADS

 to 1 fixed the <span>oversubscription of threads and</span> the memory consumption </span><span style="font-size: 12pt;">issue

</span><span style="font-size: 12pt;">when the </span><span style="font-size: 12pt;">number of mpi processes is </span><span style="font-size: 12pt;">equal to the number of cores requested.</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;"><span>I have a last question about the performance of the swr library vs the default llvmpipe library. </span></span><span style="font-size: 12pt;">I have

</span><span style="font-size: 12pt;">found some benchmark of swr vs llvmpipe</span><span style="font-size: 12pt;"> showing acceleration factor ranging from 30 to 50 depending on the number of tets (</span><a href="http://openswr.org/slides/SWR_Sept15.pdf" class="OWAAutoLink" id="LPlnk984363" previewremoved="true" style="font-size: 12pt;">http://openswr.org/slides/SWR_Sept15.pdf</a><span style="font-size: 12pt;">).

 B</span><span style="font-size: 12pt;">ut I </span><span style="font-size: 12pt;">is not clear to me which part of this acceleration is due to a better threading</span><span style="font-size: 12pt;"> of the swr library and which part is due to a potential

 better usage of the AVX* instructions. If I do not allow threading of swr by setting <span>KNOB_MAX_WORKER_THREADS to 1 (basically serial swr vs serial llvmpipe)</span>, do you know which performance gain I could still expect by relying only on a potential

 better usage of the AVX instructions?</span><span style="font-size: 12pt;"></span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;">Thanks a lot for your help.</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;">Best regards,</span></p>

<p><span style="font-size: 12pt;"><br>

</span></p>

<p><span style="font-size: 12pt;">Michel</span></p>

<p><br>

</p>

<p><span style="font-size: 12pt;"></p>

<div id="LPBorder_GT_14865502728160.3749393657422315" style="margin-bottom: 20px; overflow: auto; width: 100%; text-indent: 0px;">

<table id="LPContainer_14865502728090.5601114975476364" cellspacing="0" style="width: 90%; background-color: rgb(255, 255, 255); position: relative; overflow: auto; padding-top: 20px; padding-bottom: 20px; margin-top: 20px; border-top: 1px dotted rgb(200, 200, 200); border-bottom: 1px dotted rgb(200, 200, 200);">

<tbody>

<tr valign="top" style="border-spacing: 0px;">

<td id="TextCell_14865502728100.247322342341314" colspan="2" style="vertical-align: top; position: relative; padding: 0px; display: table-cell;">

<div id="LPRemovePreviewContainer_14865502728110.7031243795261817"></div>

<div id="LPExpandDescriptionContainer_14865502728110.44065224243128376"></div>

<div id="LPTitle_14865502728110.09055120134574168" style="top: 0px; color: rgb(86, 90, 92); font-weight: normal; font-size: 21px; font-family: wf_segoe-ui_light, "Segoe UI Light", "Segoe WP Light", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; line-height: 21px;">

<a id="LPUrlAnchor_14865502728110.8867609763057356" href="http://openswr.org/slides/SWR_Sept15.pdf" target="_blank" style="text-decoration: none;">Performance: OpenSWR vs MESA LLVMpipe</a></div>

<div id="LPMetadata_14865502728110.25532991519743375" style="margin: 10px 0px 16px; color: rgb(102, 102, 102); font-weight: normal; font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; font-size: 14px; line-height: 14px;">

openswr.org</div>

<div id="LPDescription_14865502728120.9894403074609603" style="display: block; color: rgb(102, 102, 102); font-weight: normal; font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; font-size: 14px; line-height: 20px; max-height: 100px; overflow: hidden;">

Performance: OpenSWR vs MESA* LLVMpipe Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.</div>

</td>

</tr>

</tbody>

</table>

</div>

<br>

</span>

<p></p>

<p><span style="font-size: 12pt;"><span><span></span></span></span></p>

</div>

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Chuck Atkins <chuck.atkins@kitware.com><br>

<b>Sent:</b> Tuesday, February 7, 2017 10:03:50 PM<br>

<b>To:</b> Michel Rasquin<br>

<b>Cc:</b> paraview@paraview.org<br>

<b>Subject:</b> Re: [Paraview] Issues with PVSB 5.2 and OSMesa support</font>

<div> </div>

</div>

<div>

<div dir="ltr">

<div class="gmail_extra">

<div class="gmail_quote">

<div>Hi Michel, <br>

</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<p>Indeed, I built PVSB 5.2 with the <span>intel 2016.2.181 and intelmpi <span>5.1.3.181 compilers, then </span></span><span style="font-size:12pt">ran the resulting pvserver </span><span style="font-size:12pt">on

</span><span style="font-size:12pt">Haswell CPU nodes (</span><span style="font-size:12pt">Intel E5-2680v3) which supports AVX2 instructions.  </span><span style="font-size:12pt">So this fits exactly the known issue you mentioned in your email. </span></p>

</div>

</div>

</div>

</blockquote>

<div>Yep, that'll do it.  The problem is due to a bug in the Intel compiler performing over-agressive vectorized code generation.  I'm not sure if it's fixed in >= 17 or not but I definitely know it's broken in <= 16.x.  GALLIUM_DRIVER=SWR is going to give

 you the best performance in this situation anyways though and is the recommended osmesa driver on x86_64 CPUs.<br>

</div>

<div> <br>

<br>

</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<p><span style="font-size:12pt"><span><span></span></span></span></p>

<p><span style="font-size:12pt"><span><span>Exporting the GALIIUM_DRIVER env variable to swr then leads to an interesting behavior. </span></span></span><span style="font-size:12pt">With the swr driver</span><span style="font-size:12pt">, the good news is that

 I can connect my pvserver built in release mode without crashing. </span><span style="font-size:12pt"></span></p>

</div>

</div>

</div>

</blockquote>

<div>Great!<br>

</div>

<div> </div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<p><span style="font-size:12pt">For the recollection, the llvmpipe driver compiled in release mode crashes during the client/server connection, whereas the llvmpipe driver compiled in debug mode works fine.</span></p>

</div>

</div>

</div>

</blockquote>

<div>This lines up with the issue being bad vectorization since the compiler won't be doing m,ost of those optimizations in a debug build.<br>

<br>

</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<p></p>

<p>However, our PBS scheduling killed quickly my interactive job because the virtual memory was exhausted, which was puzzling. <span style="font-size:12pt">Increasing the number of cores requested for my job and keeping some of them idle allowed me to increase

 the available memory at the cost of wasted cpu resources.</span></p>

</div>

</div>

</div>

</blockquote>

<div>I suspect the problem is is a massive oversubscription of threads by swr.  The default behavior of swr is to use all available CPU cores on the node.  However, when running multiple MPI processes per node, they have no way of knowing about each other. 

 So if you've got 24 cores per node and run 24 pvservers, you'll end up with 24^2 = 576 rendering threads on a nodes; not so great.  You can control this with the KNOB_MAX_WORKER_THREADS environment variable.  Typically you'll want to set it to the inverse

 of processes per node your job is running.  So if yor node has 24 cores and you run 24 processes per node, then set KNOB_MAX_WORKER_THREADS to 1, but if you're running 4 processes per node, then set it to 6; you get the idea.  That should address the virtual

 memory problem.  It's a balance since typically rendering will perform better with fewer ppn and mroe threads per process, but the filters, like Contour, parallelize at the MPI level and work better with a higher ppn.  You'll need to find the right balance

 for your use case depending on whether it's render-heaver or pipeline pricessing heavy.<br>

</div>

<div><br>

 </div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

<div id="m_8550353998187298182m_3945460361337263553divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Arial,Helvetica,sans-serif" dir="ltr">

Would you also know if this known issue with the llvmpipe driver will be fixed of PV 5.3 (agreeing on the fact that the swr driver should be faster on intel CPU provided that it does not exhaust the memory consumption).</div>

</div>

</div>

</blockquote>

<div><br>

</div>

<div>It's actually an Intel compiler bug and not a ParaView (or even Mesa for that matter) issue, so probably not.  It may be fixed in future releases of icc but I wouldn't know withotu testng it.<br>

</div>

<br>

<br>

</div>

<div class="gmail_quote">- Chuck<br>

</div>

<br>

</div>

</div>

</div>

</body>

</html>