<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hello Chuck,<div><br></div><div>Thanks for running and posting those performance numbers. It sadly seems like 1:1 is most frequently the most efficient use of CPU cycles.</div><div><br></div><div>It's interesting to see how this architectures scales with a large number of processes, while each core is designed for 8 lighter weight threads is seems.</div><div><br></div><div>I was hoping to run similar performance test on lhcp-rh6 with 80 virtual cores 4 sockets each with 10 cores + hyper-theading. Unfortunately I need to use ninja more as my timing results appear to be from cached compilations and not actually running the compiler.</div><div><br></div><div>I hope were are able to improve ITK threading performance with this system. But due to the tweaky-ness of this type of performance and not having direct access to the system to easily run performance analysis, I and a little unclear how to best utilize it.</div><div><br></div><div>Thanks!</div><div>Brad</div><div><br><div><div>On Apr 23, 2015, at 6:21 PM, Chuck Atkins <<a href="mailto:chuck.atkins@kitware.com">chuck.atkins@kitware.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div>In case anybody's interested, here's the "spread_numa.sh" script I use to evenly distribute across NUMA domains and bind to CPU cores:<br><br></div><span style="font-family:monospace,monospace">----------BEGIN spread_numa.sh----------<br></span><div><span style="font-family:monospace,monospace">#!/bin/bash<br><br># Evenly spread a command across numa domains for a given number of CPU cores<br>function spread()<br>{<br> NUM_CORES=$1<br> shift<br><br> # Use this wicked awk script to parse the numactl hardware layout and<br> # select an equal number of cores from each NUMA domain, evenly spaced<br> # across each domain<br> SPREAD="$(numactl -H | sed -n 's|.*cpus: \(.*\)|\1|p' | awk -v NC=${NUM_CORES} -v ND=${NUMA_DOMAINS} 'BEGIN{CPD=NC/ND} {S=NF/CPD; for(C=0;C<CPD;C++){F0=C*S; F1=(F0==int(F0)?F0:int(F0)+1)+1; printf("%d", $F1); if(!(NR==ND && C==CPD-1)){printf(",")} } }')"<br><br> echo Executing: numactl --physcpubind=${SPREAD} "$@"<br> numactl --physcpubind=${SPREAD} "$@"<br>}<br><br># Check command arguments<br>if [ $# -lt 2 ]<br>then<br> echo "Usage: $0 [NUM_CORES_TO_USE] [cmd [arg1] ... [argn]]"<br> exit 1<br>fi<br><br># Determine the total number of CPU cores<br>MAX_CORES=$(numactl -s | sed -n 's|physcpubind: \(.*\)|\1|p' | wc -w)<br><br># Determine the total number of NUMA domains<br>NUMA_DOMAINS=$(numactl -H | sed -n 's|available: \([0-9]*\).*|\1|p')<br><br># Verify the number of cores is sane<br>NUM_CORES=$1<br>shift<br>if [ $NUM_CORES -gt $MAX_CORES ]<br>then<br> echo "WARNING: $NUM_CORES cores is out of bounds. Setting to $MAX_CORES cores."<br> NUM_CORES=$MAX_CORES<br>fi<br>if [ $((NUM_CORES%NUMA_DOMAINS)) -ne 0 ]<br>then<br> TMP=$(( ((NUM_CORES/NUMA_DOMAINS) + 1) * NUMA_DOMAINS ))<br> echo "WARNING: $NUM_CORES core(s) are not evenly divided across $NUMA_DOMAINS NUMA domains. Setting to $TMP."<br> NUM_CORES=$TMP<br>fi<br><br>echo "Using ${NUM_CORES}/${MAX_CORES} cores across ${NUMA_DOMAINS} NUMA domains"<br><br>spread ${NUM_CORES} "$@"<br>----------END spread_numa.sh----------<br></span><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr">- Chuck<br></div></div></div>
<br><div class="gmail_quote">On Thu, Apr 23, 2015 at 4:57 PM, Chuck Atkins <span dir="ltr"><<a href="mailto:chuck.atkins@kitware.com" target="_blank">chuck.atkins@kitware.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto;"><div dir="ltr"><div>(re-sent for the rest of the dev list)<br></div><div><div class="h5"><div>Hi Bradley,<br><br></div>It's pretty fast. The interesting numbers
are for 20, 40, 80, and 160. That aligns with 1:1, 2:1, 4:1, and 8:1
threads to core ratio. Starting from the already configured
ITKLinuxPOWER8 currently being built, I did a ninja clean and then "time
ninja -jN". Watching the cpu load for 20, 40, and 80 cores though, I
see a fair amount of both process migration and unbalanced thread
distribution, i.e. for -j20 I'll often see 2 cores with 6 or 8 threads
and the rest with only 1 or 2. So in addition to the -jN settings, I
also ran 20, 40, and 80 threads using numactl with fixed binding to
physical CPU cores to evenly distribute the threads across cores and
prevent thread migration. See timings below in seconds:<br><br><table dir="ltr" style="table-layout: fixed; font-size: 13px; font-family: arial, sans, sans-serif; border-collapse: collapse; border: 1px solid rgb(204, 204, 204); position: static; z-index: auto;" border="1" cellpadding="0" cellspacing="0" height="204" width="481"><colgroup><col width="103"><col width="59"><col width="66"><col width="59"><col width="100"></colgroup><tbody><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">Threads</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">Real</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">User</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">Sys</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">Total CPU Time</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">20</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">1037.097</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">19866.685</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">429.796</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">20296.481</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>(Numa Bind) 20</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>915.910</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>16290.589</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>319.017</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>16609.606</b></td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">40</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">713.772</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">26953.663</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">556.960</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">27510.623</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">(Numa Bind) 40</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">641.924</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">22442.685</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">432.379</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">22875.064</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">80</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">588.357</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">40970.439</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">822.944</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">41793.383</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>(Numa Bind) 80</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>538.801</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>35366.297</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>637.922</b></td><td style="padding:2px 3px;vertical-align:bottom;text-align:right"><b>36004.219</b></td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">160</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">572.492</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">62542.901</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">1289.864</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">63832.765</td></tr><tr style="height:21px"><td style="padding:2px 3px;vertical-align:bottom;text-align:right">(Numa Bind) 160</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">549.742</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">61864.666</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">1242.975</td><td style="padding:2px 3px;vertical-align:bottom;text-align:right">63107.641</td></tr></tbody></table><br><div><br><br>So it seems like core binding gives us an approximate 10%
performance increase for all thread configurations. And while clearly
the core-locked 4:1 gave us the best time, looking at the total CPU time
(user+sys) the 1:1 looks to be the most efficient for actual cycles
used.<br><br></div></div></div><div>It's interesting to watch how the whole system
gets used up for most of the build but everything gets periodically
gated on a handful of linker processes. And of course, it's always
cool to see a screen cap of htop with a whole boat load of cores at 100%<div><img src="https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif"><span class="HOEnZb"><font color="#888888"><br></font></span></div></div><div class="gmail_extra"><span class="HOEnZb"><font color="#888888"><br clear="all"><div><div dir="ltr">- Chuck<br></div></div>
<br></font></span><div class="gmail_quote"><span class="">On Thu, Apr 23, 2015 at 10:01 AM, Bradley Lowekamp <span dir="ltr"><<a href="mailto:blowekamp@mail.nih.gov" target="_blank">blowekamp@mail.nih.gov</a>></span> wrote:<br></span><div><div class="h5"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Matt,<br>
<br>
I'd love to explore the build performance of this system.<br>
<br>
Any chance you could run clean builds of ITK on this system with 20,40,60,80,100,120,140 and 160 processes and record the timings?<br>
<br>
I am very curious how this unique systems scales with multiple heavy weight processes, as it's design appears to be uniquely suitable to lighter weight multi-threading.<br>
<br>
Thanks,<br>
Brad<br>
<div><br>
On Apr 22, 2015, at 11:51 PM, Matt McCormick <<a href="mailto:matt.mccormick@kitware.com" target="_blank">matt.mccormick@kitware.com</a>> wrote:<br>
<br>
> Hi folks,<br>
><br>
> With thanks to Chuck Atkins and FSF France, we have a new build on the<br>
> dashboard [1] for the IBM POWER8 [2] system. This is a PowerPC64<br>
> system with 20 cores and 8 threads per core -- a great system where we<br>
> can test and improve ITK parallel computing performance!<br>
><br>
><br>
> To generate a test build on Gerrit, add<br>
><br>
> request build: power8<br>
><br>
> in a review's comments.<br>
><br>
><br>
> There are currently some build warnings and test failures that should<br>
> be addressed before we will be able to use the system effectively. Any<br>
> help here is appreciated.<br>
><br>
> Thanks,<br>
> Matt<br>
><br>
><br>
> [1] <a href="https://open.cdash.org/index.php?project=Insight&date=2015-04-22&filtercount=1&showfilters=1&field1=site/string&compare1=63&value1=gcc112" target="_blank">https://open.cdash.org/index.php?project=Insight&date=2015-04-22&filtercount=1&showfilters=1&field1=site/string&compare1=63&value1=gcc112</a><br>
><br>
> [2] <a href="https://en.wikipedia.org/wiki/POWER8" target="_blank">https://en.wikipedia.org/wiki/POWER8</a><br>
</div>> _______________________________________________<br>
> Powered by <a href="http://www.kitware.com/" target="_blank">www.kitware.com</a><br>
><br>
> Visit other Kitware open-source projects at<br>
> <a href="http://www.kitware.com/opensource/opensource.html" target="_blank">http://www.kitware.com/opensource/opensource.html</a><br>
><br>
> Kitware offers ITK Training Courses, for more information visit:<br>
> <a href="http://kitware.com/products/protraining.php" target="_blank">http://kitware.com/products/protraining.php</a><br>
><br>
> Please keep messages on-topic and check the ITK FAQ at:<br>
> <a href="http://www.itk.org/Wiki/ITK_FAQ" target="_blank">http://www.itk.org/Wiki/ITK_FAQ</a><br>
><br>
> Follow this link to subscribe/unsubscribe:<br>
> <a href="http://public.kitware.com/mailman/listinfo/insight-developers" target="_blank">http://public.kitware.com/mailman/listinfo/insight-developers</a><br>
> _______________________________________________<br>
> Community mailing list<br>
> <a href="mailto:Community@itk.org" target="_blank">Community@itk.org</a><br>
> <a href="http://public.kitware.com/mailman/listinfo/community" target="_blank">http://public.kitware.com/mailman/listinfo/community</a><br>
<br>
</blockquote></div></div></div><br></div></div>
</blockquote></div><br></div>
</blockquote></div><br></div></body></html>