[ITK] [ITK-dev] New Highly Parallel Build System, the POWER8

Bradley Lowekamp blowekamp at mail.nih.gov
Fri Apr 24 09:34:56 EDT 2015


Hello Chuck,

Thanks for running and posting those performance numbers. It sadly seems like 1:1 is most frequently the most efficient use of CPU cycles.

It's interesting to see how this architectures scales with a large number of processes, while each core is designed for 8 lighter weight threads is seems.

I was hoping to run similar performance test on lhcp-rh6 with 80 virtual cores 4 sockets each with 10 cores + hyper-theading. Unfortunately I need to use ninja more as my timing results appear to be from cached compilations and not actually running the compiler.

I hope were are able to improve ITK threading performance with this system. But due to the tweaky-ness of this type of performance and not having direct access to the system to easily run performance analysis, I and a little unclear how to best utilize it.

Thanks!
Brad

On Apr 23, 2015, at 6:21 PM, Chuck Atkins <chuck.atkins at kitware.com> wrote:

> In case anybody's interested, here's the "spread_numa.sh" script I use to evenly distribute across NUMA domains and bind to CPU cores:
> 
> ----------BEGIN spread_numa.sh----------
> #!/bin/bash
> 
> # Evenly spread a command across numa domains for a given number of CPU cores
> function spread()
> {
>   NUM_CORES=$1
>   shift
> 
>   # Use this wicked awk script to parse the numactl hardware layout and
>   # select an equal number of cores from each NUMA domain, evenly spaced
>   # across each domain
>   SPREAD="$(numactl -H | sed -n 's|.*cpus: \(.*\)|\1|p' | awk -v NC=${NUM_CORES} -v ND=${NUMA_DOMAINS} 'BEGIN{CPD=NC/ND} {S=NF/CPD; for(C=0;C<CPD;C++){F0=C*S; F1=(F0==int(F0)?F0:int(F0)+1)+1; printf("%d", $F1); if(!(NR==ND && C==CPD-1)){printf(",")} } }')"
> 
>   echo Executing: numactl --physcpubind=${SPREAD} "$@"
>   numactl --physcpubind=${SPREAD} "$@"
> }
> 
> # Check command arguments
> if [ $# -lt 2 ]
> then
>   echo "Usage: $0 [NUM_CORES_TO_USE] [cmd [arg1] ... [argn]]"
>   exit 1
> fi
> 
> # Determine the total number of CPU cores
> MAX_CORES=$(numactl -s | sed -n 's|physcpubind: \(.*\)|\1|p' | wc -w)
> 
> # Determine the total number of NUMA domains
> NUMA_DOMAINS=$(numactl -H | sed -n 's|available: \([0-9]*\).*|\1|p')
> 
> # Verify the number of cores is sane
> NUM_CORES=$1
> shift
> if [ $NUM_CORES -gt $MAX_CORES ]
> then
>   echo "WARNING: $NUM_CORES cores is out of bounds.  Setting to $MAX_CORES cores."
>   NUM_CORES=$MAX_CORES
> fi
> if [ $((NUM_CORES%NUMA_DOMAINS)) -ne 0 ]
> then
>   TMP=$(( ((NUM_CORES/NUMA_DOMAINS) + 1) * NUMA_DOMAINS ))
>   echo "WARNING: $NUM_CORES core(s) are not evenly divided across $NUMA_DOMAINS NUMA domains.  Setting to $TMP."
>   NUM_CORES=$TMP
> fi
> 
> echo "Using ${NUM_CORES}/${MAX_CORES} cores across ${NUMA_DOMAINS} NUMA domains"
> 
> spread ${NUM_CORES} "$@"
> ----------END spread_numa.sh----------
> 
> 
> - Chuck
> 
> On Thu, Apr 23, 2015 at 4:57 PM, Chuck Atkins <chuck.atkins at kitware.com> wrote:
> (re-sent for the rest of the dev list)
> Hi Bradley,
> 
> It's pretty fast. The interesting numbers are for 20, 40, 80, and 160.  That aligns with 1:1, 2:1, 4:1, and 8:1 threads to core ratio.  Starting from the already configured ITKLinuxPOWER8 currently being built, I did a ninja clean and then "time ninja -jN".  Watching the cpu load for 20, 40, and 80 cores though, I see a fair amount of both process migration and unbalanced thread distribution, i.e. for -j20 I'll often see 2 cores with 6 or 8 threads and the rest with only 1 or 2.  So in addition to the -jN settings, I also ran 20, 40, and 80 threads using numactl with fixed binding to physical CPU cores to evenly distribute the threads across cores and prevent thread migration.  See timings below in seconds:
> 
> Threads	Real	User	Sys	Total CPU Time
> 20	1037.097	19866.685	429.796	20296.481
> (Numa Bind) 20	915.910	16290.589	319.017	16609.606
> 40	713.772	26953.663	556.960	27510.623
> (Numa Bind) 40	641.924	22442.685	432.379	22875.064
> 80	588.357	40970.439	822.944	41793.383
> (Numa Bind) 80	538.801	35366.297	637.922	36004.219
> 160	572.492	62542.901	1289.864	63832.765
> (Numa Bind) 160	549.742	61864.666	1242.975	63107.641
> 
> 
> 
> So it seems like core binding gives us an approximate 10% performance increase for all thread configurations.  And while clearly the core-locked 4:1 gave us the best time, looking at the total CPU time (user+sys) the 1:1 looks to be the most efficient for actual cycles used.
> 
> It's interesting to watch how the whole system gets used up for most of the build but everything gets periodically gated on a handful of linker processes.  And of course, it's always cool to see a screen cap of htop with a whole boat load of cores at 100%
> 
> 
> - Chuck
> 
> On Thu, Apr 23, 2015 at 10:01 AM, Bradley Lowekamp <blowekamp at mail.nih.gov> wrote:
> Matt,
> 
> I'd love to explore the build performance of this system.
> 
> Any chance you could run clean builds of ITK on this system with 20,40,60,80,100,120,140 and 160 processes and record the timings?
> 
> I am very curious how this unique systems scales with multiple heavy weight processes, as it's design appears to be uniquely suitable to lighter weight multi-threading.
> 
> Thanks,
> Brad
> 
> On Apr 22, 2015, at 11:51 PM, Matt McCormick <matt.mccormick at kitware.com> wrote:
> 
> > Hi folks,
> >
> > With thanks to Chuck Atkins and FSF France, we have a new build on the
> > dashboard [1] for the IBM POWER8 [2] system.  This is a PowerPC64
> > system with 20 cores and 8 threads per core -- a great system where we
> > can test and improve ITK parallel computing performance!
> >
> >
> > To generate a test build on Gerrit, add
> >
> >  request build: power8
> >
> > in a review's comments.
> >
> >
> > There are currently some build warnings and test failures that should
> > be addressed before we will be able to use the system effectively. Any
> > help here is appreciated.
> >
> > Thanks,
> > Matt
> >
> >
> > [1] https://open.cdash.org/index.php?project=Insight&date=2015-04-22&filtercount=1&showfilters=1&field1=site/string&compare1=63&value1=gcc112
> >
> > [2] https://en.wikipedia.org/wiki/POWER8
> > _______________________________________________
> > Powered by www.kitware.com
> >
> > Visit other Kitware open-source projects at
> > http://www.kitware.com/opensource/opensource.html
> >
> > Kitware offers ITK Training Courses, for more information visit:
> > http://kitware.com/products/protraining.php
> >
> > Please keep messages on-topic and check the ITK FAQ at:
> > http://www.itk.org/Wiki/ITK_FAQ
> >
> > Follow this link to subscribe/unsubscribe:
> > http://public.kitware.com/mailman/listinfo/insight-developers
> > _______________________________________________
> > Community mailing list
> > Community at itk.org
> > http://public.kitware.com/mailman/listinfo/community
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/community/attachments/20150424/4a730018/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Kitware offers ITK Training Courses, for more information visit:
http://kitware.com/products/protraining.php

Please keep messages on-topic and check the ITK FAQ at:
http://www.itk.org/Wiki/ITK_FAQ

Follow this link to subscribe/unsubscribe:
http://public.kitware.com/mailman/listinfo/insight-developers


More information about the Community mailing list