[Insight-developers] itk performance numbers

Thu Jul 26 14:02:06 EDT 2012

Ok that makes way more sense, sorry i didnt understand first time around.

Just so i've got it right
Threads                    3.20                              4.2+patch
       Time 4.2 as percent of 3.20
1                         0.347567                            0.383342
           110.293%
2                         0.300869                            0.335328
           111.453
4                         0.348677                            0.315688
            90.5388
8                         0.182681                            0.192132
           105.173

So theres about 10% more time with ITK 4.2 used in the 1 and 2 thread case.
 That is definitely better than what we were getting.  Cool.

Rupert

--------------------------------------------------------------
Rupert Brooks
rupert.brooks at gmail.com

On Thu, Jul 26, 2012 at 1:13 PM, Bradley Lowekamp <blowekamp at mail.nih.gov>wrote:

> Sorry for not being clear! I got too excited by finding the solution to
> the performance issue with ITKv3 registration in ITKv4.
>
> This first is vanilla 3.20, the second is 4.20+ the gerrit patch. The
> third is the gerrit patch with the pre-malloc of the Jacobin outside the
> threaded section! Vanilla 4.2 is ~2x 3.20 for this test on my system too.
>
> Summary for the MeansSquares metric in your test:
>
> 3.20:  1X
> 4.2: 2+X
> 4.2+gerrit patch: 1X
> 4.2+gerrit patch + single-threaded preallocation of jacobian: 1.5X
>
> On Jul 26, 2012, at 12:48 PM, Rupert Brooks wrote:
>
> Brad,
>
> Am i reading this right? - the second set of numbers is vanilla ITK 4.2
> and the third set is with your patch?
>
>
> no.
>
>
> a) Single threaded, allocating outside the threaded part made the mean
> time go up from .38 to .4 in the single threaded case, and it seems to be a
> real effect.
>
>
> That is not a significant change, compare the to effect of CPU temperature
> and other outside influences.
>
> b) After allocation outside the thread, the multithread performance is
> worse both absolutely, and in the serial/parallel percentage.
>
> Wierd.
>
> I also note that going from 4 to 8 threads seems to get you a greater
> ratio of improvement than going from 1 to 2, or 2 to 4.  Especially with
> the vanilla itk 4.2. Generally parallel improvement drops off with number
> of threads, because you are only subdividing the parallel part of the
> algorithm.  I have found in the past that this sort of wierdness can occur
> with NUMA systems depending whether the OS puts multiple threads on the
> same physical chip or not.
>
>
> That effect is not on vanilla 4.2 it's on the gerrit patched, I agree that
> with multi-socket, multi-core with hyper-threading oddities to to the
> location of caches frequently occur.
>
>
> Are you really fortunate enough to have 288 cores in that Mac, or is it
> more like 3 chips at 8 cores each or something?
>
>
> I have a dual Xeon mac pro. It has 2 Physical cores, each with 6 cores,
> which can be hyper-threaded.
>
>
> My personal tendency would be to try to understand why the single thread
> performance got so much worse, then (if relevant) try and fix the
> threading.  ITK 3.20 doesnt do very well with increasing the numbers of
> threads either.
>
>
> The single threaded performance is worse because there is a malloc
> occurring for every sample. My patch addressed that.
>
>
> Just to reiterate where im coming from - i want to show that accuracy and
> performance wise ITK4 is at least no worse than ITK3.  That justifies
> upgrading.  In the longer run, it may make more sense to rewrite old ITK
> stuff to use the new metrics, but that is a separate project.
>
>
> My intention was to demonstrate that with the patch in gerrit the
> performance is about the same as ITKv3.
>
> Brad
>
>
> Rupert
> --------------------------------------------------------------
> Rupert Brooks
> rupert.brooks at gmail.com
>
>
>
> On Thu, Jul 26, 2012 at 11:53 AM, Bradley Lowekamp <blowekamp at mail.nih.gov
> > wrote:
>
>> Hello,
>>
>> Well I did get to it before you:
>>
>> http://review.source.kitware.com/#/c/6614/
>>
>> I also uped the size of the image 100x  in your test, here is the current
>> performance on my system:
>>
>> System: victoria.nlm.nih.gov
>> Processor: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz
>>  Serial #:
>>     Cache: 32768
>>     Clock: 2794.27
>>     Cores: 12 cpus x 24 Cores = 288
>> OSName:     Mac OS X
>>   Release:  10.6.8
>>   Version:  10K549
>>   Platform: x86_64
>>   Operating System is 64 bit
>> ITK Version: 3.20.1
>> Virtual Memory: Total: 256 Available: 228
>> Physical Memory: Total:65536 Available: 58374
>>            Probe Name:        Count          Min           Mean
>> Stdev            Max        Total
>>  MeanSquares_1_threads            20      0.344348      0.347567
>>  0.00244733      0.352629       6.95134
>>  MeanSquares_2_threads            20      0.251223      0.300869
>> 0.0179305      0.321404       6.01738
>>  MeanSquares_4_threads            20      0.215516      0.348677
>>  0.173645      0.678274       6.97355
>>  MeanSquares_8_threads            20      0.138184      0.182681
>> 0.0297812      0.237129       3.65362
>> System: victoria.nlm.nih.gov
>> Processor:
>>  Serial #:
>>     Cache: 32768
>>     Clock: 2930
>>     Cores: 12 cpus x 24 Cores = 288
>> OSName:     Mac OS X
>>   Release:  10.6.8
>>   Version:  10K549
>>   Platform: x86_64
>>   Operating System is 64 bit
>> ITK Version: 4.2.0
>> Virtual Memory: Total: 256 Available: 228
>> Physical Memory: Total:65536 Available: 58371
>>            Probe Name:        Count          Min           Mean
>> Stdev            Max        Total
>>  MeanSquares_1_threads            20      0.382481      0.383342
>>  0.00186954      0.391027       7.66685
>>  MeanSquares_2_threads            20      0.211908      0.335328
>> 0.0777408      0.435574       6.70655
>>  MeanSquares_4_threads            20      0.271531      0.315688
>> 0.0390751      0.385683       6.31377
>>  MeanSquares_8_threads            20      0.147544      0.192132
>> 0.0299427      0.240976       3.84263
>>
>>
>> In the patch provided, it is implicitly done on assignment on a
>> per-thread basis. What was most un-expected was when then allocation of the
>> Jacobin was explicitly done out side the threaded part, the time when up by
>> 50%! I presume that the sequential allocation, of the doubles in the master
>> thread made the allocation sequentially, next to each other, and may be a
>> more insidious form of false sharing. Below is the numbers from this run,
>> notice the lack of speed up with more threads:
>>
>> System: victoria.nlm.nih.gov
>> Processor:
>>  Serial #:
>>     Cache: 32768
>>     Clock: 2930
>>     Cores: 12 cpus x 24 Cores = 288
>> OSName:     Mac OS X
>>   Release:  10.6.8
>>   Version:  10K549
>>   Platform: x86_64
>>   Operating System is 64 bit
>> ITK Version: 4.2.0
>> Virtual Memory: Total: 256 Available: 226
>> Physical Memory: Total:65536 Available: 57091
>>            Probe Name:        Count          Min           Mean
>> Stdev            Max        Total
>>  MeanSquares_1_threads            20      0.403931       0.40648
>>  0.00213043       0.41389        8.1296
>>  MeanSquares_2_threads            20      0.243789      0.367603
>> 0.0894637       0.65006       7.35206
>>  MeanSquares_4_threads            20      0.281336      0.354749
>> 0.0431082      0.440161       7.09497
>>  MeanSquares_8_threads            20       0.24615      0.301576
>> 0.0552998      0.446528       6.03151
>>
>>
>> Brad
>>
>>
>> On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote:
>>
>> Brad,
>>
>> The false sharing issue is a good point - however, i dont think this is
>> the cause of the performance degradation.  This part of the class
>> (m_Threader, etc) has not changed since 3.20.  (I used the optimized
>> metrics in my 3.20 builds, so its in Review/itkOptMeanSquares....) It also
>> does not explain the performance drop in single threaded mode.
>>
>> Testing will tell...  Seems like a Friday afternoon project to me, unless
>> someone else gets there first.
>>
>> Rupert
>>
>> --------------------------------------------------------------
>> Rupert Brooks
>> rupert.brooks at gmail.com
>>
>>
>>
>> On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <blowekamp at mail.nih.gov
>> > wrote:
>>
>>> Hello,
>>>
>>> Continuing to glance at the class.... I also see the following member
>>> variables for the MeanSquares class:
>>>
>>>   MeasureType *   m_ThreaderMSE;
>>>   DerivativeType *m_ThreaderMSEDerivatives;
>>>
>>> Where these are index by the thread ID and access simultaneously across
>>> the threads causes the potential for False Sharing, which can be a MAJOR
>>> problem with threaded algorithms.
>>>
>>> I would think a good solution would be to create a per-thread data
>>> structure consisting of the Jacobin, MeasureType, and DerivativeType, plus
>>> padding to prevent false sharing, or equivalently assigning max data
>>> alignment to the structure.
>>>
>>> Rupert, Would like to take a stab at this fix?
>>>
>>> Brad
>>>
>>>
>>> On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote:
>>>
>>> Sorry if this repeats - i just got a bounce from Insight Developers, so
>>> im trimming the message and resending....
>>> --------------------------------------------------------------
>>> Rupert Brooks
>>> rupert.brooks at gmail.com
>>>
>>>
>>>
>>> On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <rupert.brooks at gmail.com>wrote:
>>>
>>>> Aha.  Heres around line 183 of itkTranslationTransform.
>>>>
>>>> // Compute the Jacobian in one position
>>>> template <class TScalarType, unsigned int NDimensions>
>>>> void
>>>> TranslationTransform<TScalarType,
>>>> NDimensions>::ComputeJacobianWithRespectToParameters(
>>>>   const InputPointType &,
>>>>   JacobianType & jacobian) const
>>>> {
>>>>   // the Jacobian is constant for this transform, and it has already
>>>> been
>>>>   // initialized in the constructor, so we just need to return it here.
>>>>   jacobian = this->m_IdentityJacobian;
>>>>   return;
>>>> }
>>>>
>>>> Thats probably the culprit, although the root cause may be the
>>>> reallocating of the jacobian every time through the loop.
>>>>
>>>> Rupert
>>>>
>>>> <snipped>
>>>>
>>>
>>>
>>
>> ========================================================
>> Bradley Lowekamp
>> Medical Science and Computing for
>> Office of High Performance Computing and Communications
>> National Library of Medicine
>> blowekamp at mail.nih.gov
>>
>>
>>
>>
>
> ========================================================
>
> Bradley Lowekamp
>
> Medical Science and Computing for
>
> Office of High Performance Computing and Communications
>
> National Library of Medicine
>
> blowekamp at mail.nih.gov
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.itk.org/pipermail/insight-developers/attachments/20120726/e890d946/attachment.htm>