[Insight-developers] itk performance numbers
Rupert Brooks
rupert.brooks at gmail.com
Thu Jul 26 14:02:06 EDT 2012
Ok that makes way more sense, sorry i didnt understand first time around.
Just so i've got it right
Threads 3.20 4.2+patch
Time 4.2 as percent of 3.20
1 0.347567 0.383342
110.293%
2 0.300869 0.335328
111.453
4 0.348677 0.315688
90.5388
8 0.182681 0.192132
105.173
So theres about 10% more time with ITK 4.2 used in the 1 and 2 thread case.
That is definitely better than what we were getting. Cool.
Rupert
--------------------------------------------------------------
Rupert Brooks
rupert.brooks at gmail.com
On Thu, Jul 26, 2012 at 1:13 PM, Bradley Lowekamp <blowekamp at mail.nih.gov>wrote:
> Sorry for not being clear! I got too excited by finding the solution to
> the performance issue with ITKv3 registration in ITKv4.
>
> This first is vanilla 3.20, the second is 4.20+ the gerrit patch. The
> third is the gerrit patch with the pre-malloc of the Jacobin outside the
> threaded section! Vanilla 4.2 is ~2x 3.20 for this test on my system too.
>
> Summary for the MeansSquares metric in your test:
>
> 3.20: 1X
> 4.2: 2+X
> 4.2+gerrit patch: 1X
> 4.2+gerrit patch + single-threaded preallocation of jacobian: 1.5X
>
> On Jul 26, 2012, at 12:48 PM, Rupert Brooks wrote:
>
> Brad,
>
> Am i reading this right? - the second set of numbers is vanilla ITK 4.2
> and the third set is with your patch?
>
>
> no.
>
>
> a) Single threaded, allocating outside the threaded part made the mean
> time go up from .38 to .4 in the single threaded case, and it seems to be a
> real effect.
>
>
> That is not a significant change, compare the to effect of CPU temperature
> and other outside influences.
>
> b) After allocation outside the thread, the multithread performance is
> worse both absolutely, and in the serial/parallel percentage.
>
> Wierd.
>
> I also note that going from 4 to 8 threads seems to get you a greater
> ratio of improvement than going from 1 to 2, or 2 to 4. Especially with
> the vanilla itk 4.2. Generally parallel improvement drops off with number
> of threads, because you are only subdividing the parallel part of the
> algorithm. I have found in the past that this sort of wierdness can occur
> with NUMA systems depending whether the OS puts multiple threads on the
> same physical chip or not.
>
>
> That effect is not on vanilla 4.2 it's on the gerrit patched, I agree that
> with multi-socket, multi-core with hyper-threading oddities to to the
> location of caches frequently occur.
>
>
> Are you really fortunate enough to have 288 cores in that Mac, or is it
> more like 3 chips at 8 cores each or something?
>
>
> I have a dual Xeon mac pro. It has 2 Physical cores, each with 6 cores,
> which can be hyper-threaded.
>
>
> My personal tendency would be to try to understand why the single thread
> performance got so much worse, then (if relevant) try and fix the
> threading. ITK 3.20 doesnt do very well with increasing the numbers of
> threads either.
>
>
> The single threaded performance is worse because there is a malloc
> occurring for every sample. My patch addressed that.
>
>
> Just to reiterate where im coming from - i want to show that accuracy and
> performance wise ITK4 is at least no worse than ITK3. That justifies
> upgrading. In the longer run, it may make more sense to rewrite old ITK
> stuff to use the new metrics, but that is a separate project.
>
>
> My intention was to demonstrate that with the patch in gerrit the
> performance is about the same as ITKv3.
>
> Brad
>
>
> Rupert
> --------------------------------------------------------------
> Rupert Brooks
> rupert.brooks at gmail.com
>
>
>
> On Thu, Jul 26, 2012 at 11:53 AM, Bradley Lowekamp <blowekamp at mail.nih.gov
> > wrote:
>
>> Hello,
>>
>> Well I did get to it before you:
>>
>> http://review.source.kitware.com/#/c/6614/
>>
>> I also uped the size of the image 100x in your test, here is the current
>> performance on my system:
>>
>> System: victoria.nlm.nih.gov
>> Processor: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz
>> Serial #:
>> Cache: 32768
>> Clock: 2794.27
>> Cores: 12 cpus x 24 Cores = 288
>> OSName: Mac OS X
>> Release: 10.6.8
>> Version: 10K549
>> Platform: x86_64
>> Operating System is 64 bit
>> ITK Version: 3.20.1
>> Virtual Memory: Total: 256 Available: 228
>> Physical Memory: Total:65536 Available: 58374
>> Probe Name: Count Min Mean
>> Stdev Max Total
>> MeanSquares_1_threads 20 0.344348 0.347567
>> 0.00244733 0.352629 6.95134
>> MeanSquares_2_threads 20 0.251223 0.300869
>> 0.0179305 0.321404 6.01738
>> MeanSquares_4_threads 20 0.215516 0.348677
>> 0.173645 0.678274 6.97355
>> MeanSquares_8_threads 20 0.138184 0.182681
>> 0.0297812 0.237129 3.65362
>> System: victoria.nlm.nih.gov
>> Processor:
>> Serial #:
>> Cache: 32768
>> Clock: 2930
>> Cores: 12 cpus x 24 Cores = 288
>> OSName: Mac OS X
>> Release: 10.6.8
>> Version: 10K549
>> Platform: x86_64
>> Operating System is 64 bit
>> ITK Version: 4.2.0
>> Virtual Memory: Total: 256 Available: 228
>> Physical Memory: Total:65536 Available: 58371
>> Probe Name: Count Min Mean
>> Stdev Max Total
>> MeanSquares_1_threads 20 0.382481 0.383342
>> 0.00186954 0.391027 7.66685
>> MeanSquares_2_threads 20 0.211908 0.335328
>> 0.0777408 0.435574 6.70655
>> MeanSquares_4_threads 20 0.271531 0.315688
>> 0.0390751 0.385683 6.31377
>> MeanSquares_8_threads 20 0.147544 0.192132
>> 0.0299427 0.240976 3.84263
>>
>>
>> In the patch provided, it is implicitly done on assignment on a
>> per-thread basis. What was most un-expected was when then allocation of the
>> Jacobin was explicitly done out side the threaded part, the time when up by
>> 50%! I presume that the sequential allocation, of the doubles in the master
>> thread made the allocation sequentially, next to each other, and may be a
>> more insidious form of false sharing. Below is the numbers from this run,
>> notice the lack of speed up with more threads:
>>
>> System: victoria.nlm.nih.gov
>> Processor:
>> Serial #:
>> Cache: 32768
>> Clock: 2930
>> Cores: 12 cpus x 24 Cores = 288
>> OSName: Mac OS X
>> Release: 10.6.8
>> Version: 10K549
>> Platform: x86_64
>> Operating System is 64 bit
>> ITK Version: 4.2.0
>> Virtual Memory: Total: 256 Available: 226
>> Physical Memory: Total:65536 Available: 57091
>> Probe Name: Count Min Mean
>> Stdev Max Total
>> MeanSquares_1_threads 20 0.403931 0.40648
>> 0.00213043 0.41389 8.1296
>> MeanSquares_2_threads 20 0.243789 0.367603
>> 0.0894637 0.65006 7.35206
>> MeanSquares_4_threads 20 0.281336 0.354749
>> 0.0431082 0.440161 7.09497
>> MeanSquares_8_threads 20 0.24615 0.301576
>> 0.0552998 0.446528 6.03151
>>
>>
>> Brad
>>
>>
>> On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote:
>>
>> Brad,
>>
>> The false sharing issue is a good point - however, i dont think this is
>> the cause of the performance degradation. This part of the class
>> (m_Threader, etc) has not changed since 3.20. (I used the optimized
>> metrics in my 3.20 builds, so its in Review/itkOptMeanSquares....) It also
>> does not explain the performance drop in single threaded mode.
>>
>> Testing will tell... Seems like a Friday afternoon project to me, unless
>> someone else gets there first.
>>
>> Rupert
>>
>> --------------------------------------------------------------
>> Rupert Brooks
>> rupert.brooks at gmail.com
>>
>>
>>
>> On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <blowekamp at mail.nih.gov
>> > wrote:
>>
>>> Hello,
>>>
>>> Continuing to glance at the class.... I also see the following member
>>> variables for the MeanSquares class:
>>>
>>> MeasureType * m_ThreaderMSE;
>>> DerivativeType *m_ThreaderMSEDerivatives;
>>>
>>> Where these are index by the thread ID and access simultaneously across
>>> the threads causes the potential for False Sharing, which can be a MAJOR
>>> problem with threaded algorithms.
>>>
>>> I would think a good solution would be to create a per-thread data
>>> structure consisting of the Jacobin, MeasureType, and DerivativeType, plus
>>> padding to prevent false sharing, or equivalently assigning max data
>>> alignment to the structure.
>>>
>>> Rupert, Would like to take a stab at this fix?
>>>
>>> Brad
>>>
>>>
>>> On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote:
>>>
>>> Sorry if this repeats - i just got a bounce from Insight Developers, so
>>> im trimming the message and resending....
>>> --------------------------------------------------------------
>>> Rupert Brooks
>>> rupert.brooks at gmail.com
>>>
>>>
>>>
>>> On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <rupert.brooks at gmail.com>wrote:
>>>
>>>> Aha. Heres around line 183 of itkTranslationTransform.
>>>>
>>>> // Compute the Jacobian in one position
>>>> template <class TScalarType, unsigned int NDimensions>
>>>> void
>>>> TranslationTransform<TScalarType,
>>>> NDimensions>::ComputeJacobianWithRespectToParameters(
>>>> const InputPointType &,
>>>> JacobianType & jacobian) const
>>>> {
>>>> // the Jacobian is constant for this transform, and it has already
>>>> been
>>>> // initialized in the constructor, so we just need to return it here.
>>>> jacobian = this->m_IdentityJacobian;
>>>> return;
>>>> }
>>>>
>>>> Thats probably the culprit, although the root cause may be the
>>>> reallocating of the jacobian every time through the loop.
>>>>
>>>> Rupert
>>>>
>>>> <snipped>
>>>>
>>>
>>>
>>
>> ========================================================
>> Bradley Lowekamp
>> Medical Science and Computing for
>> Office of High Performance Computing and Communications
>> National Library of Medicine
>> blowekamp at mail.nih.gov
>>
>>
>>
>>
>
> ========================================================
>
> Bradley Lowekamp
>
> Medical Science and Computing for
>
> Office of High Performance Computing and Communications
>
> National Library of Medicine
>
> blowekamp at mail.nih.gov
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.itk.org/pipermail/insight-developers/attachments/20120726/e890d946/attachment.htm>
More information about the Insight-developers
mailing list