[Insight-developers] itk performance numbers

Thu Jul 26 12:50:48 EDT 2012

Brad,

I think that tr1 extensions have  method for alignment:

http://msdn.microsoft.com/en-us/library/bb983063.aspx

Hans
--
Hans J. Johnson, Ph.D.
hans-johnson at uiowa.edu<mailto:hans-johnson at uiowa.edu>
Assistant Professor of Psychiatry
University of Iowa Carver College of Medicine
W278 GH, 200 Hawkins Drive
Iowa City, Iowa 52242
Phone:  319-353-8587

From: Bradley Lowekamp <blowekamp at mail.nih.gov<mailto:blowekamp at mail.nih.gov>>
Date: Thursday, July 26, 2012 10:53 AM
To: Rupert Brooks <rupert.brooks at gmail.com<mailto:rupert.brooks at gmail.com>>
Cc: ITK <insight-developers at itk.org<mailto:insight-developers at itk.org>>
Subject: Re: [Insight-developers] itk performance numbers

Hello,

Well I did get to it before you:

http://review.source.kitware.com/#/c/6614/

I also uped the size of the image 100x  in your test, here is the current performance on my system:

System: victoria.nlm.nih.gov<http://victoria.nlm.nih.gov>
Processor: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz
 Serial #:
    Cache: 32768
    Clock: 2794.27
    Cores: 12 cpus x 24 Cores = 288
OSName:     Mac OS X
  Release:  10.6.8
  Version:  10K549
  Platform: x86_64
  Operating System is 64 bit
ITK Version: 3.20.1
Virtual Memory: Total: 256 Available: 228
Physical Memory: Total:65536 Available: 58374
           Probe Name:        Count          Min           Mean         Stdev            Max        Total
 MeanSquares_1_threads            20      0.344348      0.347567    0.00244733      0.352629       6.95134
 MeanSquares_2_threads            20      0.251223      0.300869     0.0179305      0.321404       6.01738
 MeanSquares_4_threads            20      0.215516      0.348677      0.173645      0.678274       6.97355
 MeanSquares_8_threads            20      0.138184      0.182681     0.0297812      0.237129       3.65362
System: victoria.nlm.nih.gov<http://victoria.nlm.nih.gov>
Processor:
 Serial #:
    Cache: 32768
    Clock: 2930
    Cores: 12 cpus x 24 Cores = 288
OSName:     Mac OS X
  Release:  10.6.8
  Version:  10K549
  Platform: x86_64
  Operating System is 64 bit
ITK Version: 4.2.0
Virtual Memory: Total: 256 Available: 228
Physical Memory: Total:65536 Available: 58371
           Probe Name:        Count          Min           Mean         Stdev            Max        Total
 MeanSquares_1_threads            20      0.382481      0.383342    0.00186954      0.391027       7.66685
 MeanSquares_2_threads            20      0.211908      0.335328     0.0777408      0.435574       6.70655
 MeanSquares_4_threads            20      0.271531      0.315688     0.0390751      0.385683       6.31377
 MeanSquares_8_threads            20      0.147544      0.192132     0.0299427      0.240976       3.84263

In the patch provided, it is implicitly done on assignment on a per-thread basis. What was most un-expected was when then allocation of the Jacobin was explicitly done out side the threaded part, the time when up by 50%! I presume that the sequential allocation, of the doubles in the master thread made the allocation sequentially, next to each other, and may be a more insidious form of false sharing. Below is the numbers from this run, notice the lack of speed up with more threads:

System: victoria.nlm.nih.gov<http://victoria.nlm.nih.gov>
Processor:
 Serial #:
    Cache: 32768
    Clock: 2930
    Cores: 12 cpus x 24 Cores = 288
OSName:     Mac OS X
  Release:  10.6.8
  Version:  10K549
  Platform: x86_64
  Operating System is 64 bit
ITK Version: 4.2.0
Virtual Memory: Total: 256 Available: 226
Physical Memory: Total:65536 Available: 57091
           Probe Name:        Count          Min           Mean         Stdev            Max        Total
 MeanSquares_1_threads            20      0.403931       0.40648    0.00213043       0.41389        8.1296
 MeanSquares_2_threads            20      0.243789      0.367603     0.0894637       0.65006       7.35206
 MeanSquares_4_threads            20      0.281336      0.354749     0.0431082      0.440161       7.09497
 MeanSquares_8_threads            20       0.24615      0.301576     0.0552998      0.446528       6.03151

Brad

On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote:

Brad,

The false sharing issue is a good point - however, i dont think this is the cause of the performance degradation.  This part of the class (m_Threader, etc) has not changed since 3.20.  (I used the optimized metrics in my 3.20 builds, so its in Review/itkOptMeanSquares....) It also does not explain the performance drop in single threaded mode.

Testing will tell...  Seems like a Friday afternoon project to me, unless someone else gets there first.

Rupert

--------------------------------------------------------------
Rupert Brooks
rupert.brooks at gmail.com<mailto:rupert.brooks at gmail.com>

On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <blowekamp at mail.nih.gov<mailto:blowekamp at mail.nih.gov>> wrote:
Hello,

Continuing to glance at the class.... I also see the following member variables for the MeanSquares class:

  MeasureType *   m_ThreaderMSE;
  DerivativeType *m_ThreaderMSEDerivatives;

Where these are index by the thread ID and access simultaneously across the threads causes the potential for False Sharing, which can be a MAJOR problem with threaded algorithms.

I would think a good solution would be to create a per-thread data structure consisting of the Jacobin, MeasureType, and DerivativeType, plus padding to prevent false sharing, or equivalently assigning max data alignment to the structure.

Rupert, Would like to take a stab at this fix?

Brad

On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote:

Sorry if this repeats - i just got a bounce from Insight Developers, so im trimming the message and resending....
--------------------------------------------------------------
Rupert Brooks
rupert.brooks at gmail.com<mailto:rupert.brooks at gmail.com>

On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <rupert.brooks at gmail.com<mailto:rupert.brooks at gmail.com>> wrote:
Aha.  Heres around line 183 of itkTranslationTransform.

// Compute the Jacobian in one position
template <class TScalarType, unsigned int NDimensions>
void
TranslationTransform<TScalarType, NDimensions>::ComputeJacobianWithRespectToParameters(
  const InputPointType &,
  JacobianType & jacobian) const
{
  // the Jacobian is constant for this transform, and it has already been
  // initialized in the constructor, so we just need to return it here.
  jacobian = this->m_IdentityJacobian;
  return;
}

Thats probably the culprit, although the root cause may be the reallocating of the jacobian every time through the loop.

Rupert

<snipped>

========================================================

Bradley Lowekamp

Medical Science and Computing for

Office of High Performance Computing and Communications

National Library of Medicine

blowekamp at mail.nih.gov<mailto:blowekamp at mail.nih.gov>

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.  If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited.  Please reply to the sender that you have received the message in error, then delete it.  Thank you.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.itk.org/pipermail/insight-developers/attachments/20120726/4b4e7ce6/attachment.htm>