ITK/PerformanceOptimization: Difference between revisions
No edit summary |
|||
(8 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
Recent performance improvements efforts: | |||
* [http://review.source.kitware.com/#/c/22387/ itk::ThreadPool refactoring] | |||
* [http://review.source.kitware.com/#/c/22377/ Change itkMultiThreader to use OpenMP] | |||
* [http://www.insight-journal.org/browse/publication/974 ITK TBB Insight Journal article] | |||
Line 6: | Line 11: | ||
Links: | Links: | ||
* TBB Insight Journal article | * [http://www.insight-journal.org/browse/publication/974 TBB Insight Journal article] | ||
* ITK Module Template | * [https://github.com/InsightSoftwareConsortium/ITKModuleTemplate ITK Module Template] | ||
* ITK Module Documentation | * [https://itk.org/ITKSoftwareGuide/html/Book1/ITKSoftwareGuide-Book1ch9.html#x48-1440009 ITK Module Documentation] | ||
* CMake'ified TBB | * [https://github.com/hjmjohnson/tbb CMake'ified TBB] | ||
* VTK SMPTools | * [https://blog.kitware.com/simple-parallel-computing-with-vtksmptools-2/ VTK SMPTools] | ||
* itkScanlineIterator | * [https://itk.org/Doxygen/html/classitk_1_1ImageScanlineIterator.html itkScanlineIterator] | ||
* ITK FFT classes | * [https://itk.org/Doxygen/html/classitk_1_1ForwardFFTImageFilter.html ITK FFT classes] | ||
Possible next steps: | |||
* TBB Insight Journal update | * TBB Insight Journal update | ||
* TBB Insight Journal -> Remote Module | * TBB Insight Journal -> Remote Module | ||
Line 20: | Line 25: | ||
* MKL backend for FFT | * MKL backend for FFT | ||
* CMake-ified TBB upstreaming / available as an ITK module | * CMake-ified TBB upstreaming / available as an ITK module | ||
* Improved awareness of ImageScanlineIterator | |||
== Performance Analysis == | |||
=== Per Thread Cost of Job Dispatch === | |||
An important metric in analyzing the scalability of multi-threading algorithms is the knowing the cost to spawn or use additional thread for a task. Knowing the cost allows an estimate of the expected improvement additional thread can potentially achieve for a perfectly scale able algorithm. It can also provide an estimate of the threading over for example if you are running X filters per second with N threads, then you can compute the time spent with threading over head. The can help determine if the expected improvement from improving the threading model vs improving the algorithms. | |||
The overhead for spawning threads is computed by measuring the time it takes the `AddImageFilter` to run with 1 thread on 1 pixel, and the time it takes to run with N threads on N pixels. Each thread does the one pixel trivial operation. The difference in execution time is considered the overhead for spawning the threads. Dividing by the number of additional threads gives us the overhead cost of “spawning” or dispatching. To estimate the cost the following SimpleITK code was initially used: | |||
def compute_thread_cost(n): | |||
sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(1) | |||
img = sitk.Image(1,1, sitk.sitkFloat32) | |||
t1 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25)) | |||
sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(n) | |||
img = sitk.Image(1,n, sitk.sitkFloat32) | |||
t2 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25)) | |||
print( "1 Thread: {0}\n{1} Threads: {2}".format(t1,n,t2)) | |||
cost=(t2-t1)/(n-1.0) | |||
print("Cost per thread: {0}".format(cost)) | |||
SimpleITK was recompiled with the Default ITK, the [http://review.source.kitware.com/#/c/22387/5 rewritten-thread-pool], and [http://review.source.kitware.com/#/c/22377/1 TheReturnOfOpenMP]. | |||
The following results were with the following system: | |||
Architecture: x86_64 | |||
CPU op-mode(s): 32-bit, 64-bit | |||
Byte Order: Little Endian | |||
CPU(s): 88 | |||
On-line CPU(s) list: 0-87 | |||
Thread(s) per core: 2 | |||
Core(s) per socket: 22 | |||
Socket(s): 2 | |||
NUMA node(s): 2 | |||
Vendor ID: GenuineIntel | |||
CPU family: 6 | |||
Model: 79 | |||
Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz | |||
Stepping: 1 | |||
CPU MHz: 2800.617 | |||
BogoMIPS: 4410.78 | |||
Virtualization: VT-x | |||
L1d cache: 32K | |||
L1i cache: 32K | |||
L2 cache: 256K | |||
L3 cache: 56320K | |||
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 | |||
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87 | |||
Here are the results. | |||
Default ITK: | |||
>compute_thread_cost(88) | |||
1 Thread: 7.9870223999e-05 | |||
88 Threads: 0.00240182876587 | |||
Cost per thread: 2.66891786422e-05 (seconds) | |||
rewriting-thread-pool: | |||
>compute_thread_cost(88) | |||
1 Thread: 7.60555267334e-05 | |||
88 Threads: 0.000521183013916 | |||
Cost per thread: 5.11640789865e-06 (seconds) | |||
TheReturnOfOpenMP: | |||
>compute_thread_cost(88) | |||
1 Thread: 7.79628753662e-05 | |||
88 Threads: 0.000182867050171 | |||
Cost per thread: 1.2057951127e-06 (seconds) | |||
To summarize these results: the rewriting-thread-pool topic has a 5.2X speed up for spawning thread, and the TheReturnOfOpenMP topic as a 22X speedup. | |||
Re-running the test code has a bit (~20% with occasional outlier) of variability. The results are quite similar with spawning a smaller number of threads. So this type of testing can be done on a smaller system too. | |||
So if you are using 100 filters per second with 100 threads each, the thread overhead will take 1% to 26% of the time. That is a lot of filters and threads to encounter this potential bottle neck, however we need to know the current usage and expectation for the number of filters and threads to run per second. |
Latest revision as of 14:01, 30 June 2017
Recent performance improvements efforts:
ITK Performance Optimization Meeting Notes
2017-04-07
Links:
- TBB Insight Journal article
- ITK Module Template
- ITK Module Documentation
- CMake'ified TBB
- VTK SMPTools
- itkScanlineIterator
- ITK FFT classes
Possible next steps:
- TBB Insight Journal update
- TBB Insight Journal -> Remote Module
- Improved compiler optimization flags
- MKL backend for FFT
- CMake-ified TBB upstreaming / available as an ITK module
- Improved awareness of ImageScanlineIterator
Performance Analysis
Per Thread Cost of Job Dispatch
An important metric in analyzing the scalability of multi-threading algorithms is the knowing the cost to spawn or use additional thread for a task. Knowing the cost allows an estimate of the expected improvement additional thread can potentially achieve for a perfectly scale able algorithm. It can also provide an estimate of the threading over for example if you are running X filters per second with N threads, then you can compute the time spent with threading over head. The can help determine if the expected improvement from improving the threading model vs improving the algorithms.
The overhead for spawning threads is computed by measuring the time it takes the `AddImageFilter` to run with 1 thread on 1 pixel, and the time it takes to run with N threads on N pixels. Each thread does the one pixel trivial operation. The difference in execution time is considered the overhead for spawning the threads. Dividing by the number of additional threads gives us the overhead cost of “spawning” or dispatching. To estimate the cost the following SimpleITK code was initially used:
def compute_thread_cost(n): sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(1) img = sitk.Image(1,1, sitk.sitkFloat32) t1 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25)) sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(n) img = sitk.Image(1,n, sitk.sitkFloat32) t2 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25)) print( "1 Thread: {0}\n{1} Threads: {2}".format(t1,n,t2)) cost=(t2-t1)/(n-1.0) print("Cost per thread: {0}".format(cost))
SimpleITK was recompiled with the Default ITK, the rewritten-thread-pool, and TheReturnOfOpenMP.
The following results were with the following system:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Stepping: 1 CPU MHz: 2800.617 BogoMIPS: 4410.78 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 56320K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87
Here are the results.
Default ITK:
>compute_thread_cost(88) 1 Thread: 7.9870223999e-05 88 Threads: 0.00240182876587 Cost per thread: 2.66891786422e-05 (seconds)
rewriting-thread-pool:
>compute_thread_cost(88) 1 Thread: 7.60555267334e-05 88 Threads: 0.000521183013916 Cost per thread: 5.11640789865e-06 (seconds)
TheReturnOfOpenMP:
>compute_thread_cost(88) 1 Thread: 7.79628753662e-05 88 Threads: 0.000182867050171 Cost per thread: 1.2057951127e-06 (seconds)
To summarize these results: the rewriting-thread-pool topic has a 5.2X speed up for spawning thread, and the TheReturnOfOpenMP topic as a 22X speedup.
Re-running the test code has a bit (~20% with occasional outlier) of variability. The results are quite similar with spawning a smaller number of threads. So this type of testing can be done on a smaller system too.
So if you are using 100 filters per second with 100 threads each, the thread overhead will take 1% to 26% of the time. That is a lot of filters and threads to encounter this potential bottle neck, however we need to know the current usage and expectation for the number of filters and threads to run per second.