[Insight-developers] Empty FixedArray destructor: Performance hit using gcc (times 2)
Bill Lorensen
bill.lorensen at gmail.com
Fri Jun 6 13:12:53 EDT 2008
OK, I am way out of my league here, but the difference may be the way
POD ("Plain old Data") types and non-POD types handle memory
allocation.
See: http://www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html
It may be, that by having a destructor, the class becomes a non-POD
class. Without it, it may be a POD class. The POD-ness rules are
complex, at least for me.
Perhaps Brad King can shed some light on this. These questions are NOT
out of his league.
Bill
On Fri, Jun 6, 2008 at 9:14 AM, Luis Ibanez <luis.ibanez at kitware.com> wrote:
>
> Hi Tom,
>
> Thanks for providing the code of your test.
>
> I was misinterpreting your description of 8 versus 4 bytes alignment.
>
> The test code has been committed the testing code under:
>
>
> Insight/Testing/Code/Common/itkFixedArrayTest2.cxx
>
>
> This should help us to see the effect of the changes accross all
> platforms.
>
>
> Is it fair to say that the test should fail when we observe
> that the array of FixedArrays has been allocated with a pointer
> that is not an 8 bytes boundary ?
>
>
> Thanks
>
>
> Luis
>
>
>
>
> ----------------------
> Tom Vercauteren wrote:
>>
>> Hi,
>>
>> Thanks for your tests, it's great to have see such reactivity!
>>
>> Below is another test that will show the performance hit. You don't
>> need to recompile ITK to use it. What we did was to run a simple loop
>> on an C array of FixedArray. Then we hack around to get an 8 byte
>> aligned C array of FixedArray and run the loop again.
>>
>> In this case, the performance hit is clearly not as large as the one
>> we get in the real world case but is still large enough to be
>> conclusive.
>>
>> Initial alignment: 4
>> Initial execution time: 920ms
>> New alignment: 0
>> Execution time: 880ms
>>
>> Let me know what it gives on your setup.
>>
>> If the destructor is not implemented you would get ( Initial
>> alignment: 0 ) and the same timing results.
>>
>> Tom
>>
>>
>>
>> #include <iostream>
>> #include <itkFixedArray.h>
>>
>> int main()
>> {
>> // Define the number of elements in the array
>> const unsigned int nelements = 10000000;
>>
>> // Define the number of runs used for timing
>> const unsigned int nrun = 10;
>>
>> // Declare a simple timer
>> clock_t t;
>>
>> typedef itk::FixedArray<double,2> ArrayType;
>>
>> // Declare an array of nelements FixedArray
>> // and add a small margin to play with pointers
>> // but not map outside the allocated memory
>> ArrayType * vec = new ArrayType[nelements+8];
>>
>> // Fill it up with zeros
>> memset(vec,0,(nelements+8)*sizeof(ArrayType));
>>
>>
>>
>>
>> // Display the alignment of the array
>> std::cout << "Initial alignment: " << (((int)vec)& 7) << "\n";
>>
>> // Start a simple experiment
>> t = clock();
>> double acc1 = 0.0;
>> for (unsigned int i=0;i<nrun;++i)
>> {
>> for (unsigned int j=0;j<nelements;++j)
>> {
>> acc1+=vec[j][0];
>> }
>> }
>>
>> // Get the final timing and display it
>> t=clock() - t;
>>
>> std::cout << "Initial execution time: "
>> << (t*1000.0) / CLOCKS_PER_SEC << "ms\n";
>>
>>
>>
>>
>>
>> // We now emulate an 8 bytes aligned array
>>
>> // Cast the pointer to char to play with bytes
>> char * p = reinterpret_cast<char*>( vec );
>>
>> // Move the char pointer until is aligned on 8 bytes
>> while (((int)p)%8) ++p;
>>
>> // Cast the 8 bytes aligned pointer back to the original type
>> ArrayType * vec2 = reinterpret_cast<ArrayType*>( p );
>>
>> // Make sure the new pointer is well aligned by
>> // displaying the alignment
>> std::cout << "New alignment: " << (((int)vec2)& 7) << "\n";
>>
>> // Start the simple experiment on the 8 byte aligned array
>> t = clock();
>> double acc2 = 0.0;
>> for (unsigned int i=0;i<nrun;++i)
>> {
>> for (unsigned int j=0;j<nelements;++j)
>> {
>> acc2+=vec2[j][0];
>> }
>> }
>>
>> // Get the final timing and display it
>> t=clock() - t;
>>
>> std::cout << "Execution time: "
>> << (t*1000.0) / CLOCKS_PER_SEC << "ms\n";
>>
>>
>>
>>
>> // Free up the memory
>> delete [] vec;
>>
>> // Make sure we do something with the sums otherwise everything
>> // could be optimized away by the compiler
>> return acc1+acc2;
>> }
>>
>>
>>
>> On Thu, Jun 5, 2008 at 5:04 PM, Gert Wollny <gert at die.upm.es> wrote:
>>
>>> Am Donnerstag, den 05.06.2008, 10:24 -0400 schrieb Luis Ibanez:
>>>
>>>> Hi Gert,
>>>>
>>>> Thanks for the quick report !
>>>>
>>>> It makes sense that -g flag will prevent the method
>>>> from being optimized away.
>>>>
>>>> If you have a chance,
>>>> could you please test what happens when no -g is
>>>> used, and the optimization flag is set to -O3 ?
>>>
>>> It was not be optimized away, and valgrind/kcachegrind tells me the
>>> destructor is located in libITKCommon.so.
>>>
>>> Actually, with -O3 the whole loop was optimized away. This is wired, to
>>> say the least, because, if the compiler doesn't see the implementation
>>> of the constructor and the destructor and uses the explicitly
>>> instanciated one, it can not know whether there is done something
>>> essential in one of the both, like changing a global variable.
>>>
>>> I've added some code to force the loop (attached).
>>>
>>> BTW: I think -g doesn't change the optimizers at all (with g++).
>>>
>>> Best
>>>
>>> Gert
>>>
>>>
>>>
>>>
>>>
>>
>>
> _______________________________________________
> Insight-developers mailing list
> Insight-developers at itk.org
> http://www.itk.org/mailman/listinfo/insight-developers
>
More information about the Insight-developers
mailing list