<div dir="ltr">Hi Simon,<div><br></div><div>Thanks for the suggestions.<div><br></div><div>The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by:<br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px">


<div>rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384</div><div>rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt</div><div>rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 --dimension 640,250,640 --hardware=cuda -v -l</div>


<div><br></div></blockquote><div>With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of <span style="font-family:arial,sans-serif;font-size:14px">itkCudaImageDataManager.hxx</span>) now I can have a better view of the GRAM usage. </div>


<div>I found that the size of the volume data in the GRAM could be reduced by --divisions but the amount of projection data sent to the GRAM are not influenced by --lowmem switch.</div><div>So --divisions does not help much if it is mainly the projection data which takes up GRAM, while --lowmem does not help at all. I did not look into the more front part of the code so I am not sure if this is the designed behaviour.</div>


<div><br></div><div>On the other hand, I am also looking for possibilities to reduce GRAM used in the CUDA ramp filter. At least one thing should be changed, and one thing may be considered:</div><div>- in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be destroyed earlier, right after the plan being executed. A plan takes up at least the same amount of memory as the data.</div>


<div>- cufftExecR2C and cufftExecC2R can be in-place. However I do not have a clear idea about how to pad deviceProjection to the required size of its cufftComplex counterpart.</div><div><div><br></div><div>Any comments?</div>


</div><div><br></div><div>Best regards,</div><div>Chao</div><div><br></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2014-05-21 14:30 GMT+02:00 Simon Rit <span dir="ltr"><<a href="mailto:simon.rit@creatis.insa-lyon.fr" target="_blank">simon.rit@creatis.insa-lyon.fr</a>></span>:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Since it fails in cufft, it's the memory of the projections that is a<br>

problem. Therefore, it is not surprising that --divisions has no<br>

influence. But --lowmem should have an influence. I would suggest:<br>

- to uncomment<br>

//#define VERBOSE<br>

in itkCudaImageDataManager.hxx and try to see what amount of memory<br>

are requested.<br>

- to try to reproduce the problem with simulated data so that we can<br>

help you in finding a solution.<br>

<span class="HOEnZb"><font color="#888888">Simon<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Wed, May 21, 2014 at 2:21 PM, Chao Wu <<a href="mailto:wuchao04@gmail.com">wuchao04@gmail.com</a>> wrote:<br>

> Hi Simon,<br>

><br>

> Yes I switched on an off the --lowmem option and it has no influence on the<br>

> behaviour I mentioned.<br>

> In my case the system memory is sufficient to handle the projections plus<br>

> the volume.<br>

> The major bottleneck is the amount of graphics memory.<br>

> If I reconstruct a little bit more slices than the limit that I found with<br>

> one stream, the allocation of GPU resource for CUFFT in the<br>

> CudaFFTRampImageFilter will fail (which was more or less expected).<br>

> However with --divisions > 1 it is indeed able to reconstruct more slices,<br>

> but only a very few more; otherwise the CUFFT would fail again.<br>

> I would expect the limitations of the amount of slices to be approximately<br>

> proportional to the number of streams, or do I miss anything about stream<br>

> division?<br>

><br>

> Thanks,<br>

> Chao<br>

><br>

><br>

><br>

> 2014-05-21 13:43 GMT+02:00 Simon Rit <<a href="mailto:simon.rit@creatis.insa-lyon.fr">simon.rit@creatis.insa-lyon.fr</a>>:<br>

><br>

>> Hi Chao,<br>

>> There are two things that use memory, the volume and the projections.<br>

>> The --divisions option divides the volume only. The --lowmem option<br>

>> works on a subset of projections at a time. Did you try this?<br>

>> Simon<br>

>><br>

>> On Wed, May 21, 2014 at 12:18 PM, Chao Wu <<a href="mailto:wuchao04@gmail.com">wuchao04@gmail.com</a>> wrote:<br>

>> > Hoi,<br>

>> ><br>

>> > I may need some hint about how the stream division works in rtkfdk.<br>

>> > I noticed that the StreamingImageFilter from ITK is used but I cannot<br>

>> > figure<br>

>> > out quickly how the division has been performed.<br>

>> > I did some test with reconstructing 400 1500x1200 projections into a<br>

>> > 640xNx640 volume (the pixel and voxel size are comparable).<br>

>> > The reconstructions were executed by rtkfdk with CUDA.<br>

>> > When I leave the origin of the volume at the center by default, I can<br>

>> > reconstruct up to N=200 slices with --divisions=1 due to the limitation<br>

>> > of<br>

>> > the graphic memory. Then when I increase the number of divisions to 2, I<br>

>> > can<br>

>> > only reconstruct up to 215 slices; and with divisions to 3 only up to<br>

>> > 219<br>

>> > slices. Does anyone have an idea why it scales like this?<br>

>> > Thanks in advance.<br>

>> ><br>

>> > Best regards,<br>

>> > Chao<br>

>> ><br>

>> > _______________________________________________<br>

>> > Rtk-users mailing list<br>

>> > <a href="mailto:Rtk-users@openrtk.org">Rtk-users@openrtk.org</a><br>

>> > <a href="http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users" target="_blank">http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users</a><br>

>> ><br>

><br>

><br>

</div></div></blockquote></div><br></div>