[Rtk-users] Stream divisions in rtkfdk

Fri May 30 05:12:41 EDT 2014

Hi Chao,
I added the option, --subsetsize.
Thanks for the detailed report. I don't understand it all, it's quite
complicated... Do you really have such memory limitations problems
that you want to go in that direction? Using the two streaming options
(--subset + --divisions), you should be able to sufficiently reduce
your memory consumption.
If you really want to go further in the in-place implementation, I
think a code patch would be more helpful but you must confine the
changes to rtk::CudaFFTRampImageFilter. We don't want to modify
itk::CudaDataManager for such a specific purpose.
Simon

On Tue, May 27, 2014 at 2:24 PM, Chao Wu <wuchao04 at gmail.com> wrote:
> Hi Simon,
>
> Thanks for your reaction. I was looking into the in-place FFT these days,
> and the way of tuning the number of projections sent to the ramp filter is
> exactly what I plan to look for next. Now I know that directly. I think it
> is a good idea to make it an option of rtkfdk, or to regulate it
> automatically by inquiring the amount of free memory with cudaMemGetInfo and
> estimating the memory needed for storing the projections, ramp kernel, FFT
> plan and the chunk of volume. The latter may be difficult though since such
> estimation is not easy at the stage even before padding the projections...
>
> Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and
> cuFFT could perform FFT in-place. So in principle
> rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter
> may also be made in-place if FFTW is used. However the “in-place” here is on
> a lower level and may not be compatible with the meaning of “in-place” of
> itk::InPlaceImageFilter.
>
> Anyway, since system memory is not a problem to me, I only focus on the Cuda
> filter. I already have sort of “dirty” implementation for my own use:
>
> First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of
> deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*)
> deviceProjection. Now the cuFFT is in-place; the only thing is that the size
> of the buffer (now used by both deviceProjectionFFT and deviceProjection)
> should be 2*(x/2+1)*y*z instead of x*y*z.
>
> Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above
> is maintained in paddedImage. Its size is determined in
> PadInputImageRegion(…) (line 60) and the actual GPU memory allocation and
> CPU-to-GPU data copying is by
> paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first
> attempt is to make the image regions of paddedImage different from each
> other by modifying FFTRampImageFilter<…>::PadInputImageRegion(…) in
> rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing
> the padded projection data as how it works now; while its BufferedRegion
> should be 2*(x/2+1) by y by z, with the additional part reserved for
> in-place FFT. Other small changes were done to calculate inputDimension and
> kernelDimension correctly based on RequestedRegion. Later I realized that
> this did not work, since cuFFT sees the buffer just as a linear space. All
> image data should come continuously from the beginning of the buffer and all
> unused spaces are at the end, but in this case the reserved spaces were at
> the end along the x (first) dimension so that they were distributed in the
> linear buffer.
>
> So this was where the “dirty” changes started. First of all, instead of
> calling PadInputImageRegion(…) at line 60 in rtkCudaFFTRampImageFilter.cxx,
> I call an altered one named PadInputImageRegionInPlaceFFT(…) (because I did
> not check if the modification works for CPU or any other situations as well,
> so I prefer to make branches when possible instead of direct changes). The
> latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only
> change of the call for allocation from paddedImage->Allocate() to
> paddedImage->AllocateInPlaceFFT(). Again, CudaImage<…>::AllocateInPlaceFFT()
> is an altered version of CudaImage<…>::Allocate() in itkCudaImage.hxx.
> There, after the calculation and set of CudaDataManager::m_BufferSize as
> before, I also calculate the required buffer size for in-place FFT and
> stored the value in a new member of CudaDataManager, namely
> m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in
> itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check
> if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let
> m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and
> after that restore m_BufferSize to its original value. Other changes have
> been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to
> m_BufferSize for back-compatibility, such as adding “m_BufferSizeInPlaceFFT
> = num” in void CudaDataManager::SetBufferSize(unsigned int num), so that any
> other allocation actions (although I have not checked those one by one) will
> not be influenced by the piece of new code. At last, under
> GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after
> cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the
> additional space in this buffer will never have a chance later to be
> initialized by means of CPU-to-GPU data copying. The length of the data is
> shorter than the buffer size.
>
> It works for me so far. Please see if you have any better routine to
> implement this. Thank you.
>
> Best regards,
> Chao
>
>
>
>
>
>
>
>
> 2014-05-27 0:12 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>
>> Hi Chao,
>> Thanks for the detailed report.
>>
>>
>> On Thu, May 22, 2014 at 10:06 AM, Chao Wu <wuchao04 at gmail.com> wrote:
>>>
>>> Hi Simon,
>>>
>>> Thanks for the suggestions.
>>>
>>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by:
>>>
>>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384
>>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing
>>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt
>>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4
>>> --dimension 640,250,640 --hardware=cuda -v -l
>>>
>>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of
>>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage.
>>> I found that the size of the volume data in the GRAM could be reduced by
>>> --divisions but the amount of projection data sent to the GRAM are not
>>> influenced by --lowmem switch.
>>
>> After looking at the code again, lowmem acts on the reading so it's not
>> related to the GPU memory but on the CPU memory, sorry about that. The
>> reconstruction algorithm does stream the projections but it processes by
>> default 16 projections at a time. You can change this in
>> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce
>> your GPU memory consumption (I checked and it works for me). Let me know if
>> it works for you and if you think that this should be made an option of
>> rtkfdk.
>>
>>>
>>> So --divisions does not help much if it is mainly the projection data
>>> which takes up GRAM, while --lowmem does not help at all. I did not look
>>> into the more front part of the code so I am not sure if this is the
>>> designed behaviour.
>>>
>>> On the other hand, I am also looking for possibilities to reduce GRAM
>>> used in the CUDA ramp filter. At least one thing should be changed, and one
>>> thing may be considered:
>>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be
>>> destroyed earlier, right after the plan being executed. A plan takes up at
>>> least the same amount of memory as the data.
>>
>> Good point, I changed it:
>>
>> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35
>>
>>>
>>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a
>>> clear idea about how to pad deviceProjection to the required size of its
>>> cufftComplex counterpart.
>>
>> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is
>> not an itk::InPlaceImageFilter. It might be possible but I would have to
>> check. Let me know if you investigate this further.
>> Thanks again,
>> Simon
>>
>>>
>>>
>>> Any comments?
>>>
>>> Best regards,
>>> Chao
>>>
>>>
>>>
>>> 2014-05-21 14:30 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>>>
>>>> Since it fails in cufft, it's the memory of the projections that is a
>>>> problem. Therefore, it is not surprising that --divisions has no
>>>> influence. But --lowmem should have an influence. I would suggest:
>>>> - to uncomment
>>>> //#define VERBOSE
>>>> in itkCudaImageDataManager.hxx and try to see what amount of memory
>>>> are requested.
>>>> - to try to reproduce the problem with simulated data so that we can
>>>> help you in finding a solution.
>>>> Simon
>>>>
>>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu <wuchao04 at gmail.com> wrote:
>>>> > Hi Simon,
>>>> >
>>>> > Yes I switched on an off the --lowmem option and it has no influence
>>>> > on the
>>>> > behaviour I mentioned.
>>>> > In my case the system memory is sufficient to handle the projections
>>>> > plus
>>>> > the volume.
>>>> > The major bottleneck is the amount of graphics memory.
>>>> > If I reconstruct a little bit more slices than the limit that I found
>>>> > with
>>>> > one stream, the allocation of GPU resource for CUFFT in the
>>>> > CudaFFTRampImageFilter will fail (which was more or less expected).
>>>> > However with --divisions > 1 it is indeed able to reconstruct more
>>>> > slices,
>>>> > but only a very few more; otherwise the CUFFT would fail again.
>>>> > I would expect the limitations of the amount of slices to be
>>>> > approximately
>>>> > proportional to the number of streams, or do I miss anything about
>>>> > stream
>>>> > division?
>>>> >
>>>> > Thanks,
>>>> > Chao
>>>> >
>>>> >
>>>> >
>>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>>>> >
>>>> >> Hi Chao,
>>>> >> There are two things that use memory, the volume and the projections.
>>>> >> The --divisions option divides the volume only. The --lowmem option
>>>> >> works on a subset of projections at a time. Did you try this?
>>>> >> Simon
>>>> >>
>>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu <wuchao04 at gmail.com> wrote:
>>>> >> > Hoi,
>>>> >> >
>>>> >> > I may need some hint about how the stream division works in rtkfdk.
>>>> >> > I noticed that the StreamingImageFilter from ITK is used but I
>>>> >> > cannot
>>>> >> > figure
>>>> >> > out quickly how the division has been performed.
>>>> >> > I did some test with reconstructing 400 1500x1200 projections into
>>>> >> > a
>>>> >> > 640xNx640 volume (the pixel and voxel size are comparable).
>>>> >> > The reconstructions were executed by rtkfdk with CUDA.
>>>> >> > When I leave the origin of the volume at the center by default, I
>>>> >> > can
>>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the
>>>> >> > limitation
>>>> >> > of
>>>> >> > the graphic memory. Then when I increase the number of divisions to
>>>> >> > 2, I
>>>> >> > can
>>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up
>>>> >> > to
>>>> >> > 219
>>>> >> > slices. Does anyone have an idea why it scales like this?
>>>> >> > Thanks in advance.
>>>> >> >
>>>> >> > Best regards,
>>>> >> > Chao
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Rtk-users mailing list
>>>> >> > Rtk-users at openrtk.org
>>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users
>>>> >> >
>>>> >
>>>> >
>>>
>>>
>>
>