[Rtk-users] Stream divisions in rtkfdk

Chao Wu wuchao04 at gmail.com
Tue May 27 08:24:19 EDT 2014


Hi Simon,

Thanks for your reaction. I was looking into the in-place FFT these days,
and the way of tuning the number of projections sent to the ramp filter is
exactly what I plan to look for next. Now I know that directly. I think it
is a good idea to make it an option of rtkfdk, or to regulate it
automatically by inquiring the amount of free memory with cudaMemGetInfo
and estimating the memory needed for storing the projections, ramp kernel,
FFT plan and the chunk of volume. The latter may be difficult though since
such estimation is not easy at the stage even before padding the
projections...

Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and
cuFFT could perform FFT in-place. So in principle
rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter
may also be made in-place if FFTW is used. However the “in-place” here is
on a lower level and may not be compatible with the meaning of “in-place”
of itk::InPlaceImageFilter.

Anyway, since system memory is not a problem to me, I only focus on the
Cuda filter. I already have sort of “dirty” implementation for my own use:

First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree
of deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*)
deviceProjection. Now the cuFFT is in-place; the only thing is that the
size of the buffer (now used by both deviceProjectionFFT and
deviceProjection) should be 2*(x/2+1)*y*z instead of x*y*z.

Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned
above is maintained in paddedImage. Its size is determined in
PadInputImageRegion(…) (line 60) and the actual GPU memory allocation and
CPU-to-GPU data copying is by
paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My
first attempt is to make the image regions of paddedImage different from
each other by modifying FFTRampImageFilter<…>::PadInputImageRegion(…) in
rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing
the padded projection data as how it works now; while its BufferedRegion
should be 2*(x/2+1) by y by z, with the additional part reserved for
in-place FFT. Other small changes were done to calculate inputDimension and
kernelDimension correctly based on RequestedRegion. Later I realized that
this did not work, since cuFFT sees the buffer just as a linear space. All
image data should come continuously from the beginning of the buffer and
all unused spaces are at the end, but in this case the reserved spaces were
at the end along the x (first) dimension so that they were distributed in
the linear buffer.

So this was where the “dirty” changes started. First of all, instead of
calling PadInputImageRegion(…) at line 60 in rtkCudaFFTRampImageFilter.cxx,
I call an altered one named PadInputImageRegionInPlaceFFT(…) (because I did
not check if the modification works for CPU or any other situations as
well, so I prefer to make branches when possible instead of direct
changes). The latter is a copy of the former in rtkFFTRampImageFilter.txx,
with the only change of the call for allocation from
paddedImage->Allocate() to paddedImage->AllocateInPlaceFFT(). Again,
CudaImage<…>::AllocateInPlaceFFT() is an altered version of
CudaImage<…>::Allocate() in itkCudaImage.hxx. There, after the calculation
and set of CudaDataManager::m_BufferSize as before, I also calculate the
required buffer size for in-place FFT and stored the value in a new member
of CudaDataManager, namely m_BufferSizeInPlaceFFT. Then under
CudaDataManager::UpdateGPUBuffer() in itkCudaDataManager.cxx, instead of
simply do this->Allocate(), I first check if m_BufferSize and
m_BufferSizeInPlaceFFT are equal. If not, I let m_BufferSize =
m_BufferSizeInPlaceFFT before doing this->Allocate(), and after that
restore m_BufferSize to its original value. Other changes have been done to
ensure that m_BufferSizeInPlaceFFT is otherwise always equal to
m_BufferSize for back-compatibility, such as adding “m_BufferSizeInPlaceFFT
= num” in void CudaDataManager::SetBufferSize(unsigned int num), so that
any other allocation actions (although I have not checked those one by one)
will not be influenced by the piece of new code. At last, under
GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after
cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the
additional space in this buffer will never have a chance later to be
initialized by means of CPU-to-GPU data copying. The length of the data is
shorter than the buffer size.

It works for me so far. Please see if you have any better routine to
implement this. Thank you.

Best regards,
Chao








2014-05-27 0:12 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:

> Hi Chao,
> Thanks for the detailed report.
>
>
> On Thu, May 22, 2014 at 10:06 AM, Chao Wu <wuchao04 at gmail.com> wrote:
>
>> Hi Simon,
>>
>> Thanks for the suggestions.
>>
>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by:
>>
>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384
>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing
>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt
>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4
>> --dimension 640,250,640 --hardware=cuda -v -l
>>
>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of
>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM
>> usage.
>> I found that the size of the volume data in the GRAM could be reduced by
>> --divisions but the amount of projection data sent to the GRAM are not
>> influenced by --lowmem switch.
>>
> After looking at the code again, lowmem acts on the reading so it's not
> related to the GPU memory but on the CPU memory, sorry about that. The
> reconstruction algorithm does stream the projections but it processes by
> default 16 projections at a time. You can change this in
> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will
> reduce your GPU memory consumption (I checked and it works for me). Let me
> know if it works for you and if you think that this should be made an
> option of rtkfdk.
>
>
>> So --divisions does not help much if it is mainly the projection data
>> which takes up GRAM, while --lowmem does not help at all. I did not look
>> into the more front part of the code so I am not sure if this is the
>> designed behaviour.
>>
>> On the other hand, I am also looking for possibilities to reduce GRAM
>> used in the CUDA ramp filter. At least one thing should be changed, and one
>> thing may be considered:
>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be
>> destroyed earlier, right after the plan being executed. A plan takes up at
>> least the same amount of memory as the data.
>>
> Good point, I changed it:
>
> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35
>
>
>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a
>> clear idea about how to pad deviceProjection to the required size of
>> its cufftComplex counterpart.
>>
> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is
> not an itk::InPlaceImageFilter. It might be possible but I would have to
> check. Let me know if you investigate this further.
> Thanks again,
> Simon
>
>
>>
>> Any comments?
>>
>> Best regards,
>> Chao
>>
>>
>>
>> 2014-05-21 14:30 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>>
>> Since it fails in cufft, it's the memory of the projections that is a
>>> problem. Therefore, it is not surprising that --divisions has no
>>> influence. But --lowmem should have an influence. I would suggest:
>>> - to uncomment
>>> //#define VERBOSE
>>> in itkCudaImageDataManager.hxx and try to see what amount of memory
>>> are requested.
>>> - to try to reproduce the problem with simulated data so that we can
>>> help you in finding a solution.
>>> Simon
>>>
>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu <wuchao04 at gmail.com> wrote:
>>> > Hi Simon,
>>> >
>>> > Yes I switched on an off the --lowmem option and it has no influence
>>> on the
>>> > behaviour I mentioned.
>>> > In my case the system memory is sufficient to handle the projections
>>> plus
>>> > the volume.
>>> > The major bottleneck is the amount of graphics memory.
>>> > If I reconstruct a little bit more slices than the limit that I found
>>> with
>>> > one stream, the allocation of GPU resource for CUFFT in the
>>> > CudaFFTRampImageFilter will fail (which was more or less expected).
>>> > However with --divisions > 1 it is indeed able to reconstruct more
>>> slices,
>>> > but only a very few more; otherwise the CUFFT would fail again.
>>> > I would expect the limitations of the amount of slices to be
>>> approximately
>>> > proportional to the number of streams, or do I miss anything about
>>> stream
>>> > division?
>>> >
>>> > Thanks,
>>> > Chao
>>> >
>>> >
>>> >
>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>>> >
>>> >> Hi Chao,
>>> >> There are two things that use memory, the volume and the projections.
>>> >> The --divisions option divides the volume only. The --lowmem option
>>> >> works on a subset of projections at a time. Did you try this?
>>> >> Simon
>>> >>
>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu <wuchao04 at gmail.com> wrote:
>>> >> > Hoi,
>>> >> >
>>> >> > I may need some hint about how the stream division works in rtkfdk.
>>> >> > I noticed that the StreamingImageFilter from ITK is used but I
>>> cannot
>>> >> > figure
>>> >> > out quickly how the division has been performed.
>>> >> > I did some test with reconstructing 400 1500x1200 projections into a
>>> >> > 640xNx640 volume (the pixel and voxel size are comparable).
>>> >> > The reconstructions were executed by rtkfdk with CUDA.
>>> >> > When I leave the origin of the volume at the center by default, I
>>> can
>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the
>>> limitation
>>> >> > of
>>> >> > the graphic memory. Then when I increase the number of divisions to
>>> 2, I
>>> >> > can
>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up
>>> to
>>> >> > 219
>>> >> > slices. Does anyone have an idea why it scales like this?
>>> >> > Thanks in advance.
>>> >> >
>>> >> > Best regards,
>>> >> > Chao
>>> >> >
>>> >> > _______________________________________________
>>> >> > Rtk-users mailing list
>>> >> > Rtk-users at openrtk.org
>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users
>>> >> >
>>> >
>>> >
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/rtk-users/attachments/20140527/c0b151bb/attachment-0009.html>


More information about the Rtk-users mailing list