From wuchao04 at gmail.com Wed May 21 06:18:57 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 12:18:57 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk Message-ID: Hoi, I may need some hint about how the stream division works in rtkfdk. I noticed that the StreamingImageFilter from ITK is used but I cannot figure out quickly how the division has been performed. I did some test with reconstructing 400 1500x1200 projections into a 640xNx640 volume (the pixel and voxel size are comparable). The reconstructions were executed by rtkfdk with CUDA. When I leave the origin of the volume at the center by default, I can reconstruct up to N=200 slices with --divisions=1 due to the limitation of the graphic memory. Then when I increase the number of divisions to 2, I can only reconstruct up to 215 slices; and with divisions to 3 only up to 219 slices. Does anyone have an idea why it scales like this? Thanks in advance. Best regards, Chao -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 07:43:40 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 13:43:40 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, There are two things that use memory, the volume and the projections. The --divisions option divides the volume only. The --lowmem option works on a subset of projections at a time. Did you try this? Simon On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > Hoi, > > I may need some hint about how the stream division works in rtkfdk. > I noticed that the StreamingImageFilter from ITK is used but I cannot figure > out quickly how the division has been performed. > I did some test with reconstructing 400 1500x1200 projections into a > 640xNx640 volume (the pixel and voxel size are comparable). > The reconstructions were executed by rtkfdk with CUDA. > When I leave the origin of the volume at the center by default, I can > reconstruct up to N=200 slices with --divisions=1 due to the limitation of > the graphic memory. Then when I increase the number of divisions to 2, I can > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > slices. Does anyone have an idea why it scales like this? > Thanks in advance. > > Best regards, > Chao > > _______________________________________________ > Rtk-users mailing list > Rtk-users at openrtk.org > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > From wuchao04 at gmail.com Wed May 21 08:21:00 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 14:21:00 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Yes I switched on an off the --lowmem option and it has no influence on the behaviour I mentioned. In my case the system memory is sufficient to handle the projections plus the volume. The major bottleneck is the amount of graphics memory. If I reconstruct a little bit more slices than the limit that I found with one stream, the allocation of GPU resource for CUFFT in the CudaFFTRampImageFilter will fail (which was more or less expected). However with --divisions > 1 it is indeed able to reconstruct more slices, but only a very few more; otherwise the CUFFT would fail again. I would expect the limitations of the amount of slices to be approximately proportional to the number of streams, or do I miss anything about stream division? Thanks, Chao 2014-05-21 13:43 GMT+02:00 Simon Rit : > Hi Chao, > There are two things that use memory, the volume and the projections. > The --divisions option divides the volume only. The --lowmem option > works on a subset of projections at a time. Did you try this? > Simon > > On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > > Hoi, > > > > I may need some hint about how the stream division works in rtkfdk. > > I noticed that the StreamingImageFilter from ITK is used but I cannot > figure > > out quickly how the division has been performed. > > I did some test with reconstructing 400 1500x1200 projections into a > > 640xNx640 volume (the pixel and voxel size are comparable). > > The reconstructions were executed by rtkfdk with CUDA. > > When I leave the origin of the volume at the center by default, I can > > reconstruct up to N=200 slices with --divisions=1 due to the limitation > of > > the graphic memory. Then when I increase the number of divisions to 2, I > can > > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > > slices. Does anyone have an idea why it scales like this? > > Thanks in advance. > > > > Best regards, > > Chao > > > > _______________________________________________ > > Rtk-users mailing list > > Rtk-users at openrtk.org > > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 08:30:21 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 14:30:21 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Since it fails in cufft, it's the memory of the projections that is a problem. Therefore, it is not surprising that --divisions has no influence. But --lowmem should have an influence. I would suggest: - to uncomment //#define VERBOSE in itkCudaImageDataManager.hxx and try to see what amount of memory are requested. - to try to reproduce the problem with simulated data so that we can help you in finding a solution. Simon On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > Hi Simon, > > Yes I switched on an off the --lowmem option and it has no influence on the > behaviour I mentioned. > In my case the system memory is sufficient to handle the projections plus > the volume. > The major bottleneck is the amount of graphics memory. > If I reconstruct a little bit more slices than the limit that I found with > one stream, the allocation of GPU resource for CUFFT in the > CudaFFTRampImageFilter will fail (which was more or less expected). > However with --divisions > 1 it is indeed able to reconstruct more slices, > but only a very few more; otherwise the CUFFT would fail again. > I would expect the limitations of the amount of slices to be approximately > proportional to the number of streams, or do I miss anything about stream > division? > > Thanks, > Chao > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > >> Hi Chao, >> There are two things that use memory, the volume and the projections. >> The --divisions option divides the volume only. The --lowmem option >> works on a subset of projections at a time. Did you try this? >> Simon >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> > Hoi, >> > >> > I may need some hint about how the stream division works in rtkfdk. >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> > figure >> > out quickly how the division has been performed. >> > I did some test with reconstructing 400 1500x1200 projections into a >> > 640xNx640 volume (the pixel and voxel size are comparable). >> > The reconstructions were executed by rtkfdk with CUDA. >> > When I leave the origin of the volume at the center by default, I can >> > reconstruct up to N=200 slices with --divisions=1 due to the limitation >> > of >> > the graphic memory. Then when I increase the number of divisions to 2, I >> > can >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> > 219 >> > slices. Does anyone have an idea why it scales like this? >> > Thanks in advance. >> > >> > Best regards, >> > Chao >> > >> > _______________________________________________ >> > Rtk-users mailing list >> > Rtk-users at openrtk.org >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> > > > From simon.rit at creatis.insa-lyon.fr Wed May 21 10:19:26 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 16:19:26 +0200 Subject: [Rtk-users] Backward incompatible change: angles in radians Message-ID: Dear all, Be aware that I have just pushed a backward incompatible change: https://github.com/SimonRit/RTK/commit/b6661f59a0a5730545474163f73438a978053194 I usually try to maintain backward compatibility but I felt that the class rtk::ThreeDCircularProjectionGeometry was really too messy. So from now on: - all angles stored or returned by the class are in radians - only the function AddProjection takes angles in degrees as parameters. AddProjectionInRadians allows you to avoid conversion of angles that are already in radians if you prefer it. - angles in geometry files are still in degrees. I believe that you will only have issues with this if you were using one of the following methods: - GetGantryAngles - GetOutOfPlaneAngles - GetInPlaneAngles The returned values are now in radians, not in degrees anymore. I apologize in advance for any inconveniece and I'm available to help you if it is one. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From wuchao04 at gmail.com Thu May 22 04:06:44 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Thu, 22 May 2014 10:06:44 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for the suggestions. The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 --dimension 640,250,640 --hardware=cuda -v -l With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. I found that the size of the volume data in the GRAM could be reduced by --divisions but the amount of projection data sent to the GRAM are not influenced by --lowmem switch. So --divisions does not help much if it is mainly the projection data which takes up GRAM, while --lowmem does not help at all. I did not look into the more front part of the code so I am not sure if this is the designed behaviour. On the other hand, I am also looking for possibilities to reduce GRAM used in the CUDA ramp filter. At least one thing should be changed, and one thing may be considered: - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be destroyed earlier, right after the plan being executed. A plan takes up at least the same amount of memory as the data. - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a clear idea about how to pad deviceProjection to the required size of its cufftComplex counterpart. Any comments? Best regards, Chao 2014-05-21 14:30 GMT+02:00 Simon Rit : > Since it fails in cufft, it's the memory of the projections that is a > problem. Therefore, it is not surprising that --divisions has no > influence. But --lowmem should have an influence. I would suggest: > - to uncomment > //#define VERBOSE > in itkCudaImageDataManager.hxx and try to see what amount of memory > are requested. > - to try to reproduce the problem with simulated data so that we can > help you in finding a solution. > Simon > > On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > > Hi Simon, > > > > Yes I switched on an off the --lowmem option and it has no influence on > the > > behaviour I mentioned. > > In my case the system memory is sufficient to handle the projections plus > > the volume. > > The major bottleneck is the amount of graphics memory. > > If I reconstruct a little bit more slices than the limit that I found > with > > one stream, the allocation of GPU resource for CUFFT in the > > CudaFFTRampImageFilter will fail (which was more or less expected). > > However with --divisions > 1 it is indeed able to reconstruct more > slices, > > but only a very few more; otherwise the CUFFT would fail again. > > I would expect the limitations of the amount of slices to be > approximately > > proportional to the number of streams, or do I miss anything about stream > > division? > > > > Thanks, > > Chao > > > > > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > > > >> Hi Chao, > >> There are two things that use memory, the volume and the projections. > >> The --divisions option divides the volume only. The --lowmem option > >> works on a subset of projections at a time. Did you try this? > >> Simon > >> > >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > >> > Hoi, > >> > > >> > I may need some hint about how the stream division works in rtkfdk. > >> > I noticed that the StreamingImageFilter from ITK is used but I cannot > >> > figure > >> > out quickly how the division has been performed. > >> > I did some test with reconstructing 400 1500x1200 projections into a > >> > 640xNx640 volume (the pixel and voxel size are comparable). > >> > The reconstructions were executed by rtkfdk with CUDA. > >> > When I leave the origin of the volume at the center by default, I can > >> > reconstruct up to N=200 slices with --divisions=1 due to the > limitation > >> > of > >> > the graphic memory. Then when I increase the number of divisions to > 2, I > >> > can > >> > only reconstruct up to 215 slices; and with divisions to 3 only up to > >> > 219 > >> > slices. Does anyone have an idea why it scales like this? > >> > Thanks in advance. > >> > > >> > Best regards, > >> > Chao > >> > > >> > _______________________________________________ > >> > Rtk-users mailing list > >> > Rtk-users at openrtk.org > >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Mon May 26 18:12:50 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 00:12:50 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, Thanks for the detailed report. On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > Hi Simon, > > Thanks for the suggestions. > > The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: > > rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 > rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing > 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt > rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 > --dimension 640,250,640 --hardware=cuda -v -l > > With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of > itkCudaImageDataManager.hxx) now I can have a better view of the GRAM > usage. > I found that the size of the volume data in the GRAM could be reduced by > --divisions but the amount of projection data sent to the GRAM are not > influenced by --lowmem switch. > After looking at the code again, lowmem acts on the reading so it's not related to the GPU memory but on the CPU memory, sorry about that. The reconstruction algorithm does stream the projections but it processes by default 16 projections at a time. You can change this in rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce your GPU memory consumption (I checked and it works for me). Let me know if it works for you and if you think that this should be made an option of rtkfdk. > So --divisions does not help much if it is mainly the projection data > which takes up GRAM, while --lowmem does not help at all. I did not look > into the more front part of the code so I am not sure if this is the > designed behaviour. > > On the other hand, I am also looking for possibilities to reduce GRAM used > in the CUDA ramp filter. At least one thing should be changed, and one > thing may be considered: > - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be > destroyed earlier, right after the plan being executed. A plan takes up at > least the same amount of memory as the data. > Good point, I changed it: https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a > clear idea about how to pad deviceProjection to the required size of > its cufftComplex counterpart. > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is not an itk::InPlaceImageFilter. It might be possible but I would have to check. Let me know if you investigate this further. Thanks again, Simon > > Any comments? > > Best regards, > Chao > > > > 2014-05-21 14:30 GMT+02:00 Simon Rit : > > Since it fails in cufft, it's the memory of the projections that is a >> problem. Therefore, it is not surprising that --divisions has no >> influence. But --lowmem should have an influence. I would suggest: >> - to uncomment >> //#define VERBOSE >> in itkCudaImageDataManager.hxx and try to see what amount of memory >> are requested. >> - to try to reproduce the problem with simulated data so that we can >> help you in finding a solution. >> Simon >> >> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >> > Hi Simon, >> > >> > Yes I switched on an off the --lowmem option and it has no influence on >> the >> > behaviour I mentioned. >> > In my case the system memory is sufficient to handle the projections >> plus >> > the volume. >> > The major bottleneck is the amount of graphics memory. >> > If I reconstruct a little bit more slices than the limit that I found >> with >> > one stream, the allocation of GPU resource for CUFFT in the >> > CudaFFTRampImageFilter will fail (which was more or less expected). >> > However with --divisions > 1 it is indeed able to reconstruct more >> slices, >> > but only a very few more; otherwise the CUFFT would fail again. >> > I would expect the limitations of the amount of slices to be >> approximately >> > proportional to the number of streams, or do I miss anything about >> stream >> > division? >> > >> > Thanks, >> > Chao >> > >> > >> > >> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >> > >> >> Hi Chao, >> >> There are two things that use memory, the volume and the projections. >> >> The --divisions option divides the volume only. The --lowmem option >> >> works on a subset of projections at a time. Did you try this? >> >> Simon >> >> >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> >> > Hoi, >> >> > >> >> > I may need some hint about how the stream division works in rtkfdk. >> >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> >> > figure >> >> > out quickly how the division has been performed. >> >> > I did some test with reconstructing 400 1500x1200 projections into a >> >> > 640xNx640 volume (the pixel and voxel size are comparable). >> >> > The reconstructions were executed by rtkfdk with CUDA. >> >> > When I leave the origin of the volume at the center by default, I can >> >> > reconstruct up to N=200 slices with --divisions=1 due to the >> limitation >> >> > of >> >> > the graphic memory. Then when I increase the number of divisions to >> 2, I >> >> > can >> >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> >> > 219 >> >> > slices. Does anyone have an idea why it scales like this? >> >> > Thanks in advance. >> >> > >> >> > Best regards, >> >> > Chao >> >> > >> >> > _______________________________________________ >> >> > Rtk-users mailing list >> >> > Rtk-users at openrtk.org >> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> >> > >> > >> > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Tue May 27 08:23:51 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 14:23:51 +0200 Subject: [Rtk-users] Test phantoms for RTK In-Reply-To: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> References: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> Message-ID: Hi, Please use the mailing list, your question might be of interest to others. The use of phantoms is described on the wiki (http://wiki.openrtk.org). For example, look for the Elekta and Varian section to see how to reconstruct these datasets. Let us know if something is not clear there with a more specific question, we'll be happy to improve the description. Thanks, Simon On Tue, May 27, 2014 at 11:28 AM, Liu, Yu wrote: > Dear Mr. Rit, > > > > I am doing my PhD at Empa in Switzerland. Currently I am trying to use RTK > to implement some of my algorithms. > > I found some test phantoms you uploaded to kitware > (http://midas3.kitware.com/midas/community/20#) and you referred to them in > one of your publications. > > However, you did not provide any documents on how to use them (at least how > to read the files). Is it possible that you give me some hints on this > issue? > > > > Thank you. > > Best regards, > > Yu Liu From wuchao04 at gmail.com Tue May 27 08:24:19 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Tue, 27 May 2014 14:24:19 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for your reaction. I was looking into the in-place FFT these days, and the way of tuning the number of projections sent to the ramp filter is exactly what I plan to look for next. Now I know that directly. I think it is a good idea to make it an option of rtkfdk, or to regulate it automatically by inquiring the amount of free memory with cudaMemGetInfo and estimating the memory needed for storing the projections, ramp kernel, FFT plan and the chunk of volume. The latter may be difficult though since such estimation is not easy at the stage even before padding the projections... Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and cuFFT could perform FFT in-place. So in principle rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter may also be made in-place if FFTW is used. However the ?in-place? here is on a lower level and may not be compatible with the meaning of ?in-place? of itk::InPlaceImageFilter. Anyway, since system memory is not a problem to me, I only focus on the Cuda filter. I already have sort of ?dirty? implementation for my own use: First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) deviceProjection. Now the cuFFT is in-place; the only thing is that the size of the buffer (now used by both deviceProjectionFFT and deviceProjection) should be 2*(x/2+1)*y*z instead of x*y*z. Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above is maintained in paddedImage. Its size is determined in PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and CPU-to-GPU data copying is by paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first attempt is to make the image regions of paddedImage different from each other by modifying FFTRampImageFilter::PadInputImageRegion(?) in rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing the padded projection data as how it works now; while its BufferedRegion should be 2*(x/2+1) by y by z, with the additional part reserved for in-place FFT. Other small changes were done to calculate inputDimension and kernelDimension correctly based on RequestedRegion. Later I realized that this did not work, since cuFFT sees the buffer just as a linear space. All image data should come continuously from the beginning of the buffer and all unused spaces are at the end, but in this case the reserved spaces were at the end along the x (first) dimension so that they were distributed in the linear buffer. So this was where the ?dirty? changes started. First of all, instead of calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did not check if the modification works for CPU or any other situations as well, so I prefer to make branches when possible instead of direct changes). The latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only change of the call for allocation from paddedImage->Allocate() to paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. There, after the calculation and set of CudaDataManager::m_BufferSize as before, I also calculate the required buffer size for in-place FFT and stored the value in a new member of CudaDataManager, namely m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and after that restore m_BufferSize to its original value. Other changes have been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any other allocation actions (although I have not checked those one by one) will not be influenced by the piece of new code. At last, under GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the additional space in this buffer will never have a chance later to be initialized by means of CPU-to-GPU data copying. The length of the data is shorter than the buffer size. It works for me so far. Please see if you have any better routine to implement this. Thank you. Best regards, Chao 2014-05-27 0:12 GMT+02:00 Simon Rit : > Hi Chao, > Thanks for the detailed report. > > > On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > >> Hi Simon, >> >> Thanks for the suggestions. >> >> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >> >> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >> --dimension 640,250,640 --hardware=cuda -v -l >> >> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM >> usage. >> I found that the size of the volume data in the GRAM could be reduced by >> --divisions but the amount of projection data sent to the GRAM are not >> influenced by --lowmem switch. >> > After looking at the code again, lowmem acts on the reading so it's not > related to the GPU memory but on the CPU memory, sorry about that. The > reconstruction algorithm does stream the projections but it processes by > default 16 projections at a time. You can change this in > rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will > reduce your GPU memory consumption (I checked and it works for me). Let me > know if it works for you and if you think that this should be made an > option of rtkfdk. > > >> So --divisions does not help much if it is mainly the projection data >> which takes up GRAM, while --lowmem does not help at all. I did not look >> into the more front part of the code so I am not sure if this is the >> designed behaviour. >> >> On the other hand, I am also looking for possibilities to reduce GRAM >> used in the CUDA ramp filter. At least one thing should be changed, and one >> thing may be considered: >> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >> destroyed earlier, right after the plan being executed. A plan takes up at >> least the same amount of memory as the data. >> > Good point, I changed it: > > https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > > >> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >> clear idea about how to pad deviceProjection to the required size of >> its cufftComplex counterpart. >> > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is > not an itk::InPlaceImageFilter. It might be possible but I would have to > check. Let me know if you investigate this further. > Thanks again, > Simon > > >> >> Any comments? >> >> Best regards, >> Chao >> >> >> >> 2014-05-21 14:30 GMT+02:00 Simon Rit : >> >> Since it fails in cufft, it's the memory of the projections that is a >>> problem. Therefore, it is not surprising that --divisions has no >>> influence. But --lowmem should have an influence. I would suggest: >>> - to uncomment >>> //#define VERBOSE >>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>> are requested. >>> - to try to reproduce the problem with simulated data so that we can >>> help you in finding a solution. >>> Simon >>> >>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>> > Hi Simon, >>> > >>> > Yes I switched on an off the --lowmem option and it has no influence >>> on the >>> > behaviour I mentioned. >>> > In my case the system memory is sufficient to handle the projections >>> plus >>> > the volume. >>> > The major bottleneck is the amount of graphics memory. >>> > If I reconstruct a little bit more slices than the limit that I found >>> with >>> > one stream, the allocation of GPU resource for CUFFT in the >>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>> > However with --divisions > 1 it is indeed able to reconstruct more >>> slices, >>> > but only a very few more; otherwise the CUFFT would fail again. >>> > I would expect the limitations of the amount of slices to be >>> approximately >>> > proportional to the number of streams, or do I miss anything about >>> stream >>> > division? >>> > >>> > Thanks, >>> > Chao >>> > >>> > >>> > >>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>> > >>> >> Hi Chao, >>> >> There are two things that use memory, the volume and the projections. >>> >> The --divisions option divides the volume only. The --lowmem option >>> >> works on a subset of projections at a time. Did you try this? >>> >> Simon >>> >> >>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>> >> > Hoi, >>> >> > >>> >> > I may need some hint about how the stream division works in rtkfdk. >>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>> cannot >>> >> > figure >>> >> > out quickly how the division has been performed. >>> >> > I did some test with reconstructing 400 1500x1200 projections into a >>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>> >> > The reconstructions were executed by rtkfdk with CUDA. >>> >> > When I leave the origin of the volume at the center by default, I >>> can >>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>> limitation >>> >> > of >>> >> > the graphic memory. Then when I increase the number of divisions to >>> 2, I >>> >> > can >>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>> to >>> >> > 219 >>> >> > slices. Does anyone have an idea why it scales like this? >>> >> > Thanks in advance. >>> >> > >>> >> > Best regards, >>> >> > Chao >>> >> > >>> >> > _______________________________________________ >>> >> > Rtk-users mailing list >>> >> > Rtk-users at openrtk.org >>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> >> > >>> > >>> > >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 28 10:48:20 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 28 May 2014 16:48:20 +0200 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: <5305E503.3000506@ucl.ac.uk> References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: Hi Ben, It was on my todo list. I found the problem and here is the fix: https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 Multithreading was indeed disabled as you pointed out, I had to remember pieces of code that were quite old (for an animal like me). Thanks again for the detailed report, Simon On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion wrote: > Hi Simon, > > Really appreciate your prompt response! > > Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster > reconstructions, and the time increase between the two commits reduces to a > little over 2x (See below). > > My dataset consists of 344 projections (about 172.0 MB) > > Does this sound about right? The CPU utilization still looks a bit like a > series of spikes for the latter commit (but different than before). > > Reconstructing and writing... It took 36.0746 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.59479 s > Ramp filter: 19.3106 s > Backprojection: 13.8042 s > > ***versus*** > > Reconstructing and writing... It took 83.4121 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.62535 s > Ramp filter: 66.5537 s > Backprojection: 13.8829 s > > Thanks again, > > Ben > > > > > On 20/02/14 06:57, Simon Rit wrote: >> >> Hi, >> Thank you Ben for the amazing report. I can spot a few things that >> could have gone wrong there but it seems to me that your >> reconstruction is slow both before and after the commit... Two >> potential reasons: >> - you have not activated FFTW in ITK. You should definitely do that, >> the FFT of ITK is (very) slow and probably not multithreaded. You must >> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >> version of ITK4, I had some issues with the first versions, see >> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >> - you are using a huge dataset. >> If you did not use FFTW, could you try again with FFTW and tell us if >> you still observe a drop in performances? If you had FFTW, can you >> provide the sie of the dataset you used? >> Thanks, >> Simon >> >> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >> wrote: >>> >>> Hello, >>> >>> First of all, many thanks to the RTK community for this useful toolkit! >>> >>> While experimenting with different versions of the code (I'm a relatively >>> new user), I've encountered large differences in rtkfdk (CPU) >>> reconstruction >>> speed between code versions (a newer version being substantially slower >>> than >>> an older version). >>> >>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>> required -g, -p, -r and -o flags, but no other flags). >>> >>> Using git-bisect, I narrowed it down to a particular commit. The parent >>> commit runs quite quickly, but the child commit shows nearly 4x >>> reconstruction time, and less-uniform CPU utilization (it looks like a >>> series of spikes). >>> >>> (See below) >>> >>> Looking at the diffs, it seems that in addition to adding the HannY >>> functionality (which should be disabled by default?), there were some >>> changes in this commit related to threading (in >>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>> misleading and the substantial difference consists in changing the FFT >>> Ramp >>> Kernel. >>> >>> I'm currently reading the source to try to understand those changes, but >>> I >>> thought I would post in case someone is able to point me in the right >>> direction. Although these differences are unexpected to me, I doubt that >>> they are unexpected to more experienced users...! >>> >>> Apologies if I've left out any critical information (or if I've provided >>> too >>> much!). >>> >>> Many thanks in advance, >>> Ben >>> >>> ****** Parent Commit ****** >>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>> Author: Julien Jomier >>> Date: Fri Nov 30 09:30:59 2012 +0100 >>> >>> ENH: Minimum CMake version is 2.8.3 >>> >>> ***Partial output*** >>> >>> Reconstructing and writing... It took 44.3992 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.67915 s >>> Ramp filter: 26.3847 s >>> Backprojection: 13.0447 s >>> >>> ***Screenshot of CPU usage attached: >>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>> >>> ****** Child Commit ****** >>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>> Author: Simon Rit >>> Date: Wed Dec 5 16:22:47 2012 +0100 >>> >>> First version of Hann windowing in the second direction >>> (perpendicular >>> to the ramp) >>> >>> ***Partial output*** >>> Reconstructing and writing... It took 126.911 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.47678 s >>> Ramp filter: 108.254 s >>> Backprojection: 13.2973 s >>> >>> ***Screenshot of CPU usage attached: >>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > From benjamin.champion.13 at ucl.ac.uk Thu May 29 05:19:37 2014 From: benjamin.champion.13 at ucl.ac.uk (Ben Champion) Date: Thu, 29 May 2014 10:19:37 +0100 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: <5386FBA9.6020402@ucl.ac.uk> Hi Simon, Glad to hear you found a fix! Thanks for looking into it. Best wishes, Ben On 28/05/14 15:48, Simon Rit wrote: > Hi Ben, > It was on my todo list. I found the problem and here is the fix: > https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 > Multithreading was indeed disabled as you pointed out, I had to > remember pieces of code that were quite old (for an animal like me). > Thanks again for the detailed report, > Simon > > On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion > wrote: >> Hi Simon, >> >> Really appreciate your prompt response! >> >> Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster >> reconstructions, and the time increase between the two commits reduces to a >> little over 2x (See below). >> >> My dataset consists of 344 projections (about 172.0 MB) >> >> Does this sound about right? The CPU utilization still looks a bit like a >> series of spikes for the latter commit (but different than before). >> >> Reconstructing and writing... It took 36.0746 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.59479 s >> Ramp filter: 19.3106 s >> Backprojection: 13.8042 s >> >> ***versus*** >> >> Reconstructing and writing... It took 83.4121 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.62535 s >> Ramp filter: 66.5537 s >> Backprojection: 13.8829 s >> >> Thanks again, >> >> Ben >> >> >> >> >> On 20/02/14 06:57, Simon Rit wrote: >>> Hi, >>> Thank you Ben for the amazing report. I can spot a few things that >>> could have gone wrong there but it seems to me that your >>> reconstruction is slow both before and after the commit... Two >>> potential reasons: >>> - you have not activated FFTW in ITK. You should definitely do that, >>> the FFT of ITK is (very) slow and probably not multithreaded. You must >>> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >>> version of ITK4, I had some issues with the first versions, see >>> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >>> - you are using a huge dataset. >>> If you did not use FFTW, could you try again with FFTW and tell us if >>> you still observe a drop in performances? If you had FFTW, can you >>> provide the sie of the dataset you used? >>> Thanks, >>> Simon >>> >>> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >>> wrote: >>>> Hello, >>>> >>>> First of all, many thanks to the RTK community for this useful toolkit! >>>> >>>> While experimenting with different versions of the code (I'm a relatively >>>> new user), I've encountered large differences in rtkfdk (CPU) >>>> reconstruction >>>> speed between code versions (a newer version being substantially slower >>>> than >>>> an older version). >>>> >>>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>>> required -g, -p, -r and -o flags, but no other flags). >>>> >>>> Using git-bisect, I narrowed it down to a particular commit. The parent >>>> commit runs quite quickly, but the child commit shows nearly 4x >>>> reconstruction time, and less-uniform CPU utilization (it looks like a >>>> series of spikes). >>>> >>>> (See below) >>>> >>>> Looking at the diffs, it seems that in addition to adding the HannY >>>> functionality (which should be disabled by default?), there were some >>>> changes in this commit related to threading (in >>>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>>> misleading and the substantial difference consists in changing the FFT >>>> Ramp >>>> Kernel. >>>> >>>> I'm currently reading the source to try to understand those changes, but >>>> I >>>> thought I would post in case someone is able to point me in the right >>>> direction. Although these differences are unexpected to me, I doubt that >>>> they are unexpected to more experienced users...! >>>> >>>> Apologies if I've left out any critical information (or if I've provided >>>> too >>>> much!). >>>> >>>> Many thanks in advance, >>>> Ben >>>> >>>> ****** Parent Commit ****** >>>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>>> Author: Julien Jomier >>>> Date: Fri Nov 30 09:30:59 2012 +0100 >>>> >>>> ENH: Minimum CMake version is 2.8.3 >>>> >>>> ***Partial output*** >>>> >>>> Reconstructing and writing... It took 44.3992 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.67915 s >>>> Ramp filter: 26.3847 s >>>> Backprojection: 13.0447 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>>> >>>> ****** Child Commit ****** >>>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>>> Author: Simon Rit >>>> Date: Wed Dec 5 16:22:47 2012 +0100 >>>> >>>> First version of Hann windowing in the second direction >>>> (perpendicular >>>> to the ramp) >>>> >>>> ***Partial output*** >>>> Reconstructing and writing... It took 126.911 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.47678 s >>>> Ramp filter: 108.254 s >>>> Backprojection: 13.2973 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>>> >>>> >>>> >>>> _______________________________________________ >>>> Rtk-users mailing list >>>> Rtk-users at openrtk.org >>>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> From simon.rit at creatis.insa-lyon.fr Fri May 30 05:12:41 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 11:12:41 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, I added the option, --subsetsize. Thanks for the detailed report. I don't understand it all, it's quite complicated... Do you really have such memory limitations problems that you want to go in that direction? Using the two streaming options (--subset + --divisions), you should be able to sufficiently reduce your memory consumption. If you really want to go further in the in-place implementation, I think a code patch would be more helpful but you must confine the changes to rtk::CudaFFTRampImageFilter. We don't want to modify itk::CudaDataManager for such a specific purpose. Simon On Tue, May 27, 2014 at 2:24 PM, Chao Wu wrote: > Hi Simon, > > Thanks for your reaction. I was looking into the in-place FFT these days, > and the way of tuning the number of projections sent to the ramp filter is > exactly what I plan to look for next. Now I know that directly. I think it > is a good idea to make it an option of rtkfdk, or to regulate it > automatically by inquiring the amount of free memory with cudaMemGetInfo and > estimating the memory needed for storing the projections, ramp kernel, FFT > plan and the chunk of volume. The latter may be difficult though since such > estimation is not easy at the stage even before padding the projections... > > Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and > cuFFT could perform FFT in-place. So in principle > rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter > may also be made in-place if FFTW is used. However the ?in-place? here is on > a lower level and may not be compatible with the meaning of ?in-place? of > itk::InPlaceImageFilter. > > Anyway, since system memory is not a problem to me, I only focus on the Cuda > filter. I already have sort of ?dirty? implementation for my own use: > > First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of > deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) > deviceProjection. Now the cuFFT is in-place; the only thing is that the size > of the buffer (now used by both deviceProjectionFFT and deviceProjection) > should be 2*(x/2+1)*y*z instead of x*y*z. > > Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above > is maintained in paddedImage. Its size is determined in > PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and > CPU-to-GPU data copying is by > paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first > attempt is to make the image regions of paddedImage different from each > other by modifying FFTRampImageFilter::PadInputImageRegion(?) in > rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing > the padded projection data as how it works now; while its BufferedRegion > should be 2*(x/2+1) by y by z, with the additional part reserved for > in-place FFT. Other small changes were done to calculate inputDimension and > kernelDimension correctly based on RequestedRegion. Later I realized that > this did not work, since cuFFT sees the buffer just as a linear space. All > image data should come continuously from the beginning of the buffer and all > unused spaces are at the end, but in this case the reserved spaces were at > the end along the x (first) dimension so that they were distributed in the > linear buffer. > > So this was where the ?dirty? changes started. First of all, instead of > calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, > I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did > not check if the modification works for CPU or any other situations as well, > so I prefer to make branches when possible instead of direct changes). The > latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only > change of the call for allocation from paddedImage->Allocate() to > paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() > is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. > There, after the calculation and set of CudaDataManager::m_BufferSize as > before, I also calculate the required buffer size for in-place FFT and > stored the value in a new member of CudaDataManager, namely > m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in > itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check > if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let > m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and > after that restore m_BufferSize to its original value. Other changes have > been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to > m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT > = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any > other allocation actions (although I have not checked those one by one) will > not be influenced by the piece of new code. At last, under > GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after > cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the > additional space in this buffer will never have a chance later to be > initialized by means of CPU-to-GPU data copying. The length of the data is > shorter than the buffer size. > > It works for me so far. Please see if you have any better routine to > implement this. Thank you. > > Best regards, > Chao > > > > > > > > > 2014-05-27 0:12 GMT+02:00 Simon Rit : > >> Hi Chao, >> Thanks for the detailed report. >> >> >> On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: >>> >>> Hi Simon, >>> >>> Thanks for the suggestions. >>> >>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >>> >>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >>> --dimension 640,250,640 --hardware=cuda -v -l >>> >>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. >>> I found that the size of the volume data in the GRAM could be reduced by >>> --divisions but the amount of projection data sent to the GRAM are not >>> influenced by --lowmem switch. >> >> After looking at the code again, lowmem acts on the reading so it's not >> related to the GPU memory but on the CPU memory, sorry about that. The >> reconstruction algorithm does stream the projections but it processes by >> default 16 projections at a time. You can change this in >> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce >> your GPU memory consumption (I checked and it works for me). Let me know if >> it works for you and if you think that this should be made an option of >> rtkfdk. >> >>> >>> So --divisions does not help much if it is mainly the projection data >>> which takes up GRAM, while --lowmem does not help at all. I did not look >>> into the more front part of the code so I am not sure if this is the >>> designed behaviour. >>> >>> On the other hand, I am also looking for possibilities to reduce GRAM >>> used in the CUDA ramp filter. At least one thing should be changed, and one >>> thing may be considered: >>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >>> destroyed earlier, right after the plan being executed. A plan takes up at >>> least the same amount of memory as the data. >> >> Good point, I changed it: >> >> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 >> >>> >>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >>> clear idea about how to pad deviceProjection to the required size of its >>> cufftComplex counterpart. >> >> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is >> not an itk::InPlaceImageFilter. It might be possible but I would have to >> check. Let me know if you investigate this further. >> Thanks again, >> Simon >> >>> >>> >>> Any comments? >>> >>> Best regards, >>> Chao >>> >>> >>> >>> 2014-05-21 14:30 GMT+02:00 Simon Rit : >>> >>>> Since it fails in cufft, it's the memory of the projections that is a >>>> problem. Therefore, it is not surprising that --divisions has no >>>> influence. But --lowmem should have an influence. I would suggest: >>>> - to uncomment >>>> //#define VERBOSE >>>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>>> are requested. >>>> - to try to reproduce the problem with simulated data so that we can >>>> help you in finding a solution. >>>> Simon >>>> >>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>>> > Hi Simon, >>>> > >>>> > Yes I switched on an off the --lowmem option and it has no influence >>>> > on the >>>> > behaviour I mentioned. >>>> > In my case the system memory is sufficient to handle the projections >>>> > plus >>>> > the volume. >>>> > The major bottleneck is the amount of graphics memory. >>>> > If I reconstruct a little bit more slices than the limit that I found >>>> > with >>>> > one stream, the allocation of GPU resource for CUFFT in the >>>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>>> > However with --divisions > 1 it is indeed able to reconstruct more >>>> > slices, >>>> > but only a very few more; otherwise the CUFFT would fail again. >>>> > I would expect the limitations of the amount of slices to be >>>> > approximately >>>> > proportional to the number of streams, or do I miss anything about >>>> > stream >>>> > division? >>>> > >>>> > Thanks, >>>> > Chao >>>> > >>>> > >>>> > >>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>>> > >>>> >> Hi Chao, >>>> >> There are two things that use memory, the volume and the projections. >>>> >> The --divisions option divides the volume only. The --lowmem option >>>> >> works on a subset of projections at a time. Did you try this? >>>> >> Simon >>>> >> >>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>>> >> > Hoi, >>>> >> > >>>> >> > I may need some hint about how the stream division works in rtkfdk. >>>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>>> >> > cannot >>>> >> > figure >>>> >> > out quickly how the division has been performed. >>>> >> > I did some test with reconstructing 400 1500x1200 projections into >>>> >> > a >>>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>>> >> > The reconstructions were executed by rtkfdk with CUDA. >>>> >> > When I leave the origin of the volume at the center by default, I >>>> >> > can >>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>>> >> > limitation >>>> >> > of >>>> >> > the graphic memory. Then when I increase the number of divisions to >>>> >> > 2, I >>>> >> > can >>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>>> >> > to >>>> >> > 219 >>>> >> > slices. Does anyone have an idea why it scales like this? >>>> >> > Thanks in advance. >>>> >> > >>>> >> > Best regards, >>>> >> > Chao >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Rtk-users mailing list >>>> >> > Rtk-users at openrtk.org >>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> >> > >>>> > >>>> > >>> >>> >> > From simon.rit at creatis.insa-lyon.fr Fri May 30 07:12:49 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 13:12:49 +0200 Subject: [Rtk-users] Result from SART is worse than from FDK In-Reply-To: <52B44FCA.7000800@bam.de> References: <527914C3.8030706@bam.de> <527918B5.9080709@bam.de> <52B44FCA.7000800@bam.de> Message-ID: Hi Andreas, I apologize for never getting back to you despite the clear description of the problem. Cyril Mory has done many developments in iterative reconstruction since your email, including some improvement of SART. See for example http://wiki.openrtk.org/index.php/RTK/Examples/ADMMTVReconstruction. I have launched the three cases you suggested with the "new" SART - SART reconstruction of middle plane: this cannot work because our forward projector assumes that the volume goes from the middle of the first voxel to the middle of the last voxel. Therefore, one plane is not enough, you need at least two. - SART reconstruction of 10 planes around middle plane: there is a truncation problem here and I don't see how it could be solved in this manner. In general, one needs to use a reconstruction support that is large enough for the problem at hand (see for example http://www.ncbi.nlm.nih.gov/pubmed/17441239). The situation is different if you reduce the data to the reconstruction of a single plane (with --dimension 256,1 in rtkprojectgeometricphantom). Then, your 10 slices are sufficient but the default unmatched forward/back-projector (see http://www.ncbi.nlm.nih.gov/pubmed/11021698 for a description of this) give bad results. You can now solve this if you match them with the option --bp NormalizedJoseph that Cyril has implemented. So even a better of implementation of SART (the current one) does not solve the problems that you have pointed out. You need a large enough CT image given input data to solve the problem. I hope this will be helpful, maybe not to you if it's too late but to some others. Simon On Fri, Dec 20, 2013 at 3:10 PM, Staude, Andreas wrote: > Hi Simon, > > I believe it really is a problem with the sum of the weights. > > I first tried with the Shepp-Logan-phantom and afterwards with my data. > The geometry is that of a standard cone-beam micro-CT. > > The data I posted before were the reconstruction of just the middle > plane. As I did the same with the Shepp-Logan-phantom data, similar > effects were seen. As soon as one reconstructs a larger region around > the middle plane, the artefacts vanish in the inner parts of the > reconstructed volume, while in the top and bottom parts artefacts remain. > > The program calls were: > > create geometry: > ---------------- > rtksimulatedgeometry --nproj="1200" --output="geometry.xml" > --sdd="1169.59" --sid="451.645" --arc="-360" --first_angle="360" > > project the phantom: > -------------------- > rtkprojectgeometricphantom -g geometry.xml -o projections3.mha --spacing > 2.5 --dimension 256 --phantomfile SheppLogan.txt > > do a reference FDK reconstruction: > ---------------------------------- > rtkfdk -p . -r projections3.mha -o shepp-logan_fdk3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > SART reconstruction of middle plane: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2D.mha -g > geometry.xml --spacing 1 --dimension 256,1,256 > > SART reconstruction of 10 planes around middle plane: > ------------------------------------------------------- > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2.5D.mha -g > geometry.xml --spacing 1 --dimension 256,10,256 > > SART reconstruction of whole object: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > > Reconstruction of more slices of the real data-set also gave a good > result. Only the slices near bottom and top are not reconstructed correctly. > > So it seems that the normalisation does not only take the values inside > the reconstructed volume into account, but also (wrong) values outside. > > What do you think? > > Cheers, > > Andreas > > > > On 11/05/2013 07:11 PM, Simon Rit wrote: >> Hi Andreas, >> Thanks for the report. We know that the implementation of SART is >> imperfect, we haven't been working a lot on it... It seems that you >> haven't reached convergence. One potential cause is that we use a >> heuristic for the sum of the weights (denominator in the SART formula) >> instead of the exact sum. The weight is constant and equals the >> diagonal of your volume (see line 165 in >> rtkSARTConeBeamReconstructionFilter.txx). Maybe this is completely >> wrong in your case. Could you try to increase lambda to see if that >> helps? >> To help us do some tests, I would advise you do reproduce your >> geometry with simulations of the Shepp Logan phantom (see >> wiki.openrtk.org). >> Simon >> >> On Tue, Nov 5, 2013 at 5:11 PM, Staude, Andreas wrote: >>> Hello RTk-users, >>> >>> I try to use the SART algorithm, but the results are worse than those >>> obtained with FDK (see attached images). >>> >>> The FDK result looks like expected, so I assume that I have the data >>> format and the reconstruction geometry set properly. For SART I used the >>> same parameters and already tried with different values of lambda and >>> niterations. >>> >>> Does anyone have an idea what went wrong? Is there some kind of >>> smoothing or regularisation applied in the SART implementation? >>> >>> Many thanks in advance! >>> >>> Cheers, >>> >>> Andreas >>> >>> >>> -- >>> >>> =============================================================== >>> Dr. Andreas Staude >>> Fachbereich 8.5 "Mikro-ZfP", Computertomographie >>> BAM Bundesanstalt f?r Materialforschung und -pr?fung >>> Unter den Eichen 87 >>> D-12205 Berlin >>> Germany >>> >>> Tel.: ++49 30 8104 4140 >>> Fax: ++49 30 8104 1837 >>> =============================================================== >>> >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > > -- > > =============================================================== > Dr. Andreas Staude > Fachbereich 8.5 "Mikro-ZfP", Computertomographie > BAM Bundesanstalt f?r Materialforschung und -pr?fung > Unter den Eichen 87 > D-12205 Berlin > Germany > > Tel.: ++49 30 8104 4140 > Fax: ++49 30 8104 1837 > =============================================================== From wuchao04 at gmail.com Wed May 21 06:18:57 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 12:18:57 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk Message-ID: Hoi, I may need some hint about how the stream division works in rtkfdk. I noticed that the StreamingImageFilter from ITK is used but I cannot figure out quickly how the division has been performed. I did some test with reconstructing 400 1500x1200 projections into a 640xNx640 volume (the pixel and voxel size are comparable). The reconstructions were executed by rtkfdk with CUDA. When I leave the origin of the volume at the center by default, I can reconstruct up to N=200 slices with --divisions=1 due to the limitation of the graphic memory. Then when I increase the number of divisions to 2, I can only reconstruct up to 215 slices; and with divisions to 3 only up to 219 slices. Does anyone have an idea why it scales like this? Thanks in advance. Best regards, Chao -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 07:43:40 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 13:43:40 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, There are two things that use memory, the volume and the projections. The --divisions option divides the volume only. The --lowmem option works on a subset of projections at a time. Did you try this? Simon On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > Hoi, > > I may need some hint about how the stream division works in rtkfdk. > I noticed that the StreamingImageFilter from ITK is used but I cannot figure > out quickly how the division has been performed. > I did some test with reconstructing 400 1500x1200 projections into a > 640xNx640 volume (the pixel and voxel size are comparable). > The reconstructions were executed by rtkfdk with CUDA. > When I leave the origin of the volume at the center by default, I can > reconstruct up to N=200 slices with --divisions=1 due to the limitation of > the graphic memory. Then when I increase the number of divisions to 2, I can > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > slices. Does anyone have an idea why it scales like this? > Thanks in advance. > > Best regards, > Chao > > _______________________________________________ > Rtk-users mailing list > Rtk-users at openrtk.org > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > From wuchao04 at gmail.com Wed May 21 08:21:00 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 14:21:00 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Yes I switched on an off the --lowmem option and it has no influence on the behaviour I mentioned. In my case the system memory is sufficient to handle the projections plus the volume. The major bottleneck is the amount of graphics memory. If I reconstruct a little bit more slices than the limit that I found with one stream, the allocation of GPU resource for CUFFT in the CudaFFTRampImageFilter will fail (which was more or less expected). However with --divisions > 1 it is indeed able to reconstruct more slices, but only a very few more; otherwise the CUFFT would fail again. I would expect the limitations of the amount of slices to be approximately proportional to the number of streams, or do I miss anything about stream division? Thanks, Chao 2014-05-21 13:43 GMT+02:00 Simon Rit : > Hi Chao, > There are two things that use memory, the volume and the projections. > The --divisions option divides the volume only. The --lowmem option > works on a subset of projections at a time. Did you try this? > Simon > > On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > > Hoi, > > > > I may need some hint about how the stream division works in rtkfdk. > > I noticed that the StreamingImageFilter from ITK is used but I cannot > figure > > out quickly how the division has been performed. > > I did some test with reconstructing 400 1500x1200 projections into a > > 640xNx640 volume (the pixel and voxel size are comparable). > > The reconstructions were executed by rtkfdk with CUDA. > > When I leave the origin of the volume at the center by default, I can > > reconstruct up to N=200 slices with --divisions=1 due to the limitation > of > > the graphic memory. Then when I increase the number of divisions to 2, I > can > > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > > slices. Does anyone have an idea why it scales like this? > > Thanks in advance. > > > > Best regards, > > Chao > > > > _______________________________________________ > > Rtk-users mailing list > > Rtk-users at openrtk.org > > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 08:30:21 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 14:30:21 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Since it fails in cufft, it's the memory of the projections that is a problem. Therefore, it is not surprising that --divisions has no influence. But --lowmem should have an influence. I would suggest: - to uncomment //#define VERBOSE in itkCudaImageDataManager.hxx and try to see what amount of memory are requested. - to try to reproduce the problem with simulated data so that we can help you in finding a solution. Simon On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > Hi Simon, > > Yes I switched on an off the --lowmem option and it has no influence on the > behaviour I mentioned. > In my case the system memory is sufficient to handle the projections plus > the volume. > The major bottleneck is the amount of graphics memory. > If I reconstruct a little bit more slices than the limit that I found with > one stream, the allocation of GPU resource for CUFFT in the > CudaFFTRampImageFilter will fail (which was more or less expected). > However with --divisions > 1 it is indeed able to reconstruct more slices, > but only a very few more; otherwise the CUFFT would fail again. > I would expect the limitations of the amount of slices to be approximately > proportional to the number of streams, or do I miss anything about stream > division? > > Thanks, > Chao > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > >> Hi Chao, >> There are two things that use memory, the volume and the projections. >> The --divisions option divides the volume only. The --lowmem option >> works on a subset of projections at a time. Did you try this? >> Simon >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> > Hoi, >> > >> > I may need some hint about how the stream division works in rtkfdk. >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> > figure >> > out quickly how the division has been performed. >> > I did some test with reconstructing 400 1500x1200 projections into a >> > 640xNx640 volume (the pixel and voxel size are comparable). >> > The reconstructions were executed by rtkfdk with CUDA. >> > When I leave the origin of the volume at the center by default, I can >> > reconstruct up to N=200 slices with --divisions=1 due to the limitation >> > of >> > the graphic memory. Then when I increase the number of divisions to 2, I >> > can >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> > 219 >> > slices. Does anyone have an idea why it scales like this? >> > Thanks in advance. >> > >> > Best regards, >> > Chao >> > >> > _______________________________________________ >> > Rtk-users mailing list >> > Rtk-users at openrtk.org >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> > > > From simon.rit at creatis.insa-lyon.fr Wed May 21 10:19:26 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 16:19:26 +0200 Subject: [Rtk-users] Backward incompatible change: angles in radians Message-ID: Dear all, Be aware that I have just pushed a backward incompatible change: https://github.com/SimonRit/RTK/commit/b6661f59a0a5730545474163f73438a978053194 I usually try to maintain backward compatibility but I felt that the class rtk::ThreeDCircularProjectionGeometry was really too messy. So from now on: - all angles stored or returned by the class are in radians - only the function AddProjection takes angles in degrees as parameters. AddProjectionInRadians allows you to avoid conversion of angles that are already in radians if you prefer it. - angles in geometry files are still in degrees. I believe that you will only have issues with this if you were using one of the following methods: - GetGantryAngles - GetOutOfPlaneAngles - GetInPlaneAngles The returned values are now in radians, not in degrees anymore. I apologize in advance for any inconveniece and I'm available to help you if it is one. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From wuchao04 at gmail.com Thu May 22 04:06:44 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Thu, 22 May 2014 10:06:44 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for the suggestions. The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 --dimension 640,250,640 --hardware=cuda -v -l With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. I found that the size of the volume data in the GRAM could be reduced by --divisions but the amount of projection data sent to the GRAM are not influenced by --lowmem switch. So --divisions does not help much if it is mainly the projection data which takes up GRAM, while --lowmem does not help at all. I did not look into the more front part of the code so I am not sure if this is the designed behaviour. On the other hand, I am also looking for possibilities to reduce GRAM used in the CUDA ramp filter. At least one thing should be changed, and one thing may be considered: - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be destroyed earlier, right after the plan being executed. A plan takes up at least the same amount of memory as the data. - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a clear idea about how to pad deviceProjection to the required size of its cufftComplex counterpart. Any comments? Best regards, Chao 2014-05-21 14:30 GMT+02:00 Simon Rit : > Since it fails in cufft, it's the memory of the projections that is a > problem. Therefore, it is not surprising that --divisions has no > influence. But --lowmem should have an influence. I would suggest: > - to uncomment > //#define VERBOSE > in itkCudaImageDataManager.hxx and try to see what amount of memory > are requested. > - to try to reproduce the problem with simulated data so that we can > help you in finding a solution. > Simon > > On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > > Hi Simon, > > > > Yes I switched on an off the --lowmem option and it has no influence on > the > > behaviour I mentioned. > > In my case the system memory is sufficient to handle the projections plus > > the volume. > > The major bottleneck is the amount of graphics memory. > > If I reconstruct a little bit more slices than the limit that I found > with > > one stream, the allocation of GPU resource for CUFFT in the > > CudaFFTRampImageFilter will fail (which was more or less expected). > > However with --divisions > 1 it is indeed able to reconstruct more > slices, > > but only a very few more; otherwise the CUFFT would fail again. > > I would expect the limitations of the amount of slices to be > approximately > > proportional to the number of streams, or do I miss anything about stream > > division? > > > > Thanks, > > Chao > > > > > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > > > >> Hi Chao, > >> There are two things that use memory, the volume and the projections. > >> The --divisions option divides the volume only. The --lowmem option > >> works on a subset of projections at a time. Did you try this? > >> Simon > >> > >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > >> > Hoi, > >> > > >> > I may need some hint about how the stream division works in rtkfdk. > >> > I noticed that the StreamingImageFilter from ITK is used but I cannot > >> > figure > >> > out quickly how the division has been performed. > >> > I did some test with reconstructing 400 1500x1200 projections into a > >> > 640xNx640 volume (the pixel and voxel size are comparable). > >> > The reconstructions were executed by rtkfdk with CUDA. > >> > When I leave the origin of the volume at the center by default, I can > >> > reconstruct up to N=200 slices with --divisions=1 due to the > limitation > >> > of > >> > the graphic memory. Then when I increase the number of divisions to > 2, I > >> > can > >> > only reconstruct up to 215 slices; and with divisions to 3 only up to > >> > 219 > >> > slices. Does anyone have an idea why it scales like this? > >> > Thanks in advance. > >> > > >> > Best regards, > >> > Chao > >> > > >> > _______________________________________________ > >> > Rtk-users mailing list > >> > Rtk-users at openrtk.org > >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Mon May 26 18:12:50 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 00:12:50 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, Thanks for the detailed report. On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > Hi Simon, > > Thanks for the suggestions. > > The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: > > rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 > rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing > 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt > rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 > --dimension 640,250,640 --hardware=cuda -v -l > > With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of > itkCudaImageDataManager.hxx) now I can have a better view of the GRAM > usage. > I found that the size of the volume data in the GRAM could be reduced by > --divisions but the amount of projection data sent to the GRAM are not > influenced by --lowmem switch. > After looking at the code again, lowmem acts on the reading so it's not related to the GPU memory but on the CPU memory, sorry about that. The reconstruction algorithm does stream the projections but it processes by default 16 projections at a time. You can change this in rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce your GPU memory consumption (I checked and it works for me). Let me know if it works for you and if you think that this should be made an option of rtkfdk. > So --divisions does not help much if it is mainly the projection data > which takes up GRAM, while --lowmem does not help at all. I did not look > into the more front part of the code so I am not sure if this is the > designed behaviour. > > On the other hand, I am also looking for possibilities to reduce GRAM used > in the CUDA ramp filter. At least one thing should be changed, and one > thing may be considered: > - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be > destroyed earlier, right after the plan being executed. A plan takes up at > least the same amount of memory as the data. > Good point, I changed it: https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a > clear idea about how to pad deviceProjection to the required size of > its cufftComplex counterpart. > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is not an itk::InPlaceImageFilter. It might be possible but I would have to check. Let me know if you investigate this further. Thanks again, Simon > > Any comments? > > Best regards, > Chao > > > > 2014-05-21 14:30 GMT+02:00 Simon Rit : > > Since it fails in cufft, it's the memory of the projections that is a >> problem. Therefore, it is not surprising that --divisions has no >> influence. But --lowmem should have an influence. I would suggest: >> - to uncomment >> //#define VERBOSE >> in itkCudaImageDataManager.hxx and try to see what amount of memory >> are requested. >> - to try to reproduce the problem with simulated data so that we can >> help you in finding a solution. >> Simon >> >> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >> > Hi Simon, >> > >> > Yes I switched on an off the --lowmem option and it has no influence on >> the >> > behaviour I mentioned. >> > In my case the system memory is sufficient to handle the projections >> plus >> > the volume. >> > The major bottleneck is the amount of graphics memory. >> > If I reconstruct a little bit more slices than the limit that I found >> with >> > one stream, the allocation of GPU resource for CUFFT in the >> > CudaFFTRampImageFilter will fail (which was more or less expected). >> > However with --divisions > 1 it is indeed able to reconstruct more >> slices, >> > but only a very few more; otherwise the CUFFT would fail again. >> > I would expect the limitations of the amount of slices to be >> approximately >> > proportional to the number of streams, or do I miss anything about >> stream >> > division? >> > >> > Thanks, >> > Chao >> > >> > >> > >> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >> > >> >> Hi Chao, >> >> There are two things that use memory, the volume and the projections. >> >> The --divisions option divides the volume only. The --lowmem option >> >> works on a subset of projections at a time. Did you try this? >> >> Simon >> >> >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> >> > Hoi, >> >> > >> >> > I may need some hint about how the stream division works in rtkfdk. >> >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> >> > figure >> >> > out quickly how the division has been performed. >> >> > I did some test with reconstructing 400 1500x1200 projections into a >> >> > 640xNx640 volume (the pixel and voxel size are comparable). >> >> > The reconstructions were executed by rtkfdk with CUDA. >> >> > When I leave the origin of the volume at the center by default, I can >> >> > reconstruct up to N=200 slices with --divisions=1 due to the >> limitation >> >> > of >> >> > the graphic memory. Then when I increase the number of divisions to >> 2, I >> >> > can >> >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> >> > 219 >> >> > slices. Does anyone have an idea why it scales like this? >> >> > Thanks in advance. >> >> > >> >> > Best regards, >> >> > Chao >> >> > >> >> > _______________________________________________ >> >> > Rtk-users mailing list >> >> > Rtk-users at openrtk.org >> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> >> > >> > >> > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Tue May 27 08:23:51 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 14:23:51 +0200 Subject: [Rtk-users] Test phantoms for RTK In-Reply-To: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> References: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> Message-ID: Hi, Please use the mailing list, your question might be of interest to others. The use of phantoms is described on the wiki (http://wiki.openrtk.org). For example, look for the Elekta and Varian section to see how to reconstruct these datasets. Let us know if something is not clear there with a more specific question, we'll be happy to improve the description. Thanks, Simon On Tue, May 27, 2014 at 11:28 AM, Liu, Yu wrote: > Dear Mr. Rit, > > > > I am doing my PhD at Empa in Switzerland. Currently I am trying to use RTK > to implement some of my algorithms. > > I found some test phantoms you uploaded to kitware > (http://midas3.kitware.com/midas/community/20#) and you referred to them in > one of your publications. > > However, you did not provide any documents on how to use them (at least how > to read the files). Is it possible that you give me some hints on this > issue? > > > > Thank you. > > Best regards, > > Yu Liu From wuchao04 at gmail.com Tue May 27 08:24:19 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Tue, 27 May 2014 14:24:19 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for your reaction. I was looking into the in-place FFT these days, and the way of tuning the number of projections sent to the ramp filter is exactly what I plan to look for next. Now I know that directly. I think it is a good idea to make it an option of rtkfdk, or to regulate it automatically by inquiring the amount of free memory with cudaMemGetInfo and estimating the memory needed for storing the projections, ramp kernel, FFT plan and the chunk of volume. The latter may be difficult though since such estimation is not easy at the stage even before padding the projections... Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and cuFFT could perform FFT in-place. So in principle rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter may also be made in-place if FFTW is used. However the ?in-place? here is on a lower level and may not be compatible with the meaning of ?in-place? of itk::InPlaceImageFilter. Anyway, since system memory is not a problem to me, I only focus on the Cuda filter. I already have sort of ?dirty? implementation for my own use: First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) deviceProjection. Now the cuFFT is in-place; the only thing is that the size of the buffer (now used by both deviceProjectionFFT and deviceProjection) should be 2*(x/2+1)*y*z instead of x*y*z. Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above is maintained in paddedImage. Its size is determined in PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and CPU-to-GPU data copying is by paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first attempt is to make the image regions of paddedImage different from each other by modifying FFTRampImageFilter::PadInputImageRegion(?) in rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing the padded projection data as how it works now; while its BufferedRegion should be 2*(x/2+1) by y by z, with the additional part reserved for in-place FFT. Other small changes were done to calculate inputDimension and kernelDimension correctly based on RequestedRegion. Later I realized that this did not work, since cuFFT sees the buffer just as a linear space. All image data should come continuously from the beginning of the buffer and all unused spaces are at the end, but in this case the reserved spaces were at the end along the x (first) dimension so that they were distributed in the linear buffer. So this was where the ?dirty? changes started. First of all, instead of calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did not check if the modification works for CPU or any other situations as well, so I prefer to make branches when possible instead of direct changes). The latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only change of the call for allocation from paddedImage->Allocate() to paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. There, after the calculation and set of CudaDataManager::m_BufferSize as before, I also calculate the required buffer size for in-place FFT and stored the value in a new member of CudaDataManager, namely m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and after that restore m_BufferSize to its original value. Other changes have been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any other allocation actions (although I have not checked those one by one) will not be influenced by the piece of new code. At last, under GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the additional space in this buffer will never have a chance later to be initialized by means of CPU-to-GPU data copying. The length of the data is shorter than the buffer size. It works for me so far. Please see if you have any better routine to implement this. Thank you. Best regards, Chao 2014-05-27 0:12 GMT+02:00 Simon Rit : > Hi Chao, > Thanks for the detailed report. > > > On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > >> Hi Simon, >> >> Thanks for the suggestions. >> >> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >> >> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >> --dimension 640,250,640 --hardware=cuda -v -l >> >> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM >> usage. >> I found that the size of the volume data in the GRAM could be reduced by >> --divisions but the amount of projection data sent to the GRAM are not >> influenced by --lowmem switch. >> > After looking at the code again, lowmem acts on the reading so it's not > related to the GPU memory but on the CPU memory, sorry about that. The > reconstruction algorithm does stream the projections but it processes by > default 16 projections at a time. You can change this in > rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will > reduce your GPU memory consumption (I checked and it works for me). Let me > know if it works for you and if you think that this should be made an > option of rtkfdk. > > >> So --divisions does not help much if it is mainly the projection data >> which takes up GRAM, while --lowmem does not help at all. I did not look >> into the more front part of the code so I am not sure if this is the >> designed behaviour. >> >> On the other hand, I am also looking for possibilities to reduce GRAM >> used in the CUDA ramp filter. At least one thing should be changed, and one >> thing may be considered: >> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >> destroyed earlier, right after the plan being executed. A plan takes up at >> least the same amount of memory as the data. >> > Good point, I changed it: > > https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > > >> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >> clear idea about how to pad deviceProjection to the required size of >> its cufftComplex counterpart. >> > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is > not an itk::InPlaceImageFilter. It might be possible but I would have to > check. Let me know if you investigate this further. > Thanks again, > Simon > > >> >> Any comments? >> >> Best regards, >> Chao >> >> >> >> 2014-05-21 14:30 GMT+02:00 Simon Rit : >> >> Since it fails in cufft, it's the memory of the projections that is a >>> problem. Therefore, it is not surprising that --divisions has no >>> influence. But --lowmem should have an influence. I would suggest: >>> - to uncomment >>> //#define VERBOSE >>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>> are requested. >>> - to try to reproduce the problem with simulated data so that we can >>> help you in finding a solution. >>> Simon >>> >>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>> > Hi Simon, >>> > >>> > Yes I switched on an off the --lowmem option and it has no influence >>> on the >>> > behaviour I mentioned. >>> > In my case the system memory is sufficient to handle the projections >>> plus >>> > the volume. >>> > The major bottleneck is the amount of graphics memory. >>> > If I reconstruct a little bit more slices than the limit that I found >>> with >>> > one stream, the allocation of GPU resource for CUFFT in the >>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>> > However with --divisions > 1 it is indeed able to reconstruct more >>> slices, >>> > but only a very few more; otherwise the CUFFT would fail again. >>> > I would expect the limitations of the amount of slices to be >>> approximately >>> > proportional to the number of streams, or do I miss anything about >>> stream >>> > division? >>> > >>> > Thanks, >>> > Chao >>> > >>> > >>> > >>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>> > >>> >> Hi Chao, >>> >> There are two things that use memory, the volume and the projections. >>> >> The --divisions option divides the volume only. The --lowmem option >>> >> works on a subset of projections at a time. Did you try this? >>> >> Simon >>> >> >>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>> >> > Hoi, >>> >> > >>> >> > I may need some hint about how the stream division works in rtkfdk. >>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>> cannot >>> >> > figure >>> >> > out quickly how the division has been performed. >>> >> > I did some test with reconstructing 400 1500x1200 projections into a >>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>> >> > The reconstructions were executed by rtkfdk with CUDA. >>> >> > When I leave the origin of the volume at the center by default, I >>> can >>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>> limitation >>> >> > of >>> >> > the graphic memory. Then when I increase the number of divisions to >>> 2, I >>> >> > can >>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>> to >>> >> > 219 >>> >> > slices. Does anyone have an idea why it scales like this? >>> >> > Thanks in advance. >>> >> > >>> >> > Best regards, >>> >> > Chao >>> >> > >>> >> > _______________________________________________ >>> >> > Rtk-users mailing list >>> >> > Rtk-users at openrtk.org >>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> >> > >>> > >>> > >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 28 10:48:20 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 28 May 2014 16:48:20 +0200 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: <5305E503.3000506@ucl.ac.uk> References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: Hi Ben, It was on my todo list. I found the problem and here is the fix: https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 Multithreading was indeed disabled as you pointed out, I had to remember pieces of code that were quite old (for an animal like me). Thanks again for the detailed report, Simon On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion wrote: > Hi Simon, > > Really appreciate your prompt response! > > Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster > reconstructions, and the time increase between the two commits reduces to a > little over 2x (See below). > > My dataset consists of 344 projections (about 172.0 MB) > > Does this sound about right? The CPU utilization still looks a bit like a > series of spikes for the latter commit (but different than before). > > Reconstructing and writing... It took 36.0746 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.59479 s > Ramp filter: 19.3106 s > Backprojection: 13.8042 s > > ***versus*** > > Reconstructing and writing... It took 83.4121 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.62535 s > Ramp filter: 66.5537 s > Backprojection: 13.8829 s > > Thanks again, > > Ben > > > > > On 20/02/14 06:57, Simon Rit wrote: >> >> Hi, >> Thank you Ben for the amazing report. I can spot a few things that >> could have gone wrong there but it seems to me that your >> reconstruction is slow both before and after the commit... Two >> potential reasons: >> - you have not activated FFTW in ITK. You should definitely do that, >> the FFT of ITK is (very) slow and probably not multithreaded. You must >> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >> version of ITK4, I had some issues with the first versions, see >> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >> - you are using a huge dataset. >> If you did not use FFTW, could you try again with FFTW and tell us if >> you still observe a drop in performances? If you had FFTW, can you >> provide the sie of the dataset you used? >> Thanks, >> Simon >> >> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >> wrote: >>> >>> Hello, >>> >>> First of all, many thanks to the RTK community for this useful toolkit! >>> >>> While experimenting with different versions of the code (I'm a relatively >>> new user), I've encountered large differences in rtkfdk (CPU) >>> reconstruction >>> speed between code versions (a newer version being substantially slower >>> than >>> an older version). >>> >>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>> required -g, -p, -r and -o flags, but no other flags). >>> >>> Using git-bisect, I narrowed it down to a particular commit. The parent >>> commit runs quite quickly, but the child commit shows nearly 4x >>> reconstruction time, and less-uniform CPU utilization (it looks like a >>> series of spikes). >>> >>> (See below) >>> >>> Looking at the diffs, it seems that in addition to adding the HannY >>> functionality (which should be disabled by default?), there were some >>> changes in this commit related to threading (in >>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>> misleading and the substantial difference consists in changing the FFT >>> Ramp >>> Kernel. >>> >>> I'm currently reading the source to try to understand those changes, but >>> I >>> thought I would post in case someone is able to point me in the right >>> direction. Although these differences are unexpected to me, I doubt that >>> they are unexpected to more experienced users...! >>> >>> Apologies if I've left out any critical information (or if I've provided >>> too >>> much!). >>> >>> Many thanks in advance, >>> Ben >>> >>> ****** Parent Commit ****** >>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>> Author: Julien Jomier >>> Date: Fri Nov 30 09:30:59 2012 +0100 >>> >>> ENH: Minimum CMake version is 2.8.3 >>> >>> ***Partial output*** >>> >>> Reconstructing and writing... It took 44.3992 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.67915 s >>> Ramp filter: 26.3847 s >>> Backprojection: 13.0447 s >>> >>> ***Screenshot of CPU usage attached: >>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>> >>> ****** Child Commit ****** >>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>> Author: Simon Rit >>> Date: Wed Dec 5 16:22:47 2012 +0100 >>> >>> First version of Hann windowing in the second direction >>> (perpendicular >>> to the ramp) >>> >>> ***Partial output*** >>> Reconstructing and writing... It took 126.911 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.47678 s >>> Ramp filter: 108.254 s >>> Backprojection: 13.2973 s >>> >>> ***Screenshot of CPU usage attached: >>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > From benjamin.champion.13 at ucl.ac.uk Thu May 29 05:19:37 2014 From: benjamin.champion.13 at ucl.ac.uk (Ben Champion) Date: Thu, 29 May 2014 10:19:37 +0100 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: <5386FBA9.6020402@ucl.ac.uk> Hi Simon, Glad to hear you found a fix! Thanks for looking into it. Best wishes, Ben On 28/05/14 15:48, Simon Rit wrote: > Hi Ben, > It was on my todo list. I found the problem and here is the fix: > https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 > Multithreading was indeed disabled as you pointed out, I had to > remember pieces of code that were quite old (for an animal like me). > Thanks again for the detailed report, > Simon > > On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion > wrote: >> Hi Simon, >> >> Really appreciate your prompt response! >> >> Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster >> reconstructions, and the time increase between the two commits reduces to a >> little over 2x (See below). >> >> My dataset consists of 344 projections (about 172.0 MB) >> >> Does this sound about right? The CPU utilization still looks a bit like a >> series of spikes for the latter commit (but different than before). >> >> Reconstructing and writing... It took 36.0746 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.59479 s >> Ramp filter: 19.3106 s >> Backprojection: 13.8042 s >> >> ***versus*** >> >> Reconstructing and writing... It took 83.4121 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.62535 s >> Ramp filter: 66.5537 s >> Backprojection: 13.8829 s >> >> Thanks again, >> >> Ben >> >> >> >> >> On 20/02/14 06:57, Simon Rit wrote: >>> Hi, >>> Thank you Ben for the amazing report. I can spot a few things that >>> could have gone wrong there but it seems to me that your >>> reconstruction is slow both before and after the commit... Two >>> potential reasons: >>> - you have not activated FFTW in ITK. You should definitely do that, >>> the FFT of ITK is (very) slow and probably not multithreaded. You must >>> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >>> version of ITK4, I had some issues with the first versions, see >>> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >>> - you are using a huge dataset. >>> If you did not use FFTW, could you try again with FFTW and tell us if >>> you still observe a drop in performances? If you had FFTW, can you >>> provide the sie of the dataset you used? >>> Thanks, >>> Simon >>> >>> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >>> wrote: >>>> Hello, >>>> >>>> First of all, many thanks to the RTK community for this useful toolkit! >>>> >>>> While experimenting with different versions of the code (I'm a relatively >>>> new user), I've encountered large differences in rtkfdk (CPU) >>>> reconstruction >>>> speed between code versions (a newer version being substantially slower >>>> than >>>> an older version). >>>> >>>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>>> required -g, -p, -r and -o flags, but no other flags). >>>> >>>> Using git-bisect, I narrowed it down to a particular commit. The parent >>>> commit runs quite quickly, but the child commit shows nearly 4x >>>> reconstruction time, and less-uniform CPU utilization (it looks like a >>>> series of spikes). >>>> >>>> (See below) >>>> >>>> Looking at the diffs, it seems that in addition to adding the HannY >>>> functionality (which should be disabled by default?), there were some >>>> changes in this commit related to threading (in >>>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>>> misleading and the substantial difference consists in changing the FFT >>>> Ramp >>>> Kernel. >>>> >>>> I'm currently reading the source to try to understand those changes, but >>>> I >>>> thought I would post in case someone is able to point me in the right >>>> direction. Although these differences are unexpected to me, I doubt that >>>> they are unexpected to more experienced users...! >>>> >>>> Apologies if I've left out any critical information (or if I've provided >>>> too >>>> much!). >>>> >>>> Many thanks in advance, >>>> Ben >>>> >>>> ****** Parent Commit ****** >>>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>>> Author: Julien Jomier >>>> Date: Fri Nov 30 09:30:59 2012 +0100 >>>> >>>> ENH: Minimum CMake version is 2.8.3 >>>> >>>> ***Partial output*** >>>> >>>> Reconstructing and writing... It took 44.3992 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.67915 s >>>> Ramp filter: 26.3847 s >>>> Backprojection: 13.0447 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>>> >>>> ****** Child Commit ****** >>>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>>> Author: Simon Rit >>>> Date: Wed Dec 5 16:22:47 2012 +0100 >>>> >>>> First version of Hann windowing in the second direction >>>> (perpendicular >>>> to the ramp) >>>> >>>> ***Partial output*** >>>> Reconstructing and writing... It took 126.911 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.47678 s >>>> Ramp filter: 108.254 s >>>> Backprojection: 13.2973 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>>> >>>> >>>> >>>> _______________________________________________ >>>> Rtk-users mailing list >>>> Rtk-users at openrtk.org >>>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> From simon.rit at creatis.insa-lyon.fr Fri May 30 05:12:41 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 11:12:41 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, I added the option, --subsetsize. Thanks for the detailed report. I don't understand it all, it's quite complicated... Do you really have such memory limitations problems that you want to go in that direction? Using the two streaming options (--subset + --divisions), you should be able to sufficiently reduce your memory consumption. If you really want to go further in the in-place implementation, I think a code patch would be more helpful but you must confine the changes to rtk::CudaFFTRampImageFilter. We don't want to modify itk::CudaDataManager for such a specific purpose. Simon On Tue, May 27, 2014 at 2:24 PM, Chao Wu wrote: > Hi Simon, > > Thanks for your reaction. I was looking into the in-place FFT these days, > and the way of tuning the number of projections sent to the ramp filter is > exactly what I plan to look for next. Now I know that directly. I think it > is a good idea to make it an option of rtkfdk, or to regulate it > automatically by inquiring the amount of free memory with cudaMemGetInfo and > estimating the memory needed for storing the projections, ramp kernel, FFT > plan and the chunk of volume. The latter may be difficult though since such > estimation is not easy at the stage even before padding the projections... > > Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and > cuFFT could perform FFT in-place. So in principle > rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter > may also be made in-place if FFTW is used. However the ?in-place? here is on > a lower level and may not be compatible with the meaning of ?in-place? of > itk::InPlaceImageFilter. > > Anyway, since system memory is not a problem to me, I only focus on the Cuda > filter. I already have sort of ?dirty? implementation for my own use: > > First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of > deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) > deviceProjection. Now the cuFFT is in-place; the only thing is that the size > of the buffer (now used by both deviceProjectionFFT and deviceProjection) > should be 2*(x/2+1)*y*z instead of x*y*z. > > Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above > is maintained in paddedImage. Its size is determined in > PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and > CPU-to-GPU data copying is by > paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first > attempt is to make the image regions of paddedImage different from each > other by modifying FFTRampImageFilter::PadInputImageRegion(?) in > rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing > the padded projection data as how it works now; while its BufferedRegion > should be 2*(x/2+1) by y by z, with the additional part reserved for > in-place FFT. Other small changes were done to calculate inputDimension and > kernelDimension correctly based on RequestedRegion. Later I realized that > this did not work, since cuFFT sees the buffer just as a linear space. All > image data should come continuously from the beginning of the buffer and all > unused spaces are at the end, but in this case the reserved spaces were at > the end along the x (first) dimension so that they were distributed in the > linear buffer. > > So this was where the ?dirty? changes started. First of all, instead of > calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, > I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did > not check if the modification works for CPU or any other situations as well, > so I prefer to make branches when possible instead of direct changes). The > latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only > change of the call for allocation from paddedImage->Allocate() to > paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() > is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. > There, after the calculation and set of CudaDataManager::m_BufferSize as > before, I also calculate the required buffer size for in-place FFT and > stored the value in a new member of CudaDataManager, namely > m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in > itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check > if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let > m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and > after that restore m_BufferSize to its original value. Other changes have > been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to > m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT > = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any > other allocation actions (although I have not checked those one by one) will > not be influenced by the piece of new code. At last, under > GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after > cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the > additional space in this buffer will never have a chance later to be > initialized by means of CPU-to-GPU data copying. The length of the data is > shorter than the buffer size. > > It works for me so far. Please see if you have any better routine to > implement this. Thank you. > > Best regards, > Chao > > > > > > > > > 2014-05-27 0:12 GMT+02:00 Simon Rit : > >> Hi Chao, >> Thanks for the detailed report. >> >> >> On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: >>> >>> Hi Simon, >>> >>> Thanks for the suggestions. >>> >>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >>> >>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >>> --dimension 640,250,640 --hardware=cuda -v -l >>> >>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. >>> I found that the size of the volume data in the GRAM could be reduced by >>> --divisions but the amount of projection data sent to the GRAM are not >>> influenced by --lowmem switch. >> >> After looking at the code again, lowmem acts on the reading so it's not >> related to the GPU memory but on the CPU memory, sorry about that. The >> reconstruction algorithm does stream the projections but it processes by >> default 16 projections at a time. You can change this in >> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce >> your GPU memory consumption (I checked and it works for me). Let me know if >> it works for you and if you think that this should be made an option of >> rtkfdk. >> >>> >>> So --divisions does not help much if it is mainly the projection data >>> which takes up GRAM, while --lowmem does not help at all. I did not look >>> into the more front part of the code so I am not sure if this is the >>> designed behaviour. >>> >>> On the other hand, I am also looking for possibilities to reduce GRAM >>> used in the CUDA ramp filter. At least one thing should be changed, and one >>> thing may be considered: >>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >>> destroyed earlier, right after the plan being executed. A plan takes up at >>> least the same amount of memory as the data. >> >> Good point, I changed it: >> >> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 >> >>> >>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >>> clear idea about how to pad deviceProjection to the required size of its >>> cufftComplex counterpart. >> >> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is >> not an itk::InPlaceImageFilter. It might be possible but I would have to >> check. Let me know if you investigate this further. >> Thanks again, >> Simon >> >>> >>> >>> Any comments? >>> >>> Best regards, >>> Chao >>> >>> >>> >>> 2014-05-21 14:30 GMT+02:00 Simon Rit : >>> >>>> Since it fails in cufft, it's the memory of the projections that is a >>>> problem. Therefore, it is not surprising that --divisions has no >>>> influence. But --lowmem should have an influence. I would suggest: >>>> - to uncomment >>>> //#define VERBOSE >>>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>>> are requested. >>>> - to try to reproduce the problem with simulated data so that we can >>>> help you in finding a solution. >>>> Simon >>>> >>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>>> > Hi Simon, >>>> > >>>> > Yes I switched on an off the --lowmem option and it has no influence >>>> > on the >>>> > behaviour I mentioned. >>>> > In my case the system memory is sufficient to handle the projections >>>> > plus >>>> > the volume. >>>> > The major bottleneck is the amount of graphics memory. >>>> > If I reconstruct a little bit more slices than the limit that I found >>>> > with >>>> > one stream, the allocation of GPU resource for CUFFT in the >>>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>>> > However with --divisions > 1 it is indeed able to reconstruct more >>>> > slices, >>>> > but only a very few more; otherwise the CUFFT would fail again. >>>> > I would expect the limitations of the amount of slices to be >>>> > approximately >>>> > proportional to the number of streams, or do I miss anything about >>>> > stream >>>> > division? >>>> > >>>> > Thanks, >>>> > Chao >>>> > >>>> > >>>> > >>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>>> > >>>> >> Hi Chao, >>>> >> There are two things that use memory, the volume and the projections. >>>> >> The --divisions option divides the volume only. The --lowmem option >>>> >> works on a subset of projections at a time. Did you try this? >>>> >> Simon >>>> >> >>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>>> >> > Hoi, >>>> >> > >>>> >> > I may need some hint about how the stream division works in rtkfdk. >>>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>>> >> > cannot >>>> >> > figure >>>> >> > out quickly how the division has been performed. >>>> >> > I did some test with reconstructing 400 1500x1200 projections into >>>> >> > a >>>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>>> >> > The reconstructions were executed by rtkfdk with CUDA. >>>> >> > When I leave the origin of the volume at the center by default, I >>>> >> > can >>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>>> >> > limitation >>>> >> > of >>>> >> > the graphic memory. Then when I increase the number of divisions to >>>> >> > 2, I >>>> >> > can >>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>>> >> > to >>>> >> > 219 >>>> >> > slices. Does anyone have an idea why it scales like this? >>>> >> > Thanks in advance. >>>> >> > >>>> >> > Best regards, >>>> >> > Chao >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Rtk-users mailing list >>>> >> > Rtk-users at openrtk.org >>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> >> > >>>> > >>>> > >>> >>> >> > From simon.rit at creatis.insa-lyon.fr Fri May 30 07:12:49 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 13:12:49 +0200 Subject: [Rtk-users] Result from SART is worse than from FDK In-Reply-To: <52B44FCA.7000800@bam.de> References: <527914C3.8030706@bam.de> <527918B5.9080709@bam.de> <52B44FCA.7000800@bam.de> Message-ID: Hi Andreas, I apologize for never getting back to you despite the clear description of the problem. Cyril Mory has done many developments in iterative reconstruction since your email, including some improvement of SART. See for example http://wiki.openrtk.org/index.php/RTK/Examples/ADMMTVReconstruction. I have launched the three cases you suggested with the "new" SART - SART reconstruction of middle plane: this cannot work because our forward projector assumes that the volume goes from the middle of the first voxel to the middle of the last voxel. Therefore, one plane is not enough, you need at least two. - SART reconstruction of 10 planes around middle plane: there is a truncation problem here and I don't see how it could be solved in this manner. In general, one needs to use a reconstruction support that is large enough for the problem at hand (see for example http://www.ncbi.nlm.nih.gov/pubmed/17441239). The situation is different if you reduce the data to the reconstruction of a single plane (with --dimension 256,1 in rtkprojectgeometricphantom). Then, your 10 slices are sufficient but the default unmatched forward/back-projector (see http://www.ncbi.nlm.nih.gov/pubmed/11021698 for a description of this) give bad results. You can now solve this if you match them with the option --bp NormalizedJoseph that Cyril has implemented. So even a better of implementation of SART (the current one) does not solve the problems that you have pointed out. You need a large enough CT image given input data to solve the problem. I hope this will be helpful, maybe not to you if it's too late but to some others. Simon On Fri, Dec 20, 2013 at 3:10 PM, Staude, Andreas wrote: > Hi Simon, > > I believe it really is a problem with the sum of the weights. > > I first tried with the Shepp-Logan-phantom and afterwards with my data. > The geometry is that of a standard cone-beam micro-CT. > > The data I posted before were the reconstruction of just the middle > plane. As I did the same with the Shepp-Logan-phantom data, similar > effects were seen. As soon as one reconstructs a larger region around > the middle plane, the artefacts vanish in the inner parts of the > reconstructed volume, while in the top and bottom parts artefacts remain. > > The program calls were: > > create geometry: > ---------------- > rtksimulatedgeometry --nproj="1200" --output="geometry.xml" > --sdd="1169.59" --sid="451.645" --arc="-360" --first_angle="360" > > project the phantom: > -------------------- > rtkprojectgeometricphantom -g geometry.xml -o projections3.mha --spacing > 2.5 --dimension 256 --phantomfile SheppLogan.txt > > do a reference FDK reconstruction: > ---------------------------------- > rtkfdk -p . -r projections3.mha -o shepp-logan_fdk3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > SART reconstruction of middle plane: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2D.mha -g > geometry.xml --spacing 1 --dimension 256,1,256 > > SART reconstruction of 10 planes around middle plane: > ------------------------------------------------------- > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2.5D.mha -g > geometry.xml --spacing 1 --dimension 256,10,256 > > SART reconstruction of whole object: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > > Reconstruction of more slices of the real data-set also gave a good > result. Only the slices near bottom and top are not reconstructed correctly. > > So it seems that the normalisation does not only take the values inside > the reconstructed volume into account, but also (wrong) values outside. > > What do you think? > > Cheers, > > Andreas > > > > On 11/05/2013 07:11 PM, Simon Rit wrote: >> Hi Andreas, >> Thanks for the report. We know that the implementation of SART is >> imperfect, we haven't been working a lot on it... It seems that you >> haven't reached convergence. One potential cause is that we use a >> heuristic for the sum of the weights (denominator in the SART formula) >> instead of the exact sum. The weight is constant and equals the >> diagonal of your volume (see line 165 in >> rtkSARTConeBeamReconstructionFilter.txx). Maybe this is completely >> wrong in your case. Could you try to increase lambda to see if that >> helps? >> To help us do some tests, I would advise you do reproduce your >> geometry with simulations of the Shepp Logan phantom (see >> wiki.openrtk.org). >> Simon >> >> On Tue, Nov 5, 2013 at 5:11 PM, Staude, Andreas wrote: >>> Hello RTk-users, >>> >>> I try to use the SART algorithm, but the results are worse than those >>> obtained with FDK (see attached images). >>> >>> The FDK result looks like expected, so I assume that I have the data >>> format and the reconstruction geometry set properly. For SART I used the >>> same parameters and already tried with different values of lambda and >>> niterations. >>> >>> Does anyone have an idea what went wrong? Is there some kind of >>> smoothing or regularisation applied in the SART implementation? >>> >>> Many thanks in advance! >>> >>> Cheers, >>> >>> Andreas >>> >>> >>> -- >>> >>> =============================================================== >>> Dr. Andreas Staude >>> Fachbereich 8.5 "Mikro-ZfP", Computertomographie >>> BAM Bundesanstalt f?r Materialforschung und -pr?fung >>> Unter den Eichen 87 >>> D-12205 Berlin >>> Germany >>> >>> Tel.: ++49 30 8104 4140 >>> Fax: ++49 30 8104 1837 >>> =============================================================== >>> >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > > -- > > =============================================================== > Dr. Andreas Staude > Fachbereich 8.5 "Mikro-ZfP", Computertomographie > BAM Bundesanstalt f?r Materialforschung und -pr?fung > Unter den Eichen 87 > D-12205 Berlin > Germany > > Tel.: ++49 30 8104 4140 > Fax: ++49 30 8104 1837 > =============================================================== From wuchao04 at gmail.com Wed May 21 06:18:57 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 12:18:57 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk Message-ID: Hoi, I may need some hint about how the stream division works in rtkfdk. I noticed that the StreamingImageFilter from ITK is used but I cannot figure out quickly how the division has been performed. I did some test with reconstructing 400 1500x1200 projections into a 640xNx640 volume (the pixel and voxel size are comparable). The reconstructions were executed by rtkfdk with CUDA. When I leave the origin of the volume at the center by default, I can reconstruct up to N=200 slices with --divisions=1 due to the limitation of the graphic memory. Then when I increase the number of divisions to 2, I can only reconstruct up to 215 slices; and with divisions to 3 only up to 219 slices. Does anyone have an idea why it scales like this? Thanks in advance. Best regards, Chao -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 07:43:40 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 13:43:40 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, There are two things that use memory, the volume and the projections. The --divisions option divides the volume only. The --lowmem option works on a subset of projections at a time. Did you try this? Simon On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > Hoi, > > I may need some hint about how the stream division works in rtkfdk. > I noticed that the StreamingImageFilter from ITK is used but I cannot figure > out quickly how the division has been performed. > I did some test with reconstructing 400 1500x1200 projections into a > 640xNx640 volume (the pixel and voxel size are comparable). > The reconstructions were executed by rtkfdk with CUDA. > When I leave the origin of the volume at the center by default, I can > reconstruct up to N=200 slices with --divisions=1 due to the limitation of > the graphic memory. Then when I increase the number of divisions to 2, I can > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > slices. Does anyone have an idea why it scales like this? > Thanks in advance. > > Best regards, > Chao > > _______________________________________________ > Rtk-users mailing list > Rtk-users at openrtk.org > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > From wuchao04 at gmail.com Wed May 21 08:21:00 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 14:21:00 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Yes I switched on an off the --lowmem option and it has no influence on the behaviour I mentioned. In my case the system memory is sufficient to handle the projections plus the volume. The major bottleneck is the amount of graphics memory. If I reconstruct a little bit more slices than the limit that I found with one stream, the allocation of GPU resource for CUFFT in the CudaFFTRampImageFilter will fail (which was more or less expected). However with --divisions > 1 it is indeed able to reconstruct more slices, but only a very few more; otherwise the CUFFT would fail again. I would expect the limitations of the amount of slices to be approximately proportional to the number of streams, or do I miss anything about stream division? Thanks, Chao 2014-05-21 13:43 GMT+02:00 Simon Rit : > Hi Chao, > There are two things that use memory, the volume and the projections. > The --divisions option divides the volume only. The --lowmem option > works on a subset of projections at a time. Did you try this? > Simon > > On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > > Hoi, > > > > I may need some hint about how the stream division works in rtkfdk. > > I noticed that the StreamingImageFilter from ITK is used but I cannot > figure > > out quickly how the division has been performed. > > I did some test with reconstructing 400 1500x1200 projections into a > > 640xNx640 volume (the pixel and voxel size are comparable). > > The reconstructions were executed by rtkfdk with CUDA. > > When I leave the origin of the volume at the center by default, I can > > reconstruct up to N=200 slices with --divisions=1 due to the limitation > of > > the graphic memory. Then when I increase the number of divisions to 2, I > can > > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > > slices. Does anyone have an idea why it scales like this? > > Thanks in advance. > > > > Best regards, > > Chao > > > > _______________________________________________ > > Rtk-users mailing list > > Rtk-users at openrtk.org > > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 08:30:21 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 14:30:21 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Since it fails in cufft, it's the memory of the projections that is a problem. Therefore, it is not surprising that --divisions has no influence. But --lowmem should have an influence. I would suggest: - to uncomment //#define VERBOSE in itkCudaImageDataManager.hxx and try to see what amount of memory are requested. - to try to reproduce the problem with simulated data so that we can help you in finding a solution. Simon On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > Hi Simon, > > Yes I switched on an off the --lowmem option and it has no influence on the > behaviour I mentioned. > In my case the system memory is sufficient to handle the projections plus > the volume. > The major bottleneck is the amount of graphics memory. > If I reconstruct a little bit more slices than the limit that I found with > one stream, the allocation of GPU resource for CUFFT in the > CudaFFTRampImageFilter will fail (which was more or less expected). > However with --divisions > 1 it is indeed able to reconstruct more slices, > but only a very few more; otherwise the CUFFT would fail again. > I would expect the limitations of the amount of slices to be approximately > proportional to the number of streams, or do I miss anything about stream > division? > > Thanks, > Chao > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > >> Hi Chao, >> There are two things that use memory, the volume and the projections. >> The --divisions option divides the volume only. The --lowmem option >> works on a subset of projections at a time. Did you try this? >> Simon >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> > Hoi, >> > >> > I may need some hint about how the stream division works in rtkfdk. >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> > figure >> > out quickly how the division has been performed. >> > I did some test with reconstructing 400 1500x1200 projections into a >> > 640xNx640 volume (the pixel and voxel size are comparable). >> > The reconstructions were executed by rtkfdk with CUDA. >> > When I leave the origin of the volume at the center by default, I can >> > reconstruct up to N=200 slices with --divisions=1 due to the limitation >> > of >> > the graphic memory. Then when I increase the number of divisions to 2, I >> > can >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> > 219 >> > slices. Does anyone have an idea why it scales like this? >> > Thanks in advance. >> > >> > Best regards, >> > Chao >> > >> > _______________________________________________ >> > Rtk-users mailing list >> > Rtk-users at openrtk.org >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> > > > From simon.rit at creatis.insa-lyon.fr Wed May 21 10:19:26 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 16:19:26 +0200 Subject: [Rtk-users] Backward incompatible change: angles in radians Message-ID: Dear all, Be aware that I have just pushed a backward incompatible change: https://github.com/SimonRit/RTK/commit/b6661f59a0a5730545474163f73438a978053194 I usually try to maintain backward compatibility but I felt that the class rtk::ThreeDCircularProjectionGeometry was really too messy. So from now on: - all angles stored or returned by the class are in radians - only the function AddProjection takes angles in degrees as parameters. AddProjectionInRadians allows you to avoid conversion of angles that are already in radians if you prefer it. - angles in geometry files are still in degrees. I believe that you will only have issues with this if you were using one of the following methods: - GetGantryAngles - GetOutOfPlaneAngles - GetInPlaneAngles The returned values are now in radians, not in degrees anymore. I apologize in advance for any inconveniece and I'm available to help you if it is one. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From wuchao04 at gmail.com Thu May 22 04:06:44 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Thu, 22 May 2014 10:06:44 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for the suggestions. The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 --dimension 640,250,640 --hardware=cuda -v -l With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. I found that the size of the volume data in the GRAM could be reduced by --divisions but the amount of projection data sent to the GRAM are not influenced by --lowmem switch. So --divisions does not help much if it is mainly the projection data which takes up GRAM, while --lowmem does not help at all. I did not look into the more front part of the code so I am not sure if this is the designed behaviour. On the other hand, I am also looking for possibilities to reduce GRAM used in the CUDA ramp filter. At least one thing should be changed, and one thing may be considered: - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be destroyed earlier, right after the plan being executed. A plan takes up at least the same amount of memory as the data. - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a clear idea about how to pad deviceProjection to the required size of its cufftComplex counterpart. Any comments? Best regards, Chao 2014-05-21 14:30 GMT+02:00 Simon Rit : > Since it fails in cufft, it's the memory of the projections that is a > problem. Therefore, it is not surprising that --divisions has no > influence. But --lowmem should have an influence. I would suggest: > - to uncomment > //#define VERBOSE > in itkCudaImageDataManager.hxx and try to see what amount of memory > are requested. > - to try to reproduce the problem with simulated data so that we can > help you in finding a solution. > Simon > > On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > > Hi Simon, > > > > Yes I switched on an off the --lowmem option and it has no influence on > the > > behaviour I mentioned. > > In my case the system memory is sufficient to handle the projections plus > > the volume. > > The major bottleneck is the amount of graphics memory. > > If I reconstruct a little bit more slices than the limit that I found > with > > one stream, the allocation of GPU resource for CUFFT in the > > CudaFFTRampImageFilter will fail (which was more or less expected). > > However with --divisions > 1 it is indeed able to reconstruct more > slices, > > but only a very few more; otherwise the CUFFT would fail again. > > I would expect the limitations of the amount of slices to be > approximately > > proportional to the number of streams, or do I miss anything about stream > > division? > > > > Thanks, > > Chao > > > > > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > > > >> Hi Chao, > >> There are two things that use memory, the volume and the projections. > >> The --divisions option divides the volume only. The --lowmem option > >> works on a subset of projections at a time. Did you try this? > >> Simon > >> > >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > >> > Hoi, > >> > > >> > I may need some hint about how the stream division works in rtkfdk. > >> > I noticed that the StreamingImageFilter from ITK is used but I cannot > >> > figure > >> > out quickly how the division has been performed. > >> > I did some test with reconstructing 400 1500x1200 projections into a > >> > 640xNx640 volume (the pixel and voxel size are comparable). > >> > The reconstructions were executed by rtkfdk with CUDA. > >> > When I leave the origin of the volume at the center by default, I can > >> > reconstruct up to N=200 slices with --divisions=1 due to the > limitation > >> > of > >> > the graphic memory. Then when I increase the number of divisions to > 2, I > >> > can > >> > only reconstruct up to 215 slices; and with divisions to 3 only up to > >> > 219 > >> > slices. Does anyone have an idea why it scales like this? > >> > Thanks in advance. > >> > > >> > Best regards, > >> > Chao > >> > > >> > _______________________________________________ > >> > Rtk-users mailing list > >> > Rtk-users at openrtk.org > >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Mon May 26 18:12:50 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 00:12:50 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, Thanks for the detailed report. On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > Hi Simon, > > Thanks for the suggestions. > > The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: > > rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 > rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing > 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt > rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 > --dimension 640,250,640 --hardware=cuda -v -l > > With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of > itkCudaImageDataManager.hxx) now I can have a better view of the GRAM > usage. > I found that the size of the volume data in the GRAM could be reduced by > --divisions but the amount of projection data sent to the GRAM are not > influenced by --lowmem switch. > After looking at the code again, lowmem acts on the reading so it's not related to the GPU memory but on the CPU memory, sorry about that. The reconstruction algorithm does stream the projections but it processes by default 16 projections at a time. You can change this in rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce your GPU memory consumption (I checked and it works for me). Let me know if it works for you and if you think that this should be made an option of rtkfdk. > So --divisions does not help much if it is mainly the projection data > which takes up GRAM, while --lowmem does not help at all. I did not look > into the more front part of the code so I am not sure if this is the > designed behaviour. > > On the other hand, I am also looking for possibilities to reduce GRAM used > in the CUDA ramp filter. At least one thing should be changed, and one > thing may be considered: > - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be > destroyed earlier, right after the plan being executed. A plan takes up at > least the same amount of memory as the data. > Good point, I changed it: https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a > clear idea about how to pad deviceProjection to the required size of > its cufftComplex counterpart. > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is not an itk::InPlaceImageFilter. It might be possible but I would have to check. Let me know if you investigate this further. Thanks again, Simon > > Any comments? > > Best regards, > Chao > > > > 2014-05-21 14:30 GMT+02:00 Simon Rit : > > Since it fails in cufft, it's the memory of the projections that is a >> problem. Therefore, it is not surprising that --divisions has no >> influence. But --lowmem should have an influence. I would suggest: >> - to uncomment >> //#define VERBOSE >> in itkCudaImageDataManager.hxx and try to see what amount of memory >> are requested. >> - to try to reproduce the problem with simulated data so that we can >> help you in finding a solution. >> Simon >> >> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >> > Hi Simon, >> > >> > Yes I switched on an off the --lowmem option and it has no influence on >> the >> > behaviour I mentioned. >> > In my case the system memory is sufficient to handle the projections >> plus >> > the volume. >> > The major bottleneck is the amount of graphics memory. >> > If I reconstruct a little bit more slices than the limit that I found >> with >> > one stream, the allocation of GPU resource for CUFFT in the >> > CudaFFTRampImageFilter will fail (which was more or less expected). >> > However with --divisions > 1 it is indeed able to reconstruct more >> slices, >> > but only a very few more; otherwise the CUFFT would fail again. >> > I would expect the limitations of the amount of slices to be >> approximately >> > proportional to the number of streams, or do I miss anything about >> stream >> > division? >> > >> > Thanks, >> > Chao >> > >> > >> > >> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >> > >> >> Hi Chao, >> >> There are two things that use memory, the volume and the projections. >> >> The --divisions option divides the volume only. The --lowmem option >> >> works on a subset of projections at a time. Did you try this? >> >> Simon >> >> >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> >> > Hoi, >> >> > >> >> > I may need some hint about how the stream division works in rtkfdk. >> >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> >> > figure >> >> > out quickly how the division has been performed. >> >> > I did some test with reconstructing 400 1500x1200 projections into a >> >> > 640xNx640 volume (the pixel and voxel size are comparable). >> >> > The reconstructions were executed by rtkfdk with CUDA. >> >> > When I leave the origin of the volume at the center by default, I can >> >> > reconstruct up to N=200 slices with --divisions=1 due to the >> limitation >> >> > of >> >> > the graphic memory. Then when I increase the number of divisions to >> 2, I >> >> > can >> >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> >> > 219 >> >> > slices. Does anyone have an idea why it scales like this? >> >> > Thanks in advance. >> >> > >> >> > Best regards, >> >> > Chao >> >> > >> >> > _______________________________________________ >> >> > Rtk-users mailing list >> >> > Rtk-users at openrtk.org >> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> >> > >> > >> > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Tue May 27 08:23:51 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 14:23:51 +0200 Subject: [Rtk-users] Test phantoms for RTK In-Reply-To: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> References: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> Message-ID: Hi, Please use the mailing list, your question might be of interest to others. The use of phantoms is described on the wiki (http://wiki.openrtk.org). For example, look for the Elekta and Varian section to see how to reconstruct these datasets. Let us know if something is not clear there with a more specific question, we'll be happy to improve the description. Thanks, Simon On Tue, May 27, 2014 at 11:28 AM, Liu, Yu wrote: > Dear Mr. Rit, > > > > I am doing my PhD at Empa in Switzerland. Currently I am trying to use RTK > to implement some of my algorithms. > > I found some test phantoms you uploaded to kitware > (http://midas3.kitware.com/midas/community/20#) and you referred to them in > one of your publications. > > However, you did not provide any documents on how to use them (at least how > to read the files). Is it possible that you give me some hints on this > issue? > > > > Thank you. > > Best regards, > > Yu Liu From wuchao04 at gmail.com Tue May 27 08:24:19 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Tue, 27 May 2014 14:24:19 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for your reaction. I was looking into the in-place FFT these days, and the way of tuning the number of projections sent to the ramp filter is exactly what I plan to look for next. Now I know that directly. I think it is a good idea to make it an option of rtkfdk, or to regulate it automatically by inquiring the amount of free memory with cudaMemGetInfo and estimating the memory needed for storing the projections, ramp kernel, FFT plan and the chunk of volume. The latter may be difficult though since such estimation is not easy at the stage even before padding the projections... Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and cuFFT could perform FFT in-place. So in principle rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter may also be made in-place if FFTW is used. However the ?in-place? here is on a lower level and may not be compatible with the meaning of ?in-place? of itk::InPlaceImageFilter. Anyway, since system memory is not a problem to me, I only focus on the Cuda filter. I already have sort of ?dirty? implementation for my own use: First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) deviceProjection. Now the cuFFT is in-place; the only thing is that the size of the buffer (now used by both deviceProjectionFFT and deviceProjection) should be 2*(x/2+1)*y*z instead of x*y*z. Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above is maintained in paddedImage. Its size is determined in PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and CPU-to-GPU data copying is by paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first attempt is to make the image regions of paddedImage different from each other by modifying FFTRampImageFilter::PadInputImageRegion(?) in rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing the padded projection data as how it works now; while its BufferedRegion should be 2*(x/2+1) by y by z, with the additional part reserved for in-place FFT. Other small changes were done to calculate inputDimension and kernelDimension correctly based on RequestedRegion. Later I realized that this did not work, since cuFFT sees the buffer just as a linear space. All image data should come continuously from the beginning of the buffer and all unused spaces are at the end, but in this case the reserved spaces were at the end along the x (first) dimension so that they were distributed in the linear buffer. So this was where the ?dirty? changes started. First of all, instead of calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did not check if the modification works for CPU or any other situations as well, so I prefer to make branches when possible instead of direct changes). The latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only change of the call for allocation from paddedImage->Allocate() to paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. There, after the calculation and set of CudaDataManager::m_BufferSize as before, I also calculate the required buffer size for in-place FFT and stored the value in a new member of CudaDataManager, namely m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and after that restore m_BufferSize to its original value. Other changes have been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any other allocation actions (although I have not checked those one by one) will not be influenced by the piece of new code. At last, under GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the additional space in this buffer will never have a chance later to be initialized by means of CPU-to-GPU data copying. The length of the data is shorter than the buffer size. It works for me so far. Please see if you have any better routine to implement this. Thank you. Best regards, Chao 2014-05-27 0:12 GMT+02:00 Simon Rit : > Hi Chao, > Thanks for the detailed report. > > > On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > >> Hi Simon, >> >> Thanks for the suggestions. >> >> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >> >> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >> --dimension 640,250,640 --hardware=cuda -v -l >> >> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM >> usage. >> I found that the size of the volume data in the GRAM could be reduced by >> --divisions but the amount of projection data sent to the GRAM are not >> influenced by --lowmem switch. >> > After looking at the code again, lowmem acts on the reading so it's not > related to the GPU memory but on the CPU memory, sorry about that. The > reconstruction algorithm does stream the projections but it processes by > default 16 projections at a time. You can change this in > rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will > reduce your GPU memory consumption (I checked and it works for me). Let me > know if it works for you and if you think that this should be made an > option of rtkfdk. > > >> So --divisions does not help much if it is mainly the projection data >> which takes up GRAM, while --lowmem does not help at all. I did not look >> into the more front part of the code so I am not sure if this is the >> designed behaviour. >> >> On the other hand, I am also looking for possibilities to reduce GRAM >> used in the CUDA ramp filter. At least one thing should be changed, and one >> thing may be considered: >> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >> destroyed earlier, right after the plan being executed. A plan takes up at >> least the same amount of memory as the data. >> > Good point, I changed it: > > https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > > >> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >> clear idea about how to pad deviceProjection to the required size of >> its cufftComplex counterpart. >> > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is > not an itk::InPlaceImageFilter. It might be possible but I would have to > check. Let me know if you investigate this further. > Thanks again, > Simon > > >> >> Any comments? >> >> Best regards, >> Chao >> >> >> >> 2014-05-21 14:30 GMT+02:00 Simon Rit : >> >> Since it fails in cufft, it's the memory of the projections that is a >>> problem. Therefore, it is not surprising that --divisions has no >>> influence. But --lowmem should have an influence. I would suggest: >>> - to uncomment >>> //#define VERBOSE >>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>> are requested. >>> - to try to reproduce the problem with simulated data so that we can >>> help you in finding a solution. >>> Simon >>> >>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>> > Hi Simon, >>> > >>> > Yes I switched on an off the --lowmem option and it has no influence >>> on the >>> > behaviour I mentioned. >>> > In my case the system memory is sufficient to handle the projections >>> plus >>> > the volume. >>> > The major bottleneck is the amount of graphics memory. >>> > If I reconstruct a little bit more slices than the limit that I found >>> with >>> > one stream, the allocation of GPU resource for CUFFT in the >>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>> > However with --divisions > 1 it is indeed able to reconstruct more >>> slices, >>> > but only a very few more; otherwise the CUFFT would fail again. >>> > I would expect the limitations of the amount of slices to be >>> approximately >>> > proportional to the number of streams, or do I miss anything about >>> stream >>> > division? >>> > >>> > Thanks, >>> > Chao >>> > >>> > >>> > >>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>> > >>> >> Hi Chao, >>> >> There are two things that use memory, the volume and the projections. >>> >> The --divisions option divides the volume only. The --lowmem option >>> >> works on a subset of projections at a time. Did you try this? >>> >> Simon >>> >> >>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>> >> > Hoi, >>> >> > >>> >> > I may need some hint about how the stream division works in rtkfdk. >>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>> cannot >>> >> > figure >>> >> > out quickly how the division has been performed. >>> >> > I did some test with reconstructing 400 1500x1200 projections into a >>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>> >> > The reconstructions were executed by rtkfdk with CUDA. >>> >> > When I leave the origin of the volume at the center by default, I >>> can >>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>> limitation >>> >> > of >>> >> > the graphic memory. Then when I increase the number of divisions to >>> 2, I >>> >> > can >>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>> to >>> >> > 219 >>> >> > slices. Does anyone have an idea why it scales like this? >>> >> > Thanks in advance. >>> >> > >>> >> > Best regards, >>> >> > Chao >>> >> > >>> >> > _______________________________________________ >>> >> > Rtk-users mailing list >>> >> > Rtk-users at openrtk.org >>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> >> > >>> > >>> > >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 28 10:48:20 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 28 May 2014 16:48:20 +0200 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: <5305E503.3000506@ucl.ac.uk> References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: Hi Ben, It was on my todo list. I found the problem and here is the fix: https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 Multithreading was indeed disabled as you pointed out, I had to remember pieces of code that were quite old (for an animal like me). Thanks again for the detailed report, Simon On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion wrote: > Hi Simon, > > Really appreciate your prompt response! > > Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster > reconstructions, and the time increase between the two commits reduces to a > little over 2x (See below). > > My dataset consists of 344 projections (about 172.0 MB) > > Does this sound about right? The CPU utilization still looks a bit like a > series of spikes for the latter commit (but different than before). > > Reconstructing and writing... It took 36.0746 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.59479 s > Ramp filter: 19.3106 s > Backprojection: 13.8042 s > > ***versus*** > > Reconstructing and writing... It took 83.4121 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.62535 s > Ramp filter: 66.5537 s > Backprojection: 13.8829 s > > Thanks again, > > Ben > > > > > On 20/02/14 06:57, Simon Rit wrote: >> >> Hi, >> Thank you Ben for the amazing report. I can spot a few things that >> could have gone wrong there but it seems to me that your >> reconstruction is slow both before and after the commit... Two >> potential reasons: >> - you have not activated FFTW in ITK. You should definitely do that, >> the FFT of ITK is (very) slow and probably not multithreaded. You must >> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >> version of ITK4, I had some issues with the first versions, see >> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >> - you are using a huge dataset. >> If you did not use FFTW, could you try again with FFTW and tell us if >> you still observe a drop in performances? If you had FFTW, can you >> provide the sie of the dataset you used? >> Thanks, >> Simon >> >> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >> wrote: >>> >>> Hello, >>> >>> First of all, many thanks to the RTK community for this useful toolkit! >>> >>> While experimenting with different versions of the code (I'm a relatively >>> new user), I've encountered large differences in rtkfdk (CPU) >>> reconstruction >>> speed between code versions (a newer version being substantially slower >>> than >>> an older version). >>> >>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>> required -g, -p, -r and -o flags, but no other flags). >>> >>> Using git-bisect, I narrowed it down to a particular commit. The parent >>> commit runs quite quickly, but the child commit shows nearly 4x >>> reconstruction time, and less-uniform CPU utilization (it looks like a >>> series of spikes). >>> >>> (See below) >>> >>> Looking at the diffs, it seems that in addition to adding the HannY >>> functionality (which should be disabled by default?), there were some >>> changes in this commit related to threading (in >>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>> misleading and the substantial difference consists in changing the FFT >>> Ramp >>> Kernel. >>> >>> I'm currently reading the source to try to understand those changes, but >>> I >>> thought I would post in case someone is able to point me in the right >>> direction. Although these differences are unexpected to me, I doubt that >>> they are unexpected to more experienced users...! >>> >>> Apologies if I've left out any critical information (or if I've provided >>> too >>> much!). >>> >>> Many thanks in advance, >>> Ben >>> >>> ****** Parent Commit ****** >>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>> Author: Julien Jomier >>> Date: Fri Nov 30 09:30:59 2012 +0100 >>> >>> ENH: Minimum CMake version is 2.8.3 >>> >>> ***Partial output*** >>> >>> Reconstructing and writing... It took 44.3992 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.67915 s >>> Ramp filter: 26.3847 s >>> Backprojection: 13.0447 s >>> >>> ***Screenshot of CPU usage attached: >>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>> >>> ****** Child Commit ****** >>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>> Author: Simon Rit >>> Date: Wed Dec 5 16:22:47 2012 +0100 >>> >>> First version of Hann windowing in the second direction >>> (perpendicular >>> to the ramp) >>> >>> ***Partial output*** >>> Reconstructing and writing... It took 126.911 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.47678 s >>> Ramp filter: 108.254 s >>> Backprojection: 13.2973 s >>> >>> ***Screenshot of CPU usage attached: >>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > From benjamin.champion.13 at ucl.ac.uk Thu May 29 05:19:37 2014 From: benjamin.champion.13 at ucl.ac.uk (Ben Champion) Date: Thu, 29 May 2014 10:19:37 +0100 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: <5386FBA9.6020402@ucl.ac.uk> Hi Simon, Glad to hear you found a fix! Thanks for looking into it. Best wishes, Ben On 28/05/14 15:48, Simon Rit wrote: > Hi Ben, > It was on my todo list. I found the problem and here is the fix: > https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 > Multithreading was indeed disabled as you pointed out, I had to > remember pieces of code that were quite old (for an animal like me). > Thanks again for the detailed report, > Simon > > On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion > wrote: >> Hi Simon, >> >> Really appreciate your prompt response! >> >> Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster >> reconstructions, and the time increase between the two commits reduces to a >> little over 2x (See below). >> >> My dataset consists of 344 projections (about 172.0 MB) >> >> Does this sound about right? The CPU utilization still looks a bit like a >> series of spikes for the latter commit (but different than before). >> >> Reconstructing and writing... It took 36.0746 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.59479 s >> Ramp filter: 19.3106 s >> Backprojection: 13.8042 s >> >> ***versus*** >> >> Reconstructing and writing... It took 83.4121 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.62535 s >> Ramp filter: 66.5537 s >> Backprojection: 13.8829 s >> >> Thanks again, >> >> Ben >> >> >> >> >> On 20/02/14 06:57, Simon Rit wrote: >>> Hi, >>> Thank you Ben for the amazing report. I can spot a few things that >>> could have gone wrong there but it seems to me that your >>> reconstruction is slow both before and after the commit... Two >>> potential reasons: >>> - you have not activated FFTW in ITK. You should definitely do that, >>> the FFT of ITK is (very) slow and probably not multithreaded. You must >>> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >>> version of ITK4, I had some issues with the first versions, see >>> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >>> - you are using a huge dataset. >>> If you did not use FFTW, could you try again with FFTW and tell us if >>> you still observe a drop in performances? If you had FFTW, can you >>> provide the sie of the dataset you used? >>> Thanks, >>> Simon >>> >>> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >>> wrote: >>>> Hello, >>>> >>>> First of all, many thanks to the RTK community for this useful toolkit! >>>> >>>> While experimenting with different versions of the code (I'm a relatively >>>> new user), I've encountered large differences in rtkfdk (CPU) >>>> reconstruction >>>> speed between code versions (a newer version being substantially slower >>>> than >>>> an older version). >>>> >>>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>>> required -g, -p, -r and -o flags, but no other flags). >>>> >>>> Using git-bisect, I narrowed it down to a particular commit. The parent >>>> commit runs quite quickly, but the child commit shows nearly 4x >>>> reconstruction time, and less-uniform CPU utilization (it looks like a >>>> series of spikes). >>>> >>>> (See below) >>>> >>>> Looking at the diffs, it seems that in addition to adding the HannY >>>> functionality (which should be disabled by default?), there were some >>>> changes in this commit related to threading (in >>>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>>> misleading and the substantial difference consists in changing the FFT >>>> Ramp >>>> Kernel. >>>> >>>> I'm currently reading the source to try to understand those changes, but >>>> I >>>> thought I would post in case someone is able to point me in the right >>>> direction. Although these differences are unexpected to me, I doubt that >>>> they are unexpected to more experienced users...! >>>> >>>> Apologies if I've left out any critical information (or if I've provided >>>> too >>>> much!). >>>> >>>> Many thanks in advance, >>>> Ben >>>> >>>> ****** Parent Commit ****** >>>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>>> Author: Julien Jomier >>>> Date: Fri Nov 30 09:30:59 2012 +0100 >>>> >>>> ENH: Minimum CMake version is 2.8.3 >>>> >>>> ***Partial output*** >>>> >>>> Reconstructing and writing... It took 44.3992 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.67915 s >>>> Ramp filter: 26.3847 s >>>> Backprojection: 13.0447 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>>> >>>> ****** Child Commit ****** >>>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>>> Author: Simon Rit >>>> Date: Wed Dec 5 16:22:47 2012 +0100 >>>> >>>> First version of Hann windowing in the second direction >>>> (perpendicular >>>> to the ramp) >>>> >>>> ***Partial output*** >>>> Reconstructing and writing... It took 126.911 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.47678 s >>>> Ramp filter: 108.254 s >>>> Backprojection: 13.2973 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>>> >>>> >>>> >>>> _______________________________________________ >>>> Rtk-users mailing list >>>> Rtk-users at openrtk.org >>>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> From simon.rit at creatis.insa-lyon.fr Fri May 30 05:12:41 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 11:12:41 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, I added the option, --subsetsize. Thanks for the detailed report. I don't understand it all, it's quite complicated... Do you really have such memory limitations problems that you want to go in that direction? Using the two streaming options (--subset + --divisions), you should be able to sufficiently reduce your memory consumption. If you really want to go further in the in-place implementation, I think a code patch would be more helpful but you must confine the changes to rtk::CudaFFTRampImageFilter. We don't want to modify itk::CudaDataManager for such a specific purpose. Simon On Tue, May 27, 2014 at 2:24 PM, Chao Wu wrote: > Hi Simon, > > Thanks for your reaction. I was looking into the in-place FFT these days, > and the way of tuning the number of projections sent to the ramp filter is > exactly what I plan to look for next. Now I know that directly. I think it > is a good idea to make it an option of rtkfdk, or to regulate it > automatically by inquiring the amount of free memory with cudaMemGetInfo and > estimating the memory needed for storing the projections, ramp kernel, FFT > plan and the chunk of volume. The latter may be difficult though since such > estimation is not easy at the stage even before padding the projections... > > Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and > cuFFT could perform FFT in-place. So in principle > rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter > may also be made in-place if FFTW is used. However the ?in-place? here is on > a lower level and may not be compatible with the meaning of ?in-place? of > itk::InPlaceImageFilter. > > Anyway, since system memory is not a problem to me, I only focus on the Cuda > filter. I already have sort of ?dirty? implementation for my own use: > > First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of > deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) > deviceProjection. Now the cuFFT is in-place; the only thing is that the size > of the buffer (now used by both deviceProjectionFFT and deviceProjection) > should be 2*(x/2+1)*y*z instead of x*y*z. > > Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above > is maintained in paddedImage. Its size is determined in > PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and > CPU-to-GPU data copying is by > paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first > attempt is to make the image regions of paddedImage different from each > other by modifying FFTRampImageFilter::PadInputImageRegion(?) in > rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing > the padded projection data as how it works now; while its BufferedRegion > should be 2*(x/2+1) by y by z, with the additional part reserved for > in-place FFT. Other small changes were done to calculate inputDimension and > kernelDimension correctly based on RequestedRegion. Later I realized that > this did not work, since cuFFT sees the buffer just as a linear space. All > image data should come continuously from the beginning of the buffer and all > unused spaces are at the end, but in this case the reserved spaces were at > the end along the x (first) dimension so that they were distributed in the > linear buffer. > > So this was where the ?dirty? changes started. First of all, instead of > calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, > I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did > not check if the modification works for CPU or any other situations as well, > so I prefer to make branches when possible instead of direct changes). The > latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only > change of the call for allocation from paddedImage->Allocate() to > paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() > is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. > There, after the calculation and set of CudaDataManager::m_BufferSize as > before, I also calculate the required buffer size for in-place FFT and > stored the value in a new member of CudaDataManager, namely > m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in > itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check > if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let > m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and > after that restore m_BufferSize to its original value. Other changes have > been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to > m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT > = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any > other allocation actions (although I have not checked those one by one) will > not be influenced by the piece of new code. At last, under > GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after > cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the > additional space in this buffer will never have a chance later to be > initialized by means of CPU-to-GPU data copying. The length of the data is > shorter than the buffer size. > > It works for me so far. Please see if you have any better routine to > implement this. Thank you. > > Best regards, > Chao > > > > > > > > > 2014-05-27 0:12 GMT+02:00 Simon Rit : > >> Hi Chao, >> Thanks for the detailed report. >> >> >> On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: >>> >>> Hi Simon, >>> >>> Thanks for the suggestions. >>> >>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >>> >>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >>> --dimension 640,250,640 --hardware=cuda -v -l >>> >>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. >>> I found that the size of the volume data in the GRAM could be reduced by >>> --divisions but the amount of projection data sent to the GRAM are not >>> influenced by --lowmem switch. >> >> After looking at the code again, lowmem acts on the reading so it's not >> related to the GPU memory but on the CPU memory, sorry about that. The >> reconstruction algorithm does stream the projections but it processes by >> default 16 projections at a time. You can change this in >> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce >> your GPU memory consumption (I checked and it works for me). Let me know if >> it works for you and if you think that this should be made an option of >> rtkfdk. >> >>> >>> So --divisions does not help much if it is mainly the projection data >>> which takes up GRAM, while --lowmem does not help at all. I did not look >>> into the more front part of the code so I am not sure if this is the >>> designed behaviour. >>> >>> On the other hand, I am also looking for possibilities to reduce GRAM >>> used in the CUDA ramp filter. At least one thing should be changed, and one >>> thing may be considered: >>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >>> destroyed earlier, right after the plan being executed. A plan takes up at >>> least the same amount of memory as the data. >> >> Good point, I changed it: >> >> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 >> >>> >>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >>> clear idea about how to pad deviceProjection to the required size of its >>> cufftComplex counterpart. >> >> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is >> not an itk::InPlaceImageFilter. It might be possible but I would have to >> check. Let me know if you investigate this further. >> Thanks again, >> Simon >> >>> >>> >>> Any comments? >>> >>> Best regards, >>> Chao >>> >>> >>> >>> 2014-05-21 14:30 GMT+02:00 Simon Rit : >>> >>>> Since it fails in cufft, it's the memory of the projections that is a >>>> problem. Therefore, it is not surprising that --divisions has no >>>> influence. But --lowmem should have an influence. I would suggest: >>>> - to uncomment >>>> //#define VERBOSE >>>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>>> are requested. >>>> - to try to reproduce the problem with simulated data so that we can >>>> help you in finding a solution. >>>> Simon >>>> >>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>>> > Hi Simon, >>>> > >>>> > Yes I switched on an off the --lowmem option and it has no influence >>>> > on the >>>> > behaviour I mentioned. >>>> > In my case the system memory is sufficient to handle the projections >>>> > plus >>>> > the volume. >>>> > The major bottleneck is the amount of graphics memory. >>>> > If I reconstruct a little bit more slices than the limit that I found >>>> > with >>>> > one stream, the allocation of GPU resource for CUFFT in the >>>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>>> > However with --divisions > 1 it is indeed able to reconstruct more >>>> > slices, >>>> > but only a very few more; otherwise the CUFFT would fail again. >>>> > I would expect the limitations of the amount of slices to be >>>> > approximately >>>> > proportional to the number of streams, or do I miss anything about >>>> > stream >>>> > division? >>>> > >>>> > Thanks, >>>> > Chao >>>> > >>>> > >>>> > >>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>>> > >>>> >> Hi Chao, >>>> >> There are two things that use memory, the volume and the projections. >>>> >> The --divisions option divides the volume only. The --lowmem option >>>> >> works on a subset of projections at a time. Did you try this? >>>> >> Simon >>>> >> >>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>>> >> > Hoi, >>>> >> > >>>> >> > I may need some hint about how the stream division works in rtkfdk. >>>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>>> >> > cannot >>>> >> > figure >>>> >> > out quickly how the division has been performed. >>>> >> > I did some test with reconstructing 400 1500x1200 projections into >>>> >> > a >>>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>>> >> > The reconstructions were executed by rtkfdk with CUDA. >>>> >> > When I leave the origin of the volume at the center by default, I >>>> >> > can >>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>>> >> > limitation >>>> >> > of >>>> >> > the graphic memory. Then when I increase the number of divisions to >>>> >> > 2, I >>>> >> > can >>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>>> >> > to >>>> >> > 219 >>>> >> > slices. Does anyone have an idea why it scales like this? >>>> >> > Thanks in advance. >>>> >> > >>>> >> > Best regards, >>>> >> > Chao >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Rtk-users mailing list >>>> >> > Rtk-users at openrtk.org >>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> >> > >>>> > >>>> > >>> >>> >> > From simon.rit at creatis.insa-lyon.fr Fri May 30 07:12:49 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 13:12:49 +0200 Subject: [Rtk-users] Result from SART is worse than from FDK In-Reply-To: <52B44FCA.7000800@bam.de> References: <527914C3.8030706@bam.de> <527918B5.9080709@bam.de> <52B44FCA.7000800@bam.de> Message-ID: Hi Andreas, I apologize for never getting back to you despite the clear description of the problem. Cyril Mory has done many developments in iterative reconstruction since your email, including some improvement of SART. See for example http://wiki.openrtk.org/index.php/RTK/Examples/ADMMTVReconstruction. I have launched the three cases you suggested with the "new" SART - SART reconstruction of middle plane: this cannot work because our forward projector assumes that the volume goes from the middle of the first voxel to the middle of the last voxel. Therefore, one plane is not enough, you need at least two. - SART reconstruction of 10 planes around middle plane: there is a truncation problem here and I don't see how it could be solved in this manner. In general, one needs to use a reconstruction support that is large enough for the problem at hand (see for example http://www.ncbi.nlm.nih.gov/pubmed/17441239). The situation is different if you reduce the data to the reconstruction of a single plane (with --dimension 256,1 in rtkprojectgeometricphantom). Then, your 10 slices are sufficient but the default unmatched forward/back-projector (see http://www.ncbi.nlm.nih.gov/pubmed/11021698 for a description of this) give bad results. You can now solve this if you match them with the option --bp NormalizedJoseph that Cyril has implemented. So even a better of implementation of SART (the current one) does not solve the problems that you have pointed out. You need a large enough CT image given input data to solve the problem. I hope this will be helpful, maybe not to you if it's too late but to some others. Simon On Fri, Dec 20, 2013 at 3:10 PM, Staude, Andreas wrote: > Hi Simon, > > I believe it really is a problem with the sum of the weights. > > I first tried with the Shepp-Logan-phantom and afterwards with my data. > The geometry is that of a standard cone-beam micro-CT. > > The data I posted before were the reconstruction of just the middle > plane. As I did the same with the Shepp-Logan-phantom data, similar > effects were seen. As soon as one reconstructs a larger region around > the middle plane, the artefacts vanish in the inner parts of the > reconstructed volume, while in the top and bottom parts artefacts remain. > > The program calls were: > > create geometry: > ---------------- > rtksimulatedgeometry --nproj="1200" --output="geometry.xml" > --sdd="1169.59" --sid="451.645" --arc="-360" --first_angle="360" > > project the phantom: > -------------------- > rtkprojectgeometricphantom -g geometry.xml -o projections3.mha --spacing > 2.5 --dimension 256 --phantomfile SheppLogan.txt > > do a reference FDK reconstruction: > ---------------------------------- > rtkfdk -p . -r projections3.mha -o shepp-logan_fdk3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > SART reconstruction of middle plane: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2D.mha -g > geometry.xml --spacing 1 --dimension 256,1,256 > > SART reconstruction of 10 planes around middle plane: > ------------------------------------------------------- > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2.5D.mha -g > geometry.xml --spacing 1 --dimension 256,10,256 > > SART reconstruction of whole object: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > > Reconstruction of more slices of the real data-set also gave a good > result. Only the slices near bottom and top are not reconstructed correctly. > > So it seems that the normalisation does not only take the values inside > the reconstructed volume into account, but also (wrong) values outside. > > What do you think? > > Cheers, > > Andreas > > > > On 11/05/2013 07:11 PM, Simon Rit wrote: >> Hi Andreas, >> Thanks for the report. We know that the implementation of SART is >> imperfect, we haven't been working a lot on it... It seems that you >> haven't reached convergence. One potential cause is that we use a >> heuristic for the sum of the weights (denominator in the SART formula) >> instead of the exact sum. The weight is constant and equals the >> diagonal of your volume (see line 165 in >> rtkSARTConeBeamReconstructionFilter.txx). Maybe this is completely >> wrong in your case. Could you try to increase lambda to see if that >> helps? >> To help us do some tests, I would advise you do reproduce your >> geometry with simulations of the Shepp Logan phantom (see >> wiki.openrtk.org). >> Simon >> >> On Tue, Nov 5, 2013 at 5:11 PM, Staude, Andreas wrote: >>> Hello RTk-users, >>> >>> I try to use the SART algorithm, but the results are worse than those >>> obtained with FDK (see attached images). >>> >>> The FDK result looks like expected, so I assume that I have the data >>> format and the reconstruction geometry set properly. For SART I used the >>> same parameters and already tried with different values of lambda and >>> niterations. >>> >>> Does anyone have an idea what went wrong? Is there some kind of >>> smoothing or regularisation applied in the SART implementation? >>> >>> Many thanks in advance! >>> >>> Cheers, >>> >>> Andreas >>> >>> >>> -- >>> >>> =============================================================== >>> Dr. Andreas Staude >>> Fachbereich 8.5 "Mikro-ZfP", Computertomographie >>> BAM Bundesanstalt f?r Materialforschung und -pr?fung >>> Unter den Eichen 87 >>> D-12205 Berlin >>> Germany >>> >>> Tel.: ++49 30 8104 4140 >>> Fax: ++49 30 8104 1837 >>> =============================================================== >>> >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > > -- > > =============================================================== > Dr. Andreas Staude > Fachbereich 8.5 "Mikro-ZfP", Computertomographie > BAM Bundesanstalt f?r Materialforschung und -pr?fung > Unter den Eichen 87 > D-12205 Berlin > Germany > > Tel.: ++49 30 8104 4140 > Fax: ++49 30 8104 1837 > =============================================================== From wuchao04 at gmail.com Wed May 21 06:18:57 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 12:18:57 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk Message-ID: Hoi, I may need some hint about how the stream division works in rtkfdk. I noticed that the StreamingImageFilter from ITK is used but I cannot figure out quickly how the division has been performed. I did some test with reconstructing 400 1500x1200 projections into a 640xNx640 volume (the pixel and voxel size are comparable). The reconstructions were executed by rtkfdk with CUDA. When I leave the origin of the volume at the center by default, I can reconstruct up to N=200 slices with --divisions=1 due to the limitation of the graphic memory. Then when I increase the number of divisions to 2, I can only reconstruct up to 215 slices; and with divisions to 3 only up to 219 slices. Does anyone have an idea why it scales like this? Thanks in advance. Best regards, Chao -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 07:43:40 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 13:43:40 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, There are two things that use memory, the volume and the projections. The --divisions option divides the volume only. The --lowmem option works on a subset of projections at a time. Did you try this? Simon On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > Hoi, > > I may need some hint about how the stream division works in rtkfdk. > I noticed that the StreamingImageFilter from ITK is used but I cannot figure > out quickly how the division has been performed. > I did some test with reconstructing 400 1500x1200 projections into a > 640xNx640 volume (the pixel and voxel size are comparable). > The reconstructions were executed by rtkfdk with CUDA. > When I leave the origin of the volume at the center by default, I can > reconstruct up to N=200 slices with --divisions=1 due to the limitation of > the graphic memory. Then when I increase the number of divisions to 2, I can > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > slices. Does anyone have an idea why it scales like this? > Thanks in advance. > > Best regards, > Chao > > _______________________________________________ > Rtk-users mailing list > Rtk-users at openrtk.org > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > From wuchao04 at gmail.com Wed May 21 08:21:00 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 14:21:00 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Yes I switched on an off the --lowmem option and it has no influence on the behaviour I mentioned. In my case the system memory is sufficient to handle the projections plus the volume. The major bottleneck is the amount of graphics memory. If I reconstruct a little bit more slices than the limit that I found with one stream, the allocation of GPU resource for CUFFT in the CudaFFTRampImageFilter will fail (which was more or less expected). However with --divisions > 1 it is indeed able to reconstruct more slices, but only a very few more; otherwise the CUFFT would fail again. I would expect the limitations of the amount of slices to be approximately proportional to the number of streams, or do I miss anything about stream division? Thanks, Chao 2014-05-21 13:43 GMT+02:00 Simon Rit : > Hi Chao, > There are two things that use memory, the volume and the projections. > The --divisions option divides the volume only. The --lowmem option > works on a subset of projections at a time. Did you try this? > Simon > > On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > > Hoi, > > > > I may need some hint about how the stream division works in rtkfdk. > > I noticed that the StreamingImageFilter from ITK is used but I cannot > figure > > out quickly how the division has been performed. > > I did some test with reconstructing 400 1500x1200 projections into a > > 640xNx640 volume (the pixel and voxel size are comparable). > > The reconstructions were executed by rtkfdk with CUDA. > > When I leave the origin of the volume at the center by default, I can > > reconstruct up to N=200 slices with --divisions=1 due to the limitation > of > > the graphic memory. Then when I increase the number of divisions to 2, I > can > > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > > slices. Does anyone have an idea why it scales like this? > > Thanks in advance. > > > > Best regards, > > Chao > > > > _______________________________________________ > > Rtk-users mailing list > > Rtk-users at openrtk.org > > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 08:30:21 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 14:30:21 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Since it fails in cufft, it's the memory of the projections that is a problem. Therefore, it is not surprising that --divisions has no influence. But --lowmem should have an influence. I would suggest: - to uncomment //#define VERBOSE in itkCudaImageDataManager.hxx and try to see what amount of memory are requested. - to try to reproduce the problem with simulated data so that we can help you in finding a solution. Simon On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > Hi Simon, > > Yes I switched on an off the --lowmem option and it has no influence on the > behaviour I mentioned. > In my case the system memory is sufficient to handle the projections plus > the volume. > The major bottleneck is the amount of graphics memory. > If I reconstruct a little bit more slices than the limit that I found with > one stream, the allocation of GPU resource for CUFFT in the > CudaFFTRampImageFilter will fail (which was more or less expected). > However with --divisions > 1 it is indeed able to reconstruct more slices, > but only a very few more; otherwise the CUFFT would fail again. > I would expect the limitations of the amount of slices to be approximately > proportional to the number of streams, or do I miss anything about stream > division? > > Thanks, > Chao > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > >> Hi Chao, >> There are two things that use memory, the volume and the projections. >> The --divisions option divides the volume only. The --lowmem option >> works on a subset of projections at a time. Did you try this? >> Simon >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> > Hoi, >> > >> > I may need some hint about how the stream division works in rtkfdk. >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> > figure >> > out quickly how the division has been performed. >> > I did some test with reconstructing 400 1500x1200 projections into a >> > 640xNx640 volume (the pixel and voxel size are comparable). >> > The reconstructions were executed by rtkfdk with CUDA. >> > When I leave the origin of the volume at the center by default, I can >> > reconstruct up to N=200 slices with --divisions=1 due to the limitation >> > of >> > the graphic memory. Then when I increase the number of divisions to 2, I >> > can >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> > 219 >> > slices. Does anyone have an idea why it scales like this? >> > Thanks in advance. >> > >> > Best regards, >> > Chao >> > >> > _______________________________________________ >> > Rtk-users mailing list >> > Rtk-users at openrtk.org >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> > > > From simon.rit at creatis.insa-lyon.fr Wed May 21 10:19:26 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 16:19:26 +0200 Subject: [Rtk-users] Backward incompatible change: angles in radians Message-ID: Dear all, Be aware that I have just pushed a backward incompatible change: https://github.com/SimonRit/RTK/commit/b6661f59a0a5730545474163f73438a978053194 I usually try to maintain backward compatibility but I felt that the class rtk::ThreeDCircularProjectionGeometry was really too messy. So from now on: - all angles stored or returned by the class are in radians - only the function AddProjection takes angles in degrees as parameters. AddProjectionInRadians allows you to avoid conversion of angles that are already in radians if you prefer it. - angles in geometry files are still in degrees. I believe that you will only have issues with this if you were using one of the following methods: - GetGantryAngles - GetOutOfPlaneAngles - GetInPlaneAngles The returned values are now in radians, not in degrees anymore. I apologize in advance for any inconveniece and I'm available to help you if it is one. Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From wuchao04 at gmail.com Thu May 22 04:06:44 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Thu, 22 May 2014 10:06:44 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for the suggestions. The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 --dimension 640,250,640 --hardware=cuda -v -l With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. I found that the size of the volume data in the GRAM could be reduced by --divisions but the amount of projection data sent to the GRAM are not influenced by --lowmem switch. So --divisions does not help much if it is mainly the projection data which takes up GRAM, while --lowmem does not help at all. I did not look into the more front part of the code so I am not sure if this is the designed behaviour. On the other hand, I am also looking for possibilities to reduce GRAM used in the CUDA ramp filter. At least one thing should be changed, and one thing may be considered: - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be destroyed earlier, right after the plan being executed. A plan takes up at least the same amount of memory as the data. - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a clear idea about how to pad deviceProjection to the required size of its cufftComplex counterpart. Any comments? Best regards, Chao 2014-05-21 14:30 GMT+02:00 Simon Rit : > Since it fails in cufft, it's the memory of the projections that is a > problem. Therefore, it is not surprising that --divisions has no > influence. But --lowmem should have an influence. I would suggest: > - to uncomment > //#define VERBOSE > in itkCudaImageDataManager.hxx and try to see what amount of memory > are requested. > - to try to reproduce the problem with simulated data so that we can > help you in finding a solution. > Simon > > On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: > > Hi Simon, > > > > Yes I switched on an off the --lowmem option and it has no influence on > the > > behaviour I mentioned. > > In my case the system memory is sufficient to handle the projections plus > > the volume. > > The major bottleneck is the amount of graphics memory. > > If I reconstruct a little bit more slices than the limit that I found > with > > one stream, the allocation of GPU resource for CUFFT in the > > CudaFFTRampImageFilter will fail (which was more or less expected). > > However with --divisions > 1 it is indeed able to reconstruct more > slices, > > but only a very few more; otherwise the CUFFT would fail again. > > I would expect the limitations of the amount of slices to be > approximately > > proportional to the number of streams, or do I miss anything about stream > > division? > > > > Thanks, > > Chao > > > > > > > > 2014-05-21 13:43 GMT+02:00 Simon Rit : > > > >> Hi Chao, > >> There are two things that use memory, the volume and the projections. > >> The --divisions option divides the volume only. The --lowmem option > >> works on a subset of projections at a time. Did you try this? > >> Simon > >> > >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > >> > Hoi, > >> > > >> > I may need some hint about how the stream division works in rtkfdk. > >> > I noticed that the StreamingImageFilter from ITK is used but I cannot > >> > figure > >> > out quickly how the division has been performed. > >> > I did some test with reconstructing 400 1500x1200 projections into a > >> > 640xNx640 volume (the pixel and voxel size are comparable). > >> > The reconstructions were executed by rtkfdk with CUDA. > >> > When I leave the origin of the volume at the center by default, I can > >> > reconstruct up to N=200 slices with --divisions=1 due to the > limitation > >> > of > >> > the graphic memory. Then when I increase the number of divisions to > 2, I > >> > can > >> > only reconstruct up to 215 slices; and with divisions to 3 only up to > >> > 219 > >> > slices. Does anyone have an idea why it scales like this? > >> > Thanks in advance. > >> > > >> > Best regards, > >> > Chao > >> > > >> > _______________________________________________ > >> > Rtk-users mailing list > >> > Rtk-users at openrtk.org > >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Mon May 26 18:12:50 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 00:12:50 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, Thanks for the detailed report. On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > Hi Simon, > > Thanks for the suggestions. > > The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: > > rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 > rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing > 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt > rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 > --dimension 640,250,640 --hardware=cuda -v -l > > With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of > itkCudaImageDataManager.hxx) now I can have a better view of the GRAM > usage. > I found that the size of the volume data in the GRAM could be reduced by > --divisions but the amount of projection data sent to the GRAM are not > influenced by --lowmem switch. > After looking at the code again, lowmem acts on the reading so it's not related to the GPU memory but on the CPU memory, sorry about that. The reconstruction algorithm does stream the projections but it processes by default 16 projections at a time. You can change this in rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce your GPU memory consumption (I checked and it works for me). Let me know if it works for you and if you think that this should be made an option of rtkfdk. > So --divisions does not help much if it is mainly the projection data > which takes up GRAM, while --lowmem does not help at all. I did not look > into the more front part of the code so I am not sure if this is the > designed behaviour. > > On the other hand, I am also looking for possibilities to reduce GRAM used > in the CUDA ramp filter. At least one thing should be changed, and one > thing may be considered: > - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be > destroyed earlier, right after the plan being executed. A plan takes up at > least the same amount of memory as the data. > Good point, I changed it: https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a > clear idea about how to pad deviceProjection to the required size of > its cufftComplex counterpart. > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is not an itk::InPlaceImageFilter. It might be possible but I would have to check. Let me know if you investigate this further. Thanks again, Simon > > Any comments? > > Best regards, > Chao > > > > 2014-05-21 14:30 GMT+02:00 Simon Rit : > > Since it fails in cufft, it's the memory of the projections that is a >> problem. Therefore, it is not surprising that --divisions has no >> influence. But --lowmem should have an influence. I would suggest: >> - to uncomment >> //#define VERBOSE >> in itkCudaImageDataManager.hxx and try to see what amount of memory >> are requested. >> - to try to reproduce the problem with simulated data so that we can >> help you in finding a solution. >> Simon >> >> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >> > Hi Simon, >> > >> > Yes I switched on an off the --lowmem option and it has no influence on >> the >> > behaviour I mentioned. >> > In my case the system memory is sufficient to handle the projections >> plus >> > the volume. >> > The major bottleneck is the amount of graphics memory. >> > If I reconstruct a little bit more slices than the limit that I found >> with >> > one stream, the allocation of GPU resource for CUFFT in the >> > CudaFFTRampImageFilter will fail (which was more or less expected). >> > However with --divisions > 1 it is indeed able to reconstruct more >> slices, >> > but only a very few more; otherwise the CUFFT would fail again. >> > I would expect the limitations of the amount of slices to be >> approximately >> > proportional to the number of streams, or do I miss anything about >> stream >> > division? >> > >> > Thanks, >> > Chao >> > >> > >> > >> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >> > >> >> Hi Chao, >> >> There are two things that use memory, the volume and the projections. >> >> The --divisions option divides the volume only. The --lowmem option >> >> works on a subset of projections at a time. Did you try this? >> >> Simon >> >> >> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >> >> > Hoi, >> >> > >> >> > I may need some hint about how the stream division works in rtkfdk. >> >> > I noticed that the StreamingImageFilter from ITK is used but I cannot >> >> > figure >> >> > out quickly how the division has been performed. >> >> > I did some test with reconstructing 400 1500x1200 projections into a >> >> > 640xNx640 volume (the pixel and voxel size are comparable). >> >> > The reconstructions were executed by rtkfdk with CUDA. >> >> > When I leave the origin of the volume at the center by default, I can >> >> > reconstruct up to N=200 slices with --divisions=1 due to the >> limitation >> >> > of >> >> > the graphic memory. Then when I increase the number of divisions to >> 2, I >> >> > can >> >> > only reconstruct up to 215 slices; and with divisions to 3 only up to >> >> > 219 >> >> > slices. Does anyone have an idea why it scales like this? >> >> > Thanks in advance. >> >> > >> >> > Best regards, >> >> > Chao >> >> > >> >> > _______________________________________________ >> >> > Rtk-users mailing list >> >> > Rtk-users at openrtk.org >> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >> >> > >> > >> > >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Tue May 27 08:23:51 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Tue, 27 May 2014 14:23:51 +0200 Subject: [Rtk-users] Test phantoms for RTK In-Reply-To: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> References: <31A5856E30ED6242B799932F22FF200A508CE1@ee-mbx2.ee.emp-eaw.ch> Message-ID: Hi, Please use the mailing list, your question might be of interest to others. The use of phantoms is described on the wiki (http://wiki.openrtk.org). For example, look for the Elekta and Varian section to see how to reconstruct these datasets. Let us know if something is not clear there with a more specific question, we'll be happy to improve the description. Thanks, Simon On Tue, May 27, 2014 at 11:28 AM, Liu, Yu wrote: > Dear Mr. Rit, > > > > I am doing my PhD at Empa in Switzerland. Currently I am trying to use RTK > to implement some of my algorithms. > > I found some test phantoms you uploaded to kitware > (http://midas3.kitware.com/midas/community/20#) and you referred to them in > one of your publications. > > However, you did not provide any documents on how to use them (at least how > to read the files). Is it possible that you give me some hints on this > issue? > > > > Thank you. > > Best regards, > > Yu Liu From wuchao04 at gmail.com Tue May 27 08:24:19 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Tue, 27 May 2014 14:24:19 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Thanks for your reaction. I was looking into the in-place FFT these days, and the way of tuning the number of projections sent to the ramp filter is exactly what I plan to look for next. Now I know that directly. I think it is a good idea to make it an option of rtkfdk, or to regulate it automatically by inquiring the amount of free memory with cudaMemGetInfo and estimating the memory needed for storing the projections, ramp kernel, FFT plan and the chunk of volume. The latter may be difficult though since such estimation is not easy at the stage even before padding the projections... Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and cuFFT could perform FFT in-place. So in principle rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter may also be made in-place if FFTW is used. However the ?in-place? here is on a lower level and may not be compatible with the meaning of ?in-place? of itk::InPlaceImageFilter. Anyway, since system memory is not a problem to me, I only focus on the Cuda filter. I already have sort of ?dirty? implementation for my own use: First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) deviceProjection. Now the cuFFT is in-place; the only thing is that the size of the buffer (now used by both deviceProjectionFFT and deviceProjection) should be 2*(x/2+1)*y*z instead of x*y*z. Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above is maintained in paddedImage. Its size is determined in PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and CPU-to-GPU data copying is by paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first attempt is to make the image regions of paddedImage different from each other by modifying FFTRampImageFilter::PadInputImageRegion(?) in rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing the padded projection data as how it works now; while its BufferedRegion should be 2*(x/2+1) by y by z, with the additional part reserved for in-place FFT. Other small changes were done to calculate inputDimension and kernelDimension correctly based on RequestedRegion. Later I realized that this did not work, since cuFFT sees the buffer just as a linear space. All image data should come continuously from the beginning of the buffer and all unused spaces are at the end, but in this case the reserved spaces were at the end along the x (first) dimension so that they were distributed in the linear buffer. So this was where the ?dirty? changes started. First of all, instead of calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did not check if the modification works for CPU or any other situations as well, so I prefer to make branches when possible instead of direct changes). The latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only change of the call for allocation from paddedImage->Allocate() to paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. There, after the calculation and set of CudaDataManager::m_BufferSize as before, I also calculate the required buffer size for in-place FFT and stored the value in a new member of CudaDataManager, namely m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and after that restore m_BufferSize to its original value. Other changes have been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any other allocation actions (although I have not checked those one by one) will not be influenced by the piece of new code. At last, under GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the additional space in this buffer will never have a chance later to be initialized by means of CPU-to-GPU data copying. The length of the data is shorter than the buffer size. It works for me so far. Please see if you have any better routine to implement this. Thank you. Best regards, Chao 2014-05-27 0:12 GMT+02:00 Simon Rit : > Hi Chao, > Thanks for the detailed report. > > > On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: > >> Hi Simon, >> >> Thanks for the suggestions. >> >> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >> >> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >> --dimension 640,250,640 --hardware=cuda -v -l >> >> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM >> usage. >> I found that the size of the volume data in the GRAM could be reduced by >> --divisions but the amount of projection data sent to the GRAM are not >> influenced by --lowmem switch. >> > After looking at the code again, lowmem acts on the reading so it's not > related to the GPU memory but on the CPU memory, sorry about that. The > reconstruction algorithm does stream the projections but it processes by > default 16 projections at a time. You can change this in > rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will > reduce your GPU memory consumption (I checked and it works for me). Let me > know if it works for you and if you think that this should be made an > option of rtkfdk. > > >> So --divisions does not help much if it is mainly the projection data >> which takes up GRAM, while --lowmem does not help at all. I did not look >> into the more front part of the code so I am not sure if this is the >> designed behaviour. >> >> On the other hand, I am also looking for possibilities to reduce GRAM >> used in the CUDA ramp filter. At least one thing should be changed, and one >> thing may be considered: >> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >> destroyed earlier, right after the plan being executed. A plan takes up at >> least the same amount of memory as the data. >> > Good point, I changed it: > > https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 > > >> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >> clear idea about how to pad deviceProjection to the required size of >> its cufftComplex counterpart. >> > I'm not sure it should be done in-place since rtk::FFTRampImageFilter is > not an itk::InPlaceImageFilter. It might be possible but I would have to > check. Let me know if you investigate this further. > Thanks again, > Simon > > >> >> Any comments? >> >> Best regards, >> Chao >> >> >> >> 2014-05-21 14:30 GMT+02:00 Simon Rit : >> >> Since it fails in cufft, it's the memory of the projections that is a >>> problem. Therefore, it is not surprising that --divisions has no >>> influence. But --lowmem should have an influence. I would suggest: >>> - to uncomment >>> //#define VERBOSE >>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>> are requested. >>> - to try to reproduce the problem with simulated data so that we can >>> help you in finding a solution. >>> Simon >>> >>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>> > Hi Simon, >>> > >>> > Yes I switched on an off the --lowmem option and it has no influence >>> on the >>> > behaviour I mentioned. >>> > In my case the system memory is sufficient to handle the projections >>> plus >>> > the volume. >>> > The major bottleneck is the amount of graphics memory. >>> > If I reconstruct a little bit more slices than the limit that I found >>> with >>> > one stream, the allocation of GPU resource for CUFFT in the >>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>> > However with --divisions > 1 it is indeed able to reconstruct more >>> slices, >>> > but only a very few more; otherwise the CUFFT would fail again. >>> > I would expect the limitations of the amount of slices to be >>> approximately >>> > proportional to the number of streams, or do I miss anything about >>> stream >>> > division? >>> > >>> > Thanks, >>> > Chao >>> > >>> > >>> > >>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>> > >>> >> Hi Chao, >>> >> There are two things that use memory, the volume and the projections. >>> >> The --divisions option divides the volume only. The --lowmem option >>> >> works on a subset of projections at a time. Did you try this? >>> >> Simon >>> >> >>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>> >> > Hoi, >>> >> > >>> >> > I may need some hint about how the stream division works in rtkfdk. >>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>> cannot >>> >> > figure >>> >> > out quickly how the division has been performed. >>> >> > I did some test with reconstructing 400 1500x1200 projections into a >>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>> >> > The reconstructions were executed by rtkfdk with CUDA. >>> >> > When I leave the origin of the volume at the center by default, I >>> can >>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>> limitation >>> >> > of >>> >> > the graphic memory. Then when I increase the number of divisions to >>> 2, I >>> >> > can >>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>> to >>> >> > 219 >>> >> > slices. Does anyone have an idea why it scales like this? >>> >> > Thanks in advance. >>> >> > >>> >> > Best regards, >>> >> > Chao >>> >> > >>> >> > _______________________________________________ >>> >> > Rtk-users mailing list >>> >> > Rtk-users at openrtk.org >>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> >> > >>> > >>> > >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 28 10:48:20 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 28 May 2014 16:48:20 +0200 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: <5305E503.3000506@ucl.ac.uk> References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: Hi Ben, It was on my todo list. I found the problem and here is the fix: https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 Multithreading was indeed disabled as you pointed out, I had to remember pieces of code that were quite old (for an animal like me). Thanks again for the detailed report, Simon On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion wrote: > Hi Simon, > > Really appreciate your prompt response! > > Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster > reconstructions, and the time increase between the two commits reduces to a > little over 2x (See below). > > My dataset consists of 344 projections (about 172.0 MB) > > Does this sound about right? The CPU utilization still looks a bit like a > series of spikes for the latter commit (but different than before). > > Reconstructing and writing... It took 36.0746 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.59479 s > Ramp filter: 19.3106 s > Backprojection: 13.8042 s > > ***versus*** > > Reconstructing and writing... It took 83.4121 s > FDKConeBeamReconstructionFilter timing: > Prefilter operations: 2.62535 s > Ramp filter: 66.5537 s > Backprojection: 13.8829 s > > Thanks again, > > Ben > > > > > On 20/02/14 06:57, Simon Rit wrote: >> >> Hi, >> Thank you Ben for the amazing report. I can spot a few things that >> could have gone wrong there but it seems to me that your >> reconstruction is slow both before and after the commit... Two >> potential reasons: >> - you have not activated FFTW in ITK. You should definitely do that, >> the FFT of ITK is (very) slow and probably not multithreaded. You must >> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >> version of ITK4, I had some issues with the first versions, see >> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >> - you are using a huge dataset. >> If you did not use FFTW, could you try again with FFTW and tell us if >> you still observe a drop in performances? If you had FFTW, can you >> provide the sie of the dataset you used? >> Thanks, >> Simon >> >> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >> wrote: >>> >>> Hello, >>> >>> First of all, many thanks to the RTK community for this useful toolkit! >>> >>> While experimenting with different versions of the code (I'm a relatively >>> new user), I've encountered large differences in rtkfdk (CPU) >>> reconstruction >>> speed between code versions (a newer version being substantially slower >>> than >>> an older version). >>> >>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>> required -g, -p, -r and -o flags, but no other flags). >>> >>> Using git-bisect, I narrowed it down to a particular commit. The parent >>> commit runs quite quickly, but the child commit shows nearly 4x >>> reconstruction time, and less-uniform CPU utilization (it looks like a >>> series of spikes). >>> >>> (See below) >>> >>> Looking at the diffs, it seems that in addition to adding the HannY >>> functionality (which should be disabled by default?), there were some >>> changes in this commit related to threading (in >>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>> misleading and the substantial difference consists in changing the FFT >>> Ramp >>> Kernel. >>> >>> I'm currently reading the source to try to understand those changes, but >>> I >>> thought I would post in case someone is able to point me in the right >>> direction. Although these differences are unexpected to me, I doubt that >>> they are unexpected to more experienced users...! >>> >>> Apologies if I've left out any critical information (or if I've provided >>> too >>> much!). >>> >>> Many thanks in advance, >>> Ben >>> >>> ****** Parent Commit ****** >>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>> Author: Julien Jomier >>> Date: Fri Nov 30 09:30:59 2012 +0100 >>> >>> ENH: Minimum CMake version is 2.8.3 >>> >>> ***Partial output*** >>> >>> Reconstructing and writing... It took 44.3992 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.67915 s >>> Ramp filter: 26.3847 s >>> Backprojection: 13.0447 s >>> >>> ***Screenshot of CPU usage attached: >>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>> >>> ****** Child Commit ****** >>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>> Author: Simon Rit >>> Date: Wed Dec 5 16:22:47 2012 +0100 >>> >>> First version of Hann windowing in the second direction >>> (perpendicular >>> to the ramp) >>> >>> ***Partial output*** >>> Reconstructing and writing... It took 126.911 s >>> FDKConeBeamReconstructionFilter timing: >>> Prefilter operations: 2.47678 s >>> Ramp filter: 108.254 s >>> Backprojection: 13.2973 s >>> >>> ***Screenshot of CPU usage attached: >>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > From benjamin.champion.13 at ucl.ac.uk Thu May 29 05:19:37 2014 From: benjamin.champion.13 at ucl.ac.uk (Ben Champion) Date: Thu, 29 May 2014 10:19:37 +0100 Subject: [Rtk-users] Difference in rtkfdk (cpu) speed/threading In-Reply-To: References: <5304EB7F.4080601@ucl.ac.uk> <5305E503.3000506@ucl.ac.uk> Message-ID: <5386FBA9.6020402@ucl.ac.uk> Hi Simon, Glad to hear you found a fix! Thanks for looking into it. Best wishes, Ben On 28/05/14 15:48, Simon Rit wrote: > Hi Ben, > It was on my todo list. I found the problem and here is the fix: > https://github.com/SimonRit/RTK/commit/8eca086de6d67f390f985a74d8df239a60a09ce7 > Multithreading was indeed disabled as you pointed out, I had to > remember pieces of code that were quite old (for an animal like me). > Thanks again for the detailed report, > Simon > > On Thu, Feb 20, 2014 at 12:20 PM, Ben Champion > wrote: >> Hi Simon, >> >> Really appreciate your prompt response! >> >> Indeed, I was not using FFTW. After rebuilding ITK with FFTW, I get faster >> reconstructions, and the time increase between the two commits reduces to a >> little over 2x (See below). >> >> My dataset consists of 344 projections (about 172.0 MB) >> >> Does this sound about right? The CPU utilization still looks a bit like a >> series of spikes for the latter commit (but different than before). >> >> Reconstructing and writing... It took 36.0746 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.59479 s >> Ramp filter: 19.3106 s >> Backprojection: 13.8042 s >> >> ***versus*** >> >> Reconstructing and writing... It took 83.4121 s >> FDKConeBeamReconstructionFilter timing: >> Prefilter operations: 2.62535 s >> Ramp filter: 66.5537 s >> Backprojection: 13.8829 s >> >> Thanks again, >> >> Ben >> >> >> >> >> On 20/02/14 06:57, Simon Rit wrote: >>> Hi, >>> Thank you Ben for the amazing report. I can spot a few things that >>> could have gone wrong there but it seems to me that your >>> reconstruction is slow both before and after the commit... Two >>> potential reasons: >>> - you have not activated FFTW in ITK. You should definitely do that, >>> the FFT of ITK is (very) slow and probably not multithreaded. You must >>> turn on ITK_USE_FFTWD and ITK_USE_FFTWF. Be careful to use a recent >>> version of ITK4, I had some issues with the first versions, see >>> http://www.itk.org/pipermail/insight-users/2013-April/047562.html >>> - you are using a huge dataset. >>> If you did not use FFTW, could you try again with FFTW and tell us if >>> you still observe a drop in performances? If you had FFTW, can you >>> provide the sie of the dataset you used? >>> Thanks, >>> Simon >>> >>> On Wed, Feb 19, 2014 at 6:35 PM, Ben Champion >>> wrote: >>>> Hello, >>>> >>>> First of all, many thanks to the RTK community for this useful toolkit! >>>> >>>> While experimenting with different versions of the code (I'm a relatively >>>> new user), I've encountered large differences in rtkfdk (CPU) >>>> reconstruction >>>> speed between code versions (a newer version being substantially slower >>>> than >>>> an older version). >>>> >>>> To test I ran rtkfdk with "--hardware 'cpu' --verbose" (as well as the >>>> required -g, -p, -r and -o flags, but no other flags). >>>> >>>> Using git-bisect, I narrowed it down to a particular commit. The parent >>>> commit runs quite quickly, but the child commit shows nearly 4x >>>> reconstruction time, and less-uniform CPU utilization (it looks like a >>>> series of spikes). >>>> >>>> (See below) >>>> >>>> Looking at the diffs, it seems that in addition to adding the HannY >>>> functionality (which should be disabled by default?), there were some >>>> changes in this commit related to threading (in >>>> code/rtkFFTRampImageFilter.{h,txx}). However, perhaps threading is >>>> misleading and the substantial difference consists in changing the FFT >>>> Ramp >>>> Kernel. >>>> >>>> I'm currently reading the source to try to understand those changes, but >>>> I >>>> thought I would post in case someone is able to point me in the right >>>> direction. Although these differences are unexpected to me, I doubt that >>>> they are unexpected to more experienced users...! >>>> >>>> Apologies if I've left out any critical information (or if I've provided >>>> too >>>> much!). >>>> >>>> Many thanks in advance, >>>> Ben >>>> >>>> ****** Parent Commit ****** >>>> commit 9df6108ae0293f86b455a2dcd4b35801e4815718 >>>> Author: Julien Jomier >>>> Date: Fri Nov 30 09:30:59 2012 +0100 >>>> >>>> ENH: Minimum CMake version is 2.8.3 >>>> >>>> ***Partial output*** >>>> >>>> Reconstructing and writing... It took 44.3992 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.67915 s >>>> Ramp filter: 26.3847 s >>>> Backprojection: 13.0447 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> 9df6108ae0293f86b455a2dcd4b35801e4815718.png *** >>>> >>>> ****** Child Commit ****** >>>> commit e223a2ed2200bbd7d86966d4eb27319ed589ee00 >>>> Author: Simon Rit >>>> Date: Wed Dec 5 16:22:47 2012 +0100 >>>> >>>> First version of Hann windowing in the second direction >>>> (perpendicular >>>> to the ramp) >>>> >>>> ***Partial output*** >>>> Reconstructing and writing... It took 126.911 s >>>> FDKConeBeamReconstructionFilter timing: >>>> Prefilter operations: 2.47678 s >>>> Ramp filter: 108.254 s >>>> Backprojection: 13.2973 s >>>> >>>> ***Screenshot of CPU usage attached: >>>> e223a2ed2200bbd7d86966d4eb27319ed589ee00.png*** >>>> >>>> >>>> >>>> _______________________________________________ >>>> Rtk-users mailing list >>>> Rtk-users at openrtk.org >>>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> From simon.rit at creatis.insa-lyon.fr Fri May 30 05:12:41 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 11:12:41 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, I added the option, --subsetsize. Thanks for the detailed report. I don't understand it all, it's quite complicated... Do you really have such memory limitations problems that you want to go in that direction? Using the two streaming options (--subset + --divisions), you should be able to sufficiently reduce your memory consumption. If you really want to go further in the in-place implementation, I think a code patch would be more helpful but you must confine the changes to rtk::CudaFFTRampImageFilter. We don't want to modify itk::CudaDataManager for such a specific purpose. Simon On Tue, May 27, 2014 at 2:24 PM, Chao Wu wrote: > Hi Simon, > > Thanks for your reaction. I was looking into the in-place FFT these days, > and the way of tuning the number of projections sent to the ramp filter is > exactly what I plan to look for next. Now I know that directly. I think it > is a good idea to make it an option of rtkfdk, or to regulate it > automatically by inquiring the amount of free memory with cudaMemGetInfo and > estimating the memory needed for storing the projections, ramp kernel, FFT > plan and the chunk of volume. The latter may be difficult though since such > estimation is not easy at the stage even before padding the projections... > > Back to the in-place FFT subject. Not sure about ITKFFT, but both FFTW and > cuFFT could perform FFT in-place. So in principle > rtk::CudaFFTRampImageFilter could be in-place, and rtk::FFTRampImageFilter > may also be made in-place if FFTW is used. However the ?in-place? here is on > a lower level and may not be compatible with the meaning of ?in-place? of > itk::InPlaceImageFilter. > > Anyway, since system memory is not a problem to me, I only focus on the Cuda > filter. I already have sort of ?dirty? implementation for my own use: > > First in rtkCudaFFTRampImageFilter.cu I commented cudaMalloc and cudaFree of > deviceProjectionFFT, and then just let deviceProjectionFFT = (float2*) > deviceProjection. Now the cuFFT is in-place; the only thing is that the size > of the buffer (now used by both deviceProjectionFFT and deviceProjection) > should be 2*(x/2+1)*y*z instead of x*y*z. > > Then I went out to rtkCudaFFTRampImageFilter.cxx. The buffer mentioned above > is maintained in paddedImage. Its size is determined in > PadInputImageRegion(?) (line 60) and the actual GPU memory allocation and > CPU-to-GPU data copying is by > paddedImage->GetCudaDataManager()->GetGPUBufferPointer() (line 98). My first > attempt is to make the image regions of paddedImage different from each > other by modifying FFTRampImageFilter::PadInputImageRegion(?) in > rtkFFTRampImageFilter.txx: its RequestedRegion remains x by y by z storing > the padded projection data as how it works now; while its BufferedRegion > should be 2*(x/2+1) by y by z, with the additional part reserved for > in-place FFT. Other small changes were done to calculate inputDimension and > kernelDimension correctly based on RequestedRegion. Later I realized that > this did not work, since cuFFT sees the buffer just as a linear space. All > image data should come continuously from the beginning of the buffer and all > unused spaces are at the end, but in this case the reserved spaces were at > the end along the x (first) dimension so that they were distributed in the > linear buffer. > > So this was where the ?dirty? changes started. First of all, instead of > calling PadInputImageRegion(?) at line 60 in rtkCudaFFTRampImageFilter.cxx, > I call an altered one named PadInputImageRegionInPlaceFFT(?) (because I did > not check if the modification works for CPU or any other situations as well, > so I prefer to make branches when possible instead of direct changes). The > latter is a copy of the former in rtkFFTRampImageFilter.txx, with the only > change of the call for allocation from paddedImage->Allocate() to > paddedImage->AllocateInPlaceFFT(). Again, CudaImage::AllocateInPlaceFFT() > is an altered version of CudaImage::Allocate() in itkCudaImage.hxx. > There, after the calculation and set of CudaDataManager::m_BufferSize as > before, I also calculate the required buffer size for in-place FFT and > stored the value in a new member of CudaDataManager, namely > m_BufferSizeInPlaceFFT. Then under CudaDataManager::UpdateGPUBuffer() in > itkCudaDataManager.cxx, instead of simply do this->Allocate(), I first check > if m_BufferSize and m_BufferSizeInPlaceFFT are equal. If not, I let > m_BufferSize = m_BufferSizeInPlaceFFT before doing this->Allocate(), and > after that restore m_BufferSize to its original value. Other changes have > been done to ensure that m_BufferSizeInPlaceFFT is otherwise always equal to > m_BufferSize for back-compatibility, such as adding ?m_BufferSizeInPlaceFFT > = num? in void CudaDataManager::SetBufferSize(unsigned int num), so that any > other allocation actions (although I have not checked those one by one) will > not be influenced by the piece of new code. At last, under > GPUMemPointer::Allocate(size_t bufferSize) in itkCudaDataManager.h, after > cudaMalloc I add cudaMemset to initialize the buffer to all zero, since the > additional space in this buffer will never have a chance later to be > initialized by means of CPU-to-GPU data copying. The length of the data is > shorter than the buffer size. > > It works for me so far. Please see if you have any better routine to > implement this. Thank you. > > Best regards, > Chao > > > > > > > > > 2014-05-27 0:12 GMT+02:00 Simon Rit : > >> Hi Chao, >> Thanks for the detailed report. >> >> >> On Thu, May 22, 2014 at 10:06 AM, Chao Wu wrote: >>> >>> Hi Simon, >>> >>> Thanks for the suggestions. >>> >>> The problem could be reproduced here (8G RAM, 1.5G GRAM, RTK1.0.0) by: >>> >>> rtksimulatedgeometry -n 30 -o geometry.xml --sdd=1536 --sid=384 >>> rtkprojectgeometricphantom -g geometry.xml -o projections.nii --spacing >>> 0.6 --dimension 1944,1536 --phantomfile SheppLogan.txt >>> rtkfdk -p . -r projections.nii -o fdk.nii -g geometry.xml --spacing 0.4 >>> --dimension 640,250,640 --hardware=cuda -v -l >>> >>> With #define VERBOSE (btw I got it in itkCudaDataManager.cxx instead of >>> itkCudaImageDataManager.hxx) now I can have a better view of the GRAM usage. >>> I found that the size of the volume data in the GRAM could be reduced by >>> --divisions but the amount of projection data sent to the GRAM are not >>> influenced by --lowmem switch. >> >> After looking at the code again, lowmem acts on the reading so it's not >> related to the GPU memory but on the CPU memory, sorry about that. The >> reconstruction algorithm does stream the projections but it processes by >> default 16 projections at a time. You can change this in >> rtkFDKConeBeamReconstructionFilter.txx line 28 to, e.g., 2. This will reduce >> your GPU memory consumption (I checked and it works for me). Let me know if >> it works for you and if you think that this should be made an option of >> rtkfdk. >> >>> >>> So --divisions does not help much if it is mainly the projection data >>> which takes up GRAM, while --lowmem does not help at all. I did not look >>> into the more front part of the code so I am not sure if this is the >>> designed behaviour. >>> >>> On the other hand, I am also looking for possibilities to reduce GRAM >>> used in the CUDA ramp filter. At least one thing should be changed, and one >>> thing may be considered: >>> - in rtkCudaFFTRampImageFilter.cu the forward FFT plan (fftFwd) should be >>> destroyed earlier, right after the plan being executed. A plan takes up at >>> least the same amount of memory as the data. >> >> Good point, I changed it: >> >> https://github.com/SimonRit/RTK/commit/bbba5ccd86d34ab8b4d9bc47b3ce6e2e176afc35 >> >>> >>> - cufftExecR2C and cufftExecC2R can be in-place. However I do not have a >>> clear idea about how to pad deviceProjection to the required size of its >>> cufftComplex counterpart. >> >> I'm not sure it should be done in-place since rtk::FFTRampImageFilter is >> not an itk::InPlaceImageFilter. It might be possible but I would have to >> check. Let me know if you investigate this further. >> Thanks again, >> Simon >> >>> >>> >>> Any comments? >>> >>> Best regards, >>> Chao >>> >>> >>> >>> 2014-05-21 14:30 GMT+02:00 Simon Rit : >>> >>>> Since it fails in cufft, it's the memory of the projections that is a >>>> problem. Therefore, it is not surprising that --divisions has no >>>> influence. But --lowmem should have an influence. I would suggest: >>>> - to uncomment >>>> //#define VERBOSE >>>> in itkCudaImageDataManager.hxx and try to see what amount of memory >>>> are requested. >>>> - to try to reproduce the problem with simulated data so that we can >>>> help you in finding a solution. >>>> Simon >>>> >>>> On Wed, May 21, 2014 at 2:21 PM, Chao Wu wrote: >>>> > Hi Simon, >>>> > >>>> > Yes I switched on an off the --lowmem option and it has no influence >>>> > on the >>>> > behaviour I mentioned. >>>> > In my case the system memory is sufficient to handle the projections >>>> > plus >>>> > the volume. >>>> > The major bottleneck is the amount of graphics memory. >>>> > If I reconstruct a little bit more slices than the limit that I found >>>> > with >>>> > one stream, the allocation of GPU resource for CUFFT in the >>>> > CudaFFTRampImageFilter will fail (which was more or less expected). >>>> > However with --divisions > 1 it is indeed able to reconstruct more >>>> > slices, >>>> > but only a very few more; otherwise the CUFFT would fail again. >>>> > I would expect the limitations of the amount of slices to be >>>> > approximately >>>> > proportional to the number of streams, or do I miss anything about >>>> > stream >>>> > division? >>>> > >>>> > Thanks, >>>> > Chao >>>> > >>>> > >>>> > >>>> > 2014-05-21 13:43 GMT+02:00 Simon Rit : >>>> > >>>> >> Hi Chao, >>>> >> There are two things that use memory, the volume and the projections. >>>> >> The --divisions option divides the volume only. The --lowmem option >>>> >> works on a subset of projections at a time. Did you try this? >>>> >> Simon >>>> >> >>>> >> On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: >>>> >> > Hoi, >>>> >> > >>>> >> > I may need some hint about how the stream division works in rtkfdk. >>>> >> > I noticed that the StreamingImageFilter from ITK is used but I >>>> >> > cannot >>>> >> > figure >>>> >> > out quickly how the division has been performed. >>>> >> > I did some test with reconstructing 400 1500x1200 projections into >>>> >> > a >>>> >> > 640xNx640 volume (the pixel and voxel size are comparable). >>>> >> > The reconstructions were executed by rtkfdk with CUDA. >>>> >> > When I leave the origin of the volume at the center by default, I >>>> >> > can >>>> >> > reconstruct up to N=200 slices with --divisions=1 due to the >>>> >> > limitation >>>> >> > of >>>> >> > the graphic memory. Then when I increase the number of divisions to >>>> >> > 2, I >>>> >> > can >>>> >> > only reconstruct up to 215 slices; and with divisions to 3 only up >>>> >> > to >>>> >> > 219 >>>> >> > slices. Does anyone have an idea why it scales like this? >>>> >> > Thanks in advance. >>>> >> > >>>> >> > Best regards, >>>> >> > Chao >>>> >> > >>>> >> > _______________________________________________ >>>> >> > Rtk-users mailing list >>>> >> > Rtk-users at openrtk.org >>>> >> > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>>> >> > >>>> > >>>> > >>> >>> >> > From simon.rit at creatis.insa-lyon.fr Fri May 30 07:12:49 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Fri, 30 May 2014 13:12:49 +0200 Subject: [Rtk-users] Result from SART is worse than from FDK In-Reply-To: <52B44FCA.7000800@bam.de> References: <527914C3.8030706@bam.de> <527918B5.9080709@bam.de> <52B44FCA.7000800@bam.de> Message-ID: Hi Andreas, I apologize for never getting back to you despite the clear description of the problem. Cyril Mory has done many developments in iterative reconstruction since your email, including some improvement of SART. See for example http://wiki.openrtk.org/index.php/RTK/Examples/ADMMTVReconstruction. I have launched the three cases you suggested with the "new" SART - SART reconstruction of middle plane: this cannot work because our forward projector assumes that the volume goes from the middle of the first voxel to the middle of the last voxel. Therefore, one plane is not enough, you need at least two. - SART reconstruction of 10 planes around middle plane: there is a truncation problem here and I don't see how it could be solved in this manner. In general, one needs to use a reconstruction support that is large enough for the problem at hand (see for example http://www.ncbi.nlm.nih.gov/pubmed/17441239). The situation is different if you reduce the data to the reconstruction of a single plane (with --dimension 256,1 in rtkprojectgeometricphantom). Then, your 10 slices are sufficient but the default unmatched forward/back-projector (see http://www.ncbi.nlm.nih.gov/pubmed/11021698 for a description of this) give bad results. You can now solve this if you match them with the option --bp NormalizedJoseph that Cyril has implemented. So even a better of implementation of SART (the current one) does not solve the problems that you have pointed out. You need a large enough CT image given input data to solve the problem. I hope this will be helpful, maybe not to you if it's too late but to some others. Simon On Fri, Dec 20, 2013 at 3:10 PM, Staude, Andreas wrote: > Hi Simon, > > I believe it really is a problem with the sum of the weights. > > I first tried with the Shepp-Logan-phantom and afterwards with my data. > The geometry is that of a standard cone-beam micro-CT. > > The data I posted before were the reconstruction of just the middle > plane. As I did the same with the Shepp-Logan-phantom data, similar > effects were seen. As soon as one reconstructs a larger region around > the middle plane, the artefacts vanish in the inner parts of the > reconstructed volume, while in the top and bottom parts artefacts remain. > > The program calls were: > > create geometry: > ---------------- > rtksimulatedgeometry --nproj="1200" --output="geometry.xml" > --sdd="1169.59" --sid="451.645" --arc="-360" --first_angle="360" > > project the phantom: > -------------------- > rtkprojectgeometricphantom -g geometry.xml -o projections3.mha --spacing > 2.5 --dimension 256 --phantomfile SheppLogan.txt > > do a reference FDK reconstruction: > ---------------------------------- > rtkfdk -p . -r projections3.mha -o shepp-logan_fdk3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > SART reconstruction of middle plane: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2D.mha -g > geometry.xml --spacing 1 --dimension 256,1,256 > > SART reconstruction of 10 planes around middle plane: > ------------------------------------------------------- > rtksart -p . -r projections3.mha -o shepp-logan_sart3_2.5D.mha -g > geometry.xml --spacing 1 --dimension 256,10,256 > > SART reconstruction of whole object: > ------------------------------------ > rtksart -p . -r projections3.mha -o shepp-logan_sart3_3D.mha -g > geometry.xml --spacing 1 --dimension 256 > > > Reconstruction of more slices of the real data-set also gave a good > result. Only the slices near bottom and top are not reconstructed correctly. > > So it seems that the normalisation does not only take the values inside > the reconstructed volume into account, but also (wrong) values outside. > > What do you think? > > Cheers, > > Andreas > > > > On 11/05/2013 07:11 PM, Simon Rit wrote: >> Hi Andreas, >> Thanks for the report. We know that the implementation of SART is >> imperfect, we haven't been working a lot on it... It seems that you >> haven't reached convergence. One potential cause is that we use a >> heuristic for the sum of the weights (denominator in the SART formula) >> instead of the exact sum. The weight is constant and equals the >> diagonal of your volume (see line 165 in >> rtkSARTConeBeamReconstructionFilter.txx). Maybe this is completely >> wrong in your case. Could you try to increase lambda to see if that >> helps? >> To help us do some tests, I would advise you do reproduce your >> geometry with simulations of the Shepp Logan phantom (see >> wiki.openrtk.org). >> Simon >> >> On Tue, Nov 5, 2013 at 5:11 PM, Staude, Andreas wrote: >>> Hello RTk-users, >>> >>> I try to use the SART algorithm, but the results are worse than those >>> obtained with FDK (see attached images). >>> >>> The FDK result looks like expected, so I assume that I have the data >>> format and the reconstruction geometry set properly. For SART I used the >>> same parameters and already tried with different values of lambda and >>> niterations. >>> >>> Does anyone have an idea what went wrong? Is there some kind of >>> smoothing or regularisation applied in the SART implementation? >>> >>> Many thanks in advance! >>> >>> Cheers, >>> >>> Andreas >>> >>> >>> -- >>> >>> =============================================================== >>> Dr. Andreas Staude >>> Fachbereich 8.5 "Mikro-ZfP", Computertomographie >>> BAM Bundesanstalt f?r Materialforschung und -pr?fung >>> Unter den Eichen 87 >>> D-12205 Berlin >>> Germany >>> >>> Tel.: ++49 30 8104 4140 >>> Fax: ++49 30 8104 1837 >>> =============================================================== >>> >>> >>> >>> >>> _______________________________________________ >>> Rtk-users mailing list >>> Rtk-users at openrtk.org >>> http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users >>> > > -- > > =============================================================== > Dr. Andreas Staude > Fachbereich 8.5 "Mikro-ZfP", Computertomographie > BAM Bundesanstalt f?r Materialforschung und -pr?fung > Unter den Eichen 87 > D-12205 Berlin > Germany > > Tel.: ++49 30 8104 4140 > Fax: ++49 30 8104 1837 > =============================================================== From wuchao04 at gmail.com Wed May 21 06:18:57 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 12:18:57 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk Message-ID: Hoi, I may need some hint about how the stream division works in rtkfdk. I noticed that the StreamingImageFilter from ITK is used but I cannot figure out quickly how the division has been performed. I did some test with reconstructing 400 1500x1200 projections into a 640xNx640 volume (the pixel and voxel size are comparable). The reconstructions were executed by rtkfdk with CUDA. When I leave the origin of the volume at the center by default, I can reconstruct up to N=200 slices with --divisions=1 due to the limitation of the graphic memory. Then when I increase the number of divisions to 2, I can only reconstruct up to 215 slices; and with divisions to 3 only up to 219 slices. Does anyone have an idea why it scales like this? Thanks in advance. Best regards, Chao -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.rit at creatis.insa-lyon.fr Wed May 21 07:43:40 2014 From: simon.rit at creatis.insa-lyon.fr (Simon Rit) Date: Wed, 21 May 2014 13:43:40 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Chao, There are two things that use memory, the volume and the projections. The --divisions option divides the volume only. The --lowmem option works on a subset of projections at a time. Did you try this? Simon On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > Hoi, > > I may need some hint about how the stream division works in rtkfdk. > I noticed that the StreamingImageFilter from ITK is used but I cannot figure > out quickly how the division has been performed. > I did some test with reconstructing 400 1500x1200 projections into a > 640xNx640 volume (the pixel and voxel size are comparable). > The reconstructions were executed by rtkfdk with CUDA. > When I leave the origin of the volume at the center by default, I can > reconstruct up to N=200 slices with --divisions=1 due to the limitation of > the graphic memory. Then when I increase the number of divisions to 2, I can > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > slices. Does anyone have an idea why it scales like this? > Thanks in advance. > > Best regards, > Chao > > _______________________________________________ > Rtk-users mailing list > Rtk-users at openrtk.org > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > From wuchao04 at gmail.com Wed May 21 08:21:00 2014 From: wuchao04 at gmail.com (Chao Wu) Date: Wed, 21 May 2014 14:21:00 +0200 Subject: [Rtk-users] Stream divisions in rtkfdk In-Reply-To: References: Message-ID: Hi Simon, Yes I switched on an off the --lowmem option and it has no influence on the behaviour I mentioned. In my case the system memory is sufficient to handle the projections plus the volume. The major bottleneck is the amount of graphics memory. If I reconstruct a little bit more slices than the limit that I found with one stream, the allocation of GPU resource for CUFFT in the CudaFFTRampImageFilter will fail (which was more or less expected). However with --divisions > 1 it is indeed able to reconstruct more slices, but only a very few more; otherwise the CUFFT would fail again. I would expect the limitations of the amount of slices to be approximately proportional to the number of streams, or do I miss anything about stream division? Thanks, Chao 2014-05-21 13:43 GMT+02:00 Simon Rit : > Hi Chao, > There are two things that use memory, the volume and the projections. > The --divisions option divides the volume only. The --lowmem option > works on a subset of projections at a time. Did you try this? > Simon > > On Wed, May 21, 2014 at 12:18 PM, Chao Wu wrote: > > Hoi, > > > > I may need some hint about how the stream division works in rtkfdk. > > I noticed that the StreamingImageFilter from ITK is used but I cannot > figure > > out quickly how the division has been performed. > > I did some test with reconstructing 400 1500x1200 projections into a > > 640xNx640 volume (the pixel and voxel size are comparable). > > The reconstructions were executed by rtkfdk with CUDA. > > When I leave the origin of the volume at the center by default, I can > > reconstruct up to N=200 slices with --divisions=1 due to the limitation > of > > the graphic memory. Then when I increase the number of divisions to 2, I > can > > only reconstruct up to 215 slices; and with divisions to 3 only up to 219 > > slices. Does anyone have an idea why it scales like this? > > Thanks in advance. > > > > Best regards, > > Chao > > > > _______________________________________________ > > Rtk-users mailing list > > Rtk-users at openrtk.org > > http://public.kitware.com/cgi-bin/mailman/listinfo/rtk-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: