You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
Suggestions to speed up FFT of convolution
10 views (last 30 days)
Show older comments
The following code is at the center of a physics model. As it scales approximately as O(N log(N)) it is the computationally most expensive part of the simulation. Therefore, it makes sense to try to optimize this part of the code as much as possible and I am looking for ideas what else to try. Conceptually the calculation involves a FFT, multiplication, and an inverse FFT to obtain the result. The simulations are done for 3D volumes [nx,ny,nz] of vectors (so the total number of cells is N=3*nx*ny*nz) and the size of the volumes varies significantly for different problems. The code now mostly runs on GPUs, but it can also be run on CPUs (this is mostly done for initial testing). Thus optimization for performance on GPUs is the main goal. One observation I made is that it is faster to use component wise multiplication vs. multiprod.
Any suggestions what else to try to reduce the time spent on calculating this?
Below is a trimmed down version of the main part of the code (updated to include examples for variables):
rng(1,'twister');
M=gpuArray(rand([100,100,100,3],"double")); %create a random vectorfield
Periodicity=[0 0 0];
%M: [n1,n2,n3,3] 3D vectorfield
[nx,ny,nz,n]=size(M);
nxP=2*nx;
nyP=2*ny;
nzP=2*nz;
ftN=gpuArray(rand([nxP,nyP,nzP,n,n],"double")); %for simplicty create ftN tensors with random entries
%Treat periodicity: Periodicity: [true/false, true/false, true/false]
rep=[1 1 1]+Periodicity;
Mhelp=repmat(M,[rep(1),rep(2),rep(3),1]);
ftM=zeros([nxP nyP nzP n],'like',M);
%Fourier transform
ftM(:,:,:,1)=fftn(Mhelp(:,:,:,1),[nxP nyP nzP]);
ftM(:,:,:,2)=fftn(Mhelp(:,:,:,2),[nxP nyP nzP]);
ftM(:,:,:,3)=fftn(Mhelp(:,:,:,3),[nxP nyP nzP]);
%ftN: [nxP,nyP,nzP,3,3] 3D volume of 3x3 tensors
%ftH=multiprod(ftN,ftM,[4 5],[4]);
%06/08/2022 version without multiprod - on GPUs achieved of the order 25% reduction in calculation time compared to multiprod.
ftH(:,:,:,1)=ftN(:,:,:,1,1).*ftM(:,:,:,1)+ftN(:,:,:,1,2).*ftM(:,:,:,2)+ftN(:,:,:,1,3).*ftM(:,:,:,3);
ftH(:,:,:,2)=ftN(:,:,:,2,1).*ftM(:,:,:,1)+ftN(:,:,:,2,2).*ftM(:,:,:,2)+ftN(:,:,:,2,3).*ftM(:,:,:,3);
ftH(:,:,:,3)=ftN(:,:,:,3,1).*ftM(:,:,:,1)+ftN(:,:,:,3,2).*ftM(:,:,:,2)+ftN(:,:,:,3,3).*ftM(:,:,:,3);
%Inverse FFT
Hh=zeros([nxP nyP nzP n],'like',M);
Hh(:,:,:,3)=-real((ifftn(ftH(:,:,:,3))));
Hh(:,:,:,1)=-real((ifftn(ftH(:,:,:,1))));
Hh(:,:,:,2)=-real((ifftn(ftH(:,:,:,2))));
H=zeros([nxP nyP nzP n],'like',M);
%Final result
H=Hh(nx:nxP-1,ny:nyP-1,nz:nzP-1,:);
5 Comments
Tim
on 8 Jun 2022
Thanks for the quick feedback. A typical size would be nx=ny=100 and nz=20, but the volumes are often as large as the GPU permits. You can think of M of being unit vectors of random orientation. ftN are 3x3 tensors with a trace of 1, but I don't think this affects the performance. Calculating ftN is done separately. If it helps I can try to come up with some mockup inputs - I just have to get back to my PC.
Paul
on 8 Jun 2022
On my system running on the CPU and using the Profiler, I'm seeing that the computations for ftH take the most time, even after pre-allocation of ftH.
For reasons I don't understand, the fftn() computation for the first slice of ftM takes about 3x longer than for the second and third slices.
Tim
on 8 Jun 2022
I'm not too concerned about the performance on a CPU as we only use this for testing. But you may want to try the multiprod version of the multiplication, I seem to recall that it was comparable with the component wise version on CPUs. I'm not sure about why the timing would be that much different either.
Answers (2)
Matt J
on 8 Jun 2022
Edited: Matt J
on 8 Jun 2022
I see a doubling of the speed if you re-organize your4D and 5D arrays as cell arrays:
Mhelp=num2cell(Mhelp,[1,2,3]);
ftN=squeeze(num2cell(ftN,[1,2,3]));
H=cell(1,3);
gputimeit(@()executeIt)
function executeIt
%Fourier transform
ftM{1}=fftn(Mhelp{1},[nxP nyP nzP]);
ftM{2}=fftn(Mhelp{2},[nxP nyP nzP]);
ftM{3}=fftn(Mhelp{3},[nxP nyP nzP]);
ftH=cell(1,3);
ftH{1}=ftN{1,1}.*ftM{1}+ftN{1,2}.*ftM{2}+ftN{1,3}.*ftM{3};
ftH{2}=ftN{2,1}.*ftM{1}+ftN{2,2}.*ftM{2}+ftN{2,3}.*ftM{3};
ftH{3}=ftN{3,1}.*ftM{1}+ftN{3,2}.*ftM{2}+ftN{3,3}.*ftM{3};
for i=1:3
tmp=-real( ifftn(ftH{i} ) );
H{i}=tmp(nx:nxP-1,ny:nyP-1,nz:nzP-1);
end
end
8 Comments
Tim
on 9 Jun 2022
I see, that makes a lot of sense. This was also why I originally thought that using multiprod would be faster, but that was not the case.
I implemented your suggestion and noticed that the line
ftN=squeeze(num2cell(ftN,[1,2,3]));
took the most time. However, this only needs to be done once as the tensor ftN stays the same, so I was able to move this out of the function above.
Despite for now doing the num2cell conversion and back (which is needed right now because all other code currently expects H to be a [nx,ny,nz,3] array) in this function, I was able to get about a 20% improvement over the best solution I had so far.
Thanks a lot Matt, I would have never thought of using cell arrays. This now has me thinking about other parts of the code and whether using cell arrays there may be helpful - I will definitely look into that.
Any other suggestions?
Tim
on 9 Jun 2022
Did a few more tests with larger volumes and the improvements of using cell arrays seem to diminish. This is in part because I currently have toi include the num2cell conversion in the function (not doing so would require a major resturcutring of the rest of the code).. But even when ignoring that penalty the improvements are not as large as they are for smaller volumes.
I guess based on your earlier comment @Matt J, this is what one might expect, as the costs for the actual fftn will tend to dominate over the overhead for making a copy for larger array sizes.
It would still seem that I could reduce the computational time by about 20% by using cell-arrays troughout the code. So I will put this on my list of things to tackle - thanks again @Matt J!
Any other suggestions?
Matt J
on 9 Jun 2022
You're welcome, but please Accept-click the answer if you consider the question resolved.
My main suggestion would be that you bite the bullet and restructure the code to avoid num2cell and create things in cell array form from the getgo. You don't seem to be using the 5-dimensional nature of the data at all.
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
If you have R2022a,
Mhelp=repmat(M,[rep(1),rep(2),rep(3),1]);
%Fourier transform
ftM=fftn(-Mhelp,[nxP nyP nzP,1]);
%Tensorprod requires R2022a
ftH=tensorprod(ftN,ftM,[4,5],4);
%Inverse FFT
Hh=real(ifftn(ftH,[nxP nyP nzP,1] ));
H=Hh(nx:nxP-1,ny:nyP-1,nz:nzP-1,:);
27 Comments
Tim
on 9 Jun 2022
Edited: Tim
on 9 Jun 2022
I may be missing something here - is it possible that this newly introduced function does not support gpuArrays?! I get a
Error using tensorprod
Invalid data types. Input arrays must be double or single.
error that goes away if I use gather() on the arguments (at which point tensorprod complains that 'Dimension arguments must have the same length.' but that's something I could figure out).
Btw. the line:
ftM=fftn(-Mhelp,[nxP nyP nzP,1]);
creates a ftM of size [nx,ny,nz,1] instead of the desired size [nx,ny,nz,3].
But I have to admit that it would be nice to be able to replace the three lines for the fftn with one (the same applies for the ifftn).
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
However,
ftH(:,:,:,1)=ftN(:,:,:,1,1).*ftM(:,:,:,1)+ftN(:,:,:,1,2).*ftM(:,:,:,2)+ftN(:,:,:,1,3).*ftM(:,:,:,3);
ftH(:,:,:,2)=ftN(:,:,:,2,1).*ftM(:,:,:,1)+ftN(:,:,:,2,2).*ftM(:,:,:,2)+ftN(:,:,:,2,3).*ftM(:,:,:,3);
ftH(:,:,:,3)=ftN(:,:,:,3,1).*ftM(:,:,:,1)+ftN(:,:,:,3,2).*ftM(:,:,:,2)+ftN(:,:,:,3,3).*ftM(:,:,:,3);
is equivalent to,
ftH=sum( reshape(ftN,[],3,3).*reshape(ftM,[],1,3),2);
ftH=reshape(ftH,size(ftM));
which requires no sub-array extraction.
Tim
on 9 Jun 2022
While what you have there does not seem to be equivalent, I think I see where you are going with this.
Right now:
ftH(:,:,:,1)=ftN(:,:,:,1,1).*ftM(:,:,:,1)+ftN(:,:,:,1,2).*ftM(:,:,:,2)+ftN(:,:,:,1,3).*ftM(:,:,:,3);
ftH(:,:,:,2)=ftN(:,:,:,2,1).*ftM(:,:,:,1)+ftN(:,:,:,2,2).*ftM(:,:,:,2)+ftN(:,:,:,2,3).*ftM(:,:,:,3);
ftH(:,:,:,3)=ftN(:,:,:,3,1).*ftM(:,:,:,1)+ftN(:,:,:,3,2).*ftM(:,:,:,2)+ftN(:,:,:,3,3).*ftM(:,:,:,3);
ftH2=sum( reshape(ftN,[],3,3).*reshape(ftM,[],1,3),2);
ftH2=reshape(ftH2,size(ftM));
gives
isequal(ftH,ftH2)
ans = logical 0
max(max(max(max(ftH-ftH2))))
ans = 3.8772e+10 - 3.1259e+10i
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
Sorry. Small mistake:
N=200;
c=@(z) complex(z,z);
ftN=c(rand(N,N,N,3,3));
ftM=c(rand(N,N,N,3));
tic;
ftH=nan(size(ftM));
ftH(:,:,:,1)=ftN(:,:,:,1,1).*ftM(:,:,:,1)+ftN(:,:,:,1,2).*ftM(:,:,:,2)+ftN(:,:,:,1,3).*ftM(:,:,:,3);
ftH(:,:,:,2)=ftN(:,:,:,2,1).*ftM(:,:,:,1)+ftN(:,:,:,2,2).*ftM(:,:,:,2)+ftN(:,:,:,2,3).*ftM(:,:,:,3);
ftH(:,:,:,3)=ftN(:,:,:,3,1).*ftM(:,:,:,1)+ftN(:,:,:,3,2).*ftM(:,:,:,2)+ftN(:,:,:,3,3).*ftM(:,:,:,3);
toc
Elapsed time is 1.682320 seconds.
tic;
ftH2=sum( reshape(ftN,[],3,3).*reshape(ftM,[],1,3),3);
ftH2=reshape(ftH2,size(ftM));
toc
Elapsed time is 0.860240 seconds.
Discrepancy=norm(ftH2(:)-ftH(:),inf)
Discrepancy = 0
Matt J
on 9 Jun 2022
Additionally, this seems to accelerate the FFT part:
N=200;
[nxP,nyP,nzP]=deal(N);
Mhelp=rand(N,N,N,3);
tic;
ftM=zeros(nxP,nyP,nzP,3);
ftM(:,:,:,1)=fftn(Mhelp(:,:,:,1),[nxP nyP nzP]);
ftM(:,:,:,2)=fftn(Mhelp(:,:,:,2),[nxP nyP nzP]);
ftM(:,:,:,3)=fftn(Mhelp(:,:,:,3),[nxP nyP nzP]);
toc
Elapsed time is 0.701800 seconds.
tic
ftM2=fft3(Mhelp,[nxP nyP nzP]);
toc
Elapsed time is 0.232527 seconds.
Discrepancy=norm(ftM2(:)-ftM(:),inf)
Discrepancy = 1.5253e-12
function Z=fft3(Z,dims)
for i=1:3
Z=fft(Z,dims(i),i);
end
end
Tim
on 9 Jun 2022
Great, the first one leads to about 10% speedup for large volumes.
The second change gives only a minor improvement for large volumes. I do some more testing to see if this difference is significant and how it changes with volue size - if it is speeding things up a little it may still make sense to do the same for the inverse FFT.
I am trying to understand why this approach would be expected to be faster than using fftn - any ideas?
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
Anyway, it seems that the impact on the GPU is far less predictable than on the CPU. On the GTX 1080Ti, for N=200, I get negligible difference in speed, although for N=300 and single precision, it seems quite significant:
function test
N=300;
c=@(z) complex(z,z);
ftN=c(gpuArray.rand(N,N,N,3,3,'single'));
ftM=c(gpuArray.rand(N,N,N,3,'single'));
[ftH,ftH2]=deal(gpuArray(1));
gputimeit(@()version1) %0.1480 sec
gputimeit(@()version2) %0.0532 sec.
Discrepancy=norm(ftH2(:)-ftH(:),inf)
function version1
ftH=gpuArray.ones(size(ftM));
ftH(:,:,:,1)=ftN(:,:,:,1,1).*ftM(:,:,:,1)+ftN(:,:,:,1,2).*ftM(:,:,:,2)+ftN(:,:,:,1,3).*ftM(:,:,:,3);
ftH(:,:,:,2)=ftN(:,:,:,2,1).*ftM(:,:,:,1)+ftN(:,:,:,2,2).*ftM(:,:,:,2)+ftN(:,:,:,2,3).*ftM(:,:,:,3);
ftH(:,:,:,3)=ftN(:,:,:,3,1).*ftM(:,:,:,1)+ftN(:,:,:,3,2).*ftM(:,:,:,2)+ftN(:,:,:,3,3).*ftM(:,:,:,3);
end
function version2
ftH2=sum( reshape(ftN,[],3,3).*reshape(ftM,[],1,3),3);
ftH2=reshape(ftH2,size(ftM));
end
end
Tim
on 9 Jun 2022
Edited: Tim
on 9 Jun 2022
Yes, there is certainly more code for the actual function, but those are the relevant lines and your suggestions are really helpful. Are you running this on a GPU or CPU? I am running this on a GPU and I wish I would see the kind of speedup that you are seeing - although I have to agree with Jan, it's defintely not 100% ;-)
I am using the profiler to get a more detailed look at what lines contribute the most - this also adds some overhead..
What's the reason using a 1D FFT instead of the fftn should speed up the computation? They should be equivalent, so is it just the overhead that fftn adds?
Tim
on 9 Jun 2022
Yes, that's more like what I am seeing - there is still a small improvement on the GPU but not nearly as much as what you saw on the CPU. As for the N=300 - that is pushing it in terms of size, but it is good to know that the improvement seems to increase with size and not the opposite.
I am still a little baffled as to why fftn would not be on par with or better than doing what your fft3 does.
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
I am still a little baffled as to why fftn would not be on par with or better than doing what your fft3 does.
Well, probably because of what we talked about earlier - fft3 does no sub-array extractions. In any case, the speed-up of fft3 doesn't seem to carry over to the GPU. It's even a little slower on mine.
Tim
on 9 Jun 2022
@Matt J just a comment to your code above - in version1 you initialize the gpuArray whereas in version2 you do not. I think it was pointed out in the past here that initializing arrays does not provide an advantage on a GPU (unfortunately, I cannot find the post where this was mentioned) - by not initilizing the gpuArrays the full code runs slightly faster, but not by much.
Matt J
on 9 Jun 2022
Edited: Matt J
on 9 Jun 2022
Are you talking about the line
ftH=gpuArray.ones(size(ftM));
The reason it's there is only because it's required for version1 to run, and because it reflects what you're doing in your current code. version2 doesn't require a prior initialization of the array, so it is deliberately omitted.
Matt J
on 9 Jun 2022
Edited: Matt J
on 10 Jun 2022
On the GPU, I think you mean. I still see substantial speed-up on the CPU with ifft3.
I have basically come to the conclusion that your current code is pretty well GPU-optimized. More importantly, though, I find that with the volume sizes that you posted (N=200), the code runs in less than 0.02 seconds on the GTX 1080Ti. It's hard to imagine that you can (or need?) to improve on that, not without dedicated CUDA MEX files.
Tim
on 10 Jun 2022
Well, if it were only one time that this function gets called I would not worry about the 0.2s it takes to excute this. But this function (and a bunch of others that fortunatly scale linearly with N) gets called quite a few times in the simulation - hundreds of thousands of times. So if I can reduce the time it takes to execute this function it will have a significant impact on the total time it takes to complete a simulation (or on how big a problem one can solve in a reasonable amount of time).
You mention CUDA MEX files - while I have found some information online, do you know of a good source that could get me started on this? How would I go about this?
Matt J
on 10 Jun 2022
If you've seen this already,
then I probably wouldn't have too much more to add. I am really doubtful, even if you are a hardcore CUDA programmer, that you can exceed the performance you have. The operations you're doing (FFTs and matrix arithmetic) are really well-optimized in MATLAB. It's not going to be easy to outperform what's already there.
My best recommendation is that instead of relying on a single GPU, you look into deploying to multiple GPU cards in parallel, using parfor. It may also help to use single precision instead of double, though I can't know if your application can tolerate that.
Tim
on 10 Jun 2022
Thanks Matt, yes I had seen that one. I also came across this one that uses the GPU coder and I am slowly making progress on that front:
Even if it does not work out it is always fun to learn something new. I found this article encouraging in terms of performance improvements:
Yes, 'single' can be an option in certain cases and this is implemented in the main code, but I also want to be able to use 'double', which is why I use something like
ftM=zeros([nxP nyP nzP n],'like',M);
to ensure the precision remains the same.
Thanks again for all your help, this has been very insightful.
Tim
on 10 Jun 2022
Yes, I agree gputimeit would have been the way to go.
After installing a compiler, toolkits etc. I was able to create CUDA mex file that replaces this piece of code using the GPU coder app. However, it actually was much slower (factor 4) than the Matlab code. In order to get the mex code perform that well I had to get rid of the variable size of the input arrays and instead create the file for a specific size. If I use dynamic memory allocation for variable size arrays the mex code is even slower. Looking at the code this generates I am sure that with a little background one could significantly improve on it - but that's not for me.
But I did learn something new.
Tim
on 10 Jun 2022
One more observation: the Matlab function for the actual problem the result (ftH) before the inverse FFT should be conjugate symmetric.
For conjugate symmetric arrays ifftn provides an option 'symmetric' to "compute the inverse Fourier transform faster" and ensure that the result is real. However, even for very large arrays the gains I saw on GPUs are minimal. Maybe the @MathWorks Support Team has some insights on this? Unfortunatly, the documentation does not have a lot of information on the underlying algorithm that could provide clues for what to expect.
See Also
Categories
Find more on Fourier Analysis and Filtering in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)