Effectiveness of implicit and explicit parallelization of fft2

11 views (last 30 days)
Hi,
I have many images on which I want to do some FFTs. I want to accelerate this image processing with multithreading. I am wondering why explicit parallelization with parfor as shown in the following example is not largely more efficient than the implicit one from the matlab function fft2 :
clear
%% parameters & variable declaration
N_slice = 40 ;
N_img=80;
a = rand(1024,1024,N_slice);
for ii=1:N_img
img{ii}=rand(1024,1024);
out{ii}=0;
end
%% multithreading inside the loop
maxNumCompThreads(4);
% 39 s with 4 threads
% 88 s with 1 thread
% parallelization efficiency of 88/(39*4)=56%
tic
for ii=1:N_img
out{ii}=sum(ifft2(a.*fft2(img{ii})),'all');
end
toc
%% parfor version
% 33 s with parpool(4)
% parallelization efficiency of 88/(32*4)=68%
maxNumCompThreads(1); %avoid the relative inefficiency of fft2 parallelisation
pool = parpool('Processes',4,'SpmdEnabled',false);
b=parallel.pool.Constant(a);
tic
parfor ii=1:N_img
% out{ii}=sum(ifft2(b.Value.*fft2(img{ii})),'all');
out{ii} = fun(b.Value,img{ii});
end
toc
function out=fun(x,y)
% Just in case, I read somewhere that parpool might be faster
% using functions rather than script - change nothing here.
out = sum(ifft2(x.*fft2(y)),'all');
end
FFT2 is supporting multithreading by default and on my pc with 4 cores, I get a 56% multithreading efficiency (defined a T1/(T2*N) with T1 the time to do a task with one thread, T2 the time with N thread).
I am expecting to get a larger multithreading efficiency by creating a pool of 4 workers, and making them work on my images in paralle. Because there is so little data transfer and so much computation time I am expecting to get a multithreading efficiency close to one. But I only get around 68% as you can see in my exemple. Why is that ?
Matthieu

Accepted Answer

Damian Pietrus
Damian Pietrus on 19 Mar 2024
Hello Matt,
As mentioned already, there could be a few reasons for this. When you start up a pool of 4 workers in the "Processes" profile, you are starting 4 separate MATLAB workers, each with their own process and memory space. There does end up being some overhead with parfor, as MATLAB not only needs to do some data transfer to each worker, but also needs to break the problem up in the first place. FFT is already well optimized and can avoid some of this overhead.
One thing I'd recommend as a comparison point is to use the "Threads" cluster profile instead of "Processes". Here, parallel workers run as threads within the main MATLAB process and share memory, so I'm curious to see if/how that impacts your overall results.
  2 Comments
Matt
Matt on 21 Mar 2024
Hi,
Switching from process to threads I went from 68% to 88% efficiency ! And that is with a cpu that works at a slighty lower frequency, so it is actually probably higher.
Very good advice. So why is this much better ?
Matt
Damian Pietrus
Damian Pietrus on 21 Mar 2024
Based on the timing of your calculations, I think the difference in effeciency comes down to the overhead required in sharing data between workers. Threaded workers can directly access the same memory space, while process workers require a bit more overhead in getting the right data copied to the right process.

Sign in to comment.

More Answers (1)

Pratyush
Pratyush on 18 Mar 2024
Hi Matt,
The observed discrepancy in multithreading efficiency when using MATLAB's "parfor" for FFT operations on images, achieving around 68% efficiency instead of the expected near-perfect efficiency, can be attributed to several factors:
  • Setting up parallel pools, transferring data, and scheduling tasks introduce overhead that reduces overall efficiency.
  • MATLAB's "fft2" is already optimized for multithreading, making explicit parallelization less beneficial.
  • Due to Amdahl's Law, as more workers are added, the speedup gained from parallelization tends to plateau because of the non-parallelizable portions of the task and increasing overhead.
  • Competition for CPU time and memory bandwidth among workers and the main MATLAB process can hinder performance.
  • While "parfor" is efficient for many tasks, its overhead can impact tasks that are already optimized, like FFT operations.
The gist is, achieving near-linear scaling in parallel computing is challenging, especially for already optimized operations. The 68% efficiency observed is relatively good, considering these factors. Optimizing parallel efficiency might involve minimizing data transfer, experimenting with the number of workers, and ensuring optimal hardware and MATLAB configurations.
  1 Comment
Matt
Matt on 18 Mar 2024
Hi,
Indeed there are many factors that can lower parallelization efficiency, but here I fail to see which one is predominant :
  • the creation of the pool is not timed
  • data transfer should not be a bottleneck since the data is a light 1024x1024 array per worker, compared to an execution time of around 1s on one thread per task.
  • I asked for 4 workers on a 4 core cpu, and I have set the number of thread per worker to one to avoid resource allocation competition.
  • 100% of the code is parallelized.
I don't see where I lose 32% of efficiency here?

Sign in to comment.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!