Why does performance of functions saturate with number of cores using parfeval but not with parfor?
5 views (last 30 days)
I am developing an application that MUST take advantage of parallelization, and ideally offer real-time updates after each iteration, which makes use of parfeval prefarable. I believe the algorithm that I have developed is highly parallelizable (see attached for performance of 'WT_Ex_2_b' as a function of number of cores used in parfeval function). From 1 to 8 cores, the speedup factor agrees with theoretical expectation (Amdahl's Law with p=0.95), however, performance of my application saturates at 8 cores. This led me to create a dummy function (see attached script) to compare the performance of using parfor and parfeval as a function of number of cores. I discovered that the parfor version behaves quite similarly to theoretical expectation (Ahmdal's Law, also with p=0.95), however the parfeval version continues to show strange saturation behavior, even for the dummy function. Notice how the Speedup factor improves with core number upto 12 cores, then suddenly no further improvement is observed. I have attached the script in case you want to reproduce this behavior on your end.
Is there a fundamental limitation to the number of cores the parfeval function can leverage? Or is there an obvious mistake I am making in the way I am using the parfeval function? Why does the performance behavior of the dummy algorithm suddenly saturate at 12 cores? Any recommendation how to use the parfeval function to perform as well as parfor?
I would like to emphasize that I have already developed my application to use parfeval, so converting to parfor would be time-consuming and prevent me from utilizing the update-after-iteration feature of parfeval.
Thank you for your help on this critical matter.
Edric Ellis on 1 Jul 2020
The main difference between parfor and parfeval is that in the parfeval case, you are responsible for scheduling the work on the workers. parfor has an advantage over parfeval in that it knows how many loop iterations there are, and so what it does is schedule a fixed number of chunks of work per worker (see the documentation for parforOptions - the chunks are referred to as "sub-ranges"). So, in your case, parfeval will incur more overhead since each parfeval request is sent on its own to a worker, where as parfor groups things together, and this will generally be more efficient in the case where the request durations are of a similar duration to the overheads of making a single remote request.
So, parfeval doesn't have a fundamental limitation in this regard, but you might need to amalgamate your requests if they are too short to match parfor performance. Another option might be to use parfor together with DataQueue which would let you perform updates at the client after each parfor iteration completes.