What functions need GPU support

Recently there are few posts about function that are not fully support gpuArray and could benefit from more intensive GPU support by TMW.
I open this thread so that users can submit their wishes and explain a typical user-case and why it is important for him/her to have this gpu feature.
I put here a list by category, but if one have a specific function that is not falling in any category, you are welcome to add it.
Basic array arithmetics and linear algebra seems pretty much well covered (?) but I do not know if something is missing in this huge library.
But these still need to be investigated.
  • Optimization functions, especially the gradient method where the gradient calculation can be performed in parallel on GPU
  • ODE functions where the Jacobian can be estimated in parallel
  • Interpolation functions where the preparation step and inquiring step can be both parallelized

8 Comments

Hi Bruno, thanks for opening this topic. Could you give a bit more detail about "Interpolation functions"? Basic interpolation functions such as interp1,2,3,n have been supported on the GPU for a long time. It would be good to know which other interpolation functions you need. (As you say above, knowing the use-case would also help us prioritize.)
If this is mentioned in other posts, could you link them here for reference?
@Ben Tordoff Hello, I give two links on the anwer bellow (so we can discuss interpolation in a separate answer).
Matt J
Matt J on 29 Aug 2023
Edited: Matt J on 29 Aug 2023
Optimization functions, especially the gradient method where the gradient calculation can be performed in parallel on GPU
Sadly, TMW have heard this proposal and disagree that it would be beneficial. Transcript below
GPU array support for Optimization Toolbox
Inbox
Matt J
Jun 12, 2023, 10:58AM
Hi Mike,
In addition to gpuArray support for griddedInterpolant, I was also wondering about enabling gpuArray types for the Optimization Toolbox solvers. The Optimization Toolbox implements a number of iterative function minimization methods, where the function to be minimized is specified by a user-defined function handle. Currently, the toolbox solvers work only with CPU-double data and the user-supplied function is required to return its results in CPU-double form. This introduces a lot of bottlenecks. It would be good if these solvers could work entirely on the GPU. This shouldn't be too hard to enable, since all of the operations that the toolbox solvers perform on doubles are probably already implemented for gpuArrays as well.
Mike Croucher
Jun 14, 2023, 11:30AM
to me
Hi Matt
I can confirm that there is definitely gpuArray support for griddedInterpolant in R2023b. The pre-release will be available soon so you can try it out.
Regarding gpuArray support for optimisation solvers. There are currently no plans to do so.
One of our developers recently worked on a Proof of Concept for a customer solving a large set of nonlinear equations using GPU arrays and the speed up was marginal, at best.
Elsewhere, there is no evidence of GPU use by the usual competitors (Gurobi, CPLEX etc) and there seems to be similar conclusions in the open source world.
With that said, if you know of any evidence showing GPU acceleration of optimization algorithms that are relevant to your work, I’d be interested in knowing it.
With respect to your own optimization problems. Given that GPU acceleration seems to be off the table. What else might we try? Do you have something concrete I could look at?
Best Wishes,
Mike
Mike Croucher
Customer Success Engineer, MathWorks
The real benefit perhaps depends how well user-supply function exploits the gpu.
IMO there is probably no clear cut between Matt and Mike positions.
@Matt J However if you can supply a use-case where gpu is desired it would be great for a concrete discussion.
However if you can supply a use-case where gpu is desired it would be great for a concrete discussion.
The flavor of the issue is illustrated by the code below, which I also shared with TMW. It runs a few iterations of fminunc() to solve a basic set of equations A*x=b using both the CPU and the GPU. On the GTX 1080 Ti, I see nearly a 3x speed-up. However, as you can see in the user-provided objective() code, I am forced to use gather() to send the results back to the CPU every time the user-provided objective is invoked. As you make N smaller (e.g., N=500), this becomes a bottleneck and the GPU time is outperformed by the CPU by a factor of 3. If you turn the option SpecifyObjectiveGradient=false, it is outperformed by a factor of 10.
N=8e3;
opts=optimoptions('fminunc','Display','none','MaxIterations',4,...
'SpecifyObjectiveGradient',true,'Algorithm','quasi-newton',...
'HessUpdate','steepdesc');
%CPU
A=rand(N);
b=A*rand(N,1);
tic;
x=fminunc(@(x)objective(x,A,b) , ones(N,1) ,opts ); %
toc %Elapsed time is 3.915851 seconds.
disp ' '
%GPU
A=gpuArray(A);
b=gpuArray(b);
tic
x=fminunc(@(x)objective(x,A,b) , ones(N,1) ,opts );
toc %Elapsed time is 1.564494 seconds..
function [fval,grad]=objective(x,A,b)
err=A*x-b;
fval=norm(err).^2/2;
fval=gather(fval);
if nargout>1
grad=A.'*err;
grad=gather(grad);
end
end
Same code, factor of 1.7 only my side
CPU: Elapsed time is 0.972671 seconds.
GPU (with gather): Elapsed time is 0.569152 seconds.
And for N=500 with/without SpecifyObjectiveGradient?
I run several times and semect the best. CPU is much faster
  • CPU : Elapsed time is 0.109399 seconds.
  • GPU: Elapsed time is 2.010831 seconds.

Sign in to comment.

Answers (3)

Bruno Luong
Bruno Luong on 29 Aug 2023

4 Comments

Matt J
Matt J on 29 Aug 2023
Edited: Matt J on 29 Aug 2023
Note from my comment above where Mike Croucher from TMW says:
I can confirm that there is definitely gpuArray support for griddedInterpolant in R2023b. The pre-release will be available soon so you can try it out.
I haven't tried the pre-release, though, to see if all functionality has been implemented.
Michal
Michal on 30 Aug 2023
Edited: Michal on 30 Aug 2023
@Matt J I just tested 1D gridedInterpolant (IMethod ... "spline" and "linear") in R2023b with gpuArrays and, at least on my midrange GPU (NVIDIA Quadro A1000), the overall speed-up is not significant even in a case when date transfer GPU -> CPU is not taken into account. With some high-end GPU the performance could be better, but my current tests shows max speedup ~ 5-15% (for small number of query points slowdown ~ 25-75%).
Finally, so far nothing impressive ... :(
Matt J
Matt J on 30 Aug 2023
Edited: Matt J on 30 Aug 2023
@Michal You should probably present your tests.
Michal
Michal on 30 Aug 2023
Edited: Michal on 30 Aug 2023
@Matt J I should delete my comment, because I just realize that all my timing results are not reliably reproducible. Measured timings strongly depends on installed version of NVIDIA driver (driver 535.x vs 525.x), at least on Ubuntu Linux 22.04. Please ignore my comment!

Sign in to comment.

Matt J
Matt J on 26 Aug 2023
Edited: Matt J on 26 Aug 2023
Sparse array indexing would be one example, e.g
>> A=gpuArray.speye(5)
A =
(1,1) 1
(2,2) 1
(3,3) 1
(4,4) 1
(5,5) 1
>> A(1,:)
Error using indexing
Sparse gpuArrays do not support indexing.
The lack of this might be the reason why some of what you've listed in the OP are not supported. A number of the Optimization Toolbox functions need to support indexing, I'm sure.

2 Comments

It looks like an important basic piece of library is missing.
Bruno Luong
Bruno Luong on 29 Aug 2023
Edited: Bruno Luong on 29 Aug 2023
One of the reason why sparse indexing code on gpu and cpu are not directly "transposable" because the storage on cpu is compressed sparse column (CSC) whereas gpu uses (CSR).

Sign in to comment.

Products

Asked:

on 26 Aug 2023

Edited:

on 30 Aug 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!