Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

Question

Thomas Barrett on 18 Jan 2021

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/720250-possible-to-speed-up-this-gpuarray-calculation-with-arrayfun-or-otherwise

Commented: Edric Ellis on 12 Feb 2021

I have a complex matrix A, and would like to modify it Nt times according to A = exp( -1i*(A + abs(A).^2) ). The size of A is typically 1000x1000, and the number of times to run would be around 10000.

I am looking to reduce the time taken to carry out these operations. For 1000 iterations on the CPU, I measure around 6.4 seconds. Following the Matlab documentation, I was able to move this to the GPU, which reduced the time taken to 0.07 seconds (an incredible x91 improvement!). So far so good.

However, I also now read this link in the docs, which describes how we can sometimes find even further improvement for element-wise calculations if we use arrayfun() as well. If I try to follow the tutorial, the time taken is actually worse, clocking in at 0.47 seconds. My tests are shown below:

Nt = 1000; % Number of times to run each method
test_functionFcn = @test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
    
gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt 
    cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
    gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
    gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

and the results are:

Time taken using only CPU: 6.38785

Time taken using gpuArray directly: 0.0680587

Time taken using GPU + arrayfun(): 0.474612

My questions are:

1) Am I using arrayfun() correctly in this situation, and it is expected that arrayfun() should be worse?

2) If so, and it is really just expected that it is slower than the direct gpuArray method, is there any easy (i.e non-MEX) way to speed up such a calculation? (I see they also mention using pagefun() for example).

Thanks in advance for any advice.

(The graphics card is Nvidia Quadro M4000, and I am running Matlab R2018a)

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Edric Ellis on 19 Jan 2021

1
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/720250-possible-to-speed-up-this-gpuarray-calculation-with-arrayfun-or-otherwise#answer_601004

Open in MATLAB Online

A few points here. Firstly, (and most importantly), to time code on the GPU, you need to use either gputimeit, or you need to inject a call to wait(gpuDevice) before calling `toc`. That's because work is launched asynchronously on the GPU, and you only get accurate timings by waiting for it to finish. With those minor modifications, on my GPU, I see 0.09 seconds for the `gpuArray` method, and 0.18 seconds for the `arrayfun` version.

Running a loop of GPU operations is generally inefficient, so the main gain you can get here is by pushing the loop inside the `arrayfun` function body so that that loop runs directly on the GPU. Like this:

    %%% Function to operate on matrices %%%
    function x = test_function(x,Nt)
    for ii = 1:Nt
        x = exp(-1i*(x + abs(x).^2));
    end
    end

You'll need to invoke it like `A = arrayfun(@test_function, A, Nt)`. On my GPU, this brings the `arrayfun` time down to 0.05 seconds, so about twice as fast as the plain `gpuArray` version.

9 Comments
Show 7 older commentsHide 7 older comments

Thomas Barrett on 19 Jan 2021

Edited: Thomas Barrett on 12 Feb 2021

Open in MATLAB Online

Good morning @Edric Ellis, thank you very much for your answer.

1) Thanks for your advice about gputimeit - I did not realise how critical it was to obtain proper benchmarking. I will use this in the future.

2) I think I see now from your example a little better how arrayfun() can lead to gains. One thing I didn't mention is that in my actual main code, inside the k=1:Nt loop there is an additional operation which is a transpose and matrix multiplication with a sparse, tridiagonal matrix. Here is a more fleshed out MWE of what is really going on:

Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = @test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
gpu_data_out = gpuArray(A); % Move the arrays to the GPU
gpu_M = gpuArray(M);
for k = 1:Nt 
    gpu_data_out = test_function( gpu_data_out );
    gpu_data_out = gpu_data_out*gpu_M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

I'm very sorry for not including that in the OP - I did not realise at the time that it might be relevant to the solution. Does this change things? Are there still gains to be made with arrayfun() on the GPU, or is this now not suitable for converting to arrayfun() ?

Thanks for your support.

Thomas Barrett on 12 Feb 2021

Hi @Edric Ellis yes this is the same for on my system as well unfortunately. Are there any other potential routes I should explore (even if they are not particularly simple to implement)?

For example, I have read that there is the possibility of writing CUDA code, or CUDA mex files - is this something that is worth exploring? Or does MATLAB effectively do all this under the hood anyway automatically when the array is specified to be a gpuArray?

Edric Ellis on 12 Feb 2021

You could certainly try writing a CUDA MEX file - doc here https://www.mathworks.com/help/parallel-computing/run-mex-functions-containing-cuda-code.html - but frankly in this case, I'm not convinced you'd be able to go much faster than the arrayfun version. gpuArray arrayfun already interprets your MATLAB code and sends an optimised version to the GPU, and is generally pretty efficient. I suspect to get any real benefits here, your next option really is to perhaps write a custom variant of a specialized matrix-matrix product - providing the structure of your M matrix is simple enough to permit that. In principle, that would allow you to launch a single GPU kernel per iteration of your loop, and do everything there.

Sign in to comment.

Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments
Show 7 older commentsHide 7 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

9 Comments Show 7 older commentsHide 7 older comments

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

9 Comments
Show 7 older commentsHide 7 older comments