GPU knnsearch performs slower than CPU for large matrices?

7 views (last 30 days)
I am currently running a knnsearch on the CPU, with a large number of query points (10 million by 3), where I want the index of where each row of the 10 million belongs with respect to a matrix of 2,500 by 3.
Running knnsearch, without the GPU, takes roughly 11 seconds, but with the GPU it takes about 45 seconds. I did notice that lowering the query points to 1 million, led to the GPU taking roughly 1.5 seconds and the CPU now only takes 1.3 seconds.
Ultimately I thought using the GPU would do the knnsearch faster, but it doesn't seem to be the case. Am I implementing this correctly? Any other advice on how to get indeces from knnsearch faster, either with GPU or without is greatly appreciated!
Example code below:
Note: I will mention that my "X" and "Y" for my implementation are not generated by randn, but are the same dimensions. I don't think that would change the interpretation of what I am trying to do, but just thought I'd mention it.
rng default
%%% 10 million queries
%%Using CPU
X=randn(2500,3);
Y=randn(10000000,3);
c=@() knnsearch(X,Y);
tcpu=timeit(c);
%%Using GPU
gx=gpuArray(X);
gy=gpuArray(Y);
g=@() knnsearch(gx,gy);
tgpu=gputimeit(g);
%%% Now just 1 million queries
%%Using CPU
X=randn(2500,3);
Y=randn(1000000,3);
c2=@() knnsearch(X,Y);
tcpu2=timeit(c2);
%%Using GPU
gx=gpuArray(X);
gy=gpuArray(Y);
g2=@() knnsearch(gx,gy);
tgpu2=gputimeit(g2);

Answers (1)

Damian Pietrus
Damian Pietrus on 19 Mar 2024
Hello Ishan,
There is some overhead when moving variables between the CPU and the GPU. For shorter running calculations, this data overhead can account for a decent percentage of your overall compute time. As a comparison, try measuring both the overall GPU time and the time just for the GPU computation. This will give you an idea of how much of the total time is spent just on data transfer overhead.
  1 Comment
Ishan Phadke
Ishan Phadke on 21 Mar 2024
Edited: Ishan Phadke on 21 Mar 2024
How would I measure just the time for the GPU computation instead of the overall GPU time? Would this be using something like tic/toc before executing wait() for the GPU to complete, or is there a better way to do this?
I also have a related follow up question, but if this is better as a separate question, let me know. Does the GPU perform better for calculations with "more square matrices"? By "more square" I simply mean the two dimensions of a m by n matrix are closer to eachother. I tried running the knnsearch but looping through chunks of the large "gy" vector and this ran faster than the non-looped version on the GPU. May just be a fluke, but this was surprising to me, so thought I'd ask.

Sign in to comment.

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!