I am working with multiplication of a large sparse matrix with a dense matrix using gpuArray. On my GTX 1080, MATLAB's sparse matrix multiplication runs in 5.04ms (multiplication only timed with tic/toc)
gpu_mmm = gpu_matrix * gpu_input;
mvm_time = toc;
. I also have a CUDA 10.2 implementation of sparse matrix multiplication using cuSPARSE, which runs the same sparse matrix multiplication in 7.25ms (timed with the Nvidia profiler). However, my CUDA implementation uses float32, while the MATLAB implementation only supports sparse matrices of type double. To my knowledge, GPUs are much faster with single precision calculations compared to double precision calculations, so I am wondering why MATLAB performs this calculation faster despite the difference in precision.