Why is my GPU code faster with the profiler on in RTX GPUs?

1 view (last 30 days)
I need to process large multidimensional arrays with a series of 1D convolutions, and I found it's faster to just implement the convolution by hand in a for loop instead of using conv due to the very small kernel size. However, my code runs significantly faster when the profiler is on in certain GPUs. In particular, it is consistently 1.5x to 2x faster when using an Nvidia RTX 3080 or an Nvidia RTX 2070; when I run the code in an Nvidia A4500 or Nvidia A5000, there is no significant difference. This is significant because a single dataset can take hours.
This behavior is consistent among multiple computers, all running Linux (Ubuntu 22.04), and tested with R2021a and R2022a, and with nvidia drivers versions 515 and 520. My question is, how can I make sure I get the "fast" performance without having to embed profile on and profile off in the relevant parts of my code? I have actually done this and I benefit directly from improved performance in the big picture of processing an entire dataset, but this is hacky and will interfere with the expected use of the profiler in the rest of the code.
MWE is here. I am placing the fastest run first to avoid confusion about the second instance potentially running faster due to the JIT or caching. I am also clearing the large variables between runs to avoid confusion about memory allocation. I am also using the results to calculate arrayMean to avoid confusion about the JIT optimizing (i.e., skipping operations) for unused results. Interestingly, the above three concenrs do not matter in practice and the code runs consistently faster with the profiler on.
% Define common parms
clear
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('on')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOn = toc;
profile('off')
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler ON: %g seconds.\n', timeProfOn)
% Run with profiler off. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('off')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOff = toc;
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler OFF: %g seconds.\n', timeProfOff)

Accepted Answer

Joss Knight
Joss Knight on 1 Dec 2022
This is due to an optimization which is not performing ideally under memory pressure. If you reduce the size of your input you'll see that it's only where you're near the limit of your GPU memory that you see this discrepancy.
When PCT sees a series of element-wise operations like this it fuses them together so it can run a single kernel, as in
largeArrayConv = largeArrayConv + k1.*largeArray(idx1) + k2.*(largeArray(idx2)) + k3.*(largeArray(idx3)) ...
Unfortunately this means that memory must be allocated for the intermediates and when you're low on memory you'll end up with a lot of raw allocs and frees. When the profiler is on this optimization doesn't happen so that the measurements make sense, and so you only ever need one temporary array allocation per loop iteration.
Of the various possible workarounds the easiest is probably just to add wait(gpuDevice) before the end of your for loop.
I agree that the optimization is misbehaving in this case and we'll take a look at how it might be improved.
  2 Comments
Néstor
Néstor on 2 Dec 2022
Thanks, Joss, this explanation is compatible with my observations: increasing the array size to use a similar fraction of the available memory in the A4500 card was sufficient to reproduce the behavior I was seeing with the RTX cards.
I tried adding wait(gpuDevice) but, although there is a small improvement (~1.2x faster), the code with the profiler off is still significantly slower (~1.5x) than with the profiler on.
I would be happy to try more complicated workarounds to recover the full performance, what other suggestions do you have?
Joss Knight
Joss Knight on 2 Dec 2022
I'm surprised about that. This is how I adapted your code:
clear
gpu = gpuDevice();
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single');
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
wait(gpu)
profile off
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile on
gputimeit(@()runConvolutionFull(convSize, largeArray, convKernel))
profile off
function largeArrayConv = runConvolutionFull(convSize, largeArray, convKernel)
largeArrayConv = 0;
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
k = convKernel(convSize + 1 + thisShift) / (2 * convSize + 1);
largeArrayPiece = largeArray(:, :, idx, :, :, :);
largeArrayConv = k .* largeArrayPiece + largeArrayConv;
wait(gpuDevice)
end
end
This makes 100% sure we're only timing the things that are consistent between the two scenarios.

Sign in to comment.

More Answers (0)

Products


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!