Why is my code running slower on the GPU?

Hi,
I've been writing a deep learning neural network model by scratch, so i can have an intuitive understanding of them. The code i've written works fine, and i've spent a great amount of time optimizing it, but i seem to have reached a bottleneck that is the GPU code. I've implemented a dynamic network through the use of structures, with the structure vector representing layer depth. This model uses sigmoid activation functions, and cross-entropy cost function.
First things first, there are three files: The main script, and the backprop and feed_forward functions.
The main script
clc;
clear all; close all;
%% Load Data
load ('numbers.mat');
for i=1:length(numbers)
temp = numbers(i).label;
numbers(i).label = zeros(1,10);
numbers(i).label(temp+1) = 1;
end
validation = numbers(1:10000);
training = numbers(10001:end);
%% Hyperparameters
batch_size = 10;
numEpochs = 5;
rateFunc = interp1(0.5 ./ [1:20], linspace(1, 20, numEpochs));
numInput = size(training(1).data, 1) * size(training(1).data, 2);
%% Initialization
net = create_net([numInput 100 10]);
numLayers = length(net);
average = [];
%% Main
for epoch=1:numEpochs
tic;
%% Backprop
randIndex = randperm(size(training,2));
for i=1:batch_size:length(training)-batch_size
[net, gradient] = backprop(net, training(randIndex(i:i+batch_size-1)), rateFunc(epoch));
end
%% Validate Net
fprintf ('Epoch(%d): %fs', epoch, toc);
[average(end+1), error] = validate_net(net, validation);
if mod(epoch, 5) == 0
train_error = validate_net(net, training);
fprintf ('\nError(Training): %f\n', train_error);
if train_error >= 0.99, break; end
end
fprintf ('\nError: %f', average(end));
fprintf ('\n---------------\n');
end
%% Functions
function [average, error] = validate_net(net, inputData)
error = [];
for i=1:size(inputData,2)
layer = feed_forward(net, inputData(i).data);
[~,ix] = max(layer(end).a);
[~,iy] = max(inputData(i).label);
error = [error; [ix-1, iy-1]];
average = mean(error(:,1) == error(:,2));
end
end
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = (randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = (randn(1, structure(i+1)));
end
end
Backprop
function [net, gradient] = backprop(net, inputData, rate)
numLayers = length(net);
delta = struct('b', [], 'w', cell(1, length(net)));
gradient = struct('b', 0, 'w', num2cell(zeros(1, length(net))));
for i=1:length(inputData)
layer = feed_forward(net, inputData(i).data);
delta(numLayers).b = layer(numLayers).a - inputData(i).label;
delta(numLayers).w = layer(numLayers-1).a' * delta(numLayers).b;
for L=numLayers-1:-1:2
delta(L).b = (delta(L+1).b * net(L+1).w') .* 1./(1 + exp(-layer(L).z)) .* (1 - 1./(1 + exp(-layer(L).z)));
delta(L).w = layer(L-1).a' * delta(L).b;
end
delta(1).b = (delta(2).b * net(2).w') .* 1./(1 + exp(-layer(1).z)) .* (1 - 1./(1 + exp(-layer(1).z)));
delta(1).w = inputData(i).data' * delta(1).b;
for L=1:numLayers
gradient(L).b = gradient(L).b + delta(L).b;
gradient(L).w = gradient(L).w + delta(L).w;
end
end
for L=1:numLayers
net(L).b = net(L).b - rate/length(inputData)*gradient(L).b;
net(L).w = net(L).w - rate/length(inputData)*gradient(L).w;
end
end
Feed_forward
function layer = feed_forward(net, inputData)
layer = struct('z', [], 'a', cell(1, length(net)));
layer(1).z = inputData * net(1).w + net(1).b;
layer(1).a = 1./ (1 + exp(-layer(1).z));
for i=2:length(net)
layer(i).z = layer(i-1).a * net(i).w + net(i).b;
layer(i).a = 1./ (1 + exp(-layer(i).z));
end
end
The dataset I'm using is the classic MNIST digit recognition problem, and I've been able to get close to 98% accuracy on it. It takes roughly 5 seconds to run per epoch, but on the GPU it takes 6 times this amount. I use the GPU by changing the create_new function, like so:
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = gpuArray(randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = gpuArray(randn(1, structure(i+1)));
end
end
Am i doing something wrong here? Would appreciate any feedback on optimizing the code, and how to solve this GPU issue.
Thanks for reading

3 Comments

AlexRD
AlexRD on 31 Mar 2021
Edited: Walter Roberson on 31 Mar 2021
Oh, apparently i can attach files!
(Numbers.mat is too big for 5mb restriction, so here's a link for numbers.mat https://drive.google.com/file/d/1GnSfTkDD1GYzy26Y5nhpNXaCZgbNHVlf/view?usp=sharing )
Well done on implementing your own neural net in MATLAB! I can't see anything obviously wrong. But of course, the GPU is only really effective when it's fully utilized. For a network processing data like MNIST, which typically has inputs of size 28x28, a batch size of 10 means the GPU is only processing about 10000 numbers at once - barely scratching the surface really. What happens when you increase the batch size to something like 256? Or 1024...?
Increasing the batch size has little effect on the time it takes to finish an epoch in my algorithm. I think it's because the amount of calculations per epoch is fixed, but the time it takes to train the network is significantly increased.
Changing it from 10 to 100 gives me better time per epoch actually, since i imagine there are less function calls for backprop (from ~5s on the CPU to ~4.5s, and same for the GPU), but the time it takes for the network to fully finish training is increased proportional to the batch size.

Sign in to comment.

 Accepted Answer

Increasing the batch size alone cannot improve convergence in a simple MLP, you need to match it with an increase to the learning rate.
But more to the point of your question, does increasing the batch size improve the GPU performance relative to the CPU?

10 Comments

AlexRD
AlexRD on 1 Apr 2021
Edited: AlexRD on 1 Apr 2021
Hey Joss,
(CPU) Increasing the batch size from 10 to 1000 gets the average time per epoch from roughly 5.2s to around 4.6s. Small decrease, but consistent enough to not discard it as randomness.
(GPU) Increasing the batch size from 10 to 1000 has no noticeable impact on the time, and it stays around 28s per epoch.
Correct me if i'm wrong, but if i decrease the batch size, it means that more steps will be taken towards the gradient vector. So by changing batch size alone, shouldn't it affect convergence speed? In fact it's what i see in my algorithm. When i change the batch size, the time it takes for the training error to be 99% is proportional to my batch size.
Yes, you're right in one sense - it doesn't look as though you're normalizing the gradients by the batch size, so their magnitude should be scaling proportional to the batch size, thus increasng the effective learning rate. But your two comments seem contradictory. Does it converge in fewer iterations or more? It should take proportionally fewer iterations to converge with a larger batch size, because you're taking bigger steps. The reason you can safely increase the step size is because a larger batch gives a better estimate of the true gradient direction. So a rule of thumb is that it's going to take a fixed number of observations (images) to get to a certain accuracy, but with a large batch size your throughput is higher and so you get to the answer quicker. This has its limits of course. At a certain point the batch size is too large,
The reason why the GPU speed is unchanged with larger batch sizes is to do with utilization. Until you're fully utilizing the GPU, it just has a pretty much fixed execution time. But it's processing more data in the same time. You're seeing similar behaviour from the CPU I see, which might indicate an issue with your code, or could be the equivalent happening with your multicore CPU.
Thank you very much for the help Joss.
I have overhauled the way my backprop algorithm works, and i got it down to 0.5s per epoch using the CPU by changing how the calculations are made and using good ol' matrix multiplication to it's full potential (how good is 0.5s per epoch on an intel i9 9900k compared to other algorithms?)
I've made it run multiple times through a batch algorithm and measured the time per epoch, and it's still surprising to see how the GPU is slower on some scenarios.
The GPU is only really faster when the neuron count is massive (well, neuron count * input size really), so my guess is it's only really optimal on the first layer, and not on the subsequent layers as these represent a lower quantity of weights. My question to you is, is it possible to create a hybrid system where the first layer is on the GPU but the subsequent layers are on the CPU? My concern is that the time it takes to transfer from GPU will make this not worthwhile.
What is your neuron count? Normally I'd say that was the number of weights in the weight matrices, but you might be using it for the number of layers? Certainly you're going to need a weight matrix with several thousand weights for the GPU to out-perform the CPU, and that is completely expected.
There are a huge number of factors here. What is your GPU and are you using single precision? Unless you have a Titan V it's unlikely your GPU has any kind of significant FLOPs for double precision, so you'll need to be making sure your data is in single precision. It certainly doesn't look in your code as though you're making sure your weights and data are single precision.
Thank you very much for your answers Joss, your insight is extremely helpful. I was not using single precision, although i experimented it with the CPU and noticed a good increase. Speaking of which, do you have any recommendations on how to store the network? I've been using structures because they're fast and allow me to save matrices with different sizes, but indexing them is kind of a pain as they don't allow you to do structure(x:y).data for example, to get concatenated data.
The neuron count is how many neurons in the only layer of the system it has. In those tests, it was effectively:
1st weight matrix: 784 x Neuron Count
2nd weight matrix: Neuron Count * 10
I have a GTX 1070 and an i9 9900k.
Well, using double precision is definitely your problem, so add 'single' to your calls to randn and you should find everything goes much faster on GPU.
As for vectoring, as long as you are performing your operations on the whole batch at once, that's the best you can do, because each layer is separated by a non-linear operation that forces the operations to happen sequentially.
You know what, looking at your code it does look as though you are processing the data one observation at a time, which totally defeats the point of batches. You need to pack your data in each batch together. So, for instance, your weight matrix is multiplied by a matrix of data where each column is one observation.
Yeah, that's what i overhauled about my backprop code and improved the timing significantly.
This is the new code i've written:
function [net, performance] = train_network(net, tdata, tlabel, vdata, vlabel, numEpochs, batch_size)
emptyLayer = struct('a', num2cell(net(1).b(1) * (zeros(1, length(net)))));
emptyNet = struct('b', 0, 'w', num2cell((zeros(1, length(net)))));
rateFunc = interp1(0.5 ./ (1:10), linspace(1, 10, numEpochs));
performance = [];
%% Backpropagation
for epoch=1:numEpochs
tic;
randIndex = randperm(size(tdata, 2));
tdata = tdata(:, randIndex);
tlabel = tlabel(:, randIndex);
mainIndex = 1:batch_size:size(tdata, 2);
mainIndex(end) = size(tdata, 2);
for i=1:length(mainIndex)-1
delta = emptyNet;
layer = feed_forward(net, tdata(:, mainIndex(i):mainIndex(i+1)-1), emptyLayer);
delta(length(net)).b = layer(length(net)).a - tlabel(:, mainIndex(i):mainIndex(i+1)-1);
delta(length(net)).w = delta(length(net)).b * layer(length(net)-1).a';
for L=length(net)-1:-1:2
delta(L).b = (net(L+1).w' * delta(L+1).b) .* sigma_prime(layer(L).a);
delta(L).w = delta(L).b * layer(L-1).a';
end
delta(1).b = (net(2).w' * delta(2).b) .* sigma_prime(layer(1).a);
delta(1).w = delta(1).b * tdata(:, mainIndex(i):mainIndex(i+1)-1)';
for L=1:length(net)
net(L).b = net(L).b - rateFunc(epoch)/length(mainIndex(i):mainIndex(i+1)) * sum(delta(L).b, 2);
net(L).w = net(L).w - rateFunc(epoch)/length(mainIndex(i):mainIndex(i+1)) * delta(L).w;
end
end
%% Validate Net
performance(end+1, 1) = validate_net(net, vdata, vlabel);
if mod(epoch, 1) == 0
train_error = validate_net(net, tdata, tlabel);
fprintf ('Epoch(%d): %fs', epoch, toc);
fprintf ('\nError(Training): %f', train_error);
fprintf ('\nError: %f\n---------------\n', performance(end, 1));
% if train_error >= 0.995, break; end
else
fprintf ('Epoch(%d): %fs', epoch, toc);
fprintf ('\nError: %f\n---------------\n', performance(end, 1));
end
end
end
It works great as the functions now all have a much bigger matrix to deal with. The single precision change gives a nice little 10% performance boost.
One interesting thing i noticed was that i thought the performance would be roughly proportional to the neuron count * batch size, but as you can see here:
The neuron count appears to have a much bigger role.
Surely your numbers are saying that the size of your weight matrix has almost no effect on performance except on the CPU when it gets large enough. On the GPU you can see that despite performing much larger operations, the performance is unchanged with neuron count, which must mean you haven't fully utilized the GPU at these sizes. You should pick your neuron size based on getting good results and not overfitting, while the batch size should just be as large a number as possible that still gives good convergence.
Thank you very much!

Sign in to comment.

More Answers (0)

Categories

Find more on Deep Learning Toolbox in Help Center and File Exchange

Products

Release

R2021a

Asked:

on 30 Mar 2021

Commented:

on 5 Apr 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!