Why is my code running slower on the GPU?
Show older comments
Hi,
I've been writing a deep learning neural network model by scratch, so i can have an intuitive understanding of them. The code i've written works fine, and i've spent a great amount of time optimizing it, but i seem to have reached a bottleneck that is the GPU code. I've implemented a dynamic network through the use of structures, with the structure vector representing layer depth. This model uses sigmoid activation functions, and cross-entropy cost function.
First things first, there are three files: The main script, and the backprop and feed_forward functions.
The main script
clc;
clear all; close all;
%% Load Data
load ('numbers.mat');
for i=1:length(numbers)
temp = numbers(i).label;
numbers(i).label = zeros(1,10);
numbers(i).label(temp+1) = 1;
end
validation = numbers(1:10000);
training = numbers(10001:end);
%% Hyperparameters
batch_size = 10;
numEpochs = 5;
rateFunc = interp1(0.5 ./ [1:20], linspace(1, 20, numEpochs));
numInput = size(training(1).data, 1) * size(training(1).data, 2);
%% Initialization
net = create_net([numInput 100 10]);
numLayers = length(net);
average = [];
%% Main
for epoch=1:numEpochs
tic;
%% Backprop
randIndex = randperm(size(training,2));
for i=1:batch_size:length(training)-batch_size
[net, gradient] = backprop(net, training(randIndex(i:i+batch_size-1)), rateFunc(epoch));
end
%% Validate Net
fprintf ('Epoch(%d): %fs', epoch, toc);
[average(end+1), error] = validate_net(net, validation);
if mod(epoch, 5) == 0
train_error = validate_net(net, training);
fprintf ('\nError(Training): %f\n', train_error);
if train_error >= 0.99, break; end
end
fprintf ('\nError: %f', average(end));
fprintf ('\n---------------\n');
end
%% Functions
function [average, error] = validate_net(net, inputData)
error = [];
for i=1:size(inputData,2)
layer = feed_forward(net, inputData(i).data);
[~,ix] = max(layer(end).a);
[~,iy] = max(inputData(i).label);
error = [error; [ix-1, iy-1]];
average = mean(error(:,1) == error(:,2));
end
end
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = (randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = (randn(1, structure(i+1)));
end
end
Backprop
function [net, gradient] = backprop(net, inputData, rate)
numLayers = length(net);
delta = struct('b', [], 'w', cell(1, length(net)));
gradient = struct('b', 0, 'w', num2cell(zeros(1, length(net))));
for i=1:length(inputData)
layer = feed_forward(net, inputData(i).data);
delta(numLayers).b = layer(numLayers).a - inputData(i).label;
delta(numLayers).w = layer(numLayers-1).a' * delta(numLayers).b;
for L=numLayers-1:-1:2
delta(L).b = (delta(L+1).b * net(L+1).w') .* 1./(1 + exp(-layer(L).z)) .* (1 - 1./(1 + exp(-layer(L).z)));
delta(L).w = layer(L-1).a' * delta(L).b;
end
delta(1).b = (delta(2).b * net(2).w') .* 1./(1 + exp(-layer(1).z)) .* (1 - 1./(1 + exp(-layer(1).z)));
delta(1).w = inputData(i).data' * delta(1).b;
for L=1:numLayers
gradient(L).b = gradient(L).b + delta(L).b;
gradient(L).w = gradient(L).w + delta(L).w;
end
end
for L=1:numLayers
net(L).b = net(L).b - rate/length(inputData)*gradient(L).b;
net(L).w = net(L).w - rate/length(inputData)*gradient(L).w;
end
end
Feed_forward
function layer = feed_forward(net, inputData)
layer = struct('z', [], 'a', cell(1, length(net)));
layer(1).z = inputData * net(1).w + net(1).b;
layer(1).a = 1./ (1 + exp(-layer(1).z));
for i=2:length(net)
layer(i).z = layer(i-1).a * net(i).w + net(i).b;
layer(i).a = 1./ (1 + exp(-layer(i).z));
end
end
The dataset I'm using is the classic MNIST digit recognition problem, and I've been able to get close to 98% accuracy on it. It takes roughly 5 seconds to run per epoch, but on the GPU it takes 6 times this amount. I use the GPU by changing the create_new function, like so:
function net = create_net(structure)
numLayers = length(structure) - 1;
net = struct('b', [], 'w', cell(1, numLayers));
for i=1:numLayers
net(i).w = gpuArray(randn(structure(i), structure(i+1))/sqrt(structure(i)));
net(i).b = gpuArray(randn(1, structure(i+1)));
end
end
Am i doing something wrong here? Would appreciate any feedback on optimizing the code, and how to solve this GPU issue.
Thanks for reading
3 Comments
AlexRD
on 31 Mar 2021
Edited: Walter Roberson
on 31 Mar 2021
Joss Knight
on 31 Mar 2021
Well done on implementing your own neural net in MATLAB! I can't see anything obviously wrong. But of course, the GPU is only really effective when it's fully utilized. For a network processing data like MNIST, which typically has inputs of size 28x28, a batch size of 10 means the GPU is only processing about 10000 numbers at once - barely scratching the surface really. What happens when you increase the batch size to something like 256? Or 1024...?
AlexRD
on 31 Mar 2021
Accepted Answer
More Answers (0)
Categories
Find more on Deep Learning Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
