Main Content

Initialize Learnable Parameters for Model Function

When you train a network using layers, layer graphs, or dlnetwork objects, the software automatically initializes the learnable parameters according to the layer initialization properties. When you define a deep learning model as a function, you must initialize the learnable parameters manually.

How you initialize learnable parameters (for example, weights and biases) can have a big impact on how quickly a deep learning model converges.

Tip

This topic explains how to initialize learnable parameters for a deep learning model defined a function in a custom training loop. To learn how to specify the learnable parameter initialization for a deep learning layer, use the corresponding layer property. For example, to set the weights initializer of a convolution2dLayer object, use the WeightsInitializer property.

Default Layer Initializations

This table shows the default initializations for the learnable parameters for each layer, and provides links that show how to initialize learnable parameters for model functions by using the same initialization.

LayerLearnable ParameterDefault Initialization
convolution2dLayerWeightsGlorot Initialization
BiasZeros Initialization
convolution3dLayerWeightsGlorot Initialization
BiasZeros Initialization
groupedConvolution2dLayerWeightsGlorot Initialization
BiasZeros Initialization
transposedConv2dLayerWeightsGlorot Initialization
BiasZeros Initialization
transposedConv3dLayerWeightsGlorot Initialization
BiasZeros Initialization
fullyConnectedLayerWeightsGlorot Initialization
BiasZeros Initialization
batchNormalizationLayerOffsetZeros Initialization
ScaleOnes Initialization
lstmLayerInput weightsGlorot Initialization
Recurrent weightsOrthogonal Initialization
BiasUnit Forget Gate Initialization
gruLayerInput weightsGlorot Initialization
Recurrent weightsOrthogonal Initialization
BiasZeros Initialization
wordEmbeddingLayerWeightsGaussian Initialization, with mean 0 and standard deviation 0.01

Learnable Parameter Sizes

When initializing learnable parameters for model functions, you must specify parameters of the correct size. The size of the learnable parameters depends on the type of deep learning operation.

OperationLearnable ParameterSize
batchnormOffset

[numChannels 1], where numChannels is the number of input channels

Scale

[numChannels 1], where numChannels is the number of input channels

dlconvWeights

[filterSize numChannels numFilters], where filterSize is a 1-by-K vector specifying the filter size, numChannels is the number of input channels, numFilters is the number of filters, and K is the number of spatial dimensions

Bias

One of the following:

  • [numFilters 1], where numFilters is the number of filters

  • [1 1]

dlconv (grouped)Weights

[filterSize numChannelsPerGroup numFiltersPerGroup numGroups], where filterSize is a 1-by-K vector specifying the filter size, numChannelsPerGroup is the number of input channels for each group, numFiltersPerGroup is the number of filters for each group, numGroups is the number of groups, and K is the number of spatial dimensions

Bias

One of the following:

  • [numFiltersPerGroup 1], where numFiltersPerGroup is the number of filters for each group.

  • [1 1]

dltranspconvWeights

[filterSize numFilters numChannels], where filterSize is a 1-by-K vector specifying the filter size, numChannels is the number of input channels, numFilters is the number of filters, and K is the number of spatial dimensions

Bias

One of the following:

  • [numFilters 1], where numFilters is the number of filters for each group.

  • [1 1]

dltranspconv (grouped)Weights

[filterSize numFiltersPerGroup numChannelsPerGroup numGroups], where filterSize is a 1-by-K vector specifying the filter size, numChannelsPerGroup is the number of input channels for each group, numFiltersPerGroup is the number of filters for each group, numGroups is the number of groups, and K is the number of spatial dimensions

Bias

One of the following:

  • [numFiltersPerGroup 1], where numFiltersPerGroup is the number of filters for each group.

  • [1 1]

fullyconnectWeights

[outputSize inputSize], where outputSize and inputSize is the number of output and input channels, respectively

Bias

[outputSize 1], where outputSize is the number of output channels

gruInput weights

[3*numHiddenUnits inputSize], where numHiddenUnits is the number of hidden units of the operation and inputSize is the number of input channels

Recurrent weights

[3*numHiddenUnits numHiddenUnits], where numHiddenUnits is the number of hidden units of the operation

Bias

[3*numHiddenUnits 1], where numHiddenUnits is the number of hidden units of the operation

lstmInput weights

[4*numHiddenUnits inputSize], where numHiddenUnits is the number of hidden units of the operation and inputSize is the number of input channels

Recurrent weights

[4*numHiddenUnits numHiddenUnits], where numHiddenUnits is the number of hidden units of the operation

Bias

[4*numHiddenUnits 1], where numHiddenUnits is the number of hidden units of the operation

Glorot Initialization

The Glorot (also known as Xavier) initializer [1] samples weights from the uniform distribution with bounds [6No+Ni,6No+Ni], where the values of No and Ni depend on the type of deep learning operation.

OperationLearnable ParameterNoNi
dlconvWeights

prod(filterSize)*numFilters, where filterSize is a 1-by-K vector containing the filter size, numFilters is the number of filters, and K is the number of spatial dimensions

prod(filterSize)*numChannels, where filterSize is a 1-by-K vector containing the filter size, numChannels is the number of input channels, and K is the number of spatial dimensions

dlconv (grouped)Weights

prod(filterSize)*numFiltersPerGroup, where filterSize is a 1-by-K vector containing the filter size, numFiltersPerGroup is the number of filters for each group, and K is the number of spatial dimensions

prod(filterSize)*numChannelsPerGroup, where filterSize is a 1-by-K vector containing the filter size, numChannelsPerGroup is the number of input channels for each group, and K is the number of spatial dimensions

dltranspconvWeights

prod(filterSize)*numFilters, where filterSize is a 1-by-K vector containing the filter size, numFilters is the number of filters, and K is the number of spatial dimensions

prod(filterSize)*numChannels, where filterSize is a 1-by-K vector containing the filter size, numChannels is the number of input channels, and K is the number of spatial dimensions

dltranspconv (grouped)Weights

prod(filterSize)*numFiltersPerGroup, where filterSize is a 1-by-K vector containing the filter size, numFiltersPerGroup is the number of filters for each group, and K is the number of spatial dimensions

prod(filterSize)*numChannelsPerGroup, where filterSize is a 1-by-K vector containing the filter size, numChannelsPerGroup is the number of input channels for each group, and K is the number of spatial dimensions

fullyconnectWeightsNumber of output channels of the operationNumber of input channels of the operation
gruInput weights3*numHiddenUnits, where numHiddenUnits is the number of hidden units of the operationNumber of input channels of the operation
Recurrent weights3*numHiddenUnits, where numHiddenUnits is the number of hidden units of the operationNumber of hidden units of the operation
lstmInput weights4*numHiddenUnits, where numHiddenUnits is the number of hidden units of the operationNumber of input channels of the operation
Recurrent weights4*numHiddenUnits, where numHiddenUnits is the number of hidden units of the operationNumber of hidden units of the operation

To initialize learnable parameters using the Glorot initializer easily, you can define a custom function. The function initializeGlorot takes as input the size of the learnable parameters sz and the values No and Ni (numOut and numIn, respectively), and returns the sampled weights as a dlarray object with underlying type 'single'.

function weights = initializeGlorot(sz,numOut,numIn)

Z = 2*rand(sz,'single') - 1;
bound = sqrt(6 / (numIn + numOut));

weights = bound * Z;
weights = dlarray(weights);

end

Example

Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.

filterSize = [5 5];
numChannels = 3;
numFilters = 128;

sz = [filterSize numChannels numFilters];
numOut = prod(filterSize) * numFilters;
numIn = prod(filterSize) * numChannels;

parameters.conv.Weights = initializeGlorot(sz,numOut,numIn);

He Initialization

The He initializer [2] samples weights from the normal distribution with zero mean and variance 2Ni, where the value Ni depends on the type of deep learning operation.

OperationLearnable ParameterNi
dlconvWeights

prod(filterSize)*numChannelsPerGroup, where filterSize is a 1-by-K vector containing the filter size, numChannelsPerGroup is the number of input channels for each group, and K is the number of spatial dimensions

dltranspconvWeights

prod(filterSize)*numChannelsPerGroup, where filterSize is a 1-by-K vector containing the filter size, numChannelsPerGroup is the number of input channels for each group, and K is the number of spatial dimensions

fullyconnectWeightsNumber of input channels of the operation
gruInput weightsNumber of input channels of the operation
Recurrent weightsNumber of hidden units of the operation.
lstmInput weightsNumber of input channels of the operation
Recurrent weightsNumber of hidden units of the operation.

To initialize learnable parameters using the He initializer easily, you can define a custom function. The function initializeHe takes as input the size of the learnable parameters sz, and the value Ni, and returns the sampled weights as a dlarray object with underlying type 'single'.

function weights = initializeHe(sz,numIn)

weights = randn(sz,'single') * sqrt(2/numIn);
weights = dlarray(weights);

end

Example

Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.

filterSize = [5 5];
numChannels = 3;
numFilters = 128;

sz = [filterSize numChannels numFilters];
numIn = prod(filterSize) * numChannels;

parameters.conv.Weights = initializeHe(sz,numIn);

Gaussian Initialization

The Gaussian initializer samples weights from a normal distribution.

To initialize learnable parameters using the Gaussian initializer easily, you can define a custom function. The function initializeGaussian takes as input the size of the learnable parameters sz, the distribution mean mu, and the distribution standard deviation sigma, and returns the sampled weights as a dlarray object with underlying type 'single'.

function weights = initializeGaussian(sz,mu,sigma)

weights = randn(sz,'single')*sigma + mu;
weights = dlarray(weights);

end

Example

Initialize the weights for an embedding operation with a dimension of 300 and vocabulary size of 5000 using the Gaussian initializer with mean 0 and standard deviation 0.01.

embeddingDimension = 300;
vocabularySize = 5000;
mu = 0;
sigma = 0.01;

sz = [embeddingDimension vocabularySize];

parameters.emb.Weights = initializeGaussian(sz,mu,sigma);

Uniform Initialization

The uniform initializer samples weights from a uniform distribution.

To initialize learnable parameters using the uniform initializer easily, you can define a custom function. The function initializeUniform takes as input the size of the learnable parameters sz, and the distribution bound bound, and returns the sampled weights as a dlarray object with underlying type 'single'.

function parameter = initializeUniform(sz,bound)

Z = 2*rand(sz,'single') - 1;
parameter = bound * Z;
parameter = dlarray(parameter);

end

Example

Initialize the weights for an attention mechanism with size 100-by-100 and bound 0.1 using the uniform initializer.

sz = [100 100];
bound = 0.1;

parameters.attentionn.Weights = initializeUniform(sz,bound);

Orthogonal Initialization

The orthogonal initializer returns the orthogonal matrix Q given by the QR decomposition of Z = QR, where Z is sampled from a unit normal distribution and the size of Z matches the size of the learnable parameter.

To initialize learnable parameters using the orthogonal initializer easily, you can define a custom function. The function initializeOrthogonal takes as input the size of the learnable parameters sz, and returns the orthogonal matrix as a dlarray object with underlying type 'single'.

function parameter = initializeOrthogonal(sz)

Z = randn(sz,'single');
[Q,R] = qr(Z,0);

D = diag(R);
Q = Q * diag(D ./ abs(D));

parameter = dlarray(Q);

end

Example

Initialize the recurrent weights for an LSTM operation with 100 hidden units using the orthogonal initializer.

numHiddenUnits = 100;

sz = [4*numHiddenUnits numHiddenUnits];

parameters.lstm.RecurrentWeights = initializeOrthogonal(sz);

Unit Forget Gate Initialization

The unit forget gate initializer initializes the bias for an LSTM operation such that the forget gate component of the biases are ones and the remaining entries are zeros.

To initialize learnable parameters using the orthogonal initializer easily, you can define a custom function. The function initializeUnitForgetGate takes as input the number of hidden units in the LSTM operation, and returns the bias as a dlarray object with underlying type 'single'.

function bias = initializeUnitForgetGate(numHiddenUnits)

bias = zeros(4*numHiddenUnits,1,'single');

idx = numHiddenUnits+1:2*numHiddenUnits;
bias(idx) = 1;

bias = dlarray(bias);

end

Example

Initialize the bias of an LSTM operation with 100 hidden units using the unit forget gate initializer.

numHiddenUnits = 100;

parameters.lstm.Bias = initializeUnitForgetGate(numHiddenUnits,'single');

Ones Initialization

To initialize learnable parameters with ones easily, you can define a custom function. The function initializeOnes takes as input the size of the learnable parameters sz, and returns the parameters as a dlarray object with underlying type 'single'.

function parameter = initializeOnes(sz)

parameter = ones(sz,'single');
parameter = dlarray(parameter);

end

Example

Initialize the scale for a batch normalization operation with 128 input channels with ones.

numChannels = 128;

sz = [numChannels 1];

parameters.bn.Scale = initializeOnes(sz);

Zeros Initialization

To initialize learnable parameters with zeros easily, you can define a custom function. The function initializeZeros takes as input the size of the learnable parameters sz, and returns the parameters as a dlarray object with underlying type 'single'.

function parameter = initializeZeros(sz)

parameter = zeros(sz,'single');
parameter = dlarray(parameter);

end

Example

Initialize the offset for a batch normalization operation with 128 input channels with zeros.

numChannels = 128;

sz = [numChannels 1];

parameters.bn.Offset = initializeZeros(sz);

Storing Learnable Parameters

It is recommended to store the learnable parameters for a given model function in a single object, such as a structure, table, or cell array. For an example showing how to initialize learnable parameters as a struct, see Train Network Using Model Function.

Storing Parameters on GPU

If you train your model using a GPU, then the software converts the learnable parameters of the model function to gpuArray objects which are stored on the GPU.

To make it easier to load learnable parameters on machines without a GPU, it is recommended practice to gather all the parameters to the local workspace before saving them. To gather learnable parameters stored as a structure, table, or cell array of dlarray objects, use the dlupdate function with the gather function. For example, if you have network learnable parameters stored on the GPU in the structure, table, or cell array parameters, you can transfer the parameters to the local workspace by using the following code:

parameters = dlupdate(@gather,parameters);

If you load learnable parameters that are not on the GPU, you can move the parameters onto the GPU using the dlupdate function with the gpuArray function. Doing so ensures that your network executes on the GPU for training and inference, regardless of where the input data is stored. For example, to move the parameters stored in the structure, table, or cell array parameters, you can transfer the parameters to the GPU by using the following code:

parameters = dlupdate(@gpuArray,parameters);

References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the Difficulty of Training Deep Feedforward Neural Networks." In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–356. Sardinia, Italy: AISTATS, 2010. https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

[2] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." In 2015 IEEE International Conference on Computer Vision (ICCV), 1026–34. Santiago, Chile: IEEE, 2015. https://doi.org/10.1109/ICCV.2015.123

See Also

| | |

Related Topics