## Initialize Learnable Parameters for Model Functions

When training a network using layers, layer graphs, or `dlnetwork` objects, the software automatically initializes the learnable parameters according to the layer initialization properties. When defining a deep learning model as a function, you must initialize the learnable parameters manually.

The initialization of learnable parameters (for example, weights and biases) can have a big impact on how quickly a deep learning model converges.

Tip

This page explains how to initialize learnable parameters for deep learning models defined as functions in a custom training loop. To learn how to specify the learnable parameter initialization for deep learning layers use the corresponding layer properties. For example, to set the weights initializer of a `convolution2dLayer` object, use the `WeightsInitializer` property.

### Default Layer Initializations

This table shows the default initializations for the learnable parameters for each layer, and provides links that show how to initialize learnable parameters for model functions using the same initialization.

LayerLearnable ParameterDefault Initialization
`convolution2dLayer`WeightsGlorot Initialization
BiasZeros Initialization
`convolution3dLayer`WeightsGlorot Initialization
BiasZeros Initialization
`groupedConvolution2dLayer`WeightsGlorot Initialization
BiasZeros Initialization
`transposedConv2dLayer`WeightsGlorot Initialization
BiasZeros Initialization
`transposedConv3dLayer`WeightsGlorot Initialization
BiasZeros Initialization
`fullyConnectedLayer`WeightsGlorot Initialization
BiasZeros Initialization
`batchNormalizationLayer`OffsetZeros Initialization
ScaleOnes Initialization
`lstmLayer`Input weightsGlorot Initialization
Recurrent weightsOrthogonal Initialization
BiasUnit Forget Gate Initialization
`gruLayer`Input weightsGlorot Initialization
Recurrent weightsOrthogonal Initialization
BiasZeros Initialization
`wordEmbeddingLayer`WeightsGaussian Initialization, with mean 0 and standard deviation 0.01.

### Learnable Parameter Sizes

When initializing learnable parameters for model functions, you must specify parameters of the correct size. The size of the learnable parameters depends on the type of deep learning operation.

OperationLearnable ParameterSize
`batchnorm`Offset

`[numChannels 1]`, where `numChannels` is the number of input channels.

Scale

`[numChannels 1]`, where `numChannels` is the number of input channels.

`dlconv`Weights

`[filterSize numChannels numFilters]`, where `filterSize` is a 1-by-`K` vector specifying the filter size, `numChannels` is the number of input channels, `numFilters` is the number of filters, and `K` is the number of spatial dimensions.

Bias

One of the following:

• `[numFilters 1]`, where `numFilters` is the number of filters

• `[1 1]`

`dlconv` (grouped)Weights

```[filterSize numChannelsPerGroup numFiltersPerGroup numGroups]```, where `filterSize` is a 1-by-`K` vector specifying the filter size, `numChannelsPerGroup` is the number of input channels for each group, `numFiltersPerGroup` is the number of filters for each group, `numGroups` is the number of groups, and `K` is the number of spatial dimensions.

Bias

One of the following:

• `[numFiltersPerGroup 1]`, where `numFiltersPerGroup` is the number of filters for each group.

• `[1 1]`

`dltranspconv`Weights

`[filterSize numFilters numChannels]`, where `filterSize` is a 1-by-`K` vector specifying the filter size, `numChannels` is the number of input channels, `numFilters` is the number of filters, and `K` is the number of spatial dimensions.

Bias

One of the following:

• `[numFilters 1]`, where `numFilters` is the number of filters for each group.

• `[1 1]`

`dltranspconv` (grouped)Weights

```[filterSize numFiltersPerGroup numChannelsPerGroup numGroups]```, where `filterSize` is a 1-by-`K` vector specifying the filter size, `numChannelsPerGroup` is the number of input channels for each group, `numFiltersPerGroup` is the number of filters for each group, `numGroups` is the number of groups, and `K` is the number of spatial dimensions.

Bias

One of the following:

• `[numFiltersPerGroup 1]`, where `numFiltersPerGroup` is the number of filters for each group.

• `[1 1]`

`fullyconnect`Weights

`[outputSize inputSize]`, where `outputSize` and `inputSize` is the number of output and input channels, respectively.

Bias

`[outputSize 1]`, where `outputSize` is the number of output channels.

`gru`Input weights

`[3*numHiddenUnits inputSize]`, where `numHiddenUnits` is the number of hidden units of the operation and `inputSize` is the number of input channels.

Recurrent weights

`[3*numHiddenUnits numHiddenUnits]`, where `numHiddenUnits` is the number of hidden units of the operation.

Bias

`[3*numHiddenUnits 1]`, where `numHiddenUnits` is the number of hidden units of the operation.

`lstm`Input weights

`[4*numHiddenUnits inputSize]`, where `numHiddenUnits` is the number of hidden units of the operation and `inputSize` is the number of input channels.

Recurrent weights

`[4*numHiddenUnits numHiddenUnits]`, where `numHiddenUnits` is the number of hidden units of the operation.

Bias

`[4*numHiddenUnits 1]`, where `numHiddenUnits` is the number of hidden units of the operation.

### Glorot Initialization

The Glorot (also known as Xavier) initializer [1] samples weights from the uniform distribution with bounds $\left[-\sqrt{\frac{6}{{N}_{o}+{N}_{i}}},\sqrt{\frac{6}{{N}_{o}+{N}_{i}}}\right]$, where the values of No and Ni depend on the type of deep learning operation:

OperationLearnable ParameterNoNi
`dlconv`Weights

`prod(filterSize)*numFilters`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numFilters` is the number of filters, and `K` is the number of spatial dimensions.

`prod(filterSize)*numChannels`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannels` is the number of input channels, and `K` is the number of spatial dimensions.

`dlconv` (grouped)Weights

`prod(filterSize)*numFiltersPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numFiltersPerGroup` is the number of filters for each group, and `K` is the number of spatial dimensions.

`prod(filterSize)*numChannelsPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannelsPerGroup` is the number of input channels for each group, and `K` is the number of spatial dimensions.

`dltranspconv`Weights

`prod(filterSize)*numFilters`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numFilters` is the number of filters, and `K` is the number of spatial dimensions.

`prod(filterSize)*numChannels`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannels` is the number of input channels, and `K` is the number of spatial dimensions.

`dltranspconv` (grouped)Weights

`prod(filterSize)*numFiltersPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numFiltersPerGroup` is the number of filters for each group, and `K` is the number of spatial dimensions.

`prod(filterSize)*numChannelsPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannelsPerGroup` is the number of input channels for each group, and `K` is the number of spatial dimensions.

`fullyconnect`WeightsThe number of output channels of the operationThe number of input channels of the operation
`gru`Input weights`3*numHiddenUnits`, where `numHiddenUnits` is the number of hidden units of the operation.The number of input channels of the operation
Recurrent weights`3*numHiddenUnits`, where `numHiddenUnits` is the number of hidden units of the operation.The number of hidden units of the operation.
`lstm`Input weights`4*numHiddenUnits`, where `numHiddenUnits` is the number of hidden units of the operation.The number of input channels of the operation
Recurrent weights`4*numHiddenUnits`, where `numHiddenUnits` is the number of hidden units of the operation.The number of hidden units of the operation.

To initialize learnable parameters using the Glorot initializer easily, you can define a custom function. The function `initializeGlorot`, takes as input the size of the learnable parameters `sz` and the values No and Ni (`numOut` and `numIn`, respectively), and returns the sampled weights as a `dlarray` object with underlying type `'single'`.

```function weights = initializeGlorot(sz,numOut,numIn) Z = 2*rand(sz,'single') - 1; bound = sqrt(6 / (numIn + numOut)); weights = bound * Z; weights = dlarray(weights); end```

#### Example

Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.

```filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numOut = prod(filterSize) * numFilters; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeGlorot(sz,numOut,numIn);```

### He Initialization

The He initializer [44] samples weights from the normal distribution with zero mean and variance $\frac{2}{{N}_{i}}$, where the value Ni depends on type of deep learning operation:

OperationLearnable ParameterNi
`dlconv`Weights

`prod(filterSize)*numChannelsPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannelsPerGroup` is the number of input channels for each group, and `K` is the number of spatial dimensions.

`dltranspconv`Weights

`prod(filterSize)*numChannelsPerGroup`, where `filterSize` is a 1-by-`K` vector containing the filter size, `numChannelsPerGroup` is the number of input channels for each group, and `K` is the number of spatial dimensions.

`fullyconnect`WeightsThe number of input channels of the operation
`gru`Input weightsThe number of input channels of the operation
Recurrent weightsThe number of hidden units of the operation.
`lstm`Input weightsThe number of input channels of the operation
Recurrent weightsThe number of hidden units of the operation.

To initialize learnable parameters using the He initializer easily, you can define a custom function. The function `initializeHe`, takes as input the size of the learnable parameters `sz`, the value Ni, and returns the sampled weights as a `dlarray` object with underlying type `'single'`.

```function weights = initializeHe(sz,numIn) weights = randn(sz,'single') * sqrt(2/numIn); weights = dlarray(weights); end```

#### Example

Initialize the weights for a convolutional operation with 128 filters of size 5-by-5 and 3 input channels.

```filterSize = [5 5]; numChannels = 3; numFilters = 128; sz = [filterSize numChannels numFilters]; numIn = prod(filterSize) * numFilters; parameters.conv.Weights = initializeHe(sz,numIn);```

### Gaussian Initialization

The Gaussian initializer samples weights from a normal distribution.

To initialize learnable parameters using the Gaussian initializer easily, you can define a custom function. The function `initializeGaussian`, takes as input the size of the learnable parameters `sz`, the distribution mean `mu`, the distribution standard deviation `sigma`, and returns the sampled weights as a `dlarray` object with underlying type `'single'`.

```function weights = initializeGaussian(sz,mu,sigma) weights = randn(sz,'single')*sigma + mu; weights = dlarray(weights); end```

#### Example

Initialize the weights for an embedding operation with dimension 300 and vocabulary size 5000 using the Gaussian initializer with mean 0 and standard deviation 0.01.

```embeddingDimension = 300; vocabularySize = 5000; mu = 0; sigma = 0.01; sz = [embeddingDimension vocabularySize]; parameters.emb.Weights = initializeGaussian(sz,mu,sigma);```

### Uniform Initialization

The uniform initializer samples weights from a uniform distribution.

To initialize learnable parameters using the uniform initializer easily, you can define a custom function. The function `initializeUniform`, takes as input the size of the learnable parameters `sz`, the distribution bound `bound`, and returns the sampled weights as a `dlarray` object with underlying type `'single'`.

```function parameter = initializeUniform(sz,bound) Z = 2*rand(sz,'single') - 1; parameter = bound * Z; parameter = dlarray(parameter); end```

#### Example

Initialize the weights for an attention mechanism with size 100-by-100 and bound 0.1 using the uniform initializer.

```sz = [100 100]; bound = 0.1; parameters.attentionn.Weights = initializeUniform(sz,bound);```

### Orthogonal Initialization

The orthogonal initializer returns the orthogonal matrix Q given by the QR decomposition of Z = QR, where Z is sampled from a unit normal distribution and the size of Z matches the size of the learnable parameter.

To initialize learnable parameters using the orthogonal initializer easily, you can define a custom function. The function `initializeOrthogonal`, takes as input the size of the learnable parameters `sz` and returns the orthogonal matrix as a `dlarray` object with underlying type `'single'`.

```function parameter = initializeOrthogonal(sz) Z = randn(sz,'single'); [Q,R] = qr(Z,0); D = diag(R); Q = Q * diag(D ./ abs(D)); parameter = dlarray(Q); end```

#### Example

Initialize the recurrent weights for an LSTM operation with 100 hidden units using the orthogonal initializer.

```numHiddenUnits = 100; sz = [4*numHiddenUnits numHiddenUnits]; parameters.lstm.RecurrentWeights = initializeOrthogonal(sz);```

### Unit Forget Gate Initialization

The unit forget gate initializer initializes the bias for an LSTM operation such that the forget gate component of the biases are ones and the remaining entries are zeros.

To initialize learnable parameters using the orthogonal initializer easily, you can define a custom function. The function `initializeUnitForgetGate`, takes as input the number of hidden units in the LSTM operation and returns the bias as a `dlarray` object with underlying type `'single'`.

```function bias = initializeUnitForgetGate(numHiddenUnits) bias = zeros(4*numHiddenUnits,1,'single'); idx = numHiddenUnits+1:2*numHiddenUnits; bias(idx) = 1; bias = dlarray(bias); end```

#### Example

Initialize the bias of an LSTM operation with 100 hidden units using the unit forget gate initializer.

```numHiddenUnits = 100; parameters.lstm.Bias = initializeUnitForgetGate(numHiddenUnits,'single');```

### Ones Initialization

To initialize learnable parameters with ones easily, you can define a custom function. The function `initializeOnes`, takes as input the size of the learnable parameters `sz` and returns parameters as a `dlarray` object with underlying type `'single'`.

```function parameter = initializeOnes(sz) parameter = ones(sz,'single'); parameter = dlarray(weights); end```

#### Example

Initialize the scale for a batch normalization operation with 128 input channels with ones.

```numChannels = 128; sz = [numChannels 1]; parameters.bn.Scale = initializeOnes(sz);```

### Zeros Initialization

To initialize learnable parameters with zeros easily, you can define a custom function. The function `initializeZeros`, takes as input the size of the learnable parameters `sz` and returns parameters as a `dlarray` object with underlying type `'single'`.

```function parameter = initializeZeros(sz) parameter = zeros(sz,'single'); parameter = dlarray(weights); end```

#### Example

Initialize the offset for a batch normalization operation with 128 input channels with zeros.

```numChannels = 128; sz = [numChannels 1]; parameters.bn.Offset = initializeZeros(sz);```

## References

[1] Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249-256. 2010.

[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." In Proceedings of the IEEE international conference on computer vision, pp. 1026-1034. 2015.