# lbfgsState

State of limited-memory BFGS (L-BFGS) solver

Since R2023a

## Description

An `lbfgsState` object stores information about steps in the L-BFGS algorithm.

The L-BFGS algorithm [1] is a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Use the L-BFGS algorithm for small networks and data sets that you can process in a single batch.

Use `lbfgsState` objects in conjunction with the `lbfgsupdate` function to train a neural network using the L-BFGS algorithm.

## Creation

### Syntax

``solverState = lbfgsState``
``solverState = lbfgsState(Name=Value)``

### Description

````solverState = lbfgsState` creates an L-BFGS state object with a history size of 10 and an initial inverse Hessian factor of 1.```

example

````solverState = lbfgsState(Name=Value)` sets the `HistorySize` and `InitialInverseHessianFactor` properties using one or more name-value arguments.```

example

## Properties

expand all

### L-BFGS State

Number of state updates to store, specified as a positive integer. Values between 3 and 20 suit most tasks.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

After creating the `lbfgsState` object, this property is read-only.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Initial value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation ${B}_{k-m}^{-1}\approx {\lambda }_{k}I$, where m is the history size, the inverse Hessian factor ${\lambda }_{k}$ is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

The initial inverse hessian factor is the value of ${\lambda }_{0}$.

After creating the `lbfgsState` object, this property is read-only.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Value that characterizes the approximate inverse Hessian matrix, specified as a positive scalar.

To save memory, the L-BFGS algorithm does not store and invert the dense Hessian matrix B. Instead, the algorithm uses the approximation ${B}_{k-m}^{-1}\approx {\lambda }_{k}I$, where m is the history size, the inverse Hessian factor ${\lambda }_{k}$ is a scalar, and I is the identity matrix. The algorithm then stores the scalar inverse Hessian factor only. The algorithm updates the inverse Hessian factor at each step.

Data Types: `single` | `double` | `int8` | `int16` | `int32` | `int64` | `uint8` | `uint16` | `uint32` | `uint64`

Since R2023b

Norm of the initial gradients, specified as a `dlarray` scalar or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `InitialGradientsNorm` is the first value that the `GradientsNorm` property takes. Otherwise, `InitialGradientsNorm` is `[]`.

Since R2024b

Initial step size, specified as one of these values:

• `[]` — Do not use an initial step size to determine the initial Hessian approximation.

• `"auto"` — Determine the initial step size automatically. The software uses an initial step size of $‖{s}_{0}{‖}_{\infty }=\frac{1}{2}‖{W}_{0}{‖}_{\infty }+0.1$, where W0 are the initial learnable parameters of the network.

• Positive real scalar — Use the specified value as the initial step size $‖{s}_{0}{‖}_{\infty }$.

If `InitialStepSize` is `"auto"` or a positive real scalar, then the software approximates the initial inverse Hessian using ${\lambda }_{0}=\frac{‖{s}_{0}{‖}_{\infty }}{‖\nabla J\left({W}_{0}\right){‖}_{\infty }}$, where λ0 is the initial inverse Hessian factor and $\nabla J\left({W}_{0}\right)$ denotes the gradients of the loss with respect to the initial learnable parameters. For more information, see Limited-Memory BFGS.

Step history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: `cell`

Gradients difference history, specified as a cell array.

The L-BFGS algorithm uses a history of gradient calculations to approximate the Hessian matrix recursively. For more information, see Limited-Memory BFGS.

Data Types: `cell`

History indices, specified as a row vector.

`HistoryIndices` is a 1-by-`HistorySize` vector, where `StepHistory(i)` and `GradientsDifferenceHistory(i)` correspond to iteration `HistoryIndices(i)`.

Data Types: `double`

### Iteration Information

Loss, specified as a `dlarray` scalar, a numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `Loss` is the first output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `Loss` is `[]`.

Gradients, specified as a `dlarray` object, a numeric array, a cell array, a structure, a table, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `Gradients` is the second output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `Gradients` is `[]`.

Additional loss function outputs, specified as a cell array.

If the state object is the output of the `lbfgsupdate` function, then `AdditionalLossFunctionOutputs` is a cell array containing additional outputs of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `AdditionalLossFunctionOutputs` is a 1-by-0 cell array.

Data Types: `cell`

Norm of the step, specified as a `dlarray` scalar, numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `StepNorm` is the norm of the step that the `lbfgsupdate` function calculates. Otherwise, `StepNorm` is `[]`.

Norm of the gradients, specified as a `dlarray` scalar, a numeric scalar, or `[]`.

If the state object is the output of the `lbfgsupdate` function, then `GradientsNorm` is the norm of the second output of the loss function that you pass to the `lbfgsupdate` function. Otherwise, `GradientsNorm` is `[]`.

Status of the line search algorithm, specified as `""`, `"completed"`, or `"failed"`.

If the state object is the output of the `lbfgsupdate` function, then `LineSearchStatus` is one of these values:

• `"completed"` — The algorithm finds a learning rate that satisfies the `LineSearchMethod` and `MaxNumLineSearchIterations` options that the `lbfgsupdate` function uses.

• `"failed"` — The algorithm fails to find a learning rate that satisfies the `LineSearchMethod` and `MaxNumLineSearchIterations` options that the `lbfgsupdate` function uses.

Otherwise, `LineSearchStatus` is `""`.

Method solver uses to find a suitable learning rate, specified as `"weak-wolfe"`, `"strong-wolfe"`, `"backtracking"`, or `""`.

If the state object is the output of the `lbfgsupdate` function, then `LineSearchMethod` is the line search method that the `lbfgsupdate` function uses. Otherwise, `LineSearchMethod` is `""`.

Maximum number of line search iterations, specified as a nonnegative integer.

If the state object is the output of the `lbfgsupdate` function, then `MaxNumLineSearchIterations` is the maximum number of line search iterations that the `lbfgsupdate` function uses. Otherwise, `MaxNumLineSearchIterations` is `0`.

Data Types: `double`

## Examples

collapse all

Create an L-BFGS solver state object.

`solverState = lbfgsState`
```solverState = LBFGSState with properties: InverseHessianFactor: 1 StepHistory: {} GradientsDifferenceHistory: {} HistoryIndices: [1x0 double] Iteration Information Loss: [] Gradients: [] AdditionalLossFunctionOutputs: {1x0 cell} GradientsNorm: [] StepNorm: [] LineSearchStatus: "" Show all properties ```

Read the transmission casing data from the CSV file `"transmissionCasingData.csv"`.

```filename = "transmissionCasingData.csv"; tbl = readtable(filename,TextType="String");```

Convert the labels for prediction to categorical using the `convertvars` function.

```labelName = "GearToothCondition"; tbl = convertvars(tbl,labelName,"categorical");```

To train a network using categorical features, convert the categorical predictors to categorical using the `convertvars` function by specifying a string array containing the names of all the categorical input variables.

```categoricalPredictorNames = ["SensorCondition" "ShaftCondition"]; tbl = convertvars(tbl,categoricalPredictorNames,"categorical");```

Loop over the categorical input variables. For each variable, convert the categorical values to one-hot encoded vectors using the `onehotencode` function.

```for i = 1:numel(categoricalPredictorNames) name = categoricalPredictorNames(i); tbl.(name) = onehotencode(tbl.(name),2); end```

View the first few rows of the table.

`head(tbl)`
``` SigMean SigMedian SigRMS SigVar SigPeak SigPeak2Peak SigSkewness SigKurtosis SigCrestFactor SigMAD SigRangeCumSum SigCorrDimension SigApproxEntropy SigLyapExponent PeakFreq HighFreqPower EnvPower PeakSpecKurtosis SensorCondition ShaftCondition GearToothCondition ________ _________ ______ _______ _______ ____________ ___________ ___________ ______________ _______ ______________ ________________ ________________ _______________ ________ _____________ ________ ________________ _______________ ______________ __________________ -0.94876 -0.9722 1.3726 0.98387 0.81571 3.6314 -0.041525 2.2666 2.0514 0.8081 28562 1.1429 0.031581 79.931 0 6.75e-06 3.23e-07 162.13 0 1 1 0 No Tooth Fault -0.97537 -0.98958 1.3937 0.99105 0.81571 3.6314 -0.023777 2.2598 2.0203 0.81017 29418 1.1362 0.037835 70.325 0 5.08e-08 9.16e-08 226.12 0 1 1 0 No Tooth Fault 1.0502 1.0267 1.4449 0.98491 2.8157 3.6314 -0.04162 2.2658 1.9487 0.80853 31710 1.1479 0.031565 125.19 0 6.74e-06 2.85e-07 162.13 0 1 0 1 No Tooth Fault 1.0227 1.0045 1.4288 0.99553 2.8157 3.6314 -0.016356 2.2483 1.9707 0.81324 30984 1.1472 0.032088 112.5 0 4.99e-06 2.4e-07 162.13 0 1 0 1 No Tooth Fault 1.0123 1.0024 1.4202 0.99233 2.8157 3.6314 -0.014701 2.2542 1.9826 0.81156 30661 1.1469 0.03287 108.86 0 3.62e-06 2.28e-07 230.39 0 1 0 1 No Tooth Fault 1.0275 1.0102 1.4338 1.0001 2.8157 3.6314 -0.02659 2.2439 1.9638 0.81589 31102 1.0985 0.033427 64.576 0 2.55e-06 1.65e-07 230.39 0 1 0 1 No Tooth Fault 1.0464 1.0275 1.4477 1.0011 2.8157 3.6314 -0.042849 2.2455 1.9449 0.81595 31665 1.1417 0.034159 98.838 0 1.73e-06 1.55e-07 230.39 0 1 0 1 No Tooth Fault 1.0459 1.0257 1.4402 0.98047 2.8157 3.6314 -0.035405 2.2757 1.955 0.80583 31554 1.1345 0.0353 44.223 0 1.11e-06 1.39e-07 230.39 0 1 0 1 No Tooth Fault ```

Extract the training data.

```predictorNames = ["SigMean" "SigMedian" "SigRMS" "SigVar" "SigPeak" "SigPeak2Peak" ... "SigSkewness" "SigKurtosis" "SigCrestFactor" "SigMAD" "SigRangeCumSum" ... "SigCorrDimension" "SigApproxEntropy" "SigLyapExponent" "PeakFreq" ... "HighFreqPower" "EnvPower" "PeakSpecKurtosis" "SensorCondition" "ShaftCondition"]; XTrain = table2array(tbl(:,predictorNames)); numInputFeatures = size(XTrain,2);```

Extract the targets and convert them to one-hot encoded vectors.

```TTrain = tbl.(labelName); TTrain = onehotencode(TTrain,2); numClasses = size(TTrain,2);```

Convert the predictors and targets to `dlarray` objects with format `"BC"` (batch, channel).

```XTrain = dlarray(XTrain,"BC"); TTrain = dlarray(TTrain,"BC");```

Define the network architecture.

```numHiddenUnits = 32; layers = [ featureInputLayer(numInputFeatures) fullyConnectedLayer(16) layerNormalizationLayer reluLayer fullyConnectedLayer(numClasses) softmaxLayer]; net = dlnetwork(layers);```

Define the `modelLoss` function, listed in the Model Loss Function section of the example. This function takes as input a neural network, input data, and targets. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

The `lbfgsupdate` function requires a loss function with the syntax `[loss,gradients] = f(net)`. Create a variable that parameterizes the evaluated `modelLoss` function to take a single input argument.

`lossFcn = @(net) dlfeval(@modelLoss,net,XTrain,TTrain);`

Initialize an L-BFGS solver state object with a maximum history size of 3 and an initial inverse Hessian approximation factor of 1.1.

```solverState = lbfgsState( ... HistorySize=3, ... InitialInverseHessianFactor=1.1);```

Train the network a maximum of 200 iterations. Stop training early when the norm of the gradients or steps are smaller than 0.00001. Print the training loss every 10 iterations.

```maxIterations = 200; gradientTolerance = 1e-5; stepTolerance = 1e-5; iteration = 0; while iteration < maxIterations iteration = iteration + 1; [net, solverState] = lbfgsupdate(net,lossFcn,solverState); if iteration==1 || mod(iteration,10)==0 fprintf("Iteration %d: Loss: %d\n",iteration,solverState.Loss); end if solverState.GradientsNorm < gradientTolerance || ... solverState.StepNorm < stepTolerance || ... solverState.LineSearchStatus == "failed" break end end```
```Iteration 1: Loss: 9.343236e-01 Iteration 10: Loss: 4.721475e-01 Iteration 20: Loss: 4.678575e-01 Iteration 30: Loss: 4.666964e-01 Iteration 40: Loss: 4.665921e-01 Iteration 50: Loss: 4.663871e-01 Iteration 60: Loss: 4.662519e-01 Iteration 70: Loss: 4.660451e-01 Iteration 80: Loss: 4.645303e-01 Iteration 90: Loss: 4.591753e-01 Iteration 100: Loss: 4.562556e-01 Iteration 110: Loss: 4.531167e-01 Iteration 120: Loss: 4.489444e-01 Iteration 130: Loss: 4.392228e-01 Iteration 140: Loss: 4.347853e-01 Iteration 150: Loss: 4.341757e-01 Iteration 160: Loss: 4.325102e-01 Iteration 170: Loss: 4.321948e-01 Iteration 180: Loss: 4.318990e-01 Iteration 190: Loss: 4.313784e-01 Iteration 200: Loss: 4.311314e-01 ```

Model Loss Function

The `modelLoss` function takes as input a neural network `net`, input data `X`, and targets `T`. The function returns the loss and the gradients of the loss with respect to the network learnable parameters.

```function [loss, gradients] = modelLoss(net, X, T) Y = forward(net,X); loss = crossentropy(Y,T); gradients = dlgradient(loss,net.Learnables); end```

expand all

## References

[1] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical programming 45, no. 1 (August 1989): 503-528. https://doi.org/10.1007/BF01589116.

## Version History

Introduced in R2023a

expand all