The linear networks discussed in this section are similar to the perceptron, but their transfer function is linear rather than hard-limiting. This allows their outputs to take on any value, whereas the perceptron output is limited to either 0 or 1. Linear networks, like the perceptron, can only solve linearly separable problems.

Here you design a linear network that, when presented with a set of given input vectors, produces outputs of corresponding target vectors. For each input vector, you can calculate the network's output vector. The difference between an output vector and its target vector is the error. You would like to find values for the network weights and biases such that the sum of the squares of the errors is minimized or below a specific value. This problem is manageable because linear systems have a single error minimum. In most cases, you can calculate a linear network directly, such that its error is a minimum for the given input vectors and target vectors. In other cases, numerical problems prohibit direct calculation. Fortunately, you can always train the network to have a minimum error by using the least mean squares (Widrow-Hoff) algorithm.

This section introduces `linearlayer`

, a function that creates a
linear layer, and `newlind`

, a function that designs a linear
layer for a specific purpose.

A linear neuron with *R* inputs is shown below.

This network has the same basic structure as the perceptron.
The only difference is that the linear neuron uses a linear transfer function `purelin`

.

The linear transfer function calculates the neuron's output by simply returning the value passed to it.

$$\alpha =purelin(n)=purelin(Wp+b)=Wp+b$$

This neuron can be trained to learn an affine function of its inputs, or to find a linear approximation to a nonlinear function. A linear network cannot, of course, be made to perform a nonlinear computation.

The linear network
shown below has one layer of *S* neurons connected
to *R* inputs through a matrix of weights **W**.

Note that the figure on the right defines an *S*-length
output vector **a**.

A single-layer linear network is shown. However, this network is just as capable as multilayer linear networks. For every multilayer linear network, there is an equivalent single-layer linear network.

Consider a single linear neuron with two inputs. The following figure shows the diagram for this network.

The weight matrix **W** in this
case has only one row. The network output is

$$\alpha =purelin(n)=purelin(Wp+b)=Wp+b$$

or

$$\alpha ={w}_{1,1}{p}_{1}+{w}_{1,2}{p}_{2}+b$$

Like the perceptron, the linear network has a * decision
boundary* that is determined by the input vectors for which
the net input *n* is zero. For *n* =
0 the equation **Wp** + *b*
= 0 specifies such a decision boundary, as shown below (adapted with
thanks from [HDB96]).

Input vectors in the upper right gray area lead to an output greater than 0. Input vectors in the lower left white area lead to an output less than 0. Thus, the linear network can be used to classify objects into two categories. However, it can classify in this way only if the objects are linearly separable. Thus, the linear network has the same limitation as the perceptron.

You can create this network using `linearlayer`

,
and configure its dimensions with two values so the input has two
elements and the output has one.

net = linearlayer; net = configure(net,[0;0],0);

The network weights and biases are set to zero by default. You can see the current values with the commands

W = net.IW{1,1} W = 0 0

and

b= net.b{1} b = 0

However, you can give the weights any values that you want, such as 2 and 3, respectively, with

net.IW{1,1} = [2 3]; W = net.IW{1,1} W = 2 3

You can set and check the bias in the same way.

net.b{1} = [-4]; b = net.b{1} b = -4

You can simulate the linear network for a particular input vector. Try

p = [5;6];

You can find the network output with the function `sim`

.

a = net(p) a = 24

To summarize, you can create a linear network with `linearlayer`

, adjust its elements
as you want, and simulate it with `sim`

.

Like the perceptron learning rule, the least mean square error (LMS) algorithm is an example of supervised training, in which the learning rule is provided with a set of examples of desired network behavior:

$$\left\{{p}_{1},{t}_{1}\right\},\left\{{p}_{2},{t}_{2}\right\},\dots \left\{{p}_{Q},{t}_{Q}\right\}$$

Here **p**_{q} is
an input to the network, and **t**_{q} is
the corresponding target output. As each input is applied to the network,
the network output is compared to the target. The error is calculated
as the difference between the target output and the network output.
The goal is to minimize the average of the sum of these errors.

$$mse=\frac{1}{Q}{\displaystyle \sum _{k=1}^{Q}e{(k)}^{2}}=\frac{1}{Q}{\displaystyle \sum _{k=1}^{Q}{(t(k)-\alpha (k))}^{2}}$$

The LMS algorithm adjusts the weights and biases of the linear network so as to minimize this mean square error.

Fortunately, the mean square error performance index for the linear network is a quadratic function. Thus, the performance index will either have one global minimum, a weak minimum, or no minimum, depending on the characteristics of the input vectors. Specifically, the characteristics of the input vectors determine whether or not a unique solution exists.

You can find more about this topic in Chapter 10 of [HDB96].

Unlike most other network architectures, linear
networks can be designed directly if input/target vector pairs are
known. You can obtain specific network values for weights and biases
to minimize the mean square error by using the function `newlind`

.

Suppose that the inputs and targets are

P = [1 2 3]; T= [2.0 4.1 5.9];

Now you can design a network.

net = newlind(P,T);

You can simulate the network behavior to check that the design was done properly.

Y = net(P) Y = 2.0500 4.0000 5.9500

Note that the network outputs are quite close to the desired targets.

You might try `demolin1`

. It shows error surfaces
for a particular problem, illustrates the design, and plots the designed
solution.

You can also use the function `newlind`

to
design linear networks having delays in the input. Such networks are
discussed in Linear Networks with Delays. First, however, delays must
be discussed.

You need a new component, the tapped delay line, to make full use of
the linear network. Such a delay line is shown below. There the input
signal enters from the left and passes through *N*-1
delays. The output of the tapped delay line (TDL) is an *N*-dimensional
vector, made up of the input signal at the current time, the previous
input signal, etc.

You can combine a tapped delay line with a linear network to
create the linear* filter* *shown*.

The output of the filter is given by

$$\alpha (k)=purelin(Wp+b)={\displaystyle \sum _{i=1}^{R}{w}_{1,i}p(k-i+1)+b}$$

The network shown is referred to in the digital signal processing field as a finite impulse response (FIR) filter [WiSt85]. Look at the code used to generate and simulate such a network.

Suppose that you want a linear layer that outputs the sequence `T`

,
given the sequence `P`

and two initial input delay
states `Pi`

.

P = {1 2 1 3 3 2}; Pi = {1 3}; T = {5 6 4 20 7 8};

You can use `newlind`

to
design a network with delays to give the appropriate outputs for the
inputs. The delay initial outputs are supplied as a third argument,
as shown below.

net = newlind(P,T,Pi);

You can obtain the output of the designed network with

Y = net(P,Pi)

to give

Y = [2.7297] [10.5405] [5.0090] [14.9550] [10.7838] [5.9820]

As you can see, the network outputs are not exactly equal to the targets, but they are close and the mean square error is minimized.

The LMS algorithm, or Widrow-Hoff learning algorithm, is based on an approximate steepest descent procedure. Here again, linear networks are trained on examples of correct behavior.

Widrow and Hoff had the insight that they could estimate the
mean square error by using the squared error at each iteration. If
you take the partial derivative of the squared error with respect
to the weights and biases at the *k*th iteration,
you have

$$\frac{\partial {e}^{2}(k)}{\partial {w}_{1,j}}=2e(k)\frac{\partial e(k)}{\partial {w}_{1,j}}$$

for *j* = 1,2,…,*R* and

$$\frac{\partial {e}^{2}(k)}{\partial b}=2e(k)\frac{\partial e(k)}{\partial b}$$

Next look at the partial derivative with respect to the error.

$$\frac{\partial e(k)}{\partial {w}_{1,j}}=\frac{\partial [t(k)-\alpha (k)]}{\partial {w}_{1,j}}=\frac{\partial}{\partial {w}_{1,j}}[t(k)-(Wp(k)+b)]$$

or

$$\frac{\partial e(k)}{\partial {w}_{1,j}}=\frac{\partial}{\partial {w}_{1,j}}\left[t(k)-\left({\displaystyle \sum _{i=1}^{R}{w}_{1,i}{p}_{i}(k)+b}\right)\right]$$

Here *p _{i}*(

This can be simplified to

$$\frac{\partial e(k)}{\partial {w}_{1,j}}=-{p}_{j}(k)$$

and

$$\frac{\partial e(k)}{\partial b}=-1$$

Finally, change the weight matrix, and the bias will be

2α*e*(*k*)**p**(*k*)

and

2α*e*(*k*)

These two equations form the basis of the Widrow-Hoff (LMS) learning algorithm.

These results can be extended to the case of multiple neurons, and written in matrix form as

$$\begin{array}{l}W(k+1)=W(k)+2\alpha e(k){p}^{T}(k)\\ b(k+1)=b(k)+2\alpha e(k)\end{array}$$

Here the error **e** and the bias **b** are vectors, and α is a *learning
rate*. If α is large, learning occurs quickly, but
if it is too large it can lead to instability and errors might even
increase. To ensure stable learning, the learning rate must be less
than the reciprocal of the largest eigenvalue of the correlation matrix **p**^{T}**p** of the input vectors.

You might want to read some of Chapter 10 of [HDB96] for more information about the LMS algorithm and its convergence.

Fortunately, there is a toolbox function, `learnwh`

,
that does all the calculation for you. It calculates the change in
weights as

dw = lr*e*p'

and the bias change as

db = lr*e

The constant 2, shown a few lines above, has been absorbed into
the code learning rate `lr`

. The function `maxlinlr`

calculates this maximum stable learning rate `lr`

as 0.999 * `P'`

*`P`

.

Type `help learnwh`

and `help maxlinlr`

for
more details about these two functions.

Linear networks can be trained to perform linear classification with the function `train`

.
This function applies each vector of a set of input vectors and calculates
the network weight and bias increments due to each of the inputs according
to `learnp`

. Then the network is
adjusted with the sum of all these corrections. Each pass through
the input vectors is called an *epoch*. This contrasts
with `adapt`

which adjusts weights
for each input vector as it is presented.

Finally, `train`

applies
the inputs to the new network, calculates the outputs, compares them
to the associated targets, and calculates a mean square error. If
the error goal is met, or if the maximum number of epochs is reached,
the training is stopped, and `train`

returns
the new network and a training record. Otherwise `train`

goes through another epoch. Fortunately,
the LMS algorithm converges when this procedure is executed.

A simple problem illustrates this procedure. Consider the linear network introduced earlier.

Suppose you have the following classification problem.

$$\left\{{p}_{1}=\left[\begin{array}{l}2\\ 2\end{array}\right],{t}_{1}=0\}\left\{{p}_{2}=\left[\begin{array}{c}1\\ -2\end{array}\right],{t}_{2}=1\right\}\left\{{p}_{3}=\left[\begin{array}{c}-2\\ 2\end{array}\right],{t}_{3}=0\right\}\{{p}_{4}=\left[\begin{array}{c}-1\\ 1\end{array}\right],{t}_{4}=1\right\}$$

Here there are four input vectors, and you want a network that produces the output corresponding to each input vector when that vector is presented.

Use `train`

to get the weights
and biases for a network that produces the correct targets for each
input vector. The initial weights and bias for the new network are
0 by default. Set the error goal to 0.1 rather than accept its default
of 0.

P = [2 1 -2 -1;2 -2 2 1]; T = [0 1 0 1]; net = linearlayer; net.trainParam.goal= 0.1; net = train(net,P,T);

The problem runs for 64 epochs, achieving a mean square error of 0.0999. The new weights and bias are

weights = net.iw{1,1} weights = -0.0615 -0.2194 bias = net.b(1) bias = [0.5899]

You can simulate the new network as shown below.

A = net(P) A = 0.0282 0.9672 0.2741 0.4320

You can also calculate the error.

err = T - sim(net,P) err = -0.0282 0.0328 -0.2741 0.5680

Note that the targets are not realized exactly. The problem would have run longer in an attempt to get perfect results had a smaller error goal been chosen, but in this problem it is not possible to obtain a goal of 0. The network is limited in its capability. See Limitations and Cautions for examples of various limitations.

This example program, `demolin2`

, shows the
training of a linear neuron and plots the weight trajectory and error
during training.

You might also try running the example program `nnd10lc`

.
It addresses a classic and historically interesting problem, shows
how a network can be trained to classify various patterns, and shows
how the trained network responds when noisy patterns are presented.

Linear networks can only learn linear relationships between
input and output vectors. Thus, they cannot find solutions to some
problems. However, even if a perfect solution does not exist, the
linear network will minimize the sum of squared errors if the learning
rate `lr`

is sufficiently small. The network will
find as close a solution as is possible given the linear nature of
the network's architecture. This property holds because the error
surface of a linear network is a multidimensional parabola. Because
parabolas have only one minimum, a gradient descent algorithm (such
as the LMS rule) must produce a solution at that minimum.

Linear networks have various other limitations. Some of them are discussed below.

Consider an overdetermined system. Suppose that you have a network
to be trained with four one-element input vectors and four targets.
A perfect solution to *wp* + *b*
= *t* for each of the inputs might not exist, for
there are four constraining equations, and only one weight and one
bias to adjust. However, the LMS rule still minimizes the error. You
might try `demolin4`

to see how this is done.

Consider a single linear neuron with one input. This time, in `demolin5`

,
train it on only one one-element input vector and its one-element
target vector:

P = [1.0]; T = [0.5];

Note that while there is only one constraint arising from the
single input/target pair, there are two variables, the weight and
the bias. Having more variables than constraints results in an underdetermined
problem with an infinite number of solutions. You can try `demolin5`

to
explore this topic.

Normally it is a straightforward job to determine whether or
not a linear network can solve a problem. Commonly, if a linear network
has at least as many degrees of freedom (*S* **R* + *S* =
number of weights and biases) as constraints (*Q* =
pairs of input/target vectors), then the network can solve the problem.
This is true except when the input vectors are linearly dependent
and they are applied to a network without biases. In this case, as
shown with the example `demolin6`

, the network cannot
solve the problem with zero error. You might want to try `demolin6`

.

You can always train a linear network with the Widrow-Hoff rule
to find the minimum error solution for its weights and biases, as
long as the learning rate is small enough. Example `demolin7`

shows
what happens when a neuron with one input and a bias is trained with
a learning rate larger than that recommended by `maxlinlr`

.
The network is trained with two different learning rates to show the
results of using too large a learning rate.