Setting initial hidden state of an LSTM with a dense layer

27 views (last 30 days)
Yildirim Kocoglu
Yildirim Kocoglu on 7 Jan 2021
Edited: Asvin Kumar on 10 Feb 2021
I have been working on an LSTM seq2seq forecasting problem and I wanted to set the "initial hidden states" of the network myself using static features (not changing over time).
Problem description:
I'm trying to have a stateless lstm since each time series is independent from one another in my case so I use a minibatch size= 1. I have a default sequence length = 'longest'.
Qeustions:
1) I'm not sure if 'longest' in seqlength means longest sequence of all the batches? For example: if I have 2 independent time series to feed the network with varying time steps (For example: 10 timesteps for batch#1 and 15 time steps for batch#2) would the 'longest' option change between each batch such as seqlength = 10 for batch#1 and seqlength=15 for batch#2 or will it consider the longest of both batches (seqlength=15) for both batches and the pad the rest of the 'missing' values for batch#1 with a default value?
2) The real problem I'm encountering is that initial hidden state dimensions has to be (num_hidden_units, 1) and the same initial hidden state is used between batches I believe (not sure). Is this set "initial hidden state" reset automatically to the same set init_state between batches during training if my minibatch size=1? I'm also not sure if the required column dimension of initial hidden state "1" is due to the selection of minibatch size = 1?
In my case, there are 7 static features available for each independent time series so, if I have 10 independent time series, I have a matrix of size (7,10).
In order to set the initial state of the lstm, I pass my 7 dimensional feature vector (static features) with size (7,10) through a dense layer and assign it as initial hidden state by outputting the required size (num_hidden_units,1) but, it does not make sense to me to use the same initial hidden state (if it resets to the same value between batches) for all the batches because then It seems to me that it loses its individual properties.
I understand that the questions are different from each other (even if they are related to one problem) and I'm not expecting all the answers but, any kind of clarification will be appreciated.
Thank you.

Answers (1)

Asvin Kumar
Asvin Kumar on 10 Feb 2021
Edited: Asvin Kumar on 10 Feb 2021
  1. Sequences are padded to the 'longest' in a mini-batch. More on that here: Sequence Options
  2. The same initial value for Hidden state is used across all sequences and mini-batches. The column dimension is 1 because the hidden state is same for all sequences. It is not related to the min-batch size being 1. Refer docs here. This is also related to your follow-up comment from another question. The column dimension for 'HiddenState' property would never be 2.
You mention this:
I'm trying to have a stateless lstm since each time series is independent from one another in my case so I use a minibatch size= 1.
and this:
It does not make sense to me to use the same initial hidden state [...] for all the batches because then It seems to me that it loses its individual properties [...]
An LSTM network trains on a dataset of one particular kind. It's learnable parameters are 'InputWeights', 'RecurrentWeights' and 'Bias' as seen in this example here. Every LSTM has a fixed initial 'HiddenState' property. On training a network, the input and recurrent weights are learnt and adapt to the required targets / error which needs to be minimized. The initial hidden state would only be a small influence when you have relatively long sequences.
So, to clear some confusion, the individual properties of sequences from a dataset are captured in the input and recurrent weights. Every LSTM has to be trained on some dataset to learn from it. I'm not sure what you mean by stateless LSTM or about each time series (sequence) being independent from each other.
Take the example of the Japanese Vowel Dataset. All the sequences are separate from each other but they also share similarities in the sense that they are all about vowels. Training an LSTM on such a dataset would mean that the network captures the individual properties in its weights.

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!