When does keras reset an LSTM state?


I read all sorts of texts about it, and none seem to answer this very basic question. It's always ambiguous:

In a stateful = False LSTM layer, does keras reset states after:

  • Each sequence; or
  • Each batch?

Suppose I have X_train shaped as (1000,20,1), meaning 1000 sequences of 20 steps of a single value. If I make:

model.fit(X_train, y_train, batch_size=200, nb_epoch=15)

Will it reset states for every single sequence (resets states 1000 times)?
Or will it reset states for every batch (resets states 5 times)?

5 Answers: 

Cheking with some tests, I got to the following conclusion, which is according to the documentation and to Nassim's answer:

First, there isn't a single state in a layer, but one state per sample in the batch. There are batch_size parallel states in such a layer.


In a stateful=False case, all the states are resetted together after each batch.

  • A batch with 10 sequences would create 10 states, and all 10 states are resetted automatically after it's processed.

  • The next batch with 10 sequences will create 10 new states, which will also be resetted after this batch is processed

If all those sequences have length (timesteps) = 7, the practical result of these two batches is:

20 individual sequences, each with length 7

None of the sequences are related. But of course: the weights (not the states) will be unique for the layer, and will represent what the layer has learned from all the sequences.

  • A state is: Where am I now inside a sequence? Which time step is it? How is this particular sequence behaving since its beginning up to now?
  • A weight is: What do I know about the general behavior of all sequences I've seen so far?


In this case, there is also the same number of parallel states, but they will simply not be resetted at all.

  • A batch with 10 sequences will create 10 states that will remain as they are at the end of the batch.

  • The next batch with 10 sequences (it's required to be 10, since the first was 10) will reuse the same 10 states that were created before.

The practical result is: the 10 sequences in the second batch are just continuing the 10 sequences of the first batch, as if there had been no interruption at all.

If each sequence has length (timesteps) = 7, then the actual meaning is:

10 individual sequences, each with length 14

When you see that you reached the total length of the sequences, then you call model.reset_states(), meaning you will not continue the previous sequences anymore, now you will start feeding new sequences.


In the doc of the RNN code you can read this :

Note on using statefulness in RNNs :

You can set RNN layers to be 'stateful', which means that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch. This assumes a one-to-one mapping between samples in different successive batches.

I know that this doesn't answer directly your question, but to me it confirms what I was thinking : when a LSTM is not stateful, the state is reset after every sample. They don't work by batches, the idea in a batch is that every sample is independant from each other.

So you have 1000 reset of the state for your example.


In Keras there are two modes for maintaining states: 1) The default mode (stateful = False) where the state is reset after each batch. AFAIK the state will still be maintained between different samples within a batch. So for your example state would be reset for 5 times in each epoch.

2) The stateful mode where the state is never reset. It is up to the user to reset state before a new epoch, but Keras itself wont reset the state. In this mode the state is propagated from sample "i" of one batch to sample"i" of the next batch. Generally it is recommended to reset state after each epoch, as the state may grow for too long and become unstable. However in my experience with small size datasets (20,000- 40,000 samples) resetting or not resetting the state after an epoch does not make much of a difference to the end result. For bigger datasets it may make a difference.

Stateful model will be useful if you have patterns that span over 100s of time steps. Otherwise the default mode is sufficient. In my experience setting the batch size roughly equivalent to the size (time steps) of the patterns in the data also helps.

The stateful setup could be quite difficult to grasp at first. One would expect the state to be transferred between the last sample of one batch to the first sample of the next batch. But the sate is actually propagated across batches between the same numbered samples. The authors had two choices and they chose the latter. Read about this here. Also look at the relevant Keras FAQ section on stateful RNNs


Expanding on @Nassim_Ben's answer, it is true that each sequence is considered independent for each instance of the batch. However, you need to keep in mind that the RNNs hidden state and cell memory get's passed along to the next cell for 20 steps. The hidden state and cell memory is typically set to zero for the very first cell in the 20 cells.

After the 20th cell, and after the hidden state (only, not cell memory) gets passed onto the layers above the RNN, the state gets reset. I'm going to assume that they mean cell memory and hidden state here.

So yes, it does get reset for all 1000 instances, however, considering that your batch_size=200, it gets reset 5 times, with each batch getting reset after they are done passing information through those 20 steps. Hopefully you got your head around this.

Here's a project I did where I had the same question. Pay special attention to cell 15 and it's explanation in the blob after cell 11. I kept appending letters because the state was getting reset otherwise.


Everyone seems to be making it too confusing. Keras LSTM resets state after every batch.

Here is a good blog: https://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/

Read LSTM State Within A Batch and Stateful LSTM for a One-Char to One-Char Mapping topics in this blog. It shows why it must reset it after batch only.


More Articles

python - AttributeError: 'module' object has no attribute 'computation'

Im trying to use Keras (Sequential) but I get the following error when I try to import it:File "kaggle_titanic_keras.py", line 3, in <module> from keras.models import Sequential File "/anaconda/lib/python2.7/site-packages/keras/__init__.py", line 4, in <module> from . import appli

python - How to concatenate two layers in keras?

I have an example of a neural network with two layers. The first layer takes two arguments and has one output. The second should take one argument as result of the first layer and one additional argument. It should looks like this:x1 x2 x3 \ / / y1 / \ / y2So, I'd created a model with

python - Keras, how do I predict after I trained a model?

I'm playing with the reuters-example dataset and it runs fine (my model is trained). I read about how to save a model, so I could load it later to use again. But how do I use this saved model to predict a new text? Do I use models.predict()?Do I have to prepare this text in a special way?I tried

python - How to tell which Keras model is better?

I don't understand which accuracy in the output to use to compare my 2 Keras models to see which one is better. Do I use the "acc" (from the training data?) one or the "val acc" (from the validation data?) one?There are different accs and val accs for each epoch. How do I know the acc or val acc for

tensorflow - Keras uses way too much GPU memory when calling train_on_batch, fit, etc

I've been messing with Keras, and like it so far. There's one big issue I have been having, when working with fairly deep networks: When calling model.train_on_batch, or model.fit etc., Keras allocates significantly more GPU memory than what the model itself should need. This is not caused by trying

python - Neural Network LSTM input shape from dataframe

I am trying to implement an LSTM with Keras.I know that LSTM's in Keras require a 3D tensor with shape (nb_samples, timesteps, input_dim) as an input. However, I am not entirely sure how the input should look like in my case, as I have just one sample of T observations for each input, not multiple s

python - Understanding Keras LSTMs

I am trying to reconcile my understand of LSTMs and pointed out here in this post by Christopher Olah implemented in Keras. I am following the blog written by Jason Brownlee for the Keras tutorial. What I am mainly confused about is, The reshaping of the data series into [samples, time steps, featur

python - How do you create a custom activation function with Keras?

Sometimes the default standard activations like ReLU, tanh, softmax, ... and the advanced activations like LeakyReLU aren't enough. And it might also not be in keras-contrib.How do you create your own activation function?

machine learning - Keras binary_crossentropy vs categorical_crossentropy performance?

I'm trying to train a CNN to categorize text by topic. When I use binary_crossentropy I get ~80% acc, with categorical_crossentrop I get ~50% acc.I don't understand why this is. It's a multiclass problem, does that mean I have to use categorical and the binary results are meaningless?model.add(embed

python - Keras misinterprets training data shape

My training data has the form (?,15) where ? is a variable length.When creating my model I specify this:inp = Input(shape=(None,15))conv = Conv1D(32,3,padding='same',activation='relu')(inp)...My training data has the shape (35730,?,15).Checking this in python I get: X.shapeOutputs: (35730,)X[0].sha