This article is part of our coverage of the latest in AI research.
Transformers have become the dominant architecture for large language models (LLMs) and other sequence-processing applications. However, their quadratic computational complexity makes scaling them to very long sequences expensive. This has led to interest in architectures with linear complexity and constant memory requirements.
In a new paper, researchers from Mila and Borealis AI revisit recurrent neural networks (RNN) as a potential alternative to Transfomer architecture. They introduce minimized versions of two popular RNN variants, long short-term memory (LSTM) and gated recurrent units (GRU). These two models, minLSTM and minGRU, are fully parallelizable during training and use significantly fewer parameters, making them a fast and efficient alternative to Transformers.
The limitations of Transformers and the resurgence of RNNs
Every time you double the size of the input sequence given to a Transformer model, you require four times as much memory and computation. This quadratic complexity makes Transformers expensive for long sequences and especially problematic for resource-constrained environments.
RNNs, on the other hand, process input data sequentially and have a linear computational complexity with respect to sequence length. They also require constant memory during inference, making them suitable for very long sequences. Traditional RNNs, however, suffered from vanishing and exploding gradients. Vanishing and exploding gradients occur when the gradients used to update neural network weights become too small or too large, hindering effective learning. This problem limited their ability to learn long-term dependencies. LSTMs and GRUs addressed these problems by introducing gating mechanisms that control the flow of information through the network.
Despite their advantages, traditional LSTMs and GRUs had a critical limitation: they could only be computed sequentially. This meant they had to use backpropagation through time (BPTT) during training, a slow process that greatly limited their ability to scale to long contexts.
The limitations of Transformers have renewed interest in recurrent models. The past year has seen the introduction of novel recurrent architectures such as S4 and Mamba, which promise to address the scalability problems of Transformers while achieving comparable performance. These models use techniques such as “parallel prefix scan,” which speed up training by parallelizing the computation on input sequences.
Revisiting LSTMs and GRUs
Inspired by the algorithmic similarities between recently proposed sequence models, the researchers revisited LSTMs and GRUs. They found that by removing dependencies on previous hidden states from their gating mechanisms, these models could be trained efficiently using the parallel scan algorithm.
Traditional LSTMs and GRUs have multiple gates that control the flow of information in the network. These gates rely on the previous hidden state to determine how much of the current input and previous memory to keep or discard. This creates a sequential dependency that requires the model to process tokens one at a time.
The researchers found that they could remove the dependency on previous hidden states while maintaining temporal consistency in the computation. This enabled the training of the model through parallel scanning algorithms. They further simplified the architectures by removing some computations that are no longer necessary. This resulted in minimal LSTMs (minLSTMs) and minimal GRUs (minGRUs) that use fewer parameters and can be trained much faster.
minGRUs and minLSTMs address the training bottleneck of traditional RNNs by enabling parallel computation. This change corresponded to a 175x speedup for minGRU and a 235x speedup for minLSTM compared to their traditional counterparts for a sequence length of 512 tokens on a T4 GPU. The improvement becomes more significant with longer sequences. For a sequence of length 4096, minGRU and minLSTM were more than 1300x faster than their traditional versions.
“As such, in a setting where minGRU would take a day to finish training for a fixed number of epochs, its traditional counterpart GRU could take over 3 years,” the researchers write.
minGRU reduces the number of required parameters by up to 87% compared to classic GRUs, and minLSTM reduces parameters by up to 85% compared to classic LSTMs.
Minimized RNNs vs SOTA recurrent models
The researchers compared the performance of minLSTMs and minGRUs to Mamba, a state-of-the-art recurrent sequence model. They measured training time, memory usage, and performance on several tasks, including selective copying, reinforcement learning (RL), and language modeling.
In terms of runtime, minLSTM and minGRU achieved results similar to those of Mamba. While they used more memory than their traditional RNNs, they were still more memory-efficient than Mamba.

On selective copying, a task that requires content-aware reasoning and memorization, the performance of minLSTM and minGRU was comparable to Mamba’s.
In RL experiments on the D4RL benchmark, minLSTM and minGRU outperformed all baselines except for Decision Mamba, where the difference was marginal.
On language modeling tasks, minLSTM and minGRU were slightly slower than Mamba in reaching peak performance during training but converged at a lower loss. Notably, they were significantly more efficient than Transformers, which needed 2.5x more time to reach their best performance.

Like other similar works that examine Transformer alternatives, one of the limitations of the minimized RNN study is the scale of the experiments. It remains to be seen whether these architectures can deliver similar results with very large models and context windows.
Nonetheless, the results of this study are significant because they show that as new information becomes available, it is worth revisiting old ideas.
“Considering the strong empirical performance of these simplified RNNs and their fundamental similarities with many recently proposed recurrent sequence methods, we question ‘Were RNNs all we needed?’” the researchers write.
On a separate note, I’m experimenting with paper discussions generated with NotebookLM. Here is an 11-minute on the paper “Were RNNs All We Needed?” I think it did a pretty good job of summarizing the paper (got some minor points wrong). Let me know what you think!