Hochreiter & Schmidhuber (1997) - Long Short-Term Memory (LSTM)

LSTM networks, introduced by Hochreiter and Schmidhuber in 1997, solved the vanishing gradient problem and became the dominant architecture for sequence modeling prior to Transformers.

/images/foundations/lstm-cell.jpg — The LSTM cell, which uses gating mechanisms to retain and forget information over long sequences.

The Long Short-Term Memory (LSTM) architecture, introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, was a breakthrough in solving one of the biggest problems in training neural networks on sequential data: the vanishing gradient problem. Prior to LSTMs, recurrent neural networks (RNNs) struggled to retain information across long sequences, making them ineffective for many natural language and time-series tasks.

LSTM introduced a memory cell and gating mechanisms that allowed information to persist or be forgotten over time. This innovation paved the way for practical sequence modeling before being eventually overtaken by Transformers .

Historical Context

Standard RNNs had been known since the 1980s but were plagued by instability during training. Gradients would either vanish or explode as they were propagated back through time, making it difficult to learn long-range dependencies.

Hochreiter’s 1991 dissertation first formalized the vanishing gradient issue, and the 1997 LSTM paper provided a practical architectural solution. Over time, LSTMs became the de facto standard for sequence tasks like speech recognition, language modeling, and machine translation - often trained using backpropagation through time (BPTT).

Technical Summary

LSTMs extend traditional RNNs by introducing a memory cell and three gates:

Input gate: Controls how much of the new input to store in memory
Forget gate: Decides what information to discard from the cell
Output gate: Determines how much of the memory to expose as output

These gates are implemented using learned weights and sigmoid activations, which scale values between 0 and 1. The memory cell acts as an accumulator of state information over time.

Key Benefits:

Mitigates vanishing gradients: Maintains stable gradients over long sequences
Captures long-range dependencies: Crucial for language, audio, and time-series data
Modular: Can be stacked into deep recurrent networks

Applications and Impact

LSTMs powered major advances in NLP and speech systems from the late 2000s through the mid-2010s:

Google Voice Search and translation
Handwriting and speech recognition
Time-series forecasting and anomaly detection

They were also foundational in early encoder-decoder architectures , such as the sequence-to-sequence model from Sutskever et al. (2014), where LSTMs encoded input sequences and decoded output sequences.

Transition to Attention

Although LSTMs reigned for nearly two decades, their sequential nature limited parallelism and scalability. The introduction of attention mechanisms and ultimately Transformers offered superior performance and efficiency on large-scale tasks.

Yet many of the problems LSTMs solved - particularly temporal dependency and memory - remain relevant in evaluating modern models.

Backpropagation : Used in training LSTMs via BPTT
Encoder-Decoder Architectures : Early applications of LSTMs
Transformers : Replaced LSTMs as the dominant sequence model

Chat Overview

Hochreiter & Schmidhuber (1997) - Long Short-Term Memory (LSTM)

Historical Context

Technical Summary

Key Benefits:

Applications and Impact

Transition to Attention

Further Reading

Historical Context

Technical Summary

Key Benefits:

Applications and Impact

Transition to Attention

Related Work

Further Reading