LSTM - Long Short-Term Memory

Understanding Gates, Cell State and Memory in Neural Networks

Working sentences: Deep learning is fun | I love learning

Why LSTM? The Vanishing Gradient Problem

Standard RNNs struggle to remember information from many steps ago. Gradients shrink to nearly zero during backpropagation. LSTM solves this with gated memory.

Analogy: Reading "Deep learning is fun" then "I love learning." A plain RNN forgets "Deep" by the time it reaches "love". LSTM can hold onto context as long as it is relevant.
Key idea: LSTM uses three gates plus one cell state to control what to remember, what to forget, and what to output at each time step.

The Four Components

ft Forget Gate

Decides what to erase from memory

it Input Gate

Decides what new info to write

Ot Output Gate

Decides what to send forward

Ct Cell State

The long-term memory highway

Inputs and Notation

St-1 = previous hidden state (old state) Xt = current input word vector Ct-1 = previous cell state Wf, Wi, Wo, Wc = gate weight matrices sigma = sigmoid function (output: 0 to 1) tanh = hyperbolic tangent (output: -1 to 1)
Deep learning is fun
I love learning

All LSTM Equations

ft = sigma( Wf * St-1 + Wf * Xt )
it = sigma( Wi * St-1 + Wi * Xt )
Ot = sigma( Wo * St-1 + Wo * Xt )
Ct' = tanh( Wc * St-1 + Wc * Xt )
Ct = ( it * Ct' ) + ( ft * Ct-1 )
St = Ot * tanh( Ct )
FG
Forget Gate
Decides what old information to erase
  • Forget Gate is represented as ft
  • Uses a sigmoid activation function
  • Output is between 0 and 1 for each cell state value
  • 0 = completely forget  |  1 = completely keep
  • Takes St-1 (old state) and Xt (current input)
ft = sigma( Wf * St-1 + Wf * Xt )
Wf = Weights for Forget Gate  |  St-1 = old state  |  Xt = Input
What it does: First sigmoid in the network. Takes previous hidden state and current input. Values near 0 mean forget, near 1 mean remember.

Sigmoid Function - Visual

sigma(x) - output always between 0 and 1

Forget Gate - Sentence Walkthrough

Processing: "Deep learning is fun" then "I love learning"

Word (Xt)Old state (St-1)ft outputDecision
DeepS0 = 0 (start)approx 0.9Keep most
learningS1 (from Deep)approx 0.6Keep partial
isS2approx 0.4Forget more
funS3approx 0.85Keep most
IS4 (from sent 1)approx 0.2Forget old context
loveS5approx 0.95Strong keep
learningS6approx 0.7Keep (new context)
Key insight: When "I" arrives at the second sentence, ft is approx 0.2 - it clears most of the first sentence's context because a new sentence has begun.
IG
Input Gate
Decides what new information to write to memory
  • Input Gate is represented as it
  • Uses a sigmoid function - same structure as forget gate
  • Works together with the intermediate cell state Ct' (tanh)
  • Two parallel computations happen simultaneously
it = sigma( Wi * St-1 + Wi * Xt )
Ct' = tanh( Wc * St-1 + Wc * Xt )
tanh output: -1 to +1 (can represent adding or removing info)
What it does: sigmoid (it) decides how much of the candidate (Ct') to write. Together: it * Ct' = new information added to cell state.

Input Gate - Sentence Walkthrough

What new info gets written to cell state at each step?

Wordit valueCt' valueWritten (it * Ct')
Deepapprox 0.9approx +0.8+0.72 strong write
learningapprox 0.5approx +0.4+0.20 partial write
isapprox 0.3approx -0.1-0.03 tiny erase
funapprox 0.8approx +0.9+0.72 strong write
Iapprox 0.85approx +0.7+0.60 new sentence
loveapprox 0.95approx +0.95+0.90 strong emotion
learningapprox 0.6approx +0.5+0.30 moderate write
Why "love" gets a strong write: In "I love learning", the word "love" carries strong semantic signal. The input gate writes this information into the cell state so future words can use this emotional context.
OG
Output Gate
Decides what part of cell state to expose as hidden state
  • Output Gate is represented as Ot
  • Uses a sigmoid activation function
  • Controls what the next hidden state (St) will be
  • Determines what information leaves the LSTM cell
  • Output St becomes the next cell's St-1
Ot = sigma( Wo * St-1 + Wo * Xt )
St = Ot * tanh( Ct )
tanh squashes Ct to [-1, +1]. Ot gates how much of it flows out.
What it does: Highlights which information from the cell state should pass to the next hidden state. The output gate filters what is currently relevant.

Output Gate - Sentence Walkthrough

WordOttanh(Ct)St = Ot * tanh(Ct)
Deepapprox 0.7approx +0.6+0.42 outputs concept
learningapprox 0.5approx +0.4+0.20
isapprox 0.3approx +0.3+0.09 weak signal
funapprox 0.9approx +0.8+0.72 strong output
Iapprox 0.6approx +0.5+0.30 fresh start
loveapprox 0.85approx +0.85+0.72 strong output
learningapprox 0.65approx +0.6+0.39 contextual
Note on "fun" and "love": Both produce high output values because they carry strong semantic content. The LSTM pushes this information strongly into the next hidden state.
CS
Cell State - The Memory Highway
Long-term memory running through the entire sequence

The cell state is the "conveyor belt" of the LSTM. Information can flow along it with minimal modification - gates selectively write to and read from it.

Ct' = tanh( Wc * St-1 + Wc * Xt )
What could be written to memory
Ct = ( it * Ct' ) + ( ft * Ct-1 )
it * Ct' = new info written ft * Ct-1 = old info kept
St = Ot * tanh( Ct )
Big picture: Cell state combines what to remember from before (ft * Ct-1) with what new thing to write (it * Ct'). This is the core of LSTM's long-term memory capability.

Cell State Evolution - Step by Step

Tracking C through both sentences

StepWordft*Ct-1 (keep)it*Ct' (new)Ct (result)
t=0start--C0 = 0
t=1Deep0.9 * 0 = 00.9 * 0.8 = 0.72C1 = 0.72
t=2learning0.6 * 0.72 = 0.430.5 * 0.4 = 0.20C2 = 0.63
t=3is0.4 * 0.63 = 0.250.3 * -0.1 = -0.03C3 = 0.22
t=4fun0.85 * 0.22 = 0.190.8 * 0.9 = 0.72C4 = 0.91
t=5I0.2 * 0.91 = 0.180.85 * 0.7 = 0.60C5 = 0.78
t=6love0.95 * 0.78 = 0.740.95 * 0.95 = 0.90C6 = 1.64
t=7learning0.7 * 1.64 = 1.150.6 * 0.5 = 0.30C7 = 1.45
Observe: At t=5 ("I"), the forget gate drops old context (ft=0.2) so most of sentence 1 is erased. By t=6 ("love"), the cell state grows strongly - "love learning" is being committed to memory.

Complete LSTM Cell Diagram

All gates and states shown together. The brown horizontal line is the Cell State highway. X = element-wise multiply, + = add.

ft = Forget Gate Ot = Output Gate it = Input Gate Ct = Cell State Ct' = Intermediate Cell State W = Weights | St-1 = old state | Xt = Input ft = sigma(Wf*St-1 + Wf*Xt) it = sigma(Wi*St-1 + Wi*Xt) Ot = sigma(Wo*St-1 + Wo*Xt) Ct' = tanh(Wc*St-1+Wc*Xt) Ct = (it*Ct') + (ft*Ct-1) St = Ot * tanh(Ct) C0 Previous Cell State X + C1 tanh X S1 S0 Old State X1 = Input Wc tanh Ct' X Wo sigma Ot Wi sigma it Wf sigma ft Deep / I learning is / love fun LSTM Cell - one time step

Reading the Diagram

  • Brown lines = Cell State highway
  • Teal lines = Hidden state flow
  • Pink lines = Weight inputs to gates
  • Yellow X = Element-wise multiply
  • sigma boxes = Sigmoid activation (0 to 1)
  • tanh boxes = Tanh activation (-1 to +1)

Data Flow Order

S0, X1
->
4 Gates (sigma, sigma, sigma, tanh)
->
ft, it, Ot, Ct'
Ct = it*Ct' + ft*C0
->
C1 (new cell state)
->
S1 = Ot * tanh(C1)

Complete Sentence Walkthrough - Both Sentences, All Gates

Sentences: "Deep learning is fun" then "I love learning". Values are illustrative approximations.

tWord Xtft (forget)it (input)Ct' (candidate)Ot (output)Ct (cell state)St (hidden)What is happening
0start----C0 = 0S0 = 0Initialise all to zero
1Deep0.9 keep0.9 write+0.800.700.720.43"Deep" - strong entity signal, high write and keep
2learning0.6 partial0.5 partial+0.400.500.630.29"learning" - not entity, moderate memory update
3is0.4 drop some0.3 low write-0.100.300.220.07"is" - verb, low signal, most context fades
4fun0.85 keep0.80 write+0.900.900.910.72"fun" - strong positive entity, high output
5I0.2 RESET0.85 new start+0.700.600.780.44New sentence - forget gate clears old context
6love0.95 strong keep0.95 strong write+0.950.851.640.89"love" - highest signal in both sentences
7learning0.70 keep0.60 write+0.500.651.450.75"learning" after "love" - different context from t=2
Key Observation: The word "learning" appears at t=2 and t=7. Despite being the same word, the LSTM produces very different cell states (0.63 vs 1.45) because the context is different. At t=2 it follows "Deep". At t=7 it follows "love". The LSTM captures this difference through its accumulated cell state.

Summary: What Each Gate Does

FG
Forget Gate ft
Sigmoid on old cell state. Low value = forget, high value = keep.
IG
Input Gate it + Ct'
Sigmoid decides how much to write. tanh creates the candidate. Together: new info added.
CS
Cell State Ct
Ct = it*Ct' + ft*Ct-1. Long-term memory highway. Updated every step.
OG
Output Gate Ot - St
St = Ot * tanh(Ct). Filters what part of cell state flows to next step.

LSTM vs Vanilla RNN

FeatureVanilla RNNLSTM
Memory spanShort (3-5 steps)Long (100+ steps)
Vanishing gradientSevere problemLargely solved
GatesNone3 gates + cell state
ParametersFewerApprox 4x more
"learning" contextSame output both timesDifferent - uses context
Sentence boundaryNo mechanismForget gate resets
Bottom line: For "Deep learning is fun" followed by "I love learning", LSTM correctly understands that the second "learning" has a different meaning because it is preceded by "love".