LSTM - Long Short-Term Memory

Why LSTM? The Vanishing Gradient Problem

Standard RNNs struggle to remember information from many steps ago. Gradients shrink to nearly zero during backpropagation. LSTM solves this with gated memory.

Analogy: Reading "Deep learning is fun" then "I love learning." A plain RNN forgets "Deep" by the time it reaches "love". LSTM can hold onto context as long as it is relevant.

Key idea: LSTM uses three gates plus one cell state to control what to remember, what to forget, and what to output at each time step.

The Four Components

ft Forget Gate

Decides what to erase from memory

it Input Gate

Decides what new info to write

Ot Output Gate

Decides what to send forward

Ct Cell State

The long-term memory highway

Inputs and Notation

Notation used throughout St-1 = previous hidden state (old state) Xt = current input word vector Ct-1 = previous cell state Wf, Wi, Wo, Wc = gate weight matrices sigma = sigmoid function (output: 0 to 1) tanh = hyperbolic tangent (output: -1 to 1)

Our two sentences

Deep learning is fun

I love learning

All LSTM Equations

Forget Gateft = sigma( Wf * St-1 + Wf * Xt )

Input Gateit = sigma( Wi * St-1 + Wi * Xt )

Output GateOt = sigma( Wo * St-1 + Wo * Xt )

Intermediate Cell StateCt' = tanh( Wc * St-1 + Wc * Xt )

Cell StateCt = ( it * Ct' ) + ( ft * Ct-1 )

New Hidden StateSt = Ot * tanh( Ct )

FG

Forget Gate

Decides what old information to erase

Forget Gate is represented as ft
Uses a sigmoid activation function
Output is between 0 and 1 for each cell state value
0 = completely forget | 1 = completely keep
Takes St-1 (old state) and Xt (current input)

Formula ft = sigma( Wf * St-1 + Wf * Xt )

Wf = Weights for Forget Gate | St-1 = old state | Xt = Input

What it does: First sigmoid in the network. Takes previous hidden state and current input. Values near 0 mean forget, near 1 mean remember.

Sigmoid Function - Visual

sigma(x) - output always between 0 and 1

Forget Gate - Sentence Walkthrough

Processing: "Deep learning is fun" then "I love learning"

Word (Xt)	Old state (St-1)	ft output	Decision
Deep	S0 = 0 (start)	approx 0.9	Keep most
learning	S1 (from Deep)	approx 0.6	Keep partial
is	S2	approx 0.4	Forget more
fun	S3	approx 0.85	Keep most
I	S4 (from sent 1)	approx 0.2	Forget old context
love	S5	approx 0.95	Strong keep
learning	S6	approx 0.7	Keep (new context)

Key insight: When "I" arrives at the second sentence, ft is approx 0.2 - it clears most of the first sentence's context because a new sentence has begun.

IG

Input Gate

Decides what new information to write to memory

Input Gate is represented as it
Uses a sigmoid function - same structure as forget gate
Works together with the intermediate cell state Ct' (tanh)
Two parallel computations happen simultaneously

Input Gate Formula it = sigma( Wi * St-1 + Wi * Xt )

Intermediate Cell State (candidate values) Ct' = tanh( Wc * St-1 + Wc * Xt )

tanh output: -1 to +1 (can represent adding or removing info)

What it does: sigmoid (it) decides how much of the candidate (Ct') to write. Together: it * Ct' = new information added to cell state.

Input Gate - Sentence Walkthrough

What new info gets written to cell state at each step?

Word	it value	Ct' value	Written (it * Ct')
Deep	approx 0.9	approx +0.8	+0.72 strong write
learning	approx 0.5	approx +0.4	+0.20 partial write
is	approx 0.3	approx -0.1	-0.03 tiny erase
fun	approx 0.8	approx +0.9	+0.72 strong write
I	approx 0.85	approx +0.7	+0.60 new sentence
love	approx 0.95	approx +0.95	+0.90 strong emotion
learning	approx 0.6	approx +0.5	+0.30 moderate write

Why "love" gets a strong write: In "I love learning", the word "love" carries strong semantic signal. The input gate writes this information into the cell state so future words can use this emotional context.

OG

Output Gate

Decides what part of cell state to expose as hidden state

Output Gate is represented as Ot
Uses a sigmoid activation function
Controls what the next hidden state (St) will be
Determines what information leaves the LSTM cell
Output St becomes the next cell's St-1

Output Gate Formula Ot = sigma( Wo * St-1 + Wo * Xt )

New Hidden State St = Ot * tanh( Ct )

tanh squashes Ct to [-1, +1]. Ot gates how much of it flows out.

What it does: Highlights which information from the cell state should pass to the next hidden state. The output gate filters what is currently relevant.

Output Gate - Sentence Walkthrough

Word	Ot	tanh(Ct)	St = Ot * tanh(Ct)
Deep	approx 0.7	approx +0.6	+0.42 outputs concept
learning	approx 0.5	approx +0.4	+0.20
is	approx 0.3	approx +0.3	+0.09 weak signal
fun	approx 0.9	approx +0.8	+0.72 strong output
I	approx 0.6	approx +0.5	+0.30 fresh start
love	approx 0.85	approx +0.85	+0.72 strong output
learning	approx 0.65	approx +0.6	+0.39 contextual

Note on "fun" and "love": Both produce high output values because they carry strong semantic content. The LSTM pushes this information strongly into the next hidden state.

CS

Cell State - The Memory Highway

Long-term memory running through the entire sequence

The cell state is the "conveyor belt" of the LSTM. Information can flow along it with minimal modification - gates selectively write to and read from it.

Intermediate Cell State (candidate values from tanh) Ct' = tanh( Wc * St-1 + Wc * Xt )

What could be written to memory

Cell State Update Ct = ( it * Ct' ) + ( ft * Ct-1 )

it * Ct' = new info written ft * Ct-1 = old info kept

New Hidden State (output) St = Ot * tanh( Ct )

Big picture: Cell state combines what to remember from before (ft * Ct-1) with what new thing to write (it * Ct'). This is the core of LSTM's long-term memory capability.

Cell State Evolution - Step by Step

Tracking C through both sentences

Step	Word	ft*Ct-1 (keep)	it*Ct' (new)	Ct (result)
t=0	start	-	-	C0 = 0
t=1	Deep	0.9 * 0 = 0	0.9 * 0.8 = 0.72	C1 = 0.72
t=2	learning	0.6 * 0.72 = 0.43	0.5 * 0.4 = 0.20	C2 = 0.63
t=3	is	0.4 * 0.63 = 0.25	0.3 * -0.1 = -0.03	C3 = 0.22
t=4	fun	0.85 * 0.22 = 0.19	0.8 * 0.9 = 0.72	C4 = 0.91
t=5	I	0.2 * 0.91 = 0.18	0.85 * 0.7 = 0.60	C5 = 0.78
t=6	love	0.95 * 0.78 = 0.74	0.95 * 0.95 = 0.90	C6 = 1.64
t=7	learning	0.7 * 1.64 = 1.15	0.6 * 0.5 = 0.30	C7 = 1.45

Observe: At t=5 ("I"), the forget gate drops old context (ft=0.2) so most of sentence 1 is erased. By t=6 ("love"), the cell state grows strongly - "love learning" is being committed to memory.

Complete LSTM Cell Diagram

All gates and states shown together. The brown horizontal line is the Cell State highway. X = element-wise multiply, + = add.

Reading the Diagram

Brown lines = Cell State highway
Teal lines = Hidden state flow
Pink lines = Weight inputs to gates
Yellow X = Element-wise multiply
sigma boxes = Sigmoid activation (0 to 1)
tanh boxes = Tanh activation (-1 to +1)

Data Flow Order

S0, X1

->

4 Gates (sigma, sigma, sigma, tanh)

->

ft, it, Ot, Ct'

Ct = it*Ct' + ft*C0

->

C1 (new cell state)

->

S1 = Ot * tanh(C1)

Complete Sentence Walkthrough - Both Sentences, All Gates

Sentences: "Deep learning is fun" then "I love learning". Values are illustrative approximations.

t	Word Xt	ft (forget)	it (input)	Ct' (candidate)	Ot (output)	Ct (cell state)	St (hidden)	What is happening
0	start	-	-	-	-	C0 = 0	S0 = 0	Initialise all to zero
1	Deep	0.9 keep	0.9 write	+0.80	0.70	0.72	0.43	"Deep" - strong entity signal, high write and keep
2	learning	0.6 partial	0.5 partial	+0.40	0.50	0.63	0.29	"learning" - not entity, moderate memory update
3	is	0.4 drop some	0.3 low write	-0.10	0.30	0.22	0.07	"is" - verb, low signal, most context fades
4	fun	0.85 keep	0.80 write	+0.90	0.90	0.91	0.72	"fun" - strong positive entity, high output
5	I	0.2 RESET	0.85 new start	+0.70	0.60	0.78	0.44	New sentence - forget gate clears old context
6	love	0.95 strong keep	0.95 strong write	+0.95	0.85	1.64	0.89	"love" - highest signal in both sentences
7	learning	0.70 keep	0.60 write	+0.50	0.65	1.45	0.75	"learning" after "love" - different context from t=2

Key Observation: The word "learning" appears at t=2 and t=7. Despite being the same word, the LSTM produces very different cell states (0.63 vs 1.45) because the context is different. At t=2 it follows "Deep". At t=7 it follows "love". The LSTM captures this difference through its accumulated cell state.

Summary: What Each Gate Does

FG

Forget Gate ft
Sigmoid on old cell state. Low value = forget, high value = keep.

IG

Input Gate it + Ct'
Sigmoid decides how much to write. tanh creates the candidate. Together: new info added.

CS

Cell State Ct
Ct = it*Ct' + ft*Ct-1. Long-term memory highway. Updated every step.

OG

Output Gate Ot - St
St = Ot * tanh(Ct). Filters what part of cell state flows to next step.

LSTM vs Vanilla RNN

Feature	Vanilla RNN	LSTM
Memory span	Short (3-5 steps)	Long (100+ steps)
Vanishing gradient	Severe problem	Largely solved
Gates	None	3 gates + cell state
Parameters	Fewer	Approx 4x more
"learning" context	Same output both times	Different - uses context
Sentence boundary	No mechanism	Forget gate resets

Bottom line: For "Deep learning is fun" followed by "I love learning", LSTM correctly understands that the second "learning" has a different meaning because it is preceded by "love".