Understanding Gates, Cell State and Memory in Neural Networks
Standard RNNs struggle to remember information from many steps ago. Gradients shrink to nearly zero during backpropagation. LSTM solves this with gated memory.
Decides what to erase from memory
Decides what new info to write
Decides what to send forward
The long-term memory highway
sigma(x) - output always between 0 and 1
Processing: "Deep learning is fun" then "I love learning"
| Word (Xt) | Old state (St-1) | ft output | Decision |
|---|---|---|---|
| Deep | S0 = 0 (start) | approx 0.9 | Keep most |
| learning | S1 (from Deep) | approx 0.6 | Keep partial |
| is | S2 | approx 0.4 | Forget more |
| fun | S3 | approx 0.85 | Keep most |
| I | S4 (from sent 1) | approx 0.2 | Forget old context |
| love | S5 | approx 0.95 | Strong keep |
| learning | S6 | approx 0.7 | Keep (new context) |
What new info gets written to cell state at each step?
| Word | it value | Ct' value | Written (it * Ct') |
|---|---|---|---|
| Deep | approx 0.9 | approx +0.8 | +0.72 strong write |
| learning | approx 0.5 | approx +0.4 | +0.20 partial write |
| is | approx 0.3 | approx -0.1 | -0.03 tiny erase |
| fun | approx 0.8 | approx +0.9 | +0.72 strong write |
| I | approx 0.85 | approx +0.7 | +0.60 new sentence |
| love | approx 0.95 | approx +0.95 | +0.90 strong emotion |
| learning | approx 0.6 | approx +0.5 | +0.30 moderate write |
| Word | Ot | tanh(Ct) | St = Ot * tanh(Ct) |
|---|---|---|---|
| Deep | approx 0.7 | approx +0.6 | +0.42 outputs concept |
| learning | approx 0.5 | approx +0.4 | +0.20 |
| is | approx 0.3 | approx +0.3 | +0.09 weak signal |
| fun | approx 0.9 | approx +0.8 | +0.72 strong output |
| I | approx 0.6 | approx +0.5 | +0.30 fresh start |
| love | approx 0.85 | approx +0.85 | +0.72 strong output |
| learning | approx 0.65 | approx +0.6 | +0.39 contextual |
The cell state is the "conveyor belt" of the LSTM. Information can flow along it with minimal modification - gates selectively write to and read from it.
Tracking C through both sentences
| Step | Word | ft*Ct-1 (keep) | it*Ct' (new) | Ct (result) |
|---|---|---|---|---|
| t=0 | start | - | - | C0 = 0 |
| t=1 | Deep | 0.9 * 0 = 0 | 0.9 * 0.8 = 0.72 | C1 = 0.72 |
| t=2 | learning | 0.6 * 0.72 = 0.43 | 0.5 * 0.4 = 0.20 | C2 = 0.63 |
| t=3 | is | 0.4 * 0.63 = 0.25 | 0.3 * -0.1 = -0.03 | C3 = 0.22 |
| t=4 | fun | 0.85 * 0.22 = 0.19 | 0.8 * 0.9 = 0.72 | C4 = 0.91 |
| t=5 | I | 0.2 * 0.91 = 0.18 | 0.85 * 0.7 = 0.60 | C5 = 0.78 |
| t=6 | love | 0.95 * 0.78 = 0.74 | 0.95 * 0.95 = 0.90 | C6 = 1.64 |
| t=7 | learning | 0.7 * 1.64 = 1.15 | 0.6 * 0.5 = 0.30 | C7 = 1.45 |
All gates and states shown together. The brown horizontal line is the Cell State highway. X = element-wise multiply, + = add.
Sentences: "Deep learning is fun" then "I love learning". Values are illustrative approximations.
| t | Word Xt | ft (forget) | it (input) | Ct' (candidate) | Ot (output) | Ct (cell state) | St (hidden) | What is happening |
|---|---|---|---|---|---|---|---|---|
| 0 | start | - | - | - | - | C0 = 0 | S0 = 0 | Initialise all to zero |
| 1 | Deep | 0.9 keep | 0.9 write | +0.80 | 0.70 | 0.72 | 0.43 | "Deep" - strong entity signal, high write and keep |
| 2 | learning | 0.6 partial | 0.5 partial | +0.40 | 0.50 | 0.63 | 0.29 | "learning" - not entity, moderate memory update |
| 3 | is | 0.4 drop some | 0.3 low write | -0.10 | 0.30 | 0.22 | 0.07 | "is" - verb, low signal, most context fades |
| 4 | fun | 0.85 keep | 0.80 write | +0.90 | 0.90 | 0.91 | 0.72 | "fun" - strong positive entity, high output |
| 5 | I | 0.2 RESET | 0.85 new start | +0.70 | 0.60 | 0.78 | 0.44 | New sentence - forget gate clears old context |
| 6 | love | 0.95 strong keep | 0.95 strong write | +0.95 | 0.85 | 1.64 | 0.89 | "love" - highest signal in both sentences |
| 7 | learning | 0.70 keep | 0.60 write | +0.50 | 0.65 | 1.45 | 0.75 | "learning" after "love" - different context from t=2 |
| Feature | Vanilla RNN | LSTM |
|---|---|---|
| Memory span | Short (3-5 steps) | Long (100+ steps) |
| Vanishing gradient | Severe problem | Largely solved |
| Gates | None | 3 gates + cell state |
| Parameters | Fewer | Approx 4x more |
| "learning" context | Same output both times | Different - uses context |
| Sentence boundary | No mechanism | Forget gate resets |