⬅️ Lecture 5: Backward Propagation

The magic behind how neural networks learn. We'll travel backward through the network to wisely correct its mistakes.

Part 1: The Big Question: "How Wrong Were We?"

In the last lecture, our network performed Forward Propagation to make a guess. For our student who studied 5 hours and slept 8, it predicted a 67% chance of passing ($\hat{y} = 0.670$).

But what if we know the student actually failed? The correct answer ($y$) was 0. Our network was wrong! Now comes the most important part of deep learning: learning from that mistake.

Backward Propagation (or Backpropagation) is the process of figuring out exactly *how* wrong the network was and assigning blame to every single weight and bias that contributed to the error.

Analogy: The Detective Story

Imagine our neural network is a team of detectives trying to solve a case.
Forward Propagation is them making their initial accusation: "We think Butler Bob did it!"
The "Error" is new evidence proving Bob is innocent. Their guess was wrong.
Backward Propagation is the chief detective going back through the entire investigation, step-by-step, to see where the faulty logic occurred. Which clue was misinterpreted? Which detective made a bad assumption? They trace the error backward from the final accusation to the initial clues to correct their reasoning for the next time.

📉Part 2: Step 1 - Measuring the Mistake (The Loss Function)

Before we can assign blame, we need to quantify the mistake. How wrong is "very wrong"? We use a Loss Function to calculate a single number that represents the total error.

A common choice is the Mean Squared Error (MSE).

$$ L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Simple Explanation:
• For each prediction, calculate the difference between the true answer ($y$) and our prediction ($\hat{y}$).
Square this difference to make it positive and to penalize larger errors more.
Average these squared differences across all our examples.
• The final number $L$ is our Loss. A high loss means big mistakes. The goal of training is to get this number as close to zero as possible.

Our Student's Error: The true answer $y$ was 0, but our prediction $\hat{y}$ was 0.670.
The squared error for this one student is $(0 - 0.670)^2 = 0.4489$. This is our starting point for backpropagation.

🧭Part 3: Gradients - The Direction of Learning

Okay, we have a Loss of 0.4489. Now what? We need to know *how to change our weights and biases* to make this Loss smaller. Should we increase a weight? Decrease it? By how much?

The answer lies in the gradient. The gradient is a vector of partial derivatives. In simple terms, it tells us the direction of the steepest ascent for the loss function.

Analogy: The Hiker on a Foggy Mountain

Imagine you are a hiker on a mountain, and it's completely foggy. Your goal is to get to the lowest point in the valley (the point of minimum loss). You can't see anything, but you can feel the ground right under your feet.

  • The gradient is the direction the ground slopes **uphill** most steeply from where you are standing.
  • To get to the valley, you need to take a step in the exact **opposite** direction of the gradient.
  • Backpropagation is the process of calculating this gradient for every single weight and bias in our network.
$$ \nabla L = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_k} \end{bmatrix} $$

Simple Explanation: The gradient $\nabla L$ is just a big list of all the partial derivatives. Each number in this list tells us two things about its corresponding weight:
1. The sign (+ or -): Tells us if increasing the weight will increase or decrease the loss.
2. The magnitude: Tells us how much influence this weight has on the final loss. A big number means a big influence.

🔗Part 4: The Secret Weapon - The Chain Rule

Here's the puzzle: the final error $L$ is at the very end of the network. The weights $W^{[1]}$ are at the very beginning. How does a change in $W^{[1]}$ affect $L$? They aren't directly connected!

This is where calculus comes in with a powerful tool: the Chain Rule. It lets us calculate the effect of one thing on another through a long chain of intermediate steps.

Analogy: The Ripple Effect

Imagine you want to know how turning a sprinkler on in your garden ($W^{[1]}$) affects the water level of a distant river ($L$). You can't measure it directly, but you can use the chain rule:

  • How does the sprinkler affect the garden soil moisture? ($\partial(\text{Soil Moisture}) / \partial(\text{Sprinkler})$)
  • How does the soil moisture affect the groundwater level? ($\partial(\text{Groundwater}) / \partial(\text{Soil Moisture})$)
  • How does the groundwater level affect the river level? ($\partial(\text{River}) / \partial(\text{Groundwater})$)

By multiplying these individual effects, you can find the total effect of the sprinkler on the river!

$$ \frac{\partial L}{\partial W^{[1]}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial Z^{[2]}} \cdot \frac{\partial Z^{[2]}}{\partial A^{[1]}} \cdot \frac{\partial A^{[1]}}{\partial Z^{[1]}} \cdot \frac{\partial Z^{[1]}}{\partial W^{[1]}} $$

Simple Explanation: This looks terrifying, but it's just our ripple effect analogy in math. We are working backward from the Loss $L$, step-by-step, calculating how each part of the forward pass contributed to the final error, until we finally figure out the "blame" for the first set of weights $W^{[1]}$.

⚙️Part 5: The Full Algorithm in 4 Steps

The entire training process for one batch of data can be summarized in four key steps.

1. Forward Pass
Make a prediction
➡️
2. Calculate Loss
Measure the error
➡️
3. Backward Pass
Compute gradients
➡️
4. Update Weights
Take a small step

The Weight Update Rule

$$ W_{\text{new}} = W_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial W_{\text{old}}} $$

Simple Explanation:
• $W_{\text{new}}$: The new, slightly improved weight.
• $W_{\text{old}}$: The weight we started with.
• $\alpha$ (alpha): The Learning Rate. This is a small number (like 0.01) that controls how big of a step we take. Too big, and we might overshoot the valley. Too small, and it will take forever to get there.
• $\frac{\partial L}{\partial W_{\text{old}}}$: This is the gradient we calculated during backpropagation. It tells us the direction of the hill. We subtract because we want to go downhill!

🔬Part 6: Interactive Backpropagation Step

Let's calculate the "blame" for one single weight: the one connecting Hidden Neuron H2 to the Output, $W^{[2]}_{2,1}$ (which was -0.8). How much did this specific weight contribute to our error of 0.4489?

See a Weight Update in Action