Efficient Training with Smart Batch Processing
By the end of this lecture, you will:
The Problem: You need to improve your recipes based on customer feedback, but you have three different approaches:
đœïž Batch Gradient Descent (Traditional Method):
Wait for ALL customers of the day to finish eating, collect ALL feedback, then improve your recipe once at the end of the day.
⥠Stochastic Gradient Descent (Quick Method):
After EACH customer finishes, immediately adjust your recipe based on their feedback alone.
đŻ Mini-batch Gradient Descent (Smart Method):
Wait for a SMALL GROUP of customers (say 10) to finish, collect their feedback, then improve your recipe. Repeat this throughout the day.
In machine learning, instead of recipes and customers, we have:
Method | Data Used Per Update | Speed | Memory Usage | Accuracy |
---|---|---|---|---|
Batch GD | All data (entire dataset) | Slow (few updates) | High | Very Accurate |
Stochastic GD | Single data point | Fast (many updates) | Low | Noisy but gets there |
Mini-batch GD | Small batch (32, 64, 128...) | Balanced | Medium | Good balance |
Think of this as: "New Recipe = Old Recipe - Learning Rate Ă Feedback"
Example: If you have 1000 customers, you wait for all 1000 feedbacks, average them, then update your recipe once.
Example: After each customer leaves, immediately adjust your recipe based on their feedback alone.
Example: Wait for 10 customers to finish, average their feedback, update recipe, then repeat with next 10 customers.
Let's say we want to predict house prices using: Price = w Ă Size + b
We have training data: 4 houses with sizes [1000, 1500, 2000, 2500] sq ft and prices [200k, 300k, 400k, 500k]
Step 1: Calculate error for ALL houses
Step 2: Calculate average gradient and update
Update after EACH house:
Update after every 2 houses:
Like Goldilocks and the three bears, we want batch size that's "just right" - not too big, not too small!
Batch Size | When to Use | Characteristics |
---|---|---|
32 | Small datasets, limited memory | Fast, noisy, good for experimentation |
64 | Most common choice | Good balance for most problems |
128 | Medium to large datasets | More stable, still efficient |
256+ | Large datasets, powerful hardware | Very stable, requires more memory |
Batch Size = 1 (SGD): Like sprinting - very fast but exhausting, lots of direction changes
Batch Size = All Data: Like planning the entire marathon route before starting - slow to start but steady
Mini-batch: Like jogging with regular checkpoints - sustainable pace with course corrections
Imagine you're training a model to recognize cats vs dogs with 1000 images:
Look at ALL 1000 images â Calculate average error â Update model ONCE
1 update per epoch, very slow but stable
Look at 1 image â Update model â Look at next image â Update again...
1000 updates per epoch, fast but jumpy
Look at 50 images â Calculate average error â Update model â Repeat...
20 updates per epoch, balanced approach
Mini-batch gradient descent combines the advantages of both batch and stochastic methods:
đŻ Batch GD Path: Smooth, straight line to minimum (like a train on tracks)
⥠SGD Path: Zigzag, noisy path that eventually reaches minimum (like a drunk person walking home)
đŻ Mini-batch Path: Slightly curved but mostly straight path (like a careful driver navigating to destination)
Batch GD: Like a tightrope walker who studies the entire rope before taking any step - safe but slow
SGD: Like a juggler who adjusts after each ball throw - quick reactions but chaotic
Mini-batch: Like a trapeze artist who coordinates with a small team - balanced precision and speed
Modern deep learning uses enhanced versions of mini-batch SGD:
Simple Explanation: Remember previous updates and use them to build momentum, making convergence faster and more stable.
Simple Explanation: Automatically adjusts learning rate for each parameter individually, making training more robust.
Analogy: Like driving - start fast on highway, slow down in city, crawl in parking lot
Let's see how different companies use these techniques:
Method | Time per Epoch | Memory Usage | Final Accuracy | Convergence Speed |
---|---|---|---|---|
Batch GD | 10 minutes | 16 GB | 95.2% | 50 epochs |
SGD (batch=1) | 45 minutes | 2 GB | 94.8% | 100 epochs |
Mini-batch (64) | 8 minutes | 4 GB | 95.1% | 30 epochs |
Mini-batch (128) | 6 minutes | 6 GB | 95.3% | 25 epochs |
Winner: Mini-batch with size 128 - best balance of speed, memory, and accuracy!
You now understand the three main approaches to gradient descent and when to use each!
Remember our restaurant analogy: Mini-batch is like having a smart chef who collects feedback from small groups of customers throughout the day, making gradual improvements that keep everyone happy!
Next steps for becoming an expert:
You've mastered Mini-batch & Stochastic Gradient Descent!
You can now optimize neural networks efficiently and understand the trade-offs between speed, memory, and accuracy.
GitHub Repository: Interactive Deep Learning Lectures
Next Lecture: Lecture 8 - Advanced Optimization Techniques
Created by Prof. Daya Shankar | School of Sciences | Woxsen University
Making Deep Learning Accessible to Everyone đ