Imagine a student who memorizes textbook answers perfectly but fails when asked slightly different questions on the exam. This is overfitting - when a model learns the training data too well but can't generalize to new, unseen data.
Overfitting occurs when a model:
Key Sign: Large gap between training and validation performance
Training Error: E_train = (1/n) × Σ(y_true - y_pred)²
Test Error: E_test = (1/m) × Σ(y_true - y_pred)²
Overfitting occurs when: E_test >> E_train
Good generalization when: E_test ≈ E_train
Image Classification Model:
Problem: Model memorized specific cats, lighting, backgrounds instead of learning what makes a cat a cat.
Dropout randomly turns off (sets to zero) a percentage of neurons during training. This prevents neurons from becoming too dependent on each other and forces the network to learn more robust features.
Think of it as: Training a sports team where random players sit out each practice - everyone becomes more versatile!
During Training:
For each neuron i with dropout probability p:
• Keep neuron: probability = (1-p)
• Drop neuron: probability = p
• If kept: output = input / (1-p) [scaling factor]
• If dropped: output = 0
During Testing:
• Use all neurons without dropout
• No scaling needed
Python Code:
# Keras/TensorFlow
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3)) # 30% dropout
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5)) # 50% dropout
model.add(Dense(10, activation='softmax'))
Best Practices:
L1 and L2 regularization add penalty terms to the loss function to control model complexity:
Goal: Prevent weights from becoming too large, reducing overfitting
Original Loss Function:
L = (1/n) × Σ(y_true - y_pred)²
L1 Regularized Loss:
L_L1 = (1/n) × Σ(y_true - y_pred)² + λ₁ × Σ|wᵢ|
L2 Regularized Loss:
L_L2 = (1/n) × Σ(y_true - y_pred)² + λ₂ × Σwᵢ²
Elastic Net (L1 + L2):
L_EN = (1/n) × Σ(y_true - y_pred)² + λ₁×Σ|wᵢ| + λ₂×Σwᵢ²
Feature 1: 0.8
Feature 2: 0.5
Feature 3: 0.0
Eliminates weak featuresFeature 1: 0.7
Feature 2: 0.4
Feature 3: 0.2
Shrinks all weightsScenario | Use L1 When | Use L2 When |
---|---|---|
Feature Selection | ✅ Want automatic feature selection | ❌ Want to keep all features |
Interpretability | ✅ Need sparse, interpretable model | ❌ Complexity is acceptable |
Stability | ❌ Can be unstable with correlated features | ✅ More stable, handles correlation well |
Computational Cost | ✅ Faster inference (sparse model) | ❌ All features computed |
# Scikit-learn
from sklearn.linear_model import Lasso, Ridge
# L1 Regularization
lasso = Lasso(alpha=0.01) # alpha = λ
lasso.fit(X_train, y_train)
# L2 Regularization
ridge = Ridge(alpha=0.01)
ridge.fit(X_train, y_train)
# Neural Networks (Keras)
from keras.regularizers import l1, l2
model.add(Dense(64, kernel_regularizer=l2(0.01)))
Early stopping monitors validation performance during training and stops when it stops improving, preventing the model from overfitting.
Think of it as: A wise coach who knows when the team has practiced enough!
Parameters:
• patience = number of epochs to wait
• min_delta = minimum improvement required
Algorithm:
1. Initialize: best_loss = ∞, wait_count = 0
2. For each epoch:
a. Calculate validation_loss
b. If validation_loss < (best_loss - min_delta):
best_loss = validation_loss
wait_count = 0
save_model()
c. Else: wait_count += 1
d. If wait_count >= patience: STOP
3. Restore best model
# Keras Implementation
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
monitor='val_loss',
patience=10,
min_delta=0.001,
restore_best_weights=True
)
model.fit(X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[early_stopping])
Best Practices:
Data augmentation creates new training examples by applying label-preserving transformations to existing data. This increases dataset size and improves model robustness.
Key principle: Transform the input while keeping the label the same
Geometric Transformations:
• Rotation: [x', y'] = [x×cos(θ) - y×sin(θ), x×sin(θ) + y×cos(θ)]
• Scaling: [x', y'] = [s×x, s×y]
• Translation: [x', y'] = [x + dx, y + dy]
Photometric Transformations:
• Brightness: I' = I + β
• Contrast: I' = α × I
• Gamma correction: I' = I^γ
# Keras ImageDataGenerator
from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
)
# Training with augmentation
model.fit(datagen.flow(X_train, y_train),
steps_per_epoch=len(X_train)//batch_size,
epochs=100)
Different regularization techniques address different aspects of overfitting:
Combined effect: More robust and generalizable models
Full Formula:
L_total = L_original + λ₁×Σ|wᵢ| + λ₂×Σwᵢ² + L_dropout_implicit
Where:
• L_original = Base prediction loss
• λ₁×Σ|wᵢ| = L1 penalty term
• λ₂×Σwᵢ² = L2 penalty term
• L_dropout_implicit = Regularization from dropout
• Early stopping + Data augmentation provide additional regularization
Problem Type | Best Combination | Reasoning |
---|---|---|
Image Classification | Data Aug + Dropout + L2 + Early Stop | Images benefit from augmentation, CNNs from dropout |
Text Classification | Dropout + L2 + Early Stop | Text augmentation is tricky, focus on model regularization |
Small Dataset | Heavy Data Aug + Light L2 + Early Stop | Augmentation crucial when data is limited |
High-Dimensional | L1 + L2 + Early Stop | L1 for feature selection, L2 for stability |
Deep Networks | Dropout + BatchNorm + L2 + Early Stop | Deep networks need strong regularization |
# Complete regularized model
import tensorflow as tf
from tensorflow.keras import layers, regularizers
from tensorflow.keras.callbacks import EarlyStopping
# Model with all regularization techniques
model = tf.keras.Sequential([
layers.Dense(128, activation='relu',
kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01)),
layers.Dropout(0.3),
layers.Dense(64, activation='relu',
kernel_regularizer=regularizers.l2(0.01)),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=10,
restore_best_weights=True)
# Data augmentation
datagen = ImageDataGenerator(
rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True
)
# Training
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(
datagen.flow(X_train, y_train, batch_size=32),
validation_data=(X_val, y_val),
epochs=100,
callbacks=[early_stop]
)
Technique | Best For | Parameters | Implementation |
---|---|---|---|
Dropout | Deep neural networks | 0.2-0.5 for hidden layers | Add Dropout() layers |
L1 Regularization | Feature selection needed | λ = 0.001-0.01 | kernel_regularizer=l1(λ) |
L2 Regularization | Weight control, stability | λ = 0.001-0.01 | kernel_regularizer=l2(λ) |
Early Stopping | All training scenarios | patience = 5-20 | EarlyStopping callback |
Data Augmentation | Limited data, images | 2-10x increase | ImageDataGenerator |
Question: You have a deep neural network with 95% training accuracy but only 70% validation accuracy. Which regularization techniques would you apply?