Lab 6: 🖼️ CNN for Image Recognition

Teaching a computer to "see" with the brain of a computer vision expert.

Libraries: TensorFlow, Keras, OpenCV • Estimated Time: 3 hours

Part 1: A Better Way to See

In our last labs, we used "Dense" neural networks. We flattened our 28x28 images into a long 784-pixel line. This worked, but it's not how we see things. When you look for a cat in a photo, you don't scan pixel by pixel. You look for shapes: pointy ears, whiskers, a cat-like nose. The *spatial relationship* between pixels matters!

Flattening an image throws that relationship away. A Convolutional Neural Network (CNN) is designed to preserve it. It acts like a detective, scanning an image with a set of "magnifying glasses" (called filters) to find specific features like edges, corners, textures, and eventually, complex shapes like eyes or wheels.

The Two Core Ideas of a CNN:

  • Convolution (`Conv2D`): The act of sliding a filter over an image to detect features and create "feature maps".
  • Pooling (`MaxPooling2D`): The act of down-sampling or summarizing the feature maps to make the model more efficient and robust.

Today, we'll build a CNN to classify images from the CIFAR-10 dataset, which contains 10 types of objects like airplanes, cars, birds, and cats.

Part 2: The CNN Toolkit Explained

The Convolutional Layer (`Conv2D`)

Imagine a tiny 3x3 pixel magnifying glass. This is our filter (or kernel). This filter contains a pattern, for example, a pattern for a vertical edge. We slide this filter over every part of our main image. When the filter is over a part of the image that matches its pattern, it gives a high activation score. The resulting grid of scores is called a feature map.

A single `Conv2D` layer doesn't just have one filter; it has many (e.g., 32 or 64). Each filter learns to detect a different feature. One might learn to find horizontal edges, another might find green-to-blue transitions, and another might find a specific curve.

Convolution operation GIF

The Pooling Layer (`MaxPooling2D`)

After a convolution, we have a bunch of detailed feature maps. A pooling layer simplifies them. The most common type is Max Pooling. It takes a small window (e.g., 2x2 pixels) on the feature map and keeps only the maximum value from that window.

Why do this?

Part 3: Loading the CIFAR-10 Dataset

Let's get our hands dirty. We'll load the CIFAR-10 dataset, which is built into Keras.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize pixel values to be between 0 and 1
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

print(f"x_train shape: {x_train.shape}") # (50000, 32, 32, 3) -> 50k images, 32x32 pixels, 3 color channels (RGB)
print(f"y_train shape: {y_train.shape}") # (50000, 1) -> 50k labels

💡 Your Turn: Visualize the Data

Let's see what these images look like. Use the code below to display the first 25 images. Can you tell what each one is?

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

plt.figure(figsize=(10,10))
for i in range(25):
  plt.subplot(5,5,i+1)
  plt.xticks([])
  plt.yticks([])
  plt.grid(False)
  plt.imshow(x_train[i])
  plt.xlabel(class_names[y_train[i][0]])
plt.show()

Part 4: Assembling Our First CNN

Let's build a simple CNN. A common pattern is to stack a few `Conv2D` and `MaxPooling2D` layers, then follow it up with a `Flatten` layer and a few `Dense` layers for classification.

model = keras.Sequential([
  # Input Layer - specify the shape of our images
  keras.Input(shape=(32, 32, 3)),

  # Convolutional Block 1
  layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding='same'),
  layers.MaxPooling2D(pool_size=(2, 2)),

  # Convolutional Block 2
  layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same'),
  layers.MaxPooling2D(pool_size=(2, 2)),

  # Classifier Head
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(10, activation='softmax')
])

model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 32, 32, 32) 896 max_pooling2d (MaxPooling2 (None, 16, 16, 32) 0 D) conv2d_1 (Conv2D) (None, 16, 16, 64) 18496 max_pooling2d_1 (MaxPoolin (None, 8, 8, 64) 0 g2D) flatten (Flatten) (None, 4096) 0 dense (Dense) (None, 128) 524416 dense_1 (Dense) (None, 10) 1290 ================================================================= Total params: 545,098 Trainable params: 545,098 Non-trainable params: 0

Detective's Note: Look at the output shapes. After the first `MaxPooling2D` layer, the image dimensions are halved from 32x32 to 16x16. After the second, they are halved again to 8x8. The `Flatten` layer then unrolls this `8x8x64` tensor into a long vector of `4096` features to feed into the classifier.

💡 Your Turn: Experiment with Kernel Size

The `kernel_size` determines the size of the "magnifying glass." We used `(3, 3)`. Rebuild the model but change the `kernel_size` in the first `Conv2D` layer to `(5, 5)`. Run `model.summary()` again. How does this change the number of parameters in that layer? Why do you think that is?

Part 5: Training Our CNN

💡 Your Turn: Choose an Optimizer

We've been using `'adam'`, which is a great default. But what about others? Before running the main training, compile the model with a different optimizer, like `'SGD'`. Does it train faster or slower? Is the final accuracy better or worse after 15 epochs?

model.compile(optimizer='sgd', # Try this one! loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Now, let's compile with Adam and train the model properly.

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = model.fit(x_train, y_train, epochs=15, validation_data=(x_test, y_test))

💡 Your Turn: Plot the Results

Training can take a few minutes. Once it's done, use the code below to plot the training history. Does the validation accuracy improve over time? Do you see any signs of overfitting (where training accuracy keeps going up but validation accuracy flattens or goes down)?

plt.plot(history.history['accuracy'], label='accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.ylim([0.5, 1])
plt.legend(loc='lower right')
plt.show()

Part 6: Peeking Inside the Black Box

Let's do something amazing: let's visualize the feature maps from our convolutional layers to see what the network is actually detecting.

# Get the outputs of the first four layers
layer_outputs = [layer.output for layer in model.layers[:4]]
activation_model = keras.Model(inputs=model.input, outputs=layer_outputs)

# Pick an image to visualize (e.g., the 7th image, a truck)
img_tensor = np.expand_dims(x_test[6], axis=0)
activations = activation_model.predict(img_tensor)

# Function to plot a feature map
def display_activation(activations, col_size, row_size, act_index):
  activation = activations[act_index]
  activation_index=0
  fig, ax = plt.subplots(row_size, col_size, figsize=(row_size*2.5,col_size*1.5))
  for row in range(0,row_size):
    for col in range(0,col_size):
      ax[row][col].imshow(activation[0, :, :, activation_index], cmap='viridis')
      activation_index += 1

# Display the 32 feature maps from the first Conv2D layer
print("First Conv Layer Activations")
display_activation(activations, 8, 4, 0)

When you run this code, you will see grids of images. Each small image is a feature map. Notice how the first layer's maps are very basic (simple edges, colors).

💡 Your Turn: Visualize Different Features

The code above shows the activations for a truck. Find an image of a cat in the `x_test` set (class name index 3). Change `x_test[6]` to the index of your cat image and re-run the cell. Do the feature maps look different? Can you spot any filters that seem to be activating on "cat-like" features (like ears or fur texture)?

Part 7: Your Mission - Build a Better CNN

Assignment: Improve the CIFAR-10 Classifier

Our simple CNN gets around 70-75% accuracy. Your goal is to improve this! Can you get it above 80%? Create a new, better model architecture in your Colab notebook.

Ideas to Try:

  1. Go Deeper: Add a third convolutional block (`Conv2D` + `MaxPooling2D`). Deeper networks can learn more complex hierarchies of features.
  2. More Filters: Instead of `32` and `64` filters, try `64` and `128`, or even more.
  3. Use Dropout: Add `layers.Dropout(0.25)` after your pooling layers or `layers.Dropout(0.5)` after your dense layers to fight overfitting.
  4. Batch Normalization: Add `layers.BatchNormalization()` after `Conv2D` or `Dense` layers (before activation). This can help stabilize and speed up training.

For your final submission, present your best model. Train it for at least 20 epochs, plot its history, and report its final test accuracy.

Part 8: Bonus - Cats vs. Dogs

The "Cats vs. Dogs" Kaggle competition is a classic entrypoint into real-world image classification. Unlike CIFAR-10, the images are of different sizes and are stored in folders on your disk.

Kaggle: Dogs vs. Cats

The goal is simple: given an image, predict whether it's a cat or a dog. This is a binary classification problem.

Your Challenge:

  1. Data Loading: You cannot load this dataset directly like CIFAR-10. You need to download the data from Kaggle and use `tf.keras.utils.image_dataset_from_directory` to load the images. This is a critical skill for working with your own datasets. You will also need to resize all images to a standard size (e.g., 180x180).
  2. Data Augmentation: To prevent overfitting and make your model more robust, create a data augmentation layer: `keras.Sequential([layers.RandomFlip("horizontal"), layers.RandomRotation(0.1),])`. Place this at the beginning of your model.
  3. Adapt your Model: Your final `Dense` layer must have only **1 neuron** and use the **`'sigmoid'` activation function**. The loss function should be `'binary_crossentropy'`.
  4. Train and Evaluate: Build the best CNN you can and see how high you can get your accuracy!

Part 9: Submission Guidelines

  1. Complete all "Your Turn" tasks and the main "Lab Assignment" in a single Google Colab notebook. The Kaggle project is a bonus.
  2. For the assignment, present the code for your final, best model. Include the `model.summary()` output.
  3. Show the code used to train your model and the plot of its training/validation history.
  4. Add a Text Cell at the end reporting the final test accuracy and summarizing the architectural choices you made to achieve it.
  5. Ensure all your code cells have been run so that their outputs and plots are visible.
  6. When you are finished, generate a shareable link. In Colab, click "Share" and set access to "Anyone with the link" can "Viewer".
  7. Click "Copy link" and submit this link as your assignment.