Teaching a computer not just *what* an object is, but *where* it is.
Libraries: YOLO (Ultralytics), OpenCV, TensorFlow • Estimated Time: 3 hours
So far, we've built models that can look at an image and say "this is a cat." That's Image Classification. But what if there's a cat, a dog, and a duck in the same image? How do we find each one?
This is Object Detection. The goal is to produce two things for every object in an image:
Older object detection systems were slow. They would look at many different regions of an image one by one. YOLO revolutionized this by being incredibly fast and accurate. It looks at the entire image just once and figures out all the bounding boxes and class probabilities simultaneously.
Think of it like this: to find your keys in a room, you don't scan every square inch. You glance around the whole room and your brain instantly picks out key-like shapes. YOLO works in a similar, clever way.
YOLO divides the input image into a grid. For each grid cell, it predicts several bounding boxes and the probability that an object's center falls within that cell. It then combines this with class predictions to produce the final detections.
We'll be using the `ultralytics` library, which provides a super easy-to-use implementation of the latest YOLO models (like YOLOv8). We also need `OpenCV` for handling images and videos.
The `ultralytics` library offers different model sizes, each with a trade-off between speed and accuracy. Try loading a medium-sized model like `'yolov8m.pt'` or the large (and very accurate) `'yolov8l.pt'`. The first time you use a model, the library will download it automatically.
Let's start by detecting objects in a single image. First, we need an image. You can upload your own to Colab or download one from the internet.
You should see the original image with bounding boxes drawn around the people and the bus, along with their class labels and confidence scores!
Instead of just displaying the image, you can save the annotated version to a file. The `predict` method has a `save=True` argument that does this automatically, placing the result in a `runs/detect/predict` folder. Try it!
The `results[0]` object is powerful. You can access the raw data directly.
By default, YOLO shows detections with a confidence of 0.25 or higher. You can be more strict. Run the prediction again, but this time set the `conf` argument. How do the results change if you set it to `0.7`?
IoU (Intersection over Union) controls how overlapping boxes are handled. A lower `iou` threshold (e.g., `0.3`) will cause the model to discard more redundant boxes. Run the prediction again, this time adjusting the `iou` parameter. Do you see any difference in the bus image?
This is where YOLO truly shines. We can run it on a live video stream from your webcam (if you're running this on your local machine) or on a video file in Colab.
First, let's download a sample video.
Now, we can call the model on the video path. We'll set `stream=True` to process it frame-by-frame efficiently.
After running this, check your file browser in Colab. You'll find a new `runs` directory containing the processed video with bounding boxes drawn on it. You can download and play it!
What if you only want to detect people and ignore everything else? You can use the `classes` argument. The class ID for 'person' is 0. Run the prediction on the video again, but only look for people. Does the output video look cleaner?
Processing at a higher resolution gives more accuracy but is slower. Use the `imgsz` argument to set the image size. Try running detection on the video at `imgsz=320` and then at `imgsz=1280`. It won't save a video, but it will print the processing speed for each frame. Compare the average speed (in ms) for each resolution.
Simply counting objects in each frame is flawed. A car that stays in view for 100 frames gets counted 100 times. A better approach is to count objects as they cross a virtual line. This is a real-world technique used in traffic analysis.
Hint: You'll need a dictionary to store the positions of objects from the previous frame. This requires a simple tracking mechanism. A good starting point is to assume an object in a similar position in the next frame is the same object.
The true power of YOLO is training it on your own custom dataset. This is the biggest challenge in object detection and a highly valuable skill.
The `COCO8` dataset is a small example dataset provided by Ultralytics, perfect for learning how to train. It has 8 images and the corresponding labels in YOLO format.
Your goal is to "fine-tune" the pre-trained YOLOv8 model on this tiny dataset. The `ultralytics` library makes this surprisingly easy.
Running this code will download the dataset, start the training process, and save your new, fine-tuned model weights in the `runs/detect/train/` directory. You can then load this new model and use it just like you used the pre-trained one!