Lab 1: Python & Environment Setup

🐍 Your journey into AI starts here. Master the tools of the trade.

Estimated Time: 3 hours

Part 1: Your AI Workbench - Jupyter & Colab

Before writing code, we need to set up our workspace. We will use an interactive environment to experiment, visualize, and share our work. This is where Jupyter notebooks shine.

What is a Jupyter Notebook?

A Jupyter Notebook is an interactive document that lets you mix live code, equations, visualizations, and narrative text in blocks called "cells". It's the standard for data science and AI because it's perfect for exploratory work.

Google Colaboratory (Colab): Your Best Starting Point

Google Colab is a free Jupyter notebook environment that runs entirely in your browser. No complex installation is required!

Getting Started with Colab

  1. Go to colab.research.google.com and sign in with a Google account.
  2. Click "New notebook".
  3. You'll see your interactive workspace. At the top left, click "+ Code" to add a cell for Python code and "+ Text" to add a cell for notes (like this text you're reading).
  4. Type code in a code cell and run it by clicking the play button on the left or pressing Shift + Enter.

🚀 Exercise: Your First Colab Notebook

1. Open a new Google Colab notebook and name it "Lab 1".

2. Create a text cell and write "My First AI Lab".

3. Below it, create a code cell and type `a = 10; b = 20; print(f"The sum is {a+b}")`.

4. Run the code cell. The output should appear directly below it. This confirms your environment is working!

Part 2: Python Fundamentals for AI

Now that your environment is ready, let's learn the language of AI. Python's simple syntax lets you focus on the logic, not the programming. Below, we'll cover the essentials you'll use every day. For each example, read the explanation, study the code, and then try the "Your Turn" task to solidify your understanding.

2.1 Basic Data Types

In AI, you're always dealing with different types of data. Numbers are used for pixel values or model weights, text is used for natural language processing, and booleans are used to control program flow.

# Numbers can be whole (integers) or have decimals (floats)
pixel_intensity = 255 # An integer, common in image data (0-255)
learning_rate = 0.001 # A float, a key hyperparameter in model training

# Text data is stored in strings
model_name = "ImageClassifier_v1"

# Booleans represent True or False states
is_training = True # Useful for changing model behavior (e.g., turning on dropout)

# We can use an f-string to easily print variables
print(f"Model: {model_name}, LR: {learning_rate}")
Model: ImageClassifier_v1, LR: 0.001

💡 Your Turn

Declare a variable for `batch_size` (an integer) and another for `model_accuracy` (a float). Print them out in a formatted string.

2.2 Data Structures: Lists & Dictionaries

Lists are ordered collections, perfect for storing a sequence of features or data points. Dictionaries are unordered collections of key-value pairs, ideal for storing model configurations or labeled data.

# A list stores a sequence of items, accessed by a numerical index (starting from 0).
# Imagine these are features for a house: [area_sq_ft, num_bedrooms, age_years]
house_features = [1500, 3, 10]
print(f"Number of bedrooms: {house_features[1]}") # Access the second item

# A dictionary stores key-value pairs, accessed by the key.
# This is perfect for storing hyperparameters.
hyperparameters = {
  "learning_rate": 0.01,
  "epochs": 50,
  "optimizer": "Adam"
}
print(f"Optimizer: {hyperparameters['optimizer']}") # Access the value associated with the 'optimizer' key
Number of bedrooms: 3 Optimizer: Adam

💡 Your Turn

Create a dictionary for a `student` with keys 'name', 'major', and a list of 'courses'. Print the student's name and their second course.

2.3 Control Flow: Loops & Conditionals

Training a model involves iterating through your dataset thousands of times (`for` loops) and making decisions based on performance (`if/else` statements).

# We often loop through a mini-batch of loss values from our model
loss_values = [0.8, 0.5, 0.2, 0.1]
for loss in loss_values:
  # Use an if/else statement to decide what to do
  if loss < 0.5:
    print(f"Loss {loss:.2f} is good. Continuing training.") # .2f formats to 2 decimal places
  else:
    print(f"Loss {loss:.2f} is high. Check model.")
Loss 0.80 is high. Check model. Loss 0.50 is high. Check model. Loss 0.20 is good. Continuing training. Loss 0.10 is good. Continuing training.

💡 Your Turn

Create a list of accuracies (e.g., `[0.91, 0.85, 0.99]`). Loop through them and print "Excellent" if accuracy is > 0.9, "Good" if > 0.8, and "Needs Improvement" otherwise.

2.4 Functions

Functions are crucial for writing clean, reusable code. In AI, you'll write functions for data preprocessing, model building, and training steps.

# A function to preprocess an image (a common task)
def preprocess_image(image_data, target_size):
  """A placeholder function to demonstrate structure."""
  print(f"Resizing image to {target_size}x{target_size}...")
  # In a real scenario, code to resize and normalize would be here
  resized_image = "some_processed_data"
  return resized_image

# Call the function
my_image = "raw_image_data"
processed = preprocess_image(my_image, target_size=224)
print(f"Function returned: {processed}")
Resizing image to 224x224... Function returned: some_processed_data

💡 Your Turn

Write a function `calculate_average_loss` that takes a list of loss values and returns their average.

Part 3: The AI Power Tools

Let's explore the three most important libraries for any AI practitioner. These are pre-installed in Colab.

3.1 NumPy: The Bedrock of AI Math

NumPy is critical because it provides the `ndarray` (N-dimensional array), the object on which all modern AI frameworks (like TensorFlow and PyTorch) are built. Operations on NumPy arrays are incredibly fast. Your image data, model weights, and feature vectors will all be NumPy arrays.

import numpy as np # 'np' is the universal alias for numpy

# Create a 1D array (a vector)
vector = np.array([1, 2, 3])
# Create a 2D array (a matrix)
matrix = np.array([ [1, 2], [3, 4] ])

print(f"Vector shape: {vector.shape}")
print(f"Matrix shape: {matrix.shape}")

# Perform a "vectorized" operation - much faster than a Python loop!
scaled_vector = vector * 5
print(f"Scaled vector: {scaled_vector}")
Vector shape: (3,) Matrix shape: (2, 2) Scaled vector: [ 5 10 15]

💡 Your Turn

Create a 3x3 NumPy matrix of numbers from 1 to 9. Then, find the mean (average) of the entire matrix using `matrix.mean()`.

3.2 Pandas: For Structuring and Cleaning Data

Pandas is your tool for data manipulation and analysis. Before you can train a model, you need to load, explore, and clean your data. Pandas provides the DataFrame, a powerful table-like structure, to make this easy.

import pandas as pd # 'pd' is the universal alias for pandas

# Create a DataFrame from a dictionary
data = {
  'Model Name': ['ResNet50', 'MobileNetV2', 'EfficientNetB0'],
  'Top-1 Accuracy': [0.76, 0.72, 0.77],
  'Size (MB)': [102, 14, 21]
}
df = pd.DataFrame(data)

print("--- Full DataFrame ---")
print(df)

# Select a single column (returns a Pandas Series)
print("\n--- Accuracy Column ---")
print(df['Top-1 Accuracy'])
--- Full DataFrame --- Model Name Top-1 Accuracy Size (MB) 0 ResNet50 0.76 102 1 MobileNetV2 0.72 14 2 EfficientNetB0 0.77 21 --- Accuracy Column --- 0 0.76 1 0.72 2 0.77 Name: Top-1 Accuracy, dtype: float64

💡 Your Turn

Add a new column to the DataFrame called `Performance` calculated as `Accuracy / Size`. Then, print the updated DataFrame.

3.3 Matplotlib: Visualizing Your Results

Matplotlib is the classic library for creating plots and charts. Visualizing your model's training progress (like loss over time) or your data's distribution is essential for understanding and debugging.

import matplotlib.pyplot as plt # 'plt' is the standard alias

# Data for our plot: training vs validation loss
epochs = [1, 2, 3, 4, 5, 6]
training_loss = [0.8, 0.6, 0.45, 0.3, 0.25, 0.22]
validation_loss = [0.85, 0.68, 0.55, 0.48, 0.46, 0.45]

# Create the plot
plt.figure(figsize=(8,5)) # Set the figure size
plt.plot(epochs, training_loss, marker='o', linestyle='--', label='Training Loss')
plt.plot(epochs, validation_loss, marker='s', linestyle='-', label='Validation Loss')
# Add labels and a title for clarity
plt.title('Model Loss Over Time', fontsize=16)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True)
plt.legend() # Display the labels for each line
plt.show() # This command renders the plot in your notebook
Example Matplotlib plot of training and validation loss

💡 Your Turn

Using the Pandas DataFrame `df` from the previous section, create a bar chart comparing the 'Top-1 Accuracy' of the different models. Use `plt.bar(df['Model Name'], df['Top-1 Accuracy'])` as a starting point.

Part 4: Lab Assignment

It's time to put everything together. This assignment requires you to perform a mini data analysis project, combining all the skills you've learned above.

Assignment: Analyzing a Toy Dataset

You are given a small dataset of used car information. Your task is to load it, perform some basic analysis, and visualize the results.

Task 1: Setup and Data Loading

  1. In a new Colab cell, import NumPy, Pandas, and Matplotlib with their standard aliases.
  2. Create a Pandas DataFrame using the following code snippet:
    car_data = {
      'Make': ['Honda', 'Toyota', 'Ford', 'Honda', 'Toyota', 'Ford'],
      'Engine_Size_L': [1.5, 2.5, 5.0, 1.8, 2.4, 4.6],
      'Price_USD': [22000, 28000, 35000, 24000, 29000, 32000]
    }
    car_df = pd.DataFrame(car_data)
  3. Print the entire DataFrame to verify it loaded correctly.

Task 2: Data Analysis with NumPy & Pandas

  1. Calculate and print the average (mean) price of all cars.
  2. Calculate and print the average engine size.
  3. Find the car with the highest price using Pandas functions (hint: look up `.idxmax()`). Print the details of this car.

Task 3: Visualization with Matplotlib

  1. Create a scatter plot to visualize the relationship between 'Engine_Size_L' (x-axis) and 'Price_USD' (y-axis).
  2. Give your plot a clear title ("Car Price vs. Engine Size") and label the x and y axes.
  3. Create a bar chart that shows the average price for each car 'Make'. You will need to use the Pandas `.groupby()` function first. (Hint: `avg_price_by_make = car_df.groupby('Make')['Price_USD'].mean()`).

Part 5: Bonus - Your First Kaggle Project

Ready to apply your skills to a world-famous dataset? This optional bonus section will guide you through the first steps of a real data science project on Kaggle.

Kaggle & The Titanic Dataset

Kaggle is a platform where data scientists compete by building the best models for a given problem. The "Titanic: Machine Learning from Disaster" competition is the "Hello, World!" of data science. Your goal is to predict which passengers survived the shipwreck.

Task 1: Get the Data

  1. Go to the Titanic data page on Kaggle. You will need to create a free account.
  2. Download the `train.csv` and `test.csv` files to your computer.
  3. In your Colab notebook, click the "Files" icon on the left sidebar and upload `train.csv`.

Task 2: Load and Explore

Use Pandas to load the data and take your first look.

# Load the training data from the uploaded file
titanic_df = pd.read_csv('train.csv')

# Display the first 5 rows to see the columns
print("--- First 5 Rows ---")
print(titanic_df.head())

# Get a summary of the data types and non-null values
print("\n--- Data Info ---")
titanic_df.info()

Task 3: Your Challenge - Exploratory Data Analysis

Now, it's your turn to be the data detective. Answer the following questions using Pandas and Matplotlib. There is no single correct answer for the code; your goal is to find the answer and visualize it.

  1. Survival Rate: What was the overall survival rate? Create a bar chart showing the raw counts of passengers who survived (1) vs. those who did not (0). (Hint: use `.value_counts()` on the 'Survived' column).
  2. Survival by Gender: Did gender play a role in survival? Create a bar chart showing the survival rate for males vs. females. (Hint: use `.groupby('Sex')['Survived'].mean()`).
  3. Survival by Class: What about passenger class ('Pclass')? Create a bar chart showing the survival rate for each of the three classes.
  4. Missing Data: Which important column has a lot of missing data? How might you handle this in a real project? (Just answer in a text cell).

Part 6: Submission Guidelines

To complete this lab, please follow these instructions carefully.

  1. Complete all "Your Turn" tasks and the main "Lab Assignment" in a single Google Colab notebook. The Kaggle project is a bonus, but we encourage you to try it!
  2. Use Text Cells in your notebook to label each section (e.g., "Part 2.4 Your Turn", "Assignment Task 1", "Bonus Kaggle Project", etc.) to keep your work organized.
  3. Ensure all your code cells have been run so that their outputs are visible below them. An unrun notebook is an incomplete notebook!
  4. When you are finished, generate a shareable link. In Colab, click the "Share" button in the top right.
  5. In the popup, under "General access", change "Restricted" to "Anyone with the link" and ensure the role is set to "Viewer".
  6. Click "Copy link" and submit this link as your assignment.