Lab 1: Python & Environment Setup

Part 1: Your AI Workbench - Jupyter & Colab

Before writing code, we need to set up our workspace. We will use an interactive environment to experiment, visualize, and share our work. This is where Jupyter notebooks shine.

What is a Jupyter Notebook?

A Jupyter Notebook is an interactive document that lets you mix live code, equations, visualizations, and narrative text in blocks called "cells". It's the standard for data science and AI because it's perfect for exploratory work.

Google Colaboratory (Colab): Your Best Starting Point

Google Colab is a free Jupyter notebook environment that runs entirely in your browser. No complex installation is required!

Zero Configuration: All major AI libraries (NumPy, Pandas, TensorFlow, PyTorch) are pre-installed.
Free Hardware: You get access to powerful GPUs and TPUs for free, which can speed up model training by 10-50x.
Easy Collaboration: Share your notebooks like a Google Doc.

Getting Started with Colab

Go to colab.research.google.com and sign in with a Google account.
Click "New notebook".
You'll see your interactive workspace. At the top left, click "+ Code" to add a cell for Python code and "+ Text" to add a cell for notes (like this text you're reading).
Type code in a code cell and run it by clicking the play button on the left or pressing Shift + Enter.

🚀 Exercise: Your First Colab Notebook

1. Open a new Google Colab notebook and name it "Lab 1".

2. Create a text cell and write "My First AI Lab".

3. Below it, create a code cell and type `a = 10; b = 20; print(f"The sum is {a+b}")`.

4. Run the code cell. The output should appear directly below it. This confirms your environment is working!

Part 2: Python Fundamentals for AI

Now that your environment is ready, let's learn the language of AI. Python's simple syntax lets you focus on the logic, not the programming. Below, we'll cover the essentials you'll use every day. For each example, read the explanation, study the code, and then try the "Your Turn" task to solidify your understanding.

2.1 Basic Data Types

In AI, you're always dealing with different types of data. Numbers are used for pixel values or model weights, text is used for natural language processing, and booleans are used to control program flow.

                # Numbers can be whole (integers) or have decimals (floats)

                pixel_intensity = 255 # An integer, common in image
                    data (0-255)

                learning_rate = 0.001 # A float, a key hyperparameter
                    in model training

                # Text data is stored in strings

                model_name = "ImageClassifier_v1"

                # Booleans represent True or False states

                is_training = True # Useful for changing model
                    behavior (e.g., turning on dropout)

                # We can use an f-string to easily print variables

                print(f"Model: {model_name}, LR:
                    {learning_rate}")

Model: ImageClassifier_v1, LR: 0.001

💡 Your Turn

Declare a variable for `batch_size` (an integer) and another for `model_accuracy` (a float). Print them out in a formatted string.

2.2 Data Structures: Lists & Dictionaries

Lists are ordered collections, perfect for storing a sequence of features or data points. Dictionaries are unordered collections of key-value pairs, ideal for storing model configurations or labeled data.

                # A list stores a sequence of items, accessed by a numerical index (starting from
                    0).

                # Imagine these are features for a house: [area_sq_ft, num_bedrooms,
                    age_years]

                house_features = [1500, 3, 10]

                print(f"Number of bedrooms:
                    {house_features[1]}") # Access the second item

                # A dictionary stores key-value pairs, accessed by the key.

                # This is perfect for storing hyperparameters.

                hyperparameters = {

                  "learning_rate": 0.01,

                  "epochs": 50,

                  "optimizer": "Adam"

                }

                print(f"Optimizer:
                    {hyperparameters['optimizer']}") # Access the value associated with the
                    'optimizer' key

Number of bedrooms: 3 Optimizer: Adam

💡 Your Turn

Create a dictionary for a `student` with keys 'name', 'major', and a list of 'courses'. Print the student's name and their second course.

2.3 Control Flow: Loops & Conditionals

Training a model involves iterating through your dataset thousands of times (`for` loops) and making decisions based on performance (`if/else` statements).

                # We often loop through a mini-batch of loss values from our model

                loss_values = [0.8, 0.5, 0.2, 0.1]

                for loss in loss_values:

                  # Use an if/else statement to decide what to do

                  if loss < 0.5:

                    print(f"Loss {loss:.2f} is
                    good. Continuing training.") # .2f formats to 2 decimal
                    places

                  else:

                    print(f"Loss {loss:.2f} is
                    high. Check model.")

Loss 0.80 is high. Check model. Loss 0.50 is high. Check model. Loss 0.20 is good. Continuing training. Loss 0.10 is good. Continuing training.

💡 Your Turn

Create a list of accuracies (e.g., `[0.91, 0.85, 0.99]`). Loop through them and print "Excellent" if accuracy is > 0.9, "Good" if > 0.8, and "Needs Improvement" otherwise.

2.4 Functions

Functions are crucial for writing clean, reusable code. In AI, you'll write functions for data preprocessing, model building, and training steps.

                # A function to preprocess an image (a common task)

                def preprocess_image(image_data, target_size):

                  """A placeholder function to demonstrate structure."""

                  print(f"Resizing image to
                    {target_size}x{target_size}...")

                  # In a real scenario, code to resize and normalize would be
                    here

                  resized_image = "some_processed_data"

                  return resized_image

                # Call the function

                my_image = "raw_image_data"

                processed = preprocess_image(my_image, target_size=224)

                print(f"Function returned: {processed}")

Resizing image to 224x224... Function returned: some_processed_data

💡 Your Turn

Write a function `calculate_average_loss` that takes a list of loss values and returns their average.

Part 3: The AI Power Tools

Let's explore the three most important libraries for any AI practitioner. These are pre-installed in Colab.

3.1 NumPy: The Bedrock of AI Math

NumPy is critical because it provides the `ndarray` (N-dimensional array), the object on which all modern AI frameworks (like TensorFlow and PyTorch) are built. Operations on NumPy arrays are incredibly fast. Your image data, model weights, and feature vectors will all be NumPy arrays.

                import numpy as np #
                    'np' is the universal alias for numpy

                # Create a 1D array (a vector)

                vector = np.array([1, 2, 3])

                # Create a 2D array (a matrix)

                matrix = np.array([ [1, 2], [3, 4] ])

                print(f"Vector shape: {vector.shape}")

                print(f"Matrix shape: {matrix.shape}")

                # Perform a "vectorized" operation - much faster than a Python loop!

                scaled_vector = vector * 5

                print(f"Scaled vector: {scaled_vector}")

Vector shape: (3,) Matrix shape: (2, 2) Scaled vector: [ 5 10 15]

💡 Your Turn

Create a 3x3 NumPy matrix of numbers from 1 to 9. Then, find the mean (average) of the entire matrix using `matrix.mean()`.

3.2 Pandas: For Structuring and Cleaning Data

Pandas is your tool for data manipulation and analysis. Before you can train a model, you need to load, explore, and clean your data. Pandas provides the DataFrame, a powerful table-like structure, to make this easy.

                import pandas as pd #
                    'pd' is the universal alias for pandas

                # Create a DataFrame from a dictionary

                data = {

                  'Model Name': ['ResNet50', 'MobileNetV2', 'EfficientNetB0'],

                  'Top-1 Accuracy': [0.76, 0.72, 0.77],

                  'Size (MB)': [102, 14, 21]

                }

                df = pd.DataFrame(data)

                print("--- Full DataFrame ---")

                print(df)

                # Select a single column (returns a Pandas Series)

                print("\n--- Accuracy Column ---")

                print(df['Top-1 Accuracy'])

--- Full DataFrame --- Model Name Top-1 Accuracy Size (MB) 0 ResNet50 0.76 102 1 MobileNetV2 0.72 14 2 EfficientNetB0 0.77 21 --- Accuracy Column --- 0 0.76 1 0.72 2 0.77 Name: Top-1 Accuracy, dtype: float64

💡 Your Turn

Add a new column to the DataFrame called `Performance` calculated as `Accuracy / Size`. Then, print the updated DataFrame.

3.3 Matplotlib: Visualizing Your Results

Matplotlib is the classic library for creating plots and charts. Visualizing your model's training progress (like loss over time) or your data's distribution is essential for understanding and debugging.

                import matplotlib.pyplot as plt # 'plt' is the standard alias

                # Data for our plot: training vs validation loss

                epochs = [1, 2, 3,
                4, 5, 6]

                training_loss = [0.8, 0.6, 0.45, 0.3, 0.25, 0.22]

                validation_loss = [0.85, 0.68, 0.55, 0.48, 0.46,
                0.45]

                # Create the plot

                plt.figure(figsize=(8,5)) # Set the figure size

                plt.plot(epochs, training_loss, marker='o', linestyle='--', label='Training Loss')

                plt.plot(epochs, validation_loss, marker='s', linestyle='-', label='Validation Loss')

                # Add labels and a title for clarity

                plt.title('Model Loss Over Time', fontsize=16)

                plt.xlabel('Epoch')

                plt.ylabel('Loss')

                plt.grid(True)

                plt.legend() # Display the labels for each line

                plt.show() # This command renders the plot in your notebook

Example Matplotlib plot of training and validation loss

💡 Your Turn

Using the Pandas DataFrame `df` from the previous section, create a bar chart comparing the 'Top-1 Accuracy' of the different models. Use `plt.bar(df['Model Name'], df['Top-1 Accuracy'])` as a starting point.

Part 4: Lab Assignment

It's time to put everything together. This assignment requires you to perform a mini data analysis project, combining all the skills you've learned above.

Assignment: Analyzing a Toy Dataset

You are given a small dataset of used car information. Your task is to load it, perform some basic analysis, and visualize the results.

Task 1: Setup and Data Loading

In a new Colab cell, import NumPy, Pandas, and Matplotlib with their standard aliases.
Create a Pandas DataFrame using the following code snippet:
car_data = {
  'Make': ['Honda', 'Toyota', 'Ford', 'Honda', 'Toyota', 'Ford'],
  'Engine_Size_L': [1.5, 2.5, 5.0, 1.8, 2.4, 4.6],
  'Price_USD': [22000, 28000, 35000, 24000, 29000, 32000]
}
car_df = pd.DataFrame(car_data)
Print the entire DataFrame to verify it loaded correctly.

Task 2: Data Analysis with NumPy & Pandas

Calculate and print the average (mean) price of all cars.
Calculate and print the average engine size.
Find the car with the highest price using Pandas functions (hint: look up `.idxmax()`). Print the details of this car.

Task 3: Visualization with Matplotlib

Create a scatter plot to visualize the relationship between 'Engine_Size_L' (x-axis) and 'Price_USD' (y-axis).
Give your plot a clear title ("Car Price vs. Engine Size") and label the x and y axes.
Create a bar chart that shows the average price for each car 'Make'. You will need to use the Pandas `.groupby()` function first. (Hint: `avg_price_by_make = car_df.groupby('Make')['Price_USD'].mean()`).

Part 5: Bonus - Your First Kaggle Project

Ready to apply your skills to a world-famous dataset? This optional bonus section will guide you through the first steps of a real data science project on Kaggle.

Kaggle & The Titanic Dataset

Kaggle is a platform where data scientists compete by building the best models for a given problem. The "Titanic: Machine Learning from Disaster" competition is the "Hello, World!" of data science. Your goal is to predict which passengers survived the shipwreck.

Task 1: Get the Data

Go to the Titanic data page on Kaggle. You will need to create a free account.
Download the `train.csv` and `test.csv` files to your computer.
In your Colab notebook, click the "Files" icon on the left sidebar and upload `train.csv`.

Task 2: Load and Explore

Use Pandas to load the data and take your first look.

                # Load the training data from the uploaded file

                titanic_df = pd.read_csv('train.csv')

                # Display the first 5 rows to see the columns

                print("--- First 5 Rows ---")

                print(titanic_df.head())

                # Get a summary of the data types and non-null values

                print("\n--- Data Info ---")

                titanic_df.info()

Task 3: Your Challenge - Exploratory Data Analysis

Now, it's your turn to be the data detective. Answer the following questions using Pandas and Matplotlib. There is no single correct answer for the code; your goal is to find the answer and visualize it.

Survival Rate: What was the overall survival rate? Create a bar chart showing the raw counts of passengers who survived (1) vs. those who did not (0). (Hint: use `.value_counts()` on the 'Survived' column).
Survival by Gender: Did gender play a role in survival? Create a bar chart showing the survival rate for males vs. females. (Hint: use `.groupby('Sex')['Survived'].mean()`).
Survival by Class: What about passenger class ('Pclass')? Create a bar chart showing the survival rate for each of the three classes.
Missing Data: Which important column has a lot of missing data? How might you handle this in a real project? (Just answer in a text cell).

Part 6: Submission Guidelines

To complete this lab, please follow these instructions carefully.

Complete all "Your Turn" tasks and the main "Lab Assignment" in a single Google Colab notebook. The Kaggle project is a bonus, but we encourage you to try it!
Use Text Cells in your notebook to label each section (e.g., "Part 2.4 Your Turn", "Assignment Task 1", "Bonus Kaggle Project", etc.) to keep your work organized.
Ensure all your code cells have been run so that their outputs are visible below them. An unrun notebook is an incomplete notebook!
When you are finished, generate a shareable link. In Colab, click the "Share" button in the top right.
In the popup, under "General access", change "Restricted" to "Anyone with the link" and ensure the role is set to "Viewer".
Click "Copy link" and submit this link as your assignment.