Lab 2: Data Science Foundations

Part 1: What is Data Science?

Imagine you're a detective. You have clues (data), and your job is to examine them, find patterns, and tell the story of what happened. That's data science!

It's the art of turning raw, messy information into clear insights. Today, you will be a data detective. Your case file is a famous dataset about flowers, and your tools will be Pandas, Matplotlib, and Seaborn.

Our Detective Tools:

Pandas 🐼: The ultimate magnifying glass. It helps us load, organize, and inspect our data in a neat table called a DataFrame.
Matplotlib 🎨: The sketchbook. It lets us draw basic charts and graphs to visualize our findings.
Seaborn ✨: The professional presentation kit. It builds on Matplotlib to make beautiful, informative, and complex charts with very little code.

Part 2: The Case File - The Iris Flower Dataset

Our first case involves identifying species of Iris flowers based on their measurements. It's a classic in data science because it's simple, clean, and great for learning.

The Clues (Features):

For each flower, we have four measurements:

Sepal Length: The length of the outer green leaf.
Sepal Width: The width of the outer green leaf.
Petal Length: The length of the colorful inner petal.
Petal Width: The width of the colorful inner petal.

The Mystery to Solve (Target):

Based on these four clues, we want to identify which of the three species the flower belongs to: Setosa, Versicolor, or Virginica.

Part 3: First Inspection (Dataset Exploration)

Let's open our case file. We'll use the scikit-learn library to easily load the dataset, then convert it to a Pandas DataFrame to start our investigation.

3.1 Loading the Data

This first code block imports our tools and loads the data into a Pandas DataFrame, our main workspace.

                # Import our libraries

                import pandas as pd

                import numpy as np

                import matplotlib.pyplot as plt

                import seaborn as sns

                from sklearn.datasets import load_iris # A function to load our case file

                # Load the dataset

                iris_data = load_iris()

                # Create a DataFrame (our table)

                iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)

                # Add the 'species' column to our table

                iris_df['species'] = iris_data.target_names[iris_data.target]

3.2 The First Glance: `.head()`

The `.head()` command shows us the first 5 rows. It's a great way to quickly check if our data loaded correctly.

                # Show the top 5 rows

                print(iris_df.head())

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa

💡 Your Turn

Use the `.tail()` command to see the last 5 rows of the DataFrame. What species do you see? Type `iris_df.tail()` in a new code cell and run it.

3.3 The File Summary: `.info()`

The `.info()` command gives us a technical summary: how many rows we have, the names of our columns, and crucially, if any data is missing.

                # Get a summary of the DataFrame

                iris_df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal length (cm) 150 non-null float64 1 sepal width (cm) 150 non-null float64 2 petal length (cm) 150 non-null float64 3 petal width (cm) 150 non-null float64 4 species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB

Detective's Note: We have 150 flowers ("entries"). All columns have "150 non-null" values, which means no data is missing! This is a very clean case file.

💡 Your Turn

Use the `.shape` attribute to see the dimensions of the table. It will show `(rows, columns)`. Type `iris_df.shape` in a new code cell.

3.4 The Statistical Profile: `.describe()`

The `.describe()` command calculates key statistics (like average, min, max) for our numerical columns. This helps us understand the range and distribution of our measurements.

                # Get descriptive statistics

                print(iris_df.describe())

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000

Detective's Note: Look at the "mean" (average) row. The average petal length (3.75 cm) is much larger than the average petal width (1.2 cm). Interesting...

💡 Your Turn

Select just the `petal length (cm)` column and then use `.describe()` on it to get statistics for only that clue. Type `iris_df['petal length (cm)'].describe()`

Part 4: Cleaning The Scene (Data Preprocessing)

Before we can draw conclusions, we must ensure our evidence is clean. This means checking for missing clues (null values) and accidental duplicates.

4.1 Checking for Missing Values

We already saw from `.info()` that we have no missing values, but this is the direct command to check. A result of '0' for every column is perfect.

                # Check for null (missing) values in each column

                print(iris_df.isnull().sum())

sepal length (cm) 0 sepal width (cm) 0 petal length (cm) 0 petal width (cm) 0 species 0 dtype: int64

4.2 Checking for Duplicates

We also need to check if any rows are exact copies of each other.

                # Count the number of duplicated rows

                duplicates = iris_df.duplicated().sum()

                print(f"Number of duplicate rows:
                    {duplicates}")

Number of duplicate rows: 1

We found one duplicate! Let's remove it to keep our evidence clean.

                # Remove duplicates, keeping the first instance

                iris_df = iris_df.drop_duplicates()

                print("Duplicates removed. Let's check info
                    again:")

                iris_df.info()

Duplicates removed. Let's check info again: <class 'pandas.core.frame.DataFrame'> Index: 149 entries, 0 to 149 Data columns (total 5 columns): ...

💡 Your Turn

Run the duplicate check again to confirm they are gone. Type `iris_df.duplicated().sum()` in a new cell. The output should now be 0.

Case File Secured!

Our data is now loaded, explored, and cleaned. The scene is secure. Now, the real detective work begins: finding patterns with visualization!

Part 5: Drawing The Connections (Visualization)

A picture is worth a thousand data points. We will use Seaborn to create beautiful plots that reveal the hidden stories in our data.

5.1 Analyzing One Clue at a Time (Univariate Plots)

Let's start by looking at each measurement individually. A histogram shows us the distribution of a single variable—where the values are most common.

                # Set the style of our plots to be pretty

                sns.set_style('whitegrid')

                # Create a histogram of sepal width

                plt.figure(figsize=(8, 6)) # Create a canvas for our plot

                sns.histplot(data=iris_df, x='sepal width (cm)', kde=True, color='purple')

                plt.title('Distribution of Sepal Width', fontsize=16)

                plt.show()

Detective's Note: The most common sepal width is around 3.0 cm. The curve (called a Kernel Density Estimate or KDE) shows a nice bell-like shape, which we call a "normal distribution."

💡 Your Turn

Create a histogram for a different feature: `'sepal length (cm)'`. Change the color to `'green'`. What do you notice about its shape?

5.2 Comparing Two Clues (Bivariate Plots)

Now let's see how two measurements relate to each other. A scatter plot is perfect for this. We'll compare petal length and petal width. We can also use `hue` to color the dots by their species. This is where we might crack the case!

                # Create a scatter plot of petal length vs petal width

                # The 'hue' parameter colors the points based on the 'species' column

                plt.figure(figsize=(10, 7))

                sns.scatterplot(data=iris_df, x='petal length (cm)', y='petal width (cm)', hue='species', s=100, palette='viridis')

                plt.title('Petal Length vs. Petal Width by Species', fontsize=16)

                plt.show()

Detective's Note: This is a huge breakthrough! The Setosa species have small petals and form a distinct, separate cluster. Versicolor and Virginica are harder to tell apart, but Virginica generally has larger petals than Versicolor. The petal measurements seem to be the key!

💡 Your Turn

Create another scatter plot, this time comparing `'sepal length (cm)'` vs. `'sepal width (cm)'`. Don't forget to set the `hue` to `'species'`. Is the separation between species as clear with these clues?

5.3 The Big Picture (Multivariate Plots)

What if we could see all the relationships at once? A pair plot does exactly that. It creates a grid of scatter plots for every pair of features.

                # Create a pair plot of the entire dataset, colored by species

                sns.pairplot(data=iris_df, hue='species')

                plt.show()

Detective's Note: Look at the plots in the "petal length" row and "petal width" column. They show the clearest separation between the three species. The case is almost closed!

Part 6: Your Turn to be the Detective

Assignment: Deeper into the Iris Data

Use the `iris_df` DataFrame to answer the following questions. For each task, write the code and display the plot or output in your Colab notebook.

Task 1: Basic Analysis

How many flowers of each species are in our (cleaned) dataset? (Hint: use `.value_counts()` on the 'species' column).
What is the average petal length for each species? (Hint: use `.groupby('species')['petal length (cm)'].mean()`).

Task 2: Visualization

Create a histogram for 'petal length (cm)'. Does it look like one bell curve, or more? Why do you think that is? (Write your answer in a text cell).
Create a box plot to compare the 'sepal width (cm)' for each of the three species. (Hint: `sns.boxplot(data=iris_df, x='species', y='sepal width (cm)')`). Are there any outliers (dots outside the whiskers)? Which species has the highest median sepal width?
Create a violin plot, which is like a box plot and a histogram combined. (Hint: `sns.violinplot(data=iris_df, x='species', y='petal width (cm)')`). What does this plot tell you about the distribution of petal widths for each species?

Part 7: Bonus - The Housing Price Case

Ready for a more challenging case? We'll look at a real dataset from a Kaggle competition about predicting house prices in Ames, Iowa.

Kaggle & The Housing Price Dataset

Kaggle is a platform where data scientists compete. The "House Prices" competition is a fantastic real-world problem.

Task 1: Get the Data

Go to the House Prices data page. You will need a Kaggle account.
Download `train.csv`.
In your Colab notebook, upload `train.csv` using the file browser on the left.
Load it with `house_df = pd.read_csv('train.csv')`.

Task 2: Your Challenge - Exploratory Data Analysis

This dataset is much bigger and messier than the Iris dataset. Your goal is to find initial clues about what features might influence the final `SalePrice`.

First Look: Use `.info()` to look at the data. How many columns are there? Do you see any with missing values?
Price Distribution: Create a histogram of the `SalePrice`. Is it a perfect bell curve? (This is called being "skewed").
Living Area vs. Price: The `GrLivArea` is the above-ground living area in square feet. Create a scatter plot of `GrLivArea` (x-axis) vs. `SalePrice` (y-axis). Does there appear to be a positive relationship?
Overall Quality: The `OverallQual` feature rates the quality of the house from 1 to 10. Create a box plot to show the relationship between `OverallQual` (x-axis) and `SalePrice` (y-axis). What does this plot tell you about how quality affects price?

Part 8: Submission Guidelines

To complete this lab, please follow these instructions carefully.

Complete all "Your Turn" tasks and the main "Lab Assignment" in a single Google Colab notebook. The Kaggle project is a bonus, but highly recommended!
Use Text Cells to label each section and answer any written questions.
Ensure all your code cells have been run so that their outputs and plots are visible.
When you are finished, generate a shareable link. In Colab, click the "Share" button in the top right.
In the popup, under "General access", change "Restricted" to "Anyone with the link" and ensure the role is set to "Viewer".
Click "Copy link" and submit this link as your assignment.