Lab 2: Data Science Foundations
📊 Learn to explore, clean, and visualize data like a pro.
Libraries: Pandas, Matplotlib, Seaborn • Estimated Time: 3 hours
Part 1: What is Data Science?
Imagine you're a detective. You have clues (data), and your job is to examine them, find patterns, and
tell the story of what happened. That's data science!
It's the art of turning raw, messy information into clear insights. Today, you will be a
data detective. Your case file is a famous dataset about flowers, and your tools will be
Pandas, Matplotlib, and Seaborn.
Our Detective Tools:
- Pandas 🐼: The ultimate magnifying glass. It helps us load, organize, and
inspect our data in a neat table called a DataFrame.
- Matplotlib 🎨: The sketchbook. It lets us draw basic charts and graphs to
visualize our findings.
- Seaborn ✨: The professional presentation kit. It builds on Matplotlib to make
beautiful, informative, and complex charts with very little code.
Part 2: The Case File - The Iris Flower Dataset
Our first case involves identifying species of Iris flowers based on their measurements. It's a classic
in data science because it's simple, clean, and great for learning.
The Clues (Features):
For each flower, we have four measurements:
- Sepal Length: The length of the outer green leaf.
- Sepal Width: The width of the outer green leaf.
- Petal Length: The length of the colorful inner petal.
- Petal Width: The width of the colorful inner petal.
The Mystery to Solve (Target):
Based on these four clues, we want to identify which of the three species the flower belongs to:
Setosa, Versicolor, or Virginica.
Part 3: First Inspection (Dataset Exploration)
Let's open our case file. We'll use the scikit-learn library to easily load the dataset, then convert it
to a Pandas DataFrame to start our investigation.
3.1 Loading the Data
This first code block imports our tools and loads the data into a Pandas DataFrame, our main workspace.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
iris_data = load_iris()
iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
iris_df['species'] = iris_data.target_names[iris_data.target]
3.2 The First Glance: `.head()`
The `.head()` command shows us the first 5 rows. It's a great way to quickly check if our data loaded
correctly.
print(iris_df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
💡 Your Turn
Use the `.tail()` command to see the last 5 rows of the DataFrame. What species do
you see? Type `iris_df.tail()` in a new code cell and run it.
3.3 The File Summary: `.info()`
The `.info()` command gives us a technical summary: how many rows we have, the names of our columns, and
crucially, if any data is missing.
iris_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal length (cm) 150 non-null float64
1 sepal width (cm) 150 non-null float64
2 petal length (cm) 150 non-null float64
3 petal width (cm) 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Detective's Note: We have 150 flowers ("entries"). All columns have "150 non-null"
values, which means no data is missing! This is a very clean case file.
💡 Your Turn
Use the `.shape` attribute to see the dimensions of the table. It will show `(rows, columns)`. Type
`iris_df.shape` in a new code cell.
3.4 The Statistical Profile: `.describe()`
The `.describe()` command calculates key statistics (like average, min, max) for our numerical columns.
This helps us understand the range and distribution of our measurements.
print(iris_df.describe())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Detective's Note: Look at the "mean" (average) row. The average petal length (3.75 cm)
is much larger than the average petal width (1.2 cm). Interesting...
💡 Your Turn
Select just the `petal length (cm)` column and then use `.describe()` on it to get statistics for
only that clue. Type `iris_df['petal length (cm)'].describe()`
Part 4: Cleaning The Scene (Data Preprocessing)
Before we can draw conclusions, we must ensure our evidence is clean. This means checking for missing
clues (null values) and accidental duplicates.
4.1 Checking for Missing Values
We already saw from `.info()` that we have no missing values, but this is the direct command to check. A
result of '0' for every column is perfect.
print(iris_df.isnull().sum())
sepal length (cm) 0
sepal width (cm) 0
petal length (cm) 0
petal width (cm) 0
species 0
dtype: int64
4.2 Checking for Duplicates
We also need to check if any rows are exact copies of each other.
duplicates = iris_df.duplicated().sum()
print(f"Number of duplicate rows:
{duplicates}")
Number of duplicate rows: 1
We found one duplicate! Let's remove it to keep our evidence clean.
iris_df = iris_df.drop_duplicates()
print("Duplicates removed. Let's check info
again:")
iris_df.info()
Duplicates removed. Let's check info again:
<class 'pandas.core.frame.DataFrame'>
Index: 149 entries, 0 to 149
Data columns (total 5 columns):
...
💡 Your Turn
Run the duplicate check again to confirm they are gone. Type `iris_df.duplicated().sum()` in a new
cell. The output should now be 0.
Case File Secured!
Our data is now loaded, explored, and cleaned. The scene is secure. Now, the real detective work
begins: finding patterns with visualization!
Part 5: Drawing The Connections (Visualization)
A picture is worth a thousand data points. We will use Seaborn to create beautiful plots that reveal the
hidden stories in our data.
5.1 Analyzing One Clue at a Time (Univariate Plots)
Let's start by looking at each measurement individually. A histogram shows us the
distribution of a single variable—where the values are most common.
sns.set_style('whitegrid')
plt.figure(figsize=(8, 6))
sns.histplot(data=iris_df, x='sepal width (cm)', kde=True, color='purple')
plt.title('Distribution of Sepal Width', fontsize=16)
plt.show()
Detective's Note: The most common sepal width is around 3.0 cm. The curve (called a
Kernel Density Estimate or KDE) shows a nice bell-like shape, which we call a "normal distribution."
💡 Your Turn
Create a histogram for a different feature: `'sepal length (cm)'`. Change the color to `'green'`.
What do you notice about its shape?
5.2 Comparing Two Clues (Bivariate Plots)
Now let's see how two measurements relate to each other. A scatter plot is perfect for
this. We'll compare petal length and petal width. We can also use `hue` to color the dots by their
species. This is where we might crack the case!
plt.figure(figsize=(10, 7))
sns.scatterplot(data=iris_df, x='petal length (cm)', y='petal width (cm)', hue='species', s=100, palette='viridis')
plt.title('Petal Length vs. Petal Width by Species', fontsize=16)
plt.show()
Detective's Note: This is a huge breakthrough! The Setosa species have small petals and
form a distinct, separate cluster. Versicolor and Virginica are harder to tell apart, but Virginica
generally has larger petals than Versicolor. The petal measurements seem to be the key!
💡 Your Turn
Create another scatter plot, this time comparing `'sepal length (cm)'` vs. `'sepal width (cm)'`.
Don't forget to set the `hue` to `'species'`. Is the separation between species as clear with these
clues?
5.3 The Big Picture (Multivariate Plots)
What if we could see all the relationships at once? A pair plot does exactly that. It
creates a grid of scatter plots for every pair of features.
sns.pairplot(data=iris_df, hue='species')
plt.show()
Detective's Note: Look at the plots in the "petal length" row and "petal width" column.
They show the clearest separation between the three species. The case is almost closed!
Part 6: Your Turn to be the Detective
Assignment: Deeper into the Iris Data
Use the `iris_df` DataFrame to answer the following questions. For each task, write the
code and display the plot or output in your Colab notebook.
Task 1: Basic Analysis
- How many flowers of each species are in our (cleaned) dataset? (Hint: use `.value_counts()` on
the 'species' column).
- What is the average petal length for each species? (Hint: use `.groupby('species')['petal length
(cm)'].mean()`).
Task 2: Visualization
- Create a histogram for 'petal length (cm)'. Does it look like one bell curve,
or more? Why do you think that is? (Write your answer in a text cell).
- Create a box plot to compare the 'sepal width (cm)' for each of the three
species. (Hint: `sns.boxplot(data=iris_df, x='species', y='sepal width (cm)')`). Are there any
outliers (dots outside the whiskers)? Which species has the highest median sepal width?
- Create a violin plot, which is like a box plot and a histogram combined. (Hint:
`sns.violinplot(data=iris_df, x='species', y='petal width (cm)')`). What does this plot tell you
about the distribution of petal widths for each species?
Part 7: Bonus - The Housing Price Case
Ready for a more challenging case? We'll look at a real dataset from a Kaggle competition about
predicting house prices in Ames, Iowa.
Kaggle & The Housing Price Dataset
Kaggle is a platform where data scientists compete. The "House Prices" competition is a fantastic real-world
problem.
Task 1: Get the Data
- Go to the House Prices data page. You will
need a Kaggle account.
- Download `train.csv`.
- In your Colab notebook, upload `train.csv` using the file browser on the left.
- Load it with `house_df = pd.read_csv('train.csv')`.
Task 2: Your Challenge - Exploratory Data Analysis
This dataset is much bigger and messier than the Iris dataset. Your goal is to find initial clues about
what features might influence the final `SalePrice`.
- First Look: Use `.info()` to look at the data. How many columns are there? Do you
see any with missing values?
- Price Distribution: Create a histogram of the `SalePrice`. Is it a perfect bell
curve? (This is called being "skewed").
- Living Area vs. Price: The `GrLivArea` is the above-ground living area in square
feet. Create a scatter plot of `GrLivArea` (x-axis) vs. `SalePrice` (y-axis). Does there appear to
be a positive relationship?
- Overall Quality: The `OverallQual` feature rates the quality of the house from 1 to
10. Create a box plot to show the relationship between `OverallQual` (x-axis) and
`SalePrice` (y-axis). What does this plot tell you about how quality affects price?
Part 8: Submission Guidelines
To complete this lab, please follow these instructions carefully.
- Complete all "Your Turn" tasks and the main "Lab Assignment" in a single Google Colab notebook. The
Kaggle project is a bonus, but highly recommended!
- Use Text Cells to label each section and answer any written questions.
- Ensure all your code cells have been run so that their outputs and plots are visible.
- When you are finished, generate a shareable link. In Colab, click the "Share"
button in the top right.
- In the popup, under "General access", change "Restricted" to "Anyone with the link"
and ensure the role is set to "Viewer".
- Click "Copy link" and submit this link as your assignment.