🔬
Lab Data Cleaning Missing Values Normalization Imputation

Topic 5.7: Laboratory - Professional Data Preparation

Cleaning messy data and preparing it for analysis through hands-on practice

🎯 Lab Overview

Real-world data is messy. Sensors fail, users skip fields, systems crash mid-recording. Before you can analyze data, you must clean it. This lab walks you through the two most common data preparation tasks: handling missing values and normalizing scales.

You'll work with a dataset of temperature readings from weather sensors. Some readings are missing (NaN), and the different sensors use different measurement scales. By the end, you'll have clean, standardized data ready for analysis.

ℹ️
What You'll Learn

How to detect missing values, choose between deletion and imputation strategies, understand when to use nan-safe functions, normalize data to a common scale using Min-Max scaling, and verify your transformations worked correctly.

🔍 Part 1: Detecting Missing Data

Step 1: Create a sample dataset with missing values. In NumPy, missing data is represented as np.nan (Not a Number):

Python
import numpy as np

# Temperature readings from 5 sensors over 6 hours
# Some readings are missing (np.nan)
temperatures = np.array([
    [22.5, 23.1, np.nan, 24.2, 23.8, 24.5],  # Sensor 1
    [21.8, 22.3, 22.9, np.nan, 23.5, 24.1],  # Sensor 2
    [23.2, np.nan, 24.1, 24.8, 25.2, 25.6],  # Sensor 3
    [22.0, 22.6, 23.2, 23.8, np.nan, 24.9],  # Sensor 4
    [21.5, 22.1, 22.7, 23.3, 23.9, 24.5]     # Sensor 5 (no missing)
])

print("Raw temperature data:")
print(temperatures)
print(f"\nShape: {temperatures.shape}")
▶ Output
Raw temperature data:
[[22.5 23.1 nan 24.2 23.8 24.5]
 [21.8 22.3 22.9 nan 23.5 24.1]
 [23.2 nan 24.1 24.8 25.2 25.6]
 [22. 22.6 23.2 23.8 nan 24.9]
 [21.5 22.1 22.7 23.3 23.9 24.5]]

Shape: (5, 6)

Step 2: Detect where the missing values are using np.isnan(). This creates a boolean mask:

Python
# Create a boolean mask: True where data is missing
nan_mask = np.isnan(temperatures)

print("Missing value mask (True = missing):")
print(nan_mask)

# Count total missing values
total_missing = np.sum(nan_mask)
print(f"\nTotal missing readings: {total_missing}")

# Count missing per sensor (row)
missing_per_sensor = np.sum(nan_mask, axis=1)
print(f"Missing per sensor: {missing_per_sensor}")
▶ Output
Missing value mask (True = missing):
[[False False True False False False]
 [False False False True False False]
 [False True False False False False]
 [False False False False True False]
 [False False False False False False]]

Total missing readings: 4
Missing per sensor: [1 1 1 1 0]

Step 3: Understand what you found. We have 4 missing readings total. Sensors 1-4 each have one missing reading. Sensor 5 has perfect data. This tells us the problem isn't catastrophic—we're not losing entire sensors, just individual readings.

🩹 Part 2: Imputing Missing Values

Step 4: Decide on a strategy. You have two options when data is missing:

Delete: Remove rows/columns with missing data (simple but loses information)
Impute: Fill missing values with reasonable estimates (keeps data but adds assumptions)

Since we only have 4 missing values out of 30 total (13%), and they're scattered, imputation makes more sense than deletion.

Step 5: Calculate the mean temperature for each sensor (ignoring missing values). We need nan-safe functions:

Python
# Calculate mean for each sensor (axis=1 means across columns)
# MUST use np.nanmean() not .mean() because NaN poisons normal mean
sensor_means = np.nanmean(temperatures, axis=1)

print("Average temperature per sensor:")
for i, avg in enumerate(sensor_means):
    print(f"  Sensor {i+1}: {avg:.2f}°C")
▶ Output
Average temperature per sensor:
  Sensor 1: 23.62°C
  Sensor 2: 22.92°C
  Sensor 3: 24.58°C
  Sensor 4: 23.30°C
  Sensor 5: 23.00°C

Step 6: Fill each sensor's missing reading with that sensor's average. We loop through each row (sensor) and use the boolean mask to find and fill NaN values:

Python
# Create a copy to preserve original
temps_cleaned = temperatures.copy()

# Fill missing values row by row
for sensor_idx in range(temps_cleaned.shape[0]):
    # Get mask for this sensor's row
    row_mask = nan_mask[sensor_idx]
    
    # Fill NaN positions with this sensor's average
    temps_cleaned[sensor_idx][row_mask] = sensor_means[sensor_idx]

print("Data after imputation:")
print(temps_cleaned)

# Verify no NaNs remain
remaining_nans = np.sum(np.isnan(temps_cleaned))
print(f"\nRemaining NaN values: {remaining_nans}")
▶ Output
Data after imputation:
[[22.5 23.1 23.62 24.2 23.8 24.5 ]
 [21.8 22.3 22.9 22.92 23.5 24.1 ]
 [23.2 24.58 24.1 24.8 25.2 25.6 ]
 [22. 22.6 23.2 23.8 23.3 24.9 ]
 [21.5 22.1 22.7 23.3 23.9 24.5 ]]

Remaining NaN values: 0

Step 7: Verify the imputation makes sense. The filled values (like 23.62 for Sensor 1's hour 2) are reasonable—they're close to surrounding readings. The data is now complete and ready for the next step.

📐 Part 3: Normalizing to Common Scale

Step 8: Understand the problem. Look at the data ranges for each sensor:

Python
# Find min and max for each sensor
for i in range(temps_cleaned.shape[0]):
    sensor_data = temps_cleaned[i]
    print(f"Sensor {i+1}: {sensor_data.min():.2f}°C to {sensor_data.max():.2f}°C (range: {sensor_data.max() - sensor_data.min():.2f})")
▶ Output
Sensor 1: 22.50°C to 24.50°C (range: 2.00)
Sensor 2: 21.80°C to 24.10°C (range: 2.30)
Sensor 3: 23.20°C to 25.60°C (range: 2.40)
Sensor 4: 22.00°C to 24.90°C (range: 2.90)
Sensor 5: 21.50°C to 24.50°C (range: 3.00)

The sensors have different ranges. If we want to compare readings across sensors fairly, we need to normalize them to a common scale. We'll use Min-Max normalization to scale all values to [0, 1].

Step 9: Understand the Min-Max formula. For each value X in a dataset:

normalized = (X - min) / (max - min)

This formula: (1) Subtracts the minimum (shifts the range to start at 0), (2) Divides by the range (scales so the max becomes 1)

Step 10: Apply Min-Max normalization. We'll normalize each sensor individually so each has values from 0 to 1:

Python
# Create array for normalized data
temps_normalized = np.zeros_like(temps_cleaned)

# Normalize each sensor's data
for i in range(temps_cleaned.shape[0]):
    sensor_data = temps_cleaned[i]
    
    # Find min and max for this sensor
    data_min = sensor_data.min()
    data_max = sensor_data.max()
    
    # Apply Min-Max formula
    # Subtract min (shifts to 0), divide by range (scales to [0,1])
    temps_normalized[i] = (sensor_data - data_min) / (data_max - data_min)

print("Normalized temperature data:")
print(temps_normalized)
▶ Output
Normalized temperature data:
[[0. 0.3 0.56 0.85 0.65 1. ]
 [0. 0.22 0.48 0.49 0.74 1. ]
 [0. 0.57 0.38 0.67 0.83 1. ]
 [0. 0.21 0.41 0.62 0.45 1. ]
 [0. 0.2 0.4 0.6 0.8 1. ]]

Step 11: Verify the normalization worked correctly:

Python
# Check: every row should have min=0 and max=1
for i in range(temps_normalized.shape[0]):
    row_min = temps_normalized[i].min()
    row_max = temps_normalized[i].max()
    print(f"Sensor {i+1}: min={row_min:.2f}, max={row_max:.2f}")

# Verify all values are in [0, 1]
all_in_range = np.all((temps_normalized >= 0) & (temps_normalized <= 1))
print(f"\nAll values in [0,1]: {all_in_range}")
▶ Output
Sensor 1: min=0.00, max=1.00
Sensor 2: min=0.00, max=1.00
Sensor 3: min=0.00, max=1.00
Sensor 4: min=0.00, max=1.00
Sensor 5: min=0.00, max=1.00

All values in [0,1]: True

Perfect! Every sensor now has data ranging from exactly 0 (coldest reading for that sensor) to exactly 1 (warmest reading for that sensor). The values in between are proportionally scaled.

🎯 Understanding What You Accomplished

Step 12: Compare original vs cleaned vs normalized data to see the complete transformation:

Python
# Show Sensor 1's transformation
print("Sensor 1 journey:")
print(f"Original:   {temperatures[0]}")
print(f"Cleaned:    {temps_cleaned[0]}")
print(f"Normalized: {temps_normalized[0]}")
▶ Output
Sensor 1 journey:
Original: [22.5 23.1 nan 24.2 23.8 24.5]
Cleaned: [22.5 23.1 23.62 24.2 23.8 24.5 ]
Normalized: [0. 0.3 0.56 0.85 0.65 1. ]

Notice the progression: the original had a missing value, the cleaned version filled it with the sensor's average (23.62°C), and the normalized version scaled everything to [0, 1] where 0 represents the coldest reading (22.5°C) and 1 represents the warmest (24.5°C).

ℹ️
Why This Matters

Data cleaning and normalization are prerequisites for machine learning. Models can't process NaN values—they'll crash or produce garbage. And when features have wildly different scales, models give unfair weight to larger numbers. These preprocessing steps ensure your data is ready for whatever analysis comes next.

💡 Key Techniques Summary

You practiced several essential data preparation techniques:

1 Detection: Use np.isnan() to create boolean masks revealing missing data locations
2 Quantification: Use np.sum() on boolean masks to count missing values across different axes
3 Nan-safe operations: Always use np.nanmean(), np.nanstd(), etc. when data might contain NaN
4 Imputation: Fill missing values with statistical estimates (mean, median) rather than deleting data
5 Min-Max normalization: Scale data to [0,1] using (X - min) / (max - min)
6 Verification: Always check your transformations worked (no NaNs remain, min=0, max=1)

These steps form the foundation of real-world data science work. In Week 6, you'll learn Pandas, which makes these operations even easier with built-in functions like .fillna() and .dropna(). But understanding the NumPy implementation shows you what's happening under the hood.

  • Real-world data always has quality issues—missing values, inconsistent scales, errors—that must be fixed before analysis.
  • np.isnan() creates boolean masks to detect missing data; np.sum() on masks counts missing values across chosen axes.
  • Nan-safe functions (nanmean, nanstd, nanmin, nanmax) ignore NaN values in calculations instead of returning NaN.
  • Imputation fills missing values with estimates—using row/column means preserves local patterns better than global means.
  • Min-Max normalization scales data to [0,1]: subtract min (shift to 0), divide by range (scale to 1).
  • Always verify transformations: check for remaining NaNs, confirm min/max values match expectations.
  • Data preparation typically consumes 80% of analysis time—it's not glamorous but it's essential for valid results.
📚External Resources