Topic 5.7: Laboratory - Professional Data Preparation
Cleaning messy data and preparing it for analysis through hands-on practice
Real-world data is messy. Sensors fail, users skip fields, systems crash mid-recording. Before you can analyze data, you must clean it. This lab walks you through the two most common data preparation tasks: handling missing values and normalizing scales.
You'll work with a dataset of temperature readings from weather sensors. Some readings are missing (NaN), and the different sensors use different measurement scales. By the end, you'll have clean, standardized data ready for analysis.
How to detect missing values, choose between deletion and imputation strategies, understand when to use nan-safe functions, normalize data to a common scale using Min-Max scaling, and verify your transformations worked correctly.
Step 1: Create a sample dataset with missing values. In NumPy, missing data is represented as np.nan (Not a Number):
import numpy as np # Temperature readings from 5 sensors over 6 hours # Some readings are missing (np.nan) temperatures = np.array([ [22.5, 23.1, np.nan, 24.2, 23.8, 24.5], # Sensor 1 [21.8, 22.3, 22.9, np.nan, 23.5, 24.1], # Sensor 2 [23.2, np.nan, 24.1, 24.8, 25.2, 25.6], # Sensor 3 [22.0, 22.6, 23.2, 23.8, np.nan, 24.9], # Sensor 4 [21.5, 22.1, 22.7, 23.3, 23.9, 24.5] # Sensor 5 (no missing) ]) print("Raw temperature data:") print(temperatures) print(f"\nShape: {temperatures.shape}")
[[22.5 23.1 nan 24.2 23.8 24.5]
[21.8 22.3 22.9 nan 23.5 24.1]
[23.2 nan 24.1 24.8 25.2 25.6]
[22. 22.6 23.2 23.8 nan 24.9]
[21.5 22.1 22.7 23.3 23.9 24.5]]
Shape: (5, 6)
Step 2: Detect where the missing values are using np.isnan(). This creates a boolean mask:
# Create a boolean mask: True where data is missing nan_mask = np.isnan(temperatures) print("Missing value mask (True = missing):") print(nan_mask) # Count total missing values total_missing = np.sum(nan_mask) print(f"\nTotal missing readings: {total_missing}") # Count missing per sensor (row) missing_per_sensor = np.sum(nan_mask, axis=1) print(f"Missing per sensor: {missing_per_sensor}")
[[False False True False False False]
[False False False True False False]
[False True False False False False]
[False False False False True False]
[False False False False False False]]
Total missing readings: 4
Missing per sensor: [1 1 1 1 0]
Step 3: Understand what you found. We have 4 missing readings total. Sensors 1-4 each have one missing reading. Sensor 5 has perfect data. This tells us the problem isn't catastrophic—we're not losing entire sensors, just individual readings.
Step 4: Decide on a strategy. You have two options when data is missing:
• Delete: Remove rows/columns with missing data (simple but loses information)
• Impute: Fill missing values with reasonable estimates (keeps data but adds assumptions)
Since we only have 4 missing values out of 30 total (13%), and they're scattered, imputation makes more sense than deletion.
Step 5: Calculate the mean temperature for each sensor (ignoring missing values). We need nan-safe functions:
# Calculate mean for each sensor (axis=1 means across columns) # MUST use np.nanmean() not .mean() because NaN poisons normal mean sensor_means = np.nanmean(temperatures, axis=1) print("Average temperature per sensor:") for i, avg in enumerate(sensor_means): print(f" Sensor {i+1}: {avg:.2f}°C")
Sensor 1: 23.62°C
Sensor 2: 22.92°C
Sensor 3: 24.58°C
Sensor 4: 23.30°C
Sensor 5: 23.00°C
Step 6: Fill each sensor's missing reading with that sensor's average. We loop through each row (sensor) and use the boolean mask to find and fill NaN values:
# Create a copy to preserve original temps_cleaned = temperatures.copy() # Fill missing values row by row for sensor_idx in range(temps_cleaned.shape[0]): # Get mask for this sensor's row row_mask = nan_mask[sensor_idx] # Fill NaN positions with this sensor's average temps_cleaned[sensor_idx][row_mask] = sensor_means[sensor_idx] print("Data after imputation:") print(temps_cleaned) # Verify no NaNs remain remaining_nans = np.sum(np.isnan(temps_cleaned)) print(f"\nRemaining NaN values: {remaining_nans}")
[[22.5 23.1 23.62 24.2 23.8 24.5 ]
[21.8 22.3 22.9 22.92 23.5 24.1 ]
[23.2 24.58 24.1 24.8 25.2 25.6 ]
[22. 22.6 23.2 23.8 23.3 24.9 ]
[21.5 22.1 22.7 23.3 23.9 24.5 ]]
Remaining NaN values: 0
Step 7: Verify the imputation makes sense. The filled values (like 23.62 for Sensor 1's hour 2) are reasonable—they're close to surrounding readings. The data is now complete and ready for the next step.
Step 8: Understand the problem. Look at the data ranges for each sensor:
# Find min and max for each sensor for i in range(temps_cleaned.shape[0]): sensor_data = temps_cleaned[i] print(f"Sensor {i+1}: {sensor_data.min():.2f}°C to {sensor_data.max():.2f}°C (range: {sensor_data.max() - sensor_data.min():.2f})")
Sensor 2: 21.80°C to 24.10°C (range: 2.30)
Sensor 3: 23.20°C to 25.60°C (range: 2.40)
Sensor 4: 22.00°C to 24.90°C (range: 2.90)
Sensor 5: 21.50°C to 24.50°C (range: 3.00)
The sensors have different ranges. If we want to compare readings across sensors fairly, we need to normalize them to a common scale. We'll use Min-Max normalization to scale all values to [0, 1].
Step 9: Understand the Min-Max formula. For each value X in a dataset:
normalized = (X - min) / (max - min)
This formula: (1) Subtracts the minimum (shifts the range to start at 0), (2) Divides by the range (scales so the max becomes 1)
Step 10: Apply Min-Max normalization. We'll normalize each sensor individually so each has values from 0 to 1:
# Create array for normalized data temps_normalized = np.zeros_like(temps_cleaned) # Normalize each sensor's data for i in range(temps_cleaned.shape[0]): sensor_data = temps_cleaned[i] # Find min and max for this sensor data_min = sensor_data.min() data_max = sensor_data.max() # Apply Min-Max formula # Subtract min (shifts to 0), divide by range (scales to [0,1]) temps_normalized[i] = (sensor_data - data_min) / (data_max - data_min) print("Normalized temperature data:") print(temps_normalized)
[[0. 0.3 0.56 0.85 0.65 1. ]
[0. 0.22 0.48 0.49 0.74 1. ]
[0. 0.57 0.38 0.67 0.83 1. ]
[0. 0.21 0.41 0.62 0.45 1. ]
[0. 0.2 0.4 0.6 0.8 1. ]]
Step 11: Verify the normalization worked correctly:
# Check: every row should have min=0 and max=1 for i in range(temps_normalized.shape[0]): row_min = temps_normalized[i].min() row_max = temps_normalized[i].max() print(f"Sensor {i+1}: min={row_min:.2f}, max={row_max:.2f}") # Verify all values are in [0, 1] all_in_range = np.all((temps_normalized>=0) & (temps_normalized<=1)) print(f"\nAll values in [0,1]: {all_in_range}")
Sensor 2: min=0.00, max=1.00
Sensor 3: min=0.00, max=1.00
Sensor 4: min=0.00, max=1.00
Sensor 5: min=0.00, max=1.00
All values in [0,1]: True
Perfect! Every sensor now has data ranging from exactly 0 (coldest reading for that sensor) to exactly 1 (warmest reading for that sensor). The values in between are proportionally scaled.
Step 12: Compare original vs cleaned vs normalized data to see the complete transformation:
# Show Sensor 1's transformation print("Sensor 1 journey:") print(f"Original: {temperatures[0]}") print(f"Cleaned: {temps_cleaned[0]}") print(f"Normalized: {temps_normalized[0]}")
Original: [22.5 23.1 nan 24.2 23.8 24.5]
Cleaned: [22.5 23.1 23.62 24.2 23.8 24.5 ]
Normalized: [0. 0.3 0.56 0.85 0.65 1. ]
Notice the progression: the original had a missing value, the cleaned version filled it with the sensor's average (23.62°C), and the normalized version scaled everything to [0, 1] where 0 represents the coldest reading (22.5°C) and 1 represents the warmest (24.5°C).
Data cleaning and normalization are prerequisites for machine learning. Models can't process NaN values—they'll crash or produce garbage. And when features have wildly different scales, models give unfair weight to larger numbers. These preprocessing steps ensure your data is ready for whatever analysis comes next.
You practiced several essential data preparation techniques:
These steps form the foundation of real-world data science work. In Week 6, you'll learn Pandas, which makes these operations even easier with built-in functions like .fillna() and .dropna(). But understanding the NumPy implementation shows you what's happening under the hood.
- Real-world data always has quality issues—missing values, inconsistent scales, errors—that must be fixed before analysis.
- np.isnan() creates boolean masks to detect missing data; np.sum() on masks counts missing values across chosen axes.
- Nan-safe functions (nanmean, nanstd, nanmin, nanmax) ignore NaN values in calculations instead of returning NaN.
- Imputation fills missing values with estimates—using row/column means preserves local patterns better than global means.
- Min-Max normalization scales data to [0,1]: subtract min (shift to 0), divide by range (scale to 1).
- Always verify transformations: check for remaining NaNs, confirm min/max values match expectations.
- Data preparation typically consumes 80% of analysis time—it's not glamorous but it's essential for valid results.
- ↗ Handling Missing Data in NumPy
https://numpy.org/doc/stable/user/misc.html#handling-nan-in-ufuncs - ↗ Data Normalization Techniques
https://scikit-learn.org/stable/modules/preprocessing.html - ↗ Real-World Data Cleaning Guide
https://realpython.com/python-data-cleaning-numpy-pandas/