Topic 8.6: Handling Missing Values

🌉 The Incomplete Artefact

▼

When archaeologists excavate an ancient site, they frequently encounter incomplete artefacts — a pottery vessel with a missing handle, a mosaic with several tiles absent, a scroll that is partially burned. For each incomplete artefact, they face the same decision: reconstruct the missing part based on what is known about similar artefacts, or preserve the artefact as-is and document the gap. Reconstruction is appropriate when the missing piece can be estimated reliably from surrounding context. Preservation of the gap is appropriate when reconstruction would involve too much guesswork.

The same decision applies to missing values in data. Imputation (filling the gap with an estimated value) is appropriate when the missing value can be reasonably estimated from the available data. Deletion (removing the row) is appropriate when filling would introduce more error than removing. The question is not 'which is better' in the abstract — it is 'which is better for this specific column, given its missingness rate and its role in the analysis'.

ℹ️

The Fundamental Trade-Off

Every approach to missing values involves a trade-off. Deletion preserves the accuracy of the remaining data but reduces the dataset's size, potentially losing important rows. Imputation preserves the dataset's size but introduces estimated values that may not be accurate — potentially adding noise or bias. Neither option is free of cost.

🔧 Fill with Median

▼

Before deciding how to fill anything, the first step is to measure the scope of the problem with .isna().sum().

Python

import pandas as pd
import numpy as np

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Check missing values per column
print(df.isna().sum())
print(f"\nTotal rows: {len(df)}")

With the scope of missingness established, .fillna() replaces NaN values in a Series or DataFrame column with a specified value. For numeric columns like score, the most robust choice is the median — not the mean. The median is not affected by extreme values (outliers), making it a more reliable central estimate for skewed distributions.

Python

# Reload and use median — more robust when outliers exist
df = pd.read_csv('../Datasets/8_4_competition_results.csv')
median_score = df['score'].median()
df['score'] = df['score'].fillna(median_score)

print(f"Median score: {median_score}")

▶ Output

Median score: 82.0

The missing scores have been replaced with the column median. The median is computed only from the non-missing values, so it is unaffected by the rows that were missing — a key advantage over forward-filling or zero-filling.

🔧 Fill with Mode

▼

For categorical (text) columns, the appropriate imputation is the mode — the most frequently occurring value. There is no concept of a 'median' for text categories, and the mean is irrelevant. The mode represents the most common category in the dataset and is a reasonable default for a missing categorical value. The notebook demonstrates this on the city column:

Python

# For categorical columns — use the most frequent value (mode)
mode_city = df['city'].mode()[0]
df['city'] = df['city'].fillna(mode_city)

print(f"Mode city used: {mode_city}")
print(f"Missing cities remaining: {df['city'].isna().sum()}")

The same pattern — compute .mode()[0], then .fillna() with that value — applies identically to any categorical column in the dataset, including level (beginner/intermediate/advanced). Whichever column you apply it to, the most frequent category becomes the fill value for every missing entry in that column.

ℹ️

Why mode()[0] and Not Just mode()?

The .mode() method returns a Series of mode values — not a single value. If two categories are tied for most frequent, both appear in the result. Using [0] extracts the first (alphabetically earliest) mode value as a scalar. Always use [0] when passing the mode to fillna(), otherwise Pandas receives a Series instead of a scalar and the fill will not work as expected.

🗑️ Drop the Row

▼

The .dropna() method removes rows (or columns) that contain missing values. Its default behaviour is aggressive — it drops any row with at least one NaN in any column. The subset and thresh parameters allow much more targeted deletion.

Python

# Drop any row with at least one missing value
df_dropped = df.dropna()

print(f"Rows before: {len(df)}")
print(f"Rows after dropna: {len(df_dropped)}")

This is the aggressive default behaviour — any row with even one missing value in any column is removed entirely. The subset parameter narrows this down to a targeted deletion based on specific columns only:

Python

# Drop rows only where 'score' is missing — ignore other columns
df_clean = df.dropna(subset=['score'])

print(f"Rows after targeted dropna: {len(df_clean)}")
print(f"Missing scores remaining: {df_clean['score'].isna().sum()}")

Results analysis: a plain dropna() removes every row that has at least one missing value anywhere — across score and level, this removes more rows than necessary if you only care about completeness in one column. dropna(subset=['score']) is the targeted alternative — it removes only the rows missing a score, leaving rows with a missing level untouched.

dropna() default

Drops any row with at least one NaN in any column. Most aggressive — can remove rows that are nearly complete because of one unimportant missing value.

dropna(subset=['col'])

Drops rows only if the specified column(s) are missing. Targeted and safe — other missing values do not trigger deletion.

📊 Verify the Fix

▼

After applying imputation, always verify that missing values have been correctly filled and that the column's statistical properties remain sensible. The score distribution should not change dramatically after median imputation — all 12 filled values are set to the median, which by definition does not shift the median.

Python

# Complete the cleaning: apply both imputation strategies
df_clean = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Fill score with median
df_clean['score'] = df_clean['score'].fillna(df_clean['score'].median())

# Fill level with mode
df_clean['level'] = df_clean['level'].fillna(df_clean['level'].mode()[0])

# Verify: no missing values remain
print("Missing values after both imputation steps:")
print(df_clean.isna().sum())

print(f"\nFinal shape: {df_clean.shape}")
print(f"\nScore distribution after imputation:")
print(df_clean['score'].describe().round(2))

▶ Output

Missing values after both imputation steps: name 0 age 0 city 0 score 0 grade 0 level 0 date 0 dtype: int64 Final shape: (200, 7) Score distribution after imputation: count 200.00 mean 80.55 std 11.88 min 45.00 25% 73.00 50% 82.00 75% 90.00 max 99.00

Zero missing values across all columns. The score distribution has shifted slightly (mean from 80.34 to 80.55) because 12 rows that were previously excluded from the mean calculation are now included with a value of 82. The median remains at 82 — confirming the imputation did not distort the central tendency.

📋 Fill or Drop?

▼

The choice between imputation and deletion — and between mean, median, and mode imputation — depends on three factors: the column's data type, the percentage of missing values, and whether the missingness appears random.

Rule 1

Use median imputation for numeric columns with potential skew or outliers.

The median is not pulled by extreme values. For exam scores, salaries, or age — wherever an outlier could distort the mean — the median is the safer imputation value.

Rule 2

Use mode imputation for categorical columns.

The most common category is the best single-value estimate for a missing categorical entry. Use .mode()[0] to extract the scalar value.

Rule 3

Use dropna(subset=['col']) when the column is critical and imputing would introduce too much uncertainty.

If the missing column is the target variable in a prediction task, imputing it can corrupt model training. Deletion is safer.

Rule 4

Use dropna(thresh=N) when rows must meet a minimum completeness standard.

Useful for datasets with many columns where a row missing more than a few values is not reliable enough to use.

Rule 5

Never impute before removing duplicates.

Imputing values on duplicated rows means the same estimated value is applied twice — once to the original and once to the copy. Remove duplicates first.

📊 Too Much Missing

▼

Not all missing value situations call for the same strategy. The proportion of missing values relative to the total dataset size is the primary driver of the decision: impute, delete rows, or delete the column entirely.

Missingness Severity Guide

0–5% Missing (Low)

Simple imputation (mean, median, mode) is appropriate
Row deletion is also acceptable — minimal data loss
W8 level column: 2.4% missing → mode imputation

5–20% Missing (Moderate)

Imputation preferred to preserve data volume
Investigate whether missingness is random or patterned
W8 score column: 5.8% missing → median imputation

20–50% Missing (High)

Column-level deletion may be appropriate
Imputation risks introducing significant bias
Discuss with stakeholders before deciding

>50% Missing (Critical)

The column is mostly absent — usually drop it
Discuss with stakeholders whether data can be re-collected
Keeping the column introduces more noise than signal

For the W8 dataset, both missing columns fall in the 0–20% range. The score column (5.8%) and level column (2.4%) are both candidates for imputation — deletion would be acceptable but wasteful given the small proportion of missing data.

Python

# Demonstrate the severity guide applied to the W8 dataset
df = pd.read_csv('../Datasets/8_4_competition_results.csv')

missing_pct = df.isna().mean() * 100
print("Missing percentage per column:")
for col, pct in missing_pct.items():
    if pct > 0:
        severity = 'Low (impute)' if pct < 5 else 'Moderate (impute)' if pct < 20 else 'High (consider drop)'
        print(f"  {col}: {pct:.1f}% → {severity}")
    else:
        print(f"  {col}: 0.0% → No action needed")

▶ Output

Missing percentage per column: name: 0.0% → No action needed age: 0.0% → No action needed city: 0.0% → No action needed score: 5.8% → Moderate (impute) grade: 0.0% → No action needed level: 2.4% → Low (impute) date: 0.0% → No action needed

The output confirms the imputation strategy for the W8 dataset: score gets median imputation (moderate missingness), level gets mode imputation (low missingness). No columns are in the high or critical range. The decision is straightforward for this dataset, though in real-world projects the decision often requires domain knowledge about why the values are missing.

⚠️ Missing Data Myths

▼

⚠️

Misconception: Mean imputation is always the best choice for numeric columns

Mean imputation is only appropriate for roughly symmetric distributions without strong outliers. For skewed distributions — common in financial data, competition scores, and population metrics — the mean is an unrepresentative summary. Median imputation is more robust for skewed data because it is not pulled toward the extreme values.

⚠️

Misconception: .dropna() by default only removes completely empty rows

.dropna() with default parameters (how='any') removes any row that has at least one missing value in any column. In a 7-column dataset, a row missing only one value in an unimportant column will be removed entirely. Use the subset parameter to restrict deletion to columns that actually matter for the analysis.

⚠️

Misconception: After imputation, the dataset is as good as a complete dataset

Imputed values are estimates — not real observations. Any analysis that includes imputed values has more uncertainty than an analysis of a fully observed dataset, even if there are no more NaN values. The decision to impute should be documented, and the sensitivity of conclusions to the imputation strategy should be considered when presenting results.

✅ Fill or Drop?

▼

?

The W8 'score' column has 12 missing values and a slight right skew (mean 80.34, median 82.0). Which imputation strategy is most appropriate, and why?

?

After applying df = df.dropna(subset=['score']), how many rows remain in the W8 dataset (after deduplication)?

.fillna(value) replaces NaN values with a specified value — use the median for numeric columns (robust to skew) and the mode for categorical columns (.mode()[0] to extract the scalar).
.dropna() removes rows with missing values — use subset= to target specific columns, how='all' for rows missing everything, and thresh=N for a minimum completeness requirement.
The W8 dataset: 12 missing scores filled with the median (82.0); 5 missing level values filled with the mode ('intermediate').
Imputation preserves data volume but introduces estimated values; deletion preserves accuracy but reduces data size — neither is free of cost.
Always impute after removing duplicates, and always verify the missing value count is zero after imputation using df.isna().sum().
For the W8 dataset, both missing columns are below the 20% threshold — imputation is appropriate and row deletion is not needed.

📂Dataset & Notebook

▼

↗
Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above
↗
Notebook: 8_6_Missing_Values.ipynb
Follow-along notebook for this topic's imputation steps

📚External Resources

▼

↗
Pandas Documentation: DataFrame.fillna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
↗
Pandas Documentation: DataFrame.dropna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
↗
scikit-learn: Imputation of Missing Values
https://scikit-learn.org/stable/modules/impute.html