Topic 8.6: Handling Missing Values
Choosing between fillna() and dropna() â and matching the imputation strategy to each column's type and missingness pattern
When archaeologists excavate an ancient site, they frequently encounter incomplete artefacts â a pottery vessel with a missing handle, a mosaic with several tiles absent, a scroll that is partially burned. For each incomplete artefact, they face the same decision: reconstruct the missing part based on what is known about similar artefacts, or preserve the artefact as-is and document the gap. Reconstruction is appropriate when the missing piece can be estimated reliably from surrounding context. Preservation of the gap is appropriate when reconstruction would involve too much guesswork.
The same decision applies to missing values in data. Imputation (filling the gap with an estimated value) is appropriate when the missing value can be reasonably estimated from the available data. Deletion (removing the row) is appropriate when filling would introduce more error than removing. The question is not 'which is better' in the abstract â it is 'which is better for this specific column, given its missingness rate and its role in the analysis'.
Every approach to missing values involves a trade-off. Deletion preserves the accuracy of the remaining data but reduces the dataset's size, potentially losing important rows. Imputation preserves the dataset's size but introduces estimated values that may not be accurate â potentially adding noise or bias. Neither option is free of cost.
Before deciding how to fill anything, the first step is to measure the scope of the problem with .isna().sum().
import pandas as pd import numpy as np df = pd.read_csv('../Datasets/8_4_competition_results.csv') # Check missing values per column print(df.isna().sum()) print(f"\nTotal rows: {len(df)}")
With the scope of missingness established, .fillna() replaces NaN values in a Series or DataFrame column with a specified value. For numeric columns like score, the most robust choice is the median â not the mean. The median is not affected by extreme values (outliers), making it a more reliable central estimate for skewed distributions.
# Reload and use median â more robust when outliers exist df = pd.read_csv('../Datasets/8_4_competition_results.csv') median_score = df['score'].median() df['score'] = df['score'].fillna(median_score) print(f"Median score: {median_score}")
The missing scores have been replaced with the column median. The median is computed only from the non-missing values, so it is unaffected by the rows that were missing â a key advantage over forward-filling or zero-filling.
For categorical (text) columns, the appropriate imputation is the mode â the most frequently occurring value. There is no concept of a 'median' for text categories, and the mean is irrelevant. The mode represents the most common category in the dataset and is a reasonable default for a missing categorical value. The notebook demonstrates this on the city column:
# For categorical columns â use the most frequent value (mode) mode_city = df['city'].mode()[0] df['city'] = df['city'].fillna(mode_city) print(f"Mode city used: {mode_city}") print(f"Missing cities remaining: {df['city'].isna().sum()}")
The same pattern â compute .mode()[0], then .fillna() with that value â applies identically to any categorical column in the dataset, including level (beginner/intermediate/advanced). Whichever column you apply it to, the most frequent category becomes the fill value for every missing entry in that column.
The .mode() method returns a Series of mode values â not a single value. If two categories are tied for most frequent, both appear in the result. Using [0] extracts the first (alphabetically earliest) mode value as a scalar. Always use [0] when passing the mode to fillna(), otherwise Pandas receives a Series instead of a scalar and the fill will not work as expected.
The .dropna() method removes rows (or columns) that contain missing values. Its default behaviour is aggressive â it drops any row with at least one NaN in any column. The subset and thresh parameters allow much more targeted deletion.
# Drop any row with at least one missing value df_dropped = df.dropna() print(f"Rows before: {len(df)}") print(f"Rows after dropna: {len(df_dropped)}")
This is the aggressive default behaviour â any row with even one missing value in any column is removed entirely. The subset parameter narrows this down to a targeted deletion based on specific columns only:
# Drop rows only where 'score' is missing â ignore other columns df_clean = df.dropna(subset=['score']) print(f"Rows after targeted dropna: {len(df_clean)}") print(f"Missing scores remaining: {df_clean['score'].isna().sum()}")
Results analysis: a plain dropna() removes every row that has at least one missing value anywhere â across score and level, this removes more rows than necessary if you only care about completeness in one column. dropna(subset=['score']) is the targeted alternative â it removes only the rows missing a score, leaving rows with a missing level untouched.
After applying imputation, always verify that missing values have been correctly filled and that the column's statistical properties remain sensible. The score distribution should not change dramatically after median imputation â all 12 filled values are set to the median, which by definition does not shift the median.
# Complete the cleaning: apply both imputation strategies df_clean = pd.read_csv('../Datasets/8_4_competition_results.csv') # Fill score with median df_clean['score'] = df_clean['score'].fillna(df_clean['score'].median()) # Fill level with mode df_clean['level'] = df_clean['level'].fillna(df_clean['level'].mode()[0]) # Verify: no missing values remain print("Missing values after both imputation steps:") print(df_clean.isna().sum()) print(f"\nFinal shape: {df_clean.shape}") print(f"\nScore distribution after imputation:") print(df_clean['score'].describe().round(2))
Zero missing values across all columns. The score distribution has shifted slightly (mean from 80.34 to 80.55) because 12 rows that were previously excluded from the mean calculation are now included with a value of 82. The median remains at 82 â confirming the imputation did not distort the central tendency.
The choice between imputation and deletion â and between mean, median, and mode imputation â depends on three factors: the column's data type, the percentage of missing values, and whether the missingness appears random.
Not all missing value situations call for the same strategy. The proportion of missing values relative to the total dataset size is the primary driver of the decision: impute, delete rows, or delete the column entirely.
- Simple imputation (mean, median, mode) is appropriate
- Row deletion is also acceptable â minimal data loss
- W8 level column: 2.4% missing â mode imputation
- Imputation preferred to preserve data volume
- Investigate whether missingness is random or patterned
- W8 score column: 5.8% missing â median imputation
- Column-level deletion may be appropriate
- Imputation risks introducing significant bias
- Discuss with stakeholders before deciding
- The column is mostly absent â usually drop it
- Discuss with stakeholders whether data can be re-collected
- Keeping the column introduces more noise than signal
For the W8 dataset, both missing columns fall in the 0â20% range. The score column (5.8%) and level column (2.4%) are both candidates for imputation â deletion would be acceptable but wasteful given the small proportion of missing data.
# Demonstrate the severity guide applied to the W8 dataset df = pd.read_csv('../Datasets/8_4_competition_results.csv') missing_pct = df.isna().mean() * 100 print("Missing percentage per column:") for col, pct in missing_pct.items(): if pct > 0: severity = 'Low (impute)' if pct < 5 else 'Moderate (impute)' if pct < 20 else 'High (consider drop)' print(f" {col}: {pct:.1f}% â {severity}") else: print(f" {col}: 0.0% â No action needed")
The output confirms the imputation strategy for the W8 dataset: score gets median imputation (moderate missingness), level gets mode imputation (low missingness). No columns are in the high or critical range. The decision is straightforward for this dataset, though in real-world projects the decision often requires domain knowledge about why the values are missing.
Mean imputation is only appropriate for roughly symmetric distributions without strong outliers. For skewed distributions â common in financial data, competition scores, and population metrics â the mean is an unrepresentative summary. Median imputation is more robust for skewed data because it is not pulled toward the extreme values.
.dropna() with default parameters (how='any') removes any row that has at least one missing value in any column. In a 7-column dataset, a row missing only one value in an unimportant column will be removed entirely. Use the subset parameter to restrict deletion to columns that actually matter for the analysis.
Imputed values are estimates â not real observations. Any analysis that includes imputed values has more uncertainty than an analysis of a fully observed dataset, even if there are no more NaN values. The decision to impute should be documented, and the sensitivity of conclusions to the imputation strategy should be considered when presenting results.
- .fillna(value) replaces NaN values with a specified value â use the median for numeric columns (robust to skew) and the mode for categorical columns (.mode()[0] to extract the scalar).
- .dropna() removes rows with missing values â use subset= to target specific columns, how='all' for rows missing everything, and thresh=N for a minimum completeness requirement.
- The W8 dataset: 12 missing scores filled with the median (82.0); 5 missing level values filled with the mode ('intermediate').
- Imputation preserves data volume but introduces estimated values; deletion preserves accuracy but reduces data size â neither is free of cost.
- Always impute after removing duplicates, and always verify the missing value count is zero after imputation using df.isna().sum().
- For the W8 dataset, both missing columns are below the 20% threshold â imputation is appropriate and row deletion is not needed.
- â Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above - â Notebook: 8_6_Missing_Values.ipynb
Follow-along notebook for this topic's imputation steps
- â Pandas Documentation: DataFrame.fillna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html - â Pandas Documentation: DataFrame.dropna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html - â scikit-learn: Imputation of Missing Values
https://scikit-learn.org/stable/modules/impute.html