Topic 8.3: Common Quality Issues
Detecting missing values, duplicates, type mismatches, formatting inconsistencies, and outliers in the W8 competition dataset
An environmental agency installs water-quality sensors along a river at five locations. Over one year, scientists find four recurring problems: sensors occasionally went offline, leaving gaps in the record; a data-export script ran twice on one day, duplicating all readings; one sensor reported temperatures of -999 (a hardware-error sentinel value); and different teams recorded flow in different units. These four problems â missing readings, duplicate records, impossible values, and format inconsistencies â appear in nearly every real-world dataset regardless of domain.
Before any cleaning can begin, each problem type must be detected and measured. This topic covers the five most common quality issues and shows exactly how to find each one in the W8 competition dataset using Pandas' detection methods.
The detection methods in this topic are all read-only â they measure the problem without changing any data. The actual corrections (fillna, drop_duplicates, str.title) are covered in Topics 8.5â8.7. Running detection first gives you a complete picture before any cleaning decision is made.
A missing value is a cell in a dataset that contains no value. In Pandas, missing values appear as NaN (Not a Number) or None. They arise when information was not collected, was lost during a system migration, or was intentionally left blank because it did not apply to a particular record.
The method .isna().sum() counts the missing values in each column. .isna() returns a DataFrame of True/False values (True = missing); chaining .sum() counts the True values per column.
import pandas as pd df = pd.read_csv('8_4_competition_results.csv') # Count missing values per column print("Missing value counts:") print(df.isna().sum()) # Also show as percentage print("\nMissing value percentages:") print((df.isna().sum() / len(df) * 100).round(2))
The dataset has 15 missing ages (7.25% of rows), 8 missing cities (3.86%), and 10 missing scores (4.83%). All are below the 20% threshold where column deletion becomes necessary, but they must be addressed before analysis or modelling. The correct strategy for each will be determined in Topic 8.6.
Statisticians classify missingness into three types. MCAR (Missing Completely At Random): no pattern â a random sensor failure. MAR (Missing At Random): missingness depends on other observed variables. MNAR (Missing Not At Random): missingness depends on the missing value itself â for example, high scorers may have skipped entering their score. The type of missingness affects which treatment strategy is most appropriate.
A duplicate record is a row that represents the same real-world entity as another row â either exactly (identical values in every column) or partially (identical in the key identifying columns, with minor differences elsewhere). Duplicates arise from data-entry errors, system failures, or merging datasets from multiple sources.
The method .duplicated().sum() counts rows that are exact copies of a previous row. To inspect the actual duplicate records, filter the DataFrame using the Boolean Series returned by .duplicated().
# Count exact duplicate rows print(f"Total duplicate rows: {df.duplicated().sum()}") # Show the duplicate rows for inspection print("\nDuplicate rows:") print(df[df.duplicated()])
There are 7 duplicate rows. df.duplicated() flags each row that is an exact copy of an earlier row, so filtering with it shows the 7 later copies â each has an identical twin earlier in the dataset. Duplicates inflate every count-based aggregate: the mean score includes each duplicated student's value twice, and the student count is overstated by 7. These must be removed before any summary statistics are computed.
A type mismatch occurs when a column's stored data type does not match the logical type of the data it should contain. The most common case in real datasets: a column that should be datetime64 is stored as object (string) because dates were entered in inconsistent formats that Pandas could not parse automatically.
The method .dtypes shows the stored type of each column. The method .info() shows types alongside null counts, making it a combined structural and completeness snapshot.
# Check data types of all columns print("Column data types:") print(df.dtypes) print("\nDataFrame info (types + null counts):") df.info()
The date column is stored as str (text) even though it should be datetime64. This prevents any time-based operation: you cannot sort by date, extract the month, or compute date differences until the column is converted. The cause is the several mixed date formats in the dataset â Pandas could not parse them automatically and stored them as text instead.
Inconsistent formatting means that values referring to the same real-world entity are written in different ways within the same column. Unlike missing values (which show as NaN) or duplicates (which can be counted), formatting inconsistencies look like valid data. The computer never raises an error â the problem only appears when analysis produces nonsensical results.
The method .value_counts() on a text column reveals formatting inconsistencies immediately: instead of seeing each city once, you see each variant separately. The count is split across spellings that should be one category.
# Detect formatting inconsistencies in city column print("City value counts (revealing inconsistencies):") print(df['city'].value_counts()) print(f"\nUnique city spellings found: {df['city'].nunique()}")
There are only 3 actual cities (Cairo, Giza, Alex) but 7 different spellings. Any groupby('city') produces 7 groups instead of 3, splitting Cairo's 103 records into three partial groups (Cairo=61, cairo=24, CAIRO=18). The total numbers still add up correctly, but they are spread across false categories.
'Cairo' and 'cairo' are different strings to Python. No error is raised, no warning is shown. A model trained on this column would treat them as entirely different locations. The only reliable way to detect formatting inconsistencies is systematic inspection using value_counts() and unique() â not waiting for an error.
An outlier is a value that falls far outside the range of the majority of values in a column. The word 'outlier' is neutral â it describes a position in the distribution, not a judgment about correctness. Some outliers are genuine (a genuinely exceptional score), others are errors (a score of 999 due to a data-entry mistake), and others are sentinel values (-999 used to signal a sensor error).
The method .describe() on the numeric columns produces the summary statistics that signal outlier presence: compare the mean to the 50% (median), and examine the min and max values.
# Detect outliers using describe() print("Statistical summary of numeric columns:") print(df[['age', 'score']].describe().round(2))
For score: the minimum is 40 and the maximum is 100, both within the valid range of 0â100. The mean (76.96) is close to the median (77.40), suggesting no extreme outliers are distorting the distribution. For age: the range is 18â35, reasonable for a competition dataset. The count of 192 for age and 197 for score confirms the missing values seen earlier. No sentinel values or impossible values are detected in the numeric columns of this dataset.
| Outlier Type | Description | First Response |
|---|---|---|
| Genuine extreme value | A real measurement that is exceptionally large or small â the top-scoring student. | Investigate and keep â it represents a real phenomenon. |
| Data-entry error | A value that cannot be correct: score of 850 on a 100-point exam. | Correct if the true value is known; replace with NaN if not. |
| Sentinel value | A numeric code used to indicate a system state: -999, 9999, or 0 where zero is impossible. | Replace with NaN, then treat as a missing value. |
Removing outliers without investigation destroys real data. A genuinely exceptional score â the top student in the competition â is the most important data point in an analysis of high performers. The correct approach is to investigate every outlier: determine whether it is an error, a sentinel value, or a genuine observation. Only errors and sentinel values should be removed.
Whether a small percentage of missing values matters depends entirely on where they are missing. If 5% of rows are missing the target variable (the value being predicted), that may represent a significant portion of validation data. The quantity of missing values is less important than the pattern and location of missingness.
Exact duplicates â rows where every column has the same value â are easy to detect. Near-duplicates are far more common and harder to find. A student record where name, age, city, and score match but the date differs slightly may represent the same student submitted twice with a logging error. Topic 8.5 covers how to detect and remove this kind of partial duplicate.
- Missing values are detected with isna().sum() â the W8 dataset has 15 missing ages (7.25%), 8 missing cities (3.86%), and 10 missing scores (4.83%).
- Duplicate records are detected with duplicated().sum() â the W8 dataset has 7 exact duplicate rows that inflate all count-based aggregates.
- Type mismatches are detected with .dtypes and .info() â the W8 date column is stored as str instead of datetime64 due to several mixed date formats.
- Formatting inconsistencies are detected with value_counts() and nunique() â the W8 city column has 7 different spellings for 3 actual cities.
- Outliers are detected with .describe() â compare min/max to the valid range and compare mean to median to detect distribution distortion.
- All five detection methods are read-only: they measure problems without changing data. Correction begins in Topics 8.5â8.7.
- â Dataset: 8_4_competition_results.csv
The W8 competition dataset shown in the code examples above
- â Pandas Documentation: DataFrame.isna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html - â Pandas Documentation: DataFrame.duplicated()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html - â Pandas Documentation: DataFrame.describe()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html - â Towards Data Science: Handling Missing Data
https://towardsdatascience.com/handling-missing-data-f998715fb73f/