Topic 8.3: Common Quality Issues

🌉 The River Station

▼

An environmental agency installs water-quality sensors along a river at five locations. Over one year, scientists find four recurring problems: sensors occasionally went offline, leaving gaps in the record; a data-export script ran twice on one day, duplicating all readings; one sensor reported temperatures of -999 (a hardware-error sentinel value); and different teams recorded flow in different units. These four problems — missing readings, duplicate records, impossible values, and format inconsistencies — appear in nearly every real-world dataset regardless of domain.

Before any cleaning can begin, each problem type must be detected and measured. This topic covers the five most common quality issues and shows exactly how to find each one in the W8 competition dataset using Pandas' detection methods.

ℹ️

Detection Is Not Correction

The detection methods in this topic are all read-only — they measure the problem without changing any data. The actual corrections (fillna, drop_duplicates, str.title) are covered in Topics 8.5–8.7. Running detection first gives you a complete picture before any cleaning decision is made.

❓ Empty Cells

▼

A missing value is a cell in a dataset that contains no value. In Pandas, missing values appear as NaN (Not a Number) or None. They arise when information was not collected, was lost during a system migration, or was intentionally left blank because it did not apply to a particular record.

The method .isna().sum() counts the missing values in each column. .isna() returns a DataFrame of True/False values (True = missing); chaining .sum() counts the True values per column.

Python

import pandas as pd

df = pd.read_csv('8_4_competition_results.csv')

# Count missing values per column
print("Missing value counts:")
print(df.isna().sum())

# Also show as percentage
print("\nMissing value percentages:")
print((df.isna().sum() / len(df) * 100).round(2))

▶ Output

Missing value counts: name 0 age 15 city 8 score 10 grade 0 level 0 date 0 dtype: int64 Missing value percentages: name 0.00 age 7.25 city 3.86 score 4.83 grade 0.00 level 0.00 date 0.00 dtype: float64

The dataset has 15 missing ages (7.25% of rows), 8 missing cities (3.86%), and 10 missing scores (4.83%). All are below the 20% threshold where column deletion becomes necessary, but they must be addressed before analysis or modelling. The correct strategy for each will be determined in Topic 8.6.

ℹ️

Not All Missing Values Are the Same

Statisticians classify missingness into three types. MCAR (Missing Completely At Random): no pattern — a random sensor failure. MAR (Missing At Random): missingness depends on other observed variables. MNAR (Missing Not At Random): missingness depends on the missing value itself — for example, high scorers may have skipped entering their score. The type of missingness affects which treatment strategy is most appropriate.

🔄 Repeated Rows

▼

A duplicate record is a row that represents the same real-world entity as another row — either exactly (identical values in every column) or partially (identical in the key identifying columns, with minor differences elsewhere). Duplicates arise from data-entry errors, system failures, or merging datasets from multiple sources.

The method .duplicated().sum() counts rows that are exact copies of a previous row. To inspect the actual duplicate records, filter the DataFrame using the Boolean Series returned by .duplicated().

Python

# Count exact duplicate rows
print(f"Total duplicate rows: {df.duplicated().sum()}")

# Show the duplicate rows for inspection
print("\nDuplicate rows:")
print(df[df.duplicated()])

▶ Output

Total duplicate rows: 7 Duplicate rows: name age city score grade level date 86 Noha Ibrahim 27.0 giza 91.2 A advanced 19-05-2024 116 Samira Hassan 22.0 CAIRO 84.0 A advanced 30-12-2023 157 Mido Gamal 27.0 Giza 81.0 A advanced 2024-04-14 174 Wedad Ali 23.0 Alex 56.7 C beginner 2023-11-26 187 Khaled Hassan 24.0 GIZA 83.9 A advanced 19/12/2023 198 Diaa Adel 22.0 Cairo 65.8 B intermediate 25-10-2023 199 Diaa Adel 31.0 Giza 100.0 A advanced 2024-04-21

There are 7 duplicate rows. df.duplicated() flags each row that is an exact copy of an earlier row, so filtering with it shows the 7 later copies — each has an identical twin earlier in the dataset. Duplicates inflate every count-based aggregate: the mean score includes each duplicated student's value twice, and the student count is overstated by 7. These must be removed before any summary statistics are computed.

Full duplicate

Every column is identical between the two rows — the same student, same score, same date. Caused by a system export running twice. Detected with .duplicated() using all columns (default).

Partial duplicate

Key columns (name, age, city) match but non-key columns (date, score) differ slightly. Caused by logging errors or concurrent data entry. Detecting and removing these is covered in Topic 8.5.

W8 duplicate type

All 7 duplicate pairs in the W8 dataset are full duplicates — identical across all 7 columns. Safe to remove with drop_duplicates(keep='first').

🔤 Numbers as Text

▼

A type mismatch occurs when a column's stored data type does not match the logical type of the data it should contain. The most common case in real datasets: a column that should be datetime64 is stored as object (string) because dates were entered in inconsistent formats that Pandas could not parse automatically.

The method .dtypes shows the stored type of each column. The method .info() shows types alongside null counts, making it a combined structural and completeness snapshot.

Python

# Check data types of all columns
print("Column data types:")
print(df.dtypes)

print("\nDataFrame info (types + null counts):")
df.info()

▶ Output

Column data types: name str age float64 city str score float64 grade str level str date str dtype: object <class 'pandas.core.frame.DataFrame'> RangeIndex: 207 entries, 0 to 206 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 207 non-null str 1 age 192 non-null float64 2 city 199 non-null str 3 score 197 non-null float64 4 grade 207 non-null str 5 level 207 non-null str 6 date 207 non-null str dtypes: float64(2), str(5) memory usage: 11.4 KB

The date column is stored as str (text) even though it should be datetime64. This prevents any time-based operation: you cannot sort by date, extract the month, or compute date differences until the column is converted. The cause is the several mixed date formats in the dataset — Pandas could not parse them automatically and stored them as text instead.

date (str — actual)

Stored as plain text string. No sorting, no .dt.month, no date arithmetic possible. Values: '07/02/2024', '09/13/2023', '2024-04-02', '21-06-2024'.

date (datetime64 — expected)

Stored as a proper timestamp. Enables .min(), .max(), .dt.year, .dt.month, .dt.day_name(), and chronological sorting.

age (float64 — signals NaN)

Ages are whole numbers, so this column should logically be int64, but Pandas uses float64 when at least one NaN exists in the column. The dtype itself reveals hidden missing values. (score is also float64, but that is expected because scores are decimals such as 97.5.)

🏙️ Seven City Spellings

▼

Inconsistent formatting means that values referring to the same real-world entity are written in different ways within the same column. Unlike missing values (which show as NaN) or duplicates (which can be counted), formatting inconsistencies look like valid data. The computer never raises an error — the problem only appears when analysis produces nonsensical results.

The method .value_counts() on a text column reveals formatting inconsistencies immediately: instead of seeing each city once, you see each variant separately. The count is split across spellings that should be one category.

Python

# Detect formatting inconsistencies in city column
print("City value counts (revealing inconsistencies):")
print(df['city'].value_counts())

print(f"\nUnique city spellings found: {df['city'].nunique()}")

▶ Output

City value counts (revealing inconsistencies): city Cairo 61 cairo 24 CAIRO 18 Giza 38 giza 19 GIZA 11 Alex 28 Name: count, dtype: int64 Unique city spellings found: 7

There are only 3 actual cities (Cairo, Giza, Alex) but 7 different spellings. Any groupby('city') produces 7 groups instead of 3, splitting Cairo's 103 records into three partial groups (Cairo=61, cairo=24, CAIRO=18). The total numbers still add up correctly, but they are spread across false categories.

⚠️

Formatting Errors Are Invisible to the Computer

'Cairo' and 'cairo' are different strings to Python. No error is raised, no warning is shown. A model trained on this column would treat them as entirely different locations. The only reliable way to detect formatting inconsistencies is systematic inspection using value_counts() and unique() — not waiting for an error.

📡 Extreme Values

▼

An outlier is a value that falls far outside the range of the majority of values in a column. The word 'outlier' is neutral — it describes a position in the distribution, not a judgment about correctness. Some outliers are genuine (a genuinely exceptional score), others are errors (a score of 999 due to a data-entry mistake), and others are sentinel values (-999 used to signal a sensor error).

The method .describe() on the numeric columns produces the summary statistics that signal outlier presence: compare the mean to the 50% (median), and examine the min and max values.

Python

# Detect outliers using describe()
print("Statistical summary of numeric columns:")
print(df[['age', 'score']].describe().round(2))

▶ Output

Statistical summary of numeric columns: age score count 192.00 197.00 mean 24.21 76.96 std 3.95 12.71 min 18.00 40.00 25% 21.00 67.60 50% 24.00 77.40 75% 27.00 85.70 max 35.00 100.00

For score: the minimum is 40 and the maximum is 100, both within the valid range of 0–100. The mean (76.96) is close to the median (77.40), suggesting no extreme outliers are distorting the distribution. For age: the range is 18–35, reasonable for a competition dataset. The count of 192 for age and 197 for score confirms the missing values seen earlier. No sentinel values or impossible values are detected in the numeric columns of this dataset.

Outlier Type	Description	First Response
Genuine extreme value	A real measurement that is exceptionally large or small — the top-scoring student.	Investigate and keep — it represents a real phenomenon.
Data-entry error	A value that cannot be correct: score of 850 on a 100-point exam.	Correct if the true value is known; replace with NaN if not.
Sentinel value	A numeric code used to indicate a system state: -999, 9999, or 0 where zero is impossible.	Replace with NaN, then treat as a missing value.

⚠️ Duplicate Traps

▼

⚠️

Misconception: Outliers should always be removed

Removing outliers without investigation destroys real data. A genuinely exceptional score — the top student in the competition — is the most important data point in an analysis of high performers. The correct approach is to investigate every outlier: determine whether it is an error, a sentinel value, or a genuine observation. Only errors and sentinel values should be removed.

⚠️

Misconception: A small percentage of missing values is always harmless

Whether a small percentage of missing values matters depends entirely on where they are missing. If 5% of rows are missing the target variable (the value being predicted), that may represent a significant portion of validation data. The quantity of missing values is less important than the pattern and location of missingness.

⚠️

Misconception: Duplicate rows are always exactly identical

Exact duplicates — rows where every column has the same value — are easy to detect. Near-duplicates are far more common and harder to find. A student record where name, age, city, and score match but the date differs slightly may represent the same student submitted twice with a logging error. Topic 8.5 covers how to detect and remove this kind of partial duplicate.

✅ Spot the Issue

▼

?

After running df['city'].value_counts() on the W8 dataset, you see 7 different city values when there should only be 3 (Cairo, Giza, Alex). What type of issue is this, and which method is used to detect it?

?

Running df.dtypes shows that the 'date' column has dtype 'object' even though it should be datetime. What does this most likely indicate?

Missing values are detected with isna().sum() — the W8 dataset has 15 missing ages (7.25%), 8 missing cities (3.86%), and 10 missing scores (4.83%).
Duplicate records are detected with duplicated().sum() — the W8 dataset has 7 exact duplicate rows that inflate all count-based aggregates.
Type mismatches are detected with .dtypes and .info() — the W8 date column is stored as str instead of datetime64 due to several mixed date formats.
Formatting inconsistencies are detected with value_counts() and nunique() — the W8 city column has 7 different spellings for 3 actual cities.
Outliers are detected with .describe() — compare min/max to the valid range and compare mean to median to detect distribution distortion.
All five detection methods are read-only: they measure problems without changing data. Correction begins in Topics 8.5–8.7.

📂Dataset & Notebook

▼

↗
Dataset: 8_4_competition_results.csv
The W8 competition dataset shown in the code examples above

📚External Resources

▼

↗
Pandas Documentation: DataFrame.isna()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html
↗
Pandas Documentation: DataFrame.duplicated()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
↗
Pandas Documentation: DataFrame.describe()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
↗
Towards Data Science: Handling Missing Data
https://towardsdatascience.com/handling-missing-data-f998715fb73f/