Topic 8.5: Resolving Duplicates

🌉 The Postal Registry Merge

▼

A national postal service merges records from two regional offices. After the merge, many addresses appear twice — some because both offices served the same street, others because an export script ran twice during the migration. For each duplicated address, the registry must decide which copy to keep. In most cases the newer record is more accurate — it reflects recent building renames or postcode corrections. But for some records, the newer one introduced a typo, making the older one correct.

This is the practical reality of deduplication: it is not sufficient to 'just remove duplicates'. You must decide which copy to keep, whether to match on all columns or only the key identifier columns, and how to handle cases where duplicates are not entirely identical. Pandas provides exactly the parameters needed to make each of these decisions with precision.

ℹ️

Why Deduplication Comes First in Cleaning

In the cleaning sequence, duplicates are removed before missing values are handled. This order matters: if you impute a missing value in a column where duplicated rows exist, you may impute the same value twice — once for the original row and once for the duplicate. Removing duplicates first reduces the problem size and prevents imputation from being applied to rows that will subsequently be deleted.

🔍 Look First

▼

Before any row is deleted, the analyst should inspect the duplicate records. The keep=False option on .duplicated() marks every occurrence of any duplicated row — including the first — making it easy to compare each pair side-by-side.

Python

import pandas as pd

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Detect duplicates — True means the row is a duplicate
df.duplicated()

▶ Output

0 False 1 False 2 False 3 False 4 False ... 202 False 203 False 204 False 205 False 206 False Length: 207, dtype: bool

The boolean mask alone is hard to read at this size. Filtering the DataFrame down to only the duplicated rows makes the actual content of the duplication visible.

Python

# Show only the duplicate rows
df[df.duplicated()]

▶ Output

name age city score grade level date 86 Noha Ibrahim 27.0 giza 91.2 A advanced 19-05-2024 116 Samira Hassan 22.0 CAIRO 84.0 A advanced 30-12-2023 157 Mido Gamal 27.0 Giza 81.0 A advanced 2024-04-14 174 Wedad Ali 23.0 Alex 56.7 C beginner 2023-11-26 187 Khaled Hassan 24.0 GIZA 83.9 A advanced 19/12/2023 198 Diaa Adel 22.0 Cairo 65.8 B intermediate 25-10-2023 199 Diaa Adel 31.0 Giza 100.0 A advanced 2024-04-21

All 7 duplicates are exact copies across all 7 columns — every field matches between the pair. This is consistent with a system export error (the export script ran twice), not with legitimate repeat entries. The correct action is to remove the second occurrence and keep the first.

ℹ️

Always Inspect Before Removing

Review the duplicate rows before calling drop_duplicates(). Confirm: (a) Are all fields truly identical, or only some? (b) Is the duplication from a system error or a legitimate repeat event? (c) Which copy should be kept — the first or the last? Answering these three questions before writing any removal code prevents accidental data loss.

🗑️ Simple Deduplication

▼

The .drop_duplicates() method removes duplicate rows. Its default behaviour — keep='first' — retains the first occurrence of each duplicated set and removes all subsequent copies. This is the most common use case when the original entry is assumed to be the correct one.

Python

# Remove duplicates — keep the FIRST occurrence (default behavior)
df_clean = df.drop_duplicates(keep='first')

print(f"Rows before: {len(df)}")
print(f"Rows after:  {len(df_clean)}")

▶ Output

Rows before: 207 Rows after: 200

The DataFrame now has 200 rows — exactly as expected (207 original rows minus 7 duplicate copies). All 7 duplicate rows were removed, and the verification confirms zero duplicates remain. The original df is unchanged; the cleaned result is stored in df_clean.

Default behaviour (keep='first')

Keeps the first occurrence of each duplicated group and removes all subsequent copies. The safest default — assumes the earliest entry is the most reliable.

Return value

drop_duplicates() returns a new DataFrame. The original df is NOT modified. You must assign the result: df = df.drop_duplicates() or df_clean = df.drop_duplicates().

Verification step

After removal, always check: df_clean.duplicated().sum() == 0 and len(df_clean) == expected_rows. Both must pass before proceeding to the next cleaning step.

🎯 Which Copy Stays?

▼

The keep parameter gives precise control over which copy of each duplicated group is retained. The three options are 'first', 'last', and False. Each serves a different scenario:

Python

# Keep the LAST occurrence instead
df_clean_last = df.drop_duplicates(keep='last')

print(f"Rows after keep=last: {len(df_clean_last)}")

# Remove ALL copies — keep none of the duplicated rows
df_no_dupes = df.drop_duplicates(keep=False)

print(f"Rows remaining: {len(df_no_dupes)}")

▶ Output

Rows after keep=last: 200 Rows remaining: 193

With keep='first' or keep='last': 200 rows remain (7 duplicates removed, first or last copy kept). With keep=False: 193 rows remain — all 14 rows involved in duplication (7 originals + 7 copies) are deleted. This leaves only rows that were never duplicated at all. Use keep=False when you want to study only the rows that have no duplicates whatsoever — not the typical deduplication use case.

keep= value	Which row is kept	Rows in result	When to Use
'first' (default)	First occurrence of each duplicated group	207 − 7 = 200	When the earliest record is the most reliable.
'last'	Last occurrence of each duplicated group	207 − 7 = 200	When the most recent record is more accurate (e.g., after an update corrected an error).
False	No copy is kept — all occurrences of any duplicated row are removed	207 − 14 = 193	When you want only rows that are completely unique — diagnostic use.

⚠️

keep=False Does Not 'Remove One Copy and Keep One'

A common misconception: keep=False removes every copy of any duplicated row — including the first occurrence. If a row appears three times, all three are removed. This is useful for isolating rows that have zero duplication. It is NOT the right choice if you want to keep one representative row for each duplicated group — use keep='first' or keep='last' for that.

🎯 Partial Duplicates

▼

The subset parameter restricts duplicate detection to a specific set of columns rather than all columns. This is essential when records that should be considered duplicates are not perfectly identical across every field — for example, two rows with the same student name, age, and city but a slightly different timestamp.

In the W8 dataset, a useful subset check is whether any student's core identifying information — name and city — appears twice, regardless of whether other fields match exactly:

Python

# Detect duplicates based on specific columns only
df.duplicated(subset=['name', 'city'])

▶ Output

0 False 1 False 2 False 3 False 4 False ... 202 False 203 True 204 False 205 False 206 False Length: 207, dtype: bool

This flags more rows than the full-row check: 203 students share both a name and a city with another row, even though their score, grade, level, or date may differ. Removing these rows with drop_duplicates on the same subset narrows the dataset down to one row per unique name+city combination:

Python

# Remove duplicates based on student name AND school — keep first
df_clean = df.drop_duplicates(
    subset=['name', 'city'],
    keep='first'
)

print(f"Rows before: {len(df)}")
print(f"Rows after:  {len(df_clean)}")

▶ Output

Rows before: 207 Rows after: 184

Note that subset deduplication on name+city removes 23 rows — far more than the 7 full-row duplicates. This is expected: many distinct students happen to share a name and city by coincidence, even though their score, grade, level, and date differ. This is exactly why the subset parameter must be chosen deliberately, matched to what truly constitutes a "duplicate" for the data at hand, rather than applied automatically.

⚠️

Warning: Partial vs. Full Duplicates

Full duplicates (identical in all columns) are straightforward to detect and remove. Partial duplicates (identical in key columns but different in others) require a deliberate subset= specification. Using the default (all columns) on partial duplicates will silently fail to detect them — the rows will remain in the dataset. Always consider whether your data may have partial duplicates before choosing a deduplication strategy.

📊 Full Dedup Workflow

▼

A professional deduplication workflow combines all the methods seen in this topic into a single, verifiable pipeline. The structure: measure before, remove, measure after, verify. This three-step verification ensures no unexpected rows were removed and no duplicates remain.

Python

import pandas as pd

# Full deduplication workflow with verification
df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Step 1: Measure before
print(f"=== BEFORE DEDUPLICATION ===")
print(f"Total rows:          {len(df)}")
print(f"Duplicate rows:      {df.duplicated().sum()}")
print(f"Unique rows:         {len(df) - df.duplicated().sum()}")

# Step 2: Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates(keep='first')

# Step 3: Measure after and verify
print(f"\n=== AFTER DEDUPLICATION ===")
print(f"Total rows:          {len(df_clean)}")
print(f"Duplicate rows:      {df_clean.duplicated().sum()}")
print(f"Rows removed:        {len(df) - len(df_clean)}")
print(f"\nVerification: {len(df) - len(df_clean)} duplicates removed — expected 7")

▶ Output

=== BEFORE DEDUPLICATION === Total rows: 207 Duplicate rows: 7 Unique rows: 200 === AFTER DEDUPLICATION === Total rows: 200 Duplicate rows: 0 Rows removed: 7 Verification: 7 duplicates removed — expected 7

The verification confirms: 7 rows removed, 200 rows remain, 0 duplicates in the result. All three checks pass. The dataset is ready to proceed to missing value handling (Topic 8.6).

Verification Check	Expected Value	Why It Matters
Rows removed = df.duplicated().sum() before	7	Confirms exactly the right number of rows were removed — not more, not fewer.
df_clean.duplicated().sum() == 0	0	Confirms no duplicates remain in the result. Should always be zero after drop_duplicates.
len(df_clean) == len(df) - duplicates	200	Confirms the final row count matches the expected post-deduplication size.

ℹ️

Save the Cleaned DataFrame Before Proceeding

After verification, assign the deduplicated result back to df (or a new variable like df_clean) before moving to the next cleaning step. If you forget to reassign, all subsequent code will operate on the original 207-row DataFrame — not the cleaned 200-row version. A common pattern is df = df.drop_duplicates() to update the working DataFrame in place.

⚠️ Dedup Myths

▼

⚠️

Misconception: All duplicates are data errors

Some duplicates are intentional and meaningful. In a competition results table, the same student could legitimately compete in two different events on the same day — their rows would look like duplicates but represent distinct real events. Always inspect duplicate rows before removing them — understand whether the duplication reflects an error or a genuine repeat observation.

⚠️

Misconception: .drop_duplicates() modifies the original DataFrame

By default, drop_duplicates() returns a new DataFrame without modifying the original. The original df still contains all 207 rows. To persist the change, either reassign the result (df = df.drop_duplicates()) or use inplace=True. Always verify the result's row count after deduplication.

⚠️

Misconception: Deduplication is fast and risk-free

Deduplication permanently removes rows. If the wrong rows are removed — because keep= was set incorrectly or because the subset was too narrow — the analysis will be based on the wrong copies. For production datasets, save a backup before running deduplication, and verify the result row count against expectations before proceeding to the next cleaning step.

✅ Find the Copy

▼

?

The W8 dataset has 207 rows and 7 duplicate rows. After running df.drop_duplicates(keep='first'), how many rows does the resulting DataFrame have?

?

A student dataset has the columns: name, age, city, score, grade, level, date. Two rows have identical name, age, city, and score values, but the 'date' column differs by one day due to a logging inconsistency. Which approach correctly identifies and removes this duplication?

.duplicated().sum() counts exact duplicate rows; using keep=False shows all rows involved in any duplication — useful for inspection before removal.
.drop_duplicates() removes duplicate rows. The keep parameter controls which copy is retained: 'first' keeps the earliest, 'last' keeps the most recent, False removes all copies.
The W8 dataset has 7 duplicate rows — removing them reduces the shape from (207, 7) to (200, 7).
The subset parameter limits duplicate detection to specified columns — essential when records are the same entity but differ in non-key fields like timestamps or logging metadata.
Deduplication must precede missing value imputation — imputing values before removing duplicates applies the same imputed value to rows that will later be deleted.
Always verify the result row count after deduplication and re-run .duplicated().sum() on the cleaned DataFrame to confirm zero duplicates remain.

📂Dataset & Notebook

▼

↗
Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above
↗
Notebook: 8_5_Resolving_Duplicates.ipynb
Follow-along notebook for this topic's deduplication steps

📚External Resources

▼

↗
Pandas Documentation: DataFrame.drop_duplicates()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
↗
Pandas Documentation: DataFrame.duplicated()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
↗
Real Python: Pandas DataFrame Manipulation
https://realpython.com/pandas-dataframe/