Topic 8.5: Resolving Duplicates
Removing duplicate rows with drop_duplicates() â controlling which copy to keep and targeting key column subsets
A national postal service merges records from two regional offices. After the merge, many addresses appear twice â some because both offices served the same street, others because an export script ran twice during the migration. For each duplicated address, the registry must decide which copy to keep. In most cases the newer record is more accurate â it reflects recent building renames or postcode corrections. But for some records, the newer one introduced a typo, making the older one correct.
This is the practical reality of deduplication: it is not sufficient to 'just remove duplicates'. You must decide which copy to keep, whether to match on all columns or only the key identifier columns, and how to handle cases where duplicates are not entirely identical. Pandas provides exactly the parameters needed to make each of these decisions with precision.
In the cleaning sequence, duplicates are removed before missing values are handled. This order matters: if you impute a missing value in a column where duplicated rows exist, you may impute the same value twice â once for the original row and once for the duplicate. Removing duplicates first reduces the problem size and prevents imputation from being applied to rows that will subsequently be deleted.
Before any row is deleted, the analyst should inspect the duplicate records. The keep=False option on .duplicated() marks every occurrence of any duplicated row â including the first â making it easy to compare each pair side-by-side.
import pandas as pd df = pd.read_csv('../Datasets/8_4_competition_results.csv') # Detect duplicates â True means the row is a duplicate df.duplicated()
The boolean mask alone is hard to read at this size. Filtering the DataFrame down to only the duplicated rows makes the actual content of the duplication visible.
# Show only the duplicate rows df[df.duplicated()]
All 7 duplicates are exact copies across all 7 columns â every field matches between the pair. This is consistent with a system export error (the export script ran twice), not with legitimate repeat entries. The correct action is to remove the second occurrence and keep the first.
Review the duplicate rows before calling drop_duplicates(). Confirm: (a) Are all fields truly identical, or only some? (b) Is the duplication from a system error or a legitimate repeat event? (c) Which copy should be kept â the first or the last? Answering these three questions before writing any removal code prevents accidental data loss.
The .drop_duplicates() method removes duplicate rows. Its default behaviour â keep='first' â retains the first occurrence of each duplicated set and removes all subsequent copies. This is the most common use case when the original entry is assumed to be the correct one.
# Remove duplicates â keep the FIRST occurrence (default behavior) df_clean = df.drop_duplicates(keep='first') print(f"Rows before: {len(df)}") print(f"Rows after: {len(df_clean)}")
The DataFrame now has 200 rows â exactly as expected (207 original rows minus 7 duplicate copies). All 7 duplicate rows were removed, and the verification confirms zero duplicates remain. The original df is unchanged; the cleaned result is stored in df_clean.
The keep parameter gives precise control over which copy of each duplicated group is retained. The three options are 'first', 'last', and False. Each serves a different scenario:
# Keep the LAST occurrence instead df_clean_last = df.drop_duplicates(keep='last') print(f"Rows after keep=last: {len(df_clean_last)}") # Remove ALL copies â keep none of the duplicated rows df_no_dupes = df.drop_duplicates(keep=False) print(f"Rows remaining: {len(df_no_dupes)}")
With keep='first' or keep='last': 200 rows remain (7 duplicates removed, first or last copy kept). With keep=False: 193 rows remain â all 14 rows involved in duplication (7 originals + 7 copies) are deleted. This leaves only rows that were never duplicated at all. Use keep=False when you want to study only the rows that have no duplicates whatsoever â not the typical deduplication use case.
| keep= value | Which row is kept | Rows in result | When to Use |
|---|---|---|---|
| 'first' (default) | First occurrence of each duplicated group | 207 â 7 = 200 | When the earliest record is the most reliable. |
| 'last' | Last occurrence of each duplicated group | 207 â 7 = 200 | When the most recent record is more accurate (e.g., after an update corrected an error). |
| False | No copy is kept â all occurrences of any duplicated row are removed | 207 â 14 = 193 | When you want only rows that are completely unique â diagnostic use. |
A common misconception: keep=False removes every copy of any duplicated row â including the first occurrence. If a row appears three times, all three are removed. This is useful for isolating rows that have zero duplication. It is NOT the right choice if you want to keep one representative row for each duplicated group â use keep='first' or keep='last' for that.
The subset parameter restricts duplicate detection to a specific set of columns rather than all columns. This is essential when records that should be considered duplicates are not perfectly identical across every field â for example, two rows with the same student name, age, and city but a slightly different timestamp.
In the W8 dataset, a useful subset check is whether any student's core identifying information â name and city â appears twice, regardless of whether other fields match exactly:
# Detect duplicates based on specific columns only df.duplicated(subset=['name', 'city'])
This flags more rows than the full-row check: 203 students share both a name and a city with another row, even though their score, grade, level, or date may differ. Removing these rows with drop_duplicates on the same subset narrows the dataset down to one row per unique name+city combination:
# Remove duplicates based on student name AND school â keep first df_clean = df.drop_duplicates( subset=['name', 'city'], keep='first' ) print(f"Rows before: {len(df)}") print(f"Rows after: {len(df_clean)}")
Note that subset deduplication on name+city removes 23 rows â far more than the 7 full-row duplicates. This is expected: many distinct students happen to share a name and city by coincidence, even though their score, grade, level, and date differ. This is exactly why the subset parameter must be chosen deliberately, matched to what truly constitutes a "duplicate" for the data at hand, rather than applied automatically.
Full duplicates (identical in all columns) are straightforward to detect and remove. Partial duplicates (identical in key columns but different in others) require a deliberate subset= specification. Using the default (all columns) on partial duplicates will silently fail to detect them â the rows will remain in the dataset. Always consider whether your data may have partial duplicates before choosing a deduplication strategy.
A professional deduplication workflow combines all the methods seen in this topic into a single, verifiable pipeline. The structure: measure before, remove, measure after, verify. This three-step verification ensures no unexpected rows were removed and no duplicates remain.
import pandas as pd # Full deduplication workflow with verification df = pd.read_csv('../Datasets/8_4_competition_results.csv') # Step 1: Measure before print(f"=== BEFORE DEDUPLICATION ===") print(f"Total rows: {len(df)}") print(f"Duplicate rows: {df.duplicated().sum()}") print(f"Unique rows: {len(df) - df.duplicated().sum()}") # Step 2: Remove duplicates (keep first occurrence) df_clean = df.drop_duplicates(keep='first') # Step 3: Measure after and verify print(f"\n=== AFTER DEDUPLICATION ===") print(f"Total rows: {len(df_clean)}") print(f"Duplicate rows: {df_clean.duplicated().sum()}") print(f"Rows removed: {len(df) - len(df_clean)}") print(f"\nVerification: {len(df) - len(df_clean)} duplicates removed â expected 7")
The verification confirms: 7 rows removed, 200 rows remain, 0 duplicates in the result. All three checks pass. The dataset is ready to proceed to missing value handling (Topic 8.6).
| Verification Check | Expected Value | Why It Matters |
|---|---|---|
| Rows removed = df.duplicated().sum() before | 7 | Confirms exactly the right number of rows were removed â not more, not fewer. |
| df_clean.duplicated().sum() == 0 | 0 | Confirms no duplicates remain in the result. Should always be zero after drop_duplicates. |
| len(df_clean) == len(df) - duplicates | 200 | Confirms the final row count matches the expected post-deduplication size. |
After verification, assign the deduplicated result back to df (or a new variable like df_clean) before moving to the next cleaning step. If you forget to reassign, all subsequent code will operate on the original 207-row DataFrame â not the cleaned 200-row version. A common pattern is df = df.drop_duplicates() to update the working DataFrame in place.
Some duplicates are intentional and meaningful. In a competition results table, the same student could legitimately compete in two different events on the same day â their rows would look like duplicates but represent distinct real events. Always inspect duplicate rows before removing them â understand whether the duplication reflects an error or a genuine repeat observation.
By default, drop_duplicates() returns a new DataFrame without modifying the original. The original df still contains all 207 rows. To persist the change, either reassign the result (df = df.drop_duplicates()) or use inplace=True. Always verify the result's row count after deduplication.
Deduplication permanently removes rows. If the wrong rows are removed â because keep= was set incorrectly or because the subset was too narrow â the analysis will be based on the wrong copies. For production datasets, save a backup before running deduplication, and verify the result row count against expectations before proceeding to the next cleaning step.
- .duplicated().sum() counts exact duplicate rows; using keep=False shows all rows involved in any duplication â useful for inspection before removal.
- .drop_duplicates() removes duplicate rows. The keep parameter controls which copy is retained: 'first' keeps the earliest, 'last' keeps the most recent, False removes all copies.
- The W8 dataset has 7 duplicate rows â removing them reduces the shape from (207, 7) to (200, 7).
- The subset parameter limits duplicate detection to specified columns â essential when records are the same entity but differ in non-key fields like timestamps or logging metadata.
- Deduplication must precede missing value imputation â imputing values before removing duplicates applies the same imputed value to rows that will later be deleted.
- Always verify the result row count after deduplication and re-run .duplicated().sum() on the cleaned DataFrame to confirm zero duplicates remain.
- â Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above - â Notebook: 8_5_Resolving_Duplicates.ipynb
Follow-along notebook for this topic's deduplication steps
- â Pandas Documentation: DataFrame.drop_duplicates()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html - â Pandas Documentation: DataFrame.duplicated()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html - â Real Python: Pandas DataFrame Manipulation
https://realpython.com/pandas-dataframe/