Topic 8.7: Format Standardization

🌉 The Bus Stop Merge

▼

A transport authority merges bus-stop records from three regional systems. The same stops appear under different names: 'Tahrir Sq', 'tahrir sq.', 'TAHRIR SQUARE'. When the authority generates a route map, each spelling becomes a separate stop — the same location appears three times. Route calculations give passengers conflicting information, and journey planners show phantom stops that do not exist.

This is the invisible failure mode of formatting inconsistencies. The data is structurally correct — one column, one variable, one value per cell. But the content is semantically broken: the same real-world entity is written in multiple ways. The W8 competition dataset has exactly this problem in the city column (7 spellings for 3 cities) and in the date column (4 different date format conventions).

ℹ️

Why Formatting Comes After Duplicates and Missing Values

Format standardisation is intentionally placed after deduplication and missing value handling. Encoding requires that all text values have been standardised first — you cannot reliably assign a consistent numeric code to 'Cairo' and 'cairo' if they are still treated as different categories. Standardisation is the bridge between raw text and usable categorical data.

🏙️ The City Problem

▼

The first step in any string standardisation is to see exactly what variants exist. The W8 dataset has 7 different city spellings for only 3 actual cities. The table below maps every variant to its canonical (correct) form.

Python

import pandas as pd

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# See how many unique 'cities' exist due to formatting
print(df['city'].value_counts())

▶ Output

city Cairo 68 cairo 19 Alexandria 32 Alex 14 Giza 28 giza 12 CAIRO 8 Name: city, dtype: int64

cairo, CAIRO, Cairo

→ Cairo (target canonical form)

Alex, Alexandria

→ Alexandria (target canonical form)

giza, Giza

→ Giza (target canonical form)

The problem has two components: case variation ('cairo' vs 'Cairo' vs 'CAIRO') and abbreviation ('Alex' vs 'Alexandria'). Case variation can be solved with string methods alone. Abbreviations require an explicit mapping — there is no string function that knows 'Alex' means 'Alexandria'.

🔡 Fix the Casing

▼

The two most important string cleaning operations for any text column are case normalisation and whitespace removal. The notebook applies .str.lower() first to convert every value to lowercase, then chains .str.strip() to remove leading and trailing whitespace.

Python

# Standardize all city names to lowercase
df['city'] = df['city'].str.lower()

print(df['city'].value_counts())

▶ Output

city cairo 95 alexandria 32 alex 14 giza 40 Name: city, dtype: int64

From 7 spellings to 4 in one line. 'cairo', 'CAIRO', and 'Cairo' have all merged into 'cairo' (95 records). 'giza' and 'Giza' have merged into 'giza' (40 records). The remaining problem is 'alex' vs 'alexandria' — a case-only fix cannot resolve an abbreviation. Chaining .str.strip() onto the same line additionally removes any leading/trailing whitespace left over from manual data entry:

Python

# Chain lower() and strip() together
df['city'] = df['city'].str.lower().str.strip()

print(df['city'].value_counts())

Chaining string methods is a common and idiomatic pandas pattern: each .str. call returns a new Series, so the next .str. call can be applied directly on the result. The order does not matter for lower() and strip() specifically, since they act on different aspects of the string (case vs. surrounding whitespace).

🗺️ Fix the Abbreviation

▼

Abbreviation variants require an explicit, whole-value replacement — there is no string function that knows 'alex' means 'alexandria'. The notebook uses .replace() on the Series (not .str.replace()), which matters: .str.replace() matches substrings, so replacing 'alex' would also corrupt 'alexandria' into 'alexandriaandria'. Plain .replace() only matches whole values.

Python

# Replace a specific wrong spelling with the correct one
# IMPORTANT: use .replace() (whole-value match), NOT .str.replace() which matches substrings
# .str.replace('alex','alexandria') would turn 'alexandria' into 'alexandriaandria'
df['city'] = df['city'].replace('alex', 'alexandria')

# Verify the fix
print(df['city'].value_counts())

▶ Output

city cairo 95 alexandria 46 giza 40 Name: city, dtype: int64

The city column is now fully standardised: 3 cities, 3 spellings, correct counts. 'cairo' has 95 records, 'alexandria' has 46 (32 originally spelled out + 14 from the 'alex' abbreviation), and 'giza' has 40. Any groupby('city') operation will now produce exactly 3 groups, each containing all records for that city.

📅 Fix the Dates

▼

The W8 dataset's date column contains four different date format conventions in the same column. No single format string can parse all four — the solution is pd.to_datetime(format='mixed'), which inspects each value individually and applies the appropriate parsing rule.

Python

# Inspect date variants before conversion
print("Date value sample (showing all format variants):")
print(df['date'].value_counts().head(10))
print(f"\nCurrent dtype: {df['date'].dtype}")

▶ Output

Date value sample (showing all format variants): date 2024-01-15 43 16/01/2024 28 15-01-2024 22 01/16/2024 19 2024-01-16 18 17/01/2024 15 15/01/2024 14 01/17/2024 12 2024-01-17 11 16-01-2024 9 Name: date, dtype: object

Python

# Convert string dates to proper datetime format
# format='mixed' lets pandas infer per element when the column mixes formats
df['date'] = pd.to_datetime(df['date'], format='mixed')

print(df['date'].dtype)
print(df['date'].head())

▶ Output

datetime64[ns] 0 2024-01-15 1 2024-01-15 2 2024-01-15 3 2024-01-15 4 2024-01-15 Name: date, dtype: datetime64[ns]

The date column is now stored as datetime64[ns] instead of plain text. All four format variants have been parsed correctly and stored in a unified internal representation. Date-based operations like .min(), .max(), .dt.month, and time-series sorting all now work correctly — they would fail or behave unpredictably on the original string column.

With the column now a true datetime type, individual date components can be pulled out as their own columns using the .dt accessor:

Python

# Extract year, month, day as separate columns
df['year']  = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day']   = df['date'].dt.day

print(df[['date', 'year', 'month', 'day']].head())

ℹ️

format='mixed' vs. Specifying a Format String

When all dates in a column follow one consistent format, specify it explicitly: pd.to_datetime(df['date'], format='%d/%m/%Y'). This is faster and prevents misinterpretation. Use format='mixed' only when the column genuinely contains multiple format conventions — as in the W8 dataset. With format='mixed', Pandas inspects each value independently and chooses the best interpretation, which is slower but necessary for multi-format columns.

✅ Verify the Cleanup

▼

After all format standardisation steps, a final verification confirms that the column types and value counts match expectations before moving to encoding (Topic 8.8).

Python

# Full post-standardisation verification
print("=== Column types after standardisation ===")
print(df.dtypes)

print("\n=== City distribution (should be 3 cities) ===")
print(df['city'].value_counts())

print("\n=== Date range ===")
print(f"Min: {df['date'].min().date()}, Max: {df['date'].max().date()}")

print("\n=== Any remaining missing values? ===")
print(df.isna().sum())

▶ Output

=== Column types after standardisation === name object age int64 city object score float64 grade object level object date datetime64[ns] dtype: object === City distribution (should be 3 cities) === city cairo 95 alexandria 46 giza 40 Name: city, dtype: int64 === Date range === Min: 2024-01-15, Max: 2024-01-22 === Any remaining missing values? === name 0 age 0 city 0 score 0 grade 0 level 0 date 0 dtype: int64

All verifications pass. The date column is now datetime64[ns]. The city column now has exactly 3 categories: cairo (95 records), alexandria (46 records), and giza (40 records). Every value_counts() check returns 3 categories instead of 7, so any groupby('city') operation now produces clean, complete results. The dataset is ready for encoding in Topic 8.8.

📅 After Date Fix

▼

Converting the date column to datetime64 is not merely a type-correction exercise — it unlocks the full power of Pandas' time-based operations through the .dt accessor. Once dates are stored as datetime, you can extract components, sort chronologically, filter by date range, and compute time differences.

Python

import pandas as pd

# After conversion, demonstrate .dt accessor operations
df = pd.read_csv('../Datasets/8_4_competition_results.csv')
df['city'] = df['city'].str.lower().str.strip().replace('alex', 'alexandria')
df['date'] = pd.to_datetime(df['date'], format='mixed')
df['year']  = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day']   = df['date'].dt.day

print("Date-based operations after conversion:")
print(f"Month extracted: {df['date'].dt.month.unique()}")
print(f"Day names: {df['date'].dt.day_name().unique()}")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")

# Sort by date — works correctly as datetime
print("\nEarliest 3 records by date:")
print(df.sort_values('date')[['name', 'city', 'score', 'date']].head(3).to_string())

▶ Output

Date-based operations after conversion: Month extracted: [1] Day names: ['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday'] Date range: 2024-01-15 to 2024-01-22 Earliest 3 records by date: name city score date 0 Ahmed cairo 85.0 2024-01-15 1 Sara alexandria 92.0 2024-01-15 3 Nour giza 88.0 2024-01-15

The competition ran across 8 days (January 15–22, 2024). All dates fall within January, and the day-name extraction confirms the competition ran across all days of the week. None of this time-based analysis is possible while the column is stored as a text string — chronological sorting on text would sort alphabetically, not chronologically.

ℹ️

Sorting Text Dates vs. Datetime Dates

When the date column is stored as object (text), sorting it produces alphabetical order: '01/15/2024' would sort before '15-01-2024' because '0' comes before '1' in ASCII order — regardless of which date is earlier. Converting to datetime ensures that all date comparisons and sorts use actual chronological order, not alphabetical string order.

⚠️ Format Myths

▼

⚠️

Misconception: Formatting inconsistencies are minor cosmetic issues

Formatting inconsistencies cause real analytical failures. A groupby('city') on a column containing 'Cairo', 'cairo', and 'CAIRO' produces three separate groups, each with a fraction of the Cairo records. A merge between two tables on a date column where one side uses 'YYYY-MM-DD' and the other uses 'DD/MM/YYYY' silently matches no records. These are not cosmetic issues — they are analysis failures.

⚠️

Misconception: .str.lower() is always safer than .str.title()

Both .str.lower() and .str.title() normalise case — the choice depends on the desired canonical form. For city names, 'Cairo' (title case) is more readable and conventional than 'cairo' (lowercase). For machine identifiers or codes, lowercase may be preferred. Always match the canonical form to the domain convention, not just a default preference.

⚠️

Misconception: String methods modify the column in place

All Pandas string accessor methods (.str.strip(), .str.title(), .str.lower()) return a new Series — they do not modify the original column. The result must be explicitly assigned back: df['city'] = df['city'].str.lower().str.strip(). Without the assignment, the transformation is computed and immediately discarded.

✅ Fix the Format

▼

?

The W8 'city' column contains 'Cairo', 'cairo', 'CAIRO', 'Alex', 'Alexandria', 'Giza', and 'giza'. After applying df['city'] = df['city'].str.lower(), how many unique city values remain?

?

The W8 'date' column contains values in four formats: 'YYYY-MM-DD', 'DD/MM/YYYY', 'DD-MM-YYYY', and 'MM/DD/YYYY'. Which pd.to_datetime() call correctly converts all of them?

str.strip() removes leading and trailing whitespace from string values — always apply before any other string transformation to eliminate hidden spacing issues.
str.lower() converts text to lowercase — collapses 'cairo', 'CAIRO', and 'Cairo' into a single canonical 'cairo'. str.strip() removes leading and trailing whitespace; chaining .str.lower().str.strip() handles both in one line.
.replace('old', 'new') on a Series maps a specific whole value to a new one — used to resolve abbreviation variants like 'alex' → 'alexandria' that cannot be fixed by case normalisation alone. Use .replace(), not .str.replace(), to avoid matching substrings inside other values.
The W8 city column is standardised from 7 spellings to 3 canonical city names (cairo: 95, alexandria: 46, giza: 40) using two steps: str.lower().str.strip() followed by .replace('alex', 'alexandria').
pd.to_datetime(format='mixed') parses a column containing multiple date format conventions — each value is parsed individually using the best-matching rule.
After conversion, the date column becomes datetime64[ns], enabling sorting, date arithmetic, and .dt accessor operations like .dt.year, .dt.month, and .dt.day.

📂Dataset & Notebook

▼

↗
Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above
↗
Notebook: 8_7_Inconsistent_Formatting.ipynb
Follow-along notebook for this topic's standardisation steps

📚External Resources

▼

↗
Pandas Documentation: Working with Text Data
https://pandas.pydata.org/docs/user_guide/text.html
↗
Pandas Documentation: pd.to_datetime()
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
↗
Pandas Documentation: Series.replace()
https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html