Topic 8.8: Categorical Encoding

🌉 Theatre Seating Numbers

▼

A theatre assigns seats on two dimensions: the row (A, B, C, D — with A being closest to the stage) and the section (Left, Centre, Right). For its ticketing database, both dimensions must be stored as numbers. For the row letters, a simple number assignment works well — A=1, B=2, C=3, D=4 — because there is a natural order: row A is genuinely closer to the stage than row D. The ordering is meaningful.

For the section (Left, Centre, Right), a simple number assignment is problematic. If Left=1, Centre=2, and Right=3, the database implicitly claims that Centre is the mathematical average of Left and Right, and that Right is three times Left. Neither claim is true — the sections are distinct locations with no numeric relationship. The correct approach is to create three separate binary columns: is_left, is_centre, is_right — one column per option.

ℹ️

The Core Question of Every Encoding Decision

Every categorical encoding decision reduces to one question: does this category have a meaningful order that the model should be aware of? If yes → ordinal → use Label Encoding. If no → nominal → use One-Hot Encoding. The W8 dataset has both: 'level' (beginner < intermediate < advanced) is ordinal; 'city' (Cairo, Alexandria, Giza) is nominal.

🏷️ Ordinal to Numbers

▼

The LabelEncoder from sklearn.preprocessing is the standard tool for this. It assigns each unique category a distinct integer. It is the correct choice when the categorical variable is ordinal — when the categories have a meaningful, rankable order. The W8 level column is a clear example: beginner < intermediate < advanced represents a genuine learning progression.

Python

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Inspect the categorical column
print(df['level'].unique())
print(df['city'].unique())

With the categories identified, LabelEncoder can be fit directly on the level column to assign each unique value an integer:

Python

# Label Encoding — assign an integer to each unique category
le = LabelEncoder()
df['level_encoded'] = le.fit_transform(df['level'])

print(df[['level', 'level_encoded']].head(10))

▶ Output

level level_encoded 0 advanced 0 1 advanced 0 2 intermediate 2 3 advanced 0 4 intermediate 2 5 advanced 0 6 beginner 1 7 intermediate 2 8 intermediate 2 9 intermediate 2

The exact integer assigned to each category is visible via le.classes_, which lists the categories in the order they were mapped:

Python

# See the exact mapping: class → integer
print("Classes:", le.classes_)
# Output: ['advanced' 'beginner' 'intermediate']
# advanced=0, beginner=1, intermediate=2

▶ Output

Classes: ['advanced' 'beginner' 'intermediate']

LabelEncoder assigns integers alphabetically — not in any meaningful learning-progression order. Here that produces advanced=0, beginner=1, intermediate=2, which is the reverse of the logical order (beginner should be the lowest level, advanced the highest). A model trained on this encoding would learn that 'advanced' (0) ranks below 'intermediate' (2), which contradicts the real-world meaning.

⚠️

Always Check le.classes_ for Ordinal Columns

LabelEncoder does not know that 'beginner' should come before 'advanced' — it only sees text and sorts it alphabetically. For a genuinely ordinal column, always inspect le.classes_ after fitting and confirm the resulting integer order matches the real-world order. If it doesn't (as with level here), the safer alternative is a manual dictionary mapping — e.g. {'beginner': 0, 'intermediate': 1, 'advanced': 2} applied with .map() — which guarantees the encoded integers reflect the true progression.

📊 City to Columns

▼

One-Hot Encoding converts a categorical column with N unique values into N binary (0/1) columns — one per category. Each new column represents a yes/no question: 'Is this row in this category?' For any given row, exactly one column is 1 and all others are 0 — hence 'one-hot'.

In Pandas, One-Hot Encoding is performed with pd.get_dummies(). It is the correct choice for nominal categories — categories with no meaningful ordering. The W8 city column is nominal: there is no sense in which Cairo > Alexandria or Giza < Cairo. Each city is an independent category.

Python

# One-Hot Encoding — create a binary column for each category
df_encoded = pd.get_dummies(df, columns=['city'])

print("New columns:", [c for c in df_encoded.columns if 'city' in c])
df_encoded.head()

▶ Output

New columns: ['city_ALEX', 'city_Alex', 'city_Alexandria', 'city_CAIRO', 'city_Cairo', 'city_GIZA', 'city_Giza', 'city_Giza ', 'city_alex', 'city_cairo', 'city_cairo ', 'city_giza'] name age score ... city_cairo city_cairo city_giza 0 Lina Sherif 28.0 97.5 ... False False False 1 Celine Ibrahim NaN 86.6 ... False False False 2 Walid Hassan 31.0 76.1 ... False False False 3 Xena Salah 26.0 88.6 ... False True False 4 Fatma Kamal 28.0 79.1 ... False False False [5 rows x 19 columns]

Notice that get_dummies() creates a separate column for every distinct string it sees — including formatting variants. Because the city column here still contains its original 7 spellings of 3 real cities (e.g. 'Cairo', 'cairo', 'CAIRO', 'cairo ' with a trailing space), one-hot encoding produces 12 columns instead of 3. pd.get_dummies() also returns boolean (True/False) columns by default, not 0/1 integers. This is exactly why format standardisation (Topic 8.7) must happen before encoding — otherwise the model receives a separate, meaningless column for every spelling mistake.

Raw city value	One-hot column created
Cairo, cairo, CAIRO, cairo (trailing space)	city_Cairo, city_cairo, city_CAIRO, city_cairo (4 separate columns)
Alex, alex, ALEX, Alexandria	city_Alex, city_alex, city_ALEX, city_Alexandria (4 separate columns)
Giza, giza, GIZA, Giza (trailing space)	city_Giza, city_giza, city_GIZA, city_Giza (4 separate columns)

🪤 Choose an Encoding

▼

Choosing between the two encoding methods depends entirely on whether the category is ordinal or nominal. Applying the wrong method introduces either a false arithmetic relationship (Label Encoding on nominal data) or unnecessary dimensionality (One-Hot Encoding on ordinal data with many levels).

Property	LabelEncoder	pd.get_dummies()
Best for	Ordinal categories (meaningful order)	Nominal categories (no order)
W8 example	level: advanced=0, beginner=1, intermediate=2 (alphabetical — check before trusting)	city: one column per distinct spelling (12 columns if unstandardised, 3 if cleaned first)
Output columns added	1 column replaces the original	N columns added (one per category)
Implies ordering?	Yes — higher integer = higher rank	No — each column is independent (True/False)
Risk if misapplied	Assigns false order to nominal categories, or wrong order to ordinal ones if not verified	Creates false independence for ordinal categories
With N=2 categories	Both methods produce equivalent results	Both methods produce equivalent results

The W8 dataset applies both methods: LabelEncoder for level (ordinal — with its alphabetical-order quirk checked via le.classes_) and pd.get_dummies() for city (nominal). Using Label Encoding for city would imply an arbitrary alphabetical ranking between cities — a completely false ordering that would cause any linear or distance-based model to learn an incorrect geographic relationship.

⚠️

Format Standardisation Must Precede Encoding

If 'cairo', 'Cairo', and 'CAIRO' are still present as distinct strings when pd.get_dummies() is run, they will be encoded as three separate categories — three columns instead of one. The entire benefit of the city standardisation in Topic 8.7 would be lost. Always standardise text before encoding.

📐 Encoding Trap

▼

When One-Hot Encoding a column with N categories, the N resulting columns are mathematically redundant: knowing the values of N−1 columns always determines the Nth. If the city column were already standardised to 3 clean categories, then knowing city_Cairo=0 and city_Alexandria=0 would be enough to know city_Giza=1 — three columns would carry only two independent pieces of information.

This is the dummy variable trap — including all N columns introduces perfect multicollinearity in linear models, making coefficient estimation mathematically unstable. The solution is to drop one dummy column using drop_first=True.

Python

# Avoid dummy variable trap — drop the first category column
df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True)

print("Columns after drop_first:", [c for c in df_encoded.columns if 'city' in c])

▶ Output

Columns after drop_first: ['city_Alex', 'city_Alexandria', 'city_CAIRO', 'city_Cairo', 'city_GIZA', 'city_Giza', 'city_Giza ', 'city_alex', 'city_cairo', 'city_cairo ', 'city_giza']

With drop_first=True, the alphabetically-first category column (city_ALEX) is dropped, leaving 11 of the original 12 columns. The same logic applies regardless of how many distinct spellings exist: for a row where every remaining dummy column is 0, the model infers it belongs to the dropped reference category. This example also shows concretely why standardising city in Topic 8.7 first matters — applying drop_first to 12 unstandardised spelling variants still leaves 11 redundant columns, instead of the 2 you'd get from 3 clean categories.

ℹ️

The Reference Category Is Not Lost — It Is Implicit

When drop_first=True drops one column, the information is not gone — it is encoded implicitly. For a row where all remaining dummy columns are 0, the model infers it belongs to the dropped (reference) category. In a fully standardised version of this dataset — with city reduced to just Cairo, Alexandria, and Giza — dropping the first column would leave exactly 2 columns, and a row with both 0 would implicitly mean the dropped city. Tree-based models (decision trees, random forests) do not require drop_first — multicollinearity does not affect them.

🔧 Full Encoding Run

▼

In a real project, encoding is applied after all cleaning and standardisation steps are complete. The W8 dataset has two columns requiring encoding: level (ordinal — manual map) and city (nominal — get_dummies). The full pipeline below applies both in sequence and shows the final encoded DataFrame.

Python

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

print(f"Shape before encoding: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

# Step 1: Label Encoding for level (ordinal)
le = LabelEncoder()
df['level_encoded'] = le.fit_transform(df['level'])

# Step 2: One-Hot Encoding for city (nominal), with drop_first for linear models
df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True)

print(f"\nShape after encoding: {df_encoded.shape}")
print(f"Columns: {df_encoded.columns.tolist()}")

This combines both Scenes from the notebook in sequence: LabelEncoder turns level into level_encoded (verify the order with le.classes_ before trusting it), and get_dummies(..., drop_first=True) expands city into one boolean column per spelling variant, minus one reference column. Because the city column here is still the original, unstandardised text, this produces many more dummy columns than the 2 you'd get after Topic 8.7's cleanup — a concrete illustration of why standardisation should happen before encoding in a real pipeline.

Python

# Verify the encoding is correct
print("Level classes (alphabetical order):", le.classes_)
print(df_encoded[['level', 'level_encoded']].drop_duplicates().sort_values('level_encoded'))

print("\nCity dummy columns created:")
print([c for c in df_encoded.columns if 'city' in c])

level column (original)

Text categories: 'beginner', 'intermediate', 'advanced'. Kept for interpretability — human-readable.

level_encoded column (new)

Integer assigned by LabelEncoder, alphabetically — not automatically in learning-progression order. Always check le.classes_.

city column (dropped)

Removed by pd.get_dummies(). Replaced by one boolean dummy column per distinct spelling, minus the dropped reference column.

city_* dummy columns (new)

Boolean columns: True = yes, False = no. The alphabetically-first category is the implicit reference when all dummies are False.

⚠️ Encoding Myths

▼

⚠️

Misconception: Label Encoding is always faster and simpler, so it should be the default

Label Encoding is only appropriate for ordinal data. Applying it to nominal categories like city names introduces false arithmetic relationships: the model may learn that the 'average' of Cairo (encoded as 0) and Giza (encoded as 2) is Alexandria (encoded as 1). This is numerically coherent but factually meaningless, and it produces incorrect model behaviour.

⚠️

Misconception: The dummy variable trap only matters for large datasets

The dummy variable trap is a mathematical problem, not a computational one. It occurs even with three columns and three rows in a linear model. The issue is that the design matrix cannot be inverted when perfect collinearity is present — regardless of the dataset size.

✅ Encode This

▼

?

The W8 'level' column has three values: 'beginner', 'intermediate', 'advanced'. After running LabelEncoder().fit_transform() on this column, why is it important to check le.classes_ before using the result?

Label Encoding (sklearn's LabelEncoder) assigns an integer to each category — appropriate only for ordinal categories where the order of values is meaningful (beginner < intermediate < advanced).
LabelEncoder assigns integers alphabetically, which may not match the intended logical order. Always inspect le.classes_ after fitting; if alphabetical order contradicts the true ranking, a manual dictionary mapping with .map() is the safer alternative.
One-Hot Encoding (pd.get_dummies()) converts a nominal categorical column into N binary (True/False) columns — appropriate when there is no meaningful ordering between categories (city names).
The dummy variable trap occurs when all N dummy columns are included in a linear model, creating perfect multicollinearity. Use drop_first=True to drop one column and designate a reference category.
Format standardisation must precede encoding — unstandardised variants like 'cairo' and 'Cairo' are encoded as separate categories, multiplying the number of dummy columns (12 columns for 3 real cities in the raw W8 data).
The W8 dataset uses LabelEncoder for 'level' (ordinal) and get_dummies for 'city' (nominal: one column per distinct spelling, reduced by drop_first).

📂Dataset & Notebook

▼

↗
Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above
↗
Notebook: 8_8_Categorical_Encoding.ipynb
Follow-along notebook for this topic's encoding steps

📚External Resources

▼

↗
scikit-learn Documentation: LabelEncoder
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
↗
Pandas Documentation: pd.get_dummies()
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
↗
GeeksforGeeks: One-Hot Encoding vs Label Encoding
https://www.geeksforgeeks.org/machine-learning/one-hot-encoding-vs-label-encoding/