Topic 8.8: Categorical Encoding
Converting text categories to numbers using LabelEncoder for ordinal data and pd.get_dummies() for nominal data
A theatre assigns seats on two dimensions: the row (A, B, C, D ā with A being closest to the stage) and the section (Left, Centre, Right). For its ticketing database, both dimensions must be stored as numbers. For the row letters, a simple number assignment works well ā A=1, B=2, C=3, D=4 ā because there is a natural order: row A is genuinely closer to the stage than row D. The ordering is meaningful.
For the section (Left, Centre, Right), a simple number assignment is problematic. If Left=1, Centre=2, and Right=3, the database implicitly claims that Centre is the mathematical average of Left and Right, and that Right is three times Left. Neither claim is true ā the sections are distinct locations with no numeric relationship. The correct approach is to create three separate binary columns: is_left, is_centre, is_right ā one column per option.
Every categorical encoding decision reduces to one question: does this category have a meaningful order that the model should be aware of? If yes ā ordinal ā use Label Encoding. If no ā nominal ā use One-Hot Encoding. The W8 dataset has both: 'level' (beginner < intermediate < advanced) is ordinal; 'city' (Cairo, Alexandria, Giza) is nominal.
The LabelEncoder from sklearn.preprocessing is the standard tool for this. It assigns each unique category a distinct integer. It is the correct choice when the categorical variable is ordinal ā when the categories have a meaningful, rankable order. The W8 level column is a clear example: beginner < intermediate < advanced represents a genuine learning progression.
import pandas as pd from sklearn.preprocessing import LabelEncoder df = pd.read_csv('../Datasets/8_4_competition_results.csv') # Inspect the categorical column print(df['level'].unique()) print(df['city'].unique())
With the categories identified, LabelEncoder can be fit directly on the level column to assign each unique value an integer:
# Label Encoding ā assign an integer to each unique category le = LabelEncoder() df['level_encoded'] = le.fit_transform(df['level']) print(df[['level', 'level_encoded']].head(10))
The exact integer assigned to each category is visible via le.classes_, which lists the categories in the order they were mapped:
# See the exact mapping: class ā integer print("Classes:", le.classes_) # Output: ['advanced' 'beginner' 'intermediate'] # advanced=0, beginner=1, intermediate=2
LabelEncoder assigns integers alphabetically ā not in any meaningful learning-progression order. Here that produces advanced=0, beginner=1, intermediate=2, which is the reverse of the logical order (beginner should be the lowest level, advanced the highest). A model trained on this encoding would learn that 'advanced' (0) ranks below 'intermediate' (2), which contradicts the real-world meaning.
LabelEncoder does not know that 'beginner' should come before 'advanced' ā it only sees text and sorts it alphabetically. For a genuinely ordinal column, always inspect le.classes_ after fitting and confirm the resulting integer order matches the real-world order. If it doesn't (as with level here), the safer alternative is a manual dictionary mapping ā e.g. {'beginner': 0, 'intermediate': 1, 'advanced': 2} applied with .map() ā which guarantees the encoded integers reflect the true progression.
One-Hot Encoding converts a categorical column with N unique values into N binary (0/1) columns ā one per category. Each new column represents a yes/no question: 'Is this row in this category?' For any given row, exactly one column is 1 and all others are 0 ā hence 'one-hot'.
In Pandas, One-Hot Encoding is performed with pd.get_dummies(). It is the correct choice for nominal categories ā categories with no meaningful ordering. The W8 city column is nominal: there is no sense in which Cairo > Alexandria or Giza < Cairo. Each city is an independent category.
# One-Hot Encoding ā create a binary column for each category df_encoded = pd.get_dummies(df, columns=['city']) print("New columns:", [c for c in df_encoded.columns if 'city' in c]) df_encoded.head()
Notice that get_dummies() creates a separate column for every distinct string it sees ā including formatting variants. Because the city column here still contains its original 7 spellings of 3 real cities (e.g. 'Cairo', 'cairo', 'CAIRO', 'cairo ' with a trailing space), one-hot encoding produces 12 columns instead of 3. pd.get_dummies() also returns boolean (True/False) columns by default, not 0/1 integers. This is exactly why format standardisation (Topic 8.7) must happen before encoding ā otherwise the model receives a separate, meaningless column for every spelling mistake.
| Raw city value | One-hot column created |
|---|---|
| Cairo, cairo, CAIRO, cairo (trailing space) | city_Cairo, city_cairo, city_CAIRO, city_cairo (4 separate columns) |
| Alex, alex, ALEX, Alexandria | city_Alex, city_alex, city_ALEX, city_Alexandria (4 separate columns) |
| Giza, giza, GIZA, Giza (trailing space) | city_Giza, city_giza, city_GIZA, city_Giza (4 separate columns) |
Choosing between the two encoding methods depends entirely on whether the category is ordinal or nominal. Applying the wrong method introduces either a false arithmetic relationship (Label Encoding on nominal data) or unnecessary dimensionality (One-Hot Encoding on ordinal data with many levels).
| Property | LabelEncoder | pd.get_dummies() |
|---|---|---|
| Best for | Ordinal categories (meaningful order) | Nominal categories (no order) |
| W8 example | level: advanced=0, beginner=1, intermediate=2 (alphabetical ā check before trusting) | city: one column per distinct spelling (12 columns if unstandardised, 3 if cleaned first) |
| Output columns added | 1 column replaces the original | N columns added (one per category) |
| Implies ordering? | Yes ā higher integer = higher rank | No ā each column is independent (True/False) |
| Risk if misapplied | Assigns false order to nominal categories, or wrong order to ordinal ones if not verified | Creates false independence for ordinal categories |
| With N=2 categories | Both methods produce equivalent results | Both methods produce equivalent results |
The W8 dataset applies both methods: LabelEncoder for level (ordinal ā with its alphabetical-order quirk checked via le.classes_) and pd.get_dummies() for city (nominal). Using Label Encoding for city would imply an arbitrary alphabetical ranking between cities ā a completely false ordering that would cause any linear or distance-based model to learn an incorrect geographic relationship.
If 'cairo', 'Cairo', and 'CAIRO' are still present as distinct strings when pd.get_dummies() is run, they will be encoded as three separate categories ā three columns instead of one. The entire benefit of the city standardisation in Topic 8.7 would be lost. Always standardise text before encoding.
When One-Hot Encoding a column with N categories, the N resulting columns are mathematically redundant: knowing the values of Nā1 columns always determines the Nth. If the city column were already standardised to 3 clean categories, then knowing city_Cairo=0 and city_Alexandria=0 would be enough to know city_Giza=1 ā three columns would carry only two independent pieces of information.
This is the dummy variable trap ā including all N columns introduces perfect multicollinearity in linear models, making coefficient estimation mathematically unstable. The solution is to drop one dummy column using drop_first=True.
# Avoid dummy variable trap ā drop the first category column df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True) print("Columns after drop_first:", [c for c in df_encoded.columns if 'city' in c])
With drop_first=True, the alphabetically-first category column (city_ALEX) is dropped, leaving 11 of the original 12 columns. The same logic applies regardless of how many distinct spellings exist: for a row where every remaining dummy column is 0, the model infers it belongs to the dropped reference category. This example also shows concretely why standardising city in Topic 8.7 first matters ā applying drop_first to 12 unstandardised spelling variants still leaves 11 redundant columns, instead of the 2 you'd get from 3 clean categories.
When drop_first=True drops one column, the information is not gone ā it is encoded implicitly. For a row where all remaining dummy columns are 0, the model infers it belongs to the dropped (reference) category. In a fully standardised version of this dataset ā with city reduced to just Cairo, Alexandria, and Giza ā dropping the first column would leave exactly 2 columns, and a row with both 0 would implicitly mean the dropped city. Tree-based models (decision trees, random forests) do not require drop_first ā multicollinearity does not affect them.
In a real project, encoding is applied after all cleaning and standardisation steps are complete. The W8 dataset has two columns requiring encoding: level (ordinal ā manual map) and city (nominal ā get_dummies). The full pipeline below applies both in sequence and shows the final encoded DataFrame.
import pandas as pd from sklearn.preprocessing import LabelEncoder df = pd.read_csv('../Datasets/8_4_competition_results.csv') print(f"Shape before encoding: {df.shape}") print(f"Columns: {df.columns.tolist()}") # Step 1: Label Encoding for level (ordinal) le = LabelEncoder() df['level_encoded'] = le.fit_transform(df['level']) # Step 2: One-Hot Encoding for city (nominal), with drop_first for linear models df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True) print(f"\nShape after encoding: {df_encoded.shape}") print(f"Columns: {df_encoded.columns.tolist()}")
This combines both Scenes from the notebook in sequence: LabelEncoder turns level into level_encoded (verify the order with le.classes_ before trusting it), and get_dummies(..., drop_first=True) expands city into one boolean column per spelling variant, minus one reference column. Because the city column here is still the original, unstandardised text, this produces many more dummy columns than the 2 you'd get after Topic 8.7's cleanup ā a concrete illustration of why standardisation should happen before encoding in a real pipeline.
# Verify the encoding is correct print("Level classes (alphabetical order):", le.classes_) print(df_encoded[['level', 'level_encoded']].drop_duplicates().sort_values('level_encoded')) print("\nCity dummy columns created:") print([c for c in df_encoded.columns if 'city' in c])
Label Encoding is only appropriate for ordinal data. Applying it to nominal categories like city names introduces false arithmetic relationships: the model may learn that the 'average' of Cairo (encoded as 0) and Giza (encoded as 2) is Alexandria (encoded as 1). This is numerically coherent but factually meaningless, and it produces incorrect model behaviour.
The dummy variable trap is a mathematical problem, not a computational one. It occurs even with three columns and three rows in a linear model. The issue is that the design matrix cannot be inverted when perfect collinearity is present ā regardless of the dataset size.
- Label Encoding (sklearn's LabelEncoder) assigns an integer to each category ā appropriate only for ordinal categories where the order of values is meaningful (beginner < intermediate < advanced).
- LabelEncoder assigns integers alphabetically, which may not match the intended logical order. Always inspect le.classes_ after fitting; if alphabetical order contradicts the true ranking, a manual dictionary mapping with .map() is the safer alternative.
- One-Hot Encoding (pd.get_dummies()) converts a nominal categorical column into N binary (True/False) columns ā appropriate when there is no meaningful ordering between categories (city names).
- The dummy variable trap occurs when all N dummy columns are included in a linear model, creating perfect multicollinearity. Use drop_first=True to drop one column and designate a reference category.
- Format standardisation must precede encoding ā unstandardised variants like 'cairo' and 'Cairo' are encoded as separate categories, multiplying the number of dummy columns (12 columns for 3 real cities in the raw W8 data).
- The W8 dataset uses LabelEncoder for 'level' (ordinal) and get_dummies for 'city' (nominal: one column per distinct spelling, reduced by drop_first).
- ā Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above - ā Notebook: 8_8_Categorical_Encoding.ipynb
Follow-along notebook for this topic's encoding steps
- ā scikit-learn Documentation: LabelEncoder
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html - ā Pandas Documentation: pd.get_dummies()
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html - ā GeeksforGeeks: One-Hot Encoding vs Label Encoding
https://www.geeksforgeeks.org/machine-learning/one-hot-encoding-vs-label-encoding/