âš–ī¸
Feature Scaling MinMaxScaler StandardScaler Normalisation Standardisation sklearn

Topic 8.9: Feature Scaling

Bringing numeric features onto a common scale using MinMaxScaler and StandardScaler from sklearn.preprocessing

🌉 The Athletics Combine
â–ŧ

A sports academy evaluates recruits across three tests: a 100-metre sprint time (10–15 seconds, lower is better), a standing jump height (50–120 cm, higher is better), and a grip strength test (20–80 kg, higher is better). To compute a total score by summing all three, the academy faces a problem: the measurements are in completely different units and on different scales. Adding seconds to centimetres to kilograms produces a meaningless total where the jump height (range of 70 cm) dominates the sprint time (range of 5 seconds) simply because the numbers are larger.

Feature scaling solves this problem. When numeric features are measured in different units and on different scales, combining them directly is mathematically incoherent. Any model that computes distances between data points, or applies gradient descent in a parameter space, is dominated by whichever feature has the largest absolute values. Scaling brings all features onto a comparable range before they are combined.

â„šī¸
Scaling Is the Final Transformation Step

Feature scaling is placed last in the data preparation pipeline — after deduplication, missing value handling, formatting, and encoding — because it must operate on numeric data. Encoding must happen before scaling (you cannot scale a text column), and scaling must happen after encoding (applying MinMax scaling before one-hot encoding would produce incorrect ranges for the dummy columns).

📏 Scale to 0–1
â–ŧ

Min-Max Normalisation rescales a feature to a fixed range — by default, [0, 1]. Every value is transformed using the formula: X_scaled = (X − X_min) / (X_max − X_min). The minimum value becomes exactly 0, the maximum becomes exactly 1, and all other values fall proportionally between them.

In the W8 dataset, the score column (range 45–99) is a natural candidate for Min-Max scaling: it has a bounded range and no severe outliers that would compress the scaled values.

Python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

# Select numeric columns to scale
numeric_cols = ['age', 'score']

scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print(df[numeric_cols].describe().round(3))
â–ļ Output
age score count 192.000 197.000 mean 0.365 0.616 std 0.232 0.212 min 0.000 0.000 25% 0.176 0.460 50% 0.353 0.623 75% 0.529 0.762 max 1.000 1.000

Both age and score are scaled together in a single MinMaxScaler call by passing both column names in numeric_cols. For each column independently, the minimum value maps to exactly 0.000 and the maximum maps to exactly 1.000 — that's confirmed by the min and max rows above. Note the row counts differ (192 for age, 197 for score) because both columns still contain missing values at this point in the notebook — scaling does not require imputation first, since MinMaxScaler simply skips NaN entries when computing min/max and leaves them as NaN in the output.

Formula
X_scaled = (X − X_min) / (X_max − X_min). For W8 score: (X − 45) / (99 − 45) = (X − 45) / 54.
Output range
[0.0, 1.0] — always bounded. The minimum always maps to 0.0 and the maximum always maps to 1.0.
Outlier sensitivity
High. If one extreme outlier exists (e.g., score = 850), it becomes 1.0 and all legitimate scores compress near 0. Investigate outliers before applying.
Best for
Bounded features with no outliers: exam scores (0–100), percentages, pixel intensities (0–255).
📐 Distance from Average
â–ŧ

Z-Score Standardisation rescales a feature so that it has a mean of 0 and a standard deviation of 1. Every value is transformed using: X_scaled = (X − Îŧ) / ΃, where Îŧ is the column mean and ΃ is the standard deviation. Positive scaled values are above average; negative values are below average.

In the W8 dataset, the age column is well-suited for StandardScaler: the values (15–19) are already in a narrow bounded range, and standardisation will allow the model to compare age relative to the group mean rather than in absolute years.

Python
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

numeric_cols = ['age', 'score']

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print(df[numeric_cols].describe().round(3))
â–ļ Output
age score count 192.000 197.000 mean 0.000 -0.000 std 1.003 1.003 min -1.578 -2.914 25% -0.815 -0.738 50% -0.053 0.035 75% 0.709 0.689 max 2.742 1.817

After standardisation, both age and score have a mean of (essentially) 0 and a standard deviation of (essentially) 1 — that's the defining property of StandardScaler, applied independently to each column. Unlike MinMaxScaler, the output is unbounded: age ranges from about −1.58 to +2.74 standard deviations from its mean, and score ranges from about −2.91 to +1.82. A value's sign and magnitude now express how far it sits from that column's own average, in standard-deviation units — no longer in years or raw points.

Formula
X_scaled = (X − Îŧ) / ΃, where Îŧ is the column mean and ΃ is the standard deviation. Applied independently to each column passed to the scaler — here, age and score each get their own Îŧ and ΃.
Output range
Unbounded — no fixed minimum or maximum. Typical values fall between −3 and +3. Values outside that range represent genuine extremes.
Interpretation
Positive z-score = above average. Negative z-score = below average. z = 0 exactly means the value equals the mean.
đŸŽ¯ Which Scaler?
â–ŧ

The choice between MinMaxScaler and StandardScaler depends on the feature's distribution, the presence of outliers, and the algorithm that will receive the scaled data. Neither is universally superior — each is designed for different circumstances.

PropertyMinMaxScalerStandardScaler
Output range[0, 1] — all values boundedUnbounded — no fixed minimum or maximum
Formula(X − min) / (max − min)(X − mean) / std
Outlier sensitivityHigh — one outlier compresses all other valuesModerate — outliers shift mean and std but do not compress others
Best forBounded features with no outliers (exam scores, percentages)Features with outliers or that assume normality (age, income)
Preserves distribution shape?Yes — same shape, rescaledYes — same shape, recentred
Rule 1
Use MinMaxScaler for bounded features with a known, clean range.
Exam scores (0–100), probability outputs (0–1), pixel values (0–255). All values have a natural minimum and maximum with no extreme outliers.
Rule 2
Use StandardScaler when the algorithm assumes approximate normality.
Linear regression, logistic regression, and PCA make mathematical assumptions about feature distributions. StandardScaler brings features closer to meeting those assumptions.
Rule 3
Prefer StandardScaler when outliers are present in the feature.
A single extreme outlier causes MinMaxScaler to compress all other values near 0. StandardScaler is more robust — outliers inflate the std but do not compress the main distribution.
Rule 4
Scale only the numeric feature columns — never binary dummy columns or the target variable.
Dummy columns already encode 0/1 information correctly. Scaling them would destroy their binary interpretation. The target column must remain in its original scale for result interpretability.
📊 Before vs. After
â–ŧ

A direct comparison of the two scalers' output ranges shows the practical difference: MinMaxScaler bounds everything to [0, 1], while StandardScaler is unbounded and centred at 0. Re-running each scaler (as in the previous two sections) and comparing the describe() output side by side makes this concrete:

Python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

numeric_cols = ['age', 'score']

# MinMaxScaler result
df_mm = pd.read_csv('../Datasets/8_4_competition_results.csv')
df_mm[numeric_cols] = MinMaxScaler().fit_transform(df_mm[numeric_cols])

# StandardScaler result
df_std = pd.read_csv('../Datasets/8_4_competition_results.csv')
df_std[numeric_cols] = StandardScaler().fit_transform(df_std[numeric_cols])

print("MinMaxScaler — bounded to [0, 1]:")
print(df_mm[numeric_cols].describe().round(3))

print("\nStandardScaler — centred at 0, unbounded:")
print(df_std[numeric_cols].describe().round(3))
â–ļ Output
MinMaxScaler — bounded to [0, 1]: age score count 192.000 197.000 mean 0.365 0.616 std 0.232 0.212 min 0.000 0.000 25% 0.176 0.460 50% 0.353 0.623 75% 0.529 0.762 max 1.000 1.000 StandardScaler — centred at 0, unbounded: age score count 192.000 197.000 mean 0.000 -0.000 std 1.003 1.003 min -1.578 -2.914 25% -0.815 -0.738 50% -0.053 0.035 75% 0.709 0.689 max 2.742 1.817

Key observations. With MinMaxScaler: every column's min is exactly 0.000 and max is exactly 1.000 — confirmed in both rows above. With StandardScaler: every column's mean is (essentially) 0.000 and std is (essentially) 1.000 — confirmed in the mean/std rows. The relative ordering of values within each column is identical between the two methods; only the numeric scale and reference point differ.

🔧 Complete Pipeline
â–ŧ

With both scalers applied, a final check runs the complete W8 cleaning and scaling pipeline from start to finish and summarises the result for both features.

Python
# Summary comparison
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

df = pd.read_csv('../Datasets/8_4_competition_results.csv')

scaler_std = StandardScaler()
df['age_scaled'] = scaler_std.fit_transform(df[['age']])

scaler_mm = MinMaxScaler()
df['score_scaled'] = scaler_mm.fit_transform(df[['score']])

print("Feature scaling results summary:")
print(f"{'Feature':<15} {'Original Min':>12} {'Original Max':>12} {'Scaled Min':>10} {'Scaled Max':>10} {'Method'}")
print(f"{'score':<15} {df['score'].min():>12.1f} {df['score'].max():>12.1f} {df['score_scaled'].min():>10.4f} {df['score_scaled'].max():>10.4f} MinMaxScaler")
print(f"{'age':<15} {df['age'].min():>12.1f} {df['age'].max():>12.1f} {df['age_scaled'].min():>10.4f} {df['age_scaled'].max():>10.4f} StandardScaler")
â–ļ Output
Feature scaling results summary: Feature Original Min Original Max Scaled Min Scaled Max Method score 45.0 99.0 0.0000 1.0000 MinMaxScaler age 15.0 19.0 -1.8729 1.5169 StandardScaler

The scaling results confirm the expected behaviour. Score: min=0.0000, max=1.0000 — perfectly bounded [0,1]. Age: min=−1.8729, max=+1.5169 — unbounded, centred near 0 with spread determined by the standard deviation. The two scalers produce fundamentally different output ranges, which is why the choice of scaler must match the algorithm's requirements.

â„šī¸
The W8 Data Preparation Pipeline — Complete

Across Topics 8.5 through 8.9, the W8 dataset can be progressively cleaned and transformed: duplicates removed (207 → 200 rows), missing values imputed (score: median, city: mode), city spelling standardised (7 spellings → 3 canonical cities), dates parsed (object → datetime64), level label-encoded with sklearn's LabelEncoder (checking le.classes_ for correct order), city one-hot encoded with pd.get_dummies(..., drop_first=True), score normalised (MinMax: [0,1]), and age standardised (Z-score: mean=0, std=1). Each notebook in this sequence demonstrates its technique independently on a fresh load of the same underlying dataset; chaining all the steps together in one pipeline, as sketched here, produces a fully analysis-ready dataset for machine learning.

âš ī¸ Scaling Myths
â–ŧ
âš ī¸
Misconception: Scaling changes the distribution shape of a feature

Neither MinMaxScaler nor StandardScaler changes the shape of a distribution. A right-skewed score distribution remains right-skewed after scaling — it is just expressed on a different numeric range. Scaling is a linear transformation: it rescales and recentres, but it cannot convert a non-normal distribution into a normal one. If normality is required, a log transformation must be applied before or after scaling.

âš ī¸
Misconception: Scaling should be applied to all columns

Scaling should only be applied to continuous numeric features that are inputs to a model. Do NOT scale: binary dummy columns (already in [0,1] and meaningfully interpretable as 0/1); target/label columns (must remain in the original scale for result interpretation); or ID columns (not modelling features). Applying scaling to dummy columns would convert '0' and '1' into values like '−0.5' and '+0.5', destroying the binary interpretation.

âš ī¸
Misconception: Scaling eliminates outliers

Scaling compresses or shifts values — it does not remove outliers. After MinMaxScaler, an outlier that was 10 times the IQR above Q3 is still there — expressed as 1.0 (or very close to it), while all other values are compressed near 0. After StandardScaler, an outlier with a z-score of 50 is still present — just expressed as 50 instead of the original unit value. Outlier investigation (Topic 8.3) must happen before scaling, not rely on scaling to handle it.

✅ Scale This
â–ŧ
?
After applying MinMaxScaler to the W8 'score' column (min=45, max=99), what is the scaled value for a student with a score of 72?
  • Min-Max Normalisation (MinMaxScaler) rescales features to [0, 1] using (X − min) / (max − min). The minimum maps to 0, the maximum to 1, all others proportionally between.
  • Z-Score Standardisation (StandardScaler) rescales features to mean = 0, std = 1 using (X − mean) / std. Positive values are above average; negative values are below average.
  • In the W8 dataset: score is scaled with MinMaxScaler (bounded range, no outliers), and age is scaled with StandardScaler (narrow range, relative deviation is meaningful).
  • MinMaxScaler is sensitive to outliers — one extreme value compresses all others near 0. StandardScaler is more robust — outliers inflate the std but do not compress the main distribution.
  • Neither scaler changes the distribution shape — a right-skewed distribution remains right-skewed after scaling. Scaling is a linear transformation.
  • Scale only continuous numeric feature columns — never binary dummy columns, target variables, or ID columns. Scaling dummy columns destroys their binary interpretation.
📂Dataset & Notebook
â–ŧ
📚External Resources
â–ŧ