Topic 8.9: Feature Scaling
Bringing numeric features onto a common scale using MinMaxScaler and StandardScaler from sklearn.preprocessing
A sports academy evaluates recruits across three tests: a 100-metre sprint time (10â15 seconds, lower is better), a standing jump height (50â120 cm, higher is better), and a grip strength test (20â80 kg, higher is better). To compute a total score by summing all three, the academy faces a problem: the measurements are in completely different units and on different scales. Adding seconds to centimetres to kilograms produces a meaningless total where the jump height (range of 70 cm) dominates the sprint time (range of 5 seconds) simply because the numbers are larger.
Feature scaling solves this problem. When numeric features are measured in different units and on different scales, combining them directly is mathematically incoherent. Any model that computes distances between data points, or applies gradient descent in a parameter space, is dominated by whichever feature has the largest absolute values. Scaling brings all features onto a comparable range before they are combined.
Feature scaling is placed last in the data preparation pipeline â after deduplication, missing value handling, formatting, and encoding â because it must operate on numeric data. Encoding must happen before scaling (you cannot scale a text column), and scaling must happen after encoding (applying MinMax scaling before one-hot encoding would produce incorrect ranges for the dummy columns).
Min-Max Normalisation rescales a feature to a fixed range â by default, [0, 1]. Every value is transformed using the formula: X_scaled = (X â X_min) / (X_max â X_min). The minimum value becomes exactly 0, the maximum becomes exactly 1, and all other values fall proportionally between them.
In the W8 dataset, the score column (range 45â99) is a natural candidate for Min-Max scaling: it has a bounded range and no severe outliers that would compress the scaled values.
import pandas as pd from sklearn.preprocessing import MinMaxScaler df = pd.read_csv('../Datasets/8_4_competition_results.csv') # Select numeric columns to scale numeric_cols = ['age', 'score'] scaler = MinMaxScaler() df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) print(df[numeric_cols].describe().round(3))
Both age and score are scaled together in a single MinMaxScaler call by passing both column names in numeric_cols. For each column independently, the minimum value maps to exactly 0.000 and the maximum maps to exactly 1.000 â that's confirmed by the min and max rows above. Note the row counts differ (192 for age, 197 for score) because both columns still contain missing values at this point in the notebook â scaling does not require imputation first, since MinMaxScaler simply skips NaN entries when computing min/max and leaves them as NaN in the output.
Z-Score Standardisation rescales a feature so that it has a mean of 0 and a standard deviation of 1. Every value is transformed using: X_scaled = (X â Îŧ) / Ī, where Îŧ is the column mean and Ī is the standard deviation. Positive scaled values are above average; negative values are below average.
In the W8 dataset, the age column is well-suited for StandardScaler: the values (15â19) are already in a narrow bounded range, and standardisation will allow the model to compare age relative to the group mean rather than in absolute years.
import pandas as pd from sklearn.preprocessing import StandardScaler df = pd.read_csv('../Datasets/8_4_competition_results.csv') numeric_cols = ['age', 'score'] scaler = StandardScaler() df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) print(df[numeric_cols].describe().round(3))
After standardisation, both age and score have a mean of (essentially) 0 and a standard deviation of (essentially) 1 â that's the defining property of StandardScaler, applied independently to each column. Unlike MinMaxScaler, the output is unbounded: age ranges from about â1.58 to +2.74 standard deviations from its mean, and score ranges from about â2.91 to +1.82. A value's sign and magnitude now express how far it sits from that column's own average, in standard-deviation units â no longer in years or raw points.
The choice between MinMaxScaler and StandardScaler depends on the feature's distribution, the presence of outliers, and the algorithm that will receive the scaled data. Neither is universally superior â each is designed for different circumstances.
| Property | MinMaxScaler | StandardScaler |
|---|---|---|
| Output range | [0, 1] â all values bounded | Unbounded â no fixed minimum or maximum |
| Formula | (X â min) / (max â min) | (X â mean) / std |
| Outlier sensitivity | High â one outlier compresses all other values | Moderate â outliers shift mean and std but do not compress others |
| Best for | Bounded features with no outliers (exam scores, percentages) | Features with outliers or that assume normality (age, income) |
| Preserves distribution shape? | Yes â same shape, rescaled | Yes â same shape, recentred |
A direct comparison of the two scalers' output ranges shows the practical difference: MinMaxScaler bounds everything to [0, 1], while StandardScaler is unbounded and centred at 0. Re-running each scaler (as in the previous two sections) and comparing the describe() output side by side makes this concrete:
import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler numeric_cols = ['age', 'score'] # MinMaxScaler result df_mm = pd.read_csv('../Datasets/8_4_competition_results.csv') df_mm[numeric_cols] = MinMaxScaler().fit_transform(df_mm[numeric_cols]) # StandardScaler result df_std = pd.read_csv('../Datasets/8_4_competition_results.csv') df_std[numeric_cols] = StandardScaler().fit_transform(df_std[numeric_cols]) print("MinMaxScaler â bounded to [0, 1]:") print(df_mm[numeric_cols].describe().round(3)) print("\nStandardScaler â centred at 0, unbounded:") print(df_std[numeric_cols].describe().round(3))
Key observations. With MinMaxScaler: every column's min is exactly 0.000 and max is exactly 1.000 â confirmed in both rows above. With StandardScaler: every column's mean is (essentially) 0.000 and std is (essentially) 1.000 â confirmed in the mean/std rows. The relative ordering of values within each column is identical between the two methods; only the numeric scale and reference point differ.
With both scalers applied, a final check runs the complete W8 cleaning and scaling pipeline from start to finish and summarises the result for both features.
# Summary comparison import pandas as pd from sklearn.preprocessing import MinMaxScaler, StandardScaler df = pd.read_csv('../Datasets/8_4_competition_results.csv') scaler_std = StandardScaler() df['age_scaled'] = scaler_std.fit_transform(df[['age']]) scaler_mm = MinMaxScaler() df['score_scaled'] = scaler_mm.fit_transform(df[['score']]) print("Feature scaling results summary:") print(f"{'Feature':<15} {'Original Min':>12} {'Original Max':>12} {'Scaled Min':>10} {'Scaled Max':>10} {'Method'}") print(f"{'score':<15} {df['score'].min():>12.1f} {df['score'].max():>12.1f} {df['score_scaled'].min():>10.4f} {df['score_scaled'].max():>10.4f} MinMaxScaler") print(f"{'age':<15} {df['age'].min():>12.1f} {df['age'].max():>12.1f} {df['age_scaled'].min():>10.4f} {df['age_scaled'].max():>10.4f} StandardScaler")
The scaling results confirm the expected behaviour. Score: min=0.0000, max=1.0000 â perfectly bounded [0,1]. Age: min=â1.8729, max=+1.5169 â unbounded, centred near 0 with spread determined by the standard deviation. The two scalers produce fundamentally different output ranges, which is why the choice of scaler must match the algorithm's requirements.
Across Topics 8.5 through 8.9, the W8 dataset can be progressively cleaned and transformed: duplicates removed (207 â 200 rows), missing values imputed (score: median, city: mode), city spelling standardised (7 spellings â 3 canonical cities), dates parsed (object â datetime64), level label-encoded with sklearn's LabelEncoder (checking le.classes_ for correct order), city one-hot encoded with pd.get_dummies(..., drop_first=True), score normalised (MinMax: [0,1]), and age standardised (Z-score: mean=0, std=1). Each notebook in this sequence demonstrates its technique independently on a fresh load of the same underlying dataset; chaining all the steps together in one pipeline, as sketched here, produces a fully analysis-ready dataset for machine learning.
Neither MinMaxScaler nor StandardScaler changes the shape of a distribution. A right-skewed score distribution remains right-skewed after scaling â it is just expressed on a different numeric range. Scaling is a linear transformation: it rescales and recentres, but it cannot convert a non-normal distribution into a normal one. If normality is required, a log transformation must be applied before or after scaling.
Scaling should only be applied to continuous numeric features that are inputs to a model. Do NOT scale: binary dummy columns (already in [0,1] and meaningfully interpretable as 0/1); target/label columns (must remain in the original scale for result interpretation); or ID columns (not modelling features). Applying scaling to dummy columns would convert '0' and '1' into values like 'â0.5' and '+0.5', destroying the binary interpretation.
Scaling compresses or shifts values â it does not remove outliers. After MinMaxScaler, an outlier that was 10 times the IQR above Q3 is still there â expressed as 1.0 (or very close to it), while all other values are compressed near 0. After StandardScaler, an outlier with a z-score of 50 is still present â just expressed as 50 instead of the original unit value. Outlier investigation (Topic 8.3) must happen before scaling, not rely on scaling to handle it.
- Min-Max Normalisation (MinMaxScaler) rescales features to [0, 1] using (X â min) / (max â min). The minimum maps to 0, the maximum to 1, all others proportionally between.
- Z-Score Standardisation (StandardScaler) rescales features to mean = 0, std = 1 using (X â mean) / std. Positive values are above average; negative values are below average.
- In the W8 dataset: score is scaled with MinMaxScaler (bounded range, no outliers), and age is scaled with StandardScaler (narrow range, relative deviation is meaningful).
- MinMaxScaler is sensitive to outliers â one extreme value compresses all others near 0. StandardScaler is more robust â outliers inflate the std but do not compress the main distribution.
- Neither scaler changes the distribution shape â a right-skewed distribution remains right-skewed after scaling. Scaling is a linear transformation.
- Scale only continuous numeric feature columns â never binary dummy columns, target variables, or ID columns. Scaling dummy columns destroys their binary interpretation.
- â Dataset: 8_4_competition_results.csv
The W8 competition dataset used in the code examples above - â Notebook: 8_9_Feature_Scaling.ipynb
Follow-along notebook for this topic's scaling steps
- â scikit-learn Documentation: MinMaxScaler
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html - â scikit-learn Documentation: StandardScaler
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html - â GeeksforGeeks: Feature Scaling, Normalization and Standardization
https://www.geeksforgeeks.org/machine-learning/feature-engineering-scaling-normalization-and-standardization/