ML: Normalization

2 minute read

Published:

Normalization makes optimization easier by controlling the scale and distribution of activations, features, weights, or gradients. The right normalization depends on the architecture and batch regime.

Updated: May 24, 2026.

Data Normalization

Before the model, normalize inputs using statistics from the training set only. Common examples:

  • Standardization: subtract mean and divide by standard deviation.
  • Min-max scaling: map values to a fixed range.
  • Image normalization: channel-wise mean and standard deviation.
  • Text normalization: case folding, Unicode normalization, or domain-specific cleanup when appropriate.

Do not leak validation or test statistics into the training pipeline.

BatchNorm

Batch Normalization normalizes activations using batch statistics during training and running estimates during evaluation. It works very well in many convolutional networks.

Limitations:

  • Small batch sizes can make statistics noisy.
  • Train and evaluation modes behave differently.
  • Distributed and gradient-accumulated training need care because the statistical batch may not match the optimization batch.

In-Place Activated BatchNorm

InPlace-ABN combines batch normalization and activation while saving memory. It can be useful for memory-heavy segmentation or dense prediction models. In current practice, memory pressure is also handled with mixed precision, activation checkpointing, fused kernels, and better hardware, so InPlace-ABN is more specialized than BatchNorm or LayerNorm.

GroupNorm

GroupNorm splits channels into groups and normalizes within each group. It does not depend on batch size, which makes it useful for small-batch vision tasks such as detection, segmentation, and high-resolution training.

LayerNorm

LayerNorm normalizes features within each sample. It does not depend on other examples in the batch, so it is stable for sequence models and variable batch sizes.

LayerNorm is standard in Transformers. Modern Transformer variants often use pre-normalization, where the normalization layer is applied before attention or feed-forward blocks.

RMSNorm

RMSNorm is a simplified LayerNorm variant that scales by root mean square without subtracting the mean. It is common in modern LLM architectures because it is cheaper and works well in pre-normalized Transformer blocks.

SpectralNorm

Spectral Normalization constrains a weight matrix by its largest singular value. It is not an activation normalization layer. It is often used when Lipschitz control matters, especially in GAN discriminators and some stability-sensitive models.

Current Practice

As of 2026, the rough defaults are:

  • CNN image classification: BatchNorm or newer architecture-specific norms.
  • Small-batch vision: GroupNorm or LayerNorm-like variants.
  • Transformers and LLMs: LayerNorm or RMSNorm, usually pre-norm.
  • GAN discriminators: SpectralNorm can help stability.
  • Tabular ML: normalize numeric features for linear models and neural networks; tree ensembles usually need less feature scaling.

Further reading: