u3841934583_httpss.mj.run95HmOLfjQXg_astronauts_sorting_documen_9b7743ea-8194-4d7a-97be-b8ca4e18edc4.png

https://arxiv.org/pdf/2503.10622

in recent years, novel architectures often seek to replace attention or convolution layers (Tolstikhin et al., 2021; Gu and Dao, 2023; Sun et al., 2024; Feng et al., 2024), but almost always retain the normalization layers.

Benchmarking

It's slower than RMS due to lack of SIMD support.

https://twitter.com/ngxson/status/1901050558246515007?t=LUGasEEQbhpYZGA2dtuVWw&s=19

Observation → Different Formula matching curve

image.png

image.png

image.png

image.png

History of Normalization

Batch Normalization → Layer Normalization → RMSNorm

Inference speed & memory bandwidth reduces each time

Such beauty