https://arxiv.org/pdf/2503.10622
in recent years, novel architectures often seek to replace attention or convolution layers (Tolstikhin et al., 2021; Gu and Dao, 2023; Sun et al., 2024; Feng et al., 2024), but almost always retain the normalization layers.
It's slower than RMS due to lack of SIMD support.
https://twitter.com/ngxson/status/1901050558246515007?t=LUGasEEQbhpYZGA2dtuVWw&s=19
Batch Normalization → Layer Normalization → RMSNorm
Inference speed & memory bandwidth reduces each time