Some workloads on M1 absolutely smash other ARM processors in part because of M1's special-purpose hardware. In particular, the undocumented AMX chip is really nice for distance matrix calculations, vector search, embeddings, etc.
Non-scientific example: for inference, whisper.cpp links with Accelerate.framework to do fast matrix multiplies. On M1, one configuration gets ~6x realtime speed, but on a very beefy AWS Gravatron processor, the same configuration only achieves 0.5x realtime, even after choosing an optimal threadcount, even linking with NEON-optimized BLAS. (Maybe I'm doing something wrong though).
Non-scientific example: for inference, whisper.cpp links with Accelerate.framework to do fast matrix multiplies. On M1, one configuration gets ~6x realtime speed, but on a very beefy AWS Gravatron processor, the same configuration only achieves 0.5x realtime, even after choosing an optimal threadcount, even linking with NEON-optimized BLAS. (Maybe I'm doing something wrong though).