So here are my preliminary benchmarks with my own implementation on an AMD EPYC 9334 32-Core processo. I need to double checks things - so take this with a grain of salt for now. Time is in seconds for 100000 iterations of manorboy(10). So far, the only implementation which clearly sucks is std::function<>. Even trampolines are suprisingly good (but I can imagine that they are much worse on other CPUs / architectures)