The sglang and vllm numbers are with cuda graphs enabled. Having said that, 1B m...

		mmoskal 8 months ago \| parent \| context \| favorite \| on: Look Ma, No Bubbles: Designing a Low-Latency Megak... The sglang and vllm numbers are with cuda graphs enabled. Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.