Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The sglang and vllm numbers are with cuda graphs enabled.

Having said that, 1B model is an extreme example - hence the 1.5x speedup. For regular models and batch sizes this would probably buy you a few percent.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: