Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.


Depends on what production means for you. This is useful for batch production jobs.

Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.


How big of a use case is synthetic data generation? I’m curious as I see a lot about it coming from academic projects but I haven’t seen much related to commercial use cases


tiny NNs distilled from LLMs can produce some amazing results, i'm surprised it's not more common tbh


I agree, there are impressive results. This just came out from Berkeley https://arxiv.org/abs/2506.04178

But still, I mainly see work on this direction in academia.


Buy surely next years production deployments will be very different to right now, with different use cases...etc


Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: