Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).


Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware


I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.


Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: