Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you say a bit more about evals and your approach?




High level, the approach is:

- I'm pain point driven:

  - I can't compare what I can't measure. 

  - I can't trust to run this "AI" tool to run on its own
- That's automation, which is about intentionality (can I describe what I want?) and risk profile understanding (What's the blast radius/worst that could happen)

Then I treat it as if it was an Integration Test/Test Driven Development exercise of sorts.

- I don't start designing an entire cloud infrastructure.

- I make sure the "agent" is living in the location where the users actually live so that it can be the equivalent of an extra paid set of hands.

- I ask questions or replicate user stories and use deterministic tests wherever I can. Don't just go for LLMaaJ. What's the simplest thing you can think of?

- The important thing is rapid iteration and control. Just like in a unit testing scenario it's not about just writing a 100 tests but the ones that qualitatively allow you to move as fast as possible.

- At this stage where the space is moving so fast and we're learning so much, don't assume or try to over-optimize places that don't hurt and instead think about minimalism, ease of change, parameterization and ease of comparison with other components that form "the black box" and with itself.

- Once you have the benchmarks that you want, you can decide things like pick the cheapest model/agent configuration that does the job within the acceptable timeframe.

Happy to go deeper on these. I have some practical/runnable samples/text I can share on the topic after the weekend. I'll drop a link here when it's ready




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: