- I can't compare what I can't measure.
- I can't trust to run this "AI" tool to run on its own
- That's automation, which is about intentionality (can I describe what I want?) and risk profile understanding (What's the blast radius/worst that could happen)
Then I treat it as if it was an Integration Test/Test Driven Development exercise of sorts.
- I don't start designing an entire cloud infrastructure.
- I make sure the "agent" is living in the location where the users actually live so that it can be the equivalent of an extra paid set of hands.
- I ask questions or replicate user stories and use deterministic tests wherever I can. Don't just go for LLMaaJ. What's the simplest thing you can think of?
- The important thing is rapid iteration and control. Just like in a unit testing scenario it's not about just writing a 100 tests but the ones that qualitatively allow you to move as fast as possible.
- At this stage where the space is moving so fast and we're learning so much, don't assume or try to over-optimize places that don't hurt and instead think about minimalism, ease of change, parameterization and ease of comparison with other components that form "the black box" and with itself.
- Once you have the benchmarks that you want, you can decide things like pick the cheapest model/agent configuration that does the job within the acceptable timeframe.
Happy to go deeper on these. I have some practical/runnable samples/text I can share on the topic after the weekend. I'll drop a link here when it's ready