One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls β you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now itβs built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.
LangChain generally felt pointless for me to use, not a good abstraction. It would rather keep you from the most important thing that you need in this fast evolving ecosystem, and it's direct prompt level (if you can even call that low level) understanding of what is going on.
For JSON I agree, now I can just mention JSON and provide examples and the response always comes in the right format, but for tool calling and information retrieval I have never seen a system actually work, nor in my tests have these worked.
Now, I'm open to the idea that I am just using it wrong, but I have seen several reports around the web that the most that people got in tool calling accuracy is 80%, which is unusable for any production system, also for info retrieval I have seen it lose coherence the more data is available overall.
Is there a model that actually achieved 100% tool calling accuracy?
So far I built systems for that myself, surrounding the LLM, and only like this it worked well in production.