Abstract
As agents become mainstream, everyone wants to improve theirs either by making fewer mistakes on existing tasks or by taking on harder ones. This usually happens once an agent is already deployed in production. So when teams try to make an existing agent system better, they see a bad trace and always end up asking the same question:
Should the team update the prompt, add a skill, change memory, fine-tune a model, rewrite a tool, add an eval, or rethink the architecture?
This talk presents practical heuristics for improving production agents safely and repeatedly and for building the kind of self-improving agents teams are now after. It focuses on two core questions: how do you know the agent actually got better, and what part of the agent should you update when something goes wrong? We'll cover failure attribution, scalable vs. one-off fixes, overfitting to individual traces, regression prevention, and how teams can build a manual improvement loop that turns agent failures into durable system improvements.
Main Takeaways
How to tell whether an agent actually improved, rather than just performing better on a single failure case.
How to identify which part of an agent system should change: prompt, skill, memory, fine-tune a model, tool, eval, workflow, or architecture.
How to distinguish scalable improvements from brittle one-off patches that create future maintenance problems.
Speaker
Abhinav Sinha
CEO @Lucidic AI, Previously @Stanford AI Lab, @Citadel and Susquehanna International Group, and @Apple
Abhinav (BS, MS Stanford CS, AI specialization) is the CEO of Lucidic AI and has worked in reinforcement learning since the GPT-2 era. Before founding Lucidic, he conducted research at the Stanford AI Lab, explored deep learning–based trading strategies as a quant at Citadel and Susquehanna International Group, and worked on distributed systems engineering at Apple on the Find-My team.