Skills, Memory, or Fine-Tuning? The Engineering Loop Behind Self-Improving Agents

Abstract

As agents become mainstream, everyone wants to improve theirs either by making fewer mistakes on existing tasks or by taking on harder ones. This usually happens once an agent is already deployed in production. So when teams try to make an existing agent system better, they see a bad trace and always end up asking the same question:

Should the team update the prompt, add a skill, change memory, fine-tune a model, rewrite a tool, add an eval, or rethink the architecture?

This talk presents practical heuristics for improving production agents safely and repeatedly and for building the kind of self-improving agents teams are now after. It focuses on two core questions: how do you know the agent actually got better, and what part of the agent should you update when something goes wrong? We'll cover failure attribution, scalable vs. one-off fixes, overfitting to individual traces, regression prevention, and how teams can build a manual improvement loop that turns agent failures into durable system improvements.

Main Takeaways

  1. How to tell whether an agent actually improved, rather than just performing better on a single failure case.

  2. How to identify which part of an agent system should change: prompt, skill, memory, fine-tune a model, tool, eval, workflow, or architecture.

  3. How to distinguish scalable improvements from brittle one-off patches that create future maintenance problems.


Speaker

Abhinav Sinha

CEO @Lucidic AI, Previously @Stanford AI Lab, @Citadel and Susquehanna International Group, and @Apple

Abhinav (BS, MS Stanford CS, AI specialization) is the CEO of Lucidic AI and has worked in reinforcement learning since the GPT-2 era. Before founding Lucidic, he conducted research at the Stanford AI Lab, explored deep learning–based trading strategies as a quant at Citadel and Susquehanna International Group, and worked on distributed systems engineering at Apple on the Find-My team.

Read more

From the same track

Session

Progressive Failure Modes of Modern AI Serving Systems

Inference platforms fail in layers. Most organizations focus on model quality while underestimating the systems engineering required to operate production AI workloads safely and reliably at scale.

Speaker image - Abi Aryan

Abi Aryan

AI Infrastructure Engineer and Educator

Session

The Revenge of the Data Scientist: Why Reliable AI Needs Evals, Traces, and Metrics

Most teams can now ship an AI prototype by calling a foundation-model API. The hard part is knowing whether that system works when real users, messy data, and business consequences arrive.

Speaker image - Hamel Husain

Hamel Husain

Machine Learning Engineer, 20+ Years in Applied AI, Machine Learning, and Data Science