The First Principles

The Agentic Engineering Field Guide, Part 4: Where I Land, and Why

Navneet Singh — Mon, 04 May 2026 02:30:56 GMT

Where I Land

For my typical client, I recommend Microsoft Agent Framework on Microsoft Foundry.

The typical client is an enterprise shop. Regulated industry. .NET heavy. Azure already in production. Approval gates required. Audit trails required. Data residency matters. The buyer is a CIO or CTO whose procurement team has a standing Microsoft agreement. That describes most of my healthcare, financial services, and government work. If that also describes your shop, my recommendation is to start there.

This is not an abstract preference. It is the same decision I make with every qualifying client after we have walked through the six questions from Part 1.

State and durability. The workflow graph checkpoints at superstep boundaries. Cosmos DB checkpoint storage shipped in MAF Python 1.0.1. Durable Task integration through the Azure Functions extension handles the day-long and week-long pauses that happen when a human approval takes its time. This is the question that kills most frameworks in production. MAF plus Foundry answers it without ceremony.

Approval gates. Durable pause state that survives application restarts is first class. Not bolted on. Not a hack you build with external queues. The framework holds the pause. The pattern composes with the rest of your workflow.

Observability. OpenTelemetry everywhere. App Insights out of the box. The trace surface maps cleanly to every other Azure service your SRE team already monitors. If your observability story is already Azure Monitor, you do not introduce a new tool.

Identity. Every agent gets a dedicated Microsoft Entra identity with RBAC scoped to the resources it needs. Entra Agent Registry catalogs the deployed agents. This is the first production-grade identity story I have seen in an agent platform. For regulated industries, this alone is worth the choice.

Compliance. Azure carries the broadest compliance portfolio in the cloud market. FedRAMP High via Azure Government. HIPAA. ISO 27001, 27017, 27018, 27701. HITRUST. SOC. PCI. EU Data Boundary. Microsoft Cloud for Sovereignty for the EU sovereign stack. 21Vianet for China. India RBI, IRDAI, MeitY. When procurement asks for the compliance matrix, it already exists.

Model choice. Foundry is model agnostic despite Microsoft's commercial incentives. OpenAI, Anthropic, Meta, Mistral, Cohere, DeepSeek, NVIDIA, Microsoft's own Phi family, plus 1,500+ models through Hugging Face compute. You are not forced into any one model provider. That is the right call for enterprise buyers who do not want to bet on one model lab.

On-prem and air-gapped. Foundry Local is the quiet differentiator nobody else has. C#, JavaScript, Rust, Python SDKs. ONNX Runtime under the hood. OpenAI-compatible API. No Azure subscription required. Same SDK patterns as cloud Foundry. For the healthcare customer who needs to keep data on-premises, or the government customer who needs air-gapped inference, you get a Microsoft-shipped runtime that you can deploy inside the compliance boundary. Neither Vertex nor Bedrock has a first-party on-device counterpart with matching SDK ergonomics.

Framework-agnostic runtime. The quiet competitive move: Foundry Hosted Agents accept LangGraph or arbitrary containerized code, not just MAF. Foundry positions itself as a control plane, not a framework lock-in. So even if your team decides tomorrow that LangGraph is the better fit for a specific workflow, you can deploy it inside Foundry and keep the enterprise spine. That is the option I want on my side when I am betting on a platform.

The procurement story matters more than the feature story sometimes. When your buyer already has a Microsoft Enterprise Agreement, adding Foundry is a line item, not a new vendor evaluation. That shortens the deal cycle by months. The technical merits make the case easy. The procurement reality closes the deal.

When I Do Not Pick It

MAF plus Foundry is my default for enterprise, regulated, Microsoft-shop clients. It is not the right call for every shop. Here is where I go elsewhere.

Python-pure shops with no .NET footprint. If your engineering team has zero .NET expertise, no appetite to learn it, and your existing services are all Python, then LangGraph plus LangSmith is the stronger technical fit. The ecosystem is larger. The graph semantics are more mature. The checkpointer ecosystem is richer (Postgres, Redis, SQLite, in-memory). LangSmith is the most battle-tested tracing product in the space. You get most of the enterprise capability without the .NET surface area you will never use.

GCP-native shops. If your organisation runs on GCP, BigQuery is your data gravity, and Gemini is your preferred model family, then Google ADK plus Vertex AI Agent Engine is cleaner than bolting Azure services onto a Google stack. Cross-cloud integration tax is real. Pick the cloud your infrastructure already lives in.

AWS-native shops with strict cost or latency SLAs. AWS Strands plus Bedrock Agents is the path of least resistance on AWS. Bedrock's service tier controls (Priority, Standard, Flex) that Strands exposes are unique. If your workload has genuine SLA sensitivity on inference cost or latency, this is real. If you are already spending heavily on Bedrock, the economics of staying in that ecosystem make sense.

TypeScript-first product teams. If you ship a web product on Next.js or a Node backend and your engineering team writes TypeScript end to end, Mastra is the serious option. The MAF TypeScript story is thin. LangGraph has TypeScript bindings but Python is the first-class surface. Mastra is built for the Node world and has matured enough to be credible.

Observability-critical early stage startups. If your team is small, observability is the feature that saves you, and LangSmith is going to be your debugging lifeline from day one, then LangGraph plus LangSmith gets you there faster than any other combination. The coupling is tight and the ergonomics are designed around it. You can migrate to a more enterprise stack later when your buyer changes shape.

Pure prototyping speed. For a two-week proof of concept, CrewAI is hard to beat. The "team of agents" metaphor maps onto a slide deck and a demo faster than any other framework. This does not mean I would ship it to production. It means I would not fight the client who wants to prove out the idea in CrewAI first, then re-implement in a more durable stack when the scope is clear.

Coding or developer tools. If you are building a coding agent, a CLI tool, or anything that inherits the shape of Claude Code, the Anthropic Claude Agent SDK gives you the mature tool loop (Read, Write, Edit, Bash, etc.) without rebuilding it. For that narrow use case, it is the fastest path.

Low-code citizen developer agents. If the business user is the builder and the agent lives inside Microsoft 365 or Teams workflows, Copilot Studio is a different product with a different audience. The MAF plus Foundry story is pro-code and developer-centric. Do not force the wrong tool.

Agentforce or ServiceNow as a buy-vs-build choice. If the agent's job is to answer a well-scoped question inside Salesforce or ServiceNow data, and the SaaS vendor has already shipped that agent, buy it. Build only what gives you differentiation. I have talked several clients out of building their own customer service agent because Agentforce already solves 80 percent of their problem at a fraction of the cost of building it well.

Three War Stories

These are archetypal patterns from client work, generalised to protect confidentiality. Each reflects a real build, a real decision point, and something I would do differently with today's tools.

Healthcare contract analysis

A hospital group needed to process executed vendor contracts. Extract key terms. Flag clauses that deviated from their standard template. Surface renewal dates and termination windows. Generate a summary for the procurement team. Twelve documents per week in good weeks. Forty in bad ones.

We built this on a twelve-step graph. Document ingestion, OCR cleanup, section classification, entity extraction, clause comparison against a library of approved standards, risk scoring, flagging, summarization, storage, notification. Each step was an executor. The graph ran in roughly four minutes end to end in the happy path.

The fifth deployment, step nine started failing. The internal API it called had been migrated. We did not know for three hours because the only signal was customer-facing summaries that looked a little worse than usual. When we looked at the checkpoint state, we could see the exact input that was crashing the step. We fixed the integration, resumed every queued workflow from the checkpoint, and the customer never knew there had been a partial outage.

That same workflow, a year earlier on a different framework, would have lost every in-flight contract to a full restart. We would have burned the token budget for the first eight steps on every failed run until we shipped the fix.

The lesson: pick the framework that treats state as a first-class citizen, not as a debugging convenience.

Financial services reconciliation

A mid-market finance firm wanted to reconcile end-of-day trade confirmations across three upstream systems. The inputs disagreed often enough to be a real problem. A human team was spending two hours every evening resolving the discrepancies by hand.

The first architecture we considered was a single agent with many tools. Load this, load that, compare, output the differences. It would have worked for the first month. The tool count for the three upstream systems plus the output destinations was already fourteen. I knew where this ended.

We built it as a hierarchical multi-agent workflow instead. A manager agent owned the reconciliation task. Three specialist agents each owned one upstream system. A fourth specialist owned output and escalation. The specialists did not know about each other. The manager composed their outputs and decided when a discrepancy needed human review versus when the rules made the resolution obvious.

Six months in, a fourth upstream system came into scope. Adding it was one new specialist agent and one new edge in the graph. The other specialists did not change. That is the compound value of composition done right. Monolithic agents do not have this property.

What I would do differently: we under-invested in the eval harness at the start. We shipped with regression tests at the agent level but not at the workflow level. Two months in, the model provider bumped a minor version and the manager's routing accuracy dropped a few points. We caught it in production through a spike in manual escalations. If I were building this today, I would have workflow-level evals in place from day one and alerts on routing accuracy drift.

Government citizen services

A state government wanted to reduce call volume at a busy services line. First-line triage. Answer simple questions. Route complex ones to human agents. Keep the conversation in the citizen's preferred language. Nothing leaves the sovereign cloud.

This was a case where the platform choice mattered more than the framework choice. The non-negotiables were data residency, audit trails that would survive a regulatory review, and the ability to run inference inside the sovereign boundary. Foundry's compliance posture, Entra Agent Registry for agent identity, and Foundry Local for the on-premises inference scenario were the answer.

The agent uses a hand-off pattern. A triage agent reads the incoming request, detects language, identifies intent. Simple queries (where is my application, how do I pay this fee) are answered directly from the internal knowledge base via Foundry IQ. Complex or sensitive queries route to a human agent through the same platform with the conversation history intact.

Observability is where this build paid off. Every agent decision is traced, queryable, and exportable. When the regulator audits, we can show exactly which queries were answered by the agent, which went to humans, and what the conversation looked like. That is the difference between a system you can defend and a system you cannot.

The lesson: for regulated and sovereign scenarios, the platform spine matters more than the framework elegance.

Where This Is All Heading

Twelve month view. What I expect the landscape to look like by April 2027.

Protocols will matter more than frameworks. MCP is already universal. A2A reached v1.0 in March 2026. By this time next year, the lock-in cost of picking the wrong framework will be lower than it is today because you will be able to swap frameworks while keeping your protocol-level integrations. Bet on frameworks that speak protocols fluently. The specific framework matters less every quarter.

On-premises inference will resurge. The first wave of enterprise AI was cloud-first because that is where the capable models lived. Open-weights models are now capable enough for many production workloads. Foundry Local, vLLM, Ollama, and the NVIDIA AI Enterprise stack make on-device and on-premises deployment credible. For regulated industries and data-sensitive workloads, expect a real shift back to on-premises or sovereign cloud over the next year. The vendors that have shipped the tooling for this (Microsoft with Foundry Local, Amazon Bedrock with Outposts, Google with Distributed Cloud) will benefit.

Multi-agent will hit a limit and retreat. The last eighteen months have been a period of maximalist multi-agent design. Five specialists. Seven specialists. Ten specialists. I am already seeing teams retreat from this. The marginal specialist often adds more latency and failure modes than value. Expect the consensus to settle around three to five specialists for a typical production workflow, with a preference for hierarchy over peer collaboration.

Evaluation becomes the bottleneck. Once everyone has durable state, checkpointing, approval gates, and tracing, the remaining differentiator is whether your evals catch regressions before your customers do. Expect eval tooling to become the hottest segment of the agentic infrastructure market in the next year. Expect acquisition activity. Expect major framework vendors to ship first-party eval platforms (some already are).

Memory becomes a category, not a framework feature. Letta, Zep, and new entrants are making memory a first-class database tier that sits alongside your vector store and your Postgres. Frameworks are good at orchestration. They are mediocre at memory. Decay policies, reflection trees, virtual context paging, and sleep-time consolidation are not things you want to maintain inside your agent framework. Expect memory-as-a-service to win a meaningful slice of the agentic infrastructure market by next April, much as vector databases did in 2023. For CTOs, the question in 2026 is still "what memory primitives does my framework ship." In 2027 it will be "which memory vendor do I pick."

Cognitive architectures become pluggable. Decay-aware retrieval, sleep-time compute, reflection trees, skill libraries, and virtual context paging are bespoke builds today. By next April, expect these to ship as framework-native features or pluggable middleware. Teams that invested in building them from scratch this year will rebase onto the shipping versions. Teams that skipped them will get them for free by upgrading. Either group wins relative to teams that never knew the patterns existed.

Self-improving agent systems ship in the human-in-the-loop form. Fully autonomous self-improvement is not arriving in the next twelve months. Production-grade eval-to-training pipelines, trajectory distillation through RFT or open-weight fine-tuning, and prompt optimization feedback loops are all shipping now and will be standard by mid-2027. The differentiator is whether your team has built the pipeline that feeds production traces back into offline training. Most have not. The teams that do compound. The teams that do not ship the same agent they launched with, just older.

Regulatory pressure reshapes the stack. EU AI Act enforcement is already biting. India's DPDP enforcement is arriving. The US is catching up state by state. For regulated industries, the ability to document training data provenance, model versioning, and agent decisions will become a procurement checkbox. Frameworks that cannot answer these questions will get filtered out of enterprise deals regardless of technical merit.

The SaaS agent layer will mature and compete directly with custom builds. Salesforce Agentforce, ServiceNow AI Agents, and the Microsoft 365 Copilot ecosystem are already changing what enterprises build in-house. A year from now, the default recommendation for well-scoped customer service or ITSM agents will be "buy the SaaS layer, build custom only where you differentiate." Custom agent builds will concentrate in the workflows that are genuinely unique to your business.

The model layer will become boring, and that is good. The gap between top labs is narrowing. The gap between top open-weights models and top closed models is narrowing. Pricing is dropping across the board. Prompt caching has changed the economics of long-context workflows. By next April, model selection will feel more like database selection. Important. Reversible. Not the differentiator it was two years ago.

The Thirty Minute CTO Decision Framework

You have a meeting in thirty minutes with your team. You need to pick a direction. Run this.

Question 1: What is your dominant engineering stack?

.NET → Microsoft Agent Framework plus Foundry. Python → LangGraph plus LangSmith, or the provider SDK matching your cloud. TypeScript or Node → Mastra, or LangGraph TypeScript. Java or Go → Google ADK has the best multi-language story.

Question 2: Which cloud, and is multi-cloud required?

Azure-heavy → Foundry. AWS-heavy → Bedrock plus Strands. GCP-heavy → Vertex plus ADK. Multi-cloud required → open framework plus protocols (MCP, A2A). Bet on interop.

Question 3: What regulatory constraints do you have?

HIPAA, FedRAMP, sovereign cloud, data residency → Microsoft Foundry has the broadest compliance coverage. Bedrock is the close second for GovCloud scenarios.

Question 4: How long is your longest workflow?

Under thirty seconds → any framework works. Thirty seconds to five minutes → you need checkpointing. Five minutes to hours → graph-based framework with durable state is required. Hours to days → MAF plus Durable Task, LangGraph plus Postgres checkpointer, or a purpose-built workflow engine (Temporal, Dagster with AI) with an agent layer.

Question 5: Will humans approve before writes?

Yes, and the approval can take hours or days → you need first-class human-in-the-loop with durable pauses. MAF, LangGraph, and Pydantic AI have this. Many others do not.

Question 6: What is your observability maturity?

Already running OpenTelemetry → pick a framework with clean OTel conventions for agents. No observability yet → add LangSmith, Braintrust, or Foundry Observability as a dependency of the build.

Question 7: What is your buy-versus-build reality?

Is this a workflow that differentiates your business, or is it the same customer service, HR, or ITSM flow every enterprise runs? If the latter, evaluate Agentforce or ServiceNow before you build. Build only where you differentiate.

A note on frontier patterns. The seven questions above decide your framework. They do not decide whether to invest in decay-aware memory, sleep-time compute, reflection trees, or trajectory distillation. Those are capability-level decisions, not framework-level. Most teams should ship the basics first and reach for frontier patterns when the baseline is solid and the cost of staleness, context overflow, or stagnant skill is visibly hurting the product. If your agents run fewer than twenty turns per session and each session is independent, you probably need none of them. If they run for months with the same user and cannot afford to rebuild from scratch, every frontier pattern starts to pay. Part 3 walks each one and tells you when it is worth the complexity.

Run these seven questions. In most cases, the answer collapses to one or two frameworks. From there, prototype and ship.

Series Close

This is the end of the first pass. Four parts. The questions, the landscape, the building blocks, and where I land.

The frameworks in this guide will change. Some will fade. Others will emerge. The protocols will mature. The eval tooling will consolidate. The model layer will become boring. That is all fine. The questions in Part 1 are the durable part. The mental model is what you keep.

I will update this guide every quarter. The next revision lands in July. Some parts will change heavily. Part 2 will change the most because the landscape moves fastest. Part 1 will change least because the questions do not age.

If you are building an enterprise agent system this year and you want a second pair of eyes on your architecture, email or find me on LinkedIn. The clients I work with best are the ones who have walked through the six questions, know where they stand, and want to pressure-test the answer. I read every reply.

We are early. The systems we build this year will shape what enterprise AI looks like for the rest of the decade. Worth getting right.

Navneet Singh is the founder and CEO of Webority Technologies. He builds enterprise AI systems for clients in healthcare, financial services, and government, and writes weekly about what actually works.

The Agentic Engineering Field Guide, Part 3: The Building Blocks

Navneet Singh — Fri, 01 May 2026 02:30:12 GMT

Why This Part Matters More Than The Framework Choice

Framework choice is the decision teams obsess over. The building blocks are the decision teams skip. That is backwards.

A team on the best framework with the wrong memory pattern has a bad product. A team on a mediocre framework with the right evaluation harness ships reliably. The patterns in this piece are what separate the agent systems that survive in production from the ones that get quietly shelved after six months.

I have been in the room for a lot of those quiet shelvings. Every one of them failed on a building block, not on the framework. The agent never remembered the user correctly. The retrieval missed the key document. The eval harness did not catch the regression that the customer did. The cost doubled after a model provider changed its pricing and nobody noticed for three weeks.

This piece walks each building block in the order I evaluate them with clients. Memory. RAG. Tools. Context. Safety. Cost. Identity. Planning. For each one: what the pattern is, what the ecosystem ships in April 2026, and the honest failure modes you will hit in production.

Memory

The vocabulary has settled. The CoALA taxonomy from 2024 is now the consensus. Three kinds of memory.

Semantic memory is facts. What the user's name is. What their preferences are. Which account tier they are on. What the contract terms are. The things you would put in a database if your system were not agentic.

Episodic memory is experiences. What the agent did in the last session. What tool calls succeeded and failed. Which responses worked and which got a frustrated reply. The trajectory history.

Procedural memory is instructions. How to handle a particular scenario. Which steps to take for a given task type. This usually ends up encoded in the system prompt, but recent work treats it as something the agent can revise over time.

The second axis is scope. Short-term memory lives inside the context window and is managed by the framework's checkpointer. Long-term memory lives in a namespace that survives sessions and is retrieved on demand. A third term, working memory, now gets used two different ways. In some frameworks it means the same thing as short-term. In Mastra it means a persistent structured profile that is always loaded. When you talk to your team about working memory, be explicit about which meaning you are using. I have seen real architectural disagreements traced to this one word.

The Consolidation Pattern Is The Thing

The pattern that makes long-running agents work is consolidation. A background process reads the episodic log and writes semantic or procedural summaries. Without it, your agents either lose information (short context) or drown in it (context overflow).

The reference implementations to study:

Mastra's Observational Memory, generally available in early 2026, uses background agents to maintain a dense observation log that replaces raw message history as it grows. Context stays small. Long-term memory stays rich.

LangChain's LangMem SDK ships create_memory_manager with explicit control over hot-path extraction (synchronous, cheap, brittle) versus background extraction (async, expensive, more reliable). This is the right abstraction. Most production systems need both.

Anthropic's memory tool shipped public beta in September 2025 on Claude Sonnet 4.5 and is now generally available on Opus 4.6 and Sonnet 4.6. It is a file based tool Claude calls with create, read, update, and delete operations against a developer-managed storage backend. Anthropic reports a 39 percent improvement on internal agentic-search evaluations when combined with context editing, and 84 percent token reduction on 100-turn web search tasks. Critical detail: the memory tool operates entirely client side through tool calls. You own the persistence layer.

Claude Code's pattern is conceptually the same thing. Background consolidation into a persistent memory directory, then reinsertion on every turn.

Framework State April 2026

Anthropic memory tool. GA on 4.6 models. You manage storage.

LangGraph Store plus LangMem. BaseStore API with namespace scoping, semantic search, filter by content. LangMem adds memory manager, store manager, and prompt optimization primitives on top. The most complete memory story in an open framework today.

Microsoft Foundry Memory tool. Preview as of March 2026. Integrated into Foundry Agent Service alongside web search, file search, and code interpreter. Region-gated.

OpenAI Memory. The developer-facing story is the Responses API plus stored conversations plus File Search. There is no first-class memory tool equivalent to Anthropic's. Teams build their own.

Mastra Memory. Working memory (structured profile), semantic recall over past messages, Observational Memory. Automatic thread and resource isolation in multi-agent systems. Deterministic subagent resource IDs derived from parent. The cleanest memory ergonomics if you are on TypeScript.

CrewAI memory. Short-term, long-term (SQLite by default), entity memory (RAG over named entities), user memory. Lighter weight than LangGraph or Mastra. Fewer knobs to turn.

Pydantic AI. Does not ship memory primitives. You pass message history into the agent call. Memory is your problem. This is an intentional design choice that says "we do not hide the model."

Honest Failure Modes

Memory staleness is the first failure you will see. The agent confidently asserts a fact the user corrected weeks ago because the old fact was never overwritten. Collection-style memories are worse than profile-style here. If a user's preference can change, make sure your memory pattern supports update, not just append.

Memory explosion is the second. Naive "remember everything" pipelines produce thousands of near-duplicate entries. Retrieval drowns in noise. The cure is deduplication, compaction, and ruthless eviction.

Cross-user bleed is the third and scariest. Namespace bugs leak one user's memory to another. The LangMem team calls this out explicitly as the reason they made user_id a first-class namespace. If your memory story does not have tenant isolation at the storage layer, you have a security bug waiting to happen.

Hot-path extraction slowing every turn by 2 to 4 seconds is the fourth. It feels fine until you measure. Run extraction in the background.

Consolidation agents hallucinating summaries and poisoning the knowledge base permanently is the fifth. Review queues matter. For high-stakes memories, a human gate on consolidation is cheap insurance.

Frontier Patterns

Consolidation is the table stakes pattern. The next tier of memory architecture is where production systems are still catching up to research. These are the patterns you should know about, evaluate with judgment, and reach for when the baseline is no longer enough.

Decay and recency-weighted retrieval. The Generative Agents paper from Park et al. at Stanford in 2023 proposed scoring each memory on three axes. Recency, with exponential decay over elapsed time since last access. Importance, a model-scored salience label. Relevance, embedding similarity to the current query. Retrieval ranks by the weighted sum of the three. The model is better than naive similarity alone because a highly-relevant but stale memory loses to a less-relevant but fresh one when that is the right call.

In production, fixed TTL on memories is almost always wrong. User preferences can persist for years. A tool result is often stale in hours. A session fact is useful for a day. The decay function should be per-memory-type, not global. Letta exposes per-block configuration. Zep's temporal knowledge graph tracks validity windows on individual facts. Mastra's Observational Memory compacts older observations into denser summaries rather than deleting them, which is a softer form of decay. Anthropic's memory tool leaves decay entirely to your storage layer, which is honest but unhelpful if you do not have a decay policy of your own.

What I tell clients: do not build a one-size-fits-all decay curve. Tag every memory with a type (preference, session fact, tool result, observation, commitment) and set decay policy per type. This adds a day of work up front and prevents the category of bug where the agent remembers something it should have forgotten, or forgets something it was supposed to keep for a year.

Virtual context and paging. MemGPT from Packer et al. in 2023 made the case that context-window management should look like an operating system. A small core memory stays resident. Archival memory lives on disk and pages in on demand through explicit agent-issued tool calls. The agent literally reads and writes its own memory with functions like core_memory_append, archival_memory_insert, and archival_memory_search. The model manages its own working set.

Letta is the commercial heir. Active project, Python-first, agent-as-service architecture where the agent runs as a persistent process with memory state surviving across restarts. Used in production at a growing number of companies, with integrations appearing for LangGraph and Mastra that let Letta serve as a pluggable memory backend. The right fit when your agent holds long-running state across sessions with the same user and you want the model to explicitly decide what stays in core context.

Be honest about where paging wins. For short, stateless request-response agents, virtual context is overkill. For an agent that has a months-long relationship with a user, it is one of the few patterns that scales without getting gradually dumber. The middle ground, where ordinary consolidation into a summary is enough, is where most production teams actually sit. Reach for virtual context when the consolidation pattern is already in place and you are still losing important facts in the summary step.

Sleep-time compute and offline reflection. The idea. Agents do their best thinking when idle. Between user turns, overnight, or in batch, a background process reads recent trajectories and produces reflections, revised plans, updated skill documents, or rewritten memory entries. The user never waits for this work. The agent wakes up smarter than it went to sleep.

Letta ships explicit sleep-time agents that run on a configurable cadence. Mastra Observational Memory does a narrower version implicitly through background consolidation. Claude Code's pattern of background memory rewrites between sessions is another narrow form of the same idea. Recent academic work has framed sleep-time compute as a distinct scaling axis, alongside pre-training compute and test-time reasoning compute, with its own cost and quality curves.

Production use is early. The failure mode is obvious. If your reflection agent hallucinates a bad summary overnight, you wake up with a confidently wrong agent. The mitigations are the same as with consolidation. Review queues on high-stakes rewrites. Diff logs so a human can see what the reflection changed. Never let a reflection agent overwrite user-provided facts.

The CTO call. Consider sleep-time compute for agents that are accessed infrequently by each user (once a day, once a week) where the reflection cost is amortized over many user interactions, and where coherence across sessions is a visible product feature. Skip it for high-throughput short-turn agents where every interaction stands alone.

Reflection trees. The other half of the Generative Agents contribution. Periodically, the agent looks at a batch of recent memories, asks itself what questions those memories raise, answers the questions using the same memory store, and writes the answers back as higher-level memories. The tree emerges from recursion. A week-old interaction gets abstracted into a one-line insight. The one-line insights get abstracted into a paragraph-long trait. The agent builds its own theory of the user without anyone writing the theory down.

This is a pattern you can implement on top of any memory system that supports insertion with metadata. Mastra, LangMem, and Letta all have the primitives. No mainstream framework ships it as a turnkey feature. The value is real when your agents have a rich episodic log and the user cares about long-term coherence (coaching, therapy, executive assistant, tutor). The cost is model calls for the reflection work, paid on a schedule you control. Budget for it.

Workflow memory. Agent Workflow Memory takes procedural memory further. The agent extracts reusable workflow patterns from its own successful trajectories and stores them as named recipes, parameterized by the variable parts. Next time a similar task arrives, it retrieves the matching workflow and follows it instead of planning from scratch. Research results on web navigation benchmarks are meaningful.

Production adoption is shallow. The closest shipping analogue is Anthropic Skills, but Skills are human-authored folders, not auto-extracted workflows. Automatic workflow synthesis from traces remains research-grade. Early adopters are building this themselves. The pattern I have seen work is a weekly batch job that reads LangSmith or Braintrust trace exports, clusters trajectories by task similarity, proposes candidate workflows, runs a human review, and promotes the approved ones into the agent's skill library. That pipeline ships. It is not turnkey.

Hierarchical memory beyond two-tier. MemGPT's two-tier model (core plus archival) is the current de facto standard. CoALA hints at richer hierarchies. Sensory buffers. Short-term working memory. Medium-term episodic buffers. Long-term consolidated knowledge. In practice, nobody has shipped a five-tier production system that outperforms the two-tier one by enough to justify the complexity. The tier count is not the insight. The insight is that different memory types need different retention, retrieval, and consolidation policies. You get most of the benefit with a two-tier architecture plus per-type policies, not with adding more tiers.

When To Reach For Frontier Memory

A simple decision rule. If your agent runs fewer than twenty turns per session and each session is independent, the standard consolidation pattern is enough. Stop there. If your agent runs for weeks or months with the same user and the quality of the relationship compounds, you will need at least two of the frontier patterns. Start with decay-aware retrieval and reflection trees. Add virtual context if context overflow becomes the binding constraint. Add sleep-time compute if user-perceived intelligence-per-session is what you are selling.

Do not add all of them at once. Each has a cost in complexity and in model calls. Each is also a new surface where the agent can corrupt its own memory. Add them one at a time with eval coverage on what you expect to improve.

Retrieval Augmented Generation

Classic RAG is still the baseline. Chunk the documents. Embed the chunks. Retrieve the most similar chunks for a query. Stuff them into the prompt. It remains the right starting point because it is cheap, understandable, and composable. It rarely ships alone anymore.

The Patterns You Actually Ship

Agentic RAG is now the dominant production pattern. The agent decides whether to retrieve at all, issues multiple queries with different formulations, reads tool descriptions, and can re-query after inspecting initial results. Foundry File Search, OpenAI File Search in the Responses API, and Anthropic's web-fetch and web-search tools are all agent driven. The model, not a pre-built pipeline, decides when to reach for data.

Graph RAG. Microsoft's GraphRAG builds a knowledge graph of entities, relationships, and claims, runs Leiden clustering, generates community summaries, and answers via Global, Local, or DRIFT search. Global is for broad "what are the themes" questions. Local walks entity neighborhoods. LazyGraphRAG, from Microsoft Research in late 2024, defers LLM work to query time. It builds a cheap graph index with NLP-extracted concepts instead of full LLM summarization, then uses the LLM only during search. Microsoft's published claim is roughly 0.1 percent of GraphRAG's indexing cost at comparable answer quality. That is the difference between GraphRAG being viable on millions of documents and not.

Contextual retrieval. Anthropic's technique from September 2024. Prepend a 50 to 100 token chunk-specific context to each chunk before embedding and before BM25 indexing. Reported results: contextual embeddings alone cut retrieval failure rate by 35 percent. Contextual embeddings plus contextual BM25 cut it by 49 percent. Adding a reranker pushed the total improvement to 67 percent. The preprocessing cost is about 1.02 dollars per million document tokens using Haiku with prompt caching. This is now standard. Most new RAG stacks bake it in by default.

Hybrid retrieval. BM25 plus dense vectors, fused with Reciprocal Rank Fusion. The consensus default for production. Pure vector search loses exact-match queries like product codes, SKUs, and error codes. Pure BM25 loses semantic paraphrase. Every serious retrieval stack does both.

Rerankers

Cohere Rerank 3.5 from December 2024 is the commercial leader with strong multilingual reasoning. Cohere Rerank v4 is now sold directly through Azure Foundry Models. Voyage rerank-2.5 is popular for finance and legal. BGE-reranker-v2 and Jina Reranker v2 are the open-weight choices. BGE for cost. Jina for multilingual. Cross-encoder models from the ms-marco-MiniLM family still beat nothing for pennies when latency is not critical.

Rule of thumb: retrieve top 50 to 100 candidates, rerank to top 5 to 10. Reranking adds 50 to 300 milliseconds of latency. That is why latency-sensitive apps skip it. Most apps should not.

Vector Databases, April 2026 Positions

Pinecone is still the managed default for teams who want zero ops. The serverless tier is aggressive on price.

Qdrant is the Pinecone alternative of choice for self-hosted. Rust core, strong filtering.

Weaviate is hybrid native from day one with strong multi-tenancy.

pgvector with pgvectorscale from Timescale is the right answer when you already have Postgres. StreamingDiskANN through pgvectorscale closed the performance gap for most workloads.

Turbopuffer is object-storage backed and extremely cheap for cold data. The darling of 2025 for billion-scale low-QPS workloads.

LanceDB is embedded and popular for desktop, edge, and Mastra-style local-first stacks.

Azure AI Search, Vertex Vector Search, and Bedrock Knowledge Bases are the hyperscaler answers. Pick based on where your compliance and data gravity already live.

Chroma is dev and prototype. Production deployments have mostly migrated off.

Managed RAG Offerings

Foundry IQ is Microsoft's knowledge layer for enterprise agents. It unifies Azure AI Search, SharePoint, OneDrive, and custom sources, and integrates with File Search in Agent Service.

Vertex RAG Engine is managed chunking plus embedding plus retrieval. Integrates with Vertex Vector Search and Gemini grounding.

Bedrock Knowledge Bases handle managed ingestion from S3, Confluence, SharePoint, Salesforce, and web sources. Supports hybrid search, reranking, and GraphRAG via Neptune Analytics.

OpenAI File Search is a tool inside the Responses API. Handles vector store creation, chunking, retrieval, and citations with minimal configuration.

Honest Failure Modes

Chunks without context is the single biggest accuracy killer. The "3 percent revenue growth" example from the Anthropic contextual retrieval paper is not an edge case. It is the median production bug. Every chunk needs enough context for standalone retrieval to make sense.

Recall cliffs at top-k boundaries. Your ranking puts the right document at position 11 when you retrieve 10. Expand your retrieval window, then rerank.

Embedding model mismatch. Your stored embeddings are from mpnet-base-v2 from 2022. Your queries are from text-embedding-3-large. Your similarity scores are meaningless. Re-embed when you upgrade.

Evaluator theater. LLM as judge on RAG that rewards fluent wrong answers. Use groundedness metrics, not just answer quality.

Freshness. Documents change. Embeddings drift. Nobody re-indexes until a customer complains. Put a freshness policy in place from day one.

Tools Beyond MCP

Function calling has converged. OpenAI, Anthropic, and Gemini all support structured tool schemas, parallel tool calls, and forced tool calls. Parity for basic use cases. Differences show up in fine-grained streaming, programmatic tool calling, and tool-search features for large tool catalogs. Anthropic ships tool search and programmatic tool calling as first-class features when you have more than 30 tools in the loop. This matters if you have a big tool inventory.

Computer use. Anthropic's computer use tool is beta with the header computer-use-2025-11-24 for Opus 4.6, Sonnet 4.6, and Opus 4.5. Screenshot, mouse, keyboard, and desktop automation. State-of-the-art results on WebArena among single-agent systems. OpenAI ships Operator (consumer) and the computer-use-preview model for developers through the Responses API. Gemini's equivalent is Computer Use in the Gemini API. All three remain reliability-limited for multi-step workflows. Expect 40 to 70 percent task completion on real workloads. If your use case depends on computer use in production, plan for human fallback.

Code interpreters. OpenAI Code Interpreter is generally available inside the Responses API and Assistants. Anthropic's Code Execution Tool is beta and required for Skills. Microsoft Foundry Code Interpreter is generally available, Python sandboxed, supports data analysis and chart generation. All three are stateful within a session and ephemeral across sessions unless you mount storage.

Skills. Anthropic Agent Skills launched October 16, 2025, and became an open standard on December 18. A skill is a folder with a SKILL.md file plus scripts and resources that Claude loads only when relevant. Composable, portable, progressive disclosure. Identical format across Claude apps, Claude Code, and the API through the /v1/skills endpoint. Requires the Code Execution Tool beta. Microsoft Agent Framework's class-based skills in MAF 1.x are the closest equivalent in the Microsoft stack, with class hierarchies instead of Markdown folders. The industry direction is clearly progressive disclosure. Ship instructions as files. Load on demand. Do not put everything in the system prompt.

Browser automation. Playwright is the plumbing. Microsoft Playwright MCP and the native Anthropic Playwright extension cover most use cases. Browserbase is the managed cloud for running headless browsers at scale with session replay, stealth mode, and proxy rotation. The default for teams that do not want to run Playwright infrastructure.

Tool registries. Foundry's tool catalog advertises roughly 1,400 tools including MCP servers, connectors, Logic Apps, and SharePoint. The biggest single catalog. OpenAI's ecosystem is MCP native now. Anthropic's Connectors directory is smaller but curated, with organization-wide skill management for Team and Enterprise. The question is no longer who has more tools. The question is whose tool permissions, auth, and audit story your compliance team will sign off on.

Context Engineering

Prompt Caching Is A Free Win

Anthropic's pricing is public and aggressive.

Opus 4.6 base is 5 dollars per million input tokens. Cache write is 6.25 dollars at the 5-minute TTL or 10 dollars at the 1-hour TTL. Cache hit is 50 cents. Ten times cheaper than the base rate.

Sonnet 4.6 base is 3 dollars per million input tokens. Cache write is 3.75 or 6. Cache hit is 30 cents.

Break-even is after roughly two hits. The cache has paid for itself. Everything after that is 90 percent savings.

The 5-minute default refreshes for free on each hit. The 1-hour extended cache is for long-running agent workflows where context is stable but access is sparse. Anthropic's own contextual retrieval cost analysis of 1.02 dollars per million document tokens relies on this.

OpenAI supports automatic prompt caching for prompts over 1024 tokens at a standard 50 percent discount on cached input. Gemini offers context caching with a minimum cache size and a storage-per-hour charge. Every serious production stack now caches system prompts, tool schemas, and long document contexts. If yours does not, you are overpaying by something like 5 to 10 times on every stable prefix.

Compression And Reinsertion

The dominant compression patterns. Rolling summaries every N turns. Hierarchical summaries where you keep the last N turns in full plus older turns summarized. Tool-result pruning where stale tool calls are automatically removed. Anthropic's context editing reports 84 percent token reduction on 100-turn workloads using this. Selective keep based on relevance scoring.

Context reinsertion. Claude Code's pattern is now de facto standard for coding agents. Re-inject the per-project memory file plus the memory directory plus directory listings plus recently modified files at the start of every turn. Cursor, Windsurf, and Aider all follow variants.

Context Windows And The Long Context Question

GPT-5.4 and the 5.x family ship 1,050,000 token context with 128,000 token output on Azure Foundry. Claude Opus 4.6 and Sonnet 4.6 offer 1 million tokens for qualified customers. Gemini 3.x ships 1 million plus tokens standard across the family with some models accepting 2 million.

The 1 million token era has not killed RAG. It has killed RAG for small knowledge bases. The decision rule most teams use now:

Under 500 thousand tokens: stuff and cache. No retrieval needed.

500 thousand to 10 million tokens: hybrid. Stuff the hot data. RAG the cold.

Over 10 million tokens: RAG with contextual retrieval and reranking.

Anthropic's own guidance: if your knowledge base is smaller than 200,000 tokens or roughly 500 pages, include the entire knowledge base in the prompt with no retrieval. I have had real client debates where we deleted half a retrieval pipeline because the documents fit. The simpler answer beats the complicated one when the simpler answer works.

Evaluation

LLM As Judge

Production standard but not trusted alone. Known failure modes. Position bias, which prefers the first of two candidates. Verbosity bias, where longer answers score higher. Self-preference, where GPT-4o judging GPT-4o rates itself higher than a blind comparison. Brittleness to prompt phrasing.

The mitigations are well known. Pairwise over pointwise. Calibrate with human labels. Use multiple judges and majority vote. Always pin the judge model so results are comparable across runs.

Rubric Based Scoring

G-Eval with chain-of-thought grading is the reference pattern, implemented in DeepEval, Ragas (RAG-specific: faithfulness, answer relevance, context precision, context recall), and Promptfoo. These remain the fastest path to getting something in CI.

Regression Suites

The bar has shifted. Every eval framework now integrates with CI/CD. Braintrust's "promote playground to experiment, run in CI, catch regressions before production" is representative of the whole commercial category. Foundry Evaluations integrates directly into the Agent Service lifecycle.

Red Teaming

UK AISI's Inspect framework is the most credible open-source evaluation harness for safety and has been adopted by multiple national AI safety institutes. Commercial options for red-team work include Haize Labs, Patronus AI, and Lakera Red.

Tooling Landscape, April 2026

Open source:

DeepEval has 14 plus metrics, pytest-native, good for unit-test-style evals.

Ragas is the RAG-eval default.

Giskard handles test generation and red-teaming with strong coverage on bias and reliability.

Phoenix and Arize offer OSS tracing plus eval and are now the most common choice for teams on OpenTelemetry.

Promptfoo is YAML-driven with easy CI integration and model comparison matrices.

Inspect is the rigorous choice used for publishable safety evaluations.

Commercial:

LangSmith Evals has the tightest LangChain integration and is strong for tracing plus eval in one tool.

Braintrust provides playgrounds, experiments, and online scoring. Product-led with good UI for teams that want one.

HoneyHive offers similar positioning with more enterprise feature completeness.

Pydantic Evals is code-first and minimal. Pairs with Pydantic AI.

Foundry Evaluations is native to Azure with Application Insights and RBAC integration.

Galileo is enterprise focused with hallucination detection and compliance reporting.

Honest Take

Eval tooling still falls short on four things.

Agent trajectory evaluation. It is easy to grade a final answer. It is hard to grade a 30-step agent loop.

Grounding drift. Judges do not reliably catch fluent but unsupported claims.

Cost of running evals. Full eval suites can exceed your training-data cost.

Test set rot. Your golden set becomes your model's memorized set over time.

Budget 15 to 25 percent of agent engineering time on eval infrastructure. That is the number I use with clients. It is more than most teams plan for, and it is the reason the teams that plan for it ship reliably.

Safety And Guardrails

Content Filters

OpenAI Moderation API, free and category-scored. Azure AI Content Safety, including Jailbreak shield, Groundedness Detection, Protected Material Detection, and Indirect Prompt Attack detection. Integrated into Foundry Agent Service by default. Bedrock Guardrails with managed policies across Bedrock models and configurable denied topics plus PII filters. Gemini safety settings with four category thresholds configurable per request.

Prompt Injection

No technique is fully proven. The layered defense that works in practice combines five techniques.

Spotlighting. Mark untrusted input with delimiters plus explicit instructions.

Separate trust zones. Tool outputs and user input are not the same thing. The prompt structure should make that clear to the model.

Output validation. Structured outputs plus tool-call shape checks. The output should be parseable and pass schema validation.

Least-privilege tools. The agent can only do what it could do safely if fully compromised. If your agent can drop database tables, that is an architecture problem, not a security problem.

Content filter pre-check on external inputs. Azure's XPIA detection in Defender for Foundry is the first shipping commercial detector specifically targeted at cross-prompt injection attacks.

Jailbreak Detection

Azure Defender for Foundry Tools includes XPIA detection and jailbreak classification. Anthropic's Constitutional AI underpins Claude's refusal behavior and remains the strongest published defense-in-depth against jailbreaks. Third-party options include Lakera Guard and Protect AI.

Guardrails Frameworks

NVIDIA NeMo Guardrails offers programmable rails through Colang with a multi-rail architecture covering input, output, retrieval, and execution. Widely adopted.

Meta Llama Guard and Prompt Guard are open-weight classifiers. The default drop-in for self-hosted stacks.

Guardrails AI (the library) plus Guardrails Hub offers a validator catalog. Strong for structured output validation.

Invariant Labs is newer but noteworthy for agent-specific trace-based policies.

Output Validation

Pydantic AI enforces Pydantic schemas on LLM output with automatic retry on parse failure. Cleanest API in the space. Instructor offers similar functionality with broader model support. TypeChat is Microsoft's TypeScript-first approach. OpenAI's native Structured Outputs and Anthropic's JSON mode reduce the need for these wrappers for simple cases but do not replace validation logic.

Protected Material

Azure Content Safety's Protected Material Detection flags verbatim copyrighted text in model output: song lyrics, news articles, recipe collections. Bedrock Guardrails includes similar capabilities. This is now a compliance checkbox rather than a differentiator.

Cost Management

Model Routing

RouteLLM from Berkeley remains the open-source reference. Train a binary classifier to route between a strong and weak model. Claims 85 percent cost reduction at 95 percent GPT-4 quality on benchmarks.

AWS Bedrock Intelligent Prompt Routing, in GA since 2025, routes between same-family models like Claude Haiku versus Sonnet with target latency and cost constraints.

OpenAI tier selection through model aliases and reasoning effort controls.

Foundry model-router is a first-party model listed in the Azure catalog that selects among Azure OpenAI models automatically.

Pattern to adopt: route small model for easy tasks, large model for hard tasks, and make the routing decision observable. The savings are real. The complexity cost is also real. Add routing when you have volume, not before.

Prompt Caching Economics, Worked

With Opus 4.6, a 100 thousand token system prompt costs 500 dollars on first call for cache write and 50 dollars on each subsequent hit at the cache rate. Ten times cheaper. Agents with stable instructions that run hourly realize near-total savings after the first call.

Sonnet 4.6 is even steeper. 300 dollars write. 30 dollars per hit.

For a 50-turn agent loop reusing the same tools and system, prompt caching typically cuts total input cost by 70 to 90 percent. This is the single most valuable cost optimization in the stack. If you are not caching, start today.

Semantic Caching

GPTCache and derivatives like Upstash Semantic Cache and Helicone work well for deterministic-query workloads (FAQ bots, documentation search) and poorly for agentic workloads (too much per-request state). Hit rates of 20 to 40 percent on FAQ patterns. Single digits on agent loops. The honest answer is that semantic caching is a niche tool. Useful where it fits. Not a general solution.

Budget Controls

Foundry, Bedrock, and OpenAI Enterprise all ship per-project spend caps, per-user rate limits, and alerts. LangSmith, Helicone, Braintrust, and OpenMeter provide cross-provider observability. Kill switches through feature flags (LaunchDarkly, Statsig) are the norm for staged rollouts. If you do not have a kill switch for agents, you are one prompt injection from a five-figure weekend bill.

Observability For Cost

Every serious platform now emits OpenTelemetry with LLM-specific semantic conventions. Tokens in, tokens out, cache hit, cache miss, model, tool calls. Phoenix, Arize, Langfuse, and Helicone standardized on this. If your platform does not emit OTel LLM spans in 2026, that is a red flag.

Identity And Auth For Agents

The Microsoft Pattern Is The Reference

Service-managed credentials plus on-behalf-of is now standard in Foundry Agent Service. The agent gets its own identity. When a user invokes the agent, the identity flows through OBO so downstream resources see the user's permissions, not the agent's. No shared secrets. Every call auditable. This is the right default for enterprise.

Entra Agent Identity

Entra now treats agents as first-class directory principals with their own object IDs, conditional access, RBAC, and lifecycle. Foundry publishes agents to the Entra Agent Registry for discoverability across Teams and Microsoft 365 Copilot. Each agent can have a dedicated Entra identity enabling secure scoped access to resources and APIs without sharing credentials. This changes your audit model fundamentally. Every action traces to the agent, the user who invoked it, the tool called, and the data accessed.

Google And AWS Equivalents

Vertex AI Agent Builder uses Google Cloud service accounts plus IAM. No dedicated agent principal type yet. Agentspace layers user-delegated access on top. The primitives exist. The UX around agents as a directory object is less mature than Entra's.

Bedrock AgentCore introduced agent IAM roles with session-scoped credentials and short-term credential issuance through STS. Similar functional scope to Entra. Less directory integration.

Compliance Implications

SOC 2, ISO 27001, HIPAA auditors are starting to expect this in 2026. You have a directory principal that performs actions. Every action is traceable. Without agent identities, you have a service account doing things on behalf of humans, which is the audit pattern nobody wants to defend.

The CTO-level decision: do not let agents run on shared service accounts past pilot. Before you ship to production, every agent has its own identity with scoped permissions.

Planning Patterns

ReAct Is Still The Default

ReAct, the Reason-then-Act pattern from Yao et al. in 2022, is still the honest baseline in production. Critique is well known. Verbose, doubling token cost for thinking then acting. Brittle on complex multi-step tasks. The thought step can hallucinate plans that contradict the action taken.

Most frameworks use ReAct-shaped loops by default, often with model-native reasoning (extended thinking in Claude, reasoning effort in GPT-5.x) replacing the visible thought channel. The result is cleaner output without the "Thought:" lines while keeping the planning capability underneath.

Tree Of Thoughts

Use when you can cheaply evaluate partial states. Puzzles, math, code that either compiles or does not. Rarely used in production agent loops because the branching cost dominates and most production tasks lack a cheap evaluator. Useful inside specialized sub-tasks like SQL query generation or theorem proving. Not a general-purpose pattern.

Reflexion

Store a short self-critique from each failed attempt and re-inject on retry. Reliable, cheap, broadly applicable. Most modern agent frameworks include some form of this, explicit or folded into extended thinking.

Task Decomposition

Production-standard for long workflows. A planner agent emits a plan. Specialist agents execute nodes. A critic agent verifies and triggers replanning. LangGraph, AutoGen, CrewAI, and Foundry's workflow agents all support this.

Anthropic's recent "advisor strategy" is a lightweight variant. The main agent consults a stronger advisor model for hard sub-decisions without spending full Opus tokens on every turn. I like this pattern. It matches how real teams work. The junior engineer does most of the work and escalates to the senior when they need to.

Plan Critique

The critic-actor split works. The common footgun is the critic using the same model as the actor and agreeing with whatever the actor produced. Self-consistency bias. Mitigation: use a different model family for the critic (Claude critic for GPT actor, or the reverse), or a smaller model with explicit rubrics.

What Is Actually Used In Production

ReAct with native reasoning models (GPT-5.x, Claude 4.6, Gemini 3.x with thinking) covers about 80 percent of simple to moderate agents.

Hierarchical planner plus specialist executors is the standard for anything multi-hour or multi-domain.

Reflexion-style retry loops are universal, usually embedded in framework retry policies.

Tree of Thoughts, Chain of Thought with Self-Consistency, Algorithm of Thoughts, and more exotic research patterns are confined to specialized sub-tasks or absent.

The research novelty to production ratio is clear. ReAct and Reflexion have made it. Tree of Thoughts is partial. Most of the novel planning papers from 2023 to 2025 have not. Do not let your architecture lead with a research paper.

Self-Improving Agents

The honest status. Most self-improvement patterns are still research. A few are landing in production. This is the area where the gap between the papers and the frameworks is widest. If you are hearing a pitch about agents that get better over time without human involvement, check the evidence before you believe it.

The Three Axes Of Self-Improvement

An agent can improve on three axes. Its knowledge, what it remembers. Its instructions, how it behaves. Its skills, what it can do. Each axis has a different maturity curve in April 2026.

Knowledge improvement is largely solved. Consolidation and reflection already do this. Memory grows, gets compacted, stays useful. Your agent knows more about each user after every session when the memory pattern is right.

Instruction improvement is the prompt optimization problem. DSPy, LangMem's prompt-optimization primitives, and several research frameworks iteratively refine system prompts based on observed failures. Honest take. These work on narrow benchmarks and rarely survive the messiness of production traffic without human oversight. For most teams, prompt updates are a human job informed by production traces. Automated prompt optimization is worth a weekend experiment. Do not bet your architecture on it yet.

Skill improvement is where the interesting work is happening, and where the gap between research and production is the most interesting to watch.

Trajectory Distillation And Reinforcement Fine-Tuning

The pattern. Take successful trajectories from production, grade them, and fine-tune the model on the traces. The agent gets measurably better at your specific workflow without anyone writing a new prompt.

OpenAI Reinforcement Fine-Tuning is in production for o-series and GPT-5 models with a grader-based workflow. Submit a dataset of prompts plus a grader function (code or LLM-based). OpenAI fine-tunes the model against the grader signal. Clients running RFT on a narrow workflow see 10 to 20 point gains over the base model on the target task. Not a toy.

Anthropic does not offer equivalent public fine-tuning today. Claude fine-tuning is available through AWS Bedrock but is narrower than the OpenAI path. The tradeoff is clear. If your task lives on OpenAI, RFT is a real option. If it lives on Claude or Gemini, the pattern is weaker for you in 2026 and you will likely wait for the equivalents to mature.

OpenPipe's ART library (Agent Reinforcement Trainer) is the open-source direction for teams that want to run the pattern on open-weights models. Pairs with vLLM for inference and a local eval harness.

The critical caveat. Trajectory distillation concentrates the successes you already have. It does not teach the agent what it never saw. A workflow that worked 60 percent of the time might hit 80 after distillation. It will not hit 95 without genuinely new data or a better base model. Plan accordingly, and never position distillation internally as "the model will figure out the remaining failure modes on its own."

Skill Synthesis From Traces

Voyager from Wang et al. in 2023 proposed an agent that writes its own skills. A curriculum generator proposes tasks. An execution agent attempts them. A skill librarian saves successful solutions as named, reusable code. Over time the library grows and the agent composes skills rather than solving from scratch.

In production, automatic skill synthesis is rare. The closest shipping pattern is human-in-the-loop skill authoring. A developer reviews production traces, identifies a recurring pattern worth encoding, and writes a skill. The Anthropic Skills open standard formalizes this. The folder is portable. The review gate is manual.

The interesting research-to-production gap. The trust boundary on auto-generated skills has not been solved. A skill is code the agent can run. An auto-generated skill is code the agent wrote and chose to run. You cannot let that into production without review. And the automation wins disappear once you add the review step. For now, treat skills as a human-authored abstraction and use traces as input to your authoring queue, not as automatic training signal.

Metacognitive Loops Beyond Reflexion

The active research directions. Self-refine, where the agent critiques and revises its own output before committing. Critic models, a separate smaller model trained specifically to grade the primary agent's work. Constitutional-style self-critique, where the agent checks its output against a list of explicit rules before responding.

Production status, honestly. Self-refine is a prompt pattern most teams have tried. It helps for writing and code. Mixed results on agent actions. Critic models are showing up in commercial products (Patronus AI, Braintrust online scoring, Lakera Guard, Azure Content Safety's groundedness models). These are typically small fine-tuned models scoring outputs against rubrics, and they are the most production-ready piece of the metacognition story.

Constitutional AI as a technique is internal to Anthropic's training process for Claude. As a user-facing pattern, the closest equivalents are guardrails frameworks with rule-based output validation. Functionally similar, mechanically different.

Eval-To-Training Pipelines

The pattern that is actually mature. Use production eval traces as training or prompt-tuning data. LangSmith and Braintrust both support promoting production examples to datasets that feed back into offline experiments. Mastra has similar hooks. Foundry Evaluations integrates with Agent Service runs.

The workflow is familiar. Production agent runs. Eval catches a regression or a low-score output. That example becomes a test case. The team revises the prompt, or fine-tunes the model, or updates a skill. The fix goes back to production. Cycle time is days to weeks. Humans in the loop at every step.

This is not fully self-improving. It is the human-in-the-loop version. It is also the only version I have seen work reliably at production scale. If you want an agent that gets better over time, build this pipeline before you chase anything more exotic. Three to five percent of engineering time on ingesting production traces back into evals and training data is the right range for teams that care about compound improvement.

The CTO Take

Self-improving agents as marketed do not exist yet. Self-improving agent systems, where humans close the loop on evals and distill improvements back into prompts or fine-tunes, are real and shipping now. Budget for the pipeline, not the magic.

The frontier will mature. Some of these patterns will move into the non-negotiable layer over the next year. I expect decay-aware memory, sleep-time compute, and eval-to-training pipelines to be on the CTO shortlist by April 2027. I do not expect auto-skill synthesis or fully automated prompt optimization to reach production status by then. Invest accordingly.

The CTO Shortlist

If you are deciding what to build into your architecture today, this is the non-negotiable shortlist.

Prompt caching on every stable prefix. Ten times savings on cached input. Unambiguous win.

Hybrid retrieval plus contextual embeddings plus rerank. 49 to 67 percent fewer retrieval failures. Pays for itself with one avoided human escalation.

Background memory consolidation. If your agents run more than 20 turns, you need this. Pick Mastra Observational Memory, LangMem, or Anthropic's memory tool based on your stack.

Skills or folder-based capability packaging. An open standard now. Invest in authoring infrastructure, not bespoke RAG for skills.

Agent identities with on-behalf-of. Your auditors will ask in 2026. Be ahead of them.

Eval in CI. Online scoring in production. Non-negotiable for anything customer facing.

Layered prompt injection defense. Spotlighting plus structured outputs plus tool least-privilege plus an XPIA classifier. No single technique is enough.

Get these seven right and your framework choice matters less than any of them. Get them wrong and no framework saves you.

The Frontier Memory Patterns and the Self-Improving Agents section earlier in this piece are the next layer. Worth tracking, worth evaluating, not yet the universal non-negotiable the seven above are. I expect two or three of them to move onto this shortlist in the next twelve months. The teams that build them early will have a quiet advantage. The teams that wait will get them for free when the frameworks catch up.

What Is Coming In Part 4

This piece went deep on the implementation patterns. The next piece closes the series.

Part 4 is where I land. The recommendation I make to most of my enterprise clients and when I do not make it. Three client war stories across healthcare, financial services, and government. Where agentic AI is heading over the next twelve months. And a thirty minute decision framework your team can run this week.

The Agentic Engineering Field Guide, Part 2: The Framework and Platform Landscape

Navneet Singh — Mon, 27 Apr 2026 02:31:02 GMT

The Three Layer Decision

Most framework debates skip the decision that matters more. Before you pick a framework, you have to pick a layer.

There are three. From most control to least.

Open framework plus your own runtime. You write the agent code in LangGraph, Microsoft Agent Framework, Mastra, or similar. You deploy it on infrastructure you own or operate. You wire in your own checkpointing, observability, and scaling. The framework does the orchestration work. Everything else is yours.

Managed platform. A hyperscaler runs the agent runtime. You write code in the matching open SDK, ship a container or config, and the platform handles deployment, scaling, identity, observability, and state. Microsoft Foundry, Google Vertex AI Agent Engine, and AWS Bedrock Agents all live here. You trade some control for a lot less plumbing.

SaaS agent layer. You do not write the agent. The vendor built it. You configure topics, actions, and data connectors inside their platform. Salesforce Agentforce and ServiceNow AI Agents live here. You trade most control for fastest time to value.

The rule of thumb I use with clients: pick the lowest layer at which your control requirements are met. Every layer up trades control for speed.

Your reasoning should go in this order.

Is there a SaaS agent that already solves this? If your workflow is customer service routing inside Salesforce data or ITSM triage inside ServiceNow data, the SaaS layer probably already has an answer. Evaluate it first.
If not, does a managed platform give you enough control? If your team can live with the runtime the cloud provides and your data already lives on that cloud, the managed platform is faster than building your own runtime.
Only if the managed platform cannot meet your requirements do you go to open framework plus your own runtime. This is where differentiation happens, and where most of the engineering cost lives.

The rest of this piece walks the three layers in that order. SaaS first. Platforms second. Open frameworks third. Then the protocols that tie them together.

Layer 1: The SaaS Agent Layer

Salesforce Agentforce

Agentforce is the Agentforce 360 Platform now, rebranded in 2025. Agent Builder for low code, plus pro code extension through Apex, JavaScript, Flows, Prompt Builder, and MuleSoft connectors. Atlas Reasoning Engine is the orchestrator under the hood. Model pluggable across OpenAI, Anthropic, Google, and Salesforce's own Einstein models.

The pricing is the thing. Agentforce has the most transparent unit economics of any enterprise SaaS agent product. Flex Credits at 500 dollars per 100,000 credits. An agent action costs 20 credits or 10 cents. A voice action costs 30 credits or 15 cents. Customer facing conversations are priced at 2 dollars each on a pre-purchase plan. Agentforce User License is 5 dollars per user per month with metered usage on top. The full Sales or Service add-on is 125 dollars per user per month unmetered. The Agentforce 1 Editions start at 550 dollars per user per month with a million Flex Credits per org per year included.

A worked example from their own pricing page: 100 users doing 3 case management tasks per day, 20 working days, 6 actions per task, comes to 1,800 dollars per month for that use case. That kind of math is what makes Agentforce a buy versus build decision rather than a platform evaluation.

Where Agentforce wins: your system of record is already Salesforce, your data already lives in Data 360 (the renamed Data Cloud), and the workflow maps onto Service, Sales, SDR, or Commerce patterns. Pre built templates for all of these ship out of the box. AgentExchange is the partner marketplace for custom topics and actions.

Where Agentforce loses: you need novel multi agent topologies, your customer is not already a Salesforce shop, or you need to run the agent outside the Salesforce security perimeter. You also give up model choice in practice. Atlas uses the models Salesforce has integrated, which is a wide set but not every model.

Named enterprise customers include Workday, OpenTable, ADP, Wiley, Heathrow Airport, FedEx, and Saks Fifth Avenue. Systems integrators have dedicated Agentforce practices at every tier.

ServiceNow AI Agents

Three layered product on the Now Platform. Now Assist is the copilot layer, generally available across ITSM, HR, CSM, Creator Workflows, Security Operations, and Sourcing. AI Agents are autonomous, expanded significantly in the Zurich release in early 2026. AI Agent Studio is the low code builder for custom agents.

The structural advantage for ServiceNow is that agents run as first class Now Platform objects. They inherit the platform's identity, data model, and audit trail without any integration work. Flow Designer, Integration Hub, MID Server access for on-premises systems, full RBAC and ACL. For internal facing workflows where your system of record is already ServiceNow, this eliminates most of the engineering you would do in a custom build.

Model strategy is pluggable. Now LLM for common workflow tasks, Azure OpenAI and Anthropic for higher reasoning, bring your own LLM for customers with existing relationships.

Pricing is per user SKU with Pro, Pro Plus, and Enterprise tiers layered on top of existing product licences. ServiceNow has been shifting toward consumption based pricing for autonomous agents based on their 2025 earnings commentary.

Where ServiceNow wins: internal facing automations, ITSM triage, HR employee services, SOC triage, and anywhere the Now Platform is already the system of record. Time to value is measured in weeks because the connectors, audit trails, and identity are already there.

Where it loses: customer facing experiences outside the Now data model, novel multi agent topologies, anything that requires custom observability or evaluation beyond what the platform exposes.

Named customers include Adobe, Hitachi, NVIDIA (large internal deployment), BT Group, AstraZeneca, Dell, and Equinix.

The SaaS Layer Reality

I have talked several clients out of building their own customer service agent because Agentforce already solves 80 percent of their problem at a fraction of the cost of building it well. The same goes for ITSM workflows and ServiceNow. Buy the SaaS layer where it fits. Build only where you differentiate. That advice sounds obvious. Most teams skip it because the SaaS layer is not where interesting engineering happens. Interesting engineering is not the goal. Outcomes are.

Layer 2: The Managed Platforms

Microsoft Foundry plus Microsoft Agent Framework

Microsoft Foundry is the rebrand of Azure AI Foundry that shipped at Ignite 2025. The old Hub plus OpenAI resource plus AI Services model collapsed into a single Foundry resource with projects. Assistants, Threads, Messages, and Runs became Responses, Conversations, Items, and Agent Versions under the Responses API. SDKs consolidated behind azure-ai-projects 2.x. The whole platform is in the middle of that naming shift. Expect procurement conversations in 2026 to include the phrase "what is this thing called now."

Three agent types ship. Prompt agents are generally available, low code, defined by instructions plus a model plus tools. Workflow agents are in preview, declarative YAML or visual designer. Hosted agents are in preview, code based, shipped as containers. Hosted agents explicitly accept MAF, LangGraph, or arbitrary code. That last detail reframes Foundry as a control plane rather than a Microsoft only runtime.

The Foundry model catalog has 1,900 plus models across foundation, reasoning, small, multimodal, domain, and industry categories. Azure Direct sold by Microsoft covers OpenAI, DeepSeek, Meta, Mistral, Cohere, NVIDIA, and Microsoft's Phi. Partners and community covers Anthropic Claude via Models as a Service, plus hundreds of Hugging Face models on managed compute. The tool catalog has 1,400 plus entries including MCP servers added directly from the portal.

Microsoft Agent Framework shipped 1.0 on April 2 for both Python and .NET. The .NET 1.1.0 landed April 10. It is the successor to AutoGen and Semantic Kernel, built by the same teams. The pairing with Foundry is the tightest in the industry. Only agent-framework-foundry is generally available. Every other provider package, Anthropic, Bedrock, Cosmos, AI Search, Durable Task, Azure Functions, Copilot Studio, Purview, is beta.

Identity through Entra. Every agent gets a dedicated Entra identity with RBAC scoped to the resources it needs. Entra Agent Registry catalogs the deployed agents. Defender for Foundry Tools surfaces prompt injection, jailbreak, and cross-prompt injection attack alerts. Application Insights and OpenTelemetry are built in.

Foundry Local is the quiet differentiator nobody is matching. Generally available in the 2026 wave. C#, JavaScript, Rust, Python SDKs. ONNX Runtime under the hood. OpenAI compatible API. No Azure subscription required. Runs on Windows, macOS, and Linux. Catalog includes GPT OSS, Qwen, DeepSeek, Mistral, Phi, Whisper. Same SDK patterns as cloud Foundry. Neither Vertex nor Bedrock has a first party on device counterpart with the same ergonomics. For healthcare on-premises, government air-gapped, or edge scenarios, this is the real story.

Compliance is the broadest portfolio in the cloud market. FedRAMP High via Azure Government. HIPAA with BAA. HITRUST. SOC 1, 2, 3. ISO 27001, 27017, 27018, 27701. PCI DSS. EU Data Boundary. Microsoft Cloud for Sovereignty for the EU sovereign stack. 21Vianet for China. India RBI, IRDAI, MeitY. When procurement asks for the compliance matrix, it already exists.

Limitations worth naming. Hosted agents preview does not yet support private networking, which matters for regulated scenarios. The evaluations SDK in MAF is still marked experimental. The rebrand has created documentation churn: classic portal, new portal, old service names in old tutorials. Expect confusion during Q2 2026 buying conversations.

Google Vertex AI Agent Engine plus ADK

Vertex AI Agent Engine is the managed agent runtime inside Vertex AI Agent Builder. The API object is still named ReasoningEngine for backward compatibility, a reminder that this product has been through two name changes.

Services inside Agent Engine, April 2026 state: Runtime is generally available, autoscaling, VPC Service Controls, configurable IAM, managed containerization. Sessions is generally available, durable per user conversation state. Memory Bank is generally available, cross session long term memory using Gemini models to generate memories, IAM Conditions support, regional ML processing. Code Execution is generally available, sandboxed code execution for agent generated code. Example Store is in preview, stores few shot examples. Quality and Evaluation is in preview, integrates the Gen AI Evaluation service and supports Gemini fine tuning for agent optimisation. Threat Detection is in preview, built into Security Command Center for attack pattern monitoring. Agent Identity is in preview.

The framework support tiers are explicit. Full integration covers ADK, LangChain, and LangGraph. Vertex AI SDK integration covers AG2 and LlamaIndex. Custom template covers CrewAI and everything else. Agent Engine runs agents speaking the A2A protocol natively. ADK itself has Python, Java, and Go SDKs, with Python the most mature.

Deployment is Terraform driven through the Agent Starter Pack. Pre built templates for ReAct, RAG, multi agent patterns, a playground UI, automated Cloud Build CI/CD, Cloud Trace and Cloud Logging wired in. Observability is Cloud Trace with OpenTelemetry, Cloud Monitoring, and Cloud Logging. Agent Engine specific dashboards surface latency, errors, token usage.

Models are any model accessible to Vertex AI. Gemini 2.x as first class. Model Garden includes Anthropic Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5, Meta Llama, Mistral, and others. Non Gemini models are a first class option, not a workaround.

Compliance. HIPAA is explicitly supported. VPC Service Controls. Customer Managed Encryption Keys. Data Residency at rest. Access Transparency and Access Approval. Private Service Connect for private VPC egress. FedRAMP is covered through Google Cloud's broader FedRAMP High Assured Workloads programme.

Where Vertex wins: GCP native shops, Gemini first, Java or Go preferred languages, BigQuery data gravity, existing Vertex AI investment. The Memory Bank primitive is genuinely unique and worth studying even if you do not pick Vertex. The first party Gen AI Evaluation integration is the cleanest eval story any managed platform ships.

Where it loses: cross cloud deployments, Azure shops, and teams with zero Google footprint. The Agent Engine feature split between generally available and preview is the most complex among the three hyperscalers. Your feature selection affects your support tier.

AWS Bedrock Agents plus Strands plus AgentCore

AWS runs two paths in parallel and the split matters.

Bedrock Agents is the config first managed agent service, generally available. You define Action Groups as OpenAPI schemas plus Lambda functions or return of control callbacks. Knowledge Bases provide retrieval augmented generation as a service, with ingestion from S3, SharePoint, Confluence, Salesforce, or the web, and vector storage on OpenSearch Serverless, Aurora PostgreSQL pgvector, Pinecone, Redis Enterprise, MongoDB Atlas, or Neptune Analytics. Guardrails for Bedrock cover content filters, PII redaction, denied topics, word filters, and contextual grounding checks. Prompt Flows is the visual canvas orchestration tool. Multi agent collaboration is generally available, explicitly hierarchical supervisor and collaborator topology rather than a free form graph.

Strands Agents is the open, Apache 2.0 licensed agent SDK. Code first, Python only for now. Latest release is 1.35.0 on April 8, with Bedrock Service Tier control for Priority, Standard, and Flex as a unique feature. Strands is deliberately portable. You can run a Strands agent anywhere. Bedrock is the preferred deployment target but not a required one.

Bedrock AgentCore is the newer managed runtime that hosts Strands, LangGraph, and ADK agents on Bedrock infrastructure. It was announced at AWS re:Invent 2024 and exists in the Bedrock top navigation as of April 2026. Specific generally available status has been moving and is worth verifying with your AWS account team at contract time.

Models on Bedrock as of April 2026. Anthropic Claude Opus 4.6 at 5 dollars input and 25 dollars output per million tokens. Claude Sonnet 4.6 at 3 and 15. Claude Haiku 4.5 at 1 and 5. Amazon Nova family across Understanding, Creative, Speech to Speech, and Embeddings. Meta Llama 4. Mistral. Cohere Rerank 3.5. DeepSeek v3.2 at 62 cents input and 1.85 dollars output per million, which is notable. Google Gemma 3. MiniMax. Qwen. Stability AI. TwelveLabs. Writer. Z AI.

Pricing model. Agents themselves carry no separate per agent charge. You pay for underlying model tokens plus Knowledge Base and Guardrails usage. Batch inference is 50 percent discount on most models. Service tiers: Standard, Flex at 50 percent discount on best effort latency, Priority at 75 percent premium on lower latency, Reserved for capacity commitment.

Compliance. HIPAA BAA eligible. SOC 1, 2, 3. PCI DSS. ISO 27001, 27017, 27018. Bedrock is available in AWS GovCloud with FedRAMP High, which is the primary pathway for U.S. federal agencies running Anthropic workloads. Regional availability covers all the major AWS regions plus the new European Sovereign region.

Where AWS wins: AWS native shops, existing Bedrock investment, Claude as the preferred model family with FedRAMP High coverage, strict latency SLAs (Service Tiers are real), and mixed model fleets with strong DeepSeek or Llama economics.

Where it loses: Azure native shops, GCP native shops, and teams that need a true graph orchestration framework rather than hierarchical supervisor patterns. The Bedrock Agents config model is less flexible than AF's graph or LangGraph's Python code.

Anthropic Developer Platform

Anthropic's platform is the exception to the hyperscaler pattern. There is no Anthropic managed agent runtime. The SDK is the story. Claude runs on the Messages API plus tools, on Bedrock AgentCore, or on Vertex Agent Engine, whichever managed runtime you prefer. This is a deliberate positioning choice. Anthropic focuses its own engineering on models, the SDK, and safety. It leans on AWS and Google for managed deployment.

Model family April 2026, verified on platform.claude.com. Claude Opus 4.6 has 1 million token context, 128k output, 5 dollars input and 25 dollars output per million. Claude Sonnet 4.6 has 1 million token context, 64k output, 3 and 15 dollars. Claude Haiku 4.5 has 200k context, 64k output, 1 and 5 dollars. Extended thinking is supported across all 4.x models. Adaptive thinking is Opus and Sonnet only. Haiku 3 retires April 19.

Opus 4.6 and Sonnet 4.6 support 300k output tokens via the Batches API with the output-300k-2026-03-24 beta header. That is a material change for long report generation workflows.

Platform capabilities beyond the Messages API. Batches API at 50 percent discount. Prompt caching, with cache writes at roughly 1.25 times the base input rate and cache reads at roughly 10 percent of base. Files API for persistent cross request storage. Computer Use is still beta with a new zoom action on Opus 4.6, Sonnet 4.6, and Opus 4.5. Tool primitives include bash, text editor, and custom tools. MCP is first class, both on Claude.ai and the API.

Enterprise offerings. Claude for Enterprise provides SSO, audit logs, and fine grained access controls. Claude for Government is the dedicated product line for US national security customers, deployed in classified environments. Compliance certifications commonly cited include SOC 2 Type 2, ISO 27001, ISO 42001, HIPAA BAA through AWS Bedrock or GCP Vertex, and FedRAMP High via AWS GovCloud.

Notable customers from Anthropic press and partner announcements include Lyft, Snowflake, Notion, Pfizer, Quora, Robinhood, Asana, Zoom, LexisNexis, Intuit, Palo Alto Networks, and Palantir. The high profile 2025 announcement was Palantir plus Anthropic plus AWS for classified U.S. government workloads.

Where Anthropic wins: teams that want the best current model for agentic work, minimum vendor lock in, and the flexibility to run on any cloud's managed runtime. The Claude Agent SDK gives you the mature tool loop from Claude Code if you are building developer tools.

Where it loses: if you want a single vendor managed story end to end, you cannot get it from Anthropic alone. You pair them with a hyperscaler runtime.

Layer 3: The Open Frameworks

The platforms above are where production deployments land. The frameworks below are where the code is written. Some frameworks pair cleanly with a specific platform. Others run anywhere.

LangGraph plus LangSmith

The open cross cloud winner. Python and TypeScript near parity. The most mature graph semantics in the space. LangGraph 1.1.6 is the current stable. LangChain itself has repositioned as the high level wrapper built on LangGraph. 28,990 GitHub stars, largest ecosystem, trusted by Klarna, Replit, and Elastic per their own README.

State management is the strongest story in the field. langgraph-checkpoint-postgres for production durability, plus SQLite, Redis, and in memory checkpointers for lighter scenarios. Durable execution is the headline feature. Human in the loop through interrupt() and Command(resume=...). State can be inspected and modified mid flight.

LangSmith is the flagship tracing and eval companion. Battle tested in production. The coupling is tight, which is a benefit and a tax. Production grade observability without LangSmith requires your own OTel pipeline work.

Where it wins: Python first multi cloud shops, graph oriented orchestration needs, teams with the ops maturity to run checkpointers and LangSmith. Hosted Agents on Foundry explicitly accept LangGraph. Vertex Agent Engine has LangGraph as a full integration tier.

Where it loses: .NET shops (TypeScript support exists but Python is first class), procurement contexts where open source plus SaaS observability is a harder sell than a hyperscaler contract, and single cloud shops where the hyperscaler's own platform is the easier path.

CrewAI

The "team of agents" metaphor, 48,643 GitHub stars, Python only. Two architectural modes: Crews for autonomous role playing agents, Flows for event driven single LLM call precision. CrewAI AMP Suite is the enterprise bundle with tracing, Control Plane, and on-premises or cloud deployment.

CrewAI is the fastest path to a multi agent proof of concept. The role, goal, and backstory metaphor maps onto a slide deck and a demo in hours. For prototyping and validating the idea, it is hard to beat.

Where it wins: role based automations (research, sales, operations), teams building "a team of agents" products, and proof of concept speed.

Where it loses: anywhere you need graph orchestration, enterprise procurement where open source plus AMP Suite pricing is a harder sell, and production systems where the looser orchestration becomes a constraint.

Pydantic AI

Best static typing story in the field, from the Pydantic team. Python only. 1.80.0 is the current stable. Capabilities, Agent Specs in YAML or JSON, server side compaction capabilities for OpenAI and Anthropic shipped recently. Pydantic Logfire is the companion observability product, OTel based.

Durable execution through DBOS and Temporal style backends. Human in the loop with per tool approval that can be conditional on call arguments, conversation history, or user preferences. MCP, A2A, and AG-UI all integrated natively.

Where it wins: teams already using Pydantic or FastAPI, type safety as a design priority, novel approaches like YAML agent specs for code free deployment, and openness about where compaction and caching happen.

Where it loses: anywhere graph support is central, since graph is not the primary pattern. TypeScript and .NET shops.

Mastra

TypeScript first, from the team behind Gatsby. 22,906 stars. Y Combinator W25. Currently the only serious TypeScript option with graph workflows, MCP server authoring, and suspend and resume. Dual license with Apache 2.0 core and enterprise license for specific modules.

Where it wins: TypeScript and Next.js product teams, Node backend shops, teams shipping AI features into existing web apps.

Where it loses: non TypeScript stacks, enterprise contexts where the YC stage still matters for procurement.

OpenAI Agents SDK

Pre 1.0 after a year of public development. Version 0.13.6 as of April 2026. Python plus a separate TypeScript SDK. Provider agnostic. Primitives are Agent, Runner, Handoffs, Tools, Guardrails, Sessions, and Tracing. Realtime Agents for voice with gpt-realtime-1.5 are a differentiator.

Where it wins: teams already on the OpenAI Responses API who want minimum ceremony, voice agent builders.

Where it loses: anywhere you need durable execution or graph orchestration. The SDK is explicitly lightweight, not a production workflow engine.

Claude Agent SDK

Pre 1.0, Python, wrapping the Claude Code CLI. Built specifically to give developers programmatic access to the agent loop that powers Claude Code. Release cadence is near daily.

Where it wins: coding agents, internal developer tools, anything that wants the mature Read, Write, Edit, Bash tool ergonomics from Claude Code without rebuilding them.

Where it loses: multi agent orchestration, non coding use cases, production systems that need stability guarantees from a pre 1.0 SDK.

The Protocols That Tie It All Together

Model Context Protocol

MCP is the de facto standard for tool and context integration as of April 2026. Current spec revision is 2025-11-25. Every non legacy framework in this guide supports MCP natively: Claude Agent SDK, Microsoft Agent Framework, Google ADK, Pydantic AI, Mastra (bidirectional, consume and author), OpenAI Agents SDK, Strands Agents, CrewAI through adapters, Semantic Kernel.

Practical implication: you can write your tool integrations once as MCP servers and call them from any framework. That changes the economics of framework choice. The lock in cost is lower than it looks.

Agent to Agent Protocol

A2A reached 1.0.0 on March 12, 2026, under Linux Foundation governance. The spec refactor separated application protocol from transport bindings, modernised OAuth 2.0 to remove implicit and password flows and add device code and PKCE, added multi tenancy via gRPC scope fields, and shipped tasks/list with filtering and pagination.

Adoption is narrower than MCP but growing. Microsoft Agent Framework advertises cross runtime interoperability via A2A. Google ADK has A2ATransport as a default supported transport. Vertex Agent Engine runs agents speaking the A2A protocol natively. Pydantic AI has A2A integration. For cross framework, cross cloud, cross team agent interoperability, A2A is where to bet.

The Protocol Bet

A bet on frameworks is a snapshot of one moment. A bet on protocols is durable. Design your agent system to speak MCP and A2A fluently. Your tools, your inter agent messages, and your external integrations all go through standardised protocols. The underlying framework becomes swappable. That is the architecture I recommend to every client planning a production build this year.

Lock In Analysis

What is hardest to migrate away from, in order:

Agentforce and ServiceNow are hardest. Your agents are objects in someone else's metadata model. Migration means rebuild.

Bedrock Agents config first is next. Action Groups, Knowledge Bases, and Guardrails are AWS native objects. The Strands SDK path deliberately reduces this because Strands agents can move off Bedrock.

Vertex Agent Engine is similar. Sessions, Memory Bank, and Example Store are Google native. ADK itself is open and portable. The surrounding services are not.

Microsoft Foundry is similar. MAF SDK is open. The Foundry runtime, tool ecosystem, and Entra Agent Registry are not.

Open SDK plus your own runtime is the lowest lock in to infrastructure. You are still locked to the SDK's abstractions, which means a framework change is a rewrite, but the infrastructure is yours.

What is practically sticky across all layers is your prompts, evals, and tool schemas. These are portable in theory and rarely in practice. Teams underestimate how much implicit behaviour is encoded in prompt tuning for a specific orchestrator.

The Migration Test

Before committing to any layer, pick one agent. Build it twice. Once on the SaaS layer or managed platform you are considering. Once on an open SDK deployed to your own runtime. Measure four things.

Time to first production deployment.

Per conversation cost at ten times your current volume.

Evaluation score against your golden dataset.

Time to add a new tool, specifically a customer specific integration your vendor does not ship.

The answers will not match what the vendor decks suggested. They rarely do. The difference between what the decks say and what the numbers show is the single most valuable piece of evidence you can bring to a platform selection meeting.

What Is Coming in Part 3

This piece mapped the layers and named the options. The next piece goes into what you build inside whatever layer you pick.

Part 3 covers the building blocks. Memory patterns. RAG patterns including agentic, graph, and contextual retrieval. Tools and capabilities beyond MCP including computer use and code interpreters. Context engineering and prompt caching economics. Evaluation frameworks. Safety and guardrails. Cost management. Identity and authentication. Planning patterns.

These are the patterns every production agent build hits regardless of framework. Get them right and your architecture compounds. Get them wrong and you are rebuilding in six months.

The Agentic Engineering Field Guide, Part 1: How I Evaluate Agent Frameworks

Navneet Singh — Fri, 24 Apr 2026 07:28:38 GMT

Why I Am Writing This Series

Agentic AI went from research demos to production systems in eighteen months. Faster than microservices. Faster than containers. Faster than the mobile shift for enterprise. The consequence is that most engineering teams are picking agent frameworks the way people used to pick JavaScript frameworks in 2015. You commit to one. Three months later you realise it does not answer your actual production problems. You rewrite. That rewrite costs six to twelve months. Some of my clients are on their third framework. A few are on their fourth.

This series is the guide I wish I had eighteen months ago. It is not a product comparison. It is the mental model I use when a client asks me which framework to bet on, what questions to ask, what to worry about, and where this is all heading.

One note before we start. I build mostly on the Microsoft stack. My client base is healthcare, financial services, and government, most of them .NET heavy and Azure native. That shapes the view in this series. Where other stacks win, I say so. Where I land on Microsoft Agent Framework plus Microsoft Foundry, the reasoning is in Part 4. Read the whole thing for the balanced view. Read Part 4 for my pick.

Four parts.

Part 1 (this piece) is about the frame. The six questions every production agent system must answer before you touch a framework. The orchestration patterns you will actually use. The production readiness checklist.

Part 2 walks the landscape. Microsoft Foundry plus Agent Framework. Google Vertex plus ADK. AWS Bedrock plus Strands. Anthropic, OpenAI, LangGraph, CrewAI, Pydantic AI, Mastra. Protocols. Plus the enterprise SaaS agent layer (Agentforce, ServiceNow) because your buyers will ask about it.

Part 3 goes into the building blocks. Memory. RAG. Tools. Context engineering. Safety. Cost. Evaluation. Identity. The patterns every production build hits regardless of framework.

Part 4 is where I land and why. The explicit recommendation for my typical client. When I do not pick it. Three war stories across different stacks. Where this is all heading. And a thirty minute CTO decision framework you can run with your team this week.

If you bookmark this, I will update it every quarter. The frameworks churn. The questions do not.

The Six Questions Every Production Agent System Must Answer

Start with the questions. Frameworks are answers to questions. If you pick an answer before you have written down the question, you will pick wrong. I ask every client these six questions before we pick a stack. In that order.

1. How do you manage state across long running workflows?

Agents are not stateless. A real workflow runs for minutes. Sometimes hours. Sometimes days. State is the conversation history, the intermediate tool outputs, the decisions already made, the files already generated, the approval statuses from earlier in the flow. If your state lives in memory, your workflow dies when the process restarts. If your state lives in a flat key value store, you cannot branch, merge, or rewind.

The production answer is durable, typed, serializable state that survives process restarts. Ideally, state you can inspect with a debugger and modify if needed.

Failure mode when you get this wrong: you cannot recover from a failure halfway through a long workflow. You cannot reproduce a decision three days later when a user disputes it. You cannot run your agent on Azure Functions or AWS Lambda because your state assumes a long lived process.

2. How do you recover from partial failures?

Real workflows fail. APIs time out. LLMs return malformed JSON. Networks partition. Rate limits hit. A production agent system does not restart from the top when step nine of twelve fails. It resumes from the last known good state.

This is where the checkpointing story matters. Every framework claims to handle failure. Not all of them actually do. Ask to see the code that runs when a step fails. If there is no checkpointing, there is no recovery. If checkpoints serialize only the happy path, they will not survive a malformed intermediate output.

Failure mode when you get this wrong: your costs scale with your failure rate. You burn tokens re running the first eight steps of every failed workflow. Results diverge on retry because temperature is not zero. Customer-facing outputs become inconsistent for identical inputs. The client services team ends up explaining to the lawyer why today's contract summary differs from yesterday's for the same document.

3. Where does a human approve before writes happen?

The single most important architectural decision in enterprise agentic AI. Where is the gate between "the agent has decided what to do" and "the agent has done it?"

If you have no gate, you cannot ship to a regulated client. You cannot ship to a client whose legal team has seen an agent make a mistake. You cannot ship to most enterprises at all.

The gate has to be durable. Nobody approves a contract in the same millisecond the agent generated it. The approval arrives three hours later, from a different person, on a different machine, possibly after the original workflow process has died. The framework has to hold that pause. In my experience, this is the feature where most frameworks either shine or quietly fail. The difference is whether the pause is an in-memory await (dies on restart) or a durable suspend (survives weeks if needed).

Failure mode when you get this wrong: you ship write operations before human review. Either nothing gets approved because the workflow dies while waiting, or the agent writes something catastrophic and you spend a quarter explaining it to compliance.

4. How do you observe, debug, and audit agent decisions?

Every agent decision will be questioned eventually. By a user. By a compliance officer. By a regulator. By your own engineering team during the post mortem. If you cannot reconstruct why an agent chose a particular path, you have an audit problem and a debugging problem at the same time.

What you need: structured traces with inputs, outputs, tool calls, prompts, and timing. Exportable. Queryable. Tied to workflow runs, not just LLM calls. Retention that meets your compliance requirements. Ideally, the ability to replay a workflow from a captured trace.

This is the area where most frameworks are weakest. The base tracing is usually OpenTelemetry compatible, which is good. The production layer on top, where you actually query and alert and audit, almost always requires a second tool. LangSmith. Braintrust. Langfuse. Foundry Observability. Pydantic Logfire. Budget for one of these from day one.

Failure mode when you get this wrong: a client asks why the agent made a specific decision. You have no answer. The compliance officer asks for an audit trail for the last ninety days. You have log files with free text that cannot be queried. Your engineering team guesses at why a production regression happened.

5. How do you test systems with non deterministic components?

Agent systems are non deterministic. Traditional unit tests do not work cleanly. You need evaluations, not just assertions. Rubric based scoring. Regression suites that compare behaviour across model versions. A way to simulate failure modes at specific workflow steps. An LLM-as-judge harness for open ended outputs. Red team tests for prompt injection and jailbreak attempts.

This is where the maturity of the ecosystem matters. Evaluation frameworks exist. Most are young. The integration between agent frameworks and eval tooling is uneven. I have seen teams ship to production with no regression suite because the eval story was too painful to build. Then a model provider bumps a minor version and the agent's accuracy silently drops ten points. Nobody notices for two weeks.

Failure mode when you get this wrong: you ship to production with no way to know if a model upgrade broke your workflow. You cannot defend accuracy numbers to a client. You are testing manually by running the agent and reading the outputs, which does not scale past four people.

6. How do you compose specialists versus run generalists?

The question most teams get wrong. A single agent with many tools feels simpler. Until the tool count exceeds twenty. Until a single prompt has to accommodate ten different use cases. Until you need to give different teams ownership of different capabilities.

Multi agent composition is not always the answer. Sometimes one agent with good tool design is better. The question is when to split, when to stay monolithic, and what the cost of composition is. Every handoff between agents adds latency, context loss, and failure modes. Every specialist adds a new prompt to maintain. Every graph edge adds a routing decision that can go wrong.

Graph based orchestration frameworks give you the escape valve when the single agent model runs out. Prompt-only frameworks do not. If you expect your system to grow in scope, you need the graph option available even if you do not use it on day one.

Failure mode when you get this wrong: your agent works for six weeks. Then as you add features, response quality collapses. Latency climbs. Prompt length becomes unmanageable. You cannot onboard a new team onto the agent without teaching them the entire system. You end up rewriting as a multi agent graph under time pressure, which is the worst time to do it.

The Orchestration Patterns You Will Actually Use

Regardless of framework, these are the patterns. The names vary by vendor. The shapes repeat. Every production build I do combines two or three of these in a single workflow.

Sequential. Agents run one after another. Output of agent one becomes input to agent two. Useful when the process is linear and the order matters. Example: a due diligence workflow where you first extract entities from a document, then look them up in a registry, then generate a summary.

Concurrent. Agents run in parallel on the same input. Results get aggregated. Useful when you want multiple independent perspectives. Example: four agents each reviewing a legal clause for a different risk (compliance, commercial, IP, operational). Aggregated into a single risk summary. Faster than sequential because the agents do not wait for each other. More expensive because you pay for all branches whether you use the output or not.

Hand off. One agent decides which specialist to route to next based on context. Useful for triage patterns. Example: customer service routing. A front line agent reads the request, decides if it is billing, technical, or account access, routes to the right specialist. The routing decision itself is an LLM call with a constrained output. This is where framework maturity matters. In immature frameworks the routing agent hallucinates a specialist that does not exist and you discover it in production.

Hierarchical. A manager agent delegates to workers. Workers report back. The manager composes results and decides what to do with partial results. Similar to concurrent but with explicit supervision. Example: code review pipelines where a reviewer agent delegates to style, security, and test coverage specialists, then composes the review comments into a single PR review.

Magentic or dynamic planning. The orchestrator maintains a shared task ledger, delegates, observes results, and re plans. Useful for open ended problems where the right sequence of steps cannot be determined upfront. Example: research tasks. "Figure out the compliance posture of this company" is not a fixed sequence. It is a loop of searches, reads, cross checks, and synthesis. Magentic patterns are powerful and expensive. Use them only when the problem genuinely needs dynamic planning. Most problems do not.

Event driven. Agents react to events from external systems rather than being driven by a single entry point workflow. Useful when the agent sits inside a larger event driven architecture. Example: an agent that monitors a support queue, picks up tickets matching a pattern, processes them, publishes results back. This is not a replacement for the other patterns. It is a harness around them.

The framework question becomes: does my framework let me compose these patterns, or does it lock me into one model? This is where graph based frameworks pull ahead of purely hierarchical ones. If you want to build a hand off inside a hierarchical flow inside an event driven harness, you need a framework that treats the graph as the primary abstraction.

The Production Readiness Checklist

Before you ship, walk through this list. If you cannot answer yes to all of it, you are not shipping a product. You are shipping a demo with uptime. I run this with every client in the week before go live.

State - State is serialized and durable across process restarts - State schema is versioned so you can migrate between deployments - You can inspect, modify, and re run from any point in the workflow

Failure recovery - Individual step failures resume from the last good checkpoint - Tool timeout and retry behaviour is configurable per step - You can kill and restart the entire workflow engine without losing in-flight work

Approval gates - Human approval is a durable pause, not an in memory await - Approval notifications go through your real systems (email, Slack, queue) - Gate timeouts have defined behaviour (auto reject, auto escalate, notify)

Observability - Every agent decision is traced with inputs, outputs, prompts, tool calls, timing - Traces are queryable by workflow run and by user session - Trace retention meets your compliance requirements - You can replay a workflow from a trace

Evaluation - You have a regression suite that runs before every deployment - You can detect behaviour drift across model upgrades - You have rubric based scoring for open ended outputs - You have a red team suite for prompt injection and jailbreak

Cost - You know the token cost per workflow run, not just per call - You have budgets that halt runs that exceed thresholds - You can route to cheaper models for easy subtasks - You have prompt caching enabled where the provider supports it

Compliance - Data residency is enforced at the framework level, not just the model provider - You can explain any agent decision to a regulator - Sensitive data is redacted before it reaches external providers - You have an audit log that is separate from your application logs

Rollback - You can roll back a prompt change without a full redeploy - You can pin model versions - You can A/B test prompts and agent configurations in production

If you build this checklist into your architecture from the start, framework choice matters less. If you do not, no framework saves you. I have watched teams with the best framework in the space fail in production because they skipped the checklist. I have watched teams with a mediocre framework ship reliably because they built the checklist in from day one.

The checklist is the thing. The framework is the accelerator.

What Is Coming in the Rest of the Series

The rest of this series maps the landscape against the questions.

Part 2 covers frameworks and platforms. The sharper mental model is that every hyperscaler now has a two layer story. An open SDK on the bottom (the framework). A managed platform on top (the platform). Microsoft has Agent Framework and Foundry. Google has ADK and Vertex. AWS has Strands and Bedrock. Anthropic has the Claude Agent SDK and the Claude developer platform. LangGraph plus LangSmith is the credible open source version of the same story. The decision that comes before framework choice is framework versus platform. Part 2 covers all of it.

Part 3 goes into the building blocks you will use inside whatever framework you pick. Memory patterns, the ones that work and the ones that look good on slides but fail in production. RAG patterns, classic and agentic and graph based. Tool patterns beyond MCP. Context engineering, including prompt caching economics. Evaluation frameworks. Safety and guardrails. Cost management. Identity and auth for agents. This is the implementation reality every CTO needs to plan for.

Part 4 is where I land. The recommendation I make to most of my enterprise clients. The explicit "when I do not pick it" caveats. Three client war stories across different stacks. Where agentic AI is heading over the next twelve months. And a thirty minute decision framework you can run with your team.

If you are evaluating a stack for a production build and want a second pair of eyes before you commit, reply to this email or find me on LinkedIn. I read every reply. I will also note which parts of this guide readers push back on the hardest. Those become the sections I rewrite in the quarterly update.

Advanced Claude Code Techniques: Managing Context, Sessions, and Token Efficiency

Navneet Singh — Sat, 18 Apr 2026 09:47:19 GMT

You already know the basics. You write slash commands, you wire up tools, you let the model edit files. And yet your sessions are bleeding tokens. A ten-minute task somehow swallows 400K tokens, and you hit a rate limit before lunch. The model didn’t get worse. Your context got fat.

Every section below is a problem, a tactic, and a tradeoff. Nothing else.

1. The core mechanic: why context is the bottleneck

The model is stateless. There is no memory on the server. Every single turn, every keystroke you send, every tool call the model makes, the client repackages the entire conversation so far and re-sends it. System prompt, all tool schemas, every user message, every assistant message, every tool result. All of it. Every turn.

The 1M-token context window is forgiving, and that’s the problem. You never get the “context full” signal that would force you to clean up. You just get a steadily rising bill and, eventually, a rate-limit slap.

Prompt caching softens the blow but doesn’t fix it. Anthropic’s prompt cache is a server-side optimization that stores the prefix of a request, with a default TTL of about 5 minutes (a 1-hour option exists for an extra cache-write premium). When your next turn starts with the exact same bytes as the cached prefix, you skip re-processing and pay a cache-read rate of roughly 10% of input cost. The model is still stateless. The cache is just a network and compute shortcut for repeated prefixes. It doesn’t remember anything. A few things to know:

The cache works on the entire request prefix, including system prompt, tool definitions, prior assistant messages, and tool results. Not just user reads.
Any Edit to a file the model already read breaks the cache from that point forward. The tool result for that Read now contains different bytes, and everything after it re-bills at full rate.
Pause for more than 5 minutes (coffee, meeting, lunch) and the default cache evaporates. Your next turn pays full price for the whole conversation. The 1-hour TTL is worth setting on long sessions if your client supports it.
Cache doesn’t shrink the conversation. You still pay output tokens for every response, and the uncached portion after any change gets billed in full.

The math of tool bloat. A single Read on a 500-line file is roughly 15-25K tokens (line numbers add bulk). A Grep across a large repo with output_mode: "content" and no head_limit can easily return 30K+ tokens. A screenshot from Chrome DevTools MCP comes back as base64. A 100KB PNG inflates to about 135KB of text, which on Claude’s tokenizer comes out around 35K tokens when the screenshot is inlined as text content. (Images sent through the proper vision endpoint are billed by image size, not base64 expansion. Different math.) Ten of those text-blob screenshots and you’ve burned 350K tokens on pictures. The conversation doesn’t forget any of it. Every subsequent turn re-sends all of it.

The model’s intelligence is fixed per turn. The cost is determined by what you put in front of it. That’s the whole game.

2. Context hygiene tactics (per-turn wins)

These are the cheap wins. Do them reflexively.

Never re-Read a file already in context

If a file appeared in the conversation this session, whether through Read, a Grep with -C, or a tool’s output, it’s still there. Scroll up mentally. Don’t re-read it.

Claude Code now tracks file state and will often warn you when you re-read. The discipline still has to be yours. The common anti-pattern: reading a file, editing it, then re-reading the whole thing to “verify.” The edit tool already errors if the change didn’t apply cleanly. Re-reading a 600-line file to confirm a 3-line change is an 18K-token mistake, and I’ve watched myself make it more times than I want to admit.

Scoped reads beat full-file dumps

Before you Read a big file, Grep for the symbol you care about. Then Read with offset and limit around the hit.

# Bad: reads all 1,200 lines of the controller
Read(file_path="...OrderController.cs")

# Good: find the method, read only the window you need
Grep(pattern="ProcessRefund", path="...OrderController.cs", -n=true)
Read(file_path="...OrderController.cs", offset=340, limit=80)

A 1,200-line file is roughly 30K tokens. An 80-line window is roughly 2K. Do this 20 times in a session and you’ve saved half a million tokens.

Stop using Bash for file operations

Never cat, head, tail, find, ls -R, grep, or rg from the Bash tool. Use Read, Glob, and Grep.

Bash output is unstructured text that the model has to re-parse. The dedicated tools are optimized. Grep uses ripgrep internally with sensible defaults. Glob returns sorted paths. Read adds line numbers the model can reference in follow-up Edit calls. Bash find . on a node_modules-adjacent tree can dump 50K+ tokens of garbage.

The only time to shell out for file ops is when you need something structurally impossible otherwise, like git log --stat for change history.

Budget your screenshots

Treat browser and Figma screenshots as expensive. Take one, do the work, don’t take another unless state changed.

Chrome DevTools MCP screenshots run 80-150KB of PNG data before base64 inflation. Figma get_screenshot is similar. If you’re iterating on a UI fix and take a screenshot after every edit, you’re paying tens of thousands of tokens per iteration just to look at the page. Make the full change, then screenshot once. If you need to inspect a specific element, use get_page_text or the DOM tools. They’re a fraction of the size.

Quiet your build and test output

Configure your build commands for minimum output. Verbose logs are context poison.

# .NET
dotnet build -v q --nologo
dotnet test --verbosity quiet --nologo

# Node
npm ci --silent
npm run build -- --silent
pnpm install --reporter=silent

# Python
pytest -q --no-header

A noisy dotnet build emits 8-15K tokens of per-project spam nobody reads. Quiet mode drops it to a few hundred. Same story for npm install pulling down 600 packages with progress bars.

If a build fails, re-run verbose. Don’t run verbose by default.

3. Session boundary management (the hard problem)

This is the section most people skip and then wonder why they’re out of tokens at 2pm.

The reality

Sessions don’t have clean endings. You’re not going to finish one discrete task, run /clear, and start the next. Real work drifts. You start investigating a bug, end up reading unrelated code for context, fix the bug, then pivot to a related refactor, then someone asks about the deployment. By hour three, your context is 80% sediment. Tool results from three tangents ago that nobody needs anymore.

The goal isn’t clean sessions. The goal is to shed context aggressively when topics shift.

Four tools, four use cases

Tool What it does When to use /clear Wipes conversation history. Clean slate. Starting a genuinely different task. Previous context has no value to the next step. /compact Summarizes conversation so far into a short blob; keeps working state. Mid-task when context is bloated but you still need continuity. /rewind (Esc-Esc) Restores to a prior prompt; you choose files only, conversation only, or both. You went down the wrong path two turns ago and want to back up without losing earlier work. Handoff (lifecycle) Write key state to a timestamped file under ./.claude/handoffs/, then /clear. Between phases of long work, especially with multiple threads in flight, where you want deterministic continuity instead of a model-generated summary.

/rewind is newer and underused. It auto-checkpoints before every Edit/Write call (note: Bash-modified files are NOT tracked). Picking “restore both” actually forks the session, so you can branch and explore. Combined with git commits, you have two layers of undo: /rewind for fine-grained tool-level steps, git checkout for code-level rollback.

`/compact` is the underused middle ground

Most people treat /compact as automatic. “It’ll happen when I hit the limit.” Use it deliberately, with instructions.

/compact Keep the database schema we worked out and the list of files changed. Drop all the exploratory reads and the Stack Overflow tangent.

The instruction steers what survives. Without instructions, you get a generic summary that may discard exactly the thing you needed. The token cost of /compact is real because it does a full-context read to produce the summary. But the next 20 turns operate on a fraction of the prior bulk, which is the whole point.

Tasks: persistent state inside a session

TaskCreate, TaskList, TaskUpdate, TaskGet. Tasks are file-backed work items that survive /clear and can be read by parallel subagents. Earlier versions called these “Todos” and they vanished with the session. Tasks now persist, support addBlockedBy / addBlocks dependency chains, and can coordinate across worktrees via CLAUDE_CODE_TASK_LIST_ID.

Use them when a piece of work has 3+ stages, when you want a subagent to pick up where you left off, or when you genuinely want progress visible across /clear boundaries. Don’t use them for trivial single-step work.

The handoff lifecycle (not just a note)

When you’ve hit a natural phase boundary, write the essential state to disk, then clear.

The naive version is one file: write ./HANDOFF.md, /clear, read it in the next session. That works exactly until the moment you pause feature A to fix urgent bug B. Now B’s handoff overwrites A’s, and when you come back to A next week you’ve got nothing. Writing the note is easy. Managing handoffs over time is the real problem. Multiple concurrent threads, detecting when a handoff has gone stale, archiving the ones that are done.

Plenty of people solve the single-file problem. Fewer solve the lifecycle. What I built, running daily in production now, is a directory of handoffs, three custom slash commands, and staleness detection driven by git reachability. None of it ships with Claude Code. You wire it up yourself.

Location

./.claude/handoffs/
  2026-04-13T01-07-42Z_auth-refactor.md
  2026-04-12T15-30-00Z_deploy-debugging.md
  archive/
    2026-04-10T09-00-00Z_feature-x-done.md

Per-repo, local-only, gitignored. Handoffs reference uncommitted working-tree state tied to one machine. They have no business being synced or committed. If the information is worth sharing across machines, that’s what git commits are for.

File header: the lifecycle anchor

Every handoff starts with a YAML frontmatter block:

---
created: 2026-04-13T01:07:42Z
topic: auth-refactor
branch: development
head_commit: abc1234
uncommitted_files: 3
status: active
---

head_commit is the hook that makes staleness detection possible. If that commit no longer exists (rebased away, branch force-pushed), the handoff is almost certainly pointing at code that isn’t there anymore. If current HEAD is 50+ commits ahead of it, the world has moved on.

Three custom slash commands do the whole lifecycle

/handoff writes a new one. - Captures branch, HEAD commit, uncommitted file count into the frontmatter. - Prompts for a topic slug (auth-refactor, not handoff-1). - Writes ./.claude/handoffs/_.md. - Adds .claude/handoffs/ to .gitignore on first run. - Recommends /clear when done.

/pickup picks up an old one. - Lists active handoffs newest-first: topic, age, branch, whether the head_commit is still reachable. - Auto-picks if there’s only one active. Otherwise you choose. - Surfaces staleness warnings before handing you the content. - Moves the file to archive/ after you’ve read it, so the same handoff doesn’t get picked up twice.

/handoff-prune does cleanup, typically run at session start. - Auto-archives handoffs older than 14 days. - Auto-archives any whose head_commit no longer exists or whose branch was deleted. - Flags (but doesn’t auto-archive) when current HEAD is 50+ commits ahead, or when the topic slug shows up in recent commit messages. Both signal the work probably landed. - Deletes files in archive/ older than 90 days.

Staleness rules at a glance

Signal Action Age >14 days Auto-archive head_commit no longer exists Auto-archive Branch deleted Auto-archive HEAD >50 commits ahead Warn on pickup, offer archive Topic slug appears in recent commits Warn on pickup (likely completed) Archive file >90 days old Delete

Why this beats the single-file version

Concurrent threads stop stepping on each other. Pause feature A, /handoff. Get pulled into bug B, /handoff again. Two files in active/, neither overwritten. /pickup lets you choose which one to continue.

Stale handoffs self-destruct. The prune logic runs at session start, so last month’s abandoned debugging note doesn’t sit around pretending to be live context.

You get a short-term audit trail. archive/ holds recently-completed work for a window, and when you catch yourself thinking “how did I approach this last time?”, the answer is often still on disk.

It stays local. Handoffs belong to the machine where the uncommitted work lives. Anything worth propagating across machines gets committed.

Anti-patterns

Don’t commit handoffs. They reference uncommitted local state. Gitignore them on day one.

Don’t skip the slug. handoff-1 tells you nothing a week later. auth-token-migration tells you exactly what you were in the middle of.

Don’t resume stale blindly. The staleness warnings are there because force-pushes and rebases happen. If the handoff points at a commit that no longer exists, trust the warning over the note.

Don’t treat handoffs as documentation. If a fact is worth keeping past two weeks, it belongs in memory, a project doc, or a commit message. Not in an ephemeral handoff file.

This beats /compact when you want control over what carries forward. You wrote the note, so you know what’s in it. A model-generated summary can quietly lose the one critical fact you needed.

Git commits as natural checkpoints

Commit at every logical boundary, not at the end of the session.

An uncommitted working tree is a reason not to /clear. The model needs the context to keep its work coherent. A committed working tree is freedom. The code is safe, the state is in git, and the next session can git log its way back to orientation in 2K tokens instead of re-reading everything.

The anti-pattern is hoarding 15 changes into one big commit at end of day. You’ve now locked yourself into one giant session for the whole day’s work, because clearing means losing context the model needs to finish.

Parallel sessions for genuinely different tasks

If you’re working on two unrelated things, open two Claude Code instances.

One terminal for the backend refactor. Another for the email template tweak. Neither pollutes the other. When you /clear in one, the other is untouched. Don’t try to multiplex unrelated work in a single session. The shared context is pure overhead for both.

`/resume` is your safety net

/clear is not permanent. /resume can pull back prior sessions on the same machine (sessions persist as JSONL under ~/.claude/projects/). This should make you more willing to clear, not less. If clearing turns out to be premature, you haven’t lost anything. You’ve just paid for a cleaner workspace.

The mental model

Stop thinking of context as “the conversation.” Think of it as the minimum state needed to produce the next correct action. Most of what’s in your context right now isn’t minimum state. It’s sediment.

4. Subagent delegation (biggest lever for long sessions)

If you take one architectural idea from this article, take this one.

Why subagents are the single biggest win

When you spawn a subagent, it runs with its own separate context. It does the searches, reads the files, runs the tools. Then it returns a summary, usually a few hundred to a few thousand tokens. All the tool noise stays in the subagent’s context and vanishes when the subagent finishes.

Your main session sees a task dispatch and a short report. Not the 50 files the subagent read to produce that report.

When to reach for a subagent

Broad codebase search. “Find every place we handle refund state transitions.” A grep-read-grep-read loop across 30 files is a context disaster inline. Perfect subagent work.

Log scans. “Check the last 500 lines of Application Insights for errors in the checkout flow.” The subagent can pull, grep, and summarize. You see a one-paragraph digest.

Multi-file exploration. “Trace how OrderId flows from the API layer down to the database.” Ten files opened, one summary out.

Any large read. A 2,000-line generated client file you need three things from. Subagent.

Specialized vs general-purpose subagents

A general-purpose subagent works for most exploration. For repeated flows, define specialized agents with tighter system prompts, restricted tools, and domain knowledge baked in. A sql-analyst that already knows your schema. A deployment-checker that knows your Azure setup. These run faster and produce tighter summaries because they don’t need orientation.

How to define a specialized agent

Agents are markdown files with YAML frontmatter at ~/.claude/agents/.md (global) or ./.claude/agents/.md (per-project). A definition has a name, description, model preference (usually Haiku or Sonnet), tool restrictions, and a system prompt. The description is what the main model reads when deciding whether to dispatch, so be specific. Tool restrictions are load-bearing. A log scanner doesn’t need Edit or Write.

Example ~/.claude/agents/log-scanner.md:

---
name: log-scanner
description: Scans Application Insights and Azure Log Analytics for specific
  error patterns. Returns concise summaries.
model: haiku
tools: [Bash, Grep]
---

You are a log-scanning specialist. Given a time window and search pattern:

1. Query the logs with `az monitor app-insights query` or equivalent.
2. Filter for the pattern, group by severity.
3. Return a one-paragraph summary with counts and top 3 most relevant messages.
4. Do NOT dump full log lines unless specifically asked.
5. If a query would return >500 lines, sample instead of returning all.

When is it worth defining one? If you’ve done a similar scan/query/check more than three times, turn it into an agent. Below three, general-purpose is fine. Above, the specialized agent pays for itself in tighter summaries and faster turnaround, and the definition doubles as documentation for future-you.

Background agents for long operations

Use run_in_background: true on Bash calls that take more than a few seconds.

Bash(command="dotnet publish -c Release -o ./publish", run_in_background=true)

You get a process handle back immediately, continue working, and get a notification when it finishes. Otherwise you’re blocking the conversation on a 90-second build. Same logic for terraform apply, long test runs, az webapp deployment, npm run build on a big Next.js app.

Worktree isolation for risky experiments

For subagent work that might break things (mass refactors, dependency upgrades, schema migrations), run the subagent in an isolated git worktree. The main branch stays clean. Pass isolation: "worktree" in the subagent invocation, or launch a parent session with claude --worktree . Either way you get a separate filesystem and branch. If the experiment works, merge. If it doesn’t, nuke the worktree.

5. Memory system for cross-session continuity

Memory is for things you want to persist beyond the current session without re-teaching them.

What belongs in memory

Stable facts. “The staging DB connection string lives in Key Vault secret db-staging-cs.” Doesn’t change often.

Incident learnings. “Last time we deployed to prod on a Friday, the Redis connection pool exhausted. Don’t do Friday deploys.”

User preferences. “Always use DateTimeOffset not DateTime.” These also belong in CLAUDE.md, but memory is better for context-dependent preferences.

What does NOT belong in memory

Transient state. “We’re currently debugging the refund bug.” That’s session state; use a handoff note.

Things that change. “The current version is 1.4.2.” It won’t be in three weeks.

Large documents. Memory is not a file system. Keep entries short and indexable.

Where it lives

Memory files live under ~/.claude/projects//memory/. Plain markdown, one file per entry:

~/.claude/projects/webority-lps-web/memory/
  MEMORY.md
  user_role.md
  feedback_friday_deploys.md
  project_auth_rewrite.md
  reference_grafana_oncall.md

Filename prefixes mirror the four types I use: User (preferences), Feedback (corrections from prior sessions), Project (codebase-specific), Reference (pointers to docs, runbooks). This is a personal taxonomy on top of Claude Code’s memory layer; the underlying system also recognizes managed-policy and project-rules files at higher precedence, so check the docs before adopting these names verbatim.

Anatomy of a memory file

Short. YAML frontmatter for metadata. Structured so the model can apply the rule without reading the whole story:

---
name: Production deploys must be on weekdays
description: Hard rule, never deploy to prod on Friday afternoon
type: feedback
---

Production deploys must be completed by 2pm on weekdays only.

**Why:** Last two Friday deploys had Redis connection-pool issues that weren't
caught until Monday. Team lost 6 hours of weekend debugging.

**How to apply:** If the user proposes a prod deploy after 2pm any weekday or
any time on Friday, pause and ask for explicit confirmation with reasoning.

One-sentence rule, Why anchored to a real incident (so future-you won’t delete it casually), How-to-apply spelling out the trigger.

MEMORY.md as the index

Without an index, the model has no cheap way to know what’s there. A working MEMORY.md:

- [User role](user_role.md): senior .NET architect, primary stack Azure Functions
- [No Friday deploys](feedback_friday_deploys.md): hard rule, past incident
- [Auth middleware rewrite](project_auth_rewrite.md): compliance-driven, Q2 goal
- [Grafana oncall dashboard](reference_grafana_oncall.md): pointer for request-path debugging

Title plus a one-phrase reason-for-existing. The model scans the index in a few hundred tokens and reads only relevant files. Prune stale entries. Resist letting it grow past a screenful.

Anti-patterns specific to memory

Ephemeral task state. “We’re fixing the refund bug” is active work, that’s a handoff.

Facts that change. Shelf life under a month, don’t persist.

Full documents. Memory is an index to knowledge, not a container.

Redundant with CLAUDE.md. Don’t duplicate. You’ll double-bill and eventually update one and forget the other.

6. CLAUDE.md hierarchy and discipline

CLAUDE.md is loaded into every session’s system context. Every line costs you tokens on every turn, forever.

Levels

Global (~/.claude/CLAUDE.md) applies to every project. Keep this tight.

Project (./CLAUDE.md at repo root) applies to this codebase only.

Managed policies and drop-ins exist for enterprise installs and override user-level files. Most solo developers never touch them, but if you’re on a managed workstation, your CLAUDE.md may not be the highest-priority instruction set.

Push project-specific rules down. If a rule only matters for one repo, it doesn’t belong in global. A global CLAUDE.md that mentions specific database schemas or specific Azure subscriptions is bloat for every other project.

Ruthless pruning

Read your CLAUDE.md and for each line ask: does this change model behavior in a way I can observe? If not, delete it.

Common bloat:

Aspirational principles (”write clean code”) that don’t steer any specific decision.
Redundancy with well-known conventions. You don’t need to tell the model what SOLID is.
Rules the model already follows by default.

A 400-line global CLAUDE.md is maybe 8K tokens added to every turn of every session. That’s not free.

Enforceable vs unenforceable rules

Some rules prose can enforce. “Use DateOnly for dates” works because the model reads it naturally and does it. Others prose cannot. “Never run rm -rf without confirming” is the kind of thing the model can forget under pressure. Unenforceable rules belong in settings.json as deny rules or hooks (next section).

Syncing global CLAUDE.md across machines

If you use Claude Code on multiple machines, you need a sync story. A simple one: keep CLAUDE.md in a GitHub gist and pull it at session start.

# In global CLAUDE.md:
# At session start: gh gist view  -f CLAUDE.md
# Compare last-sync timestamps, pull if remote is newer

Pair this with a SessionStart hook that runs the fetch automatically. Done once, works forever.

7. Deterministic enforcement via settings.json

Prose is probabilistic. Hooks and permissions are deterministic. If a rule absolutely must not be broken, don’t write it in CLAUDE.md. Put it in settings.json.

Deny rules for destructive or wasteful commands

{
  "permissions": {
    "deny": [
      "Bash(rm -rf *)",
      "Bash(git push --force*)",
      "Bash(find /*)",
      "Bash(cat *)",
      "Bash(grep *)"
    ]
  }
}

The last three deny the cat/grep/find anti-patterns from section 2 at the enforcement layer. The model can’t bypass them under any prompt. Effective against both the model’s bad habits and your own when you’re tired.

A caveat the docs themselves call out: Bash pattern matching is naive. A determined adversary or a confused model can sometimes bypass Bash(rm -rf *) via subshells, env-var expansion, or quoting tricks. Treat Bash deny rules as friction against everyday mistakes, not as a true security boundary. For real isolation, use sandboxed execution or filesystem-level write protection.

Start narrow. Aggressive denies have false positives

Those aggressive patterns block legitimate work. Bash(cat *) kills cat file | jq pipes. Bash(grep *) kills git log --oneline | grep fix and every | grep pipeline. Bash(find /*) is narrow (root-rooted only), but people often widen it to Bash(find *), which kills find . -name '*.cs' traversal.

Start with rules for things purely destructive or wasteful, observe friction for a week, then narrow. Safer starting point:

"deny": [
  "Read(**/.env*)",
  "Read(**/*secret*)",
  "Read(**/*.pem)",
  "Bash(rm -rf /*)",
  "Bash(git push --force*)",
  "Bash(*Compress-Archive*)"
]

Bash(rm -rf /*) blocks root wipes. Bash(rm -rf *) also blocks rm -rf node_modules and rm -rf ./publish, which are legitimate cleanups. Write the narrow pattern, not the scary-looking one.

The full hook surface

The hook system has grown well past SessionStart and PostToolUse. The full set worth knowing:

SessionStart (with matcher of startup/resume/clear): sync configs, fetch credentials.
UserPromptSubmit: stdout gets injected into the conversation. Use this to inject git status, current time in IST, or a small runbook every turn.
PreToolUse / PostToolUse / PostToolUseFailure: gate or react to tool calls. Auto-format on Edit, deny shell access for certain commands, log on failure.
Stop / SubagentStop: fire when the model finishes. Trigger a build, ping a notifier, archive a transcript.
PreCompact / PostCompact: archive the full transcript before auto-compaction destroys it.
SessionEnd, Notification, PermissionRequest, TaskCreated, FileChanged, WorktreeCreate: niche but useful in specific automation flows.

A full list lives in the Claude Code docs. The two I underuse and now wish I’d wired up earlier: UserPromptSubmit for ambient context injection, and PreCompact for archival.

SessionStart hook for auto-sync

{
  "hooks": {
    "SessionStart": [
      {
        "command": "gh gist view af524f... -f CLAUDE.md > ~/.claude/CLAUDE.md.remote && "
      }
    ]
  }
}

Runs before the first turn. User doesn’t have to remember to sync. Verify the exact schema in your version’s docs; matcher and field names evolve.

PostToolUse hook for auto-format

Illustrative pseudo-code. Hook schema, matcher syntax, and substitution variables (like $file_path) vary between versions. Verify against your release notes before wiring up.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit",
        "command": "prettier --write $file_path"
      }
    ]
  }
}

The model doesn’t have to remember to format. The format just happens.

When to prefer hooks over prose

Behavior that must be 100% reliable (safety, compliance, formatting).

Behavior the model tends to forget under context pressure.

Behavior that’s cheap to automate but expensive to repeat in prose.

Prose is still right for nuanced guidance like “prefer composition over inheritance.” Hooks are right for mechanical rules.

8. MCP server diet

Every MCP server you have connected loads its tool schemas into your system prompt on every turn. A Figma server with 15 tools, a Chrome server with 20, a Gmail server with 8. That’s 40+ tool schemas, each with parameter descriptions, easily 10-20K tokens just sitting there in case you need them.

Audit regularly

claude mcp list

For each server, ask: did I use this in the last two weeks? If no, disconnect.

Common offenders

Canva, Gamma. Connected once to try them, never used again.

Gmail. Connected for one task, now hanging around.

Notion, Linear, Jira. Huge schemas, only useful in specific workflows.

Disconnect them. Reconnect when you need them. It takes 30 seconds.

Local vs claude.ai-registered servers

Servers registered on claude.ai propagate everywhere you use Claude. Servers configured locally only affect the current machine. For personal tools, prefer local. It keeps your global footprint smaller.

Deferred tool loading (first-class token win)

On supported versions, tool names and short descriptions sit in the system prompt, but full JSON schemas (parameters, descriptions, enums) only load when the model actually calls a tool, typically via a ToolSearch mechanism fetching on demand.

Why it matters. A Chrome DevTools MCP with 25 tools and full schemas weighs ~15K tokens per turn. Deferred loading turns that into ~2K of names plus an on-demand fetch. Across three or four MCPs, the difference between a 40K-token and 6K-token system prompt, every turn.

When it’s available. Check your version. Some enable it automatically, some need a flag, some don’t support it. Worth upgrading for.

The tradeoff. A small round-trip the first time the model needs an unloaded tool. A few hundred extra tokens and a bit of latency for that call. Against 10K+ tokens saved per turn, obvious win.

Relation to the MCP diet. Not a license to keep every MCP ever installed. Loaded names still cost. Not 15K, but not zero. Priority: disconnect unused > defer loaded > keep-loaded.

9. Model routing

Different tasks deserve different models. Running Opus for a one-line typo fix is waste. Running Haiku to architect a system is malpractice.

Rough allocation

Opus 4.7 (claude-opus-4-7) for planning, architecture, ambiguous requirements, multi-step reasoning, “figure out why this is broken.” Best quality reasoning, highest cost.

Sonnet 4.6 (claude-sonnet-4-6) for implementation from a clear plan, writing code, editing files, running tests. The workhorse. You’ll use it most.

Haiku 4.5 (claude-haiku-4-5) for lookups, simple file reads, log scanning, “what’s the current time in IST,” renaming a variable, scanning through hundreds of log lines. Fast and cheap.

Setting the default and switching

Launch with an explicit model. claude --model haiku for grep-heavy exploration, claude --model sonnet for normal coding, claude --model opus for planning or hard debugging. Defaulting to Sonnet and explicitly upgrading to Opus avoids paying Opus rates for a session you never needed to.

Mid-session (syntax varies): /model if supported, otherwise handoff + /clear and relaunch. Don’t flip casually. Switches can reset cached prefix state. Pick a phase (plan / build / verify), set the model, switch at boundaries.

Per-subagent model selection

The bigger win than switching the main session is selecting a cheaper model for subagents. An agent that scans logs, searches code, or summarizes a report doesn’t need Opus-level reasoning. Your main Sonnet thread pays Sonnet rates for a short summary instead of Opus rates for 20K tokens of log noise. Set model: haiku (or sonnet) in the agent’s frontmatter.

Cost math (current pricing)

Per million input tokens, the 4.x generation is roughly: Opus 4.7 ~$5, Sonnet 4.6 ~$3, Haiku 4.5 ~$1. Output tokens cost 5x the input rate at each tier. So Opus is roughly 1.7x Sonnet and 5x Haiku on input cost. Much narrower than the older Opus 3 / 4.1 generation when Opus was 5x Sonnet and 15x Haiku. The relative gap has tightened considerably. Always check current Anthropic pricing before optimizing aggressively, because these numbers move.

Operating rule: never use an expensive model for work where the answer is smaller than the inputs. Bulk-read-then-summarize is Haiku. Sonnet for structured code. Opus when reasoning is the work.

10. Visibility tools

You can’t manage what you can’t see.

`/context`

Shows your current context usage. Use it. Before you take a screenshot or do a big read, glance at where you are. At 30% you’re fine. At 75% maybe delegate or compact before the expensive operation.

Custom status line

Claude Code supports custom status lines. Put current context usage on it so you see it every turn.

Version-dependent example. The variable names below (CLAUDE_CONTEXT_PERCENT, CLAUDE_MODEL) are illustrative. The actual contract (env vars, stdin JSON, or a state file) varies between Claude Code versions. Check your release notes or claude --help.

{
  "statusLine": {
    "command": "echo \"ctx: ${CLAUDE_CONTEXT_PERCENT}%  model: ${CLAUDE_MODEL}\""
  }
}

More portable: have the status-line command shell out to a small script that reads whatever your version exposes. When the contract changes, you update one script instead of editing JSON.

The principle matters more than the syntax. If you can see context usage every turn, you’ll act on it. If you can’t, you won’t.

Why visibility changes behavior

It’s the speedometer effect. You drive differently when you can see your speed. You use Claude Code differently when you can see that you’re at 60% context three turns into a task.

11. Skills, slash commands, and plugins

Three flavors of model-extending markdown. Worth understanding the differences.

Slash commands: user-invoked recipes

~/.claude/commands/*.md for global, ./.claude/commands/*.md for project-specific. The filename (without .md) becomes the slash command name. A slash command runs only when you type it.

---
description: Write a handoff note capturing current work state so the user can /clear and resume cleanly
---

Write a new handoff file capturing the state of the current work thread.

Steps:
1. Gather context: current branch, HEAD commit short SHA, uncommitted file count.
2. Ensure `./.claude/handoffs/` exists and is gitignored.
3. Prompt the user for a kebab-case topic slug.
4. Write `./.claude/handoffs/_.md` with YAML frontmatter and body sections.
5. Keep total under 400 words.
6. Recommend `/clear` to the user.

The description field is what Claude Code shows in the command palette and what the model reads when deciding whether the command is relevant. First-class, not decoration.

Skills: model-invoked capabilities

Skills live in ~/.claude/skills//SKILL.md (or per-project equivalent). Same markdown + YAML structure, but the key difference is auto-invocation. With user-invocable: true they behave like slash commands. With user-invocable: false they’re background knowledge the model can pull in when relevant, without you typing anything.

---
name: research
description: Multi-source topic research across HN, Reddit, RSS, Twitter, blogs
user-invocable: true
---

The right mental model: a slash command is a button the user presses. A skill is a tool the model knows it has. Both are markdown files; the lifecycle is what differs. Anthropic ships first-party skills (/simplify, /loop, /claude-api) and there’s a growing third-party ecosystem.

Skills also support live reloading. Drop a new SKILL.md into the directory and the next turn picks it up. No restart.

Plugins: bundled distribution

A plugin packages skills + agents + hooks + slash commands into one installable unit. Manage them with /plugin. Install with /plugin install @. Useful when you want to share a workflow across a team without copy-pasting markdown files into every machine.

For solo work, raw skills and slash commands are usually enough. Plugins matter when you have 5+ teammates who all need the same setup.

Discipline

Don’t create a slash command or skill for a flow you might need. Create it when you’ve done the flow manually three times and it’s clearly sticking.

Example: `/deploy-backend` slash command

1. Run dotnet build -c Release -v q --nologo; stop on error.
2. Publish to ./publish (clean first).
3. Zip ./publish using System.IO.Compression.ZipFile (fastest).
4. Deploy with az functionapp deployment source config-zip.
5. Restart the function app.
6. Check app state with az functionapp show --query state.

Beats re-typing the sequence or watching the model reinvent it from memory every time.

Example: `/ist` skill

Run: date -u "+%Y-%m-%d %H:%M:%SZ" and convert to IST (UTC+5:30).
Report as: DD-MMM-YYYY HH:MM IST.

Silly but useful. I invoke it 20 times a week.

12. Going headless and autonomous

The interactive TUI is one mode. Claude Code also runs headless, scriptable, and on a schedule. This is where it stops being a coding assistant and becomes a worker.

Headless mode: `claude -p`

claude -p "Find all TODO comments older than 30 days and summarize" \
  --allowedTools "Read,Grep,Bash" \
  --output-format stream-json

Prints to stdout, no TUI, no interactive loop. --output-format json gives you structured output. --output-format stream-json emits NDJSON for real-time piping. Combine with --json-schema to enforce a response shape.

This is how you embed Claude Code in CI: pre-commit hooks, PR review bots, scheduled audits. The same agent loop, the same tool system, no terminal.

Agent SDK (Python and TypeScript)

If you want to embed deeper than CLI piping, the Agent SDK exposes the agent loop, context manager, tool registration, and subagent dispatch as a library. Build a custom orchestrator. Wire it to your existing services. The mental model is identical to interactive Claude Code, but you control the host.

Worth picking up when “claude -p” with a long pipe stops being expressive enough. Below that, the CLI is fine.

`/loop`: in-session intervals

/loop 5m /check-deploy

Re-runs /check-deploy every 5 minutes inside the current session, until you stop it or hit the cap (50 concurrent loops, 3-day expiry). Useful for poll-style work: monitor a build, watch a metric, wait for a webhook to fire.

Remote Tasks: cron in the cloud

Define a GitHub repo, a prompt, and a cron schedule. Anthropic’s cloud spins up a Claude Code instance on the schedule, runs the prompt against the repo, and you don’t need your laptop open. Your “scheduled audit” or “weekly content batch” runs without you.

This is the part of Claude Code that most people haven’t internalized yet. You can hand off a recurring workflow entirely. The constraint shifts from “can I be at my desk?” to “can I describe the work clearly enough to leave it running?” Different problem.

When to reach for it: any task you’d otherwise do at the same time every day or week. Ad reports. Lead followups. Index health checks. Anything that’s deterministic in shape but tedious to remember.

13. Diagnosing bloat from session logs

Every hygiene tactic here assumes you can tell when you’re violating it. Self-reporting is unreliable. You’ll swear you didn’t re-read that file and the logs will disagree.

Claude Code persists every session as JSONL, one event per line, at ~/.claude/projects//.jsonl. The project-dir name is your cwd with non-alphanumeric characters replaced by hyphens. These files answer why did this session burn so many tokens.

Useful diagnostic queries

jq gets you most of what you need. Record shapes evolve between versions, so adjust paths.

# Message type distribution
jq -r '.type' session.jsonl | sort | uniq -c

# Most-read files (duplicate-read detection)
jq -r 'select(.type=="tool_use")
       | select(.message.content[].name=="Read")
       | .message.content[].input.file_path' session.jsonl \
  | sort | uniq -c | sort -rn | head

# Biggest tool results by byte size
jq -c 'select(.type=="tool_result")' session.jsonl \
  | awk '{print length, NR}' | sort -rn | head

The second query is the most useful one in this article. Any file with a count of 5+ means re-read discipline is failing. The third points at the tool results that blew up the session, usually a screenshot, a giant Grep without head_limit, or a full-file Read.

Patterns to look for

Same file Read 5+ times: re-read discipline failing.
Single tool results >50KB: screenshot or full-file-read culprits.

500 messages across topics with no /clear: topic-shift discipline failing.

Zero Agent calls in a 5MB session: missed delegation.
Back-to-back Edit calls on the same file with small diffs: thrashing from unclear instructions.

Post-mortem, not live monitoring. Once a week, pick your worst session and run the three queries. Five minutes. The patterns repeat, and you’ll learn which discipline is weakest and target it. Token waste is measurable. Treat it like any other performance problem.

14. Putting it together: a daily workflow

A tight checklist. Each item maps back to a section above.

Session start (30 seconds)

/context to verify clean slate.
git pull.
claude mcp list and disconnect anything unused.

During work

Discover first (Grep/Glob), then Read scoped.
Delegate anything bulky (searches, scans, multi-file) to subagents.
Background long ops (builds, deploys).
Watch status-line context %.

At checkpoints

Logical unit done → commit.
Wrong path two turns ago → /rewind.
Topic shift → handoff + /clear (or /compact with instructions for related continuations).
Stuck → dump state to file, /clear, come back fresh.

Session end

Persist anything cross-session-worthy to memory or CLAUDE.md.
Handoff open work.

Weekly

Pick your worst session log and run the diagnostic queries from section 13.
Audit MCP servers; disconnect anything unused.
Promote any thrice-repeated workflow into a slash command or skill.

This is a reference, not an exhaustive guide. The detail for each bullet lives in the relevant section.

Closing

Claude Code’s ceiling isn’t model intelligence. It’s how much relevant context you can hold at low cost while doing the next step. Re-reading is waste. Full-file dumps are waste. Bash for file ops is waste. Stale MCP servers are waste. Bloated CLAUDE.md files are waste.

Keep the context lean. Delegate bulk work to subagents. Commit often. /clear without fear because /resume and /rewind are always there. Enforce the rules that matter via settings.json instead of hoping prose holds. Turn the recurring stuff into skills, slash commands, and remote tasks so you stop doing it by hand.

Do these things for a week and your token usage drops by half for the same work. A month in and you won’t go back.

If you try one of these and it meaningfully changes how a session feels, I’d like to hear which one. Especially the tactic you didn’t expect to matter.

Navneet Singh is Founder and CEO of Webority Technologies, an engineering-first company shipping enterprise AI systems, healthcare IT platforms, and agentic engineering infrastructure. He runs millions of Claude tokens a month on production work for clients across ten countries.

What the Claude Code leak actually shows about building AI products

Navneet Singh — Sun, 12 Apr 2026 05:35:20 GMT

AI in Enterprise: What Actually Works. Part 1

None of that is the interesting part.

512,000 lines of TypeScript from a production AI agent, involuntarily published to the world. The real question is what it tells you about how AI products actually get built at scale. If you read the source for what it is, not a scandal but a case study, several things that practitioners have suspected for months are now confirmed.

How the source got out

A developer going by @Fried_rice found a .map file bundled inside the Claude Code npm package. Source maps exist for debugging. They bridge minified production code and readable TypeScript. Someone forgot to strip them before shipping.

Within hours, the source was on GitHub. Within a day, it had 8,000-plus forks. Anthropic filed DMCA takedowns, which is the expected move. The less expected move: their automated DMCA process hit their own GitHub fork network in the blast radius, taking down thousands of legitimate repos that had nothing to do with the leak. Bystanders got caught in the sweep.

Boris Cherny, the engineer who created Claude Code, responded publicly. “It’s never an individual’s fault, it’s the system’s.” Contrast that with the SolarWinds breach, where the CEO testified before Congress that the cause was an intern who set a password to “solarwinds123.” Anthropic and SolarWinds are very different organizations. The response to a failure says more about an engineering team than the failure itself.

What’s interesting about the Cherny response isn’t the diplomacy. It’s the accuracy. A .map file in an npm package is a process gap, not a rogue actor. Someone at some point made a default configuration decision, no one caught it in review, and the package shipped. The only unusual thing is that the file happened to be 512,000 lines of commercially significant TypeScript.

What the media covered (and why it distracted everyone)

Three things dominated the coverage.

Anti-distillation. The source contains logic labeled ANTI_DISTILLATION_CC. Fake tools injected to corrupt any training data scraped from Claude Code sessions. The goal is poisoning competitor model training. Technically clever, ethically contested, probably legally murky. Every major AI lab has some version of this or is thinking about it.

Undercover mode. A setting that strips AI attribution from git commits, apparently for use by Anthropic employees contributing to open-source projects without revealing they’re using their own tool. The reaction was louder than the substance. A company using its own product while not advertising it is not exactly a revelation. The attribution question in AI-assisted code is genuinely complicated, but this particular implementation is more mundane than the coverage implied.

The Tamagotchis. Yes, there are 18 species. Yes, there’s gacha rarity. Yes, the developers hex-encoded the word “duck” as a hex string to avoid an internal scanner that would flag it as a model codename. The duck thing is the most endearing thing in 512,000 lines. Some engineer spent real time making sure their virtual duck would survive code review.

These three things were the story for most of the coverage. They’re fine. They’re interesting. They’re not why practitioners spent days reading this source.

KAIROS: what a production background agent actually looks like

The most developed unreleased feature in the codebase is something called KAIROS. It has over 150 source references. Based on what’s visible in the source, it’s the real version of what every AI agent demo claims to be.

The architecture is specific. KAIROS fires periodic prompts when the terminal is unfocused or the user has been idle. Each tick has a hard 15-second blocking budget. The agent has exclusive access to tools that the standard Claude Code instance doesn’t get: SendUserFileTool, PushNotificationTool, SubscribePRTool. The design intent, readable in the surrounding prompt strings, is to surface things “the user hasn’t asked for and needs to see now.”

Most production AI agents today are pull-based. The user asks, the agent answers. KAIROS is push-based with a budget constraint. It monitors, decides what’s worth interrupting you for, and then pings you rather than waiting for you to ask.

This is what everyone in AI is trying to build. Background agents that actually behave like competent colleagues rather than overeager interns who respond only when poked. The gap between “AI that answers questions” and “AI that notices things” is where most enterprise AI projects fall flat.

Anthropic built it. They haven’t shipped it yet. The 15-second blocking budget suggests they’re still working out the cost-latency tradeoff for background execution, each tick is a full model inference. But the architecture is there, the tooling is specific, and the intent is clear.

If you’re building enterprise AI agents right now, KAIROS is worth understanding as a design target. Not to copy it. The specific implementation is Anthropic’s and the IP questions are obvious. But internalize the architectural decision: push-based with a hard budget, selective interruption, dedicated tool access for the background context.

AutoDream: the memory problem is harder than anyone says out loud

Memory consolidation across AI sessions is the problem that most enterprise AI deployments haven’t solved, and most of the AI product demos pretend doesn’t exist.

The source shows what Anthropic’s internal attempt looks like. The system is called AutoDream, and it’s triple-gated before it fires: the session must be idle, there must be a gap of at least 24 hours since the last session, and the user must have accumulated at least five sessions. Only then does it run.

When it does run, the process is a full reflective pass across four phases: Orient, Gather, Consolidate, Prune. It removes near-duplicates. It resolves contradictions between memories. It flags memories that have “drifted.” Things that were true at the time they were stored but are no longer consistent with recent behavior.

The Hacker News thread on the leak had a comment that sat at the top for most of the discussion: “Memory consolidation between sessions is the actual unsolved problem.” Every enterprise AI deployment I’ve worked on or seen runs into the same wall eventually: what happens to context across sessions? Most products either give up (stateless, every session starts cold) or fake it (dump a summary into the system prompt that degrades quickly in quality).

AutoDream is a genuine attempt at structured consolidation. The orientation/gather/consolidate/prune cycle maps to something you’d recognize from systems design. Not unlike how a database handles transaction log compaction. The fact that it’s triple-gated is notable: Anthropic clearly ran this more aggressively in earlier versions and got bad results. The 24-hour gap and five-session threshold suggest the consolidation model works better with more data and more time, not less.

There’s one other detail from the source worth calling out. CLAUDE.md, the context file that defines project-level instructions, is reinserted into the context every turn, not just at session start. This was suspected by people who had observed certain prompt cache patterns, but it’s confirmed in the source. The implication for enterprise deployments is practical: any project-level instruction set needs to be written with turn-by-turn reinsertion in mind, not just session initialization.

Multi-agent orchestration is a prompt, not a framework

The section of the codebase that generated the most reaction in engineering communities is coordinatorMode.ts.

The multi-agent orchestration logic, the part that decides how to break down a complex task, spin up parallel workers, manage dependencies between subtasks, and route results back, is written as a natural language prompt. Not code. A prompt.

Hacker News put it plainly: “So much for LangChain and LangGraph. If Anthropic themselves aren’t using it...”

The technical architecture underneath is coherent. Parallel workers share a prompt cache, which means the cost doesn’t multiply linearly with the number of agents. Risk is classified at the task level (LOW, MEDIUM, HIGH) and HIGH-risk actions require a human gate before execution. These are sound engineering decisions.

But the orchestration logic itself, the stuff that would be code in any conventional architecture, is a prompt.

Most of the orchestration abstraction layers being built right now (multi-agent frameworks, task graph libraries, agent communication protocols) are solving problems that may not need to be solved at this stage. The evidence from the Anthropic source is that if your model is capable enough, the coordination problem reduces to a prompt engineering problem. You describe the task decomposition strategy in natural language, and the model executes it.

Orchestration frameworks aren’t useless. They’re just premature at this stage. The coordination abstractions make sense when you need deterministic execution guarantees, audit trails, or cross-model coordination where you can’t rely on a single capable model to reason about the task structure. For everything else, which is most of what people are building right now, the natural language coordinator is probably good enough and significantly easier to change.

Unreleased features as a product roadmap

Four features in the source aren’t shipped yet. Read together, they outline where agentic AI tooling is going.

ULTRAPLAN offloads planning for complex tasks to a 30-minute remote Opus session. There’s keyword detection logic that routes certain task types to ULTRAPLAN rather than inline planning. The architecture includes a “teleport sentinel,” a mechanism to retrieve the results of the remote planning session back into the local context when it completes. The implication is that some tasks are better planned with a longer-horizon, higher-capability pass before execution begins. Most current agents do planning and execution in a single context window. Separating them, with different capability and time budgets for each, is a different design philosophy.

Voice mode is further along than the existence of unreleased features usually suggests. Full push-to-talk, using Deepgram Nova 3 for transcription, routing through api.anthropic.com rather than claude.ai. The routing detail matters: Cloudflare’s TLS fingerprinting apparently blocks non-browser clients from the .ai domain. The solution was to route voice through the API domain instead. This is the kind of operational detail that only shows up in production code. Not a design document, not a demo. Code that has run into a real network problem and implemented a workaround.

Bridge Mode extends Claude Code sessions to browser and mobile. The architectural pattern makes sense: the local agent maintains state, the remote session connects to it. The challenge, readable in the surrounding code comments, is session handoff. Transferring enough context to the remote session to make it useful without being slow.

TungstenTool is Anthropic-internal-only. Not shipped to users. It gives Claude direct keystroke and screen-capture control over the terminal via tmux. The implication is that there’s an internal version of Claude Code that is considerably more capable at direct system control than what ships externally.

MagicDocs is the most practically interesting for enterprise contexts. Files starting with # MAGIC DOC: are tracked automatically. After each turn, a Sonnet sub-agent reviews what changed in the session and updates those files to reflect the current state of the project. Auto-updating documentation, maintained by a background agent, without explicit user action. The hard problem with documentation isn’t generating it. It’s keeping it current. MagicDocs is a specific attempt to solve that with a background sub-agent rather than developer discipline.

The security findings most people didn’t read closely

The coverage spent more words on the Tamagotchis than on two security findings that actually matter.

The 50-subcommand cap. The Bash tool limits analysis to 50 subcommands. Above that count, it falls back to “ask,” the safe default, declining to execute. This sounds like a security feature. It partially is. But the cap was added to fix a UI freeze bug, not to prevent malicious commands. Security was incidental. A crafted command with more than 50 subcommands may bypass deny rules that were intended to block specific patterns. Not a theoretical vulnerability. A real gap in the deny rule system with a traceable cause.

Token cache drain. There’s a session serialization bug. When a session is saved, attachment types are stripped. When the session resumes, tool announcements are rebuilt from scratch. This shifts the cache prefix positions, breaking cache alignment. The result: cache hit ratio degrades from roughly 67% at the start of a long session to around 26% by the end. In practical terms, this makes long sessions significantly more expensive than they should be.

The community patched this using the leaked source before Anthropic had issued any official fix. Boris Cherny confirmed the bug. Neither finding is catastrophic. Both are the kind of thing that happens in a fast-moving production codebase when the team is building features faster than the QA surface can cover.

The codebase quality debate

Reddit’s r/programming thread was titled “Anthropic’s codebase is absolutely unhinged” and gathered 5,500-plus upvotes. The specific evidence: 460 eslint-disable comments, deprecated functions still in production, TODOs sitting in error handlers.

Then the engineers actually read it.

The community consensus that emerged over the next few days was more measured: “This is just a codebase.” The eslint-disable comments are high but not unusual for a product that moved from prototype to production at pace. The deprecated functions exist because the cost of removing them mid-feature build is higher than leaving them temporarily. The TODOs in error handlers are the honest acknowledgment that someone knew what needed fixing and left a marker rather than pretending it was done.

Production code at scale looks different from tutorial code. Anyone who looked at the source and was surprised by the debt has probably not shipped a 512,000-line product under competitive pressure.

The Mythos proximity

One thing worth noting that happened within days of the leak: Anthropic announced Mythos, an AI security agent. Not a public release. Just a preview. “Already found thousands of high-severity vulnerabilities in every major OS and web browser” through autonomous scanning. Available only to 40-plus critical infrastructure organizations through a $100M credits program.

Some on HN theorized both the Claude Code leak and the Mythos announcement were coordinated intentional disclosures. More likely explanation: Anthropic is shipping very fast and their release process hasn’t scaled proportionally. The .map file in the npm package and the narrow Mythos preview both suggest a company moving faster than its own process can contain. That’s not unique to Anthropic. It’s the signature of a company in a specific phase of growth where velocity is high and operational discipline around releases is still catching up.

What converging architectural patterns mean for enterprise builders

The source confirms something that practitioners have been arriving at independently: there is a converging architecture for production AI agents, and it looks like this.

Skeptical memory with structured consolidation, not naive accumulation. Background monitoring with budget constraints, not always-on reactive. Risk classification baked into the execution path, not bolted on later. Context reinsertion every turn, not just at session start. Natural language coordination for task decomposition, not custom orchestration code. Human gates for high-risk actions, not blind autonomous execution.

Multiple independent teams have arrived at this architecture by building things and watching them fail. The Claude Code source confirms that the team with the most direct access to the underlying model capability arrived at similar conclusions. That’s a signal worth taking seriously. These patterns aren’t framework-specific or model-specific. They’re structural solutions to structural problems in agentic AI systems.

Claude Code ranks 39th on TerminalBench, the agentic coding benchmark, despite being built by the same company that makes the underlying model. Thirty-ninth. The moat isn’t the model. It’s the harness. The orchestration, the memory, the context management, the risk classification, the prompt engineering around task decomposition. All of that is what makes an AI agent actually useful in production, and none of it comes free with API access.

This is the finding that enterprise builders should sit with. The architectural patterns in this source are transferable. Not the code. The code has obvious IP considerations. But the design decisions: how to handle memory drift, how to structure background monitoring, how to build risk gates that don’t create so much friction that users route around them, how to write a context file that survives turn-by-turn reinsertion without becoming noise.

These problems exist in every enterprise AI deployment. The source shows one production solution to each of them. That’s worth more than the Tamagotchi discourse.

The question the leak actually raises for practitioners isn’t about Anthropic’s security practices or their corporate culture. The question is this: given that we now have a detailed look at how a production agentic system gets built at scale (memory consolidation, background agents, risk classification, multi-agent coordination, context management), what does your current architecture look like against that bar?

Most enterprise AI deployments don’t have structured memory consolidation. Most don’t have risk classification at the execution level. Most don’t have background agents with budget constraints. Most are still reactive, stateless, and hoping the model compensates for the missing infrastructure.

The source confirms the bar. The bar is 512,000 lines. Some of it messy. Some of it clever. All of it earned through shipping.

The architectural patterns are there to study. The operational discipline to actually build that way is the part no source code can give you.

Navneet Singh is Founder and CEO at Webority Technologies, where the team builds enterprise AI systems, healthcare IT platforms, and agentic engineering infrastructure for clients across industries.

Software is about to stop looking like software

Navneet Singh — Sun, 05 Apr 2026 05:38:19 GMT

This is Part 2 of the Future of Software series. Read Part 1: Software Is a Proxy. AI Makes It Obsolete.

“AI will replace all software UIs with a chat window.” Open a text box, type what you want, get an answer. No dashboards. No navigation. No buttons.

Sounds elegant. Also wrong.

The opposite take is equally wrong. That AI just means adding a chatbot sidebar to your existing Salesforce instance and calling it transformation. I’ve seen four enterprise clients do exactly this in the last 18 months. Users ignore the chatbot within two weeks. Every time.

The product interface isn’t dying. It’s going through the biggest shift since desktop to web. And most builders are getting it wrong because they think it’s a binary choice. Chat or dashboard. Pick one.

The future is neither.

Why pure chat fails

Before I get into where things are going, it’s worth understanding why the “just a chat box” pitch falls apart the moment someone tries to use it.

Humans think visually, not sequentially. A business owner asks “how is my company doing?” and they don’t want 500 words back. They want to see revenue trending up, one project turning red, cash reserves stable. A well designed visual communicates in 2 seconds what takes 2 minutes to read. We built a project management tool for a healthcare client last year. First version had a conversational interface. Client’s project managers hated it. They needed to see the sprint board AND team capacity AND client deadline all at once. Not one at a time through a chat window.

Chat has no persistent state. Ask “show me overdue tasks.” Get a list. Close the chat. Where’s the list? Gone. Ask again tomorrow, get a new list. Humans need spatial anchoring. The feeling that information lives somewhere stable, not summoned from thin air each time you ask.

Chat can’t handle parallel attention. A project manager needs four data streams in peripheral vision while focusing on one. Chat forces everything into a single sequential thread. That’s a downgrade, not an upgrade.

So pure chat is out. But traditional dashboards are also increasingly inadequate. They show everything to everyone, all the time, regardless of what actually matters right now. They make humans come to the information instead of bringing information to the human.

What’s left?

Four stages of product interface evolution

This transition isn’t a single leap. It’s a progression through four stages. Each one changes the fundamental relationship between the user and the product. I’ve been building B2B software for 18 years and we’re watching this play out in real time across our client base.

Stage 1: Static dashboards (where most products are today)

You navigate to information. The product has a fixed structure. Tabs, sidebars, pages, settings. The same dashboard shows the same layout to everyone, every time.

This is how Salesforce, Jira, HubSpot, Tally, and virtually every B2B tool works today. The UI is the product. Learning the UI is learning the product. The skill is in navigation.

We operate at this stage with most of our enterprise clients still. About 80% of the custom software we build in any given quarter is Stage 1. It works. It’s just not where things are heading.

Stage 2: Conversational hybrid (where the progressive products are heading)

You ask for information, the system responds with both text and visual context. The dashboard still exists but it’s increasingly secondary.

“What should I focus on today?” returns a prioritised list with context about why each item matters. Not because you navigated to a “priorities” page, but because the system synthesised across your tasks, calendar, team status, and deadlines to generate an answer.

We started building Stage 2 interfaces about eight months ago. A healthcare client wanted their clinical coordinators to be able to ask “which patients need follow up today” instead of navigating through four different screens to compile the list manually. The coordinator’s daily workflow went from 45 minutes of screen navigation to a 3 minute conversation. Same data, same decisions, completely different interaction model.

Most users in Stage 2 products spend 80% of their time in conversation mode and 20% in dashboard mode. The inverse of today.Stage 3: Ambient intelligence (where things get interesting)

The system doesn’t wait for you to ask. It proactively delivers the 2 or 3 things that need your attention, through whatever channel you’re already on. WhatsApp, Slack, email, push notification.

8 AM, your phone buzzes: “Two items need attention. One team member is blocked. Reply 1 or 2 to choose an approach and unblock them. One deadline is at risk. Approve a scope adjustment or extend the timeline. Everything else is on track.”

You reply “2” and “approve.” Twelve seconds. Your entire morning interaction with what used to be a complex software product.

The design principle inverts: if the system hasn’t contacted you, everything is fine. Silence is the signal that things are working. This is how a great executive assistant operates. They don’t hand you a 50 page report every morning. They tell you the three things that need your judgment. Everything else is handled.

We’re prototyping this for a logistics client right now. Their operations manager currently spends 90 minutes every morning reviewing dashboards across three different systems. The goal is to bring that down to under 5 minutes of notification responses. We’re about halfway there.

Stage 4: Generated interfaces

No pre-built screens exist at all. When you want to see something, you describe what you want and the system generates a visualisation on the fly. Perfectly tailored to your question, your role, and your context.

“Show me how the team is doing” generates a completely different view for a CEO than for a team lead. The CEO sees cross-team velocity trends and budget utilisation. The team lead sees individual contributor output and blocker resolution times. Same question, different generated interface, because the system understands who’s asking and what they actually need.

“Compare this quarter with last quarter” doesn’t load a pre-built comparison page. It generates a side by side analysis highlighting specifically what changed, what caused the change, and what actions might address it. The visualisation is ephemeral. It exists for this moment, for this question, and then dissolves.

There is no “analytics page” with 15 pre-configured charts that someone built once and nobody updates. Every view is generated on demand, contextual to the user, and disposable after use.

We haven’t built a full Stage 4 system for a client yet. But we’ve built pieces. A government reporting module where the interface generates different compliance views depending on which regulatory body is requesting the data. Same underlying dataset, completely different presentation. That’s a taste of where this goes when it matures.

Five properties of the future interface

Regardless of stage, the direction is clear. Future interfaces share five properties that are fundamentally different from today’s software.

1. Exception based, not comprehensive

Today’s products show you everything. All tasks, all deals, all transactions. The user’s job is to scan through comprehensive views and find what matters. That’s an enormous cognitive tax.

Future products show you only what needs attention. Three items are red. Everything else is fine. The 200 transactions that reconciled correctly are invisible. The 3 that didn’t are surfaced.

The metric for a well designed future product is not “time spent in app.” It’s “decisions made per minute of attention.” If a user can process their entire daily workload in 90 seconds of interaction, that’s a triumph. Not a retention problem.

I keep having this argument with product managers who are worried about engagement metrics dropping. If your users accomplished everything they needed and left in 45 seconds, you won. If they spent 20 minutes browsing dashboards and left without taking an action, you lost. The engagement model of the future is the opposite of consumer social media.

2. Multi-modal, not screen bound

Today’s products live on a screen. You open the app, use it, close it.

Future products exist across multiple channels simultaneously. Morning summary arrives as a WhatsApp message. Time sensitive approval pops up as a push notification. Deep analysis request generates a visual in the web app. Quick status check answered by voice assistant.

The product meets you where you are, in the format appropriate for the interaction’s complexity. Channel selection isn’t a user preference to configure. It’s a decision the system makes based on urgency, complexity, and context.

We’re seeing this with our own internal tools. Our marketing system runs across 10 tracks with automated alerting. When ad spend drifts more than 15% overnight, the alert goes to Slack. When a review drops below 3 stars, the notification goes to the responsible person’s phone. When someone wants to dig into campaign performance, they open the dashboard. Different channels for different urgency levels. Nobody configured that. The system decides.

3. Conversation first, visualisation second

The primary interaction model is natural language. But the response isn’t always text. That’s the nuance the “just a chat box” crowd misses.

When you ask “how are we doing on revenue?” the best answer is a chart. When you ask “who’s falling behind?” the best answer is a ranked list. When you ask “what happened last Tuesday?” the best answer might be a timeline.

The AI should know when to respond with words, when to respond with a visualisation, and when to respond with an action. The interface is generated in response to the question, not pre-built awaiting it. Pre-built dashboards assume the designer knows what the user will want to see. Generated visualisations assume the system can figure it out in real time. The latter is almost always more useful, because what matters changes every day.4. Progressively deep. Glanceable to explorable

The initial response should be absurdly simple. A traffic light. A single number. A one sentence summary. “Everything’s on track” or “Two things need your attention.”

But depth should be instantly available. Tap on “two things need attention” and you get the specifics. Tap on a specific issue and you get the full context. Conversation history, timeline, root cause analysis, recommended actions. Tap further and you get the raw data.

The design principle: the top layer should be understandable in 3 seconds. Every deeper layer is opt in. Most days, most users never go past the first layer. That’s success, not failure.

Think of it like a newspaper versus a research library. Today’s software is a research library. Everything is there, find what you need. Tomorrow’s software is a newspaper with a library behind it. The headlines tell you what matters, and you can always dig deeper.

5. Action oriented, not information oriented

This might be the most underappreciated shift. Today’s products end at “here’s the information.” The user then has to decide what to do and go somewhere else to do it. See a problem in the dashboard. Think about the solution. Switch to email to communicate it. Switch to the task tool to update it. Switch back to verify.

Future products end at “here’s the recommended action. Approve?”

“A team member is blocked on a decision. Here are both options with trade offs. Reply 1 or 2 to unblock them.” That’s not information delivery. That’s a decision point with pre-packaged execution. The user makes a judgment call, the system handles everything else. Communicating the decision, updating records, adjusting timelines, notifying stakeholders.

Every interaction should conclude with the work being done, not with the user having more information to act on manually.

What this means if you’re building

If you’re building a B2B product right now, or redesigning one, a few things follow from this.

Your home screen should not be a dashboard. It should be “What needs your attention right now.” Two or three cards, maximum. Everything on track is invisible.

Your primary interaction model should be conversational. A text input at the centre, not a navigation menu. “What’s behind schedule?” gets an intelligent answer. “Reassign the homepage task to Sarah” is executed, not routed to a form.

Your notification layer is your most important surface. More users will interact through notifications than through your app. The notification isn’t a pointer to the app. The notification is the app for 90% of interactions.

Build zero permanent analytics pages. Every chart should be generated on demand in response to a question. Kill the “Analytics” tab with its 15 pre-built charts that 3% of users visit monthly.

Every piece of information should end with a suggested action. Don’t show “3 invoices are overdue.” Show “3 invoices are overdue. Payment reminders drafted. Send all three?”

Measure decisions per minute, not time in app. Less time in your product means more value from your product.

The uncomfortable bit

Most B2B products being built today, including products that call themselves “AI-powered,” are designed around Stage 1 assumptions. They have dashboards. They have navigation menus. They have settings pages and configuration wizards. They’ve added an AI chat sidebar. That’s decoration.

The products that win the next decade will be designed around Stage 3 and Stage 4 from the ground up. They’ll feel less like “software” and more like “a team member who happens to be omniscient and tireless.” The interface won’t be something you learn or navigate or spend time in.

It will be something that works for you. Mostly in the background, occasionally surfacing for your judgment, always concluding with the work actually getting done.

The dashboard is dead. Not because screens are dead or because visual information is dead. But because the idea that a human should spend time navigating pre-built screens to find information and then separately act on it. That idea is dead.

What replaces it is better. Faster. Quieter. And, paradoxically, more visual than ever. Because when the system generates exactly the right visualisation for exactly the right question at exactly the right moment, the visual impact is far greater than a permanent dashboard full of charts nobody looks at.

The future of product UI isn’t no interface. It’s the right interface, at the right time, for the right person, generated in the moment it’s needed, and gone the moment it’s done.

We’re building some of this right now. The early results are encouraging. And honestly a little unsettling. When the first thing your client says after using the new system is “I don’t feel like I’m using software anymore,” you know something fundamental shifted.

Navneet Singh is the Founder & CEO of Webority Technologies. He writes about engineering-first approaches to building technology companies.

Software Is a Proxy. AI Makes It Obsolete.

Navneet Singh — Sun, 29 Mar 2026 14:06:20 GMT

Every piece of B2B software ever built exists for one reason: humans can’t hold enough information in their heads.

A project manager can’t track 500 tasks, their dependencies, their statuses, and their blockers simultaneously. So we built Jira. A salesperson can’t remember every interaction with every prospect. So we built CRM systems. Accountants, HR teams, marketers, all the same story. We kept building structured tools to compensate for the limits of human memory and attention.

Every SaaS product is fundamentally a structured information proxy for humans. The UI, the database schema, the workflow engine, all of it exists because humans need structure to process information.

Now ask the uncomfortable question: what happens when AI doesn’t have those limitations?

An AI can hold the entire context of 500 tasks, their dependencies, every conversation that created them, every commit that relates to them, and every team member’s workload. Simultaneously. In memory. Without a database UI. It doesn’t need a kanban board to “see” the work. It doesn’t need a pipeline view to “understand” the sales funnel. It doesn’t need a trial balance to “know” the financial position.

The proxy becomes unnecessary. And that means the product as a category starts to dissolve.

Three eras of software

Era 1 (1980–2010): Software as a tool.

You operate it. Excel. Photoshop. Tally. AutoCAD. The human does the work, the software is the instrument. A master of Excel is more productive than a novice. You pay for what the software can do.

Era 2 (2010–2025): Software as a system.

You configure it. Salesforce. Jira. HubSpot. Workday. The human designs workflows, sets up automations, defines rules. The software executes them at scale. The skill shifts from operating to configuring. An entire consulting industry emerges around “Salesforce implementation” and “Jira administration.”

You pay for processes the software runs.

Era 3 (2025–???): Software as an agent.

You direct it. State the outcome. The AI figures out the process, executes the work, reports back. No configuration. No workflow design. No administration.

You pay for work done.

We’re at the very beginning of Era 3. Most of the industry hasn’t figured out what that means yet.

I run a technology company. I’ve spent the last year watching our own clients ask for AI-native rebuilds of tools they’ve used for a decade. Not “add a chatbot.” Rebuild it. That’s the signal.

The Proxy Stack

I think about this as layers. Every piece of software sits somewhere on what I call the Proxy Stack: how much stands between a human’s intent and the outcome they want.

Layer 4 — Thick Proxy. You do the work, software is the instrument. Excel, Tally, Photoshop. (Era 1)
Layer 3 — Structured Proxy. You configure workflows, software executes. Salesforce, Jira, HubSpot. (Era 2)
Layer 2 — Assisted Proxy. You operate the system, AI helps at the edges. Chatbot in the sidebar, auto-generated summaries. (Era 2.5)
Layer 1 — Thin Proxy. You approve and direct, AI does most of the work. (Early Era 3)
Layer 0 — Zero Proxy. You state the outcome, AI handles everything. (Full Era 3)

Most of the SaaS industry right now is fighting over the move from Layer 3 to Layer 2. They’re calling it transformation. It’s a one-layer improvement while the market is heading to zero.

The interesting question for any software product becomes simple: what layer are you on, and how fast are you moving down?

The Era 2.5 trap

Right now, every SaaS company is bolting AI onto their existing product. A chatbot in the sidebar. An “AI insights” panel. Auto-generated summaries. Content suggestions.

They’re calling this “AI-powered.” It’s not. It’s Era 2 software with Era 3 marketing. Layer 3 products wearing a Layer 2 costume.

The fundamental architecture hasn’t changed. The database schema is the same. The UI is the same. The workflow engine is the same. The user still needs to create records, update statuses, configure automations, and navigate dashboards. The AI is a feature, not the foundation.

I call this Era 2.5. And it’s a trap, because it gives incumbents the illusion of transformation while leaving them structurally vulnerable to genuine Layer 1 and Layer 0 products.

The pattern is familiar. When the web emerged, most software companies built a “web portal” on top of their existing client-server architecture. It looked modern. It wasn’t. Salesforce, a true web-native architecture, ate their lunch. When mobile emerged, most companies built “responsive” versions of their desktop UIs. Instagram, Uber, WhatsApp didn’t do that. They were mobile-native from the ground up and created entirely new categories.

The same structural displacement is beginning now. It’s already starting. Klarna replaced 700 customer service agents with AI and reported the same satisfaction scores. Harvey is handling legal research that junior associates used to do. Cursor and similar tools are writing production code that ships without human review. These aren’t experiments. They’re early Layer 1 products eating into Layer 3 territory.

What Layer 0 products actually look like

The UI collapses

A Layer 0 product might not have a traditional UI at all. Not “minimal UI,” potentially no persistent visual interface. The interaction model is conversational plus notifications plus generated artifacts.

You don’t “open the project management tool.” You ask “what should I focus on today?” and get an answer that pulls from projects, calendar, team capacity, and deadlines. The answer IS the product.

You don’t “check the CRM.” The CRM tells you, unprompted, that your biggest prospect hasn’t responded in 12 days and drafts a follow-up referencing their Q3 budget cycle.

The entire concept of “logging into software” becomes as archaic as “dialling up the internet.”

Now, “no UI” taken literally is an overcorrection. Humans are visual. Even at Layer 0, people will want to see their financial position, see their project timeline. The difference is that these visualisations are generated on demand, not pre-built dashboards you navigate to. You won’t click through “Reports > Financial > P&L > FY 2026.” You’ll say “show me how we’re doing financially” and get something tailored to your context. The UI isn’t dead. It’s generated, not designed. Ephemeral, not permanent.

The database becomes invisible

At Layer 3, users interact with structured data through forms and views. Create a contact. Update a deal stage. Change a task status. Every interaction is a human translating reality into a structured record.

At Layer 0, the AI maintains structured data as a side effect of understanding unstructured reality. A conversation happened on WhatsApp, the AI extracted a task, identified the owner, estimated the effort, slotted it into the sprint. A payment arrived in the bank account, the AI matched it to an invoice, updated receivables, adjusted the cash flow forecast.

The database still exists. But no human ever touches it directly. It’s the AI’s memory, not the user’s interface.

This single shift eliminates the number one complaint about every B2B tool in existence: the overhead of keeping it updated. Nobody hates managing projects. They hate maintaining the project management tool. Nobody hates tracking sales. They hate updating the CRM. The administrative tax of structured data entry is universal friction. Layer 0 removes it entirely.

Products merge into agents

If the UI is conversational and the database is invisible, what actually separates a “PM tool” from a “CRM” from an “accounting system”?

At Layer 3, they’re different products because they have different UIs, different data models, different workflows. You switch between applications. You export from one and import into another. You build integrations to connect them.

At Layer 0, they’re different capabilities of the same agent. “Follow up with the client” is a CRM action. “Create the invoice for that deal” is an accounting action. “Assign the implementation to the engineering team” is a PM action. But to the user, it’s one continuous conversation with one system that understands the full context.

The concept of “software categories” dissolves. CRM, ERP, HRM, PM become capabilities, not products. And the company that assembles the most comprehensive set of capabilities into a single coherent agent wins the entire enterprise software market.

Though I’d push back on myself here: some domains will resist this merger for a long time. Finance, HR, legal, healthcare, anywhere errors carry regulatory or legal consequences, the trust curve is steep. An engineer might accept AI auto-creating tasks from Day 1. A CFO won’t trust AI to auto-file tax returns until it’s proven correct for 12 consecutive months. That’s not irrational resistance. It’s appropriate caution with real stakes.

The moat shifts from product to data

At Layer 3, the moat is the product: features, integrations, ecosystem, brand. Salesforce’s moat is its AppExchange ecosystem and millions of trained admins. Jira’s moat is deep integration with the Atlassian suite and the inertia of enterprise workflows built around it.

At Layer 0, the product layer commoditises. AI can generate UIs dynamically. Workflows configure themselves. Integrations are just API calls that an agent handles. The traditional product moat erodes.

What replaces it is accumulated domain intelligence.

An AI that has processed 10,000 Indian SMB accounting datasets understands GST edge cases that a generic AI never will. An AI that has managed 5,000 engineering sprints predicts blockers with precision a new entrant can’t match. Features can be copied overnight. Domain intelligence compounds over years.

The implication for builders: your first 1,000 customers aren’t revenue. They’re your dataset. The competitive distance between you and a new entrant is measured in accumulated learning, not feature parity.

The interaction frequency inverts

This one challenges every SaaS metric we’ve been taught to worship.

Layer 3 products optimise for engagement. Daily active users. Time-in-app. Sessions per day. Product teams celebrate when users spend more time in their tool. Growth teams optimise for habit formation. The entire product strategy assumes that a “sticky” product is a good product.

Layer 0 inverts this completely.

The best AI agent is the one you interact with the least, because it’s handling everything autonomously. A project management system that requires zero daily interaction beats one that requires 30 minutes. An accounting system that needs one approval per day beats one demanding 2 hours of data entry.

Success is measured by the absence of human involvement, not its presence.

This breaks every SaaS business metric. DAU becomes meaningless. Time-in-app becomes a failure indicator. The companies that figure out new metrics (tasks autonomously completed, decisions made without human intervention, exceptions per 1,000 transactions) will build the products that actually matter.

The north star for Layer 0: how much valuable work happened without anyone touching the product today?

And if engagement metrics break, pricing does too. Per-seat makes no sense when AI does the work of five people. The model shifts to outcomes: per transaction processed, per ticket resolved, per project managed. You pay for work done, not tools provided.

What this means if you’re building

If you’re building B2B software today, the strategic question isn’t “what features should we add?” It’s where are you on the Proxy Stack, and how fast are you moving down?

Sitting at Layer 3? Your window is closing. Not today, not this year, but within 3-5 years, Layer 1 alternatives will begin displacing products that require manual data entry, workflow configuration, and dashboard navigation.

Moving from Layer 3 to Layer 2? You’re buying time, not building a moat. The AI chatbot in your sidebar doesn’t change the fundamental architecture. Use the time well. Start rebuilding the foundation, not the facade.

Building for Layer 0 from scratch? You have the structural advantage but face the distribution disadvantage. Incumbents have millions of users, thousands of integrations, and decades of trust. Your product needs to be dramatically better on the one dimension that matters most: elimination of human effort.

And if you’re not building software but running a business on it? Start asking your vendors a different question. Not “what features are on the roadmap” but “when does your product stop needing me to operate it?”

The transition will be slow and messy. Spreadsheets didn’t kill paper ledgers in a year. SaaS didn’t kill on-premise in a year. Layer 0 won’t kill Layer 3 in a year either. Hybrid products will dominate the market for a while. Pure autonomous products will work for some use cases and fail for others.

But the direction is clear. The human-in-the-loop isn’t going away, it’s moving up. Instead of entering data and configuring workflows, humans do work that actually requires human judgment: setting strategy, making ethical calls, building relationships, evaluating ambiguity. The AI handles the structured, repeatable, process-driven layer. The human handles everything else.

Every software category will be rebuilt around this principle. The only question is who rebuilds it first.

Part 2 of this series is now live: Software is about to stop looking like software

Navneet Singh is the Founder & CEO of Webority Technologies. He writes about engineering-first approaches to building technology companies.

CRED’s Identity Crisis: How a Premium Club Became a Platform Still Searching for Its Purpose

Navneet Singh — Sun, 14 Dec 2025 05:23:30 GMT

When CRED launched in 2018 under founder Kunal Shah, it carried the aura of a modern black card moment for Indian consumers.

When CRED launched, access was restricted to individuals with a credit score of 750+, reinforcing the idea that this was a genuine club for India’s most financially disciplined consumers.

The pitch felt simple and powerful:

If you are financially disciplined, have a high credit score, and pay on time, you belong to an exclusive club. We’ll reward you for it.

In spirit, it evoked the mythology of the iconic Amex Black Card: a quiet signal that you’ve “made it” financially and now get access to a different world of benefits.

CRED’s early positioning was clear:

Aggregate India’s high‑credit‑score, financially responsible consumers.
Give them an elegant, reward‑driven experience for paying credit card bills on time.
Build a community with trust, high spending capacity, and strong brand appeal.

On paper, that is a powerful thesis. Capture a subset of consumers who are trusted by banks, have high scores, spend more, and likely have higher disposable income. then build premium financial services around them. The underlying bet: this audience should be monetizable at much higher margins than the typical mass‑market user.

In practice, CRED built something quite different.

My Personal Experience with CRED: Why I Could Not Fully Trust It

I signed up for CRED when it launched, partly out of curiosity and partly because I fell squarely into their target segment. I wanted to see what this “exclusive club” actually felt like from the inside.

At one point, CRED requested access to my Gmail account so it could scan card statements and provide personalized insights. I granted it briefly. just to evaluate the experience. but very quickly revoked access. The idea of giving a private company full visibility into all my emails didn’t sit well with me. It felt like too much trust, too early, without enough clarity on how my data would be used.

That moment stayed with me. It highlighted a deeper issue that ties directly into CRED’s business model:

If the value offered isn’t truly premium, users will hesitate to offer premium‑level access.

Trust is the currency of financial products. And for a platform attempting to position itself as a premium financial ally, trust must be earned with substance. not just sleek design or clever marketing.

Feature Timeline: How CRED Shifted Directions Over the Years (and What Each Shift Tells Us)

To understand CRED’s identity crisis, it helps to look at its major product launches chronologically. The pattern reveals a platform trying multiple paths, each time nudging away from its original promise.

2018: The So‑Called Exclusive Credit Card Bill Payment Club

CRED launches as an invite‑only app for people with a 750+ credit score, advertised as a premium club. although in reality, nothing inside felt genuinely exclusive.**.

Simple value: Pay your credit card bill → earn CRED coins → redeem rewards.

2019: Gamification Takes Over (The First Visible Drift)

CRED begins layering gamified mechanics. kill‑the‑bill, scratch cards, jackpots. These shifts moved the product away from being a disciplined financial club and closer to a dopamine‑driven engagement app.
CRED starts introducing:

“Kill the bill” games
Scratch cards
Jackpot‑style campaigns

2020: Launch of CRED Store

A marketplace redeemable through coins. but lacking anything truly exclusive. Instead of premium financial value, users saw repackaged discounts available elsewhere, weakening the club narrative.**
A curated marketplace of brands redeemable through coins. but almost nothing about it was truly premium. The supposed club offered no exclusive access, no elite experiences, and no member‑only advantages.

2020: Introduction of CRED RentPay (Useful, But Not Premium)

A creative tool for rent payments via credit cards, but unrelated to premium benefits. It expanded utility, not exclusivity.**
Users could pay rent using a credit card by paying a fee.

2021: CRED Cash (Short‑Term Credit That Premium Users Didn’t Need)

A lending product offering instant credit. But high‑credibility users typically already access better rates from banks, making this feel misaligned with their needs.**
Instant credit lines offered through partner NBFCs.

2021: CRED Mint (Risky, Not Exclusive, and Not Tailored to High‑Credibility Users)

Peer‑to‑peer lending opened a new avenue, yet introduced risk without delivering premium‑grade wealth management. It felt experimental, not curated.**
Allowing users to lend money and earn interest.

2022: CRED Pay (Checkout Integration)

A merchant‑checkout layer that positioned CRED as a payment facilitator. Useful for brands, but added no distinctive value for premium users.**
CRED partners with merchants to enable payment via CRED at checkout.

2022: Happay Acquisition (A Move Completely Outside the “Elite Club” Narrative)

A major entry into B2B expense management. A sharp turn away from the consumer‑focused, premium‑club identity.**
A major expansion into B2B financial operations.

2023–2024: Diversification Wave

Travel deals, dining offers, insurance, utilities. a scatter of features that broadened the platform but diluted the original promise even further.**
CRED begins offering:

Travel deals
Dining experiences
Utility payments
Insurance‑related products

What CRED Actually Built: Features and Revenue Streams

To understand the gap between promise and monetization, it helps to map what CRED has built and how it makes (or claims to make) money.

Features and user experience

Over the years, CRED has evolved from “pay your credit card bill and earn coins” to a broader fintech and commerce platform. The major pieces:

Credit‑card bill payment hub: A single place to track multiple cards, get reminders, pay on time, and see basic analytics.
Rewards and “CRED coins”: Every bill payment earns coins that can be redeemed for offers, discounts, and experiences.
“Members‑only” positioning: Eligibility based on credit score and branding around financial responsibility created a club‑like feel.
Additional verticals layered on top:
- Rent payments via credit card (CRED RentPay) with convenience fees.
- Short‑term credit lines (CRED Cash / Stash) and peer‑to‑peer style products (CRED Mint).
- A curated e‑commerce “Store” and brand‑offers marketplace.
- Corporate and expense‑management capabilities via acquisitions like Happay.

Revenue and monetization channels

Publicly, the monetization story looks something like this:

Brand/merchant listing fees and commissions: Brands pay to list offers, get visibility, and access this high‑credit‑score audience. CRED earns fees or commissions when users redeem.
Transaction and processing fees: On flows like RentPay and certain payments, CRED charges convenience or service fees.
Lending and interest income: From credit lines, lending products, and partnerships with banks/NBFCs.
Data and user‑insight monetization: In theory, enabling highly targeted access to a premium user base within regulatory bounds.

The financial picture in brief

Even with growing revenue and a multi‑billion‑dollar valuation at its peak, CRED has consistently reported substantial operating losses. The topline is scaling; the bottom line still hasn’t convincingly followed.

So the natural question arises: If you’ve aggregated some of the country’s most creditworthy consumers, why is monetization still this hard?

The Core Problem: Premium Audience, Non‑Premium Value

CRED’s thesis is straightforward: a user who is credit‑worthy is more monetizable. Banks want them, brands want them, and advertisers love them.

The catch is this: premium users expect premium value.

If what they receive, after clearing the “exclusive” bar, is an endless feed of discounts on consumer products, the experience quickly starts to feel less like a black card and more like a glorified coupon engine.

1. The premium audience is valuable only if they feel they’re treated as premium

The entire club narrative works only if the club feels different.

Instead of:

“Because you’re financially solid, we’ll help you build more wealth, access better credit, and unlock superior financial opportunities.”

the user journey often becomes:

“Because you’re financially solid, here are some offers and cheap stuff you can buy if you redeem enough coins.”

That’s a fundamental mismatch. You attract strong, disciplined financial profiles and then mostly nudge them to consume more, not necessarily to become richer, more secure, or more empowered.

2. The margin economics of “cheap deals” is weak

If your core loop is:

pay bill → earn coins → redeem discount,

then your economics are tied to the economics of discounts.

Deals are typically:

Low‑margin.
Often subsidized by brands as marketing spends.
Weakly linked to long‑term loyalty or deep financial engagement.

This can drive engagement and app opens, but it rarely produces high‑margin, defensible revenue. You end up playing in the same arena as any rewards app or offer aggregator, just with better branding.

3. The shift to real financial services is necessary: and difficult

The only way to truly monetize a high‑quality, high‑trust user base is through financial products:

Credit lines.
Investment products.
Insurance.
Wealth and advisory services.

These products can deliver meaningful margin. But they also come with:

Risk (credit risk, market risk).
Regulatory complexity.
The need for strong underwriting and data models.
Longer product cycles and slower, more disciplined growth.

CRED has taken steps into this territory, but it’s still not clear that these initiatives have scaled to match the size and promise of the user base.

4. Exclusivity vs scale: the strategic tension

Exclusivity is a differentiator. but it’s also a constraint.

If you insist on only the highest scores, your audience is relatively small. That’s fine if your monetization per user is strong. If it isn’t, exclusivity becomes a liability rather than a moat.

You’re then trapped between two unattractive choices:

Loosen eligibility and dilute the premium narrative to chase volume.
Stay exclusive and accept thin monetization for a long time.

Neither fully honors the original “elite club” story.

5. The marketing and burn‑rate trap

Reward‑driven businesses burn capital quickly:

You subsidize user acquisition.
You subsidize rewards.
You spend on brand and advertising.

If the monetization machine lags behind, you end up in a cycle of chasing ever more volume to justify ever more spend. without a clear path to sustainable profitability.

Where CRED Missed the Opportunity

In my view, the real miss is simple:

CRED figured out how to aggregate highly credible, disciplined financial users.
It did not figure out how to monetize them in a way that matches their profile.

These are precisely the users who would respond to:

Better investing options.
Sophisticated wealth‑building tools.
Access to unique financial products.
Ways to optimize taxes, debt, and long‑term planning.

Instead, the experience has largely centered around:

Games of chance.
Lucky draws.
Flashy campaigns.
A catalogue of offers that often look like the same promotions available elsewhere, repackaged inside a premium wrapper.

It’s an emotional mismatch: for a user who sees themselves as financially solid and responsible, the platform often feels more like a gamified shopping feed than a serious financial ally.

What CRED Could Have Built Instead

If I were building a business around high‑credibility users, the design would tilt more towards wealth, leverage, and long‑term value, and less towards discounts and dopamine.

A. From “cheap deals” to “wealth‑creation and financial empowerment”

The narrative shift should be:

From: “You’ve paid your bills on time, here’s a deal.”
To: “You’ve proven financial discipline, now we’ll help you compound it.”

Concrete ways to do that:

Curated investment products aligned to user maturity and risk profile.
Premium credit lines with clearly superior terms.
Tools for long‑term wealth planning, tax optimization, and goal‑based investing.
Access to alternative assets and structured products typically not marketed to the average retail user.

B. Embed genuinely sticky financial services

The goal is to become the primary financial cockpit for this user segment.

That means:

A consolidated dashboard of cards, loans, investments, and liabilities.
Intelligent nudges not just to spend but to save, invest, rebalance, and de‑risk.
Monetization via:
- Advisory or subscription fees.
- Asset‑under‑management-based income.
- Premium credit lines and structured lending products.

Here, the revenue comes not from selling “stuff”, but from helping users manage and grow their money.

C. Make the value exchange explicit: and premium

A high‑score user with a solid financial profile does not want to feel like just another target in an ad funnel.

The platform must answer clearly:

“Because you are financially disciplined, what can we help you do that others cannot access?”

A smaller set of high‑quality, high‑relevance benefits will outperform an endless scroll of generic offers.

D. Drive high margin, not just high volume

Premium audiences justify premium economics. That requires:

Tight tracking of unit economics per user.
Ensuring each reward, campaign, or partnership has a clear business case and high‑value outcome.
Lending products built with rigorous risk controls and clear margin, not just vanity volume.

E. Scale without diluting the club

Exclusivity can be maintained even while scaling, if you:

Expand into adjacent segments that are still financially strong (entrepreneurs, business owners, high‑earning professionals).
Create internal tiers (for example, a more “elite” layer on top of an already strong base) with radically differentiated benefits.

The key is that growth should not come at the cost of the brand’s core promise.

Why CRED Still Feels Stuck in the “Gap Phase”

CRED has done the hard work of:

Building a brand.
Acquiring a desirable user base.
Owning a mind‑share position as the app for “credit‑worthy” people.

But the monetization architecture has not yet fully caught up with that ambition.

Today, the experience still leans heavily on:

Rewards.
Offers.
Consumer spending.

and far less on:

Long‑term wealth.
Serious financial leverage.
High‑margin financial relationships.

Until that gap closes, the business will continue to feel like a premium wrapper around a discount‑driven core.

The Way Forward for CRED and for Builders Watching It

If CRED wants to live up to the mythology it created, the shift ahead is clear:

Re‑position from a reward app with elite branding to a serious financial empowerment platform.
Monetize through high‑margin financial products and services, not just through deals.
Use rewards as a hook, not the foundation of the business.
Preserve exclusivity, even as it scales into adjacent premium segments.
Be radically transparent about data, insights, and how they’re used to create value for users.

For founders, technologists, and product leaders, there’s a broader lesson here:

Aggregating a great user base is only half the game.
Architecting a monetization model that genuinely respects who those users are. and what they actually need. is the other half.

Conclusion

CRED began as an ambitious attempt to build an exclusive club for India’s most financially disciplined consumers. On that front, it has achieved something rare: a brand that people recognize and a user base that most financial institutions would love to own.

But brand and user base are inputs, not outcomes.

The next phase demands an equally thoughtful monetization engine. one that treats high‑credibility users not as a crowd to be sold discounted products, but as partners in long‑term financial growth.

Whether CRED can make that leap. from rewards‑driven engagement to genuine financial empowerment. will define its real legacy. If it gets this right, it will finally become what it hinted at in the beginning: not just another app on your phone, but the closest thing India has to a true, modern, digital “black card” experience for the financially disciplined.

Navneet Singh is the Founder & CEO of Webority Technologies. He writes about engineering-first approaches to building technology companies.