The Agentic Engineering Field Guide, Part 3: The Building Blocks

Memory, RAG, tools, context, safety, cost, identity, and planning. The patterns every production agent build hits regardless of framework. Where the ecosystem actually lands in April 2026.

May 01, 2026

Why This Part Matters More Than The Framework Choice

Framework choice is the decision teams obsess over. The building blocks are the decision teams skip. That is backwards.

A team on the best framework with the wrong memory pattern has a bad product. A team on a mediocre framework with the right evaluation harness ships reliably. The patterns in this piece are what separate the agent systems that survive in production from the ones that get quietly shelved after six months.

I have been in the room for a lot of those quiet shelvings. Every one of them failed on a building block, not on the framework. The agent never remembered the user correctly. The retrieval missed the key document. The eval harness did not catch the regression that the customer did. The cost doubled after a model provider changed its pricing and nobody noticed for three weeks.

This piece walks each building block in the order I evaluate them with clients. Memory. RAG. Tools. Context. Safety. Cost. Identity. Planning. For each one: what the pattern is, what the ecosystem ships in April 2026, and the honest failure modes you will hit in production.

Memory

The vocabulary has settled. The CoALA taxonomy from 2024 is now the consensus. Three kinds of memory.

Semantic memory is facts. What the user's name is. What their preferences are. Which account tier they are on. What the contract terms are. The things you would put in a database if your system were not agentic.

Episodic memory is experiences. What the agent did in the last session. What tool calls succeeded and failed. Which responses worked and which got a frustrated reply. The trajectory history.

Procedural memory is instructions. How to handle a particular scenario. Which steps to take for a given task type. This usually ends up encoded in the system prompt, but recent work treats it as something the agent can revise over time.

The second axis is scope. Short-term memory lives inside the context window and is managed by the framework's checkpointer. Long-term memory lives in a namespace that survives sessions and is retrieved on demand. A third term, working memory, now gets used two different ways. In some frameworks it means the same thing as short-term. In Mastra it means a persistent structured profile that is always loaded. When you talk to your team about working memory, be explicit about which meaning you are using. I have seen real architectural disagreements traced to this one word.

The Consolidation Pattern Is The Thing

The pattern that makes long-running agents work is consolidation. A background process reads the episodic log and writes semantic or procedural summaries. Without it, your agents either lose information (short context) or drown in it (context overflow).

The reference implementations to study:

Mastra's Observational Memory, generally available in early 2026, uses background agents to maintain a dense observation log that replaces raw message history as it grows. Context stays small. Long-term memory stays rich.

LangChain's LangMem SDK ships create_memory_manager with explicit control over hot-path extraction (synchronous, cheap, brittle) versus background extraction (async, expensive, more reliable). This is the right abstraction. Most production systems need both.

Anthropic's memory tool shipped public beta in September 2025 on Claude Sonnet 4.5 and is now generally available on Opus 4.6 and Sonnet 4.6. It is a file based tool Claude calls with create, read, update, and delete operations against a developer-managed storage backend. Anthropic reports a 39 percent improvement on internal agentic-search evaluations when combined with context editing, and 84 percent token reduction on 100-turn web search tasks. Critical detail: the memory tool operates entirely client side through tool calls. You own the persistence layer.

Claude Code's pattern is conceptually the same thing. Background consolidation into a persistent memory directory, then reinsertion on every turn.

Framework State April 2026

Anthropic memory tool. GA on 4.6 models. You manage storage.

LangGraph Store plus LangMem. BaseStore API with namespace scoping, semantic search, filter by content. LangMem adds memory manager, store manager, and prompt optimization primitives on top. The most complete memory story in an open framework today.

Microsoft Foundry Memory tool. Preview as of March 2026. Integrated into Foundry Agent Service alongside web search, file search, and code interpreter. Region-gated.

OpenAI Memory. The developer-facing story is the Responses API plus stored conversations plus File Search. There is no first-class memory tool equivalent to Anthropic's. Teams build their own.

Mastra Memory. Working memory (structured profile), semantic recall over past messages, Observational Memory. Automatic thread and resource isolation in multi-agent systems. Deterministic subagent resource IDs derived from parent. The cleanest memory ergonomics if you are on TypeScript.

CrewAI memory. Short-term, long-term (SQLite by default), entity memory (RAG over named entities), user memory. Lighter weight than LangGraph or Mastra. Fewer knobs to turn.

Pydantic AI. Does not ship memory primitives. You pass message history into the agent call. Memory is your problem. This is an intentional design choice that says "we do not hide the model."

Honest Failure Modes

Memory staleness is the first failure you will see. The agent confidently asserts a fact the user corrected weeks ago because the old fact was never overwritten. Collection-style memories are worse than profile-style here. If a user's preference can change, make sure your memory pattern supports update, not just append.

Memory explosion is the second. Naive "remember everything" pipelines produce thousands of near-duplicate entries. Retrieval drowns in noise. The cure is deduplication, compaction, and ruthless eviction.

Cross-user bleed is the third and scariest. Namespace bugs leak one user's memory to another. The LangMem team calls this out explicitly as the reason they made user_id a first-class namespace. If your memory story does not have tenant isolation at the storage layer, you have a security bug waiting to happen.

Hot-path extraction slowing every turn by 2 to 4 seconds is the fourth. It feels fine until you measure. Run extraction in the background.

Consolidation agents hallucinating summaries and poisoning the knowledge base permanently is the fifth. Review queues matter. For high-stakes memories, a human gate on consolidation is cheap insurance.

Frontier Patterns

Consolidation is the table stakes pattern. The next tier of memory architecture is where production systems are still catching up to research. These are the patterns you should know about, evaluate with judgment, and reach for when the baseline is no longer enough.

Decay and recency-weighted retrieval. The Generative Agents paper from Park et al. at Stanford in 2023 proposed scoring each memory on three axes. Recency, with exponential decay over elapsed time since last access. Importance, a model-scored salience label. Relevance, embedding similarity to the current query. Retrieval ranks by the weighted sum of the three. The model is better than naive similarity alone because a highly-relevant but stale memory loses to a less-relevant but fresh one when that is the right call.

In production, fixed TTL on memories is almost always wrong. User preferences can persist for years. A tool result is often stale in hours. A session fact is useful for a day. The decay function should be per-memory-type, not global. Letta exposes per-block configuration. Zep's temporal knowledge graph tracks validity windows on individual facts. Mastra's Observational Memory compacts older observations into denser summaries rather than deleting them, which is a softer form of decay. Anthropic's memory tool leaves decay entirely to your storage layer, which is honest but unhelpful if you do not have a decay policy of your own.

What I tell clients: do not build a one-size-fits-all decay curve. Tag every memory with a type (preference, session fact, tool result, observation, commitment) and set decay policy per type. This adds a day of work up front and prevents the category of bug where the agent remembers something it should have forgotten, or forgets something it was supposed to keep for a year.

Virtual context and paging. MemGPT from Packer et al. in 2023 made the case that context-window management should look like an operating system. A small core memory stays resident. Archival memory lives on disk and pages in on demand through explicit agent-issued tool calls. The agent literally reads and writes its own memory with functions like core_memory_append, archival_memory_insert, and archival_memory_search. The model manages its own working set.

Letta is the commercial heir. Active project, Python-first, agent-as-service architecture where the agent runs as a persistent process with memory state surviving across restarts. Used in production at a growing number of companies, with integrations appearing for LangGraph and Mastra that let Letta serve as a pluggable memory backend. The right fit when your agent holds long-running state across sessions with the same user and you want the model to explicitly decide what stays in core context.

Be honest about where paging wins. For short, stateless request-response agents, virtual context is overkill. For an agent that has a months-long relationship with a user, it is one of the few patterns that scales without getting gradually dumber. The middle ground, where ordinary consolidation into a summary is enough, is where most production teams actually sit. Reach for virtual context when the consolidation pattern is already in place and you are still losing important facts in the summary step.

Sleep-time compute and offline reflection. The idea. Agents do their best thinking when idle. Between user turns, overnight, or in batch, a background process reads recent trajectories and produces reflections, revised plans, updated skill documents, or rewritten memory entries. The user never waits for this work. The agent wakes up smarter than it went to sleep.

Letta ships explicit sleep-time agents that run on a configurable cadence. Mastra Observational Memory does a narrower version implicitly through background consolidation. Claude Code's pattern of background memory rewrites between sessions is another narrow form of the same idea. Recent academic work has framed sleep-time compute as a distinct scaling axis, alongside pre-training compute and test-time reasoning compute, with its own cost and quality curves.

Production use is early. The failure mode is obvious. If your reflection agent hallucinates a bad summary overnight, you wake up with a confidently wrong agent. The mitigations are the same as with consolidation. Review queues on high-stakes rewrites. Diff logs so a human can see what the reflection changed. Never let a reflection agent overwrite user-provided facts.

The CTO call. Consider sleep-time compute for agents that are accessed infrequently by each user (once a day, once a week) where the reflection cost is amortized over many user interactions, and where coherence across sessions is a visible product feature. Skip it for high-throughput short-turn agents where every interaction stands alone.

Reflection trees. The other half of the Generative Agents contribution. Periodically, the agent looks at a batch of recent memories, asks itself what questions those memories raise, answers the questions using the same memory store, and writes the answers back as higher-level memories. The tree emerges from recursion. A week-old interaction gets abstracted into a one-line insight. The one-line insights get abstracted into a paragraph-long trait. The agent builds its own theory of the user without anyone writing the theory down.

This is a pattern you can implement on top of any memory system that supports insertion with metadata. Mastra, LangMem, and Letta all have the primitives. No mainstream framework ships it as a turnkey feature. The value is real when your agents have a rich episodic log and the user cares about long-term coherence (coaching, therapy, executive assistant, tutor). The cost is model calls for the reflection work, paid on a schedule you control. Budget for it.

Workflow memory. Agent Workflow Memory takes procedural memory further. The agent extracts reusable workflow patterns from its own successful trajectories and stores them as named recipes, parameterized by the variable parts. Next time a similar task arrives, it retrieves the matching workflow and follows it instead of planning from scratch. Research results on web navigation benchmarks are meaningful.

Production adoption is shallow. The closest shipping analogue is Anthropic Skills, but Skills are human-authored folders, not auto-extracted workflows. Automatic workflow synthesis from traces remains research-grade. Early adopters are building this themselves. The pattern I have seen work is a weekly batch job that reads LangSmith or Braintrust trace exports, clusters trajectories by task similarity, proposes candidate workflows, runs a human review, and promotes the approved ones into the agent's skill library. That pipeline ships. It is not turnkey.

Hierarchical memory beyond two-tier. MemGPT's two-tier model (core plus archival) is the current de facto standard. CoALA hints at richer hierarchies. Sensory buffers. Short-term working memory. Medium-term episodic buffers. Long-term consolidated knowledge. In practice, nobody has shipped a five-tier production system that outperforms the two-tier one by enough to justify the complexity. The tier count is not the insight. The insight is that different memory types need different retention, retrieval, and consolidation policies. You get most of the benefit with a two-tier architecture plus per-type policies, not with adding more tiers.

When To Reach For Frontier Memory

A simple decision rule. If your agent runs fewer than twenty turns per session and each session is independent, the standard consolidation pattern is enough. Stop there. If your agent runs for weeks or months with the same user and the quality of the relationship compounds, you will need at least two of the frontier patterns. Start with decay-aware retrieval and reflection trees. Add virtual context if context overflow becomes the binding constraint. Add sleep-time compute if user-perceived intelligence-per-session is what you are selling.

Do not add all of them at once. Each has a cost in complexity and in model calls. Each is also a new surface where the agent can corrupt its own memory. Add them one at a time with eval coverage on what you expect to improve.

Retrieval Augmented Generation

Classic RAG is still the baseline. Chunk the documents. Embed the chunks. Retrieve the most similar chunks for a query. Stuff them into the prompt. It remains the right starting point because it is cheap, understandable, and composable. It rarely ships alone anymore.

The Patterns You Actually Ship

Agentic RAG is now the dominant production pattern. The agent decides whether to retrieve at all, issues multiple queries with different formulations, reads tool descriptions, and can re-query after inspecting initial results. Foundry File Search, OpenAI File Search in the Responses API, and Anthropic's web-fetch and web-search tools are all agent driven. The model, not a pre-built pipeline, decides when to reach for data.

Graph RAG. Microsoft's GraphRAG builds a knowledge graph of entities, relationships, and claims, runs Leiden clustering, generates community summaries, and answers via Global, Local, or DRIFT search. Global is for broad "what are the themes" questions. Local walks entity neighborhoods. LazyGraphRAG, from Microsoft Research in late 2024, defers LLM work to query time. It builds a cheap graph index with NLP-extracted concepts instead of full LLM summarization, then uses the LLM only during search. Microsoft's published claim is roughly 0.1 percent of GraphRAG's indexing cost at comparable answer quality. That is the difference between GraphRAG being viable on millions of documents and not.

Contextual retrieval. Anthropic's technique from September 2024. Prepend a 50 to 100 token chunk-specific context to each chunk before embedding and before BM25 indexing. Reported results: contextual embeddings alone cut retrieval failure rate by 35 percent. Contextual embeddings plus contextual BM25 cut it by 49 percent. Adding a reranker pushed the total improvement to 67 percent. The preprocessing cost is about 1.02 dollars per million document tokens using Haiku with prompt caching. This is now standard. Most new RAG stacks bake it in by default.

Hybrid retrieval. BM25 plus dense vectors, fused with Reciprocal Rank Fusion. The consensus default for production. Pure vector search loses exact-match queries like product codes, SKUs, and error codes. Pure BM25 loses semantic paraphrase. Every serious retrieval stack does both.

Rerankers

Cohere Rerank 3.5 from December 2024 is the commercial leader with strong multilingual reasoning. Cohere Rerank v4 is now sold directly through Azure Foundry Models. Voyage rerank-2.5 is popular for finance and legal. BGE-reranker-v2 and Jina Reranker v2 are the open-weight choices. BGE for cost. Jina for multilingual. Cross-encoder models from the ms-marco-MiniLM family still beat nothing for pennies when latency is not critical.

Rule of thumb: retrieve top 50 to 100 candidates, rerank to top 5 to 10. Reranking adds 50 to 300 milliseconds of latency. That is why latency-sensitive apps skip it. Most apps should not.

Vector Databases, April 2026 Positions

Pinecone is still the managed default for teams who want zero ops. The serverless tier is aggressive on price.

Qdrant is the Pinecone alternative of choice for self-hosted. Rust core, strong filtering.

Weaviate is hybrid native from day one with strong multi-tenancy.

pgvector with pgvectorscale from Timescale is the right answer when you already have Postgres. StreamingDiskANN through pgvectorscale closed the performance gap for most workloads.

Turbopuffer is object-storage backed and extremely cheap for cold data. The darling of 2025 for billion-scale low-QPS workloads.

LanceDB is embedded and popular for desktop, edge, and Mastra-style local-first stacks.

Azure AI Search, Vertex Vector Search, and Bedrock Knowledge Bases are the hyperscaler answers. Pick based on where your compliance and data gravity already live.

Chroma is dev and prototype. Production deployments have mostly migrated off.

Managed RAG Offerings

Foundry IQ is Microsoft's knowledge layer for enterprise agents. It unifies Azure AI Search, SharePoint, OneDrive, and custom sources, and integrates with File Search in Agent Service.

Vertex RAG Engine is managed chunking plus embedding plus retrieval. Integrates with Vertex Vector Search and Gemini grounding.

Bedrock Knowledge Bases handle managed ingestion from S3, Confluence, SharePoint, Salesforce, and web sources. Supports hybrid search, reranking, and GraphRAG via Neptune Analytics.

OpenAI File Search is a tool inside the Responses API. Handles vector store creation, chunking, retrieval, and citations with minimal configuration.

Honest Failure Modes

Chunks without context is the single biggest accuracy killer. The "3 percent revenue growth" example from the Anthropic contextual retrieval paper is not an edge case. It is the median production bug. Every chunk needs enough context for standalone retrieval to make sense.

Recall cliffs at top-k boundaries. Your ranking puts the right document at position 11 when you retrieve 10. Expand your retrieval window, then rerank.

Embedding model mismatch. Your stored embeddings are from mpnet-base-v2 from 2022. Your queries are from text-embedding-3-large. Your similarity scores are meaningless. Re-embed when you upgrade.

Evaluator theater. LLM as judge on RAG that rewards fluent wrong answers. Use groundedness metrics, not just answer quality.

Freshness. Documents change. Embeddings drift. Nobody re-indexes until a customer complains. Put a freshness policy in place from day one.

Tools Beyond MCP

Function calling has converged. OpenAI, Anthropic, and Gemini all support structured tool schemas, parallel tool calls, and forced tool calls. Parity for basic use cases. Differences show up in fine-grained streaming, programmatic tool calling, and tool-search features for large tool catalogs. Anthropic ships tool search and programmatic tool calling as first-class features when you have more than 30 tools in the loop. This matters if you have a big tool inventory.

Computer use. Anthropic's computer use tool is beta with the header computer-use-2025-11-24 for Opus 4.6, Sonnet 4.6, and Opus 4.5. Screenshot, mouse, keyboard, and desktop automation. State-of-the-art results on WebArena among single-agent systems. OpenAI ships Operator (consumer) and the computer-use-preview model for developers through the Responses API. Gemini's equivalent is Computer Use in the Gemini API. All three remain reliability-limited for multi-step workflows. Expect 40 to 70 percent task completion on real workloads. If your use case depends on computer use in production, plan for human fallback.

Code interpreters. OpenAI Code Interpreter is generally available inside the Responses API and Assistants. Anthropic's Code Execution Tool is beta and required for Skills. Microsoft Foundry Code Interpreter is generally available, Python sandboxed, supports data analysis and chart generation. All three are stateful within a session and ephemeral across sessions unless you mount storage.

Skills. Anthropic Agent Skills launched October 16, 2025, and became an open standard on December 18. A skill is a folder with a SKILL.md file plus scripts and resources that Claude loads only when relevant. Composable, portable, progressive disclosure. Identical format across Claude apps, Claude Code, and the API through the /v1/skills endpoint. Requires the Code Execution Tool beta. Microsoft Agent Framework's class-based skills in MAF 1.x are the closest equivalent in the Microsoft stack, with class hierarchies instead of Markdown folders. The industry direction is clearly progressive disclosure. Ship instructions as files. Load on demand. Do not put everything in the system prompt.

Browser automation. Playwright is the plumbing. Microsoft Playwright MCP and the native Anthropic Playwright extension cover most use cases. Browserbase is the managed cloud for running headless browsers at scale with session replay, stealth mode, and proxy rotation. The default for teams that do not want to run Playwright infrastructure.

Tool registries. Foundry's tool catalog advertises roughly 1,400 tools including MCP servers, connectors, Logic Apps, and SharePoint. The biggest single catalog. OpenAI's ecosystem is MCP native now. Anthropic's Connectors directory is smaller but curated, with organization-wide skill management for Team and Enterprise. The question is no longer who has more tools. The question is whose tool permissions, auth, and audit story your compliance team will sign off on.

Context Engineering

Prompt Caching Is A Free Win

Anthropic's pricing is public and aggressive.

Opus 4.6 base is 5 dollars per million input tokens. Cache write is 6.25 dollars at the 5-minute TTL or 10 dollars at the 1-hour TTL. Cache hit is 50 cents. Ten times cheaper than the base rate.

Sonnet 4.6 base is 3 dollars per million input tokens. Cache write is 3.75 or 6. Cache hit is 30 cents.

Break-even is after roughly two hits. The cache has paid for itself. Everything after that is 90 percent savings.

The 5-minute default refreshes for free on each hit. The 1-hour extended cache is for long-running agent workflows where context is stable but access is sparse. Anthropic's own contextual retrieval cost analysis of 1.02 dollars per million document tokens relies on this.

OpenAI supports automatic prompt caching for prompts over 1024 tokens at a standard 50 percent discount on cached input. Gemini offers context caching with a minimum cache size and a storage-per-hour charge. Every serious production stack now caches system prompts, tool schemas, and long document contexts. If yours does not, you are overpaying by something like 5 to 10 times on every stable prefix.

Compression And Reinsertion

The dominant compression patterns. Rolling summaries every N turns. Hierarchical summaries where you keep the last N turns in full plus older turns summarized. Tool-result pruning where stale tool calls are automatically removed. Anthropic's context editing reports 84 percent token reduction on 100-turn workloads using this. Selective keep based on relevance scoring.

Context reinsertion. Claude Code's pattern is now de facto standard for coding agents. Re-inject the per-project memory file plus the memory directory plus directory listings plus recently modified files at the start of every turn. Cursor, Windsurf, and Aider all follow variants.

Context Windows And The Long Context Question

GPT-5.4 and the 5.x family ship 1,050,000 token context with 128,000 token output on Azure Foundry. Claude Opus 4.6 and Sonnet 4.6 offer 1 million tokens for qualified customers. Gemini 3.x ships 1 million plus tokens standard across the family with some models accepting 2 million.

The 1 million token era has not killed RAG. It has killed RAG for small knowledge bases. The decision rule most teams use now:

Under 500 thousand tokens: stuff and cache. No retrieval needed.

500 thousand to 10 million tokens: hybrid. Stuff the hot data. RAG the cold.

Over 10 million tokens: RAG with contextual retrieval and reranking.

Anthropic's own guidance: if your knowledge base is smaller than 200,000 tokens or roughly 500 pages, include the entire knowledge base in the prompt with no retrieval. I have had real client debates where we deleted half a retrieval pipeline because the documents fit. The simpler answer beats the complicated one when the simpler answer works.

Evaluation

LLM As Judge

Production standard but not trusted alone. Known failure modes. Position bias, which prefers the first of two candidates. Verbosity bias, where longer answers score higher. Self-preference, where GPT-4o judging GPT-4o rates itself higher than a blind comparison. Brittleness to prompt phrasing.

The mitigations are well known. Pairwise over pointwise. Calibrate with human labels. Use multiple judges and majority vote. Always pin the judge model so results are comparable across runs.

Rubric Based Scoring

G-Eval with chain-of-thought grading is the reference pattern, implemented in DeepEval, Ragas (RAG-specific: faithfulness, answer relevance, context precision, context recall), and Promptfoo. These remain the fastest path to getting something in CI.

Regression Suites

The bar has shifted. Every eval framework now integrates with CI/CD. Braintrust's "promote playground to experiment, run in CI, catch regressions before production" is representative of the whole commercial category. Foundry Evaluations integrates directly into the Agent Service lifecycle.

Red Teaming

UK AISI's Inspect framework is the most credible open-source evaluation harness for safety and has been adopted by multiple national AI safety institutes. Commercial options for red-team work include Haize Labs, Patronus AI, and Lakera Red.

Tooling Landscape, April 2026

Open source:

DeepEval has 14 plus metrics, pytest-native, good for unit-test-style evals.

Ragas is the RAG-eval default.

Giskard handles test generation and red-teaming with strong coverage on bias and reliability.

Phoenix and Arize offer OSS tracing plus eval and are now the most common choice for teams on OpenTelemetry.

Promptfoo is YAML-driven with easy CI integration and model comparison matrices.

Inspect is the rigorous choice used for publishable safety evaluations.

Commercial:

LangSmith Evals has the tightest LangChain integration and is strong for tracing plus eval in one tool.

Braintrust provides playgrounds, experiments, and online scoring. Product-led with good UI for teams that want one.

HoneyHive offers similar positioning with more enterprise feature completeness.

Pydantic Evals is code-first and minimal. Pairs with Pydantic AI.

Foundry Evaluations is native to Azure with Application Insights and RBAC integration.

Galileo is enterprise focused with hallucination detection and compliance reporting.

Honest Take

Eval tooling still falls short on four things.

Agent trajectory evaluation. It is easy to grade a final answer. It is hard to grade a 30-step agent loop.

Grounding drift. Judges do not reliably catch fluent but unsupported claims.

Cost of running evals. Full eval suites can exceed your training-data cost.

Test set rot. Your golden set becomes your model's memorized set over time.

Budget 15 to 25 percent of agent engineering time on eval infrastructure. That is the number I use with clients. It is more than most teams plan for, and it is the reason the teams that plan for it ship reliably.

Safety And Guardrails

Content Filters

OpenAI Moderation API, free and category-scored. Azure AI Content Safety, including Jailbreak shield, Groundedness Detection, Protected Material Detection, and Indirect Prompt Attack detection. Integrated into Foundry Agent Service by default. Bedrock Guardrails with managed policies across Bedrock models and configurable denied topics plus PII filters. Gemini safety settings with four category thresholds configurable per request.

Prompt Injection

No technique is fully proven. The layered defense that works in practice combines five techniques.

Spotlighting. Mark untrusted input with delimiters plus explicit instructions.

Separate trust zones. Tool outputs and user input are not the same thing. The prompt structure should make that clear to the model.

Output validation. Structured outputs plus tool-call shape checks. The output should be parseable and pass schema validation.

Least-privilege tools. The agent can only do what it could do safely if fully compromised. If your agent can drop database tables, that is an architecture problem, not a security problem.

Content filter pre-check on external inputs. Azure's XPIA detection in Defender for Foundry is the first shipping commercial detector specifically targeted at cross-prompt injection attacks.

Jailbreak Detection

Azure Defender for Foundry Tools includes XPIA detection and jailbreak classification. Anthropic's Constitutional AI underpins Claude's refusal behavior and remains the strongest published defense-in-depth against jailbreaks. Third-party options include Lakera Guard and Protect AI.

Guardrails Frameworks

NVIDIA NeMo Guardrails offers programmable rails through Colang with a multi-rail architecture covering input, output, retrieval, and execution. Widely adopted.

Meta Llama Guard and Prompt Guard are open-weight classifiers. The default drop-in for self-hosted stacks.

Guardrails AI (the library) plus Guardrails Hub offers a validator catalog. Strong for structured output validation.

Invariant Labs is newer but noteworthy for agent-specific trace-based policies.

Output Validation

Pydantic AI enforces Pydantic schemas on LLM output with automatic retry on parse failure. Cleanest API in the space. Instructor offers similar functionality with broader model support. TypeChat is Microsoft's TypeScript-first approach. OpenAI's native Structured Outputs and Anthropic's JSON mode reduce the need for these wrappers for simple cases but do not replace validation logic.

Protected Material

Azure Content Safety's Protected Material Detection flags verbatim copyrighted text in model output: song lyrics, news articles, recipe collections. Bedrock Guardrails includes similar capabilities. This is now a compliance checkbox rather than a differentiator.

Cost Management

Model Routing

RouteLLM from Berkeley remains the open-source reference. Train a binary classifier to route between a strong and weak model. Claims 85 percent cost reduction at 95 percent GPT-4 quality on benchmarks.

AWS Bedrock Intelligent Prompt Routing, in GA since 2025, routes between same-family models like Claude Haiku versus Sonnet with target latency and cost constraints.

OpenAI tier selection through model aliases and reasoning effort controls.

Foundry model-router is a first-party model listed in the Azure catalog that selects among Azure OpenAI models automatically.

Pattern to adopt: route small model for easy tasks, large model for hard tasks, and make the routing decision observable. The savings are real. The complexity cost is also real. Add routing when you have volume, not before.

Prompt Caching Economics, Worked

With Opus 4.6, a 100 thousand token system prompt costs 500 dollars on first call for cache write and 50 dollars on each subsequent hit at the cache rate. Ten times cheaper. Agents with stable instructions that run hourly realize near-total savings after the first call.

Sonnet 4.6 is even steeper. 300 dollars write. 30 dollars per hit.

For a 50-turn agent loop reusing the same tools and system, prompt caching typically cuts total input cost by 70 to 90 percent. This is the single most valuable cost optimization in the stack. If you are not caching, start today.

Semantic Caching

GPTCache and derivatives like Upstash Semantic Cache and Helicone work well for deterministic-query workloads (FAQ bots, documentation search) and poorly for agentic workloads (too much per-request state). Hit rates of 20 to 40 percent on FAQ patterns. Single digits on agent loops. The honest answer is that semantic caching is a niche tool. Useful where it fits. Not a general solution.

Budget Controls

Foundry, Bedrock, and OpenAI Enterprise all ship per-project spend caps, per-user rate limits, and alerts. LangSmith, Helicone, Braintrust, and OpenMeter provide cross-provider observability. Kill switches through feature flags (LaunchDarkly, Statsig) are the norm for staged rollouts. If you do not have a kill switch for agents, you are one prompt injection from a five-figure weekend bill.

Observability For Cost

Every serious platform now emits OpenTelemetry with LLM-specific semantic conventions. Tokens in, tokens out, cache hit, cache miss, model, tool calls. Phoenix, Arize, Langfuse, and Helicone standardized on this. If your platform does not emit OTel LLM spans in 2026, that is a red flag.

Identity And Auth For Agents

The Microsoft Pattern Is The Reference

Service-managed credentials plus on-behalf-of is now standard in Foundry Agent Service. The agent gets its own identity. When a user invokes the agent, the identity flows through OBO so downstream resources see the user's permissions, not the agent's. No shared secrets. Every call auditable. This is the right default for enterprise.

Entra Agent Identity

Entra now treats agents as first-class directory principals with their own object IDs, conditional access, RBAC, and lifecycle. Foundry publishes agents to the Entra Agent Registry for discoverability across Teams and Microsoft 365 Copilot. Each agent can have a dedicated Entra identity enabling secure scoped access to resources and APIs without sharing credentials. This changes your audit model fundamentally. Every action traces to the agent, the user who invoked it, the tool called, and the data accessed.

Google And AWS Equivalents

Vertex AI Agent Builder uses Google Cloud service accounts plus IAM. No dedicated agent principal type yet. Agentspace layers user-delegated access on top. The primitives exist. The UX around agents as a directory object is less mature than Entra's.

Bedrock AgentCore introduced agent IAM roles with session-scoped credentials and short-term credential issuance through STS. Similar functional scope to Entra. Less directory integration.

Compliance Implications

SOC 2, ISO 27001, HIPAA auditors are starting to expect this in 2026. You have a directory principal that performs actions. Every action is traceable. Without agent identities, you have a service account doing things on behalf of humans, which is the audit pattern nobody wants to defend.

The CTO-level decision: do not let agents run on shared service accounts past pilot. Before you ship to production, every agent has its own identity with scoped permissions.

Planning Patterns

ReAct Is Still The Default

ReAct, the Reason-then-Act pattern from Yao et al. in 2022, is still the honest baseline in production. Critique is well known. Verbose, doubling token cost for thinking then acting. Brittle on complex multi-step tasks. The thought step can hallucinate plans that contradict the action taken.

Most frameworks use ReAct-shaped loops by default, often with model-native reasoning (extended thinking in Claude, reasoning effort in GPT-5.x) replacing the visible thought channel. The result is cleaner output without the "Thought:" lines while keeping the planning capability underneath.

Tree Of Thoughts

Use when you can cheaply evaluate partial states. Puzzles, math, code that either compiles or does not. Rarely used in production agent loops because the branching cost dominates and most production tasks lack a cheap evaluator. Useful inside specialized sub-tasks like SQL query generation or theorem proving. Not a general-purpose pattern.

Reflexion

Store a short self-critique from each failed attempt and re-inject on retry. Reliable, cheap, broadly applicable. Most modern agent frameworks include some form of this, explicit or folded into extended thinking.

Task Decomposition

Production-standard for long workflows. A planner agent emits a plan. Specialist agents execute nodes. A critic agent verifies and triggers replanning. LangGraph, AutoGen, CrewAI, and Foundry's workflow agents all support this.

Anthropic's recent "advisor strategy" is a lightweight variant. The main agent consults a stronger advisor model for hard sub-decisions without spending full Opus tokens on every turn. I like this pattern. It matches how real teams work. The junior engineer does most of the work and escalates to the senior when they need to.

Plan Critique

The critic-actor split works. The common footgun is the critic using the same model as the actor and agreeing with whatever the actor produced. Self-consistency bias. Mitigation: use a different model family for the critic (Claude critic for GPT actor, or the reverse), or a smaller model with explicit rubrics.

What Is Actually Used In Production

ReAct with native reasoning models (GPT-5.x, Claude 4.6, Gemini 3.x with thinking) covers about 80 percent of simple to moderate agents.

Hierarchical planner plus specialist executors is the standard for anything multi-hour or multi-domain.

Reflexion-style retry loops are universal, usually embedded in framework retry policies.

Tree of Thoughts, Chain of Thought with Self-Consistency, Algorithm of Thoughts, and more exotic research patterns are confined to specialized sub-tasks or absent.

The research novelty to production ratio is clear. ReAct and Reflexion have made it. Tree of Thoughts is partial. Most of the novel planning papers from 2023 to 2025 have not. Do not let your architecture lead with a research paper.

Self-Improving Agents

The honest status. Most self-improvement patterns are still research. A few are landing in production. This is the area where the gap between the papers and the frameworks is widest. If you are hearing a pitch about agents that get better over time without human involvement, check the evidence before you believe it.

The Three Axes Of Self-Improvement

An agent can improve on three axes. Its knowledge, what it remembers. Its instructions, how it behaves. Its skills, what it can do. Each axis has a different maturity curve in April 2026.

Knowledge improvement is largely solved. Consolidation and reflection already do this. Memory grows, gets compacted, stays useful. Your agent knows more about each user after every session when the memory pattern is right.

Instruction improvement is the prompt optimization problem. DSPy, LangMem's prompt-optimization primitives, and several research frameworks iteratively refine system prompts based on observed failures. Honest take. These work on narrow benchmarks and rarely survive the messiness of production traffic without human oversight. For most teams, prompt updates are a human job informed by production traces. Automated prompt optimization is worth a weekend experiment. Do not bet your architecture on it yet.

Skill improvement is where the interesting work is happening, and where the gap between research and production is the most interesting to watch.

Trajectory Distillation And Reinforcement Fine-Tuning

The pattern. Take successful trajectories from production, grade them, and fine-tune the model on the traces. The agent gets measurably better at your specific workflow without anyone writing a new prompt.

OpenAI Reinforcement Fine-Tuning is in production for o-series and GPT-5 models with a grader-based workflow. Submit a dataset of prompts plus a grader function (code or LLM-based). OpenAI fine-tunes the model against the grader signal. Clients running RFT on a narrow workflow see 10 to 20 point gains over the base model on the target task. Not a toy.

Anthropic does not offer equivalent public fine-tuning today. Claude fine-tuning is available through AWS Bedrock but is narrower than the OpenAI path. The tradeoff is clear. If your task lives on OpenAI, RFT is a real option. If it lives on Claude or Gemini, the pattern is weaker for you in 2026 and you will likely wait for the equivalents to mature.

OpenPipe's ART library (Agent Reinforcement Trainer) is the open-source direction for teams that want to run the pattern on open-weights models. Pairs with vLLM for inference and a local eval harness.

The critical caveat. Trajectory distillation concentrates the successes you already have. It does not teach the agent what it never saw. A workflow that worked 60 percent of the time might hit 80 after distillation. It will not hit 95 without genuinely new data or a better base model. Plan accordingly, and never position distillation internally as "the model will figure out the remaining failure modes on its own."

Skill Synthesis From Traces

Voyager from Wang et al. in 2023 proposed an agent that writes its own skills. A curriculum generator proposes tasks. An execution agent attempts them. A skill librarian saves successful solutions as named, reusable code. Over time the library grows and the agent composes skills rather than solving from scratch.

In production, automatic skill synthesis is rare. The closest shipping pattern is human-in-the-loop skill authoring. A developer reviews production traces, identifies a recurring pattern worth encoding, and writes a skill. The Anthropic Skills open standard formalizes this. The folder is portable. The review gate is manual.

The interesting research-to-production gap. The trust boundary on auto-generated skills has not been solved. A skill is code the agent can run. An auto-generated skill is code the agent wrote and chose to run. You cannot let that into production without review. And the automation wins disappear once you add the review step. For now, treat skills as a human-authored abstraction and use traces as input to your authoring queue, not as automatic training signal.

Metacognitive Loops Beyond Reflexion

The active research directions. Self-refine, where the agent critiques and revises its own output before committing. Critic models, a separate smaller model trained specifically to grade the primary agent's work. Constitutional-style self-critique, where the agent checks its output against a list of explicit rules before responding.

Production status, honestly. Self-refine is a prompt pattern most teams have tried. It helps for writing and code. Mixed results on agent actions. Critic models are showing up in commercial products (Patronus AI, Braintrust online scoring, Lakera Guard, Azure Content Safety's groundedness models). These are typically small fine-tuned models scoring outputs against rubrics, and they are the most production-ready piece of the metacognition story.

Constitutional AI as a technique is internal to Anthropic's training process for Claude. As a user-facing pattern, the closest equivalents are guardrails frameworks with rule-based output validation. Functionally similar, mechanically different.

Eval-To-Training Pipelines

The pattern that is actually mature. Use production eval traces as training or prompt-tuning data. LangSmith and Braintrust both support promoting production examples to datasets that feed back into offline experiments. Mastra has similar hooks. Foundry Evaluations integrates with Agent Service runs.

The workflow is familiar. Production agent runs. Eval catches a regression or a low-score output. That example becomes a test case. The team revises the prompt, or fine-tunes the model, or updates a skill. The fix goes back to production. Cycle time is days to weeks. Humans in the loop at every step.

This is not fully self-improving. It is the human-in-the-loop version. It is also the only version I have seen work reliably at production scale. If you want an agent that gets better over time, build this pipeline before you chase anything more exotic. Three to five percent of engineering time on ingesting production traces back into evals and training data is the right range for teams that care about compound improvement.

The CTO Take

Self-improving agents as marketed do not exist yet. Self-improving agent systems, where humans close the loop on evals and distill improvements back into prompts or fine-tunes, are real and shipping now. Budget for the pipeline, not the magic.

The frontier will mature. Some of these patterns will move into the non-negotiable layer over the next year. I expect decay-aware memory, sleep-time compute, and eval-to-training pipelines to be on the CTO shortlist by April 2027. I do not expect auto-skill synthesis or fully automated prompt optimization to reach production status by then. Invest accordingly.

The CTO Shortlist

If you are deciding what to build into your architecture today, this is the non-negotiable shortlist.

Prompt caching on every stable prefix. Ten times savings on cached input. Unambiguous win.

Hybrid retrieval plus contextual embeddings plus rerank. 49 to 67 percent fewer retrieval failures. Pays for itself with one avoided human escalation.

Background memory consolidation. If your agents run more than 20 turns, you need this. Pick Mastra Observational Memory, LangMem, or Anthropic's memory tool based on your stack.

Skills or folder-based capability packaging. An open standard now. Invest in authoring infrastructure, not bespoke RAG for skills.

Agent identities with on-behalf-of. Your auditors will ask in 2026. Be ahead of them.

Eval in CI. Online scoring in production. Non-negotiable for anything customer facing.

Layered prompt injection defense. Spotlighting plus structured outputs plus tool least-privilege plus an XPIA classifier. No single technique is enough.

Get these seven right and your framework choice matters less than any of them. Get them wrong and no framework saves you.

The Frontier Memory Patterns and the Self-Improving Agents section earlier in this piece are the next layer. Worth tracking, worth evaluating, not yet the universal non-negotiable the seven above are. I expect two or three of them to move onto this shortlist in the next twelve months. The teams that build them early will have a quiet advantage. The teams that wait will get them for free when the frameworks catch up.

What Is Coming In Part 4

This piece went deep on the implementation patterns. The next piece closes the series.

Part 4 is where I land. The recommendation I make to most of my enterprise clients and when I do not make it. Three client war stories across healthcare, financial services, and government. Where agentic AI is heading over the next twelve months. And a thirty minute decision framework your team can run this week.

Navneet Singh is the founder and CEO of Webority Technologies. He builds enterprise AI systems for clients in healthcare, financial services, and government, and writes weekly about what actually works.

The First Principles

Discussion about this post

Ready for more?