The Agentic Engineering Field Guide, Part 4: Where I Land, and Why

The recommendation I make to most of my enterprise clients, when I do not make it, three war stories, where this is all heading, and a thirty minute decision framework you can run with your team this week.

May 04, 2026

Where I Land

For my typical client, I recommend Microsoft Agent Framework on Microsoft Foundry.

The typical client is an enterprise shop. Regulated industry. .NET heavy. Azure already in production. Approval gates required. Audit trails required. Data residency matters. The buyer is a CIO or CTO whose procurement team has a standing Microsoft agreement. That describes most of my healthcare, financial services, and government work. If that also describes your shop, my recommendation is to start there.

This is not an abstract preference. It is the same decision I make with every qualifying client after we have walked through the six questions from Part 1.

State and durability. The workflow graph checkpoints at superstep boundaries. Cosmos DB checkpoint storage shipped in MAF Python 1.0.1. Durable Task integration through the Azure Functions extension handles the day-long and week-long pauses that happen when a human approval takes its time. This is the question that kills most frameworks in production. MAF plus Foundry answers it without ceremony.

Approval gates. Durable pause state that survives application restarts is first class. Not bolted on. Not a hack you build with external queues. The framework holds the pause. The pattern composes with the rest of your workflow.

Observability. OpenTelemetry everywhere. App Insights out of the box. The trace surface maps cleanly to every other Azure service your SRE team already monitors. If your observability story is already Azure Monitor, you do not introduce a new tool.

Identity. Every agent gets a dedicated Microsoft Entra identity with RBAC scoped to the resources it needs. Entra Agent Registry catalogs the deployed agents. This is the first production-grade identity story I have seen in an agent platform. For regulated industries, this alone is worth the choice.

Compliance. Azure carries the broadest compliance portfolio in the cloud market. FedRAMP High via Azure Government. HIPAA. ISO 27001, 27017, 27018, 27701. HITRUST. SOC. PCI. EU Data Boundary. Microsoft Cloud for Sovereignty for the EU sovereign stack. 21Vianet for China. India RBI, IRDAI, MeitY. When procurement asks for the compliance matrix, it already exists.

Model choice. Foundry is model agnostic despite Microsoft's commercial incentives. OpenAI, Anthropic, Meta, Mistral, Cohere, DeepSeek, NVIDIA, Microsoft's own Phi family, plus 1,500+ models through Hugging Face compute. You are not forced into any one model provider. That is the right call for enterprise buyers who do not want to bet on one model lab.

On-prem and air-gapped. Foundry Local is the quiet differentiator nobody else has. C#, JavaScript, Rust, Python SDKs. ONNX Runtime under the hood. OpenAI-compatible API. No Azure subscription required. Same SDK patterns as cloud Foundry. For the healthcare customer who needs to keep data on-premises, or the government customer who needs air-gapped inference, you get a Microsoft-shipped runtime that you can deploy inside the compliance boundary. Neither Vertex nor Bedrock has a first-party on-device counterpart with matching SDK ergonomics.

Framework-agnostic runtime. The quiet competitive move: Foundry Hosted Agents accept LangGraph or arbitrary containerized code, not just MAF. Foundry positions itself as a control plane, not a framework lock-in. So even if your team decides tomorrow that LangGraph is the better fit for a specific workflow, you can deploy it inside Foundry and keep the enterprise spine. That is the option I want on my side when I am betting on a platform.

The procurement story matters more than the feature story sometimes. When your buyer already has a Microsoft Enterprise Agreement, adding Foundry is a line item, not a new vendor evaluation. That shortens the deal cycle by months. The technical merits make the case easy. The procurement reality closes the deal.

When I Do Not Pick It

MAF plus Foundry is my default for enterprise, regulated, Microsoft-shop clients. It is not the right call for every shop. Here is where I go elsewhere.

Python-pure shops with no .NET footprint. If your engineering team has zero .NET expertise, no appetite to learn it, and your existing services are all Python, then LangGraph plus LangSmith is the stronger technical fit. The ecosystem is larger. The graph semantics are more mature. The checkpointer ecosystem is richer (Postgres, Redis, SQLite, in-memory). LangSmith is the most battle-tested tracing product in the space. You get most of the enterprise capability without the .NET surface area you will never use.

GCP-native shops. If your organisation runs on GCP, BigQuery is your data gravity, and Gemini is your preferred model family, then Google ADK plus Vertex AI Agent Engine is cleaner than bolting Azure services onto a Google stack. Cross-cloud integration tax is real. Pick the cloud your infrastructure already lives in.

AWS-native shops with strict cost or latency SLAs. AWS Strands plus Bedrock Agents is the path of least resistance on AWS. Bedrock's service tier controls (Priority, Standard, Flex) that Strands exposes are unique. If your workload has genuine SLA sensitivity on inference cost or latency, this is real. If you are already spending heavily on Bedrock, the economics of staying in that ecosystem make sense.

TypeScript-first product teams. If you ship a web product on Next.js or a Node backend and your engineering team writes TypeScript end to end, Mastra is the serious option. The MAF TypeScript story is thin. LangGraph has TypeScript bindings but Python is the first-class surface. Mastra is built for the Node world and has matured enough to be credible.

Observability-critical early stage startups. If your team is small, observability is the feature that saves you, and LangSmith is going to be your debugging lifeline from day one, then LangGraph plus LangSmith gets you there faster than any other combination. The coupling is tight and the ergonomics are designed around it. You can migrate to a more enterprise stack later when your buyer changes shape.

Pure prototyping speed. For a two-week proof of concept, CrewAI is hard to beat. The "team of agents" metaphor maps onto a slide deck and a demo faster than any other framework. This does not mean I would ship it to production. It means I would not fight the client who wants to prove out the idea in CrewAI first, then re-implement in a more durable stack when the scope is clear.

Coding or developer tools. If you are building a coding agent, a CLI tool, or anything that inherits the shape of Claude Code, the Anthropic Claude Agent SDK gives you the mature tool loop (Read, Write, Edit, Bash, etc.) without rebuilding it. For that narrow use case, it is the fastest path.

Low-code citizen developer agents. If the business user is the builder and the agent lives inside Microsoft 365 or Teams workflows, Copilot Studio is a different product with a different audience. The MAF plus Foundry story is pro-code and developer-centric. Do not force the wrong tool.

Agentforce or ServiceNow as a buy-vs-build choice. If the agent's job is to answer a well-scoped question inside Salesforce or ServiceNow data, and the SaaS vendor has already shipped that agent, buy it. Build only what gives you differentiation. I have talked several clients out of building their own customer service agent because Agentforce already solves 80 percent of their problem at a fraction of the cost of building it well.

Three War Stories

These are archetypal patterns from client work, generalised to protect confidentiality. Each reflects a real build, a real decision point, and something I would do differently with today's tools.

Healthcare contract analysis

A hospital group needed to process executed vendor contracts. Extract key terms. Flag clauses that deviated from their standard template. Surface renewal dates and termination windows. Generate a summary for the procurement team. Twelve documents per week in good weeks. Forty in bad ones.

We built this on a twelve-step graph. Document ingestion, OCR cleanup, section classification, entity extraction, clause comparison against a library of approved standards, risk scoring, flagging, summarization, storage, notification. Each step was an executor. The graph ran in roughly four minutes end to end in the happy path.

The fifth deployment, step nine started failing. The internal API it called had been migrated. We did not know for three hours because the only signal was customer-facing summaries that looked a little worse than usual. When we looked at the checkpoint state, we could see the exact input that was crashing the step. We fixed the integration, resumed every queued workflow from the checkpoint, and the customer never knew there had been a partial outage.

That same workflow, a year earlier on a different framework, would have lost every in-flight contract to a full restart. We would have burned the token budget for the first eight steps on every failed run until we shipped the fix.

The lesson: pick the framework that treats state as a first-class citizen, not as a debugging convenience.

Financial services reconciliation

A mid-market finance firm wanted to reconcile end-of-day trade confirmations across three upstream systems. The inputs disagreed often enough to be a real problem. A human team was spending two hours every evening resolving the discrepancies by hand.

The first architecture we considered was a single agent with many tools. Load this, load that, compare, output the differences. It would have worked for the first month. The tool count for the three upstream systems plus the output destinations was already fourteen. I knew where this ended.

We built it as a hierarchical multi-agent workflow instead. A manager agent owned the reconciliation task. Three specialist agents each owned one upstream system. A fourth specialist owned output and escalation. The specialists did not know about each other. The manager composed their outputs and decided when a discrepancy needed human review versus when the rules made the resolution obvious.

Six months in, a fourth upstream system came into scope. Adding it was one new specialist agent and one new edge in the graph. The other specialists did not change. That is the compound value of composition done right. Monolithic agents do not have this property.

What I would do differently: we under-invested in the eval harness at the start. We shipped with regression tests at the agent level but not at the workflow level. Two months in, the model provider bumped a minor version and the manager's routing accuracy dropped a few points. We caught it in production through a spike in manual escalations. If I were building this today, I would have workflow-level evals in place from day one and alerts on routing accuracy drift.

Government citizen services

A state government wanted to reduce call volume at a busy services line. First-line triage. Answer simple questions. Route complex ones to human agents. Keep the conversation in the citizen's preferred language. Nothing leaves the sovereign cloud.

This was a case where the platform choice mattered more than the framework choice. The non-negotiables were data residency, audit trails that would survive a regulatory review, and the ability to run inference inside the sovereign boundary. Foundry's compliance posture, Entra Agent Registry for agent identity, and Foundry Local for the on-premises inference scenario were the answer.

The agent uses a hand-off pattern. A triage agent reads the incoming request, detects language, identifies intent. Simple queries (where is my application, how do I pay this fee) are answered directly from the internal knowledge base via Foundry IQ. Complex or sensitive queries route to a human agent through the same platform with the conversation history intact.

Observability is where this build paid off. Every agent decision is traced, queryable, and exportable. When the regulator audits, we can show exactly which queries were answered by the agent, which went to humans, and what the conversation looked like. That is the difference between a system you can defend and a system you cannot.

The lesson: for regulated and sovereign scenarios, the platform spine matters more than the framework elegance.

Where This Is All Heading

Twelve month view. What I expect the landscape to look like by April 2027.

Protocols will matter more than frameworks. MCP is already universal. A2A reached v1.0 in March 2026. By this time next year, the lock-in cost of picking the wrong framework will be lower than it is today because you will be able to swap frameworks while keeping your protocol-level integrations. Bet on frameworks that speak protocols fluently. The specific framework matters less every quarter.

On-premises inference will resurge. The first wave of enterprise AI was cloud-first because that is where the capable models lived. Open-weights models are now capable enough for many production workloads. Foundry Local, vLLM, Ollama, and the NVIDIA AI Enterprise stack make on-device and on-premises deployment credible. For regulated industries and data-sensitive workloads, expect a real shift back to on-premises or sovereign cloud over the next year. The vendors that have shipped the tooling for this (Microsoft with Foundry Local, Amazon Bedrock with Outposts, Google with Distributed Cloud) will benefit.

Multi-agent will hit a limit and retreat. The last eighteen months have been a period of maximalist multi-agent design. Five specialists. Seven specialists. Ten specialists. I am already seeing teams retreat from this. The marginal specialist often adds more latency and failure modes than value. Expect the consensus to settle around three to five specialists for a typical production workflow, with a preference for hierarchy over peer collaboration.

Evaluation becomes the bottleneck. Once everyone has durable state, checkpointing, approval gates, and tracing, the remaining differentiator is whether your evals catch regressions before your customers do. Expect eval tooling to become the hottest segment of the agentic infrastructure market in the next year. Expect acquisition activity. Expect major framework vendors to ship first-party eval platforms (some already are).

Memory becomes a category, not a framework feature. Letta, Zep, and new entrants are making memory a first-class database tier that sits alongside your vector store and your Postgres. Frameworks are good at orchestration. They are mediocre at memory. Decay policies, reflection trees, virtual context paging, and sleep-time consolidation are not things you want to maintain inside your agent framework. Expect memory-as-a-service to win a meaningful slice of the agentic infrastructure market by next April, much as vector databases did in 2023. For CTOs, the question in 2026 is still "what memory primitives does my framework ship." In 2027 it will be "which memory vendor do I pick."

Cognitive architectures become pluggable. Decay-aware retrieval, sleep-time compute, reflection trees, skill libraries, and virtual context paging are bespoke builds today. By next April, expect these to ship as framework-native features or pluggable middleware. Teams that invested in building them from scratch this year will rebase onto the shipping versions. Teams that skipped them will get them for free by upgrading. Either group wins relative to teams that never knew the patterns existed.

Self-improving agent systems ship in the human-in-the-loop form. Fully autonomous self-improvement is not arriving in the next twelve months. Production-grade eval-to-training pipelines, trajectory distillation through RFT or open-weight fine-tuning, and prompt optimization feedback loops are all shipping now and will be standard by mid-2027. The differentiator is whether your team has built the pipeline that feeds production traces back into offline training. Most have not. The teams that do compound. The teams that do not ship the same agent they launched with, just older.

Regulatory pressure reshapes the stack. EU AI Act enforcement is already biting. India's DPDP enforcement is arriving. The US is catching up state by state. For regulated industries, the ability to document training data provenance, model versioning, and agent decisions will become a procurement checkbox. Frameworks that cannot answer these questions will get filtered out of enterprise deals regardless of technical merit.

The SaaS agent layer will mature and compete directly with custom builds. Salesforce Agentforce, ServiceNow AI Agents, and the Microsoft 365 Copilot ecosystem are already changing what enterprises build in-house. A year from now, the default recommendation for well-scoped customer service or ITSM agents will be "buy the SaaS layer, build custom only where you differentiate." Custom agent builds will concentrate in the workflows that are genuinely unique to your business.

The model layer will become boring, and that is good. The gap between top labs is narrowing. The gap between top open-weights models and top closed models is narrowing. Pricing is dropping across the board. Prompt caching has changed the economics of long-context workflows. By next April, model selection will feel more like database selection. Important. Reversible. Not the differentiator it was two years ago.

The Thirty Minute CTO Decision Framework

You have a meeting in thirty minutes with your team. You need to pick a direction. Run this.

Question 1: What is your dominant engineering stack?

.NET → Microsoft Agent Framework plus Foundry. Python → LangGraph plus LangSmith, or the provider SDK matching your cloud. TypeScript or Node → Mastra, or LangGraph TypeScript. Java or Go → Google ADK has the best multi-language story.

Question 2: Which cloud, and is multi-cloud required?

Azure-heavy → Foundry. AWS-heavy → Bedrock plus Strands. GCP-heavy → Vertex plus ADK. Multi-cloud required → open framework plus protocols (MCP, A2A). Bet on interop.

Question 3: What regulatory constraints do you have?

HIPAA, FedRAMP, sovereign cloud, data residency → Microsoft Foundry has the broadest compliance coverage. Bedrock is the close second for GovCloud scenarios.

Question 4: How long is your longest workflow?

Under thirty seconds → any framework works. Thirty seconds to five minutes → you need checkpointing. Five minutes to hours → graph-based framework with durable state is required. Hours to days → MAF plus Durable Task, LangGraph plus Postgres checkpointer, or a purpose-built workflow engine (Temporal, Dagster with AI) with an agent layer.

Question 5: Will humans approve before writes?

Yes, and the approval can take hours or days → you need first-class human-in-the-loop with durable pauses. MAF, LangGraph, and Pydantic AI have this. Many others do not.

Question 6: What is your observability maturity?

Already running OpenTelemetry → pick a framework with clean OTel conventions for agents. No observability yet → add LangSmith, Braintrust, or Foundry Observability as a dependency of the build.

Question 7: What is your buy-versus-build reality?

Is this a workflow that differentiates your business, or is it the same customer service, HR, or ITSM flow every enterprise runs? If the latter, evaluate Agentforce or ServiceNow before you build. Build only where you differentiate.

A note on frontier patterns. The seven questions above decide your framework. They do not decide whether to invest in decay-aware memory, sleep-time compute, reflection trees, or trajectory distillation. Those are capability-level decisions, not framework-level. Most teams should ship the basics first and reach for frontier patterns when the baseline is solid and the cost of staleness, context overflow, or stagnant skill is visibly hurting the product. If your agents run fewer than twenty turns per session and each session is independent, you probably need none of them. If they run for months with the same user and cannot afford to rebuild from scratch, every frontier pattern starts to pay. Part 3 walks each one and tells you when it is worth the complexity.

Run these seven questions. In most cases, the answer collapses to one or two frameworks. From there, prototype and ship.

Series Close

This is the end of the first pass. Four parts. The questions, the landscape, the building blocks, and where I land.

The frameworks in this guide will change. Some will fade. Others will emerge. The protocols will mature. The eval tooling will consolidate. The model layer will become boring. That is all fine. The questions in Part 1 are the durable part. The mental model is what you keep.

I will update this guide every quarter. The next revision lands in July. Some parts will change heavily. Part 2 will change the most because the landscape moves fastest. Part 1 will change least because the questions do not age.

If you are building an enterprise agent system this year and you want a second pair of eyes on your architecture, email or find me on LinkedIn. The clients I work with best are the ones who have walked through the six questions, know where they stand, and want to pressure-test the answer. I read every reply.

We are early. The systems we build this year will shape what enterprise AI looks like for the rest of the decade. Worth getting right.

Navneet Singh is the founder and CEO of Webority Technologies. He builds enterprise AI systems for clients in healthcare, financial services, and government, and writes weekly about what actually works.

The First Principles

Discussion about this post

Ready for more?