The Agentic Engineering Field Guide, Part 1: How I Evaluate Agent Frameworks

Six questions every production system must answer, the orchestration patterns you will actually use, and the readiness checklist I run with every client.

Apr 24, 2026

Why I Am Writing This Series

Agentic AI went from research demos to production systems in eighteen months. Faster than microservices. Faster than containers. Faster than the mobile shift for enterprise. The consequence is that most engineering teams are picking agent frameworks the way people used to pick JavaScript frameworks in 2015. You commit to one. Three months later you realise it does not answer your actual production problems. You rewrite. That rewrite costs six to twelve months. Some of my clients are on their third framework. A few are on their fourth.

This series is the guide I wish I had eighteen months ago. It is not a product comparison. It is the mental model I use when a client asks me which framework to bet on, what questions to ask, what to worry about, and where this is all heading.

One note before we start. I build mostly on the Microsoft stack. My client base is healthcare, financial services, and government, most of them .NET heavy and Azure native. That shapes the view in this series. Where other stacks win, I say so. Where I land on Microsoft Agent Framework plus Microsoft Foundry, the reasoning is in Part 4. Read the whole thing for the balanced view. Read Part 4 for my pick.

Four parts.

Part 1 (this piece) is about the frame. The six questions every production agent system must answer before you touch a framework. The orchestration patterns you will actually use. The production readiness checklist.

Part 2 walks the landscape. Microsoft Foundry plus Agent Framework. Google Vertex plus ADK. AWS Bedrock plus Strands. Anthropic, OpenAI, LangGraph, CrewAI, Pydantic AI, Mastra. Protocols. Plus the enterprise SaaS agent layer (Agentforce, ServiceNow) because your buyers will ask about it.

Part 3 goes into the building blocks. Memory. RAG. Tools. Context engineering. Safety. Cost. Evaluation. Identity. The patterns every production build hits regardless of framework.

Part 4 is where I land and why. The explicit recommendation for my typical client. When I do not pick it. Three war stories across different stacks. Where this is all heading. And a thirty minute CTO decision framework you can run with your team this week.

If you bookmark this, I will update it every quarter. The frameworks churn. The questions do not.

The Six Questions Every Production Agent System Must Answer

Start with the questions. Frameworks are answers to questions. If you pick an answer before you have written down the question, you will pick wrong. I ask every client these six questions before we pick a stack. In that order.

1. How do you manage state across long running workflows?

Agents are not stateless. A real workflow runs for minutes. Sometimes hours. Sometimes days. State is the conversation history, the intermediate tool outputs, the decisions already made, the files already generated, the approval statuses from earlier in the flow. If your state lives in memory, your workflow dies when the process restarts. If your state lives in a flat key value store, you cannot branch, merge, or rewind.

The production answer is durable, typed, serializable state that survives process restarts. Ideally, state you can inspect with a debugger and modify if needed.

Failure mode when you get this wrong: you cannot recover from a failure halfway through a long workflow. You cannot reproduce a decision three days later when a user disputes it. You cannot run your agent on Azure Functions or AWS Lambda because your state assumes a long lived process.

2. How do you recover from partial failures?

Real workflows fail. APIs time out. LLMs return malformed JSON. Networks partition. Rate limits hit. A production agent system does not restart from the top when step nine of twelve fails. It resumes from the last known good state.

This is where the checkpointing story matters. Every framework claims to handle failure. Not all of them actually do. Ask to see the code that runs when a step fails. If there is no checkpointing, there is no recovery. If checkpoints serialize only the happy path, they will not survive a malformed intermediate output.

Failure mode when you get this wrong: your costs scale with your failure rate. You burn tokens re running the first eight steps of every failed workflow. Results diverge on retry because temperature is not zero. Customer-facing outputs become inconsistent for identical inputs. The client services team ends up explaining to the lawyer why today's contract summary differs from yesterday's for the same document.

3. Where does a human approve before writes happen?

The single most important architectural decision in enterprise agentic AI. Where is the gate between "the agent has decided what to do" and "the agent has done it?"

If you have no gate, you cannot ship to a regulated client. You cannot ship to a client whose legal team has seen an agent make a mistake. You cannot ship to most enterprises at all.

The gate has to be durable. Nobody approves a contract in the same millisecond the agent generated it. The approval arrives three hours later, from a different person, on a different machine, possibly after the original workflow process has died. The framework has to hold that pause. In my experience, this is the feature where most frameworks either shine or quietly fail. The difference is whether the pause is an in-memory await (dies on restart) or a durable suspend (survives weeks if needed).

Failure mode when you get this wrong: you ship write operations before human review. Either nothing gets approved because the workflow dies while waiting, or the agent writes something catastrophic and you spend a quarter explaining it to compliance.

4. How do you observe, debug, and audit agent decisions?

Every agent decision will be questioned eventually. By a user. By a compliance officer. By a regulator. By your own engineering team during the post mortem. If you cannot reconstruct why an agent chose a particular path, you have an audit problem and a debugging problem at the same time.

What you need: structured traces with inputs, outputs, tool calls, prompts, and timing. Exportable. Queryable. Tied to workflow runs, not just LLM calls. Retention that meets your compliance requirements. Ideally, the ability to replay a workflow from a captured trace.

This is the area where most frameworks are weakest. The base tracing is usually OpenTelemetry compatible, which is good. The production layer on top, where you actually query and alert and audit, almost always requires a second tool. LangSmith. Braintrust. Langfuse. Foundry Observability. Pydantic Logfire. Budget for one of these from day one.

Failure mode when you get this wrong: a client asks why the agent made a specific decision. You have no answer. The compliance officer asks for an audit trail for the last ninety days. You have log files with free text that cannot be queried. Your engineering team guesses at why a production regression happened.

5. How do you test systems with non deterministic components?

Agent systems are non deterministic. Traditional unit tests do not work cleanly. You need evaluations, not just assertions. Rubric based scoring. Regression suites that compare behaviour across model versions. A way to simulate failure modes at specific workflow steps. An LLM-as-judge harness for open ended outputs. Red team tests for prompt injection and jailbreak attempts.

This is where the maturity of the ecosystem matters. Evaluation frameworks exist. Most are young. The integration between agent frameworks and eval tooling is uneven. I have seen teams ship to production with no regression suite because the eval story was too painful to build. Then a model provider bumps a minor version and the agent's accuracy silently drops ten points. Nobody notices for two weeks.

Failure mode when you get this wrong: you ship to production with no way to know if a model upgrade broke your workflow. You cannot defend accuracy numbers to a client. You are testing manually by running the agent and reading the outputs, which does not scale past four people.

6. How do you compose specialists versus run generalists?

The question most teams get wrong. A single agent with many tools feels simpler. Until the tool count exceeds twenty. Until a single prompt has to accommodate ten different use cases. Until you need to give different teams ownership of different capabilities.

Multi agent composition is not always the answer. Sometimes one agent with good tool design is better. The question is when to split, when to stay monolithic, and what the cost of composition is. Every handoff between agents adds latency, context loss, and failure modes. Every specialist adds a new prompt to maintain. Every graph edge adds a routing decision that can go wrong.

Graph based orchestration frameworks give you the escape valve when the single agent model runs out. Prompt-only frameworks do not. If you expect your system to grow in scope, you need the graph option available even if you do not use it on day one.

Failure mode when you get this wrong: your agent works for six weeks. Then as you add features, response quality collapses. Latency climbs. Prompt length becomes unmanageable. You cannot onboard a new team onto the agent without teaching them the entire system. You end up rewriting as a multi agent graph under time pressure, which is the worst time to do it.

The Orchestration Patterns You Will Actually Use

Regardless of framework, these are the patterns. The names vary by vendor. The shapes repeat. Every production build I do combines two or three of these in a single workflow.

Sequential. Agents run one after another. Output of agent one becomes input to agent two. Useful when the process is linear and the order matters. Example: a due diligence workflow where you first extract entities from a document, then look them up in a registry, then generate a summary.

Concurrent. Agents run in parallel on the same input. Results get aggregated. Useful when you want multiple independent perspectives. Example: four agents each reviewing a legal clause for a different risk (compliance, commercial, IP, operational). Aggregated into a single risk summary. Faster than sequential because the agents do not wait for each other. More expensive because you pay for all branches whether you use the output or not.

Hand off. One agent decides which specialist to route to next based on context. Useful for triage patterns. Example: customer service routing. A front line agent reads the request, decides if it is billing, technical, or account access, routes to the right specialist. The routing decision itself is an LLM call with a constrained output. This is where framework maturity matters. In immature frameworks the routing agent hallucinates a specialist that does not exist and you discover it in production.

Hierarchical. A manager agent delegates to workers. Workers report back. The manager composes results and decides what to do with partial results. Similar to concurrent but with explicit supervision. Example: code review pipelines where a reviewer agent delegates to style, security, and test coverage specialists, then composes the review comments into a single PR review.

Magentic or dynamic planning. The orchestrator maintains a shared task ledger, delegates, observes results, and re plans. Useful for open ended problems where the right sequence of steps cannot be determined upfront. Example: research tasks. "Figure out the compliance posture of this company" is not a fixed sequence. It is a loop of searches, reads, cross checks, and synthesis. Magentic patterns are powerful and expensive. Use them only when the problem genuinely needs dynamic planning. Most problems do not.

Event driven. Agents react to events from external systems rather than being driven by a single entry point workflow. Useful when the agent sits inside a larger event driven architecture. Example: an agent that monitors a support queue, picks up tickets matching a pattern, processes them, publishes results back. This is not a replacement for the other patterns. It is a harness around them.

The framework question becomes: does my framework let me compose these patterns, or does it lock me into one model? This is where graph based frameworks pull ahead of purely hierarchical ones. If you want to build a hand off inside a hierarchical flow inside an event driven harness, you need a framework that treats the graph as the primary abstraction.

The Production Readiness Checklist

Before you ship, walk through this list. If you cannot answer yes to all of it, you are not shipping a product. You are shipping a demo with uptime. I run this with every client in the week before go live.

State - State is serialized and durable across process restarts - State schema is versioned so you can migrate between deployments - You can inspect, modify, and re run from any point in the workflow

Failure recovery - Individual step failures resume from the last good checkpoint - Tool timeout and retry behaviour is configurable per step - You can kill and restart the entire workflow engine without losing in-flight work

Approval gates - Human approval is a durable pause, not an in memory await - Approval notifications go through your real systems (email, Slack, queue) - Gate timeouts have defined behaviour (auto reject, auto escalate, notify)

Observability - Every agent decision is traced with inputs, outputs, prompts, tool calls, timing - Traces are queryable by workflow run and by user session - Trace retention meets your compliance requirements - You can replay a workflow from a trace

Evaluation - You have a regression suite that runs before every deployment - You can detect behaviour drift across model upgrades - You have rubric based scoring for open ended outputs - You have a red team suite for prompt injection and jailbreak

Cost - You know the token cost per workflow run, not just per call - You have budgets that halt runs that exceed thresholds - You can route to cheaper models for easy subtasks - You have prompt caching enabled where the provider supports it

Compliance - Data residency is enforced at the framework level, not just the model provider - You can explain any agent decision to a regulator - Sensitive data is redacted before it reaches external providers - You have an audit log that is separate from your application logs

Rollback - You can roll back a prompt change without a full redeploy - You can pin model versions - You can A/B test prompts and agent configurations in production

If you build this checklist into your architecture from the start, framework choice matters less. If you do not, no framework saves you. I have watched teams with the best framework in the space fail in production because they skipped the checklist. I have watched teams with a mediocre framework ship reliably because they built the checklist in from day one.

The checklist is the thing. The framework is the accelerator.

What Is Coming in the Rest of the Series

The rest of this series maps the landscape against the questions.

Part 2 covers frameworks and platforms. The sharper mental model is that every hyperscaler now has a two layer story. An open SDK on the bottom (the framework). A managed platform on top (the platform). Microsoft has Agent Framework and Foundry. Google has ADK and Vertex. AWS has Strands and Bedrock. Anthropic has the Claude Agent SDK and the Claude developer platform. LangGraph plus LangSmith is the credible open source version of the same story. The decision that comes before framework choice is framework versus platform. Part 2 covers all of it.

Part 3 goes into the building blocks you will use inside whatever framework you pick. Memory patterns, the ones that work and the ones that look good on slides but fail in production. RAG patterns, classic and agentic and graph based. Tool patterns beyond MCP. Context engineering, including prompt caching economics. Evaluation frameworks. Safety and guardrails. Cost management. Identity and auth for agents. This is the implementation reality every CTO needs to plan for.

Part 4 is where I land. The recommendation I make to most of my enterprise clients. The explicit "when I do not pick it" caveats. Three client war stories across different stacks. Where agentic AI is heading over the next twelve months. And a thirty minute decision framework you can run with your team.

If you are evaluating a stack for a production build and want a second pair of eyes before you commit, reply to this email or find me on LinkedIn. I read every reply. I will also note which parts of this guide readers push back on the hardest. Those become the sections I rewrite in the quarterly update.

Navneet Singh is the founder and CEO of Webority Technologies. He builds enterprise AI systems for clients in healthcare, financial services, and government, and writes weekly about what actually works.

The First Principles

Discussion about this post

Ready for more?