TL;DR: Most production “AI agents” are actually deterministic workflows — and that’s fine, but the architecture decision you make right now determines whether your system costs $0.10 or $50 per request, whether it completes in 3 seconds or 3 minutes, and whether you can debug it when it fails at 2am. This post maps Anthropic’s five canonical patterns against AWS Bedrock AgentCore, Google ADK, and Azure Foundry Agent Service, and surfaces the five production failure modes that quietly kill agent projects after the demo.
Audience: architect
Reading Time: ~16 min (~3,300 words)
Prerequisites
You should understand:
- What LLMs are and how tool calling/function calling works
- The basics of serverless compute and event-driven architectures on at least one cloud
- Why context windows matter for LLM-based systems
You do not need to have built an agent system before — this post is specifically for engineers evaluating whether and how to build one.
The Problem: The Vocabulary Gap That Causes Production Failures
Two things reliably happen when teams ship their first AI agent to production.
In week one, it works surprisingly well on the demo scenarios. In week six, it has cost 40× more than budgeted, failed silently on three edge cases, introduced a latency spike that the team cannot trace, and the on-call engineer has no idea how to reproduce the failure.
The root cause is rarely the model. It is almost always an architecture mismatch caused by a vocabulary problem: the word “agent” is applied to everything from a simple LLM wrapper to a fully autonomous multi-agent orchestration system. The architecture decisions that follow — which pattern to use, what guardrails to add, how to instrument the system — are profoundly different depending on which of those you actually built.
Anthropic’s December 2024 paper “Building Effective Agents” introduced the most useful distinction in the field:
- Workflows: systems where LLMs are components in predefined code paths. The execution sequence is fixed by the developer.
- Agents: systems where LLMs dynamically direct their own process, including deciding which tools to call and in what order.
The practical implication: most production systems labeled “agents” are workflows. That is not a criticism — workflows are more predictable, cheaper to run, and far easier to debug. But if you architect a workflow using an agentic framework with autonomous loop control, you get the cost and complexity of agents without the flexibility benefit.
This post gives you the map to make that call correctly.
Background: The Five Patterns
Anthropic’s paper identifies five core patterns that cover the vast majority of LLM-based systems. Understanding these is a prerequisite to everything that follows:
1. Prompt Chaining — The output of one LLM call becomes the input of the next. Steps are sequential and fixed. Example: classify intent → extract entities → generate response.
2. Routing — A classifier decides which specialized subpipeline handles the input. Example: customer query → classify as billing/technical/escalate → route to appropriate handler.
3. Parallelization — Multiple LLM calls run concurrently on independent subtasks. Includes the “sectioning” variant (divide a document, process each section independently) and the “voting” variant (run the same prompt N times, take a majority or best result).
4. Orchestrator-Workers — An orchestrator LLM dynamically decomposes a task and delegates subtasks to specialized worker agents. Workers call tools; the orchestrator synthesizes results. This is the first “true agent” pattern — the execution path is not fixed in advance.
5. Evaluator-Optimizer — A generator LLM produces output, an evaluator LLM scores it, and the system iterates until a quality threshold is met (or a max iteration cap is hit). The evaluator is the critical component that people forget to build before this pattern reaches production.
Patterns 1–3 are mostly workflows — deterministic execution paths that happen to include LLM calls. Patterns 4–5 are agentic — the LLM controls branching and iteration. Your architecture, cost model, observability requirements, and failure modes differ fundamentally between the two groups.
Choosing the Right Pattern
Before selecting a cloud platform, pick the right pattern. The flowchart below maps task characteristics to the appropriate pattern. Use the least complex pattern that can complete the task.
Diagram

Pattern Selection Summary
| Pattern | Dynamic control flow | Cost predictability | P99 latency | Debugging complexity | Best fit |
|---|---|---|---|---|---|
| Prompt chaining | No | Very high | Low (1–5s) | Very low | Document processing, deterministic transformation pipelines |
| Routing | Partial | High | Low (1–5s) | Low | Intent classification, tiered support, rule-based dispatch |
| Parallelization | No | High | Low (parallel steps) | Medium | Independent operations, consensus/voting on critical outputs |
| Orchestrator-workers | Yes | Low | High (15–60s+) | High | Open-ended research, multi-domain tasks, unknown subtask count |
| Evaluator-optimizer | Yes (iteration) | Medium (bounded) | High | Medium | Code generation, long-form content, quality-critical outputs |
The key architectural insight from this table is that cost predictability and latency are inversely correlated with task flexibility. You cannot have all three — pick two.
Three Platform Approaches
All three major cloud providers now offer production-grade infrastructure for agent systems. The right platform depends less on capability gaps (which have narrowed substantially) and more on your existing cloud footprint, compliance requirements, and whether you need framework-agnostic portability.
Context Diagram

AWS Bedrock AgentCore
Bedrock AgentCore — currently GA — is AWS’s take on “the agent runtime that gets out of your way.” The key design decision: framework-agnostic by default. You bring your LangGraph, CrewAI, AutoGen, or custom framework code; AgentCore provides nine fully managed services around it.
The nine components are split into two groups:
Infrastructure services: Runtime (serverless hosting, session isolation, 8-hour async support), Gateway (converts Lambda/REST/MCP servers into agent-callable tools with semantic discovery), Memory (cross-session context persistence), Identity (managed IAM for agents, On-Behalf-Of authentication to third-party tools).
Quality and safety services: Policy (Cedar-based real-time tool-call interception before execution), Evaluations (live output sampling and scoring with built-in and custom evaluators), Observability (CloudWatch + OpenTelemetry across every agent hop), Code Interpreter (isolated multi-language sandbox), Browser (serverless browser runtime scaling to hundreds of concurrent sessions).
The Cedar policy engine deserves emphasis. Cedar policies are evaluated before each tool call executes — not just at input/output boundaries. This is what makes production guardrails possible: you can reject a tool call that would write to a database outside business hours or exceed a per-agent cost budget, without modifying your agent code.
Container Diagram

Google ADK + Agent Engine
Google’s Agent Development Kit (ADK) is open-source (Apache 2.0) and production-ready in Python, TypeScript, Go 1.0, and Java 1.0. Unlike AgentCore (a runtime platform you deploy code onto), ADK is a framework you code against, deployed via Agent Engine on Vertex AI, Cloud Run, or GKE.
ADK’s differentiated capabilities:
- Native multi-agent: First-class support for orchestrator-workers architectures, sequential and parallel workflow agents, and loop agents — built into the framework SDK.
- A2A protocol: ADK supports the Agent-to-Agent (A2A) communication protocol, enabling standardized inter-agent communication at the API level. This is the foundation for multi-vendor agent interoperability — still emerging, but worth watching.
- MCP native: ADK has a dedicated
/mcp/module for MCP server integration, making tool definitions portable across any MCP-compliant host. - Evaluation framework: Built in, not bolted on. ADK’s evaluation tooling is a first-class SDK component with visual debugging.
- Model agnostic: Gemini, Claude, Gemma, Vertex AI hosted models, Ollama, vLLM, LiteLLM — same ADK code, swappable model configuration.
If you are already on GCP or if open-source portability is a hard requirement, ADK is the strongest option. The trade-off against AgentCore: you own more infrastructure (the Agent Engine handles the runtime, but the framework itself is yours to upgrade and maintain).
Azure Foundry Agent Service
Microsoft Foundry Agent Service (last updated April 16, 2026) takes a different positioning than the other two: three tiers based on how much control you need.
Prompt agents (GA): configuration-only, no code required. Define instructions, model, and tools in the portal. Agent Service handles hosting and orchestration. Best for rapid prototyping and internal tools.
Workflow agents (preview): YAML-based or visual builder orchestration of multiple agents. Supports branching logic, human-in-the-loop steps, and group-chat patterns. For teams that want multi-agent without writing orchestration code.
Hosted agents (preview): Bring your own framework (LangGraph or custom), deploy as a container on the Agent Service infrastructure; Foundry manages runtime and scaling. This is the equivalent of AgentCore Runtime.
Microsoft’s enterprise integrations are the real differentiator. Published agents can be distributed directly through Microsoft Teams and Microsoft 365 Copilot — agents become available where enterprise users already work, without building a custom UI. The Entra Agent Registry provides identity-based agent discovery comparable to AWS’s semantic tool discovery in AgentCore Gateway.
Platform Comparison
| Capability | AWS Bedrock AgentCore | Google ADK + Agent Engine | Azure Foundry Agent Service |
|---|---|---|---|
| GA status | GA (all tiers) | Production-ready (Python/TS/Go/Java 1.0) | Prompt agents: GA; Workflow/Hosted: Preview |
| Approach | Runtime platform (framework-agnostic) | Framework (Apache 2.0 open source) | Tiered service: no-code → code-based |
| Multi-agent coordination | Via supported frameworks in Runtime | Native multi-agent in ADK SDK | Workflow agents (YAML/visual) + Hosted |
| MCP support | AgentCore Gateway (connect to MCP servers) | Native MCP module in ADK | MCP servers via Azure Functions + catalog |
| A2A protocol | Framework-dependent | Native A2A protocol in ADK SDK | Entra Agent Registry (distribution layer) |
| Policy enforcement | Cedar-based, per-tool-call inline | Evaluation framework + Apigee AI Gateway | Content safety filters, XPIA protection, Azure RBAC |
| Identity model | AgentCore Identity (IAM, OBO auth) | GCP service accounts, Vertex AI IAM | Microsoft Entra per-agent identity, OBO passthrough |
| Observability | CloudWatch + OpenTelemetry native | ADK logging + evaluation + Cloud Trace | Application Insights, end-to-end agent tracing |
| Pricing model | Per-use, serverless (agentcore.aws) | Vertex AI consumption-based | Consumption-based (Azure AI Foundry pricing) |
| Key differentiator | Cedar policies, framework-agnostic, 9 managed services | A2A protocol, open source portability, multi-language | M365/Teams distribution, Entra identity, enterprise compliance |
Production Failure Modes
This is the section you will not find in the vendor documentation. These are the five patterns that reliably cause production agent failures after a successful demo.
Failure 1: Compounding Error Amplification
If you chain five agents, each with 90% task accuracy, the end-to-end system accuracy is at most 0.9⁵ = 59%. This is not a theoretical problem — it is arithmetic. In practice, accuracy is harder to define and is often lower.
What this looks like in production: An orchestrator-workers pipeline succeeds in 95% of test cases. After launch, you observe the real distribution of inputs, and the accuracy is 60%. Each individual agent looks fine in isolation. The problem is the chain.
Mitigation: Measure each agent’s accuracy independently on the real task distribution before chaining. Use an explicit evaluator pattern (see Pattern 5 above) between high-stakes handoffs in the chain, not just at the output boundary.
Failure 2: Unbounded Cost from Feedback Loops
An evaluator-optimizer with a weak or poorly calibrated evaluator will loop indefinitely, or until your cloud budget is exhausted. There is no natural stopping condition in the majority of open-source agent frameworks.
What this looks like in production: An overnight batch job for content generation runs all night, loops 40 iterations on one document because the evaluator’s quality bar is set to a score of 0.95 on outputs that only ever reach 0.88, and hits a $3,000 bill by morning.
Mitigation: Every evaluator-optimizer loop must have a hard iteration cap (e.g., max 3 loops), a time ceiling, and a cost ceiling. All three. AgentCore Policy (Cedar rules) and Azure Foundry’s tool configuration can enforce budget limits per agent session without modifying agent code.
Failure 3: Latency Accumulation in Sequential Chains
Each LLM call in a sequential chain adds latency. An orchestrator that delegates to four workers, where each worker makes three LLM calls, produces a minimum end-to-end latency of 12 LLM inference round-trips — often 30–90 seconds with frontier models.
What this looks like in production: An agent assistant that works well in a Slack slash command context suddenly feels broken when embedded in a UI where users expect sub-5-second responses.
Mitigation: Identify independent sub-tasks and parallelize them (Pattern 3). Use smaller, faster models for intermediate steps where quality requirements are lower. Set explicit latency budgets before designing agent topology — work backward from an acceptable user-facing latency to determine how many sequential LLM hops you can afford.
Failure 4: Debugging Opacity
Tracing a failure through four agents, each making eight tool calls, across distributed log streams is extremely difficult without purpose-built instrumentation. Standard application logging is insufficient — you need to reconstruct the exact sequence of LLM decisions, tool calls, inputs, and outputs that led to the failure.
What this looks like in production: An agent produces an incorrect output. The engineer opens CloudWatch or Application Insights, finds 47 log lines from four agents, has no way to correlate them to a single user request, and cannot reproduce the issue because the agent’s internal state was not persisted.
Mitigation: Before shipping to production, verify that every agent boundary is instrumented with OpenTelemetry traces, including: request ID propagation across all hops, every tool call with its exact input and output, every LLM model, token count, and latency, and the full conversation context at each handoff. All three platforms provide this when configured correctly — AgentCore Observability (CloudWatch + OTel), ADK’s logging module, and Azure’s Application Insights agent tracing.
Persist conversation context to the memory layer (AgentCore Memory, ADK Sessions, Azure Cosmos DB) on every run, not just on success. You cannot debug failures you cannot reconstruct.
Failure 5: Tool Call Storms from Insufficiently Scoped Permissions
An orchestrator agent with access to 50+ tools and no constraints will pattern-match to over-call. A three-step research task results in 28 tool calls as the LLM speculatively tries tools to satisfy ambiguous instructions.
What this looks like in production: Your agent’s per-request cost is 40× higher in production than in testing because test cases were well-specified, but production inputs are ambiguous, and the model tries more tools when uncertain.
Mitigation: Scope each agent’s tool permissions to the minimum required for its role. Workers should not have access to orchestrator-level tools. Use Cedar policies (AgentCore) or Foundry’s tool configuration scope to enforce this at the platform level, not just in your system prompt. Write explicit tool descriptions that specify when not to use a tool — poor tool definitions are the primary cause of tool call storms.
Real-World Example: Ericsson on AWS AgentCore
Ericsson — which manages telecommunications infrastructure for billions of users globally — has deployed agent systems on AWS Bedrock AgentCore, with Dag Lindbo (SVP Cloud Software & Services) specifically citing the framework flexibility as the key production consideration: the ability to bring existing agent code without rewriting it for a proprietary runtime (AWS AgentCore product page).
This is the pattern you will encounter with enterprise adoption: teams that built agent systems on LangGraph or CrewAI in 2024 now need production-grade infra (session management, identity, guardrails, observability) without discarding the framework code they have already written and tested. Framework-agnostic runtimes (AgentCore, Foundry Hosted agents) are the primary response to this constraint.
The alternative — adopting a platform-native orchestration layer (ADK, Foundry Workflow agents) — makes sense for greenfield deployments where framework portability is less critical than tight platform integration.
Trade-offs and When NOT to Use Multi-Agent Systems
When NOT to use agents or multi-agent systems
Latency budget under 2 seconds: Each LLM hop adds latency. A single optimized prompt is always faster than a chain of agents. If user-facing latency is under 2 seconds, you need a single prompt— not a pipeline.
Fully deterministic tasks: If you can enumerate every branch in your pipeline, build a coded workflow. LLM-controlled routing or loop management adds cost and non-determinism to problems that do not require either.
Cost per request must be predictable: Autonomous agents produce variable token usage. If you need a predictable per-request cost for billing or budgeting, agents are the wrong tool. Use chaining or routing patterns where token counts are bounded.
High-stakes irreversible actions without human review: Financial transactions, patient record writes, infrastructure mutations — these require deterministic human-in-the-loop checkpoints at decision boundaries. An autonomous agent deciding to delete a resource or initiate a payment is an incident waiting to happen.
Team has no distributed tracing infrastructure: Do not deploy multi-agent systems into environments without first setting up OpenTelemetry or an equivalent distributed tracing solution. You will not be able to diagnose failures, and failures are guaranteed.
Trade-off Matrix
| Factor | Workflow (Chaining/Routing) | ReAct Agent | Orchestrator-Workers |
|---|---|---|---|
| Development time | Fast | Medium | Slow |
| Cost per request | Predictable | Variable | Highly variable |
| Latency (P50) | 1–5s | 5–20s | 20–60s+ |
| Task flexibility | Low | Medium | High |
| Debuggability | High | Medium | Low without tracing |
| Failure detection | Explicit | Harder | Requires instrumentation |
Key Takeaways
- Most production “agents” are workflows — classify what you built accurately before you architect around it. Workflows (patterns 1–3) are cheaper, faster, and easier to debug.
- Compounding error rates are a multi-agent’s primary accuracy killer. Measure each agent’s accuracy on your real input distribution before chaining. Do not chain 5 agents if you have not measured each one independently.
- Every evaluator-optimizer loop requires three explicit limits: a max iteration count, a wall-clock time ceiling, and a per-session cost ceiling. Missing any one of these is a support incident.
- Instruct every agent to check their boundary before launching. OpenTelemetry traces propagating request IDs across all hops are the minimum viable observability stance for any multi-agent system.
- MCP is now table stakes — AWS AgentCore Gateway, Google ADK, and Azure Foundry Agent Service all support MCP. Tool portability is no longer a platform differentiator.
- Cedar-style per-tool-call policy enforcement (AgentCore) or Foundry’s tool scope configuration lets you enforce production guardrails without modifying agent code. Use them.
- Platform selection is an integration question. AgentCore for framework portability and Cedar-based policy control. ADK for open-source portability, A2A interoperability, and multi-language teams. Azure Foundry for enterprise Microsoft 365/Teams distribution and Entra identity.
- Start with the simplest pattern. “Start with simple prompts, optimize with evals, add agents only when simpler solutions fall short.” — Anthropic, December 2024. This is still the right production principle.
Auto Amazon Links: No products found.
