Building Scalable GenAI Apps on AWS: From Prototype to Production

By | November 8, 2025

TL;DR

  • A support team ships a GenAI assistant in 30 days on AWS Bedrock, transitioning from demo to production without incurring excessive cost or risk.
  • Key choices: Bedrock for models, OpenSearch Serverless for vectors, S3 for lineage, API Gateway + Lambda for orchestration.
  • What matters: safety guardrails, measurable eval, observability, and CI/CD—not just a clever prompt.

Project timeline snapshot

PhaseDayMilestone
Kickoff0Backlog crosses 1,200 tickets; AI assistant charter approved.
Prototype (Act I)3Bedrock-based helper drafting answers with lineage and manual QA.
Reliability (Act II)14RAG + eval harness hits ≥85% correct with citations.
Production hardening (Act III)28Guardrails, SLOs, CI/CD, and cost controls live across accounts.

Starting prerequisites

  • AWS org-level guardrails already enforce dedicated support accounts, VPC endpoints, and CloudTrail aggregation.
  • A lightweight ticket export feed to S3 exists for analytics, reused here for eval gold sets.
  • Terraform baseline modules (network, observability, IAM foundations) were in place before Act I, so the 30-day clock focuses on the GenAI stack itself.

Introduction — The night the queue broke

On a Tuesday night, Anna (Support Lead) watched the backlog climb past 1,200 tickets. The average first response time exceeded 9 hours, and Finance flagged the rising overtime. Arjun (Tech Lead) made a call: “We ship an AI assistant in 30 days or we scale headcount by 40%.” The constraints: strict PII controls, sub-2s median latency, and no vendor lock-in that locks data in. This is the story of how they built, measured, and hardened a GenAI app on AWS—without turning the system into a black box.


With leadership aligned on the constraints and baseline guardrails already deployed, the team sprinted into the first build phase.

Act I — The three-day prototype: make it helpful

Goal

Prove the assistant can draft high-quality replies from FAQs and past resolutions.

Decision journal

  • Context: Need fast model access and optionality (Claude, Llama, Mistral, Titan)
  • Options: Direct model APIs vs Amazon Bedrock
  • Decision: Bedrock for unified API, governance, and model choice; store all I/O in S3 for evaluation

Architecture (prototype)

flowchart LR
    U[Agent Console] --> APIGW[API Gateway]
    APIGW --> L[Lambda: Orchestrator]
    L --> BR[Bedrock: FM]
    L --> S3[S3: Prompts + Transcripts]

    classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
    class APIGW,L,BR,S3 aws

Prototype operating notes

  • Split the single S3 bucket into /prompts, /responses, and /fixtures prefixes so that every request—good or bad—has a lineage for later evaluation.
  • Secrets (model API keys, Slack webhooks) reside in AWS Systems Manager Parameter Store; Lambda reads them via short-lived decrypt permissions instead of bundling environment variables.
  • A pytest harness (internal repo support-assistant/tests) replays 15 canonical tickets after each prompt edit and exports a CSV that product and support can annotate for quick alignment.

First working call (conceptual)

import json, boto3

bedrock = boto3.client("bedrock-runtime")

body = {
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 400,
  "messages": [
    {"role": "system", "content": [{"type": "text", "text": "You write concise, empathetic replies."}]},
    {"role": "user", "content": [{"type": "text", "text": "Refund policy for annual plans?"}]}
  ]
}

resp = bedrock.invoke_model(
  modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
  contentType="application/json",
  accept="application/json",
  body=json.dumps(body)
)

Outcome (Day 3)

  • Helpfulness (manual review, n=25): 72% acceptable drafts
  • Median latency: 1.6s; Cost/request: $0.003–$0.006 (prompt size dependent)
  • Gap: Hallucinations on policy edge cases; no citations for trust

Key metrics recap — Act I

MetricValueNotes
Manual helpfulness72% acceptable (n=25)Quick human review across priority ticket types.
Median latency1.6sMeasured via CloudWatch p50 at API Gateway.
Cost per request$0.003–$0.006Depends on prompt size; tracked via Bedrock billing.
Regression harness size15 canonical ticketsPytest replay suite exported to CSV for annotation.

Prototype exit criteria

  • Support leadership signed off only after seeing a side-by-side comparison in the console, a Jira comment proving transcript capture, and an alarm that fires if Bedrock median latency exceeds 2.5 seconds for five minutes.
  • Engineering archived the three core prompts, along with instructions, into Git, ensuring that every version used in the prototype is reproducible for audits.

Three intense days later, the assistant could draft helpful replies, so the team pivoted to grounding and evaluation.


Act II — Make it reliable: RAG and evaluation

Goal

Ground answers in the company’s policy corpus and recent resolutions; show sources.

Decision journal

  • Context: Need vector search + filters (region, product, plan)
  • Options: OpenSearch Serverless vs Aurora + pgvector vs Kendra
  • Decision: OpenSearch Serverless for managed vector engine and VPC access; Kendra reserved for SaaS connectors later

RAG request flow

sequenceDiagram
    participant Agent
    participant API as API Gateway
    participant L as Lambda Orchestrator
    participant EMB as Bedrock Embeddings
    participant VEC as OpenSearch Serverless
    participant S3 as S3 (Docs)
    participant FM as Bedrock FM

    Agent->>API: Query
    API->>L: Invoke(query)
    L->>EMB: Embed(query)
    L->>VEC: Vector search + filters (k=6)
    VEC-->>L: Top-k with metadata
    L->>S3: Fetch originals (optional)
    L->>FM: Prompt + citations
    FM-->>L: Answer + sources
    L-->>API: Response

Ingestion pipeline (batch and streaming)

  • S3 Put events -> Lambda: chunk, clean, embed, index
  • Re-embed on content updates; invalidate by metadata prefix
  • Tag docs (owner, region, product, retention)

Document prep heuristics

  • Ticket macros and FAQs are chunked at ~650 tokens with a 100-token overlap so context fits in a single FM window while keeping policy paragraphs intact.
  • We drop HTML boilerplate, normalize tables to markdown, and persist source SHA hashes; if content hasn’t changed, we skip expensive re-embeds.
  • Sensitive rows (HIPAA, PCI) carry an compliance_scope attribute so OpenSearch filters and prompt guards can automatically refuse cross-region access.

Evaluation harness (gold set)

  • Built 60-question gold set from ticket archive; labels: correct, partially-correct, off-topic
  • Offline metrics: groundedness, citation coverage, refusal quality; threshold to ship ≥85% correct
  • Online A/B: prompt version v5 vs v6; measured CSAT delta and cost/request

Evaluation automation

  • Nightly Step Functions job replays the gold set, logs token counts, and publishes trend lines to the “Support AI Eval Trends” QuickSight dashboard, allowing product teams to spot regressions before agents complain.
  • Retrieval hit rate is computed per intent, and anything below 70% automatically files a Jira ticket (automation rule GENAI-RAG-coverage) requesting additional corpus coverage.
  • Manual graders record rationales in a shared CSV stored under the /evals/rationales/ S3 prefix that becomes few-shot supervision data for future hallucination detectors.

Outcome (Week 2)

  • Helpfulness: 91% correct on gold set; citations present for 88%
  • Median latency: 1.9s (RAG adds ~0.3s); Cost/request: $0.004–$0.007
  • Gap: Occasional PII leakage in copy-paste tickets; need guardrails

Key metrics recap — Act II

MetricValueNotes
Gold-set accuracy91% correct (n=60)Threshold to ship was ≥85%.
Citation coverage88%Each answer cites at least one doc reference.
Median latency1.9sRAG adds ~0.3s over prototype.
Cost per request$0.004–$0.007Higher due to embeddings + retrieval.

RAG and evaluation discipline unlocked trustworthy answers, setting the stage for production-grade guardrails and operations.


Act III — Production hardening: safety, observability, CI/CD

Goal

Reduce risk, make issues diagnosable, and deploy safely across accounts.

Decision journal

  • Context: PII exposure risk and jailbreak attempts
  • Options: App-level regex + classifiers vs Bedrock Guardrails
  • Decision: Bedrock Guardrails + app validators; reject or redact before generation and before response

Runtime architecture (prod)

flowchart TD
    Client-->APIGW[API Gateway]
    APIGW-->Auth[Cognito/JWT Authorizer]
    APIGW-->Orch[Lambda Orchestrator]
    Orch-->Bedrock
    Orch-->Vec[OpenSearch Serverless]
    Orch-->S3
    Orch-->Obs[CloudWatch + X-Ray]
    Orch-->Cfg["AppConfig (feature flags)"]
    subgraph Async
        Orch-->Q[SQS]
        Q-->SFN[Step Functions]
        SFN-->Ingest[Ingestion Lambdas]
    end

    classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
    class APIGW,Auth,Orch,Bedrock,Vec,S3,Obs,Cfg,Q,SFN,Ingest aws

Production upgrades

  • Safety: Guardrails policies + output validators; redaction filters; audit log of blocked content
  • Observability: Request IDs, prompt version IDs, retrieval hit rate, token usage, refusal rate, traces to X-Ray
  • Security: IAM least-privilege per path, VPC endpoints, KMS on S3/OpenSearch, signed URLs for citations
  • CI/CD: Terraform modules for API, roles, OpenSearch, buckets; GitHub Actions → plan/apply; canary by header; quick rollback by alias
  • Cost: Budgets + alarms; prompt compression; result cache for frequent intents

Guardrail configuration snapshot

{
  "name": "support-assistant-prod",
  "sensitiveInformationPolicy": {
    "piiEntities": ["EMAIL", "SSN", "PHONE"],
    "action": "REDACT"
  },
  "wordFilters": [
    {"match": "credit card number", "action": "BLOCK"},
    {"match": "share internal roadmap", "action": "CHALLENGE"}
  ],
  "contextualGrounding": {
    "relevanceThreshold": 0.68,
    "responseAction": "REFUSE_WITH_TEMPLATE"
  },
  "deniedTopics": ["Self-harm instructions", "Malware creation"]
}

Operations scorecard

  • SLO: 99% of requests under 4s measured at API Gateway; Lambda emits custom metrics keyed by prompt version and OpenSearch collection alias.
  • Pager duty: on-call receives a single aggregated alarm that fans out if Bedrock throttling, OpenSearch write errors, or guardrail rejection spikes exceed three standard deviations.
  • The runbook mandates synthetic queries every 5 minutes per region, and any mismatch between synthetic and production latency automatically redirects traffic to the previous alias.

Outcome (Week 4)

  • CSAT +18%, first-response down to 1h 45m
  • p50 1.9s, p95 3.8s; cost/request -22% via shorter contexts and caching
  • Zero PII incidents in logs; jailbreak tests pass

Key metrics recap — Act III

MetricValueNotes
CSAT change+18%Relative to four-week baseline before rollout.
First-response time1h 45mDown from 9h pre-project.
Latencyp50 1.9s / p95 3.8sEnforced via 99% <4s SLO.
Cost per request-22% vs prototypePrompt compression + caching.
PII / jailbreak incidents0Guardrail and validator coverage.

Appendix — Minimal Terraform sketch (conceptual)

resource "aws_opensearchserverless_collection" "vec" {
  name = "support-assistant"
  type = "VECTORSEARCH"
}

resource "aws_iam_role" "orchestrator" {
  name               = "support-assistant-orchestrator"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
}

resource "aws_iam_role_policy" "allow_services" {
  role = aws_iam_role.orchestrator.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      { Effect = "Allow", Action = ["bedrock:InvokeModel","bedrock:InvokeModelWithResponseStream"], Resource = "*"},
      { Effect = "Allow", Action = ["aoss:APIAccessAll"], Resource = "*"},
      { Effect = "Allow", Action = ["s3:GetObject","s3:PutObject"], Resource = ["arn:aws:s3:::support-assistant-*","arn:aws:s3:::support-assistant-*/*"]}
    ]
  })
}

Ops runbook snippet (abridged)

alerts:
  - name: bedrock-latency
    trigger: p95_latency > 3.5s for 3m
    action: scale_concurrency(lambda_orchestrator)
  - name: vector-ingest-backlog
    trigger: sqs_visible > 500 messages
    action: rerun_step_function ingest-replay
playbooks:
  guardrail-spike:
    1: Inspect CloudWatch Logs Insights query `fields guardrailReason | stats count(*) by reason`.
    2: If >30% are `contextual-grounding`, lower relevance threshold to 0.64 temporarily via AppConfig.
    3: Announce mitigation in #support-ai channel.

Ship checklist

  • [ ] Gold set ≥85% correct; online CSAT flat or up
  • [ ] Guardrails + validators enabled; jailbreak suite green
  • [ ] Traces + token/retrieval metrics; S3 transcripts with retention
  • [ ] IAM scoped; VPC endpoints; KMS on data at rest
  • [ ] Canary deploy + alias rollback tested; budgets/alerts in place

Further reading

Epilogue — What we’d do next

  • Rerankers to shrink context; per-tenant vector filters; result caching
  • Bedrock Agents for constrained tool use; human review for escalations
  • Quarterly eval refresh; drift detection on intents and docs