Building Scalable GenAI Apps on AWS: From Prototype to Production

TL;DR

A support team ships a GenAI assistant in 30 days on AWS Bedrock, transitioning from demo to production without incurring excessive cost or risk.
Key choices: Bedrock for models, OpenSearch Serverless for vectors, S3 for lineage, API Gateway + Lambda for orchestration.
What matters: safety guardrails, measurable eval, observability, and CI/CD—not just a clever prompt.

Project timeline snapshot

Phase	Day	Milestone
Kickoff	0	Backlog crosses 1,200 tickets; AI assistant charter approved.
Prototype (Act I)	3	Bedrock-based helper drafting answers with lineage and manual QA.
Reliability (Act II)	14	RAG + eval harness hits ≥85% correct with citations.
Production hardening (Act III)	28	Guardrails, SLOs, CI/CD, and cost controls live across accounts.

Starting prerequisites

AWS org-level guardrails already enforce dedicated support accounts, VPC endpoints, and CloudTrail aggregation.
A lightweight ticket export feed to S3 exists for analytics, reused here for eval gold sets.
Terraform baseline modules (network, observability, IAM foundations) were in place before Act I, so the 30-day clock focuses on the GenAI stack itself.

Introduction — The night the queue broke

On a Tuesday night, Anna (Support Lead) watched the backlog climb past 1,200 tickets. The average first response time exceeded 9 hours, and Finance flagged the rising overtime. Arjun (Tech Lead) made a call: “We ship an AI assistant in 30 days or we scale headcount by 40%.” The constraints: strict PII controls, sub-2s median latency, and no vendor lock-in that locks data in. This is the story of how they built, measured, and hardened a GenAI app on AWS—without turning the system into a black box.

With leadership aligned on the constraints and baseline guardrails already deployed, the team sprinted into the first build phase.

Act I — The three-day prototype: make it helpful

Goal

Prove the assistant can draft high-quality replies from FAQs and past resolutions.

Decision journal

Context: Need fast model access and optionality (Claude, Llama, Mistral, Titan)
Options: Direct model APIs vs Amazon Bedrock
Decision: Bedrock for unified API, governance, and model choice; store all I/O in S3 for evaluation

Architecture (prototype)

flowchart LR
    U[Agent Console] --> APIGW[API Gateway]
    APIGW --> L[Lambda: Orchestrator]
    L --> BR[Bedrock: FM]
    L --> S3[S3: Prompts + Transcripts]

    classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
    class APIGW,L,BR,S3 aws

Prototype operating notes

Split the single S3 bucket into /prompts, /responses, and /fixtures prefixes so that every request—good or bad—has a lineage for later evaluation.
Secrets (model API keys, Slack webhooks) reside in AWS Systems Manager Parameter Store; Lambda reads them via short-lived decrypt permissions instead of bundling environment variables.
A pytest harness (internal repo support-assistant/tests) replays 15 canonical tickets after each prompt edit and exports a CSV that product and support can annotate for quick alignment.

First working call (conceptual)

import json, boto3

bedrock = boto3.client("bedrock-runtime")

body = {
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 400,
  "messages": [
    {"role": "system", "content": [{"type": "text", "text": "You write concise, empathetic replies."}]},
    {"role": "user", "content": [{"type": "text", "text": "Refund policy for annual plans?"}]}
  ]
}

resp = bedrock.invoke_model(
  modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
  contentType="application/json",
  accept="application/json",
  body=json.dumps(body)
)

Outcome (Day 3)

Helpfulness (manual review, n=25): 72% acceptable drafts
Median latency: 1.6s; Cost/request: $0.003–$0.006 (prompt size dependent)
Gap: Hallucinations on policy edge cases; no citations for trust

Key metrics recap — Act I

Metric	Value	Notes
Manual helpfulness	72% acceptable (n=25)	Quick human review across priority ticket types.
Median latency	1.6s	Measured via CloudWatch p50 at API Gateway.
Cost per request	$0.003–$0.006	Depends on prompt size; tracked via Bedrock billing.
Regression harness size	15 canonical tickets	Pytest replay suite exported to CSV for annotation.

Prototype exit criteria

Support leadership signed off only after seeing a side-by-side comparison in the console, a Jira comment proving transcript capture, and an alarm that fires if Bedrock median latency exceeds 2.5 seconds for five minutes.
Engineering archived the three core prompts, along with instructions, into Git, ensuring that every version used in the prototype is reproducible for audits.

Three intense days later, the assistant could draft helpful replies, so the team pivoted to grounding and evaluation.

Act II — Make it reliable: RAG and evaluation

Goal

Ground answers in the company’s policy corpus and recent resolutions; show sources.

Decision journal

Context: Need vector search + filters (region, product, plan)
Options: OpenSearch Serverless vs Aurora + pgvector vs Kendra
Decision: OpenSearch Serverless for managed vector engine and VPC access; Kendra reserved for SaaS connectors later

RAG request flow

sequenceDiagram
    participant Agent
    participant API as API Gateway
    participant L as Lambda Orchestrator
    participant EMB as Bedrock Embeddings
    participant VEC as OpenSearch Serverless
    participant S3 as S3 (Docs)
    participant FM as Bedrock FM

    Agent->>API: Query
    API->>L: Invoke(query)
    L->>EMB: Embed(query)
    L->>VEC: Vector search + filters (k=6)
    VEC-->>L: Top-k with metadata
    L->>S3: Fetch originals (optional)
    L->>FM: Prompt + citations
    FM-->>L: Answer + sources
    L-->>API: Response

Ingestion pipeline (batch and streaming)

S3 Put events -> Lambda: chunk, clean, embed, index
Re-embed on content updates; invalidate by metadata prefix
Tag docs (owner, region, product, retention)

Document prep heuristics

Ticket macros and FAQs are chunked at ~650 tokens with a 100-token overlap so context fits in a single FM window while keeping policy paragraphs intact.
We drop HTML boilerplate, normalize tables to markdown, and persist source SHA hashes; if content hasn’t changed, we skip expensive re-embeds.
Sensitive rows (HIPAA, PCI) carry an compliance_scope attribute so OpenSearch filters and prompt guards can automatically refuse cross-region access.

Evaluation harness (gold set)

Built 60-question gold set from ticket archive; labels: correct, partially-correct, off-topic
Offline metrics: groundedness, citation coverage, refusal quality; threshold to ship ≥85% correct
Online A/B: prompt version v5 vs v6; measured CSAT delta and cost/request

Evaluation automation

Nightly Step Functions job replays the gold set, logs token counts, and publishes trend lines to the “Support AI Eval Trends” QuickSight dashboard, allowing product teams to spot regressions before agents complain.
Retrieval hit rate is computed per intent, and anything below 70% automatically files a Jira ticket (automation rule GENAI-RAG-coverage) requesting additional corpus coverage.
Manual graders record rationales in a shared CSV stored under the /evals/rationales/ S3 prefix that becomes few-shot supervision data for future hallucination detectors.

Outcome (Week 2)

Helpfulness: 91% correct on gold set; citations present for 88%
Median latency: 1.9s (RAG adds ~0.3s); Cost/request: $0.004–$0.007
Gap: Occasional PII leakage in copy-paste tickets; need guardrails

Key metrics recap — Act II

Metric	Value	Notes
Gold-set accuracy	91% correct (n=60)	Threshold to ship was ≥85%.
Citation coverage	88%	Each answer cites at least one doc reference.
Median latency	1.9s	RAG adds ~0.3s over prototype.
Cost per request	$0.004–$0.007	Higher due to embeddings + retrieval.

RAG and evaluation discipline unlocked trustworthy answers, setting the stage for production-grade guardrails and operations.

Act III — Production hardening: safety, observability, CI/CD

Goal

Reduce risk, make issues diagnosable, and deploy safely across accounts.

Decision journal

Context: PII exposure risk and jailbreak attempts
Options: App-level regex + classifiers vs Bedrock Guardrails
Decision: Bedrock Guardrails + app validators; reject or redact before generation and before response

Runtime architecture (prod)

flowchart TD
    Client-->APIGW[API Gateway]
    APIGW-->Auth[Cognito/JWT Authorizer]
    APIGW-->Orch[Lambda Orchestrator]
    Orch-->Bedrock
    Orch-->Vec[OpenSearch Serverless]
    Orch-->S3
    Orch-->Obs[CloudWatch + X-Ray]
    Orch-->Cfg["AppConfig (feature flags)"]
    subgraph Async
        Orch-->Q[SQS]
        Q-->SFN[Step Functions]
        SFN-->Ingest[Ingestion Lambdas]
    end

    classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
    class APIGW,Auth,Orch,Bedrock,Vec,S3,Obs,Cfg,Q,SFN,Ingest aws

Production upgrades

Safety: Guardrails policies + output validators; redaction filters; audit log of blocked content
Observability: Request IDs, prompt version IDs, retrieval hit rate, token usage, refusal rate, traces to X-Ray
Security: IAM least-privilege per path, VPC endpoints, KMS on S3/OpenSearch, signed URLs for citations
CI/CD: Terraform modules for API, roles, OpenSearch, buckets; GitHub Actions → plan/apply; canary by header; quick rollback by alias
Cost: Budgets + alarms; prompt compression; result cache for frequent intents

Guardrail configuration snapshot

{
  "name": "support-assistant-prod",
  "sensitiveInformationPolicy": {
    "piiEntities": ["EMAIL", "SSN", "PHONE"],
    "action": "REDACT"
  },
  "wordFilters": [
    {"match": "credit card number", "action": "BLOCK"},
    {"match": "share internal roadmap", "action": "CHALLENGE"}
  ],
  "contextualGrounding": {
    "relevanceThreshold": 0.68,
    "responseAction": "REFUSE_WITH_TEMPLATE"
  },
  "deniedTopics": ["Self-harm instructions", "Malware creation"]
}

Operations scorecard

SLO: 99% of requests under 4s measured at API Gateway; Lambda emits custom metrics keyed by prompt version and OpenSearch collection alias.
Pager duty: on-call receives a single aggregated alarm that fans out if Bedrock throttling, OpenSearch write errors, or guardrail rejection spikes exceed three standard deviations.
The runbook mandates synthetic queries every 5 minutes per region, and any mismatch between synthetic and production latency automatically redirects traffic to the previous alias.

Outcome (Week 4)

CSAT +18%, first-response down to 1h 45m
p50 1.9s, p95 3.8s; cost/request -22% via shorter contexts and caching
Zero PII incidents in logs; jailbreak tests pass

Key metrics recap — Act III

Metric	Value	Notes
CSAT change	+18%	Relative to four-week baseline before rollout.
First-response time	1h 45m	Down from 9h pre-project.
Latency	p50 1.9s / p95 3.8s	Enforced via 99% <4s SLO.
Cost per request	-22% vs prototype	Prompt compression + caching.
PII / jailbreak incidents	0	Guardrail and validator coverage.

Appendix — Minimal Terraform sketch (conceptual)

resource "aws_opensearchserverless_collection" "vec" {
  name = "support-assistant"
  type = "VECTORSEARCH"
}

resource "aws_iam_role" "orchestrator" {
  name               = "support-assistant-orchestrator"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
}

resource "aws_iam_role_policy" "allow_services" {
  role = aws_iam_role.orchestrator.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      { Effect = "Allow", Action = ["bedrock:InvokeModel","bedrock:InvokeModelWithResponseStream"], Resource = "*"},
      { Effect = "Allow", Action = ["aoss:APIAccessAll"], Resource = "*"},
      { Effect = "Allow", Action = ["s3:GetObject","s3:PutObject"], Resource = ["arn:aws:s3:::support-assistant-*","arn:aws:s3:::support-assistant-*/*"]}
    ]
  })
}

Ops runbook snippet (abridged)

alerts:
  - name: bedrock-latency
    trigger: p95_latency > 3.5s for 3m
    action: scale_concurrency(lambda_orchestrator)
  - name: vector-ingest-backlog
    trigger: sqs_visible > 500 messages
    action: rerun_step_function ingest-replay
playbooks:
  guardrail-spike:
    1: Inspect CloudWatch Logs Insights query `fields guardrailReason | stats count(*) by reason`.
    2: If >30% are `contextual-grounding`, lower relevance threshold to 0.64 temporarily via AppConfig.
    3: Announce mitigation in #support-ai channel.

Ship checklist

[ ] Gold set ≥85% correct; online CSAT flat or up
[ ] Guardrails + validators enabled; jailbreak suite green
[ ] Traces + token/retrieval metrics; S3 transcripts with retention
[ ] IAM scoped; VPC endpoints; KMS on data at rest
[ ] Canary deploy + alias rollback tested; budgets/alerts in place

Epilogue — What we’d do next

Rerankers to shrink context; per-tenant vector filters; result caching
Bedrock Agents for constrained tool use; human review for escalations
Quarterly eval refresh; drift detection on intents and docs

Auto Amazon Links: No products found.

TL;DR

Project timeline snapshot

Starting prerequisites

Introduction — The night the queue broke

Act I — The three-day prototype: make it helpful

Goal

Decision journal

Architecture (prototype)

Prototype operating notes

First working call (conceptual)

Outcome (Day 3)

Key metrics recap — Act I

Prototype exit criteria

Act II — Make it reliable: RAG and evaluation

Goal

Decision journal

RAG request flow

Ingestion pipeline (batch and streaming)

Document prep heuristics

Evaluation harness (gold set)

Evaluation automation

Outcome (Week 2)

Key metrics recap — Act II

Act III — Production hardening: safety, observability, CI/CD

Goal

Decision journal

Runtime architecture (prod)

Production upgrades

Guardrail configuration snapshot

Operations scorecard

Outcome (Week 4)

Key metrics recap — Act III

Appendix — Minimal Terraform sketch (conceptual)

Ops runbook snippet (abridged)

Ship checklist

Further reading

Epilogue — What we’d do next