TL;DR
- A support team ships a GenAI assistant in 30 days on AWS Bedrock, transitioning from demo to production without incurring excessive cost or risk.
- Key choices: Bedrock for models, OpenSearch Serverless for vectors, S3 for lineage, API Gateway + Lambda for orchestration.
- What matters: safety guardrails, measurable eval, observability, and CI/CD—not just a clever prompt.
Project timeline snapshot
| Phase | Day | Milestone |
|---|---|---|
| Kickoff | 0 | Backlog crosses 1,200 tickets; AI assistant charter approved. |
| Prototype (Act I) | 3 | Bedrock-based helper drafting answers with lineage and manual QA. |
| Reliability (Act II) | 14 | RAG + eval harness hits ≥85% correct with citations. |
| Production hardening (Act III) | 28 | Guardrails, SLOs, CI/CD, and cost controls live across accounts. |
Starting prerequisites
- AWS org-level guardrails already enforce dedicated support accounts, VPC endpoints, and CloudTrail aggregation.
- A lightweight ticket export feed to S3 exists for analytics, reused here for eval gold sets.
- Terraform baseline modules (network, observability, IAM foundations) were in place before Act I, so the 30-day clock focuses on the GenAI stack itself.
Introduction — The night the queue broke
On a Tuesday night, Anna (Support Lead) watched the backlog climb past 1,200 tickets. The average first response time exceeded 9 hours, and Finance flagged the rising overtime. Arjun (Tech Lead) made a call: “We ship an AI assistant in 30 days or we scale headcount by 40%.” The constraints: strict PII controls, sub-2s median latency, and no vendor lock-in that locks data in. This is the story of how they built, measured, and hardened a GenAI app on AWS—without turning the system into a black box.
With leadership aligned on the constraints and baseline guardrails already deployed, the team sprinted into the first build phase.
Act I — The three-day prototype: make it helpful
Goal
Prove the assistant can draft high-quality replies from FAQs and past resolutions.
Decision journal
- Context: Need fast model access and optionality (Claude, Llama, Mistral, Titan)
- Options: Direct model APIs vs Amazon Bedrock
- Decision: Bedrock for unified API, governance, and model choice; store all I/O in S3 for evaluation
Architecture (prototype)
flowchart LR
U[Agent Console] --> APIGW[API Gateway]
APIGW --> L[Lambda: Orchestrator]
L --> BR[Bedrock: FM]
L --> S3[S3: Prompts + Transcripts]
classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
class APIGW,L,BR,S3 awsPrototype operating notes
- Split the single S3 bucket into
/prompts,/responses, and/fixturesprefixes so that every request—good or bad—has a lineage for later evaluation. - Secrets (model API keys, Slack webhooks) reside in AWS Systems Manager Parameter Store; Lambda reads them via short-lived decrypt permissions instead of bundling environment variables.
- A pytest harness (internal repo
support-assistant/tests) replays 15 canonical tickets after each prompt edit and exports a CSV that product and support can annotate for quick alignment.
First working call (conceptual)
import json, boto3
bedrock = boto3.client("bedrock-runtime")
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 400,
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You write concise, empathetic replies."}]},
{"role": "user", "content": [{"type": "text", "text": "Refund policy for annual plans?"}]}
]
}
resp = bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
contentType="application/json",
accept="application/json",
body=json.dumps(body)
)
Outcome (Day 3)
- Helpfulness (manual review, n=25): 72% acceptable drafts
- Median latency: 1.6s; Cost/request: $0.003–$0.006 (prompt size dependent)
- Gap: Hallucinations on policy edge cases; no citations for trust
Key metrics recap — Act I
| Metric | Value | Notes |
|---|---|---|
| Manual helpfulness | 72% acceptable (n=25) | Quick human review across priority ticket types. |
| Median latency | 1.6s | Measured via CloudWatch p50 at API Gateway. |
| Cost per request | $0.003–$0.006 | Depends on prompt size; tracked via Bedrock billing. |
| Regression harness size | 15 canonical tickets | Pytest replay suite exported to CSV for annotation. |
Prototype exit criteria
- Support leadership signed off only after seeing a side-by-side comparison in the console, a Jira comment proving transcript capture, and an alarm that fires if Bedrock median latency exceeds 2.5 seconds for five minutes.
- Engineering archived the three core prompts, along with instructions, into Git, ensuring that every version used in the prototype is reproducible for audits.
Three intense days later, the assistant could draft helpful replies, so the team pivoted to grounding and evaluation.
Act II — Make it reliable: RAG and evaluation
Goal
Ground answers in the company’s policy corpus and recent resolutions; show sources.
Decision journal
- Context: Need vector search + filters (region, product, plan)
- Options: OpenSearch Serverless vs Aurora + pgvector vs Kendra
- Decision: OpenSearch Serverless for managed vector engine and VPC access; Kendra reserved for SaaS connectors later
RAG request flow
sequenceDiagram
participant Agent
participant API as API Gateway
participant L as Lambda Orchestrator
participant EMB as Bedrock Embeddings
participant VEC as OpenSearch Serverless
participant S3 as S3 (Docs)
participant FM as Bedrock FM
Agent->>API: Query
API->>L: Invoke(query)
L->>EMB: Embed(query)
L->>VEC: Vector search + filters (k=6)
VEC-->>L: Top-k with metadata
L->>S3: Fetch originals (optional)
L->>FM: Prompt + citations
FM-->>L: Answer + sources
L-->>API: ResponseIngestion pipeline (batch and streaming)
- S3 Put events -> Lambda: chunk, clean, embed, index
- Re-embed on content updates; invalidate by metadata prefix
- Tag docs (owner, region, product, retention)
Document prep heuristics
- Ticket macros and FAQs are chunked at ~650 tokens with a 100-token overlap so context fits in a single FM window while keeping policy paragraphs intact.
- We drop HTML boilerplate, normalize tables to markdown, and persist source SHA hashes; if content hasn’t changed, we skip expensive re-embeds.
- Sensitive rows (HIPAA, PCI) carry an
compliance_scopeattribute so OpenSearch filters and prompt guards can automatically refuse cross-region access.
Evaluation harness (gold set)
- Built 60-question gold set from ticket archive; labels: correct, partially-correct, off-topic
- Offline metrics: groundedness, citation coverage, refusal quality; threshold to ship ≥85% correct
- Online A/B: prompt version v5 vs v6; measured CSAT delta and cost/request
Evaluation automation
- Nightly Step Functions job replays the gold set, logs token counts, and publishes trend lines to the “Support AI Eval Trends” QuickSight dashboard, allowing product teams to spot regressions before agents complain.
- Retrieval hit rate is computed per intent, and anything below 70% automatically files a Jira ticket (automation rule
GENAI-RAG-coverage) requesting additional corpus coverage. - Manual graders record rationales in a shared CSV stored under the
/evals/rationales/S3 prefix that becomes few-shot supervision data for future hallucination detectors.
Outcome (Week 2)
- Helpfulness: 91% correct on gold set; citations present for 88%
- Median latency: 1.9s (RAG adds ~0.3s); Cost/request: $0.004–$0.007
- Gap: Occasional PII leakage in copy-paste tickets; need guardrails
Key metrics recap — Act II
| Metric | Value | Notes |
|---|---|---|
| Gold-set accuracy | 91% correct (n=60) | Threshold to ship was ≥85%. |
| Citation coverage | 88% | Each answer cites at least one doc reference. |
| Median latency | 1.9s | RAG adds ~0.3s over prototype. |
| Cost per request | $0.004–$0.007 | Higher due to embeddings + retrieval. |
RAG and evaluation discipline unlocked trustworthy answers, setting the stage for production-grade guardrails and operations.
Act III — Production hardening: safety, observability, CI/CD
Goal
Reduce risk, make issues diagnosable, and deploy safely across accounts.
Decision journal
- Context: PII exposure risk and jailbreak attempts
- Options: App-level regex + classifiers vs Bedrock Guardrails
- Decision: Bedrock Guardrails + app validators; reject or redact before generation and before response
Runtime architecture (prod)
flowchart TD
Client-->APIGW[API Gateway]
APIGW-->Auth[Cognito/JWT Authorizer]
APIGW-->Orch[Lambda Orchestrator]
Orch-->Bedrock
Orch-->Vec[OpenSearch Serverless]
Orch-->S3
Orch-->Obs[CloudWatch + X-Ray]
Orch-->Cfg["AppConfig (feature flags)"]
subgraph Async
Orch-->Q[SQS]
Q-->SFN[Step Functions]
SFN-->Ingest[Ingestion Lambdas]
end
classDef aws fill:#FF9900,stroke:#333,stroke-width:1px,color:#111
class APIGW,Auth,Orch,Bedrock,Vec,S3,Obs,Cfg,Q,SFN,Ingest awsProduction upgrades
- Safety: Guardrails policies + output validators; redaction filters; audit log of blocked content
- Observability: Request IDs, prompt version IDs, retrieval hit rate, token usage, refusal rate, traces to X-Ray
- Security: IAM least-privilege per path, VPC endpoints, KMS on S3/OpenSearch, signed URLs for citations
- CI/CD: Terraform modules for API, roles, OpenSearch, buckets; GitHub Actions → plan/apply; canary by header; quick rollback by alias
- Cost: Budgets + alarms; prompt compression; result cache for frequent intents
Guardrail configuration snapshot
{
"name": "support-assistant-prod",
"sensitiveInformationPolicy": {
"piiEntities": ["EMAIL", "SSN", "PHONE"],
"action": "REDACT"
},
"wordFilters": [
{"match": "credit card number", "action": "BLOCK"},
{"match": "share internal roadmap", "action": "CHALLENGE"}
],
"contextualGrounding": {
"relevanceThreshold": 0.68,
"responseAction": "REFUSE_WITH_TEMPLATE"
},
"deniedTopics": ["Self-harm instructions", "Malware creation"]
}
Operations scorecard
- SLO: 99% of requests under 4s measured at API Gateway; Lambda emits custom metrics keyed by prompt version and OpenSearch collection alias.
- Pager duty: on-call receives a single aggregated alarm that fans out if Bedrock throttling, OpenSearch write errors, or guardrail rejection spikes exceed three standard deviations.
- The runbook mandates synthetic queries every 5 minutes per region, and any mismatch between synthetic and production latency automatically redirects traffic to the previous alias.
Outcome (Week 4)
- CSAT +18%, first-response down to 1h 45m
- p50 1.9s, p95 3.8s; cost/request -22% via shorter contexts and caching
- Zero PII incidents in logs; jailbreak tests pass
Key metrics recap — Act III
| Metric | Value | Notes |
|---|---|---|
| CSAT change | +18% | Relative to four-week baseline before rollout. |
| First-response time | 1h 45m | Down from 9h pre-project. |
| Latency | p50 1.9s / p95 3.8s | Enforced via 99% <4s SLO. |
| Cost per request | -22% vs prototype | Prompt compression + caching. |
| PII / jailbreak incidents | 0 | Guardrail and validator coverage. |
Appendix — Minimal Terraform sketch (conceptual)
resource "aws_opensearchserverless_collection" "vec" {
name = "support-assistant"
type = "VECTORSEARCH"
}
resource "aws_iam_role" "orchestrator" {
name = "support-assistant-orchestrator"
assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
}
resource "aws_iam_role_policy" "allow_services" {
role = aws_iam_role.orchestrator.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{ Effect = "Allow", Action = ["bedrock:InvokeModel","bedrock:InvokeModelWithResponseStream"], Resource = "*"},
{ Effect = "Allow", Action = ["aoss:APIAccessAll"], Resource = "*"},
{ Effect = "Allow", Action = ["s3:GetObject","s3:PutObject"], Resource = ["arn:aws:s3:::support-assistant-*","arn:aws:s3:::support-assistant-*/*"]}
]
})
}
Ops runbook snippet (abridged)
alerts:
- name: bedrock-latency
trigger: p95_latency > 3.5s for 3m
action: scale_concurrency(lambda_orchestrator)
- name: vector-ingest-backlog
trigger: sqs_visible > 500 messages
action: rerun_step_function ingest-replay
playbooks:
guardrail-spike:
1: Inspect CloudWatch Logs Insights query `fields guardrailReason | stats count(*) by reason`.
2: If >30% are `contextual-grounding`, lower relevance threshold to 0.64 temporarily via AppConfig.
3: Announce mitigation in #support-ai channel.
Ship checklist
- [ ] Gold set ≥85% correct; online CSAT flat or up
- [ ] Guardrails + validators enabled; jailbreak suite green
- [ ] Traces + token/retrieval metrics; S3 transcripts with retention
- [ ] IAM scoped; VPC endpoints; KMS on data at rest
- [ ] Canary deploy + alias rollback tested; budgets/alerts in place
Further reading
- Amazon Bedrock product overview
- Amazon Bedrock Guardrails guide
- Amazon OpenSearch Serverless vector search
- AWS AppConfig feature flagging
Epilogue — What we’d do next
- Rerankers to shrink context; per-tenant vector filters; result caching
- Bedrock Agents for constrained tool use; human review for escalations
- Quarterly eval refresh; drift detection on intents and docs
Auto Amazon Links: No products found. WEB_PAGE_DUMPER: The server does not wake up: https://web-page-dumper.herokuapp.com/ URL: https://www.amazon.com/gp/top-rated/ Cache: AAL_048d91e746d8e46e76b94d301f80f1d9
