Terraform monorepo with hexagonal layout (environments/<env>-<stack>/ + modules/). Each AWS account hosts one S3 state bucket; each stack writes its own .tfstate key under that bucket; DynamoDB lock is implicit via use_lockfile=true. Modules cover api, service, background-service, rds, vpc, kms, sqs, lambda, common (GitHub OIDC), bastion-host, hosted-zone.
Role in the system: Provisions everything af-backend-go-api, af-map, af-targeting Lambdas, and (indirectly) af-frontend run on; account-isolated across AF-Shared / AF-Dev / AF-Staging / AF-Prod
Surfaces:
- common-shared (GitHub OIDC, Route 53 zones)
- common-analytics (Lambda targeting code bucket)
- common-backend (legacy)
- dev-shared / dev-backend / dev-analytics
- stg-shared / stg-backend / stg-analytics
- prod-* (planned)
User workflows
Plan → Apply (manual gates)
Stack updated, state in S3, lock released
Bootstrap a new env
New env online
Lint gate
Drift-free docs
State lock recovery
Apply can resume
Module init for IDE
Local IntelliSense
API endpoints
- INPUT
dev-backend.app_nameApp name for naming + tags - INPUT
dev-backend.api_ecs_task_configECS min/max instances - INPUT
dev-backend.db_scaling_configAurora min/max ACU - INPUT
dev-backend.frontend_urlCORS whitelist for S3 file uploads - OUTPUT
dev-backend.api_dns_nameCloudFront FQDN for af-backend-go-api - OUTPUT
dev-backend.bastion_host_dns_nameEC2 bastion DNS for RDS tunnel - OUTPUT
dev-backend.bastion_host_private_key_opensshBastion SSH private key - OUTPUT
dev-backend.mapping_service_ecs_task_role_arnAllows analytics to grant SQS SendMessage
Third-party APIs
AWS APIs
Resource provisioning via hashicorp/aws ~> 6.14
GitHub OIDC
Federated identity into AWS for CI
Terraform Registry
Provider + module downloads
Service dependencies
S3 state bucket per account
Terraform remote state
DynamoDB lock (implicit)
State locking
AWS SSO
Operator authentication
GitHub OIDC
CI/CD authentication into AWS without long-lived keys
Analysis
af-infra — Prop-Build Analysis
Document Type: Critical Review & Analysis (companion to prop-build-template.md)
Scope: Per-Repo
Subject: aid-finder/af-infra (Terraform monorepo)
Reviewer(s): Claude (automated code review)
Date: 2026-04-09
Version: 0.1
Confidence Level: Medium
What would raise confidence: access to actual AWS state (drift check), CI run history, an interview with the platform engineer who owns the backend state bucket, and sight of the prod-* environment directories (which do not currently exist in the repo).
Inputs Reviewed:
- Prop-build doc:
/Users/andres/src/af/af-analysis/data/af-infra.yaml - Companion docs:
data/af-infra/{api-examples,data-flow,deployment,runbook}.md - Source tree:
/Users/andres/src/af/af-infra/(HCL, ~4.4k LOC inmodules/+environments/) - CI:
.github/workflows/{plan,apply,lint}.yaml,.github/actions/* - Backend config:
BACKEND.md, eachenvironments/*/providers.tf
A.1 Executive Summary
- Overall health: Solid, conventional Terraform monorepo — clean module boundaries, WAF + KMS + OIDC done right, lint pipeline in place — but the repo's promise of four environments (common/dev/stg/prod) is not yet kept: there is no
environments/prod-*directory anywhere in the tree. - Top risk: The
prodenvironments promised byaf-infra.yamland by.github/workflows/apply.yaml:13do not exist on disk (ls environments/returns only common/dev/stg variants). Clicking "Run workflow → env: prod" will fail cold, and no one has prod state, prod tfvars, or prod backups. See A.4.1. - Top win / thing worth preserving: The
modules/serviceblue/green CodeDeploy + GitHub OIDC pattern (modules/service/main.tf:372-483) is exemplary — scoped IAM policy, no long-lived keys,ignore_changeson task_definition so Terraform and CodeDeploy don't fight. Propagate to any future service module. - Single recommended next action: Create
environments/prod-{backend,analytics,shared}(copy from stg, with prod-sized ACUs, retention, deletion protection, and a distinct state bucket) before any claim of "Active prod" is made externally. - Blocking unknowns: Whether the existing
dev-hih-backend-048599825724-us-east-1state bucket has versioning + SSE + MFA-delete (cannot verify from code — only theuse_lockfile = trueline inproviders.tfis visible); whetherterraform applyhas ever been run against a real prod account.
A.2 Health Scorecard
| # | Dimension | Score (1–5) | Justification |
|---|---|---|---|
| 1 | Module overview / clarity of intent | 4 | README + BACKEND.md + four companion docs; intent clear, though README.md is mostly a stub + PNG. |
| 2 | External dependencies | 4 | Pinned AWS provider ~> 6.14.0 and pinned community modules (rds-aurora 9.15.0, iam 5.59.0, ec2-instance 6.0.2). .terraform.lock.hcl per stack. |
| 3 | API endpoints | N/A | IaC repo exposes no HTTP APIs; modules/api provisions infra for the Go API. |
| 4 | Database schema | N/A | Aurora is provisioned (modules/rds/main.tf) but schema lives in af-backend-go-api. |
| 5 | Backend services | 4 | modules/service and modules/background-service are cleanly factored; ECS Fargate + ALB + autoscaling + CodeDeploy in one module. |
| 6 | WebSocket / real-time | N/A | Not in scope. |
| 7 | Frontend components | N/A | Not in scope. |
| 8 | Data flow clarity | 4 | data-flow.md companion present; cross-stack wiring via terraform_remote_state is explicit (environments/dev-backend/main.tf:1-17). |
| 9 | Error handling & resilience | 3 | Blue/green w/ auto-rollback on failure (modules/service/main.tf:378-381), DLQ on SQS (environments/dev-backend/main.tf:427-430). No CloudWatch alarms defined in IaC. |
| 10 | Configuration | 4 | Per-env terraform.tfvars + locals.tf; no secrets in tfvars; SSM PLACEHOLDER pattern (modules/service/main.tf:1-12) with ignore_changes = [value]. |
| 11 | Data refresh patterns | N/A | — |
| 12 | Performance | 3 | ACU 0.5–1.0 in dev is fine; no evidence prod has been sized. Containers hard-coded 256/512 (modules/service/main.tf:152-153). |
| 13 | Module interactions | 4 | Remote_state for cross-stack; VPC peering wired explicitly (environments/dev-backend/main.tf:33-61). |
| 14 | Troubleshooting / runbooks | 4 | runbook.md companion covers 8 scenarios per the yaml index. |
| 15 | Testing & QA | 2 | make lint runs fmt -check, validate, tflint, terraform-docs --output-check (Makefile:12-18). No terraform test, no Terratest, no tfsec/checkov. |
| 16 | Deployment & DevOps | 4 | Plan/apply split via GitHub Actions with artifact handoff (.github/workflows/apply.yaml:59-66), OIDC assume-role, per-env GitHub Environment gate. Missing prod directories. |
| 17 | Security & compliance | 3 | WAF managed rule groups + custom rate limit, KMS for RDS, private subnets, OIDC. BUT: bastion SSH open to 0.0.0.0/0 (modules/bastion-host/main.tf:20-23), DB password in Terraform random_password persisted in state + SSM, broad AWSCodeDeployRoleForECS attached, and state bucket shared across backend+analytics+shared. |
| 18 | Documentation & maintenance | 4 | terraform-docs --output-check enforced in lint; companion docs exist. |
| 19 | Roadmap clarity | 2 | No ROADMAP.md, no TODOs, no prod plan in-repo despite yaml listing prod as in scope. |
Overall score: 3.50 (14 applicable dimensions averaged, excluding N/A rows 3, 4, 6, 7, 11).
A.3 What's Working Well
-
Strength: GitHub OIDC assume-role + narrowly-scoped
code_deployerpolicy per service.- Location:
modules/service/main.tf:424-483 - Why it works: No long-lived AWS keys;
iam:PassRoleis restricted to just the two task roles of that service;subjects = ["${var.github_repository_name}:*"]binds trust to one repo. - Propagate to: Any other repo deploying to AWS.
- Location:
-
Strength: Blue/green ECS with CodeDeploy and
ignore_changes = [task_definition, load_balancer].- Location:
modules/service/main.tf:184-197, 297-316 - Why it works: Terraform owns the infra, CD owns the rollout; avoids drift fight. Auto-rollback on DEPLOYMENT_FAILURE wired in (
:378-381).
- Location:
-
Strength: SSM
PLACEHOLDERpattern withignore_changes = [value].- Location:
modules/service/main.tf:1-12 - Why it works: Lets Terraform own parameter name/type/ACL while leaving value management to humans/ops — keeps secrets out of state and tfvars.
- Location:
-
Strength: Single lint target —
fmt,validate,tflint,terraform-docs --output-check.- Location:
Makefile:12-18 - Why it works: Identical local + CI command; docs cannot drift from code.
- Location:
-
Strength: WAF stack in front of CloudFront: managed IP reputation / SQLi / KnownBadInputs / Common + custom rate-based rule + CustomBodySizeLimit.
- Location:
environments/dev-backend/main.tf:116-259
- Location:
A.4 What to Improve
A.4.1 P0 — Prod environments do not exist
- Problem:
af-infra.yamlmeta.scope claims "four account-isolated environments (common, dev, stg, prod)" and.github/workflows/apply.yaml:13offersprodas a workflow_dispatch choice, butenvironments/contains onlycommon-*,dev-*,stg-*. - Evidence:
ls environments/shows noprod-*;.github/workflows/apply.yaml:10-14;af-infra.yamlscope text. - Suggested change: Either create
environments/prod-{backend,analytics,shared}with prod-sized values (deletion_protection=true, higher retention, distinct state bucket under the prod AWS account), or removeprodfrom the workflow_dispatch enum and update the prop-build doc. - Estimated effort: M
- Risk if ignored: The documented architecture is a fiction; on-call engineers will assume a prod exists and discover otherwise during an incident.
A.4.2 P0 — Bastion host SSH open to the entire internet
- Problem: The bastion security group allows TCP/22 ingress from
0.0.0.0/0and::/0. - Evidence:
modules/bastion-host/main.tf:16-23 - Suggested change: Restrict
cidr_blocksto an input variable defaulting to[], or replace the bastion with SSM Session Manager. - Estimated effort: S
- Risk if ignored: Constant SSH brute-force surface; the private key lives in Terraform state (
modules/bastion-host/main.tf:5,create_private_key = true) — state leak = shell into VPC with routes to RDS.
A.4.3 P1 — RDS master password generated by Terraform, stored in state and SSM
- Problem:
random_passwordresult is written to state and then to an SSM SecureString;manage_master_user_password = falseexplicitly opts out of AWS Secrets Manager integration. - Evidence:
modules/rds/main.tf:5-8, 28-29, 83-93 - Suggested change: Set
manage_master_user_password = trueand let RDS rotate through Secrets Manager. - Estimated effort: M
- Risk if ignored: Credentials sprawl; manual rotation; state file becomes a secret.
A.4.4 P1 — No security/policy scanning in CI
- Problem:
make lintrunsfmt,validate,tflint,terraform-docs— notfsec,checkov,trivy, or OPA/Conftest. - Evidence:
Makefile:12-18;.github/workflows/lint.yaml. - Suggested change: Add a
checkov -d modules -d environments(ortfsec .) step with a curated ignore list. - Estimated effort: S
- Risk if ignored: Misconfigurations merge unnoticed.
A.4.5 P1 — State bucket is shared across stacks
- Problem: Every
dev-*environment points its backend atdev-hih-backend-048599825724-us-east-1and reads siblings' state viaterraform_remote_state. A compromised backend apply can read or clobber analytics state. - Evidence:
environments/dev-backend/providers.tf:3-6;environments/dev-backend/main.tf:1-17 - Suggested change: Per-stack bucket prefixes with IAM conditions, or per-account state buckets.
- Estimated effort: M
- Risk if ignored: Lateral blast radius within an environment during credential compromise.
A.4.6 P2 — environments/dev-backend/main.tf is a 510-line copy of stg-backend/main.tf
- Problem: Parallel structurally-identical files. Classic shotgun surgery. Analytics has same pattern at 640 LOC each.
- Evidence:
environments/dev-backend/main.tfandenvironments/stg-backend/main.tfboth 510 LOC. - Suggested change: Extract
modules/stacks/backendso env dirs become thinterraform.tfvars+module "backend"call. - Estimated effort: L
- Risk if ignored: Drift between environments; review burden; prod (when created) will be a fourth copy.
A.4.7 P2 — No CloudWatch alarms in IaC
- Problem:
containerInsights = enabledand WAF emits metrics, but noaws_cloudwatch_metric_alarmresources exist anywhere. - Evidence:
grep -r metric_alarm modules environmentsreturns nothing. - Suggested change: Add
modules/alarmswith RDS CPU, ECS task count, ALB 5xx, SQS DLQ depth, WAF blocked request surge. - Estimated effort: M
A.5 Things That Don't Make Sense
-
Observation:
environments/common-shared/main.tfis 6 lines andenvironments/dev-shared/main.tfis 26 lines — beside 510-line peers.- Question for author: Is "shared" intentionally minimal, or a placeholder?
-
Observation: Two CORS rules on the file submission bucket — one allowing only
frontend_url, another (dev-only) allowing*.- Location:
environments/dev-backend/main.tf:300-318 - Question for author: Why not a per-env
allowed_originslist variable?
- Location:
-
Observation:
default_tagsis set in everyproviders.tfbut no tagging-compliance policy enforces it.- Location:
environments/dev-backend/providers.tf:19-25
- Location:
A.6 Anti-Patterns Detected
A.6.1 Code-level
- Copy-paste / duplication — env files duplicated across dev/stg.
- Magic numbers — WAF
priority = 0..5; container256/512.
A.6.2 Architectural
- Shotgun surgery — any backend-stack change must be made to dev + stg (+ prod once added).
- Missing seams for testing — no
terraform test, no Terratest. - Leaky abstraction —
environments/dev-backend/main.tfmixes stack glue with top-level resources rather than delegating to a stack module.
A.6.3 Data
- None observed in IaC repo.
A.6.4 Async / Ops
- Work queues without visibility —
aws_sqs_queue.mapping_service_dlq(environments/dev-backend/main.tf:427-430) has no depth alarm.
A.6.5 Security
- Secrets in state — RDS master password (
modules/rds/main.tf:5-8, 89-93); bastion private key (modules/bastion-host/main.tf:5). - Overbroad IAM / SG — bastion SSH
0.0.0.0/0;AWSCodeDeployRoleForECSattached wholesale; shared state bucket.
A.6.6 Detected Instances
| # | Anti-pattern | Location (file:line) | Severity | Recommendation |
|---|---|---|---|---|
| 1 | Bastion SSH open world | modules/bastion-host/main.tf:16-23 | P0 | Restrict CIDRs or SSM Session Manager. |
| 2 | DB password in TF state + SSM | modules/rds/main.tf:5-8, 89-93 | P1 | manage_master_user_password = true. |
| 3 | Shared state bucket across stacks | environments/dev-backend/providers.tf:3 | P1 | Per-stack/account buckets with key-prefix IAM. |
| 4 | Broad CodeDeploy managed-policy attach | modules/service/main.tf:362-365 | P2 | Replace with scoped custom policy. |
| 5 | Shotgun surgery across env dirs | environments/dev-backend/main.tf ↔ stg-backend/main.tf | P2 | Extract modules/stacks/backend. |
| 6 | SQS DLQ with no depth alarm | environments/dev-backend/main.tf:427-430 | P2 | Add aws_cloudwatch_metric_alarm. |
| 7 | Magic constants for task CPU/mem | modules/service/main.tf:152-153 | P2 | Promote to variables. |
| 8 | Dev-only * CORS branch in shared file | environments/dev-backend/main.tf:309-317 | P2 | Parameterize allowed_origins. |
A.7 Open Questions
- Q: Does a
prodAWS account exist yet, and if so why is there noenvironments/prod-*tree? - Q: Is the state bucket versioned + SSE-encrypted + MFA-delete?
- Q: Is there a GitHub Environment protection rule requiring approval before
apply.yamlruns againstprod? - Q: Why
manage_master_user_password = falseon Aurora (modules/rds/main.tf:29)?
A.8 Difficulties Encountered
- Difficulty: No running state visible; cannot see whether resources match HCL.
- Impact: Drift and "claimed prod" cannot be verified.
- Fix: Commit a
terraform state listsnapshot underdoc/.
- Difficulty: Companion docs live in
af-analysis, not next to the code.- Impact: Bus-factor on discoverability.
- Difficulty: No CHANGELOG; had to read parallel files side-by-side.
A.9 Risks & Unknowns
A.9.1 Known risks
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Prod env absent but advertised | H | H | A.4.1 |
| 2 | Bastion SSH exposed | H | H | A.4.2 |
| 3 | DB password in state | M | H | A.4.3 |
| 4 | No tfsec/checkov | M | M | A.4.4 |
| 5 | State-bucket blast radius | L | H | A.4.5 |
| 6 | Env-directory drift | M | M | A.4.6 |
| 7 | No alarm wiring | M | M | A.4.7 |
A.9.2 Unknown unknowns
- CodeDeploy rollback runtime behavior (HCL-only review).
- State bucket versioning/encryption/replication (not in backend block).
- Analytics stack (
*-analytics/main.tf@ 640 LOC each) — time-boxed. .github/actions/for-each-*composite actions.
A.10 Technical Debt Register
| # | Debt item | Quadrant | Interest | Remediation |
|---|---|---|---|---|
| 1 | Parallel env directories (dev-backend/stg-backend @ 510 LOC) | Prudent & Inadvertent | 2–3× change time | A.4.6 |
| 2 | Missing prod-* env dirs | Reckless & Inadvertent | Broken on first use | A.4.1 |
| 3 | RDS password via random_password + SSM | Reckless & Deliberate | Manual rotation | A.4.3 |
| 4 | Bastion SSH 0.0.0.0/0 | Reckless & Inadvertent | Open brute-force | A.4.2 |
| 5 | No tfsec/checkov in CI | Prudent & Deliberate | Misconfigs slip in | A.4.4 |
| 6 | Shared state bucket across stacks | Prudent & Inadvertent | Lateral blast radius | A.4.5 |
| 7 | No alarms in IaC | Prudent & Inadvertent | Out-of-band ops | A.4.7 |
| 8 | Magic container CPU/mem constants | Prudent & Inadvertent | Per-service tuning hard | Promote to variables |
| 9 | No terraform test / Terratest | Prudent & Deliberate | Refactors are scary | Smoke tests on plan |
A.11 Security Posture (lightweight STRIDE)
| Category | Threat present? | Mitigated? | Gap |
|---|---|---|---|
| Spoofing (identity) | Yes (GH Actions → AWS) | Partial — OIDC with per-repo subjects (modules/service/main.tf:479) | Trust condition is :*; narrow to :ref:refs/heads/main or :environment:prod. |
| Tampering (integrity) | Yes (state, SSM) | Partial | State bucket config not verifiable from repo; RDS pw in state. |
| Repudiation | Yes | Partial | CloudTrail not in-repo; assumed org-level. |
| Information Disclosure | Yes | Partial | DB pw in SSM + state; bastion private key in state. |
| Denial of Service | Yes | Yes | WAF rate-based rule + managed rule groups. |
| Elevation of Privilege | Yes | Partial | Bastion open SSH; managed CodeDeploy role; shared state bucket. |
A.12 Operational Readiness
| Capability | Present / Partial / Missing | Notes |
|---|---|---|
| Structured logs | Partial | aws_cloudwatch_log_group "this" created; shape owned by app. |
| Metrics | Partial | Container Insights + WAF metrics; no custom alarms. |
| Distributed tracing | Missing | No X-Ray / OTEL resources. |
| Actionable alerts | Missing | No aws_cloudwatch_metric_alarm. |
| Runbooks | Present | data/af-infra/runbook.md. |
| On-call ownership defined | Partial | CODEOWNERS present. |
| SLOs / SLIs | Missing | None. |
| Backup & restore tested | Partial | backup_retention_period = 7 in dev; no drill evidence. |
| Disaster recovery plan | Partial | Single region; no cross-region replication in IaC. |
| Chaos / failure testing | Missing | N/A for IaC. |
A.13 Test & Quality Signals
- Coverage: N/A (lint-only).
- Untested critical paths:
modules/serviceblue/green; VPC peering rollback. - Missing test types: [x] unit [x] integration [ ] e2e [ ] contract [ ] load [x] security/fuzz.
A.14 Performance & Cost Smells
- Hot paths: ECS Fargate tasks at 256/512 — fine for dev, likely undersized for prod.
- Suspected bottlenecks: Aurora max 1.0 ACU in dev; prod must raise.
- Oversized / idle: Bastion EC2 runs 24/7 (
modules/bastion-host/main.tf:50-64); consider schedule-stop or SSM.
A.15 Bus-Factor & Knowledge Risk
- Only-person knowledge: State-bucket bootstrap (who originally ran
aws s3 mb?); whymanage_master_user_password = false. - What breaks: Re-bootstrapping state into a new account; interpreting
common-shared6-LOC minimalism. - Knowledge-transfer actions:
doc/bootstrap.mddescribing state-bucket creation; ADR for the RDS password decision.
A.16 Compliance Gaps
| Regulation | Requirement | Status | Gap | Remediation |
|---|---|---|---|---|
| AWS Well-Architected (Security) | Least-privilege IAM | Partial | Bastion SG, shared state bucket, managed-policy attach | A.4.2, A.4.5 |
| AWS Well-Architected (Reliability) | Alarm coverage | Missing | No metric_alarm | A.4.7 |
| CIS AWS FSBP | Encrypted + versioned state bucket | Unknown | Not in backend block | Verify and document |
| Internal | Rate-limit + SQLi on public API | Present | — | — |
A.17 Recommendations Summary
| Priority | Action | Owner | Effort | Depends on |
|---|---|---|---|---|
| P0 | Create environments/prod-{backend,analytics,shared} OR drop prod from .github/workflows/apply.yaml enum. | Platform lead | M | Prod AWS account decision |
| P0 | Restrict bastion SSH (modules/bastion-host/main.tf:16-23) or migrate to SSM Session Manager. | SRE | S | — |
| P1 | Switch RDS to manage_master_user_password = true + Secrets Manager; drop aws_ssm_parameter.password. | Platform + af-backend-go-api | M | App env var update |
| P1 | Add tfsec or checkov step to .github/workflows/lint.yaml. | SRE | S | — |
| P1 | Per-stack (or per-account) state buckets with prefix-scoped IAM. | Platform lead | M | Prod dir decision |
| P1 | Add modules/alarms (RDS CPU, ALB 5xx, SQS DLQ depth, WAF blocked surge). | SRE | M | — |
| P2 | Extract modules/stacks/backend to eliminate env copy-paste. | Platform lead | L | — |
| P2 | Replace AWSCodeDeployRoleForECS attachment with a scoped custom policy. | SRE | S | — |
| P2 | Promote container CPU/mem and CORS allowed_origins to module variables. | Platform | S | — |
| P2 | Narrow GitHub OIDC trust subjects from "repo:*" to ref/environment scoped. | SRE | S | — |
| P2 | Add terraform test / Terratest smoke tests. | Platform | M | — |
| P2 | Add CloudWatch alarm on mapping_service_dlq depth. | SRE | S | modules/alarms |
Environment variables
| Name | Purpose |
|---|---|
AWS_PROFILE* | AWS SSO profile (operator path) |
AWS_REGION* | Hardcoded us-east-1 in locals.tf + GH env var |
AWS_ACCOUNT_ID* | Per-GitHub-Environment; used to build OIDC role ARN |
AWS_CICD_ROLE_NAME* | OIDC-assumable role name |
TF_IN_AUTOMATION | Suppresses interactive prompts in CI |
TF_PLUGIN_CACHE_DIR | Speeds up init |
TF_VAR_* | Override any variable defined in variables.tf |
