AidFinder
Back to dashboard

af-infra

Aid Finder Infrastructure

Terraform monorepo provisioning all AWS resources for AidFinder across common/dev/stg/prod accounts with per-stack isolation.

Domain role
Infrastructure-as-code (IaC)
Last updated
2026-04-07
Lines of code
6,606
API style
TerraformIO

Terraform monorepo with hexagonal layout (environments/<env>-<stack>/ + modules/). Each AWS account hosts one S3 state bucket; each stack writes its own .tfstate key under that bucket; DynamoDB lock is implicit via use_lockfile=true. Modules cover api, service, background-service, rds, vpc, kms, sqs, lambda, common (GitHub OIDC), bastion-host, hosted-zone.

Role in the system: Provisions everything af-backend-go-api, af-map, af-targeting Lambdas, and (indirectly) af-frontend run on; account-isolated across AF-Shared / AF-Dev / AF-Staging / AF-Prod

Surfaces:

  • common-shared (GitHub OIDC, Route 53 zones)
  • common-analytics (Lambda targeting code bucket)
  • common-backend (legacy)
  • dev-shared / dev-backend / dev-analytics
  • stg-shared / stg-backend / stg-analytics
  • prod-* (planned)

User workflows

  • Plan → Apply (manual gates)

    Stack updated, state in S3, lock released

  • Bootstrap a new env

    New env online

  • Lint gate

    Drift-free docs

  • State lock recovery

    Apply can resume

  • Module init for IDE

    Local IntelliSense

API endpoints

  • INPUTdev-backend.app_nameApp name for naming + tags
  • INPUTdev-backend.api_ecs_task_configECS min/max instances
  • INPUTdev-backend.db_scaling_configAurora min/max ACU
  • INPUTdev-backend.frontend_urlCORS whitelist for S3 file uploads
  • OUTPUTdev-backend.api_dns_nameCloudFront FQDN for af-backend-go-api
  • OUTPUTdev-backend.bastion_host_dns_nameEC2 bastion DNS for RDS tunnel
  • OUTPUTdev-backend.bastion_host_private_key_opensshBastion SSH private key
  • OUTPUTdev-backend.mapping_service_ecs_task_role_arnAllows analytics to grant SQS SendMessage

Third-party APIs

  • AWS APIs

    Resource provisioning via hashicorp/aws ~> 6.14

  • GitHub OIDC

    Federated identity into AWS for CI

  • Terraform Registry

    Provider + module downloads

Service dependencies

  • S3 state bucket per account

    Terraform remote state

  • DynamoDB lock (implicit)

    State locking

  • AWS SSO

    Operator authentication

  • GitHub OIDC

    CI/CD authentication into AWS without long-lived keys

Analysis

overall health3.5 / 5strong
4Module overview / clarity of intent
4External dependencies
4Backend services
4Data flow clarity
3Error handling & resilience
4Configuration
3Performance
4Module interactions
4Troubleshooting / runbooks
2Testing & QA
4Deployment & DevOps
3Security & compliance
4Documentation & maintenance
2Roadmap clarity

af-infra — Prop-Build Analysis

Document Type: Critical Review & Analysis (companion to prop-build-template.md) Scope: Per-Repo Subject: aid-finder/af-infra (Terraform monorepo) Reviewer(s): Claude (automated code review) Date: 2026-04-09 Version: 0.1 Confidence Level: Medium What would raise confidence: access to actual AWS state (drift check), CI run history, an interview with the platform engineer who owns the backend state bucket, and sight of the prod-* environment directories (which do not currently exist in the repo).

Inputs Reviewed:

  • Prop-build doc: /Users/andres/src/af/af-analysis/data/af-infra.yaml
  • Companion docs: data/af-infra/{api-examples,data-flow,deployment,runbook}.md
  • Source tree: /Users/andres/src/af/af-infra/ (HCL, ~4.4k LOC in modules/ + environments/)
  • CI: .github/workflows/{plan,apply,lint}.yaml, .github/actions/*
  • Backend config: BACKEND.md, each environments/*/providers.tf

A.1 Executive Summary

  • Overall health: Solid, conventional Terraform monorepo — clean module boundaries, WAF + KMS + OIDC done right, lint pipeline in place — but the repo's promise of four environments (common/dev/stg/prod) is not yet kept: there is no environments/prod-* directory anywhere in the tree.
  • Top risk: The prod environments promised by af-infra.yaml and by .github/workflows/apply.yaml:13 do not exist on disk (ls environments/ returns only common/dev/stg variants). Clicking "Run workflow → env: prod" will fail cold, and no one has prod state, prod tfvars, or prod backups. See A.4.1.
  • Top win / thing worth preserving: The modules/service blue/green CodeDeploy + GitHub OIDC pattern (modules/service/main.tf:372-483) is exemplary — scoped IAM policy, no long-lived keys, ignore_changes on task_definition so Terraform and CodeDeploy don't fight. Propagate to any future service module.
  • Single recommended next action: Create environments/prod-{backend,analytics,shared} (copy from stg, with prod-sized ACUs, retention, deletion protection, and a distinct state bucket) before any claim of "Active prod" is made externally.
  • Blocking unknowns: Whether the existing dev-hih-backend-048599825724-us-east-1 state bucket has versioning + SSE + MFA-delete (cannot verify from code — only the use_lockfile = true line in providers.tf is visible); whether terraform apply has ever been run against a real prod account.

A.2 Health Scorecard

#DimensionScore (1–5)Justification
1Module overview / clarity of intent4README + BACKEND.md + four companion docs; intent clear, though README.md is mostly a stub + PNG.
2External dependencies4Pinned AWS provider ~> 6.14.0 and pinned community modules (rds-aurora 9.15.0, iam 5.59.0, ec2-instance 6.0.2). .terraform.lock.hcl per stack.
3API endpointsN/AIaC repo exposes no HTTP APIs; modules/api provisions infra for the Go API.
4Database schemaN/AAurora is provisioned (modules/rds/main.tf) but schema lives in af-backend-go-api.
5Backend services4modules/service and modules/background-service are cleanly factored; ECS Fargate + ALB + autoscaling + CodeDeploy in one module.
6WebSocket / real-timeN/ANot in scope.
7Frontend componentsN/ANot in scope.
8Data flow clarity4data-flow.md companion present; cross-stack wiring via terraform_remote_state is explicit (environments/dev-backend/main.tf:1-17).
9Error handling & resilience3Blue/green w/ auto-rollback on failure (modules/service/main.tf:378-381), DLQ on SQS (environments/dev-backend/main.tf:427-430). No CloudWatch alarms defined in IaC.
10Configuration4Per-env terraform.tfvars + locals.tf; no secrets in tfvars; SSM PLACEHOLDER pattern (modules/service/main.tf:1-12) with ignore_changes = [value].
11Data refresh patternsN/A
12Performance3ACU 0.5–1.0 in dev is fine; no evidence prod has been sized. Containers hard-coded 256/512 (modules/service/main.tf:152-153).
13Module interactions4Remote_state for cross-stack; VPC peering wired explicitly (environments/dev-backend/main.tf:33-61).
14Troubleshooting / runbooks4runbook.md companion covers 8 scenarios per the yaml index.
15Testing & QA2make lint runs fmt -check, validate, tflint, terraform-docs --output-check (Makefile:12-18). No terraform test, no Terratest, no tfsec/checkov.
16Deployment & DevOps4Plan/apply split via GitHub Actions with artifact handoff (.github/workflows/apply.yaml:59-66), OIDC assume-role, per-env GitHub Environment gate. Missing prod directories.
17Security & compliance3WAF managed rule groups + custom rate limit, KMS for RDS, private subnets, OIDC. BUT: bastion SSH open to 0.0.0.0/0 (modules/bastion-host/main.tf:20-23), DB password in Terraform random_password persisted in state + SSM, broad AWSCodeDeployRoleForECS attached, and state bucket shared across backend+analytics+shared.
18Documentation & maintenance4terraform-docs --output-check enforced in lint; companion docs exist.
19Roadmap clarity2No ROADMAP.md, no TODOs, no prod plan in-repo despite yaml listing prod as in scope.

Overall score: 3.50 (14 applicable dimensions averaged, excluding N/A rows 3, 4, 6, 7, 11).


A.3 What's Working Well

  • Strength: GitHub OIDC assume-role + narrowly-scoped code_deployer policy per service.

    • Location: modules/service/main.tf:424-483
    • Why it works: No long-lived AWS keys; iam:PassRole is restricted to just the two task roles of that service; subjects = ["${var.github_repository_name}:*"] binds trust to one repo.
    • Propagate to: Any other repo deploying to AWS.
  • Strength: Blue/green ECS with CodeDeploy and ignore_changes = [task_definition, load_balancer].

    • Location: modules/service/main.tf:184-197, 297-316
    • Why it works: Terraform owns the infra, CD owns the rollout; avoids drift fight. Auto-rollback on DEPLOYMENT_FAILURE wired in (:378-381).
  • Strength: SSM PLACEHOLDER pattern with ignore_changes = [value].

    • Location: modules/service/main.tf:1-12
    • Why it works: Lets Terraform own parameter name/type/ACL while leaving value management to humans/ops — keeps secrets out of state and tfvars.
  • Strength: Single lint target — fmt, validate, tflint, terraform-docs --output-check.

    • Location: Makefile:12-18
    • Why it works: Identical local + CI command; docs cannot drift from code.
  • Strength: WAF stack in front of CloudFront: managed IP reputation / SQLi / KnownBadInputs / Common + custom rate-based rule + CustomBodySizeLimit.

    • Location: environments/dev-backend/main.tf:116-259

A.4 What to Improve

A.4.1 P0 — Prod environments do not exist

  • Problem: af-infra.yaml meta.scope claims "four account-isolated environments (common, dev, stg, prod)" and .github/workflows/apply.yaml:13 offers prod as a workflow_dispatch choice, but environments/ contains only common-*, dev-*, stg-*.
  • Evidence: ls environments/ shows no prod-*; .github/workflows/apply.yaml:10-14; af-infra.yaml scope text.
  • Suggested change: Either create environments/prod-{backend,analytics,shared} with prod-sized values (deletion_protection=true, higher retention, distinct state bucket under the prod AWS account), or remove prod from the workflow_dispatch enum and update the prop-build doc.
  • Estimated effort: M
  • Risk if ignored: The documented architecture is a fiction; on-call engineers will assume a prod exists and discover otherwise during an incident.

A.4.2 P0 — Bastion host SSH open to the entire internet

  • Problem: The bastion security group allows TCP/22 ingress from 0.0.0.0/0 and ::/0.
  • Evidence: modules/bastion-host/main.tf:16-23
  • Suggested change: Restrict cidr_blocks to an input variable defaulting to [], or replace the bastion with SSM Session Manager.
  • Estimated effort: S
  • Risk if ignored: Constant SSH brute-force surface; the private key lives in Terraform state (modules/bastion-host/main.tf:5, create_private_key = true) — state leak = shell into VPC with routes to RDS.

A.4.3 P1 — RDS master password generated by Terraform, stored in state and SSM

  • Problem: random_password result is written to state and then to an SSM SecureString; manage_master_user_password = false explicitly opts out of AWS Secrets Manager integration.
  • Evidence: modules/rds/main.tf:5-8, 28-29, 83-93
  • Suggested change: Set manage_master_user_password = true and let RDS rotate through Secrets Manager.
  • Estimated effort: M
  • Risk if ignored: Credentials sprawl; manual rotation; state file becomes a secret.

A.4.4 P1 — No security/policy scanning in CI

  • Problem: make lint runs fmt, validate, tflint, terraform-docs — no tfsec, checkov, trivy, or OPA/Conftest.
  • Evidence: Makefile:12-18; .github/workflows/lint.yaml.
  • Suggested change: Add a checkov -d modules -d environments (or tfsec .) step with a curated ignore list.
  • Estimated effort: S
  • Risk if ignored: Misconfigurations merge unnoticed.

A.4.5 P1 — State bucket is shared across stacks

  • Problem: Every dev-* environment points its backend at dev-hih-backend-048599825724-us-east-1 and reads siblings' state via terraform_remote_state. A compromised backend apply can read or clobber analytics state.
  • Evidence: environments/dev-backend/providers.tf:3-6; environments/dev-backend/main.tf:1-17
  • Suggested change: Per-stack bucket prefixes with IAM conditions, or per-account state buckets.
  • Estimated effort: M
  • Risk if ignored: Lateral blast radius within an environment during credential compromise.

A.4.6 P2 — environments/dev-backend/main.tf is a 510-line copy of stg-backend/main.tf

  • Problem: Parallel structurally-identical files. Classic shotgun surgery. Analytics has same pattern at 640 LOC each.
  • Evidence: environments/dev-backend/main.tf and environments/stg-backend/main.tf both 510 LOC.
  • Suggested change: Extract modules/stacks/backend so env dirs become thin terraform.tfvars + module "backend" call.
  • Estimated effort: L
  • Risk if ignored: Drift between environments; review burden; prod (when created) will be a fourth copy.

A.4.7 P2 — No CloudWatch alarms in IaC

  • Problem: containerInsights = enabled and WAF emits metrics, but no aws_cloudwatch_metric_alarm resources exist anywhere.
  • Evidence: grep -r metric_alarm modules environments returns nothing.
  • Suggested change: Add modules/alarms with RDS CPU, ECS task count, ALB 5xx, SQS DLQ depth, WAF blocked request surge.
  • Estimated effort: M

A.5 Things That Don't Make Sense

  1. Observation: environments/common-shared/main.tf is 6 lines and environments/dev-shared/main.tf is 26 lines — beside 510-line peers.

    • Question for author: Is "shared" intentionally minimal, or a placeholder?
  2. Observation: Two CORS rules on the file submission bucket — one allowing only frontend_url, another (dev-only) allowing *.

    • Location: environments/dev-backend/main.tf:300-318
    • Question for author: Why not a per-env allowed_origins list variable?
  3. Observation: default_tags is set in every providers.tf but no tagging-compliance policy enforces it.

    • Location: environments/dev-backend/providers.tf:19-25

A.6 Anti-Patterns Detected

A.6.1 Code-level

  • Copy-paste / duplication — env files duplicated across dev/stg.
  • Magic numbers — WAF priority = 0..5; container 256/512.

A.6.2 Architectural

  • Shotgun surgery — any backend-stack change must be made to dev + stg (+ prod once added).
  • Missing seams for testing — no terraform test, no Terratest.
  • Leaky abstraction — environments/dev-backend/main.tf mixes stack glue with top-level resources rather than delegating to a stack module.

A.6.3 Data

  • None observed in IaC repo.

A.6.4 Async / Ops

  • Work queues without visibility — aws_sqs_queue.mapping_service_dlq (environments/dev-backend/main.tf:427-430) has no depth alarm.

A.6.5 Security

  • Secrets in state — RDS master password (modules/rds/main.tf:5-8, 89-93); bastion private key (modules/bastion-host/main.tf:5).
  • Overbroad IAM / SG — bastion SSH 0.0.0.0/0; AWSCodeDeployRoleForECS attached wholesale; shared state bucket.

A.6.6 Detected Instances

#Anti-patternLocation (file:line)SeverityRecommendation
1Bastion SSH open worldmodules/bastion-host/main.tf:16-23P0Restrict CIDRs or SSM Session Manager.
2DB password in TF state + SSMmodules/rds/main.tf:5-8, 89-93P1manage_master_user_password = true.
3Shared state bucket across stacksenvironments/dev-backend/providers.tf:3P1Per-stack/account buckets with key-prefix IAM.
4Broad CodeDeploy managed-policy attachmodules/service/main.tf:362-365P2Replace with scoped custom policy.
5Shotgun surgery across env dirsenvironments/dev-backend/main.tfstg-backend/main.tfP2Extract modules/stacks/backend.
6SQS DLQ with no depth alarmenvironments/dev-backend/main.tf:427-430P2Add aws_cloudwatch_metric_alarm.
7Magic constants for task CPU/memmodules/service/main.tf:152-153P2Promote to variables.
8Dev-only * CORS branch in shared fileenvironments/dev-backend/main.tf:309-317P2Parameterize allowed_origins.

A.7 Open Questions

  1. Q: Does a prod AWS account exist yet, and if so why is there no environments/prod-* tree?
  2. Q: Is the state bucket versioned + SSE-encrypted + MFA-delete?
  3. Q: Is there a GitHub Environment protection rule requiring approval before apply.yaml runs against prod?
  4. Q: Why manage_master_user_password = false on Aurora (modules/rds/main.tf:29)?

A.8 Difficulties Encountered

  • Difficulty: No running state visible; cannot see whether resources match HCL.
    • Impact: Drift and "claimed prod" cannot be verified.
    • Fix: Commit a terraform state list snapshot under doc/.
  • Difficulty: Companion docs live in af-analysis, not next to the code.
    • Impact: Bus-factor on discoverability.
  • Difficulty: No CHANGELOG; had to read parallel files side-by-side.

A.9 Risks & Unknowns

A.9.1 Known risks

#RiskLikelihoodImpactMitigation
1Prod env absent but advertisedHHA.4.1
2Bastion SSH exposedHHA.4.2
3DB password in stateMHA.4.3
4No tfsec/checkovMMA.4.4
5State-bucket blast radiusLHA.4.5
6Env-directory driftMMA.4.6
7No alarm wiringMMA.4.7

A.9.2 Unknown unknowns

  • CodeDeploy rollback runtime behavior (HCL-only review).
  • State bucket versioning/encryption/replication (not in backend block).
  • Analytics stack (*-analytics/main.tf @ 640 LOC each) — time-boxed.
  • .github/actions/for-each-* composite actions.

A.10 Technical Debt Register

#Debt itemQuadrantInterestRemediation
1Parallel env directories (dev-backend/stg-backend @ 510 LOC)Prudent & Inadvertent2–3× change timeA.4.6
2Missing prod-* env dirsReckless & InadvertentBroken on first useA.4.1
3RDS password via random_password + SSMReckless & DeliberateManual rotationA.4.3
4Bastion SSH 0.0.0.0/0Reckless & InadvertentOpen brute-forceA.4.2
5No tfsec/checkov in CIPrudent & DeliberateMisconfigs slip inA.4.4
6Shared state bucket across stacksPrudent & InadvertentLateral blast radiusA.4.5
7No alarms in IaCPrudent & InadvertentOut-of-band opsA.4.7
8Magic container CPU/mem constantsPrudent & InadvertentPer-service tuning hardPromote to variables
9No terraform test / TerratestPrudent & DeliberateRefactors are scarySmoke tests on plan

A.11 Security Posture (lightweight STRIDE)

CategoryThreat present?Mitigated?Gap
Spoofing (identity)Yes (GH Actions → AWS)Partial — OIDC with per-repo subjects (modules/service/main.tf:479)Trust condition is :*; narrow to :ref:refs/heads/main or :environment:prod.
Tampering (integrity)Yes (state, SSM)PartialState bucket config not verifiable from repo; RDS pw in state.
RepudiationYesPartialCloudTrail not in-repo; assumed org-level.
Information DisclosureYesPartialDB pw in SSM + state; bastion private key in state.
Denial of ServiceYesYesWAF rate-based rule + managed rule groups.
Elevation of PrivilegeYesPartialBastion open SSH; managed CodeDeploy role; shared state bucket.

A.12 Operational Readiness

CapabilityPresent / Partial / MissingNotes
Structured logsPartialaws_cloudwatch_log_group "this" created; shape owned by app.
MetricsPartialContainer Insights + WAF metrics; no custom alarms.
Distributed tracingMissingNo X-Ray / OTEL resources.
Actionable alertsMissingNo aws_cloudwatch_metric_alarm.
RunbooksPresentdata/af-infra/runbook.md.
On-call ownership definedPartialCODEOWNERS present.
SLOs / SLIsMissingNone.
Backup & restore testedPartialbackup_retention_period = 7 in dev; no drill evidence.
Disaster recovery planPartialSingle region; no cross-region replication in IaC.
Chaos / failure testingMissingN/A for IaC.

A.13 Test & Quality Signals

  • Coverage: N/A (lint-only).
  • Untested critical paths: modules/service blue/green; VPC peering rollback.
  • Missing test types: [x] unit [x] integration [ ] e2e [ ] contract [ ] load [x] security/fuzz.

A.14 Performance & Cost Smells

  • Hot paths: ECS Fargate tasks at 256/512 — fine for dev, likely undersized for prod.
  • Suspected bottlenecks: Aurora max 1.0 ACU in dev; prod must raise.
  • Oversized / idle: Bastion EC2 runs 24/7 (modules/bastion-host/main.tf:50-64); consider schedule-stop or SSM.

A.15 Bus-Factor & Knowledge Risk

  • Only-person knowledge: State-bucket bootstrap (who originally ran aws s3 mb?); why manage_master_user_password = false.
  • What breaks: Re-bootstrapping state into a new account; interpreting common-shared 6-LOC minimalism.
  • Knowledge-transfer actions: doc/bootstrap.md describing state-bucket creation; ADR for the RDS password decision.

A.16 Compliance Gaps

RegulationRequirementStatusGapRemediation
AWS Well-Architected (Security)Least-privilege IAMPartialBastion SG, shared state bucket, managed-policy attachA.4.2, A.4.5
AWS Well-Architected (Reliability)Alarm coverageMissingNo metric_alarmA.4.7
CIS AWS FSBPEncrypted + versioned state bucketUnknownNot in backend blockVerify and document
InternalRate-limit + SQLi on public APIPresent

A.17 Recommendations Summary

PriorityActionOwnerEffortDepends on
P0Create environments/prod-{backend,analytics,shared} OR drop prod from .github/workflows/apply.yaml enum.Platform leadMProd AWS account decision
P0Restrict bastion SSH (modules/bastion-host/main.tf:16-23) or migrate to SSM Session Manager.SRES
P1Switch RDS to manage_master_user_password = true + Secrets Manager; drop aws_ssm_parameter.password.Platform + af-backend-go-apiMApp env var update
P1Add tfsec or checkov step to .github/workflows/lint.yaml.SRES
P1Per-stack (or per-account) state buckets with prefix-scoped IAM.Platform leadMProd dir decision
P1Add modules/alarms (RDS CPU, ALB 5xx, SQS DLQ depth, WAF blocked surge).SREM
P2Extract modules/stacks/backend to eliminate env copy-paste.Platform leadL
P2Replace AWSCodeDeployRoleForECS attachment with a scoped custom policy.SRES
P2Promote container CPU/mem and CORS allowed_origins to module variables.PlatformS
P2Narrow GitHub OIDC trust subjects from "repo:*" to ref/environment scoped.SRES
P2Add terraform test / Terratest smoke tests.PlatformM
P2Add CloudWatch alarm on mapping_service_dlq depth.SRESmodules/alarms

Environment variables

NamePurpose
AWS_PROFILE*AWS SSO profile (operator path)
AWS_REGION*Hardcoded us-east-1 in locals.tf + GH env var
AWS_ACCOUNT_ID*Per-GitHub-Environment; used to build OIDC role ARN
AWS_CICD_ROLE_NAME*OIDC-assumable role name
TF_IN_AUTOMATIONSuppresses interactive prompts in CI
TF_PLUGIN_CACHE_DIRSpeeds up init
TF_VAR_*Override any variable defined in variables.tf