Go service built on gqlgen v0.17.86 + chi router with a clean-architecture layout (domain / service / infrastructure). Main store is PostgreSQL via pgx v5; a separate read-only households DB serves eligibility lookups. Auth is Firebase (email-link OOB in dev, ID-token verification in prod). Plaid SDK powers bank linking. AI agents are run as ECS Fargate containers, orchestrated via a JWT-secured control plane plus CloudFront-signed-URL data plane.

Role in the system: Receives form submissions from af-frontend, persists to PostgreSQL, runs eligibility via the af-targeting analytics service, spawns per-survivor Fargate AI containers (af-disaster-assistance-gov-agent / af-fema-real-ai-agent) and exposes their VNC endpoints back to the user.

Surfaces:

GraphQL Query: form, forms, formSubmission(s), formFiles, formPrefill, activeDisasters, householdDetailedInfo, actualHouseholdInfo, me, accounts(admin)
GraphQL Mutation: signIn, updateMe, submitForm, finalizeFormSubmission, continueAgent, createPlaidLinkToken, linkBankAccount
Custom directives: @account, @strongAuth, @authorizeAPIKey, @validate, @goField
REST: POST /api/graphql, GET /deployment-badge, implicit /health

User workflows

Survivor sign-in
Bearer token usable on all @account-guarded operations
Form submission (two-step S3 upload)
FormSubmissionInfo with eligibility + expectedAmount; AI state machine begins
AI form automation (lazy polling)
FEMA application submitted on survivor's behalf; VNC remains available 48h for review
Household eligibility lookup
Returns ActualHouseholdInfo for the survivor's address
Plaid bank linking
Account/routing/bank metadata returned to frontend; survivor uses it to autofill financial form fields

API endpoints

QUERYform(formId: UUID!)Load form structure with sections, fields, conditions
QUERYformSubmission(formSubmissionId!)Load submission with state, eligibility, AI status
QUERYformPrefill(formId!)Latest submitted values for prefill
QUERYactiveDisastersList active declared disasters
QUERYactualHouseholdInfo(limit, offset)Account's address-matched household + program eligibility
QUERYmeCurrent account profile
QUERYaccounts(limit, offset)Admin: list accounts
MUTATIONsignInVerify Firebase token, upsert account
MUTATIONupdateMe(input)Update profile
MUTATIONsubmitForm(formId!, input)Validate + persist submission, return presigned S3 URLs
MUTATIONfinalizeFormSubmission(submissionId!)Mark files uploaded, call analytics, fire AI container
MUTATIONcontinueAgent(submissionId!)Resume Agent 2 after user Login.gov 2FA in VNC
MUTATIONcreatePlaidLinkTokenMint Plaid Link token
MUTATIONlinkBankAccount(publicToken!)Exchange public token, return account details (not persisted)
POST/api/graphqlGraphQL HTTP transport
GET/deployment-badgeCloudFront-served deployment badge metadata

Third-party APIs

Firebase Auth
Email-link OOB sign-in + ID-token verification
Plaid
Bank account linking, account/routing retrieval
Mapbox (indirect)
Tileset namespace owner; no direct API calls from this service (af-map handles uploads)
AI Container API (control plane)
Start/stop/suspend per-survivor Fargate containers
AI Agent API (data plane)
Agent orchestration: /health, /survivor-info, /state, /agent1/run, /agent2/run
User Update Service (analytics)
Eligibility determination from form data

Service dependencies

PostgreSQL (main)
Accounts, forms, submissions, AI/analytics metadata
PostgreSQL (households RO)
Read-only households DB synced from af-targeting BigQuery — eligibility lookups
AWS S3
Form file submissions (presigned PUT URLs)
af-map
Downstream tile worker; planned SNS/SQS trigger on disaster updates (not yet implemented)
af-disaster-assistance-gov-agent / af-fema-real-ai-agent
Per-survivor Fargate containers spawned by this service
af-targeting (User Update Service)
Eligibility scoring + household sync

Analysis

overall health3.6 / 5strong

5Module overview / clarity of intent

4External dependencies

4API endpoints

4Database schema

3Backend services

2WebSocket / real-time

4Data flow clarity

3Error handling & resilience

4Configuration

2Data refresh patterns

3Performance

4Module interactions

4Troubleshooting / runbooks

3Testing & QA

4Deployment & DevOps

3Security & compliance

5Documentation & maintenance

4Roadmap clarity

af-backend-go-api — Prop-Build Analysis

Document Type: Critical Review & Analysis (companion to prop-build-template.md) Scope: Per-Repo Subject: af-backend-go-api (AidFinder GraphQL backend) Reviewer(s): Claude (automated code review) Date: 2026-04-09 Version: 0.1 Confidence Level: Medium What would raise confidence: running tests locally, prod metrics/traces, interview with Jiri/Marek/Matúš, observing an actual AI-agent submission end-to-end.

Inputs Reviewed:

Prop-build doc: data/af-backend-go-api.yaml
Companion docs: data/af-backend-go-api/{api-examples,data-flow,deployment,runbook}.md, schema.graphql
Source tree: /Users/andres/src/af/af-backend-go-api/ (124 non-test .go files, 35 test files, 16 tern migrations, 6 GitHub Actions workflows)

Part A — Per-Repo Analysis

A.1 Executive Summary

Overall health: Solid, conventionally-structured Go service with clean-architecture layering, strong typing, and meaningful test coverage on hot domain logic; the biggest risks are operational (polling-based AI state, fire-and-forget goroutines) and a handful of latent correctness/security smells rather than structural rot.
Top risk: The AI-agent lifecycle is driven by an unsupervised goroutine spawned from FinalizeFormSubmission (service/application/formsubmission/formsubmission.go:203-215) using context.WithoutCancel, with no DLQ, no retry, no persistence of in-flight state, and no visibility metric — a crash mid-poll silently strands a submission (see A.4.1).
Top win / worth preserving: Clean domain/service/infrastructure split with hand-rolled SQL + scany (no ORM), dependency-injected services validated at construction (formsubmission.go:86-116), and a disciplined structured-error taxonomy (api/graphql/errors/error.go, README error codes).
Single recommended next action: Persist AI-lifecycle work as a durable job (DB outbox + a polling worker or SQS) and expose depth/age metrics; everything else on the list is P1/P2.
Blocking unknowns: Prod metrics, actual coverage numbers, RDS index state for households RO, whether FIXED_COMPLEXITY_LIMIT=20 is tuned for real clients, and whether the planned SNS/SQS → af-map publisher exists anywhere in branch.

A.2 Health Scorecard

#	Dimension	Score (1–5)	Justification
1	Module overview / clarity of intent	5	README + prop-build doc + companion files are thorough and match code.
2	External dependencies	4	Modern, pinned versions (`go.mod`); Firebase/Plaid/AWS SDKs wrapped behind interfaces.
3	API endpoints	4	Schema-first gqlgen with directives (`api/graphql/graph/schema.resolvers.go`, 675 LOC resolvers); introspection on in all envs is a minor smell.
4	Database schema	4	16 hand-written tern migrations, XOR field-type subtables, deferrable constraints; lacks explicit retention.
5	Backend services	3	`FinalizeFormSubmission` carries a `//nolint:funlen,cyclop,gocognit` tag at `formsubmission.go:146` — acknowledged complexity; AI orchestration leaks through application layer.
6	WebSocket / real-time	2	No subscriptions/WS; only client polling (`cmd/api/graphql/main.go:83-88` TODO comment).
7	Frontend components	N/A	Backend only.
8	Data flow clarity	4	`data-flow.md` matches the code; sequence is traceable through resolvers → application → domain.
9	Error handling & resilience	3	Good error taxonomy, but no circuit breakers, no retries on outbound HTTP, TODO comment at `formsubmission.go:171-172` ("find out how to handle errors").
10	Configuration	4	Env-var parsing via strv config in `cmd/api/graphql/setup/setup.go`; validated at startup.
11	Data refresh patterns	2	Lazy poll with 2-min cooldown (`aiRefreshInterval`, `formsubmission.go:27`) is fragile; no caching of hot reads.
12	Performance	3	gqlgen complexity cap + dataloader v7 for N+1 are good; `ReadFormPrefill` (`domain/form/postgres/form.go:646`) is the known O(n) hotspot.
13	Module interactions	4	Well-bounded; dependency boundaries crisp; only real coupling is downstream agents.
14	Troubleshooting / runbooks	4	`runbook.md` exists and is specific; missing alert thresholds.
15	Testing & QA	3	35 test files for 124 source files; unit coverage on form transform, JWT, auth middleware, agents; no coverage number published; no load/fuzz.
16	Deployment & DevOps	4	6 GH Actions workflows (`build`, `tests`, `lint`, `release`, `deploy-task-definition`, `vuln-scan`), multi-stage Dockerfile, tern-on-startup migrations.
17	Security & compliance	3	Firebase + directive-based authz is correct; HS256 JWT with shared secret to AI control plane (`service/infrastructure/ai/container/jwt.go:20`) and CORS default `*` are soft spots (see A.6.5).
18	Documentation & maintenance	5	README is exemplary; CHANGELOG present; CODEOWNERS file.
19	Roadmap clarity	4	Tech-debt + planned items enumerated in YAML §19 and README.

Overall score: 3.61 (average of 18 scored rows; Frontend N/A excluded). Weighted reading: the repo is comfortably above "acceptable" on code, docs, and deployment; it dips into "needs work" on real-time/refresh (row 6, 11) and on the resilience of its most valuable flow (row 9) — i.e., the weakest links are exactly where a survivor's submission lives.

A.3 What's Working Well

Strength: Dependency injection with explicit nil-checks at construction.
- Location: service/application/formsubmission/formsubmission.go:86-116
- Why it works: Every collaborator is an interface; constructor refuses to return a half-wired service. Makes testing trivial and failures loud at startup instead of runtime.
- Propagate to: Any repo still using package-level globals for dependencies.
Strength: Strong ID types via code-gen (types/id/).
- Location: types/id/id.go, types/id/id_gen.go
- Why it works: Compile-time separation of id.Account / id.FormSubmission / id.Container / id.Form avoids primitive obsession; swapping a submissionID for an accountID is a type error, not a runtime bug.
- Propagate to: Other Go services that pass bare UUIDs around.
Strength: Structured, code-first error taxonomy surfaced through GraphQL extensions.
- Location: api/graphql/errors/error.go, api/graphql/graph/error.go, README error codes section
- Why it works: Frontend can pattern-match on stable codes (ERR_FORM_*, FIELD_*) instead of scraping strings. api/graphql/errors/error_test.go locks the contract.
- Propagate to: Any repo where the frontend currently greps errors by message.
Strength: Dataloader v7 wired as request-scoped middleware.
- Location: api/graphql/middleware/dataloader.go + dataloader/context.go
- Why it works: Prevents the classic GraphQL N+1 on nested resolvers without forcing batch-everywhere discipline on the resolver layer.
- Propagate to: Any GraphQL server still writing naive per-field fetches.

A.4 What to Improve

A.4.1 P0 — AI lifecycle is a fire-and-forget goroutine with no durability or visibility

Problem: The entire post-finalize AI flow (start container, poll ~5min, submit survivor info, run agent 1) runs in a detached goroutine spawned from FinalizeFormSubmission, using context.WithoutCancel(ctx). If the pod is killed between finalize and any of the three pollFor… loops, the submission sits in an intermediate state forever; only the client's next poll can nudge refreshAIAgentState back into life, and only if the container itself is still running. There is no DLQ, no retry budget, no visible queue depth, no metric for in-flight jobs.
Evidence:
- service/application/formsubmission/formsubmission.go:203-215 (detached goroutine, context.WithoutCancel, panic recovery that only logs).
- service/application/formsubmission/formsubmission.go:454-530 three pollFor… loops bounded only by aiAPITimeout = 5 * time.Minute (formsubmission.go:24).
- cmd/api/graphql/main.go:83-88 — BeforeShutdown hook is empty; the TODO comment explicitly notes "can be for example closing of websocket connections, or messages handler (SQS, pub/sub)" — there is no graceful-drain for in-flight AI jobs.
- formsubmission.go:171-172 — // TODO find out how to handle errors on the analytics path that precedes the AI kick-off.
Suggested change: Persist a row in a pending_ai_jobs table (or SQS message) inside the finalize transaction; have a worker (in-proc ticker or separate consumer) pick it up. Emit a pending_ai_jobs_depth and oldest_pending_age metric. Short-term, at minimum add a startup reaper that scans submissions in non-terminal AI states older than aiAPITimeout and transitions them to ERROR.
Estimated effort: M
Risk if ignored: Silent data loss of the platform's headline feature; survivors see an eternally "running" state after any pod restart.

A.4.2 P1 — `FinalizeFormSubmission` is explicitly too large and mixes concerns

Problem: The finalize path carries //nolint:funlen,cyclop,gocognit — an author-acknowledged smell. It validates files, calls analytics, decides eligibility, launches the goroutine, and owns error translation in one function.
Evidence: service/application/formsubmission/formsubmission.go:146-221 (~75 lines, 4 external services, 7 error branches, 1 goroutine).
Suggested change: Split into (a) finalizeAndScore (files + analytics + DB update), (b) enqueueAIJob (the durable-queue piece from A.4.1), (c) shouldRunAI policy helper. The anti-pattern suppression comment should then be removable.
Estimated effort: S
Risk if ignored: Any change to finalize logic touches the same fragile 75-line block; regressions are likely.

A.4.3 P1 — Form prefill is a known O(n) scan with no index

Problem: ReadFormPrefill fans out four batch queries across every submission a user has ever made to recover the latest value per field. Documented as slow in the YAML roadmap but no fix in progress.
Evidence: domain/form/postgres/form.go:646-700; YAML section_19_roadmap.tech_debt[0].
Suggested change: Either (a) add a (account_id, field_id, created_at DESC) index plus DISTINCT ON SQL, or (b) denormalize into a form_field_latest_value table updated on submit.
Estimated effort: S
Risk if ignored: Linear growth in per-user submission history turns prefill (called on every form open) into a tail-latency spike.

A.4.4 P1 — HS256 shared-secret JWT for AI control plane

Problem: Control-plane JWT uses HS256 with a shared secret (AI_JWT_SECRET), 1h TTL, 5-min refresh window; if either side leaks the secret, the attacker can mint arbitrary tokens for the AI control plane.
Evidence: service/infrastructure/ai/container/jwt.go:13-76 (sub = "integration-team", issuer = "saas-platform" — hard-coded, generic, no kid, no rotation hook).
Suggested change: Move to RS256/ES256 with asymmetric keys rotated through SSM/KMS; include kid to support zero-downtime rotation.
Estimated effort: M
Risk if ignored: Single-point credential compromise compromises the container orchestrator.

A.4.5 P2 — CORS allow-list accepts `*` by config

Problem: YAML shows CORS_ALLOWED_ORIGINS default *; middleware simply forwards the config (api/graphql/middleware/cors.go:26-31). In combination with Firebase Bearer auth there is no browser-side CSRF surface, but the signal is wrong and makes auditing harder.
Evidence: api/graphql/middleware/cors.go:26,31; YAML env_vars entry for CORS_ALLOWED_ORIGINS.
Suggested change: Enforce an explicit allow-list in non-local envs; reject * when ENVIRONMENT != local.
Estimated effort: S
Risk if ignored: Policy drift; a future cookie-based flow would instantly inherit the *.

A.5 Things That Don't Make Sense

Observation: refreshAIAgentState returns early if the state is StateAgent1Complete (formsubmission.go:403-405), but that is precisely the state from which ContinueAgent is expected to move forward. The function also spawns yet another detached goroutine to restart stale agents (formsubmission.go:419-430), which re-introduces the A.4.1 durability problem on a read path.
- Location: service/application/formsubmission/formsubmission.go:399-436
- Hypotheses considered: (a) intentional freeze so polling doesn't race with user VNC action; (b) performance-motivated early-return to avoid calling the control plane on every query.
- Question for author: Is there a reason a read (formSubmission query) causes a side-effect goroutine that can restart remote agents? Should restart be an explicit mutation instead?
Observation: cmd/api/graphql/main.go:99-101 leaves a commented-out globalWaitGroup.Wait() TODO for short-lived goroutines. This is the same concern as A.4.1 but filed as an aspirational comment.
- Question for author: Was a wait-group attempted and abandoned, and if so, why?
Observation: sub = "integration-team" and issuer = "saas-platform" in the AI control-plane JWT (jwt.go:14-15) look like left-over template values.
- Question for author: Intentional or pre-refactor placeholder?

A.6 Anti-Patterns Detected

A.6.1 Code-level

A.6.2 Architectural

Big ball of mud
Distributed monolith
Chatty services
Leaky abstraction — refreshAIAgentState in the application layer concretely knows about container states and restart semantics (formsubmission.go:399-452); should live behind a single AILifecycle facade.
Golden hammer
Vendor lock-in without exit strategy
Stovepipe / reinvented wheel
Missing seams for testing (hard-coded clocks, network, filesystem) — time.After is used directly in three polling loops (formsubmission.go:462, 491, 513), so tests cannot advance the clock and must wait real seconds. timesource.TimeSource is injected for Now() but not for sleeps.

A.6.3 Data

God table
EAV abuse — the field-type XOR subtables (form_field_string_submission / _choice_submission / _file_submission) deliberately avoid EAV; good.
Missing indexes on hot queries — prefill scan (A.4.3) and households RO address lookup are both flagged in the YAML roadmap.
N+1 queries — prevented via dataloader v7.
Unbounded growth / no retention policy — account.deleted_at soft delete only, form_submission and form_field_*_submission have no retention (YAML §4 and §17 both say "retention TBD").
Nullable-everything schemas
Implicit coupling via shared database — exists between this service and af-targeting via households RO, but it is read-only by design.

A.6.4 Async / Ops

Poison messages with no dead-letter queue — detached goroutine has no DLQ (A.4.1, formsubmission.go:203-215).
Retry storms / no backoff — the three poll loops sleep with fixed intervals; good.
Missing idempotency keys on non-idempotent ops — finalizeFormSubmission has an ERR_FORM_SUBMISSION_ALREADY_COMPLETED guard but no client-supplied idempotency key; a retry from the frontend during the analyticsService.UserUpdate call window (formsubmission.go:169) could double-post to analytics before the DB state transitions.
Hidden coupling via shared state — the in-process goroutine keeps implicit state about the submission; no other replica can take over.
Work queues without visibility / depth metrics — no queue exists (A.4.1).

A.6.5 Security

Secrets in code, .env committed, or logs
Missing authn/z on internal endpoints — directives enforce it (@account, @strongAuth, @authorizeAPIKey).
Overbroad IAM roles — cannot verify without af-infra.
Unvalidated input crossing a trust boundary — go-playground/validator tags + custom form validators look comprehensive.
PII/PHI in logs or error messages — not observed directly, but error-wrapping uses fmt.Errorf("%w", err) liberally; some wrapped errors include user-supplied values that could carry PII. No explicit redaction layer in util/logger/handler.go. Cannot rule out without a log audit.
Missing CSRF/XSS/SQLi/SSRF — CORS * permitted (A.4.5). SQLi is well-guarded (pgx named params). SSRF: service/infrastructure/ai/agent/agent.go blindly issues HTTP against status.AgentAPIBase returned by the control plane — a compromised control plane can point this service at arbitrary internal URLs. Not necessarily a bug but worth noting.

A.6.6 Detected Instances

#	Anti-pattern	Location (file:line)	Severity	Recommendation
1	God function (`FinalizeFormSubmission`)	`service/application/formsubmission/formsubmission.go:146-221`	P1	Split into 3 helpers (A.4.2)
2	Magic constants (no rationale)	`service/application/formsubmission/formsubmission.go:23-29`	P2	Add doc comments or link to runbook thresholds
3	Long parameter list (7 args)	`service/application/formsubmission/formsubmission.go:78-86`	P2	Introduce a `Dependencies` struct
4	Leaky abstraction (app layer knows container state)	`service/application/formsubmission/formsubmission.go:399-452`	P1	Extract `AILifecycle` service
5	Missing test seam on `time.After`	`service/application/formsubmission/formsubmission.go:462,491,513`	P2	Inject a `sleeper` or wrap via `timesource`
6	Missing index on prefill scan	`domain/form/postgres/form.go:646-700`	P1	Add index + `DISTINCT ON` (A.4.3)
7	Unbounded retention on submission PII	`database/sql/migrations/003_form_updates.sql`, `015_repeated_groups.sql`	P1	Define retention schedule + nightly purge (A.10, A.16)
8	Fire-and-forget AI job w/o DLQ	`service/application/formsubmission/formsubmission.go:203-215`	P0	Durable outbox (A.4.1)
9	Missing client-supplied idempotency key on finalize	`service/application/formsubmission/formsubmission.go:147`	P2	Accept `Idempotency-Key` header, store hash
10	Hidden coupling via in-process goroutine state	`service/application/formsubmission/formsubmission.go:203-215, 419-430`	P1	Move to external queue
11	CORS permits `*`	`api/graphql/middleware/cors.go:26` + env default	P2	Enforce allow-list in dev/stg/prod (A.4.5)
12	HS256 shared-secret JWT to AI control plane	`service/infrastructure/ai/container/jwt.go:13-76`	P1	Move to RS256/ES256 + `kid` (A.4.4)
13	Implicit SSRF surface via control-plane-returned URL	`service/infrastructure/ai/agent/agent.go` (per YAML §2)	P2	Validate `AgentAPIBase` is in an allow-listed CloudFront domain

A.7 Open Questions

Q: Is there any plan for durable job handling of AI lifecycle (outbox, SQS, worker pool), or is the fire-and-forget pattern intentional?
- Blocks: A.4.1, A.6.4
- Who can answer: backend lead (Jiri / Marek)
Q: What is the retention policy for form_submission and form_field_*_submission rows containing SSN/DOB? YAML says "TBD" throughout §4 and §17.
- Blocks: A.6.3, A.10, A.16
- Who can answer: product + legal
Q: Is FIXED_COMPLEXITY_LIMIT=20 actually tuned against real client queries, or a placeholder?
- Blocks: A.12
- Who can answer: whoever maintains af-frontend query shapes
Q: Does the AI control plane return agent URLs from a fixed domain (e.g., *.cloudfront.net tenant bucket) so an allow-list is enforceable?
- Blocks: A.6.5 #13
- Who can answer: af-disaster-assistance-gov-agent owner

A.8 Difficulties Encountered

Difficulty: 18,371-line api/graphql/graph/generated.go dominates grep results for resolver-related queries.
- Impact on analysis: Harder to confirm which logic is generated vs. hand-written; relied on schema.resolvers.go plus domain layer.
- Fix that would help next reviewer: .gitattributes linguist-generated=true + explicit mention in README.
Difficulty: No ability to run tests (make test-run) or observe metrics in this review session.
- Impact on analysis: Coverage numbers, flake rate, p50/p99 latencies, queue depths — all marked "TBD" in A.2/A.13/A.14.
- Fix: Publish a coverage badge + Prometheus/CloudWatch dashboard link in README.
Difficulty: af-infra Terraform not in this repo, so IAM scope, SSM parameter layout, and ALB/CloudFront policies could not be verified.
- Impact on analysis: STRIDE gaps below (A.11) had to be inferred from config.

A.9 Risks & Unknowns

A.9.1 Known risks

#	Risk	Likelihood	Impact	Mitigation
1	AI job lost on pod restart	M	H	A.4.1 (durable outbox)
2	Prefill tail-latency under load	M	M	A.4.3 (index / denormalization)
3	HS256 shared secret leak	L	H	A.4.4 (asymmetric keys)
4	PII retention non-compliance (FEMA Privacy Act, GDPR)	M	H	Retention schedule + purge job
5	Double-finalize during transient frontend retry	L	M	Idempotency key
6	Control-plane URL redirection (SSRF-adjacent)	L	M	AgentAPIBase allow-list

A.9.2 Unknown unknowns

Area not reviewed: Plaid flow in detail. Best guess at risk level: L — the pattern is textbook (no persistence, one-shot exchange), but I did not walk service/infrastructure/plaid/plaid.go line by line.
Area not reviewed: Admin-only mutations and the @authorizeAPIKey directive implementation end-to-end. M — admin endpoints share the same route, so an auth regression would be high-blast-radius.
Area not reviewed: The 971-line domain/form/postgres/form.go; only skimmed for prefill. M — complex hand-written SQL.
Area not reviewed: util/logger/handler_dev.go and how structured logs are emitted in prod — PII scrubbing not verified. M.
Area not reviewed: CI workflow contents (build.yaml, tests.yaml, vuln-scan.yaml) — existence confirmed, content not read. L.

A.10 Technical Debt Register

#	Debt item	Quadrant	Estimated interest	Remediation
1	Fire-and-forget AI goroutine, no durability	Reckless & Inadvertent	High — invisible failures of the flagship feature	A.4.1
2	`FinalizeFormSubmission` oversize (author-nolinted)	Prudent & Deliberate	Medium — slows future changes	A.4.2
3	Prefill O(n) scan	Prudent & Deliberate (roadmap acknowledges)	Medium — grows with account lifetime	A.4.3
4	Polling-based AI state (no subscriptions)	Prudent & Deliberate	Low–Medium — extra DB QPS + UX latency	GraphQL subscriptions (YAML §19)
5	HS256 shared secret for AI JWT	Reckless & Inadvertent	Low prob, High blast	A.4.4
6	No app-level audit log (who submitted / accessed what PII)	Prudent & Deliberate	Medium — compliance risk	Dedicated audit table + async writer
7	Retention policy undefined for PII tables	Reckless & Inadvertent	Medium — regulatory	Retention schedule + purge
8	CORS allow-list defaults to `*`	Prudent & Inadvertent	Low	A.4.5
9	`time.After` prevents time-travel tests	Prudent & Inadvertent	Low — slows test suite	Inject sleeper
10	`generated.go` not marked `linguist-generated`	Prudent & Inadvertent	Very Low	Add `.gitattributes`

A.11 Security Posture (lightweight STRIDE)

Category	Threat present?	Mitigated?	Gap
Spoofing (identity)	Yes (Firebase tokens, admin key)	Mostly (`auth.go:42-56`, `@strongAuth` freshness)	Admin key in header; no mTLS between services
Tampering (integrity)	Yes	Partially	SQL parameterized (pgx named params); no row-level integrity hashes; JWT HS256 shared secret (A.4.4)
Repudiation	Yes (survivor actions on FEMA behalf)	No	No app-level audit log (A.10 #6)
Information Disclosure	Yes (PII everywhere in submissions)	Partially	TLS + soft-delete; no field-level encryption; PII in wrapped error strings not audited
Denial of Service	Yes	Partially	`FIXED_COMPLEXITY_LIMIT=20`, 4 MiB body limit (`bodylimit.go:12`), 25 MiB file cap; no per-account rate limiter; poll loops cannot be cancelled
Elevation of Privilege	Yes (admin directive)	Yes on surface	Cannot verify IAM scope of ECS task role

A.12 Operational Readiness

Capability	Present / Partial / Missing	Notes
Structured logs	Present	`util/logger/handler.go` slog-based
Metrics	Missing (unverified)	No Prometheus/OTel wiring seen in `cmd/api/graphql/main.go`
Distributed tracing	Missing (unverified)	No OTel SDK import visible
Actionable alerts	Partial	Runbook lists ECS task health + deployment-badge; thresholds "TBD" in YAML §14
Runbooks	Present	`data/af-backend-go-api/runbook.md`
On-call ownership defined	Present	`CODEOWNERS` at repo root
SLOs / SLIs	Missing	YAML §12 all "TBD"
Backup & restore tested	Unknown	RDS managed, not verified
Disaster recovery plan	Unknown	Cross-repo concern
Chaos / failure testing	Missing	No evidence

A.13 Test & Quality Signals

Coverage (line / branch): Not published (YAML §15 coverage_pct: null). File counts: 35 _test.go vs 124 source .go ≈ 28% file ratio — low but concentrated on the right places (domain/form, AI agent, JWT, auth middleware).
Trend: Unknown.
Flake rate: Unknown.
Slowest tests: Likely the three pollFor… loops if tested directly, since they use real time.After (A.6.2).
Untested critical paths: End-to-end AI lifecycle (no integration test walking finalize→poll→continue), admin API-key directive, FinalizeFormSubmission happy path (test file exists at service/application/formsubmission/formsubmission_test.go but cannot assess depth without running).
Missing test types: [ ] unit (have) [ ] integration (partial, LocalStack+emulator per YAML) [x] e2e (no dedicated repo-level e2e) [x] contract (no schema-diff/contract test between schema.graphqls and frontend) [x] load [x] security/fuzz

A.14 Performance & Cost Smells

Hot paths: form, formSubmission, formPrefill, actualHouseholdInfo queries; submitForm + finalizeFormSubmission mutations.
Suspected bottlenecks: (1) ReadFormPrefill full-scan (domain/form/postgres/form.go:646); (2) households RO address matching without documented indexes (YAML §12); (3) three 5-minute poll loops hold goroutines + DB connections per in-flight AI job (formsubmission.go:454-530).
Wasteful queries / loops: Poll loops write an AI-state update on every tick (formsubmission.go:469, 520) even when state has not changed — unnecessary row writes at ~10s cadence.
Oversized infra / idle resources: N/A — not reviewed.
Cache hit/miss surprises: No cache layer; YAML §19 lists "cache hot reads (ListActiveDisasters, form definitions)" as roadmap. Today every activeDisasters query hits Postgres.

A.15 Bus-Factor & Knowledge Risk

Who is the only person who understands X? Cannot infer from code alone; YAML lists three authors (Siroky, Burda, Bafrnec).
What breaks if they disappear tomorrow? The AI state machine (public vs internal mapping in formsubmission.go:546-586) and the refreshAIAgentState semantics (A.5 #1) are the highest-knowledge-density section; no comments explaining why each early-return exists.
What is undocumented tribal knowledge? The reason the VNC endpoint validity is 1h while the container lifetime is 48h, and the interplay with aiRefreshInterval's 2-min cooldown.
Suggested knowledge-transfer actions: Add a state-machine diagram in service/application/formsubmission/ as a README, ideally with decision rationale embedded as comments on the state-mapping functions.

A.16 Compliance Gaps

Regulation	Requirement	Status	Gap	Remediation
FEMA Privacy Act	Defined retention + purge of survivor PII	Partial	Soft-delete only; no purge job; retention "TBD" (YAML §4)	Define schedule + nightly purge
GDPR / CCPA	Data export + hard delete on DSAR	Missing	YAML §19 acknowledges no export endpoint	Build DSAR pipeline
Login.gov IAL2/AAL2	Performed by agent in VNC	OK	Verification lives outside this service	Document trust boundary in runbook
NIST 800-53 (audit)	Application-level audit log	Missing	CloudTrail only — no per-user data-access trail	Add audit log table (A.10 #6)
Plaid Data Use Policy	Do not persist credentials	OK	Confirmed via `service/infrastructure/plaid/plaid.go` wrapping (YAML §17)	—

A.17 Recommendations Summary

Priority	Action	Owner (suggested)	Effort	Depends on
P0	Durable AI-job outbox + startup reaper + depth/age metrics (A.4.1, A.6.4, A.10 #1)	backend-go-api maintainers	M	af-infra (SQS or table migration)
P0	Define and enforce PII retention schedule (A.10 #7, A.16 FEMA/GDPR)	product + legal + backend	M	legal sign-off
P1	Split `FinalizeFormSubmission`; extract `AILifecycle` facade (A.4.2, A.6.2, A.6.6 #1,#4)	backend	S	—
P1	Index + `DISTINCT ON` (or denorm) for prefill (A.4.3, A.6.3)	backend	S	—
P1	Replace HS256 shared-secret AI JWT with RS256 + `kid` rotation (A.4.4, A.6.5)	backend + af-infra	M	control-plane team
P1	Add app-level audit log table + async writer (A.10 #6, A.16 NIST)	backend	M	—
P1	Idempotency key on `finalizeFormSubmission` (A.6.4 #3)	backend + af-frontend	S	frontend coordination
P2	CORS allow-list enforced per env (A.4.5)	backend	S	—
P2	Inject sleeper/ticker to make poll loops time-travel testable (A.6.2)	backend	S	—
P2	AgentAPIBase domain allow-list (SSRF hygiene, A.6.6 #13)	backend	S	control-plane team
P2	Add `.gitattributes linguist-generated=true` for `generated.go` (A.10 #10)	backend	S	—
P2	Cache hot reads: `activeDisasters`, form definitions (YAML §19)	backend	S	—
P2	Publish coverage badge + metrics dashboard link in README (A.8, A.12)	backend	S	—

Environment variables

Name	Purpose
`ENVIRONMENT`*	local\|dev\|stg\|prod
`PORT`*	HTTP listen port
`DATABASE_HOST`*	Main DB host
`DATABASE_PORT`*	Main DB port
`DATABASE_USERNAME`*	Main DB user
`DATABASE_DB_NAME`*	Main DB name
`DATABASE_HOUSEHOLDS_RO_HOST`*	Households RO host
`DATABASE_HOUSEHOLDS_RO_PORT`*	Households RO port
`FIREBASE_CREDENTIALS`*	Firebase service-account JSON
`FIREBASE_AUTH_EMULATOR_HOST`	Dev-only emulator override
`ADMIN_API_KEY`*	Header value for @authorizeAPIKey directive
`MAPBOX_USERNAME`*	Mapbox tileset namespace
`AI_CONTAINER_API_HOST`*	AI control-plane base URL
`AI_JWT_SECRET`*	HS256 secret for control-plane JWT
`ANALYTICS_API_HOST`*	User Update Service base URL (af-targeting FastAPI; .env.common defaults to :8090, the af-targeting docker-compose canonical port is :8080 — local dev typically port-maps 8090→8080)
`S3_BUCKET_FILE_SUBMISSION_NAME`*	S3 bucket for form files
`S3_ENDPOINT_URL`	LocalStack S3 override (dev)
`PLAID_CLIENT_ID`*	Plaid client ID
`PLAID_SECRET`*	Plaid secret
`PLAID_ENVIRONMENT`*	sandbox\|production
`MAX_FILE_SUBMISSION_SIZE_MIB`*	Max upload size
`ALLOWED_FILE_SUBMISSION_TYPES`*	MIME whitelist
`FIXED_COMPLEXITY_LIMIT`*	GraphQL complexity cap
`STRONG_AUTH_MAX_AGE`*	@strongAuth freshness
`LOG_LEVEL`	slog level
`CORS_ALLOWED_ORIGINS`	CORS list