Go service built on gqlgen v0.17.86 + chi router with a clean-architecture layout (domain / service / infrastructure). Main store is PostgreSQL via pgx v5; a separate read-only households DB serves eligibility lookups. Auth is Firebase (email-link OOB in dev, ID-token verification in prod). Plaid SDK powers bank linking. AI agents are run as ECS Fargate containers, orchestrated via a JWT-secured control plane plus CloudFront-signed-URL data plane.
Role in the system: Receives form submissions from af-frontend, persists to PostgreSQL, runs eligibility via the af-targeting analytics service, spawns per-survivor Fargate AI containers (af-disaster-assistance-gov-agent / af-fema-real-ai-agent) and exposes their VNC endpoints back to the user.
Surfaces:
- GraphQL Query: form, forms, formSubmission(s), formFiles, formPrefill, activeDisasters, householdDetailedInfo, actualHouseholdInfo, me, accounts(admin)
- GraphQL Mutation: signIn, updateMe, submitForm, finalizeFormSubmission, continueAgent, createPlaidLinkToken, linkBankAccount
- Custom directives: @account, @strongAuth, @authorizeAPIKey, @validate, @goField
- REST: POST /api/graphql, GET /deployment-badge, implicit /health
User workflows
Survivor sign-in
Bearer token usable on all @account-guarded operations
Form submission (two-step S3 upload)
FormSubmissionInfo with eligibility + expectedAmount; AI state machine begins
AI form automation (lazy polling)
FEMA application submitted on survivor's behalf; VNC remains available 48h for review
Household eligibility lookup
Returns ActualHouseholdInfo for the survivor's address
Plaid bank linking
Account/routing/bank metadata returned to frontend; survivor uses it to autofill financial form fields
API endpoints
- QUERY
form(formId: UUID!)Load form structure with sections, fields, conditions - QUERY
formSubmission(formSubmissionId!)Load submission with state, eligibility, AI status - QUERY
formPrefill(formId!)Latest submitted values for prefill - QUERY
activeDisastersList active declared disasters - QUERY
actualHouseholdInfo(limit, offset)Account's address-matched household + program eligibility - QUERY
meCurrent account profile - QUERY
accounts(limit, offset)Admin: list accounts - MUTATION
signInVerify Firebase token, upsert account - MUTATION
updateMe(input)Update profile - MUTATION
submitForm(formId!, input)Validate + persist submission, return presigned S3 URLs - MUTATION
finalizeFormSubmission(submissionId!)Mark files uploaded, call analytics, fire AI container - MUTATION
continueAgent(submissionId!)Resume Agent 2 after user Login.gov 2FA in VNC - MUTATION
createPlaidLinkTokenMint Plaid Link token - MUTATION
linkBankAccount(publicToken!)Exchange public token, return account details (not persisted) - POST
/api/graphqlGraphQL HTTP transport - GET
/deployment-badgeCloudFront-served deployment badge metadata
Third-party APIs
Firebase Auth
Email-link OOB sign-in + ID-token verification
Plaid
Bank account linking, account/routing retrieval
Mapbox (indirect)
Tileset namespace owner; no direct API calls from this service (af-map handles uploads)
AI Container API (control plane)
Start/stop/suspend per-survivor Fargate containers
AI Agent API (data plane)
Agent orchestration: /health, /survivor-info, /state, /agent1/run, /agent2/run
User Update Service (analytics)
Eligibility determination from form data
Service dependencies
PostgreSQL (main)
Accounts, forms, submissions, AI/analytics metadata
PostgreSQL (households RO)
Read-only households DB synced from af-targeting BigQuery — eligibility lookups
AWS S3
Form file submissions (presigned PUT URLs)
af-map
Downstream tile worker; planned SNS/SQS trigger on disaster updates (not yet implemented)
af-disaster-assistance-gov-agent / af-fema-real-ai-agent
Per-survivor Fargate containers spawned by this service
af-targeting (User Update Service)
Eligibility scoring + household sync
Analysis
af-backend-go-api — Prop-Build Analysis
Document Type: Critical Review & Analysis (companion to prop-build-template.md)
Scope: Per-Repo
Subject: af-backend-go-api (AidFinder GraphQL backend)
Reviewer(s): Claude (automated code review)
Date: 2026-04-09
Version: 0.1
Confidence Level: Medium
What would raise confidence: running tests locally, prod metrics/traces, interview with Jiri/Marek/Matúš, observing an actual AI-agent submission end-to-end.
Inputs Reviewed:
- Prop-build doc:
data/af-backend-go-api.yaml - Companion docs:
data/af-backend-go-api/{api-examples,data-flow,deployment,runbook}.md,schema.graphql - Source tree:
/Users/andres/src/af/af-backend-go-api/(124 non-test.gofiles, 35 test files, 16 tern migrations, 6 GitHub Actions workflows)
Part A — Per-Repo Analysis
A.1 Executive Summary
- Overall health: Solid, conventionally-structured Go service with clean-architecture layering, strong typing, and meaningful test coverage on hot domain logic; the biggest risks are operational (polling-based AI state, fire-and-forget goroutines) and a handful of latent correctness/security smells rather than structural rot.
- Top risk: The AI-agent lifecycle is driven by an unsupervised goroutine spawned from
FinalizeFormSubmission(service/application/formsubmission/formsubmission.go:203-215) usingcontext.WithoutCancel, with no DLQ, no retry, no persistence of in-flight state, and no visibility metric — a crash mid-poll silently strands a submission (see A.4.1). - Top win / worth preserving: Clean domain/service/infrastructure split with hand-rolled SQL + scany (no ORM), dependency-injected services validated at construction (
formsubmission.go:86-116), and a disciplined structured-error taxonomy (api/graphql/errors/error.go, README error codes). - Single recommended next action: Persist AI-lifecycle work as a durable job (DB outbox + a polling worker or SQS) and expose depth/age metrics; everything else on the list is P1/P2.
- Blocking unknowns: Prod metrics, actual coverage numbers, RDS index state for households RO, whether
FIXED_COMPLEXITY_LIMIT=20is tuned for real clients, and whether the planned SNS/SQS → af-map publisher exists anywhere in branch.
A.2 Health Scorecard
| # | Dimension | Score (1–5) | Justification |
|---|---|---|---|
| 1 | Module overview / clarity of intent | 5 | README + prop-build doc + companion files are thorough and match code. |
| 2 | External dependencies | 4 | Modern, pinned versions (go.mod); Firebase/Plaid/AWS SDKs wrapped behind interfaces. |
| 3 | API endpoints | 4 | Schema-first gqlgen with directives (api/graphql/graph/schema.resolvers.go, 675 LOC resolvers); introspection on in all envs is a minor smell. |
| 4 | Database schema | 4 | 16 hand-written tern migrations, XOR field-type subtables, deferrable constraints; lacks explicit retention. |
| 5 | Backend services | 3 | FinalizeFormSubmission carries a //nolint:funlen,cyclop,gocognit tag at formsubmission.go:146 — acknowledged complexity; AI orchestration leaks through application layer. |
| 6 | WebSocket / real-time | 2 | No subscriptions/WS; only client polling (cmd/api/graphql/main.go:83-88 TODO comment). |
| 7 | Frontend components | N/A | Backend only. |
| 8 | Data flow clarity | 4 | data-flow.md matches the code; sequence is traceable through resolvers → application → domain. |
| 9 | Error handling & resilience | 3 | Good error taxonomy, but no circuit breakers, no retries on outbound HTTP, TODO comment at formsubmission.go:171-172 ("find out how to handle errors"). |
| 10 | Configuration | 4 | Env-var parsing via strv config in cmd/api/graphql/setup/setup.go; validated at startup. |
| 11 | Data refresh patterns | 2 | Lazy poll with 2-min cooldown (aiRefreshInterval, formsubmission.go:27) is fragile; no caching of hot reads. |
| 12 | Performance | 3 | gqlgen complexity cap + dataloader v7 for N+1 are good; ReadFormPrefill (domain/form/postgres/form.go:646) is the known O(n) hotspot. |
| 13 | Module interactions | 4 | Well-bounded; dependency boundaries crisp; only real coupling is downstream agents. |
| 14 | Troubleshooting / runbooks | 4 | runbook.md exists and is specific; missing alert thresholds. |
| 15 | Testing & QA | 3 | 35 test files for 124 source files; unit coverage on form transform, JWT, auth middleware, agents; no coverage number published; no load/fuzz. |
| 16 | Deployment & DevOps | 4 | 6 GH Actions workflows (build, tests, lint, release, deploy-task-definition, vuln-scan), multi-stage Dockerfile, tern-on-startup migrations. |
| 17 | Security & compliance | 3 | Firebase + directive-based authz is correct; HS256 JWT with shared secret to AI control plane (service/infrastructure/ai/container/jwt.go:20) and CORS default * are soft spots (see A.6.5). |
| 18 | Documentation & maintenance | 5 | README is exemplary; CHANGELOG present; CODEOWNERS file. |
| 19 | Roadmap clarity | 4 | Tech-debt + planned items enumerated in YAML §19 and README. |
Overall score: 3.61 (average of 18 scored rows; Frontend N/A excluded). Weighted reading: the repo is comfortably above "acceptable" on code, docs, and deployment; it dips into "needs work" on real-time/refresh (row 6, 11) and on the resilience of its most valuable flow (row 9) — i.e., the weakest links are exactly where a survivor's submission lives.
A.3 What's Working Well
-
Strength: Dependency injection with explicit nil-checks at construction.
- Location:
service/application/formsubmission/formsubmission.go:86-116 - Why it works: Every collaborator is an interface; constructor refuses to return a half-wired service. Makes testing trivial and failures loud at startup instead of runtime.
- Propagate to: Any repo still using package-level globals for dependencies.
- Location:
-
Strength: Strong ID types via code-gen (
types/id/).- Location:
types/id/id.go,types/id/id_gen.go - Why it works: Compile-time separation of
id.Account/id.FormSubmission/id.Container/id.Formavoids primitive obsession; swapping a submissionID for an accountID is a type error, not a runtime bug. - Propagate to: Other Go services that pass bare UUIDs around.
- Location:
-
Strength: Structured, code-first error taxonomy surfaced through GraphQL extensions.
- Location:
api/graphql/errors/error.go,api/graphql/graph/error.go, README error codes section - Why it works: Frontend can pattern-match on stable codes (
ERR_FORM_*,FIELD_*) instead of scraping strings.api/graphql/errors/error_test.golocks the contract. - Propagate to: Any repo where the frontend currently greps errors by message.
- Location:
-
Strength: Dataloader v7 wired as request-scoped middleware.
- Location:
api/graphql/middleware/dataloader.go+dataloader/context.go - Why it works: Prevents the classic GraphQL N+1 on nested resolvers without forcing batch-everywhere discipline on the resolver layer.
- Propagate to: Any GraphQL server still writing naive per-field fetches.
- Location:
A.4 What to Improve
A.4.1 P0 — AI lifecycle is a fire-and-forget goroutine with no durability or visibility
- Problem: The entire post-finalize AI flow (start container, poll ~5min, submit survivor info, run agent 1) runs in a detached goroutine spawned from
FinalizeFormSubmission, usingcontext.WithoutCancel(ctx). If the pod is killed between finalize and any of the threepollFor…loops, the submission sits in an intermediate state forever; only the client's next poll can nudgerefreshAIAgentStateback into life, and only if the container itself is still running. There is no DLQ, no retry budget, no visible queue depth, no metric for in-flight jobs. - Evidence:
service/application/formsubmission/formsubmission.go:203-215(detached goroutine,context.WithoutCancel, panic recovery that only logs).service/application/formsubmission/formsubmission.go:454-530threepollFor…loops bounded only byaiAPITimeout = 5 * time.Minute(formsubmission.go:24).cmd/api/graphql/main.go:83-88—BeforeShutdownhook is empty; the TODO comment explicitly notes "can be for example closing of websocket connections, or messages handler (SQS, pub/sub)" — there is no graceful-drain for in-flight AI jobs.formsubmission.go:171-172—// TODO find out how to handle errorson the analytics path that precedes the AI kick-off.
- Suggested change: Persist a row in a
pending_ai_jobstable (or SQS message) inside the finalize transaction; have a worker (in-proc ticker or separate consumer) pick it up. Emit apending_ai_jobs_depthandoldest_pending_agemetric. Short-term, at minimum add a startup reaper that scans submissions in non-terminal AI states older thanaiAPITimeoutand transitions them toERROR. - Estimated effort: M
- Risk if ignored: Silent data loss of the platform's headline feature; survivors see an eternally "running" state after any pod restart.
A.4.2 P1 — FinalizeFormSubmission is explicitly too large and mixes concerns
- Problem: The finalize path carries
//nolint:funlen,cyclop,gocognit— an author-acknowledged smell. It validates files, calls analytics, decides eligibility, launches the goroutine, and owns error translation in one function. - Evidence:
service/application/formsubmission/formsubmission.go:146-221(~75 lines, 4 external services, 7 error branches, 1 goroutine). - Suggested change: Split into (a)
finalizeAndScore(files + analytics + DB update), (b)enqueueAIJob(the durable-queue piece from A.4.1), (c)shouldRunAIpolicy helper. The anti-pattern suppression comment should then be removable. - Estimated effort: S
- Risk if ignored: Any change to finalize logic touches the same fragile 75-line block; regressions are likely.
A.4.3 P1 — Form prefill is a known O(n) scan with no index
- Problem:
ReadFormPrefillfans out four batch queries across every submission a user has ever made to recover the latest value per field. Documented as slow in the YAML roadmap but no fix in progress. - Evidence:
domain/form/postgres/form.go:646-700; YAMLsection_19_roadmap.tech_debt[0]. - Suggested change: Either (a) add a
(account_id, field_id, created_at DESC)index plusDISTINCT ONSQL, or (b) denormalize into aform_field_latest_valuetable updated on submit. - Estimated effort: S
- Risk if ignored: Linear growth in per-user submission history turns prefill (called on every form open) into a tail-latency spike.
A.4.4 P1 — HS256 shared-secret JWT for AI control plane
- Problem: Control-plane JWT uses HS256 with a shared secret (
AI_JWT_SECRET), 1h TTL, 5-min refresh window; if either side leaks the secret, the attacker can mint arbitrary tokens for the AI control plane. - Evidence:
service/infrastructure/ai/container/jwt.go:13-76(sub = "integration-team",issuer = "saas-platform"— hard-coded, generic, no kid, no rotation hook). - Suggested change: Move to RS256/ES256 with asymmetric keys rotated through SSM/KMS; include
kidto support zero-downtime rotation. - Estimated effort: M
- Risk if ignored: Single-point credential compromise compromises the container orchestrator.
A.4.5 P2 — CORS allow-list accepts * by config
- Problem: YAML shows
CORS_ALLOWED_ORIGINSdefault*; middleware simply forwards the config (api/graphql/middleware/cors.go:26-31). In combination with Firebase Bearer auth there is no browser-side CSRF surface, but the signal is wrong and makes auditing harder. - Evidence:
api/graphql/middleware/cors.go:26,31; YAML env_vars entry forCORS_ALLOWED_ORIGINS. - Suggested change: Enforce an explicit allow-list in non-local envs; reject
*whenENVIRONMENT != local. - Estimated effort: S
- Risk if ignored: Policy drift; a future cookie-based flow would instantly inherit the
*.
A.5 Things That Don't Make Sense
-
Observation:
refreshAIAgentStatereturns early if the state isStateAgent1Complete(formsubmission.go:403-405), but that is precisely the state from whichContinueAgentis expected to move forward. The function also spawns yet another detached goroutine to restart stale agents (formsubmission.go:419-430), which re-introduces the A.4.1 durability problem on a read path.- Location:
service/application/formsubmission/formsubmission.go:399-436 - Hypotheses considered: (a) intentional freeze so polling doesn't race with user VNC action; (b) performance-motivated early-return to avoid calling the control plane on every query.
- Question for author: Is there a reason a read (
formSubmissionquery) causes a side-effect goroutine that can restart remote agents? Should restart be an explicit mutation instead?
- Location:
-
Observation:
cmd/api/graphql/main.go:99-101leaves a commented-outglobalWaitGroup.Wait()TODO for short-lived goroutines. This is the same concern as A.4.1 but filed as an aspirational comment.- Question for author: Was a wait-group attempted and abandoned, and if so, why?
-
Observation:
sub = "integration-team"andissuer = "saas-platform"in the AI control-plane JWT (jwt.go:14-15) look like left-over template values.- Question for author: Intentional or pre-refactor placeholder?
A.6 Anti-Patterns Detected
A.6.1 Code-level
- God object / god function —
FinalizeFormSubmission(see A.4.2) - Shotgun surgery
- Feature envy
- Primitive obsession — prevented by
types/id/ - Dead code
- Copy-paste / duplication
- Magic numbers / unexplained constants —
aiAPITimeout,aiContainerPollInterval,aiAgentReadyPollInterval,aiRefreshInterval,aiContainerLifetimedeclared as a block but with no rationale comments (formsubmission.go:23-29) - Deep nesting (>3 levels)
- Long parameter lists (>4) —
NewService(timeSource, formService, analyticsService, storageService, survivorAnalyticsService, aiContainerService, aiAgentService)= 7 params (formsubmission.go:78-86); acceptable as a DI seam but crosses the threshold. - Boolean-flag parameters
A.6.2 Architectural
- Big ball of mud
- Distributed monolith
- Chatty services
- Leaky abstraction —
refreshAIAgentStatein the application layer concretely knows about container states and restart semantics (formsubmission.go:399-452); should live behind a singleAILifecyclefacade. - Golden hammer
- Vendor lock-in without exit strategy
- Stovepipe / reinvented wheel
- Missing seams for testing (hard-coded clocks, network, filesystem) —
time.Afteris used directly in three polling loops (formsubmission.go:462, 491, 513), so tests cannot advance the clock and must wait real seconds.timesource.TimeSourceis injected forNow()but not for sleeps.
A.6.3 Data
- God table
- EAV abuse — the field-type XOR subtables (
form_field_string_submission/_choice_submission/_file_submission) deliberately avoid EAV; good. - Missing indexes on hot queries — prefill scan (A.4.3) and households RO address lookup are both flagged in the YAML roadmap.
- N+1 queries — prevented via dataloader v7.
- Unbounded growth / no retention policy —
account.deleted_atsoft delete only,form_submissionandform_field_*_submissionhave no retention (YAML §4 and §17 both say "retention TBD"). - Nullable-everything schemas
- Implicit coupling via shared database — exists between this service and af-targeting via households RO, but it is read-only by design.
A.6.4 Async / Ops
- Poison messages with no dead-letter queue — detached goroutine has no DLQ (A.4.1,
formsubmission.go:203-215). - Retry storms / no backoff — the three poll loops sleep with fixed intervals; good.
- Missing idempotency keys on non-idempotent ops —
finalizeFormSubmissionhas anERR_FORM_SUBMISSION_ALREADY_COMPLETEDguard but no client-supplied idempotency key; a retry from the frontend during theanalyticsService.UserUpdatecall window (formsubmission.go:169) could double-post to analytics before the DB state transitions. - Hidden coupling via shared state — the in-process goroutine keeps implicit state about the submission; no other replica can take over.
- Work queues without visibility / depth metrics — no queue exists (A.4.1).
A.6.5 Security
- Secrets in code,
.envcommitted, or logs - Missing authn/z on internal endpoints — directives enforce it (
@account,@strongAuth,@authorizeAPIKey). - Overbroad IAM roles — cannot verify without af-infra.
- Unvalidated input crossing a trust boundary — go-playground/validator tags + custom form validators look comprehensive.
- PII/PHI in logs or error messages — not observed directly, but error-wrapping uses
fmt.Errorf("%w", err)liberally; some wrapped errors include user-supplied values that could carry PII. No explicit redaction layer inutil/logger/handler.go. Cannot rule out without a log audit. - Missing CSRF/XSS/SQLi/SSRF — CORS
*permitted (A.4.5). SQLi is well-guarded (pgx named params). SSRF:service/infrastructure/ai/agent/agent.goblindly issues HTTP againststatus.AgentAPIBasereturned by the control plane — a compromised control plane can point this service at arbitrary internal URLs. Not necessarily a bug but worth noting.
A.6.6 Detected Instances
| # | Anti-pattern | Location (file:line) | Severity | Recommendation |
|---|---|---|---|---|
| 1 | God function (FinalizeFormSubmission) | service/application/formsubmission/formsubmission.go:146-221 | P1 | Split into 3 helpers (A.4.2) |
| 2 | Magic constants (no rationale) | service/application/formsubmission/formsubmission.go:23-29 | P2 | Add doc comments or link to runbook thresholds |
| 3 | Long parameter list (7 args) | service/application/formsubmission/formsubmission.go:78-86 | P2 | Introduce a Dependencies struct |
| 4 | Leaky abstraction (app layer knows container state) | service/application/formsubmission/formsubmission.go:399-452 | P1 | Extract AILifecycle service |
| 5 | Missing test seam on time.After | service/application/formsubmission/formsubmission.go:462,491,513 | P2 | Inject a sleeper or wrap via timesource |
| 6 | Missing index on prefill scan | domain/form/postgres/form.go:646-700 | P1 | Add index + DISTINCT ON (A.4.3) |
| 7 | Unbounded retention on submission PII | database/sql/migrations/003_form_updates.sql, 015_repeated_groups.sql | P1 | Define retention schedule + nightly purge (A.10, A.16) |
| 8 | Fire-and-forget AI job w/o DLQ | service/application/formsubmission/formsubmission.go:203-215 | P0 | Durable outbox (A.4.1) |
| 9 | Missing client-supplied idempotency key on finalize | service/application/formsubmission/formsubmission.go:147 | P2 | Accept Idempotency-Key header, store hash |
| 10 | Hidden coupling via in-process goroutine state | service/application/formsubmission/formsubmission.go:203-215, 419-430 | P1 | Move to external queue |
| 11 | CORS permits * | api/graphql/middleware/cors.go:26 + env default | P2 | Enforce allow-list in dev/stg/prod (A.4.5) |
| 12 | HS256 shared-secret JWT to AI control plane | service/infrastructure/ai/container/jwt.go:13-76 | P1 | Move to RS256/ES256 + kid (A.4.4) |
| 13 | Implicit SSRF surface via control-plane-returned URL | service/infrastructure/ai/agent/agent.go (per YAML §2) | P2 | Validate AgentAPIBase is in an allow-listed CloudFront domain |
A.7 Open Questions
-
Q: Is there any plan for durable job handling of AI lifecycle (outbox, SQS, worker pool), or is the fire-and-forget pattern intentional?
- Blocks: A.4.1, A.6.4
- Who can answer: backend lead (Jiri / Marek)
-
Q: What is the retention policy for
form_submissionandform_field_*_submissionrows containing SSN/DOB? YAML says "TBD" throughout §4 and §17.- Blocks: A.6.3, A.10, A.16
- Who can answer: product + legal
-
Q: Is
FIXED_COMPLEXITY_LIMIT=20actually tuned against real client queries, or a placeholder?- Blocks: A.12
- Who can answer: whoever maintains af-frontend query shapes
-
Q: Does the AI control plane return agent URLs from a fixed domain (e.g.,
*.cloudfront.nettenant bucket) so an allow-list is enforceable?- Blocks: A.6.5 #13
- Who can answer: af-disaster-assistance-gov-agent owner
A.8 Difficulties Encountered
-
Difficulty: 18,371-line
api/graphql/graph/generated.godominates grep results for resolver-related queries.- Impact on analysis: Harder to confirm which logic is generated vs. hand-written; relied on
schema.resolvers.goplus domain layer. - Fix that would help next reviewer:
.gitattributes linguist-generated=true+ explicit mention in README.
- Impact on analysis: Harder to confirm which logic is generated vs. hand-written; relied on
-
Difficulty: No ability to run tests (
make test-run) or observe metrics in this review session.- Impact on analysis: Coverage numbers, flake rate, p50/p99 latencies, queue depths — all marked "TBD" in A.2/A.13/A.14.
- Fix: Publish a coverage badge + Prometheus/CloudWatch dashboard link in README.
-
Difficulty: af-infra Terraform not in this repo, so IAM scope, SSM parameter layout, and ALB/CloudFront policies could not be verified.
- Impact on analysis: STRIDE gaps below (A.11) had to be inferred from config.
A.9 Risks & Unknowns
A.9.1 Known risks
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | AI job lost on pod restart | M | H | A.4.1 (durable outbox) |
| 2 | Prefill tail-latency under load | M | M | A.4.3 (index / denormalization) |
| 3 | HS256 shared secret leak | L | H | A.4.4 (asymmetric keys) |
| 4 | PII retention non-compliance (FEMA Privacy Act, GDPR) | M | H | Retention schedule + purge job |
| 5 | Double-finalize during transient frontend retry | L | M | Idempotency key |
| 6 | Control-plane URL redirection (SSRF-adjacent) | L | M | AgentAPIBase allow-list |
A.9.2 Unknown unknowns
- Area not reviewed: Plaid flow in detail. Best guess at risk level: L — the pattern is textbook (no persistence, one-shot exchange), but I did not walk
service/infrastructure/plaid/plaid.goline by line. - Area not reviewed: Admin-only mutations and the
@authorizeAPIKeydirective implementation end-to-end. M — admin endpoints share the same route, so an auth regression would be high-blast-radius. - Area not reviewed: The 971-line
domain/form/postgres/form.go; only skimmed for prefill. M — complex hand-written SQL. - Area not reviewed:
util/logger/handler_dev.goand how structured logs are emitted in prod — PII scrubbing not verified. M. - Area not reviewed: CI workflow contents (
build.yaml,tests.yaml,vuln-scan.yaml) — existence confirmed, content not read. L.
A.10 Technical Debt Register
| # | Debt item | Quadrant | Estimated interest | Remediation |
|---|---|---|---|---|
| 1 | Fire-and-forget AI goroutine, no durability | Reckless & Inadvertent | High — invisible failures of the flagship feature | A.4.1 |
| 2 | FinalizeFormSubmission oversize (author-nolinted) | Prudent & Deliberate | Medium — slows future changes | A.4.2 |
| 3 | Prefill O(n) scan | Prudent & Deliberate (roadmap acknowledges) | Medium — grows with account lifetime | A.4.3 |
| 4 | Polling-based AI state (no subscriptions) | Prudent & Deliberate | Low–Medium — extra DB QPS + UX latency | GraphQL subscriptions (YAML §19) |
| 5 | HS256 shared secret for AI JWT | Reckless & Inadvertent | Low prob, High blast | A.4.4 |
| 6 | No app-level audit log (who submitted / accessed what PII) | Prudent & Deliberate | Medium — compliance risk | Dedicated audit table + async writer |
| 7 | Retention policy undefined for PII tables | Reckless & Inadvertent | Medium — regulatory | Retention schedule + purge |
| 8 | CORS allow-list defaults to * | Prudent & Inadvertent | Low | A.4.5 |
| 9 | time.After prevents time-travel tests | Prudent & Inadvertent | Low — slows test suite | Inject sleeper |
| 10 | generated.go not marked linguist-generated | Prudent & Inadvertent | Very Low | Add .gitattributes |
A.11 Security Posture (lightweight STRIDE)
| Category | Threat present? | Mitigated? | Gap |
|---|---|---|---|
| Spoofing (identity) | Yes (Firebase tokens, admin key) | Mostly (auth.go:42-56, @strongAuth freshness) | Admin key in header; no mTLS between services |
| Tampering (integrity) | Yes | Partially | SQL parameterized (pgx named params); no row-level integrity hashes; JWT HS256 shared secret (A.4.4) |
| Repudiation | Yes (survivor actions on FEMA behalf) | No | No app-level audit log (A.10 #6) |
| Information Disclosure | Yes (PII everywhere in submissions) | Partially | TLS + soft-delete; no field-level encryption; PII in wrapped error strings not audited |
| Denial of Service | Yes | Partially | FIXED_COMPLEXITY_LIMIT=20, 4 MiB body limit (bodylimit.go:12), 25 MiB file cap; no per-account rate limiter; poll loops cannot be cancelled |
| Elevation of Privilege | Yes (admin directive) | Yes on surface | Cannot verify IAM scope of ECS task role |
A.12 Operational Readiness
| Capability | Present / Partial / Missing | Notes |
|---|---|---|
| Structured logs | Present | util/logger/handler.go slog-based |
| Metrics | Missing (unverified) | No Prometheus/OTel wiring seen in cmd/api/graphql/main.go |
| Distributed tracing | Missing (unverified) | No OTel SDK import visible |
| Actionable alerts | Partial | Runbook lists ECS task health + deployment-badge; thresholds "TBD" in YAML §14 |
| Runbooks | Present | data/af-backend-go-api/runbook.md |
| On-call ownership defined | Present | CODEOWNERS at repo root |
| SLOs / SLIs | Missing | YAML §12 all "TBD" |
| Backup & restore tested | Unknown | RDS managed, not verified |
| Disaster recovery plan | Unknown | Cross-repo concern |
| Chaos / failure testing | Missing | No evidence |
A.13 Test & Quality Signals
- Coverage (line / branch): Not published (YAML §15
coverage_pct: null). File counts: 35_test.govs 124 source.go≈ 28% file ratio — low but concentrated on the right places (domain/form, AI agent, JWT, auth middleware). - Trend: Unknown.
- Flake rate: Unknown.
- Slowest tests: Likely the three
pollFor…loops if tested directly, since they use realtime.After(A.6.2). - Untested critical paths: End-to-end AI lifecycle (no integration test walking finalize→poll→continue), admin API-key directive,
FinalizeFormSubmissionhappy path (test file exists atservice/application/formsubmission/formsubmission_test.gobut cannot assess depth without running). - Missing test types: [ ] unit (have) [ ] integration (partial, LocalStack+emulator per YAML) [x] e2e (no dedicated repo-level e2e) [x] contract (no schema-diff/contract test between schema.graphqls and frontend) [x] load [x] security/fuzz
A.14 Performance & Cost Smells
- Hot paths:
form,formSubmission,formPrefill,actualHouseholdInfoqueries;submitForm+finalizeFormSubmissionmutations. - Suspected bottlenecks: (1)
ReadFormPrefillfull-scan (domain/form/postgres/form.go:646); (2) households RO address matching without documented indexes (YAML §12); (3) three 5-minute poll loops hold goroutines + DB connections per in-flight AI job (formsubmission.go:454-530). - Wasteful queries / loops: Poll loops write an AI-state update on every tick (
formsubmission.go:469, 520) even when state has not changed — unnecessary row writes at ~10s cadence. - Oversized infra / idle resources: N/A — not reviewed.
- Cache hit/miss surprises: No cache layer; YAML §19 lists "cache hot reads (ListActiveDisasters, form definitions)" as roadmap. Today every
activeDisastersquery hits Postgres.
A.15 Bus-Factor & Knowledge Risk
- Who is the only person who understands X? Cannot infer from code alone; YAML lists three authors (Siroky, Burda, Bafrnec).
- What breaks if they disappear tomorrow? The AI state machine (public vs internal mapping in
formsubmission.go:546-586) and therefreshAIAgentStatesemantics (A.5 #1) are the highest-knowledge-density section; no comments explaining why each early-return exists. - What is undocumented tribal knowledge? The reason the VNC endpoint validity is 1h while the container lifetime is 48h, and the interplay with
aiRefreshInterval's 2-min cooldown. - Suggested knowledge-transfer actions: Add a state-machine diagram in
service/application/formsubmission/as a README, ideally with decision rationale embedded as comments on the state-mapping functions.
A.16 Compliance Gaps
| Regulation | Requirement | Status | Gap | Remediation |
|---|---|---|---|---|
| FEMA Privacy Act | Defined retention + purge of survivor PII | Partial | Soft-delete only; no purge job; retention "TBD" (YAML §4) | Define schedule + nightly purge |
| GDPR / CCPA | Data export + hard delete on DSAR | Missing | YAML §19 acknowledges no export endpoint | Build DSAR pipeline |
| Login.gov IAL2/AAL2 | Performed by agent in VNC | OK | Verification lives outside this service | Document trust boundary in runbook |
| NIST 800-53 (audit) | Application-level audit log | Missing | CloudTrail only — no per-user data-access trail | Add audit log table (A.10 #6) |
| Plaid Data Use Policy | Do not persist credentials | OK | Confirmed via service/infrastructure/plaid/plaid.go wrapping (YAML §17) | — |
A.17 Recommendations Summary
| Priority | Action | Owner (suggested) | Effort | Depends on |
|---|---|---|---|---|
| P0 | Durable AI-job outbox + startup reaper + depth/age metrics (A.4.1, A.6.4, A.10 #1) | backend-go-api maintainers | M | af-infra (SQS or table migration) |
| P0 | Define and enforce PII retention schedule (A.10 #7, A.16 FEMA/GDPR) | product + legal + backend | M | legal sign-off |
| P1 | Split FinalizeFormSubmission; extract AILifecycle facade (A.4.2, A.6.2, A.6.6 #1,#4) | backend | S | — |
| P1 | Index + DISTINCT ON (or denorm) for prefill (A.4.3, A.6.3) | backend | S | — |
| P1 | Replace HS256 shared-secret AI JWT with RS256 + kid rotation (A.4.4, A.6.5) | backend + af-infra | M | control-plane team |
| P1 | Add app-level audit log table + async writer (A.10 #6, A.16 NIST) | backend | M | — |
| P1 | Idempotency key on finalizeFormSubmission (A.6.4 #3) | backend + af-frontend | S | frontend coordination |
| P2 | CORS allow-list enforced per env (A.4.5) | backend | S | — |
| P2 | Inject sleeper/ticker to make poll loops time-travel testable (A.6.2) | backend | S | — |
| P2 | AgentAPIBase domain allow-list (SSRF hygiene, A.6.6 #13) | backend | S | control-plane team |
| P2 | Add .gitattributes linguist-generated=true for generated.go (A.10 #10) | backend | S | — |
| P2 | Cache hot reads: activeDisasters, form definitions (YAML §19) | backend | S | — |
| P2 | Publish coverage badge + metrics dashboard link in README (A.8, A.12) | backend | S | — |
Environment variables
| Name | Purpose |
|---|---|
ENVIRONMENT* | local|dev|stg|prod |
PORT* | HTTP listen port |
DATABASE_HOST* | Main DB host |
DATABASE_PORT* | Main DB port |
DATABASE_USERNAME* | Main DB user |
DATABASE_DB_NAME* | Main DB name |
DATABASE_HOUSEHOLDS_RO_HOST* | Households RO host |
DATABASE_HOUSEHOLDS_RO_PORT* | Households RO port |
FIREBASE_CREDENTIALS* | Firebase service-account JSON |
FIREBASE_AUTH_EMULATOR_HOST | Dev-only emulator override |
ADMIN_API_KEY* | Header value for @authorizeAPIKey directive |
MAPBOX_USERNAME* | Mapbox tileset namespace |
AI_CONTAINER_API_HOST* | AI control-plane base URL |
AI_JWT_SECRET* | HS256 secret for control-plane JWT |
ANALYTICS_API_HOST* | User Update Service base URL (af-targeting FastAPI; .env.common defaults to :8090, the af-targeting docker-compose canonical port is :8080 — local dev typically port-maps 8090→8080) |
S3_BUCKET_FILE_SUBMISSION_NAME* | S3 bucket for form files |
S3_ENDPOINT_URL | LocalStack S3 override (dev) |
PLAID_CLIENT_ID* | Plaid client ID |
PLAID_SECRET* | Plaid secret |
PLAID_ENVIRONMENT* | sandbox|production |
MAX_FILE_SUBMISSION_SIZE_MIB* | Max upload size |
ALLOWED_FILE_SUBMISSION_TYPES* | MIME whitelist |
FIXED_COMPLEXITY_LIMIT* | GraphQL complexity cap |
STRONG_AUTH_MAX_AGE* | @strongAuth freshness |
LOG_LEVEL | slog level |
CORS_ALLOWED_ORIGINS | CORS list |
