AidFinder
Back to dashboard

af-backend-go-api

Aid Finder Backend Go API

Go/gqlgen GraphQL backend orchestrating accounts, server-driven dynamic forms, FEMA submission flow, AI-agent container lifecycle, eligibility analytics, and Plaid bank linking.

Domain role
Backend GraphQL API — central nervous system of the AidFinder platform
Last updated
2026-04-02
Lines of code
52,066
API style
GraphQL

Go service built on gqlgen v0.17.86 + chi router with a clean-architecture layout (domain / service / infrastructure). Main store is PostgreSQL via pgx v5; a separate read-only households DB serves eligibility lookups. Auth is Firebase (email-link OOB in dev, ID-token verification in prod). Plaid SDK powers bank linking. AI agents are run as ECS Fargate containers, orchestrated via a JWT-secured control plane plus CloudFront-signed-URL data plane.

Role in the system: Receives form submissions from af-frontend, persists to PostgreSQL, runs eligibility via the af-targeting analytics service, spawns per-survivor Fargate AI containers (af-disaster-assistance-gov-agent / af-fema-real-ai-agent) and exposes their VNC endpoints back to the user.

Surfaces:

  • GraphQL Query: form, forms, formSubmission(s), formFiles, formPrefill, activeDisasters, householdDetailedInfo, actualHouseholdInfo, me, accounts(admin)
  • GraphQL Mutation: signIn, updateMe, submitForm, finalizeFormSubmission, continueAgent, createPlaidLinkToken, linkBankAccount
  • Custom directives: @account, @strongAuth, @authorizeAPIKey, @validate, @goField
  • REST: POST /api/graphql, GET /deployment-badge, implicit /health

User workflows

  • Survivor sign-in

    Bearer token usable on all @account-guarded operations

  • Form submission (two-step S3 upload)

    FormSubmissionInfo with eligibility + expectedAmount; AI state machine begins

  • AI form automation (lazy polling)

    FEMA application submitted on survivor's behalf; VNC remains available 48h for review

  • Household eligibility lookup

    Returns ActualHouseholdInfo for the survivor's address

  • Plaid bank linking

    Account/routing/bank metadata returned to frontend; survivor uses it to autofill financial form fields

API endpoints

  • QUERYform(formId: UUID!)Load form structure with sections, fields, conditions
  • QUERYformSubmission(formSubmissionId!)Load submission with state, eligibility, AI status
  • QUERYformPrefill(formId!)Latest submitted values for prefill
  • QUERYactiveDisastersList active declared disasters
  • QUERYactualHouseholdInfo(limit, offset)Account's address-matched household + program eligibility
  • QUERYmeCurrent account profile
  • QUERYaccounts(limit, offset)Admin: list accounts
  • MUTATIONsignInVerify Firebase token, upsert account
  • MUTATIONupdateMe(input)Update profile
  • MUTATIONsubmitForm(formId!, input)Validate + persist submission, return presigned S3 URLs
  • MUTATIONfinalizeFormSubmission(submissionId!)Mark files uploaded, call analytics, fire AI container
  • MUTATIONcontinueAgent(submissionId!)Resume Agent 2 after user Login.gov 2FA in VNC
  • MUTATIONcreatePlaidLinkTokenMint Plaid Link token
  • MUTATIONlinkBankAccount(publicToken!)Exchange public token, return account details (not persisted)
  • POST/api/graphqlGraphQL HTTP transport
  • GET/deployment-badgeCloudFront-served deployment badge metadata

Third-party APIs

  • Firebase Auth

    Email-link OOB sign-in + ID-token verification

  • Plaid

    Bank account linking, account/routing retrieval

  • Mapbox (indirect)

    Tileset namespace owner; no direct API calls from this service (af-map handles uploads)

  • AI Container API (control plane)

    Start/stop/suspend per-survivor Fargate containers

  • AI Agent API (data plane)

    Agent orchestration: /health, /survivor-info, /state, /agent1/run, /agent2/run

  • User Update Service (analytics)

    Eligibility determination from form data

Service dependencies

  • PostgreSQL (main)

    Accounts, forms, submissions, AI/analytics metadata

  • PostgreSQL (households RO)

    Read-only households DB synced from af-targeting BigQuery — eligibility lookups

  • AWS S3

    Form file submissions (presigned PUT URLs)

  • af-map

    Downstream tile worker; planned SNS/SQS trigger on disaster updates (not yet implemented)

  • af-disaster-assistance-gov-agent / af-fema-real-ai-agent

    Per-survivor Fargate containers spawned by this service

  • af-targeting (User Update Service)

    Eligibility scoring + household sync

Analysis

overall health3.6 / 5strong
5Module overview / clarity of intent
4External dependencies
4API endpoints
4Database schema
3Backend services
2WebSocket / real-time
4Data flow clarity
3Error handling & resilience
4Configuration
2Data refresh patterns
3Performance
4Module interactions
4Troubleshooting / runbooks
3Testing & QA
4Deployment & DevOps
3Security & compliance
5Documentation & maintenance
4Roadmap clarity

af-backend-go-api — Prop-Build Analysis

Document Type: Critical Review & Analysis (companion to prop-build-template.md) Scope: Per-Repo Subject: af-backend-go-api (AidFinder GraphQL backend) Reviewer(s): Claude (automated code review) Date: 2026-04-09 Version: 0.1 Confidence Level: Medium What would raise confidence: running tests locally, prod metrics/traces, interview with Jiri/Marek/Matúš, observing an actual AI-agent submission end-to-end.

Inputs Reviewed:

  • Prop-build doc: data/af-backend-go-api.yaml
  • Companion docs: data/af-backend-go-api/{api-examples,data-flow,deployment,runbook}.md, schema.graphql
  • Source tree: /Users/andres/src/af/af-backend-go-api/ (124 non-test .go files, 35 test files, 16 tern migrations, 6 GitHub Actions workflows)

Part A — Per-Repo Analysis

A.1 Executive Summary

  • Overall health: Solid, conventionally-structured Go service with clean-architecture layering, strong typing, and meaningful test coverage on hot domain logic; the biggest risks are operational (polling-based AI state, fire-and-forget goroutines) and a handful of latent correctness/security smells rather than structural rot.
  • Top risk: The AI-agent lifecycle is driven by an unsupervised goroutine spawned from FinalizeFormSubmission (service/application/formsubmission/formsubmission.go:203-215) using context.WithoutCancel, with no DLQ, no retry, no persistence of in-flight state, and no visibility metric — a crash mid-poll silently strands a submission (see A.4.1).
  • Top win / worth preserving: Clean domain/service/infrastructure split with hand-rolled SQL + scany (no ORM), dependency-injected services validated at construction (formsubmission.go:86-116), and a disciplined structured-error taxonomy (api/graphql/errors/error.go, README error codes).
  • Single recommended next action: Persist AI-lifecycle work as a durable job (DB outbox + a polling worker or SQS) and expose depth/age metrics; everything else on the list is P1/P2.
  • Blocking unknowns: Prod metrics, actual coverage numbers, RDS index state for households RO, whether FIXED_COMPLEXITY_LIMIT=20 is tuned for real clients, and whether the planned SNS/SQS → af-map publisher exists anywhere in branch.

A.2 Health Scorecard

#DimensionScore (1–5)Justification
1Module overview / clarity of intent5README + prop-build doc + companion files are thorough and match code.
2External dependencies4Modern, pinned versions (go.mod); Firebase/Plaid/AWS SDKs wrapped behind interfaces.
3API endpoints4Schema-first gqlgen with directives (api/graphql/graph/schema.resolvers.go, 675 LOC resolvers); introspection on in all envs is a minor smell.
4Database schema416 hand-written tern migrations, XOR field-type subtables, deferrable constraints; lacks explicit retention.
5Backend services3FinalizeFormSubmission carries a //nolint:funlen,cyclop,gocognit tag at formsubmission.go:146 — acknowledged complexity; AI orchestration leaks through application layer.
6WebSocket / real-time2No subscriptions/WS; only client polling (cmd/api/graphql/main.go:83-88 TODO comment).
7Frontend componentsN/ABackend only.
8Data flow clarity4data-flow.md matches the code; sequence is traceable through resolvers → application → domain.
9Error handling & resilience3Good error taxonomy, but no circuit breakers, no retries on outbound HTTP, TODO comment at formsubmission.go:171-172 ("find out how to handle errors").
10Configuration4Env-var parsing via strv config in cmd/api/graphql/setup/setup.go; validated at startup.
11Data refresh patterns2Lazy poll with 2-min cooldown (aiRefreshInterval, formsubmission.go:27) is fragile; no caching of hot reads.
12Performance3gqlgen complexity cap + dataloader v7 for N+1 are good; ReadFormPrefill (domain/form/postgres/form.go:646) is the known O(n) hotspot.
13Module interactions4Well-bounded; dependency boundaries crisp; only real coupling is downstream agents.
14Troubleshooting / runbooks4runbook.md exists and is specific; missing alert thresholds.
15Testing & QA335 test files for 124 source files; unit coverage on form transform, JWT, auth middleware, agents; no coverage number published; no load/fuzz.
16Deployment & DevOps46 GH Actions workflows (build, tests, lint, release, deploy-task-definition, vuln-scan), multi-stage Dockerfile, tern-on-startup migrations.
17Security & compliance3Firebase + directive-based authz is correct; HS256 JWT with shared secret to AI control plane (service/infrastructure/ai/container/jwt.go:20) and CORS default * are soft spots (see A.6.5).
18Documentation & maintenance5README is exemplary; CHANGELOG present; CODEOWNERS file.
19Roadmap clarity4Tech-debt + planned items enumerated in YAML §19 and README.

Overall score: 3.61 (average of 18 scored rows; Frontend N/A excluded). Weighted reading: the repo is comfortably above "acceptable" on code, docs, and deployment; it dips into "needs work" on real-time/refresh (row 6, 11) and on the resilience of its most valuable flow (row 9) — i.e., the weakest links are exactly where a survivor's submission lives.


A.3 What's Working Well

  • Strength: Dependency injection with explicit nil-checks at construction.

    • Location: service/application/formsubmission/formsubmission.go:86-116
    • Why it works: Every collaborator is an interface; constructor refuses to return a half-wired service. Makes testing trivial and failures loud at startup instead of runtime.
    • Propagate to: Any repo still using package-level globals for dependencies.
  • Strength: Strong ID types via code-gen (types/id/).

    • Location: types/id/id.go, types/id/id_gen.go
    • Why it works: Compile-time separation of id.Account / id.FormSubmission / id.Container / id.Form avoids primitive obsession; swapping a submissionID for an accountID is a type error, not a runtime bug.
    • Propagate to: Other Go services that pass bare UUIDs around.
  • Strength: Structured, code-first error taxonomy surfaced through GraphQL extensions.

    • Location: api/graphql/errors/error.go, api/graphql/graph/error.go, README error codes section
    • Why it works: Frontend can pattern-match on stable codes (ERR_FORM_*, FIELD_*) instead of scraping strings. api/graphql/errors/error_test.go locks the contract.
    • Propagate to: Any repo where the frontend currently greps errors by message.
  • Strength: Dataloader v7 wired as request-scoped middleware.

    • Location: api/graphql/middleware/dataloader.go + dataloader/context.go
    • Why it works: Prevents the classic GraphQL N+1 on nested resolvers without forcing batch-everywhere discipline on the resolver layer.
    • Propagate to: Any GraphQL server still writing naive per-field fetches.

A.4 What to Improve

A.4.1 P0 — AI lifecycle is a fire-and-forget goroutine with no durability or visibility

  • Problem: The entire post-finalize AI flow (start container, poll ~5min, submit survivor info, run agent 1) runs in a detached goroutine spawned from FinalizeFormSubmission, using context.WithoutCancel(ctx). If the pod is killed between finalize and any of the three pollFor… loops, the submission sits in an intermediate state forever; only the client's next poll can nudge refreshAIAgentState back into life, and only if the container itself is still running. There is no DLQ, no retry budget, no visible queue depth, no metric for in-flight jobs.
  • Evidence:
    • service/application/formsubmission/formsubmission.go:203-215 (detached goroutine, context.WithoutCancel, panic recovery that only logs).
    • service/application/formsubmission/formsubmission.go:454-530 three pollFor… loops bounded only by aiAPITimeout = 5 * time.Minute (formsubmission.go:24).
    • cmd/api/graphql/main.go:83-88BeforeShutdown hook is empty; the TODO comment explicitly notes "can be for example closing of websocket connections, or messages handler (SQS, pub/sub)" — there is no graceful-drain for in-flight AI jobs.
    • formsubmission.go:171-172// TODO find out how to handle errors on the analytics path that precedes the AI kick-off.
  • Suggested change: Persist a row in a pending_ai_jobs table (or SQS message) inside the finalize transaction; have a worker (in-proc ticker or separate consumer) pick it up. Emit a pending_ai_jobs_depth and oldest_pending_age metric. Short-term, at minimum add a startup reaper that scans submissions in non-terminal AI states older than aiAPITimeout and transitions them to ERROR.
  • Estimated effort: M
  • Risk if ignored: Silent data loss of the platform's headline feature; survivors see an eternally "running" state after any pod restart.

A.4.2 P1 — FinalizeFormSubmission is explicitly too large and mixes concerns

  • Problem: The finalize path carries //nolint:funlen,cyclop,gocognit — an author-acknowledged smell. It validates files, calls analytics, decides eligibility, launches the goroutine, and owns error translation in one function.
  • Evidence: service/application/formsubmission/formsubmission.go:146-221 (~75 lines, 4 external services, 7 error branches, 1 goroutine).
  • Suggested change: Split into (a) finalizeAndScore (files + analytics + DB update), (b) enqueueAIJob (the durable-queue piece from A.4.1), (c) shouldRunAI policy helper. The anti-pattern suppression comment should then be removable.
  • Estimated effort: S
  • Risk if ignored: Any change to finalize logic touches the same fragile 75-line block; regressions are likely.

A.4.3 P1 — Form prefill is a known O(n) scan with no index

  • Problem: ReadFormPrefill fans out four batch queries across every submission a user has ever made to recover the latest value per field. Documented as slow in the YAML roadmap but no fix in progress.
  • Evidence: domain/form/postgres/form.go:646-700; YAML section_19_roadmap.tech_debt[0].
  • Suggested change: Either (a) add a (account_id, field_id, created_at DESC) index plus DISTINCT ON SQL, or (b) denormalize into a form_field_latest_value table updated on submit.
  • Estimated effort: S
  • Risk if ignored: Linear growth in per-user submission history turns prefill (called on every form open) into a tail-latency spike.

A.4.4 P1 — HS256 shared-secret JWT for AI control plane

  • Problem: Control-plane JWT uses HS256 with a shared secret (AI_JWT_SECRET), 1h TTL, 5-min refresh window; if either side leaks the secret, the attacker can mint arbitrary tokens for the AI control plane.
  • Evidence: service/infrastructure/ai/container/jwt.go:13-76 (sub = "integration-team", issuer = "saas-platform" — hard-coded, generic, no kid, no rotation hook).
  • Suggested change: Move to RS256/ES256 with asymmetric keys rotated through SSM/KMS; include kid to support zero-downtime rotation.
  • Estimated effort: M
  • Risk if ignored: Single-point credential compromise compromises the container orchestrator.

A.4.5 P2 — CORS allow-list accepts * by config

  • Problem: YAML shows CORS_ALLOWED_ORIGINS default *; middleware simply forwards the config (api/graphql/middleware/cors.go:26-31). In combination with Firebase Bearer auth there is no browser-side CSRF surface, but the signal is wrong and makes auditing harder.
  • Evidence: api/graphql/middleware/cors.go:26,31; YAML env_vars entry for CORS_ALLOWED_ORIGINS.
  • Suggested change: Enforce an explicit allow-list in non-local envs; reject * when ENVIRONMENT != local.
  • Estimated effort: S
  • Risk if ignored: Policy drift; a future cookie-based flow would instantly inherit the *.

A.5 Things That Don't Make Sense

  1. Observation: refreshAIAgentState returns early if the state is StateAgent1Complete (formsubmission.go:403-405), but that is precisely the state from which ContinueAgent is expected to move forward. The function also spawns yet another detached goroutine to restart stale agents (formsubmission.go:419-430), which re-introduces the A.4.1 durability problem on a read path.

    • Location: service/application/formsubmission/formsubmission.go:399-436
    • Hypotheses considered: (a) intentional freeze so polling doesn't race with user VNC action; (b) performance-motivated early-return to avoid calling the control plane on every query.
    • Question for author: Is there a reason a read (formSubmission query) causes a side-effect goroutine that can restart remote agents? Should restart be an explicit mutation instead?
  2. Observation: cmd/api/graphql/main.go:99-101 leaves a commented-out globalWaitGroup.Wait() TODO for short-lived goroutines. This is the same concern as A.4.1 but filed as an aspirational comment.

    • Question for author: Was a wait-group attempted and abandoned, and if so, why?
  3. Observation: sub = "integration-team" and issuer = "saas-platform" in the AI control-plane JWT (jwt.go:14-15) look like left-over template values.

    • Question for author: Intentional or pre-refactor placeholder?

A.6 Anti-Patterns Detected

A.6.1 Code-level

  • God object / god function — FinalizeFormSubmission (see A.4.2)
  • Shotgun surgery
  • Feature envy
  • Primitive obsession — prevented by types/id/
  • Dead code
  • Copy-paste / duplication
  • Magic numbers / unexplained constants — aiAPITimeout, aiContainerPollInterval, aiAgentReadyPollInterval, aiRefreshInterval, aiContainerLifetime declared as a block but with no rationale comments (formsubmission.go:23-29)
  • Deep nesting (>3 levels)
  • Long parameter lists (>4) — NewService(timeSource, formService, analyticsService, storageService, survivorAnalyticsService, aiContainerService, aiAgentService) = 7 params (formsubmission.go:78-86); acceptable as a DI seam but crosses the threshold.
  • Boolean-flag parameters

A.6.2 Architectural

  • Big ball of mud
  • Distributed monolith
  • Chatty services
  • Leaky abstraction — refreshAIAgentState in the application layer concretely knows about container states and restart semantics (formsubmission.go:399-452); should live behind a single AILifecycle facade.
  • Golden hammer
  • Vendor lock-in without exit strategy
  • Stovepipe / reinvented wheel
  • Missing seams for testing (hard-coded clocks, network, filesystem) — time.After is used directly in three polling loops (formsubmission.go:462, 491, 513), so tests cannot advance the clock and must wait real seconds. timesource.TimeSource is injected for Now() but not for sleeps.

A.6.3 Data

  • God table
  • EAV abuse — the field-type XOR subtables (form_field_string_submission / _choice_submission / _file_submission) deliberately avoid EAV; good.
  • Missing indexes on hot queries — prefill scan (A.4.3) and households RO address lookup are both flagged in the YAML roadmap.
  • N+1 queries — prevented via dataloader v7.
  • Unbounded growth / no retention policy — account.deleted_at soft delete only, form_submission and form_field_*_submission have no retention (YAML §4 and §17 both say "retention TBD").
  • Nullable-everything schemas
  • Implicit coupling via shared database — exists between this service and af-targeting via households RO, but it is read-only by design.

A.6.4 Async / Ops

  • Poison messages with no dead-letter queue — detached goroutine has no DLQ (A.4.1, formsubmission.go:203-215).
  • Retry storms / no backoff — the three poll loops sleep with fixed intervals; good.
  • Missing idempotency keys on non-idempotent ops — finalizeFormSubmission has an ERR_FORM_SUBMISSION_ALREADY_COMPLETED guard but no client-supplied idempotency key; a retry from the frontend during the analyticsService.UserUpdate call window (formsubmission.go:169) could double-post to analytics before the DB state transitions.
  • Hidden coupling via shared state — the in-process goroutine keeps implicit state about the submission; no other replica can take over.
  • Work queues without visibility / depth metrics — no queue exists (A.4.1).

A.6.5 Security

  • Secrets in code, .env committed, or logs
  • Missing authn/z on internal endpoints — directives enforce it (@account, @strongAuth, @authorizeAPIKey).
  • Overbroad IAM roles — cannot verify without af-infra.
  • Unvalidated input crossing a trust boundary — go-playground/validator tags + custom form validators look comprehensive.
  • PII/PHI in logs or error messages — not observed directly, but error-wrapping uses fmt.Errorf("%w", err) liberally; some wrapped errors include user-supplied values that could carry PII. No explicit redaction layer in util/logger/handler.go. Cannot rule out without a log audit.
  • Missing CSRF/XSS/SQLi/SSRF — CORS * permitted (A.4.5). SQLi is well-guarded (pgx named params). SSRF: service/infrastructure/ai/agent/agent.go blindly issues HTTP against status.AgentAPIBase returned by the control plane — a compromised control plane can point this service at arbitrary internal URLs. Not necessarily a bug but worth noting.

A.6.6 Detected Instances

#Anti-patternLocation (file:line)SeverityRecommendation
1God function (FinalizeFormSubmission)service/application/formsubmission/formsubmission.go:146-221P1Split into 3 helpers (A.4.2)
2Magic constants (no rationale)service/application/formsubmission/formsubmission.go:23-29P2Add doc comments or link to runbook thresholds
3Long parameter list (7 args)service/application/formsubmission/formsubmission.go:78-86P2Introduce a Dependencies struct
4Leaky abstraction (app layer knows container state)service/application/formsubmission/formsubmission.go:399-452P1Extract AILifecycle service
5Missing test seam on time.Afterservice/application/formsubmission/formsubmission.go:462,491,513P2Inject a sleeper or wrap via timesource
6Missing index on prefill scandomain/form/postgres/form.go:646-700P1Add index + DISTINCT ON (A.4.3)
7Unbounded retention on submission PIIdatabase/sql/migrations/003_form_updates.sql, 015_repeated_groups.sqlP1Define retention schedule + nightly purge (A.10, A.16)
8Fire-and-forget AI job w/o DLQservice/application/formsubmission/formsubmission.go:203-215P0Durable outbox (A.4.1)
9Missing client-supplied idempotency key on finalizeservice/application/formsubmission/formsubmission.go:147P2Accept Idempotency-Key header, store hash
10Hidden coupling via in-process goroutine stateservice/application/formsubmission/formsubmission.go:203-215, 419-430P1Move to external queue
11CORS permits *api/graphql/middleware/cors.go:26 + env defaultP2Enforce allow-list in dev/stg/prod (A.4.5)
12HS256 shared-secret JWT to AI control planeservice/infrastructure/ai/container/jwt.go:13-76P1Move to RS256/ES256 + kid (A.4.4)
13Implicit SSRF surface via control-plane-returned URLservice/infrastructure/ai/agent/agent.go (per YAML §2)P2Validate AgentAPIBase is in an allow-listed CloudFront domain

A.7 Open Questions

  1. Q: Is there any plan for durable job handling of AI lifecycle (outbox, SQS, worker pool), or is the fire-and-forget pattern intentional?

    • Blocks: A.4.1, A.6.4
    • Who can answer: backend lead (Jiri / Marek)
  2. Q: What is the retention policy for form_submission and form_field_*_submission rows containing SSN/DOB? YAML says "TBD" throughout §4 and §17.

    • Blocks: A.6.3, A.10, A.16
    • Who can answer: product + legal
  3. Q: Is FIXED_COMPLEXITY_LIMIT=20 actually tuned against real client queries, or a placeholder?

    • Blocks: A.12
    • Who can answer: whoever maintains af-frontend query shapes
  4. Q: Does the AI control plane return agent URLs from a fixed domain (e.g., *.cloudfront.net tenant bucket) so an allow-list is enforceable?

    • Blocks: A.6.5 #13
    • Who can answer: af-disaster-assistance-gov-agent owner

A.8 Difficulties Encountered

  • Difficulty: 18,371-line api/graphql/graph/generated.go dominates grep results for resolver-related queries.

    • Impact on analysis: Harder to confirm which logic is generated vs. hand-written; relied on schema.resolvers.go plus domain layer.
    • Fix that would help next reviewer: .gitattributes linguist-generated=true + explicit mention in README.
  • Difficulty: No ability to run tests (make test-run) or observe metrics in this review session.

    • Impact on analysis: Coverage numbers, flake rate, p50/p99 latencies, queue depths — all marked "TBD" in A.2/A.13/A.14.
    • Fix: Publish a coverage badge + Prometheus/CloudWatch dashboard link in README.
  • Difficulty: af-infra Terraform not in this repo, so IAM scope, SSM parameter layout, and ALB/CloudFront policies could not be verified.

    • Impact on analysis: STRIDE gaps below (A.11) had to be inferred from config.

A.9 Risks & Unknowns

A.9.1 Known risks

#RiskLikelihoodImpactMitigation
1AI job lost on pod restartMHA.4.1 (durable outbox)
2Prefill tail-latency under loadMMA.4.3 (index / denormalization)
3HS256 shared secret leakLHA.4.4 (asymmetric keys)
4PII retention non-compliance (FEMA Privacy Act, GDPR)MHRetention schedule + purge job
5Double-finalize during transient frontend retryLMIdempotency key
6Control-plane URL redirection (SSRF-adjacent)LMAgentAPIBase allow-list

A.9.2 Unknown unknowns

  • Area not reviewed: Plaid flow in detail. Best guess at risk level: L — the pattern is textbook (no persistence, one-shot exchange), but I did not walk service/infrastructure/plaid/plaid.go line by line.
  • Area not reviewed: Admin-only mutations and the @authorizeAPIKey directive implementation end-to-end. M — admin endpoints share the same route, so an auth regression would be high-blast-radius.
  • Area not reviewed: The 971-line domain/form/postgres/form.go; only skimmed for prefill. M — complex hand-written SQL.
  • Area not reviewed: util/logger/handler_dev.go and how structured logs are emitted in prod — PII scrubbing not verified. M.
  • Area not reviewed: CI workflow contents (build.yaml, tests.yaml, vuln-scan.yaml) — existence confirmed, content not read. L.

A.10 Technical Debt Register

#Debt itemQuadrantEstimated interestRemediation
1Fire-and-forget AI goroutine, no durabilityReckless & InadvertentHigh — invisible failures of the flagship featureA.4.1
2FinalizeFormSubmission oversize (author-nolinted)Prudent & DeliberateMedium — slows future changesA.4.2
3Prefill O(n) scanPrudent & Deliberate (roadmap acknowledges)Medium — grows with account lifetimeA.4.3
4Polling-based AI state (no subscriptions)Prudent & DeliberateLow–Medium — extra DB QPS + UX latencyGraphQL subscriptions (YAML §19)
5HS256 shared secret for AI JWTReckless & InadvertentLow prob, High blastA.4.4
6No app-level audit log (who submitted / accessed what PII)Prudent & DeliberateMedium — compliance riskDedicated audit table + async writer
7Retention policy undefined for PII tablesReckless & InadvertentMedium — regulatoryRetention schedule + purge
8CORS allow-list defaults to *Prudent & InadvertentLowA.4.5
9time.After prevents time-travel testsPrudent & InadvertentLow — slows test suiteInject sleeper
10generated.go not marked linguist-generatedPrudent & InadvertentVery LowAdd .gitattributes

A.11 Security Posture (lightweight STRIDE)

CategoryThreat present?Mitigated?Gap
Spoofing (identity)Yes (Firebase tokens, admin key)Mostly (auth.go:42-56, @strongAuth freshness)Admin key in header; no mTLS between services
Tampering (integrity)YesPartiallySQL parameterized (pgx named params); no row-level integrity hashes; JWT HS256 shared secret (A.4.4)
RepudiationYes (survivor actions on FEMA behalf)NoNo app-level audit log (A.10 #6)
Information DisclosureYes (PII everywhere in submissions)PartiallyTLS + soft-delete; no field-level encryption; PII in wrapped error strings not audited
Denial of ServiceYesPartiallyFIXED_COMPLEXITY_LIMIT=20, 4 MiB body limit (bodylimit.go:12), 25 MiB file cap; no per-account rate limiter; poll loops cannot be cancelled
Elevation of PrivilegeYes (admin directive)Yes on surfaceCannot verify IAM scope of ECS task role

A.12 Operational Readiness

CapabilityPresent / Partial / MissingNotes
Structured logsPresentutil/logger/handler.go slog-based
MetricsMissing (unverified)No Prometheus/OTel wiring seen in cmd/api/graphql/main.go
Distributed tracingMissing (unverified)No OTel SDK import visible
Actionable alertsPartialRunbook lists ECS task health + deployment-badge; thresholds "TBD" in YAML §14
RunbooksPresentdata/af-backend-go-api/runbook.md
On-call ownership definedPresentCODEOWNERS at repo root
SLOs / SLIsMissingYAML §12 all "TBD"
Backup & restore testedUnknownRDS managed, not verified
Disaster recovery planUnknownCross-repo concern
Chaos / failure testingMissingNo evidence

A.13 Test & Quality Signals

  • Coverage (line / branch): Not published (YAML §15 coverage_pct: null). File counts: 35 _test.go vs 124 source .go ≈ 28% file ratio — low but concentrated on the right places (domain/form, AI agent, JWT, auth middleware).
  • Trend: Unknown.
  • Flake rate: Unknown.
  • Slowest tests: Likely the three pollFor… loops if tested directly, since they use real time.After (A.6.2).
  • Untested critical paths: End-to-end AI lifecycle (no integration test walking finalize→poll→continue), admin API-key directive, FinalizeFormSubmission happy path (test file exists at service/application/formsubmission/formsubmission_test.go but cannot assess depth without running).
  • Missing test types: [ ] unit (have) [ ] integration (partial, LocalStack+emulator per YAML) [x] e2e (no dedicated repo-level e2e) [x] contract (no schema-diff/contract test between schema.graphqls and frontend) [x] load [x] security/fuzz

A.14 Performance & Cost Smells

  • Hot paths: form, formSubmission, formPrefill, actualHouseholdInfo queries; submitForm + finalizeFormSubmission mutations.
  • Suspected bottlenecks: (1) ReadFormPrefill full-scan (domain/form/postgres/form.go:646); (2) households RO address matching without documented indexes (YAML §12); (3) three 5-minute poll loops hold goroutines + DB connections per in-flight AI job (formsubmission.go:454-530).
  • Wasteful queries / loops: Poll loops write an AI-state update on every tick (formsubmission.go:469, 520) even when state has not changed — unnecessary row writes at ~10s cadence.
  • Oversized infra / idle resources: N/A — not reviewed.
  • Cache hit/miss surprises: No cache layer; YAML §19 lists "cache hot reads (ListActiveDisasters, form definitions)" as roadmap. Today every activeDisasters query hits Postgres.

A.15 Bus-Factor & Knowledge Risk

  • Who is the only person who understands X? Cannot infer from code alone; YAML lists three authors (Siroky, Burda, Bafrnec).
  • What breaks if they disappear tomorrow? The AI state machine (public vs internal mapping in formsubmission.go:546-586) and the refreshAIAgentState semantics (A.5 #1) are the highest-knowledge-density section; no comments explaining why each early-return exists.
  • What is undocumented tribal knowledge? The reason the VNC endpoint validity is 1h while the container lifetime is 48h, and the interplay with aiRefreshInterval's 2-min cooldown.
  • Suggested knowledge-transfer actions: Add a state-machine diagram in service/application/formsubmission/ as a README, ideally with decision rationale embedded as comments on the state-mapping functions.

A.16 Compliance Gaps

RegulationRequirementStatusGapRemediation
FEMA Privacy ActDefined retention + purge of survivor PIIPartialSoft-delete only; no purge job; retention "TBD" (YAML §4)Define schedule + nightly purge
GDPR / CCPAData export + hard delete on DSARMissingYAML §19 acknowledges no export endpointBuild DSAR pipeline
Login.gov IAL2/AAL2Performed by agent in VNCOKVerification lives outside this serviceDocument trust boundary in runbook
NIST 800-53 (audit)Application-level audit logMissingCloudTrail only — no per-user data-access trailAdd audit log table (A.10 #6)
Plaid Data Use PolicyDo not persist credentialsOKConfirmed via service/infrastructure/plaid/plaid.go wrapping (YAML §17)

A.17 Recommendations Summary

PriorityActionOwner (suggested)EffortDepends on
P0Durable AI-job outbox + startup reaper + depth/age metrics (A.4.1, A.6.4, A.10 #1)backend-go-api maintainersMaf-infra (SQS or table migration)
P0Define and enforce PII retention schedule (A.10 #7, A.16 FEMA/GDPR)product + legal + backendMlegal sign-off
P1Split FinalizeFormSubmission; extract AILifecycle facade (A.4.2, A.6.2, A.6.6 #1,#4)backendS
P1Index + DISTINCT ON (or denorm) for prefill (A.4.3, A.6.3)backendS
P1Replace HS256 shared-secret AI JWT with RS256 + kid rotation (A.4.4, A.6.5)backend + af-infraMcontrol-plane team
P1Add app-level audit log table + async writer (A.10 #6, A.16 NIST)backendM
P1Idempotency key on finalizeFormSubmission (A.6.4 #3)backend + af-frontendSfrontend coordination
P2CORS allow-list enforced per env (A.4.5)backendS
P2Inject sleeper/ticker to make poll loops time-travel testable (A.6.2)backendS
P2AgentAPIBase domain allow-list (SSRF hygiene, A.6.6 #13)backendScontrol-plane team
P2Add .gitattributes linguist-generated=true for generated.go (A.10 #10)backendS
P2Cache hot reads: activeDisasters, form definitions (YAML §19)backendS
P2Publish coverage badge + metrics dashboard link in README (A.8, A.12)backendS

Environment variables

NamePurpose
ENVIRONMENT*local|dev|stg|prod
PORT*HTTP listen port
DATABASE_HOST*Main DB host
DATABASE_PORT*Main DB port
DATABASE_USERNAME*Main DB user
DATABASE_DB_NAME*Main DB name
DATABASE_HOUSEHOLDS_RO_HOST*Households RO host
DATABASE_HOUSEHOLDS_RO_PORT*Households RO port
FIREBASE_CREDENTIALS*Firebase service-account JSON
FIREBASE_AUTH_EMULATOR_HOSTDev-only emulator override
ADMIN_API_KEY*Header value for @authorizeAPIKey directive
MAPBOX_USERNAME*Mapbox tileset namespace
AI_CONTAINER_API_HOST*AI control-plane base URL
AI_JWT_SECRET*HS256 secret for control-plane JWT
ANALYTICS_API_HOST*User Update Service base URL (af-targeting FastAPI; .env.common defaults to :8090, the af-targeting docker-compose canonical port is :8080 — local dev typically port-maps 8090→8080)
S3_BUCKET_FILE_SUBMISSION_NAME*S3 bucket for form files
S3_ENDPOINT_URLLocalStack S3 override (dev)
PLAID_CLIENT_ID*Plaid client ID
PLAID_SECRET*Plaid secret
PLAID_ENVIRONMENT*sandbox|production
MAX_FILE_SUBMISSION_SIZE_MIB*Max upload size
ALLOWED_FILE_SUBMISSION_TYPES*MIME whitelist
FIXED_COMPLEXITY_LIMIT*GraphQL complexity cap
STRONG_AUTH_MAX_AGE*@strongAuth freshness
LOG_LEVELslog level
CORS_ALLOWED_ORIGINSCORS list