AidFinder
Back to dashboard

af-fema-real-ai-agent

FEMA Real AI Agent

Sister of af-disaster-assistance-gov-agent. Same Flask state machine + 2-agent split, but targets the REAL https://www.disasterassistance.gov/ with Firefox ESR and manual user Login.gov 2FA. Includes AWS CDK + Replit integration.

Domain role
AI agent runtime (real gov target)
Last updated
2026-03-18
API style
REST

Same Dockerfile shape as the sister: computer-use Ubuntu 22.04 + Node.js 20 + Claude Code CLI v2.1.45 + Flask 3.0 + mcp-vnc MCP server + Firefox ESR with EFS-backed profile. Flask state machine has 9 production states (IDLE → SETTING_UP → DATA_SET → AGENT1_RUNNING → (AGENT1_COMPLETE | AGENT1_ERROR) → AGENT2_RUNNING → (AGENT2_COMPLETE | AGENT2_ERROR)) — same shape as sister. The MOCK FEMA HTTP server is GONE and CHROME is GONE (only Firefox ESR remains). The Agent 1 script navigates to the live https://www.disasterassistance.gov/ and stops at the Login.gov page; the user completes 2FA inside the noVNC session; the user (or operator) calls /agent2/run. Includes aws_deployment/ CDK stacks (ECS, API Gateway, CloudFront, Secrets, Storage, Observability) and aws_demo_deploy/REPLIT_DEPLOY.md for Replit iframe integration. The /survivor-info/p1 + /survivor-info/p2 split endpoints are PRESENT (not removed as previously documented).

Role in the system: Production sibling of af-disaster-assistance-gov-agent — same Flask state machine, same per-survivor container, but with mock infrastructure removed and Firefox ESR as the only browser. Includes AWS CDK + Replit integration.

Surfaces:

  • Flask data-plane :5001 (same 8 endpoints as sister)
  • noVNC :6080 (vnc_embed.html for iframe)
  • VNC :5900
  • Firefox ESR with persisted profile in EFS
  • Claude Code CLI agent runtime
  • AWS CDK stacks under aws_deployment/
  • Replit integration docs (aws_demo_deploy/REPLIT_DEPLOY.md)

User workflows

  • Build + start

    /health = ok

  • Submit survivor

    Container ready

  • Run Agent 1

    State → AGENT1_COMPLETE; manual user step required

  • Manual Login.gov

    User authenticated

  • Run Agent 2

    State → AGENT2_COMPLETE

API endpoints

  • GET/healthLiveness
  • POST/agent_healthLLM smoke test
  • GET/stateCurrent orchestration state
  • POST/survivor-infoSubmit full survivor JSON; generate scripts; launch Firefox to real FEMA URL
  • POST/survivor-info/p1Partial submission (pages 1-N before Login.gov)
  • POST/survivor-info/p2Partial submission (post-login pages)
  • GET/agent1/status_detailedVerbose Agent 1 status (logs, sentinels, retry counts)
  • GET/agent2/status_detailedVerbose Agent 2 status
  • POST/agent1/restartForce-restart Agent 1 subprocess
  • POST/agent2/restartForce-restart Agent 2 subprocess
  • POST/test/test_kup, /test/test_kup2, /test/force-statusTest-only endpoints for state-machine harness
  • POST/agent1/runSpawn Agent 1 subprocess
  • GET/agent1/statusPoll Agent 1
  • POST/agent2/runSpawn Agent 2 (only after manual login)
  • GET/agent2/statusPoll Agent 2

Third-party APIs

  • Anthropic Bedrock (or direct Claude API)

    LLM backend for Claude Code CLI

  • https://www.disasterassistance.gov/

    REAL FEMA application portal (target)

  • Login.gov

    2FA between Agent 1 and Agent 2

Service dependencies

  • AWS ECS Fargate

    Per-survivor task launch

  • AWS EFS

    Firefox profile + work dir

  • AWS Secrets Manager

    Bedrock + JWT + VNC keys

  • AWS API Gateway + Lambda (control plane)

    Container manager (start/stop/list)

  • AWS CloudFront

    Origin restriction + signed URLs (mTLS optional)

  • AWS DynamoDB

    Container state + audit

  • Replit (frontend host)

    Embeds noVNC iframe + calls control plane

  • af-backend-go-api

    Alternative control-plane caller (production path)

Analysis

overall health3.4 / 5acceptable
4Module overview / clarity of intent
3External dependencies
3API endpoints
3Database schema
4Backend services
3WebSocket / real-time
3Frontend components
4Data flow clarity
4Error handling & resilience
3Configuration
3Data refresh patterns
3Performance
4Module interactions
4Troubleshooting / runbooks
4Testing & QA
3Deployment & DevOps
2Security & compliance
4Documentation & maintenance
3Roadmap clarity

af-fema-real-ai-agent — Prop-Build Analysis

Document Type: Critical Review & Analysis (companion to prop-build-template.md) Scope: Per-Repo / Per-Module Subject: af-fema-real-ai-agent (FEMA Real AI Agent — per-survivor Claude-computer-use container driving production DisasterAssistance.gov) Reviewer(s): Claude (automated code review) Date: 2026-04-09 Version: 0.1 Confidence Level: Medium What would raise confidence: running container locally, observing a full Agent 1 → Login.gov → Agent 2 run, access to CloudWatch logs + DynamoDB audit events, interview with Gordon, review of container_work_dir/fema-apply-agent/agent1.md + agent2.md prompts, and the CDK stacks under aws_deployment/.

Inputs Reviewed:

  • Prop-build doc: /Users/andres/src/af/af-analysis/data/af-fema-real-ai-agent.yaml
  • Companion docs: api-examples.md, data-flow.md, runbook.md, deployment.md
  • Source: /Users/andres/src/af/af-fema-real-ai-agent/fema_agent/src/web_service/{app.py,state.py,agent_runner.py} plus tree listing of survivor_api/, raw_builder/, intermediate_builder/, aws_deployment/, specs/021-mock-to-real-fema/.
  • Not executed; no prod metrics; agent prompt files (agent1.md/agent2.md) not inspected line-by-line.

Part A — Per-Repo / Per-Module Analysis

A.1 Executive Summary

  • Overall health: Functional, reasonably well-factored Python/Flask orchestrator around Claude Code CLI driving a real federal benefits site, but security posture is weak for something that handles FEMA IA PII and a live Login.gov session.
  • Top risk: No application-layer authentication on any Flask endpoint (app.py:110-564) combined with PII-laden survivor JSON flowing into prompt text that Claude then executes against disasterassistance.gov — a classic prompt-injection + PII-in-prompt surface. See A.6.5 and A.11.
  • Top win / thing worth preserving: Surgical Firefox profile lock cleanup that preserves cookies (app.py:42-84) and the resume-first / fresh-prompt retry escalation in agent_runner.restart_agent (agent_runner.py:164-184) — both are thoughtful, failure-mode-driven code worth propagating to the sister repo.
  • Single recommended next action: Put authenticated, signed-URL-only access in front of every Flask route (not just at CloudFront) and add an explicit survivor-data sanitizer before values are interpolated into as_agent1.txt / as_agent2.txt.
  • Blocking unknowns: The actual content of agent1.md / agent2.md (the prompts Claude receives) was not read; CDK auth/network posture was taken from the YAML summary only; no coverage/flake data.

A.2 Health Scorecard

#DimensionScore (1–5)Justification
1Module overview / clarity of intent4README + YAML + specs/021-mock-to-real-fema/ make the purpose and Agent1/Agent2 split unambiguous.
2External dependencies3Hard dependency on Bedrock, Login.gov, live FEMA site; no abstraction to swap providers even though LLM_PROVIDER env hints at it.
3API endpoints3Clean REST shape and error handlers (app.py:92-106), but ~16 routes all unauthenticated at app layer; test endpoints (/test/test_kup*, /test/force-status) live in the same app (app.py:624,699,796 per YAML).
4Database schema3DynamoDB tables (ContainerInstance, AuditEvent) per YAML; not inspected directly; in-container state is an in-memory dict in state.py — fine for per-survivor lifetime, but opaque.
5Backend services4Clear separation: survivor_api/ (parse), raw_builder//intermediate_builder/ (script gen), web_service/agent_runner.py (process mgmt), state.py (FSM).
6WebSocket / real-time3noVNC is stock; only real-time surface; 30s heartbeat is pragmatic.
7Frontend components3No real FE in this repo beyond vnc_embed.html; scope is narrow and appropriate.
8Data flow clarity4The 9-state FSM and sentinel-file protocol are explicit; companion data-flow.md traces it.
9Error handling & resilience4check_inactivity_and_kill (agent_runner.py:241-283), SIGTERM→SIGKILL escalation (:210-238), stale-detection + retry count, resume-first strategy are all solid.
10Configuration3Env vars documented; but WORK_DIR defaulted twice (app.py:26, agent_runner.py:16) and subprocess.run(..., env=os.environ.copy()) in /agent_health (app.py:146) passes the entire host env to the LLM CLI.
11Data refresh patterns3EFS-backed Firefox profile persists across runs, with intentional lock cleanup; acceptable.
12Performance3Per-section timeouts (100s A1, 300s A2) and 600s inactivity kill (agent_runner.py:19) are reasonable; no metrics to verify.
13Module interactions4Clean boundary between Flask plane and Claude CLI subprocess.
14Troubleshooting / runbooks4runbook.md + RUN_BOOK.md + inline docstrings cover stale locks, DOM drift, Login.gov expiry.
15Testing & QA4YAML cites 874 passing tests, tests tree includes unit/integration/contract; test count is a strong signal even without coverage numbers.
16Deployment & DevOps3CDK stacks present (not read); Dockerfile clean; no evidence of staged rollouts or canaries.
17Security & compliance2Unauth Flask routes, shell=True subprocess with f-string interpolation, PII flowing into LLM prompts + EFS profile, broad env passthrough to LLM CLI. See A.6.5, A.11.
18Documentation & maintenance4YAML + specs + companion md files are unusually thorough for a WIP-feeling repo.
19Roadmap clarity3specs/021-mock-to-real-fema/ and 022-secure-demo-deploy/ suggest direction but no explicit roadmap doc was reviewed.

Overall score: 3.32 average (19 rows). Weighted reading: operationally thoughtful, but security rating (2) is a load-bearing concern that should pull the effective score down for any prod-readiness decision.


A.3 What's Working Well

  • Strength: Surgical Firefox profile lock cleanup that deletes only lock and .parentlock and explicitly preserves cookies/storage.

    • Location: fema_agent/src/web_service/app.py:42-84
    • Why it works: The docstring names the exact failure mode (new container IP vs. old lock) and the code walks only one level so it cannot accidentally nuke profile state. This is the kind of narrow, well-justified hack that saves incidents.
    • Propagate to: af-disaster-assistance-gov-agent (sister repo) if not already there.
  • Strength: Resume-first retry strategy with graceful fallback to fresh-prompt.

    • Location: fema_agent/src/web_service/agent_runner.py:164-184 (restart_agent) + :120-161 (run_agent_resume, run_agent_fresh_with_resume).
    • Why it works: Encodes real knowledge (--continue is broken with --mcp-config, so --resume <sessionId> is used) and escalates to a fresh prompt that self-skips completed pages using temp/ logs. Debt-aware, not debt-blind.
    • Propagate to: Any other Claude Code CLI orchestrator in the org.
  • Strength: SIGTERM → grace → SIGKILL process-group termination.

    • Location: agent_runner.py:210-238 (kill_process_group) + :241-283 (check_inactivity_and_kill).
    • Why it works: start_new_session=True at spawn (:51, :137, :159) plus os.killpg ensures the shell wrapper and the claude-full child both die; inactivity is measured from max(start_time, latest_mtime) to avoid false-positive kills from stale temp/ files.
    • Propagate to: Sister repo and any future agent runner.
  • Strength: Explicit P1/P2 cross-validation with county match.

    • Location: app.py:359-370.
    • Why it works: Catches a whole class of "wrong disaster" data-entry errors before they hit Claude and the real gov site.
    • Propagate to: Any other form-automation agent.

A.4 What to Improve

A.4.1 P0 — Unauthenticated Flask data plane handling PII

  • Problem: Every route on the Flask app is defined without an @requires_auth wrapper. /survivor-info, /survivor-info/p1, /survivor-info/p2, /agent1/run, /agent2/run, /agent*/restart, and the /test/* endpoints all accept anonymous POSTs. The YAML argues "network isolation + CloudFront signed URLs" is sufficient, but nothing in this repo enforces that — any workload that lands in the VPC (misconfigured SG, sidecar compromise, SSRF from another service, or a misrouted ALB) can POST survivor PII directly.
  • Evidence: fema_agent/src/web_service/app.py:110, 114, 214, 239, 333, 418, 460, 560, 564 (and /test/* at :624,:699,:796 per YAML A.3). No before_request auth hook; no abort(401) anywhere in app.py.
  • Suggested change: Add a before_request that validates an HS256 JWT (the same AI_JWT_SECRET already used at the API Gateway layer per YAML §3.3) on every non-/health route; move test-only endpoints behind an env flag that is false in prod images.
  • Estimated effort: S
  • Risk if ignored: PII exfiltration; unauthorized hijack of a container mid-session after the user has completed Login.gov 2FA (attacker can ride the authenticated Firefox profile).

A.4.2 P0 — shell=True subprocess spawns with f-string interpolation

  • Problem: Three functions build command strings via f-string and pass them to subprocess.Popen(..., shell=True): run_agent, run_agent_resume, run_agent_fresh_with_resume. session_id and agent_file are interpolated into the shell string. Today session_id is a server-generated uuid.uuid4() (app.py:425, 471) and agent_file is a hard-coded constant, so there is no current injection path — but the pattern is fragile: any future refactor that lets a caller pass a session_id (e.g. for test harnesses, or resuming from DynamoDB) instantly becomes RCE-as-root-in-container.
  • Evidence: fema_agent/src/web_service/agent_runner.py:40-52, 127-139, 148-161.
  • Suggested change: Replace with list form: subprocess.Popen(["claude-full", "--session-id", session_id, "-p", prompt], shell=False, ...). Drop shell=True entirely.
  • Estimated effort: S
  • Risk if ignored: Latent command injection one refactor away; also makes it hard to reason about quoting of any prompt that ever contains a " or $.

A.4.3 P1 — Survivor PII is interpolated into LLM prompts with no sanitization layer

  • Problem: SurvivorFemaApplication.generate_agent1_script_from_p1 / generate_agent2_script (called at app.py:296, 395) take attacker-controllable JSON and produce as_agent1.txt / as_agent2.txt files that Claude Code then reads verbatim and executes against the real gov site. There is no evidence of (a) stripping control tokens, (b) length caps, (c) detection of prompt-injection strings like "ignore previous instructions", or (d) a deny-list for URLs/domains. The entire survivor JSON is also kept in state.py in-memory and persisted to EFS as the script files. PII (address, disability info, income, household/deceased members) passes through Bedrock as prompt text.
  • Evidence: app.py:295-296, 387-395 (generator invocation); YAML §4 enumerates the PII fields.
  • Suggested change: Introduce a survivor_sanitizer module that (1) validates every field against a tight allow-list regex, (2) caps each field length, (3) rejects strings containing obvious injection markers, (4) logs a redacted copy only; redact PII from any stderr/stdout tails returned by /agent_health and /agent*/status (app.py:154-167, agent_runner.py:302-313). Also document the DPA/BAA status of the Bedrock route since PII leaves the VPC.
  • Estimated effort: M
  • Risk if ignored: Prompt injection drives Claude to submit bogus data to the real FEMA site, exfiltrate cookies, or click malicious links inside Firefox. PII exposure to any log sink that captures stdout/stderr.

A.4.4 P1 — /agent_health hands the LLM subprocess the entire host environment

  • Problem: subprocess.run(["claude-full", "-p", prompt], env=os.environ.copy(), ...) passes every env var — including BEDROCK_API_KEY, AI_JWT_SECRET, AWS credentials injected by the task role — to a child that is explicitly meant to be a "quick smoke test."
  • Evidence: fema_agent/src/web_service/app.py:140-147.
  • Suggested change: Build a minimal allow-list env (PATH, HOME, LLM_PROVIDER, LLM_MODEL_NAME, AWS_REGION, provider key).
  • Estimated effort: S
  • Risk if ignored: Secret exfiltration via LLM output or future observability hooks.

A.4.5 P2 — Test-only routes live in the production app

  • Problem: /test/test_kup, /test/test_kup2, /test/force-status are in app.py and reachable in every built image with no env guard.
  • Evidence: YAML §3.2 cites app.py:624,699,796.
  • Suggested change: Gate with if os.environ.get("FEMA_ENABLE_TEST_ROUTES") == "1": inside create_app; fail-closed in prod.
  • Estimated effort: S
  • Risk if ignored: Unexpected state transitions from unauth traffic; widens blast radius of A.4.1.

A.5 Things That Don't Make Sense

  1. Observation: _clear_firefox_profile_locks walks the top-level dir, breaks after the first iteration, then does a second manual two-level walk.

    • Location: app.py:59-83.
    • Hypotheses considered: defensive belt-and-suspenders; or the author found os.walk descended too far and spliced in a second pass.
    • Question for author: Would a single glob.glob(os.path.join(firefox_profile_dir, "**/lock"), recursive=True) plus .parentlock glob be equivalent and cleaner?
  2. Observation: check_agent_status returns stderr = stdout_text when stderr is empty.

    • Location: agent_runner.py:305-308.
    • Hypotheses considered: back-compat with older callers that only inspect stderr.
    • Question for author: Is there still any caller that only looks at stderr? If not, drop the aliasing — it hides the real stream and complicates log grepping.

A.6 Anti-Patterns Detected

A.6.1 Code-level

  • God object / god function — app.py is 829 lines with FSM reset logic, validation, subprocess launching, lock cleanup, and ~16 routes in one create_app.
  • Shotgun surgery
  • Feature envy
  • Primitive obsession
  • Dead code
  • Copy-paste / duplication — DEFAULT_WORK_DIR defined twice with the same default (app.py:26, agent_runner.py:16); agent1/agent2 status/run/restart blocks are near-mirrors (app.py:418-502).
  • Magic numbers / unexplained constants
  • Deep nesting (>3 levels)
  • Long parameter lists (>4)
  • Boolean-flag parameters that change behavior

A.6.2 Architectural

  • Big ball of mud
  • Distributed monolith
  • Chatty services
  • Leaky abstraction / inappropriate intimacy between layers
  • Golden hammer
  • Vendor lock-in without exit strategy
  • Stovepipe / reinvented wheel
  • Missing seams for testing — subprocess spawning, os.environ reads, time.time(), and os.walk are all called directly with no injection point.

A.6.3 Data

  • God table / EAV / missing indexes / N+1 / unbounded growth / nullable-everything / shared DB — N/A (not enough visibility into DynamoDB schemas from code read).

A.6.4 Async / Ops

  • Poison messages with no dead-letter queue — N/A (no queue).
  • Retry storms / no backoff — mitigated by retry_count ceiling.
  • Missing idempotency keys on non-idempotent ops — /agent*/run guarded by FSM state, effectively a per-container idempotency key.
  • Hidden coupling via shared state — Flask state.py in-memory dict is the single source of truth; any Flask worker count >1 would silently corrupt it. Single-worker assumption is not asserted.
  • Work queues without visibility / depth metrics

A.6.5 Security

  • Secrets in code, .env committed, or logs — stdout/stderr tails returned by /agent_health (app.py:154-167) and check_agent_status (agent_runner.py:302-313) can leak env-loaded secrets or PII echoed by the LLM.
  • Missing authn/z on internal endpoints — every route in app.py is unauthenticated (see A.4.1).
  • Overbroad IAM roles / least-privilege violations — not reviewed (no CDK inspection).
  • Unvalidated input crossing a trust boundary — survivor JSON is key-allow-listed but field-level values are fed into a script file that Claude then executes (prompt injection surface; see A.4.3).
  • PII/PHI in logs or error messages — in-memory survivor_data, EFS-persisted as_agent1.txt/as_agent2.txt, and process stderr returned via HTTP all carry PII. No redaction layer found.
  • Missing CSRF / XSS / SQLi / SSRF protections — Flask JSON API so CSRF N/A; no SQL; SSRF implicit in Firefox surface.

A.6.6 Detected Instances

#Anti-patternLocation (file:line)Severity (P0/P1/P2)Recommendation
1God function (create_app)fema_agent/src/web_service/app.py:86-820P2Split into routes/health.py, routes/survivor.py, routes/agents.py blueprints.
2Duplicated DEFAULT_WORK_DIRapp.py:26, agent_runner.py:16P2Extract to web_service/config.py.
3Near-duplicate agent1/agent2 route handlersapp.py:418-502P2Parameterize on agent_key like _handle_restart already does at :522-549.
4Missing seams for testingagent_runner.py:45-52, 132-139, 156-161, 241-283P2Inject a ProcessLauncher + Clock.
5Single-process in-memory FSM with no worker guardstate.py (module-level dict)P1Assert workers == 1 at startup or move state out of process.
6Stdout/stderr tails leaked in HTTP responsesapp.py:154-167, agent_runner.py:302-313P1Redact before return; log full body internally only.
7Unauth endpoints on PII-handling data planeapp.py:110,114,214,239,333,418,460,560,564P0JWT before_request hook.
8Unvalidated survivor JSON values interpolated into LLM promptsapp.py:295-296, 387-395 (via SurvivorFemaApplication)P0Sanitizer + length caps + prompt-injection deny list.
9shell=True + f-string interpolation in Popenagent_runner.py:40-52, 127-139, 148-161P0 (latent)Use list form, drop shell=True.
10env=os.environ.copy() passed to LLM subprocessapp.py:146P1Allow-list env.
11Test routes reachable in prod imageapp.py:624,699,796 (per YAML)P2Env-gate.

A.7 Open Questions

  1. Q: Is there any authentication at the container's Flask layer, or is the YAML's "VPC isolation + CloudFront signed URLs" the only control? If the latter, what stops an in-VPC service from posting directly?

    • Blocks: A.4.1, A.11.
    • Who can answer: Gordon / platform-sec.
  2. Q: Have the agent1.md / agent2.md prompts been reviewed for prompt-injection resilience against untrusted survivor input?

    • Blocks: A.4.3.
    • Who can answer: Gordon / AI safety reviewer.
  3. Q: Does the Bedrock path have a DPA/BAA in place for the PII being sent? FEMA IA data includes income, disability, deceased persons.

    • Blocks: A.16 (if compliance is claimed).
    • Who can answer: legal / compliance.
  4. Q: Is Flask run with workers=1? If not, state.py's in-memory dict is unsafe.

    • Blocks: A.6.4.
    • Who can answer: deployment doc / Dockerfile CMD.

A.8 Difficulties Encountered

  • Difficulty: agent1.md / agent2.md prompt templates live under container_work_dir/fema-apply-agent/ (per YAML) and were not read as part of this review.

    • Impact on analysis: Cannot concretely grade prompt-injection resilience — the "PII into prompts" finding (A.4.3) is inferred from the call sites, not from the prompt text itself.
    • Fix that would help next reviewer: Commit a sanitized sample prompt to the repo root or link from README.
  • Difficulty: CDK stacks under aws_deployment/ were not opened; security posture at the edge is taken on faith from the YAML.

    • Impact on analysis: Could not verify CloudFront signed-URL enforcement, mTLS, IAM least-privilege, or SG posture.
    • Fix that would help next reviewer: Short aws_deployment/README.md per stack would shortcut this.
  • Difficulty: No coverage or flake numbers; test directory count alone is a weak signal.

    • Impact on analysis: A.13 is mostly empty.
    • Fix that would help next reviewer: pytest --cov badge or a coverage.xml artifact.

A.9 Risks & Unknowns

A.9.1 Known risks

#RiskLikelihood (L/M/H)Impact (L/M/H)Mitigation
1Unauth Flask plane reachable from any VPC workloadMHJWT before_request; SG lockdown to API GW only.
2Prompt injection via survivor JSON field valuesMHSanitizer + deny list; prompt design that quarantines user data.
3DOM drift / CAPTCHA on real disasterassistance.govHMRunbook exists; add monitoring on agent retry count.
4Login.gov session expiry mid-Agent-2 runMMDetect 401/redirect to Login.gov; surface to user.
5Multi-worker Flask corrupts in-memory FSMLHAssert workers=1 or move state.
6Secrets/PII leak via stderr tails in HTTP responsesMHRedact before return.
7shell=True becomes injectable after a future refactorLHSwitch to list form now.

A.9.2 Unknown unknowns

  • Area not reviewed: agent1.md / agent2.md prompt bodies. Reason: not in the paths I opened. Best guess at risk level: High — this is where prompt-injection defense either lives or doesn't.
  • Area not reviewed: aws_deployment/ CDK stacks. Reason: out of scope for time budget. Best guess at risk level: Medium — standard CDK patterns usually OK, but IAM scoping and CloudFront signed-URL enforcement need verification.
  • Area not reviewed: survivor_api/ field-level validators. Reason: only inspected the call site in app.py. Best guess at risk level: Medium — the sanitization verdict in A.4.3 hinges on what this module does or doesn't do.
  • Area not reviewed: DynamoDB ContainerInstance + AuditEvent schemas. Reason: in sibling container_manager service. Best guess at risk level: Low-Medium.
  • Area not reviewed: The 874-test suite content. Reason: time. Best guess at risk level: Low (tests existing is a strong signal).

A.10 Technical Debt Register

#Debt itemQuadrantEstimated interestRemediation
1Unauth Flask data planeReckless & DeliberateHigh (security incidents)JWT before_request (S).
2shell=True + f-string PopenPrudent & InadvertentMedium (latent)Switch to list form (S).
3PII into LLM prompts with no sanitizerReckless & InadvertentHigh (compliance + injection)Sanitizer layer + redaction (M).
4env=os.environ.copy() for LLM subprocessReckless & InadvertentMediumAllow-list env (S).
5create_app is 700+ linesPrudent & DeliberateLowBlueprint split (M).
6Duplicated DEFAULT_WORK_DIRPrudent & InadvertentLowExtract config module (S).
7In-memory FSM with no worker-count assertionReckless & InadvertentMediumAssert workers=1 at startup (S).
8Test routes in prod imagePrudent & DeliberateLow-MediumEnv gate (S).
9Stdout/stderr tails leaked over HTTPReckless & InadvertentMedium-HighRedaction layer (S).

A.11 Security Posture (lightweight STRIDE)

CategoryThreat present?Mitigated?Gap
Spoofing (identity)Yes — anyone in VPC can POST as "the control plane"Partial (only at edge, per YAML)No app-layer auth (A.4.1).
Tampering (integrity)Yes — attacker can POST /survivor-info/p2 to mutate in-flight stateNoSame root cause as spoofing.
RepudiationPartial — DynamoDB AuditEvent exists per YAMLUnknownNot verified end-to-end; no signed audit log seen in app.py.
Information DisclosureYes — stdout/stderr tails, in-memory PII, EFS-persisted scripts, Bedrock prompt payloadsWeakNeeds redaction + DPA check (A.4.3, A.4.4).
Denial of ServiceYes — /agent_health spawns a subprocess per call, no rate limitPartial (409 if agent running)Add rate limit / auth.
Elevation of PrivilegeYes — latent via shell=True if session_id ever becomes tainted; compromised agent runs with container role IAMPartialList-form Popen + IAM scoping review.

A.12 Operational Readiness

CapabilityPresent / Partial / MissingNotes
Structured logsPartiallogger = logging.getLogger(__name__) in agent_runner.py but app.py uses print(..., file=sys.stderr) (:302-305).
MetricsUnknownNot visible in code read.
Distributed tracingMissingNo OTel imports.
Actionable alertsUnknownPresumed in aws_deployment/observability stack.
RunbooksPresentrunbook.md + RUN_BOOK.md.
On-call ownership definedUnknownSingle author per git (Gordon).
SLOs / SLIsMissingNot documented.
Backup & restore testedUnknownEFS is the only stateful store; snapshot policy not verified.
Disaster recovery planUnknownNot seen.
Chaos / failure testingMissingNo evidence.

A.13 Test & Quality Signals

  • Coverage (line / branch): N/A — not reported.
  • Trend: N/A.
  • Flake rate: N/A.
  • Slowest tests: N/A.
  • Untested critical paths: Unknown; likely: prompt-injection robustness, multi-worker FSM safety, real-site DOM drift.
  • Missing test types: [ ] unit (present per YAML) [ ] integration (present) [ ] e2e (run_e2e_perf_test.sh present) [ ] contract (present) [x] load [x] security/fuzz.

A.14 Performance & Cost Smells

  • Hot paths: /agent*/status polled by control plane.
  • Suspected bottlenecks: Cold Firefox start + Claude Code CLI boot per container.
  • Wasteful queries / loops: get_latest_temp_mtime walks the full temp/ tree on every inactivity check (agent_runner.py:187-207) — probably fine at current sizes.
  • Oversized infra / idle resources: Fargate per-survivor is inherently spiky; without TTL enforcement (NOVNC_HEARTBEAT_SECONDS only keeps a conn open, doesn't kill idle tasks) cost could drift.
  • Cache hit/miss surprises: N/A.

A.15 Bus-Factor & Knowledge Risk

  • Who is the only person who understands X? Gordon (sole authors: entry in YAML, gordon.zhg@gmail.com from two git identities).
  • What breaks if they disappear tomorrow? Real-site DOM fixes, Login.gov handoff tuning, prompt engineering for agent1.md/agent2.md.
  • What is undocumented tribal knowledge? Why --resume instead of --continue (partially captured in docstring at agent_runner.py:123-125); the per-section timeouts (100s/300s) rationale.
  • Suggested knowledge-transfer actions: Pair-review with a second engineer on the prompt files; ADR for the Agent 1/Agent 2 split and Login.gov handoff.

A.16 Compliance Gaps

N/A — the prop-build doc does not explicitly claim HIPAA/SOC 2/PCI compliance. That said, if FEMA IA data is being processed, a reasonable auditor would ask about: (a) BAA/DPA with AWS Bedrock, (b) PII retention in EFS-backed Firefox profiles, (c) access control to the unauth Flask plane, (d) audit log integrity in DynamoDB. These are flagged here even without an explicit claim, because the data class (federal benefits PII including disability and deceased persons) would typically trigger review.


A.17 Recommendations Summary

PriorityActionOwner (suggested)EffortDepends on
P0Add JWT before_request auth to every Flask route except /health; env-gate /test/*GordonSAI_JWT_SECRET already exists
P0Build a survivor-data sanitizer + prompt-injection deny list + length caps; wire into generate_agent1_script_from_p1 and generate_agent2_scriptGordon + AI-safety reviewerMRead of agent1.md/agent2.md
P0Replace shell=True + f-string Popen calls in agent_runner.py with list formGordonS
P0Redact PII/secrets from stdout/stderr tails before returning in /agent_health and /agent*/statusGordonS
P1Build an allow-list env for claude-full subprocess in /agent_health (and future spawns)GordonS
P1Assert workers=1 at Flask startup or document the single-worker requirement and Dockerfile CMDGordonS
P1Document / verify Bedrock DPA coverage for FEMA IA PIIComplianceSlegal
P1Read + security-review container_work_dir/fema-apply-agent/agent1.md and agent2.mdAI-safety reviewerM
P2Split create_app into Flask blueprints; extract DEFAULT_WORK_DIR to config.pyGordonM
P2Parameterize agent1/agent2 route handlers like _handle_restart already doesGordonS
P2Inject ProcessLauncher + Clock seams into agent_runner.py for testabilityGordonM
P2Add structured logging (swap print(..., file=sys.stderr) for logger.*)GordonS

Environment variables

NamePurpose
LLM_PROVIDERbedrock|anthropic
LLM_MODEL_NAMEFriendly alias
BEDROCK_API_KEY*Bedrock auth
ANTHROPIC_API_KEYDirect Anthropic API alt
AWS_REGIONBedrock region
WORKSPACE_DIREFS mount point
FIREFOX_PROFILE_DIRPersisted Firefox profile
FIREFOX_CACHE_DIRPersisted Firefox cache
NOVNC_HEARTBEAT_SECONDSIdle keepalive for CloudFront