Same Dockerfile shape as the sister: computer-use Ubuntu 22.04 + Node.js 20 + Claude Code CLI v2.1.45 + Flask 3.0 + mcp-vnc MCP server + Firefox ESR with EFS-backed profile. Flask state machine has 9 production states (IDLE → SETTING_UP → DATA_SET → AGENT1_RUNNING → (AGENT1_COMPLETE | AGENT1_ERROR) → AGENT2_RUNNING → (AGENT2_COMPLETE | AGENT2_ERROR)) — same shape as sister. The MOCK FEMA HTTP server is GONE and CHROME is GONE (only Firefox ESR remains). The Agent 1 script navigates to the live https://www.disasterassistance.gov/ and stops at the Login.gov page; the user completes 2FA inside the noVNC session; the user (or operator) calls /agent2/run. Includes aws_deployment/ CDK stacks (ECS, API Gateway, CloudFront, Secrets, Storage, Observability) and aws_demo_deploy/REPLIT_DEPLOY.md for Replit iframe integration. The /survivor-info/p1 + /survivor-info/p2 split endpoints are PRESENT (not removed as previously documented).
Role in the system: Production sibling of af-disaster-assistance-gov-agent — same Flask state machine, same per-survivor container, but with mock infrastructure removed and Firefox ESR as the only browser. Includes AWS CDK + Replit integration.
Surfaces:
- Flask data-plane :5001 (same 8 endpoints as sister)
- noVNC :6080 (vnc_embed.html for iframe)
- VNC :5900
- Firefox ESR with persisted profile in EFS
- Claude Code CLI agent runtime
- AWS CDK stacks under aws_deployment/
- Replit integration docs (aws_demo_deploy/REPLIT_DEPLOY.md)
User workflows
Build + start
/health = ok
Submit survivor
Container ready
Run Agent 1
State → AGENT1_COMPLETE; manual user step required
Manual Login.gov
User authenticated
Run Agent 2
State → AGENT2_COMPLETE
API endpoints
- GET
/healthLiveness - POST
/agent_healthLLM smoke test - GET
/stateCurrent orchestration state - POST
/survivor-infoSubmit full survivor JSON; generate scripts; launch Firefox to real FEMA URL - POST
/survivor-info/p1Partial submission (pages 1-N before Login.gov) - POST
/survivor-info/p2Partial submission (post-login pages) - GET
/agent1/status_detailedVerbose Agent 1 status (logs, sentinels, retry counts) - GET
/agent2/status_detailedVerbose Agent 2 status - POST
/agent1/restartForce-restart Agent 1 subprocess - POST
/agent2/restartForce-restart Agent 2 subprocess - POST
/test/test_kup, /test/test_kup2, /test/force-statusTest-only endpoints for state-machine harness - POST
/agent1/runSpawn Agent 1 subprocess - GET
/agent1/statusPoll Agent 1 - POST
/agent2/runSpawn Agent 2 (only after manual login) - GET
/agent2/statusPoll Agent 2
Third-party APIs
Anthropic Bedrock (or direct Claude API)
LLM backend for Claude Code CLI
https://www.disasterassistance.gov/
REAL FEMA application portal (target)
Login.gov
2FA between Agent 1 and Agent 2
Service dependencies
AWS ECS Fargate
Per-survivor task launch
AWS EFS
Firefox profile + work dir
AWS Secrets Manager
Bedrock + JWT + VNC keys
AWS API Gateway + Lambda (control plane)
Container manager (start/stop/list)
AWS CloudFront
Origin restriction + signed URLs (mTLS optional)
AWS DynamoDB
Container state + audit
Replit (frontend host)
Embeds noVNC iframe + calls control plane
af-backend-go-api
Alternative control-plane caller (production path)
Analysis
af-fema-real-ai-agent — Prop-Build Analysis
Document Type: Critical Review & Analysis (companion to prop-build-template.md)
Scope: Per-Repo / Per-Module
Subject: af-fema-real-ai-agent (FEMA Real AI Agent — per-survivor Claude-computer-use container driving production DisasterAssistance.gov)
Reviewer(s): Claude (automated code review)
Date: 2026-04-09
Version: 0.1
Confidence Level: Medium
What would raise confidence: running container locally, observing a full Agent 1 → Login.gov → Agent 2 run, access to CloudWatch logs + DynamoDB audit events, interview with Gordon, review of container_work_dir/fema-apply-agent/agent1.md + agent2.md prompts, and the CDK stacks under aws_deployment/.
Inputs Reviewed:
- Prop-build doc:
/Users/andres/src/af/af-analysis/data/af-fema-real-ai-agent.yaml - Companion docs:
api-examples.md,data-flow.md,runbook.md,deployment.md - Source:
/Users/andres/src/af/af-fema-real-ai-agent/fema_agent/src/web_service/{app.py,state.py,agent_runner.py}plus tree listing ofsurvivor_api/,raw_builder/,intermediate_builder/,aws_deployment/,specs/021-mock-to-real-fema/. - Not executed; no prod metrics; agent prompt files (
agent1.md/agent2.md) not inspected line-by-line.
Part A — Per-Repo / Per-Module Analysis
A.1 Executive Summary
- Overall health: Functional, reasonably well-factored Python/Flask orchestrator around Claude Code CLI driving a real federal benefits site, but security posture is weak for something that handles FEMA IA PII and a live Login.gov session.
- Top risk: No application-layer authentication on any Flask endpoint (
app.py:110-564) combined with PII-laden survivor JSON flowing into prompt text that Claude then executes against disasterassistance.gov — a classic prompt-injection + PII-in-prompt surface. See A.6.5 and A.11. - Top win / thing worth preserving: Surgical Firefox profile lock cleanup that preserves cookies (
app.py:42-84) and the resume-first / fresh-prompt retry escalation inagent_runner.restart_agent(agent_runner.py:164-184) — both are thoughtful, failure-mode-driven code worth propagating to the sister repo. - Single recommended next action: Put authenticated, signed-URL-only access in front of every Flask route (not just at CloudFront) and add an explicit survivor-data sanitizer before values are interpolated into
as_agent1.txt/as_agent2.txt. - Blocking unknowns: The actual content of
agent1.md/agent2.md(the prompts Claude receives) was not read; CDK auth/network posture was taken from the YAML summary only; no coverage/flake data.
A.2 Health Scorecard
| # | Dimension | Score (1–5) | Justification |
|---|---|---|---|
| 1 | Module overview / clarity of intent | 4 | README + YAML + specs/021-mock-to-real-fema/ make the purpose and Agent1/Agent2 split unambiguous. |
| 2 | External dependencies | 3 | Hard dependency on Bedrock, Login.gov, live FEMA site; no abstraction to swap providers even though LLM_PROVIDER env hints at it. |
| 3 | API endpoints | 3 | Clean REST shape and error handlers (app.py:92-106), but ~16 routes all unauthenticated at app layer; test endpoints (/test/test_kup*, /test/force-status) live in the same app (app.py:624,699,796 per YAML). |
| 4 | Database schema | 3 | DynamoDB tables (ContainerInstance, AuditEvent) per YAML; not inspected directly; in-container state is an in-memory dict in state.py — fine for per-survivor lifetime, but opaque. |
| 5 | Backend services | 4 | Clear separation: survivor_api/ (parse), raw_builder//intermediate_builder/ (script gen), web_service/agent_runner.py (process mgmt), state.py (FSM). |
| 6 | WebSocket / real-time | 3 | noVNC is stock; only real-time surface; 30s heartbeat is pragmatic. |
| 7 | Frontend components | 3 | No real FE in this repo beyond vnc_embed.html; scope is narrow and appropriate. |
| 8 | Data flow clarity | 4 | The 9-state FSM and sentinel-file protocol are explicit; companion data-flow.md traces it. |
| 9 | Error handling & resilience | 4 | check_inactivity_and_kill (agent_runner.py:241-283), SIGTERM→SIGKILL escalation (:210-238), stale-detection + retry count, resume-first strategy are all solid. |
| 10 | Configuration | 3 | Env vars documented; but WORK_DIR defaulted twice (app.py:26, agent_runner.py:16) and subprocess.run(..., env=os.environ.copy()) in /agent_health (app.py:146) passes the entire host env to the LLM CLI. |
| 11 | Data refresh patterns | 3 | EFS-backed Firefox profile persists across runs, with intentional lock cleanup; acceptable. |
| 12 | Performance | 3 | Per-section timeouts (100s A1, 300s A2) and 600s inactivity kill (agent_runner.py:19) are reasonable; no metrics to verify. |
| 13 | Module interactions | 4 | Clean boundary between Flask plane and Claude CLI subprocess. |
| 14 | Troubleshooting / runbooks | 4 | runbook.md + RUN_BOOK.md + inline docstrings cover stale locks, DOM drift, Login.gov expiry. |
| 15 | Testing & QA | 4 | YAML cites 874 passing tests, tests tree includes unit/integration/contract; test count is a strong signal even without coverage numbers. |
| 16 | Deployment & DevOps | 3 | CDK stacks present (not read); Dockerfile clean; no evidence of staged rollouts or canaries. |
| 17 | Security & compliance | 2 | Unauth Flask routes, shell=True subprocess with f-string interpolation, PII flowing into LLM prompts + EFS profile, broad env passthrough to LLM CLI. See A.6.5, A.11. |
| 18 | Documentation & maintenance | 4 | YAML + specs + companion md files are unusually thorough for a WIP-feeling repo. |
| 19 | Roadmap clarity | 3 | specs/021-mock-to-real-fema/ and 022-secure-demo-deploy/ suggest direction but no explicit roadmap doc was reviewed. |
Overall score: 3.32 average (19 rows). Weighted reading: operationally thoughtful, but security rating (2) is a load-bearing concern that should pull the effective score down for any prod-readiness decision.
A.3 What's Working Well
-
Strength: Surgical Firefox profile lock cleanup that deletes only
lockand.parentlockand explicitly preserves cookies/storage.- Location:
fema_agent/src/web_service/app.py:42-84 - Why it works: The docstring names the exact failure mode (new container IP vs. old lock) and the code walks only one level so it cannot accidentally nuke profile state. This is the kind of narrow, well-justified hack that saves incidents.
- Propagate to:
af-disaster-assistance-gov-agent(sister repo) if not already there.
- Location:
-
Strength: Resume-first retry strategy with graceful fallback to fresh-prompt.
- Location:
fema_agent/src/web_service/agent_runner.py:164-184(restart_agent) +:120-161(run_agent_resume,run_agent_fresh_with_resume). - Why it works: Encodes real knowledge (
--continueis broken with--mcp-config, so--resume <sessionId>is used) and escalates to a fresh prompt that self-skips completed pages usingtemp/logs. Debt-aware, not debt-blind. - Propagate to: Any other Claude Code CLI orchestrator in the org.
- Location:
-
Strength: SIGTERM → grace → SIGKILL process-group termination.
- Location:
agent_runner.py:210-238(kill_process_group) +:241-283(check_inactivity_and_kill). - Why it works:
start_new_session=Trueat spawn (:51,:137,:159) plusos.killpgensures the shell wrapper and theclaude-fullchild both die; inactivity is measured frommax(start_time, latest_mtime)to avoid false-positive kills from staletemp/files. - Propagate to: Sister repo and any future agent runner.
- Location:
-
Strength: Explicit P1/P2 cross-validation with county match.
- Location:
app.py:359-370. - Why it works: Catches a whole class of "wrong disaster" data-entry errors before they hit Claude and the real gov site.
- Propagate to: Any other form-automation agent.
- Location:
A.4 What to Improve
A.4.1 P0 — Unauthenticated Flask data plane handling PII
- Problem: Every route on the Flask app is defined without an
@requires_authwrapper./survivor-info,/survivor-info/p1,/survivor-info/p2,/agent1/run,/agent2/run,/agent*/restart, and the/test/*endpoints all accept anonymous POSTs. The YAML argues "network isolation + CloudFront signed URLs" is sufficient, but nothing in this repo enforces that — any workload that lands in the VPC (misconfigured SG, sidecar compromise, SSRF from another service, or a misrouted ALB) can POST survivor PII directly. - Evidence:
fema_agent/src/web_service/app.py:110, 114, 214, 239, 333, 418, 460, 560, 564(and/test/*at:624,:699,:796per YAML A.3). Nobefore_requestauth hook; noabort(401)anywhere inapp.py. - Suggested change: Add a
before_requestthat validates an HS256 JWT (the sameAI_JWT_SECRETalready used at the API Gateway layer per YAML §3.3) on every non-/healthroute; move test-only endpoints behind an env flag that is false in prod images. - Estimated effort: S
- Risk if ignored: PII exfiltration; unauthorized hijack of a container mid-session after the user has completed Login.gov 2FA (attacker can ride the authenticated Firefox profile).
A.4.2 P0 — shell=True subprocess spawns with f-string interpolation
- Problem: Three functions build command strings via f-string and pass them to
subprocess.Popen(..., shell=True):run_agent,run_agent_resume,run_agent_fresh_with_resume.session_idandagent_fileare interpolated into the shell string. Todaysession_idis a server-generateduuid.uuid4()(app.py:425, 471) andagent_fileis a hard-coded constant, so there is no current injection path — but the pattern is fragile: any future refactor that lets a caller pass a session_id (e.g. for test harnesses, or resuming from DynamoDB) instantly becomes RCE-as-root-in-container. - Evidence:
fema_agent/src/web_service/agent_runner.py:40-52, 127-139, 148-161. - Suggested change: Replace with list form:
subprocess.Popen(["claude-full", "--session-id", session_id, "-p", prompt], shell=False, ...). Dropshell=Trueentirely. - Estimated effort: S
- Risk if ignored: Latent command injection one refactor away; also makes it hard to reason about quoting of any prompt that ever contains a
"or$.
A.4.3 P1 — Survivor PII is interpolated into LLM prompts with no sanitization layer
- Problem:
SurvivorFemaApplication.generate_agent1_script_from_p1/generate_agent2_script(called atapp.py:296, 395) take attacker-controllable JSON and produceas_agent1.txt/as_agent2.txtfiles that Claude Code then reads verbatim and executes against the real gov site. There is no evidence of (a) stripping control tokens, (b) length caps, (c) detection of prompt-injection strings like"ignore previous instructions", or (d) a deny-list for URLs/domains. The entire survivor JSON is also kept instate.pyin-memory and persisted to EFS as the script files. PII (address, disability info, income, household/deceased members) passes through Bedrock as prompt text. - Evidence:
app.py:295-296, 387-395(generator invocation); YAML §4 enumerates the PII fields. - Suggested change: Introduce a
survivor_sanitizermodule that (1) validates every field against a tight allow-list regex, (2) caps each field length, (3) rejects strings containing obvious injection markers, (4) logs a redacted copy only; redact PII from any stderr/stdout tails returned by/agent_healthand/agent*/status(app.py:154-167,agent_runner.py:302-313). Also document the DPA/BAA status of the Bedrock route since PII leaves the VPC. - Estimated effort: M
- Risk if ignored: Prompt injection drives Claude to submit bogus data to the real FEMA site, exfiltrate cookies, or click malicious links inside Firefox. PII exposure to any log sink that captures stdout/stderr.
A.4.4 P1 — /agent_health hands the LLM subprocess the entire host environment
- Problem:
subprocess.run(["claude-full", "-p", prompt], env=os.environ.copy(), ...)passes every env var — includingBEDROCK_API_KEY,AI_JWT_SECRET, AWS credentials injected by the task role — to a child that is explicitly meant to be a "quick smoke test." - Evidence:
fema_agent/src/web_service/app.py:140-147. - Suggested change: Build a minimal allow-list env (
PATH,HOME,LLM_PROVIDER,LLM_MODEL_NAME,AWS_REGION, provider key). - Estimated effort: S
- Risk if ignored: Secret exfiltration via LLM output or future observability hooks.
A.4.5 P2 — Test-only routes live in the production app
- Problem:
/test/test_kup,/test/test_kup2,/test/force-statusare inapp.pyand reachable in every built image with no env guard. - Evidence: YAML §3.2 cites
app.py:624,699,796. - Suggested change: Gate with
if os.environ.get("FEMA_ENABLE_TEST_ROUTES") == "1":insidecreate_app; fail-closed in prod. - Estimated effort: S
- Risk if ignored: Unexpected state transitions from unauth traffic; widens blast radius of A.4.1.
A.5 Things That Don't Make Sense
-
Observation:
_clear_firefox_profile_lockswalks the top-level dir,breaks after the first iteration, then does a second manual two-level walk.- Location:
app.py:59-83. - Hypotheses considered: defensive belt-and-suspenders; or the author found
os.walkdescended too far and spliced in a second pass. - Question for author: Would a single
glob.glob(os.path.join(firefox_profile_dir, "**/lock"), recursive=True)plus.parentlockglob be equivalent and cleaner?
- Location:
-
Observation:
check_agent_statusreturnsstderr = stdout_textwhen stderr is empty.- Location:
agent_runner.py:305-308. - Hypotheses considered: back-compat with older callers that only inspect
stderr. - Question for author: Is there still any caller that only looks at
stderr? If not, drop the aliasing — it hides the real stream and complicates log grepping.
- Location:
A.6 Anti-Patterns Detected
A.6.1 Code-level
- God object / god function —
app.pyis 829 lines with FSM reset logic, validation, subprocess launching, lock cleanup, and ~16 routes in onecreate_app. - Shotgun surgery
- Feature envy
- Primitive obsession
- Dead code
- Copy-paste / duplication —
DEFAULT_WORK_DIRdefined twice with the same default (app.py:26,agent_runner.py:16); agent1/agent2 status/run/restart blocks are near-mirrors (app.py:418-502). - Magic numbers / unexplained constants
- Deep nesting (>3 levels)
- Long parameter lists (>4)
- Boolean-flag parameters that change behavior
A.6.2 Architectural
- Big ball of mud
- Distributed monolith
- Chatty services
- Leaky abstraction / inappropriate intimacy between layers
- Golden hammer
- Vendor lock-in without exit strategy
- Stovepipe / reinvented wheel
- Missing seams for testing — subprocess spawning,
os.environreads,time.time(), andos.walkare all called directly with no injection point.
A.6.3 Data
- God table / EAV / missing indexes / N+1 / unbounded growth / nullable-everything / shared DB — N/A (not enough visibility into DynamoDB schemas from code read).
A.6.4 Async / Ops
- Poison messages with no dead-letter queue — N/A (no queue).
- Retry storms / no backoff — mitigated by retry_count ceiling.
- Missing idempotency keys on non-idempotent ops —
/agent*/runguarded by FSM state, effectively a per-container idempotency key. - Hidden coupling via shared state — Flask
state.pyin-memory dict is the single source of truth; any Flask worker count >1 would silently corrupt it. Single-worker assumption is not asserted. - Work queues without visibility / depth metrics
A.6.5 Security
- Secrets in code,
.envcommitted, or logs — stdout/stderr tails returned by/agent_health(app.py:154-167) andcheck_agent_status(agent_runner.py:302-313) can leak env-loaded secrets or PII echoed by the LLM. - Missing authn/z on internal endpoints — every route in
app.pyis unauthenticated (see A.4.1). - Overbroad IAM roles / least-privilege violations — not reviewed (no CDK inspection).
- Unvalidated input crossing a trust boundary — survivor JSON is key-allow-listed but field-level values are fed into a script file that Claude then executes (prompt injection surface; see A.4.3).
- PII/PHI in logs or error messages — in-memory
survivor_data, EFS-persistedas_agent1.txt/as_agent2.txt, and process stderr returned via HTTP all carry PII. No redaction layer found. - Missing CSRF / XSS / SQLi / SSRF protections — Flask JSON API so CSRF N/A; no SQL; SSRF implicit in Firefox surface.
A.6.6 Detected Instances
| # | Anti-pattern | Location (file:line) | Severity (P0/P1/P2) | Recommendation |
|---|---|---|---|---|
| 1 | God function (create_app) | fema_agent/src/web_service/app.py:86-820 | P2 | Split into routes/health.py, routes/survivor.py, routes/agents.py blueprints. |
| 2 | Duplicated DEFAULT_WORK_DIR | app.py:26, agent_runner.py:16 | P2 | Extract to web_service/config.py. |
| 3 | Near-duplicate agent1/agent2 route handlers | app.py:418-502 | P2 | Parameterize on agent_key like _handle_restart already does at :522-549. |
| 4 | Missing seams for testing | agent_runner.py:45-52, 132-139, 156-161, 241-283 | P2 | Inject a ProcessLauncher + Clock. |
| 5 | Single-process in-memory FSM with no worker guard | state.py (module-level dict) | P1 | Assert workers == 1 at startup or move state out of process. |
| 6 | Stdout/stderr tails leaked in HTTP responses | app.py:154-167, agent_runner.py:302-313 | P1 | Redact before return; log full body internally only. |
| 7 | Unauth endpoints on PII-handling data plane | app.py:110,114,214,239,333,418,460,560,564 | P0 | JWT before_request hook. |
| 8 | Unvalidated survivor JSON values interpolated into LLM prompts | app.py:295-296, 387-395 (via SurvivorFemaApplication) | P0 | Sanitizer + length caps + prompt-injection deny list. |
| 9 | shell=True + f-string interpolation in Popen | agent_runner.py:40-52, 127-139, 148-161 | P0 (latent) | Use list form, drop shell=True. |
| 10 | env=os.environ.copy() passed to LLM subprocess | app.py:146 | P1 | Allow-list env. |
| 11 | Test routes reachable in prod image | app.py:624,699,796 (per YAML) | P2 | Env-gate. |
A.7 Open Questions
-
Q: Is there any authentication at the container's Flask layer, or is the YAML's "VPC isolation + CloudFront signed URLs" the only control? If the latter, what stops an in-VPC service from posting directly?
- Blocks: A.4.1, A.11.
- Who can answer: Gordon / platform-sec.
-
Q: Have the
agent1.md/agent2.mdprompts been reviewed for prompt-injection resilience against untrusted survivor input?- Blocks: A.4.3.
- Who can answer: Gordon / AI safety reviewer.
-
Q: Does the Bedrock path have a DPA/BAA in place for the PII being sent? FEMA IA data includes income, disability, deceased persons.
- Blocks: A.16 (if compliance is claimed).
- Who can answer: legal / compliance.
-
Q: Is Flask run with
workers=1? If not,state.py's in-memory dict is unsafe.- Blocks: A.6.4.
- Who can answer: deployment doc / Dockerfile CMD.
A.8 Difficulties Encountered
-
Difficulty:
agent1.md/agent2.mdprompt templates live undercontainer_work_dir/fema-apply-agent/(per YAML) and were not read as part of this review.- Impact on analysis: Cannot concretely grade prompt-injection resilience — the "PII into prompts" finding (A.4.3) is inferred from the call sites, not from the prompt text itself.
- Fix that would help next reviewer: Commit a sanitized sample prompt to the repo root or link from README.
-
Difficulty: CDK stacks under
aws_deployment/were not opened; security posture at the edge is taken on faith from the YAML.- Impact on analysis: Could not verify CloudFront signed-URL enforcement, mTLS, IAM least-privilege, or SG posture.
- Fix that would help next reviewer: Short
aws_deployment/README.mdper stack would shortcut this.
-
Difficulty: No coverage or flake numbers; test directory count alone is a weak signal.
- Impact on analysis: A.13 is mostly empty.
- Fix that would help next reviewer:
pytest --covbadge or acoverage.xmlartifact.
A.9 Risks & Unknowns
A.9.1 Known risks
| # | Risk | Likelihood (L/M/H) | Impact (L/M/H) | Mitigation |
|---|---|---|---|---|
| 1 | Unauth Flask plane reachable from any VPC workload | M | H | JWT before_request; SG lockdown to API GW only. |
| 2 | Prompt injection via survivor JSON field values | M | H | Sanitizer + deny list; prompt design that quarantines user data. |
| 3 | DOM drift / CAPTCHA on real disasterassistance.gov | H | M | Runbook exists; add monitoring on agent retry count. |
| 4 | Login.gov session expiry mid-Agent-2 run | M | M | Detect 401/redirect to Login.gov; surface to user. |
| 5 | Multi-worker Flask corrupts in-memory FSM | L | H | Assert workers=1 or move state. |
| 6 | Secrets/PII leak via stderr tails in HTTP responses | M | H | Redact before return. |
| 7 | shell=True becomes injectable after a future refactor | L | H | Switch to list form now. |
A.9.2 Unknown unknowns
- Area not reviewed:
agent1.md/agent2.mdprompt bodies. Reason: not in the paths I opened. Best guess at risk level: High — this is where prompt-injection defense either lives or doesn't. - Area not reviewed:
aws_deployment/CDK stacks. Reason: out of scope for time budget. Best guess at risk level: Medium — standard CDK patterns usually OK, but IAM scoping and CloudFront signed-URL enforcement need verification. - Area not reviewed:
survivor_api/field-level validators. Reason: only inspected the call site inapp.py. Best guess at risk level: Medium — the sanitization verdict in A.4.3 hinges on what this module does or doesn't do. - Area not reviewed: DynamoDB
ContainerInstance+AuditEventschemas. Reason: in sibling container_manager service. Best guess at risk level: Low-Medium. - Area not reviewed: The 874-test suite content. Reason: time. Best guess at risk level: Low (tests existing is a strong signal).
A.10 Technical Debt Register
| # | Debt item | Quadrant | Estimated interest | Remediation |
|---|---|---|---|---|
| 1 | Unauth Flask data plane | Reckless & Deliberate | High (security incidents) | JWT before_request (S). |
| 2 | shell=True + f-string Popen | Prudent & Inadvertent | Medium (latent) | Switch to list form (S). |
| 3 | PII into LLM prompts with no sanitizer | Reckless & Inadvertent | High (compliance + injection) | Sanitizer layer + redaction (M). |
| 4 | env=os.environ.copy() for LLM subprocess | Reckless & Inadvertent | Medium | Allow-list env (S). |
| 5 | create_app is 700+ lines | Prudent & Deliberate | Low | Blueprint split (M). |
| 6 | Duplicated DEFAULT_WORK_DIR | Prudent & Inadvertent | Low | Extract config module (S). |
| 7 | In-memory FSM with no worker-count assertion | Reckless & Inadvertent | Medium | Assert workers=1 at startup (S). |
| 8 | Test routes in prod image | Prudent & Deliberate | Low-Medium | Env gate (S). |
| 9 | Stdout/stderr tails leaked over HTTP | Reckless & Inadvertent | Medium-High | Redaction layer (S). |
A.11 Security Posture (lightweight STRIDE)
| Category | Threat present? | Mitigated? | Gap |
|---|---|---|---|
| Spoofing (identity) | Yes — anyone in VPC can POST as "the control plane" | Partial (only at edge, per YAML) | No app-layer auth (A.4.1). |
| Tampering (integrity) | Yes — attacker can POST /survivor-info/p2 to mutate in-flight state | No | Same root cause as spoofing. |
| Repudiation | Partial — DynamoDB AuditEvent exists per YAML | Unknown | Not verified end-to-end; no signed audit log seen in app.py. |
| Information Disclosure | Yes — stdout/stderr tails, in-memory PII, EFS-persisted scripts, Bedrock prompt payloads | Weak | Needs redaction + DPA check (A.4.3, A.4.4). |
| Denial of Service | Yes — /agent_health spawns a subprocess per call, no rate limit | Partial (409 if agent running) | Add rate limit / auth. |
| Elevation of Privilege | Yes — latent via shell=True if session_id ever becomes tainted; compromised agent runs with container role IAM | Partial | List-form Popen + IAM scoping review. |
A.12 Operational Readiness
| Capability | Present / Partial / Missing | Notes |
|---|---|---|
| Structured logs | Partial | logger = logging.getLogger(__name__) in agent_runner.py but app.py uses print(..., file=sys.stderr) (:302-305). |
| Metrics | Unknown | Not visible in code read. |
| Distributed tracing | Missing | No OTel imports. |
| Actionable alerts | Unknown | Presumed in aws_deployment/observability stack. |
| Runbooks | Present | runbook.md + RUN_BOOK.md. |
| On-call ownership defined | Unknown | Single author per git (Gordon). |
| SLOs / SLIs | Missing | Not documented. |
| Backup & restore tested | Unknown | EFS is the only stateful store; snapshot policy not verified. |
| Disaster recovery plan | Unknown | Not seen. |
| Chaos / failure testing | Missing | No evidence. |
A.13 Test & Quality Signals
- Coverage (line / branch): N/A — not reported.
- Trend: N/A.
- Flake rate: N/A.
- Slowest tests: N/A.
- Untested critical paths: Unknown; likely: prompt-injection robustness, multi-worker FSM safety, real-site DOM drift.
- Missing test types: [ ] unit (present per YAML) [ ] integration (present) [ ] e2e (
run_e2e_perf_test.shpresent) [ ] contract (present) [x] load [x] security/fuzz.
A.14 Performance & Cost Smells
- Hot paths:
/agent*/statuspolled by control plane. - Suspected bottlenecks: Cold Firefox start + Claude Code CLI boot per container.
- Wasteful queries / loops:
get_latest_temp_mtimewalks the fulltemp/tree on every inactivity check (agent_runner.py:187-207) — probably fine at current sizes. - Oversized infra / idle resources: Fargate per-survivor is inherently spiky; without TTL enforcement (NOVNC_HEARTBEAT_SECONDS only keeps a conn open, doesn't kill idle tasks) cost could drift.
- Cache hit/miss surprises: N/A.
A.15 Bus-Factor & Knowledge Risk
- Who is the only person who understands X? Gordon (sole
authors:entry in YAML,gordon.zhg@gmail.comfrom two git identities). - What breaks if they disappear tomorrow? Real-site DOM fixes, Login.gov handoff tuning, prompt engineering for
agent1.md/agent2.md. - What is undocumented tribal knowledge? Why
--resumeinstead of--continue(partially captured in docstring atagent_runner.py:123-125); the per-section timeouts (100s/300s) rationale. - Suggested knowledge-transfer actions: Pair-review with a second engineer on the prompt files; ADR for the Agent 1/Agent 2 split and Login.gov handoff.
A.16 Compliance Gaps
N/A — the prop-build doc does not explicitly claim HIPAA/SOC 2/PCI compliance. That said, if FEMA IA data is being processed, a reasonable auditor would ask about: (a) BAA/DPA with AWS Bedrock, (b) PII retention in EFS-backed Firefox profiles, (c) access control to the unauth Flask plane, (d) audit log integrity in DynamoDB. These are flagged here even without an explicit claim, because the data class (federal benefits PII including disability and deceased persons) would typically trigger review.
A.17 Recommendations Summary
| Priority | Action | Owner (suggested) | Effort | Depends on |
|---|---|---|---|---|
| P0 | Add JWT before_request auth to every Flask route except /health; env-gate /test/* | Gordon | S | AI_JWT_SECRET already exists |
| P0 | Build a survivor-data sanitizer + prompt-injection deny list + length caps; wire into generate_agent1_script_from_p1 and generate_agent2_script | Gordon + AI-safety reviewer | M | Read of agent1.md/agent2.md |
| P0 | Replace shell=True + f-string Popen calls in agent_runner.py with list form | Gordon | S | — |
| P0 | Redact PII/secrets from stdout/stderr tails before returning in /agent_health and /agent*/status | Gordon | S | — |
| P1 | Build an allow-list env for claude-full subprocess in /agent_health (and future spawns) | Gordon | S | — |
| P1 | Assert workers=1 at Flask startup or document the single-worker requirement and Dockerfile CMD | Gordon | S | — |
| P1 | Document / verify Bedrock DPA coverage for FEMA IA PII | Compliance | S | legal |
| P1 | Read + security-review container_work_dir/fema-apply-agent/agent1.md and agent2.md | AI-safety reviewer | M | — |
| P2 | Split create_app into Flask blueprints; extract DEFAULT_WORK_DIR to config.py | Gordon | M | — |
| P2 | Parameterize agent1/agent2 route handlers like _handle_restart already does | Gordon | S | — |
| P2 | Inject ProcessLauncher + Clock seams into agent_runner.py for testability | Gordon | M | — |
| P2 | Add structured logging (swap print(..., file=sys.stderr) for logger.*) | Gordon | S | — |
Environment variables
| Name | Purpose |
|---|---|
LLM_PROVIDER | bedrock|anthropic |
LLM_MODEL_NAME | Friendly alias |
BEDROCK_API_KEY* | Bedrock auth |
ANTHROPIC_API_KEY | Direct Anthropic API alt |
AWS_REGION | Bedrock region |
WORKSPACE_DIR | EFS mount point |
FIREFOX_PROFILE_DIR | Persisted Firefox profile |
FIREFOX_CACHE_DIR | Persisted Firefox cache |
NOVNC_HEARTBEAT_SECONDS | Idle keepalive for CloudFront |
