Docker image based on ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest (Ubuntu 22.04 + X11 + noVNC). Adds Node.js 20, Claude Code CLI v2.1.45, Flask 3.0, MCP servers (chrome-devtools, mcp-vnc, playwright), and a fema_agent Python package implementing a survivor data parser, agent script generator, and an 8-state Flask state machine that spawns claude-full subprocesses for Agent 1 (pre-login) and Agent 2 (post-login). Browser is Chrome with remote debugging on 9222; profile + cookies persisted in EFS so the Login.gov session survives between Agent 1 and Agent 2.
Role in the system: Spawned per-survivor by af-backend-go-api (control plane in af-infra Lambda); exposed via Flask data-plane on :5001 and noVNC on :6080; container destroyed post-completion
Surfaces:
- Flask data-plane HTTP API on :5001 (8 endpoints)
- noVNC web on :6080 (vnc.html + vnc_embed.html for iframe)
- VNC server on :5900
- Mock FEMA HTTP server on :8020 (dev/training)
- Chrome DevTools Protocol on :9222
- Claude Code CLI as the agent runtime (claude-full wrapper)
- .claude/ MCP server config + agents/commands
User workflows
Build image
Image ready
Start container
/health returns ok
Submit survivor
Container ready for Agent 1
Run Agent 1
State → AGENT1_COMPLETE / WAITING_FOR_LOGIN
Login.gov via VNC
Survivor authenticated; ready for Agent 2
Run Agent 2
State → AGENT2_COMPLETE; survivor reviews page 37 manually
API endpoints
- GET
/healthLiveness - POST
/agent_healthLLM connectivity smoke test - GET
/stateCurrent orchestration state + stale-agent flags - POST
/survivor-infoSubmit full survivor JSON, generate scripts - POST
/survivor-info/p1Partial submission (pages 1-16 only) - POST
/survivor-info/p2Partial submission (pages 20-36 only) - POST
/agent1/runSpawn Agent 1 subprocess - GET
/agent1/statusPoll Agent 1 subprocess - POST
/agent2/runSpawn Agent 2 subprocess - GET
/agent2/statusPoll Agent 2 subprocess
Third-party APIs
Anthropic Bedrock (or direct Claude API)
LLM backend for Claude Code CLI
DisasterAssistance.gov
Real FEMA application portal (target)
Login.gov
2FA gate between Agent 1 and Agent 2
Chrome DevTools Protocol
Browser automation surface
Service dependencies
AWS ECS Fargate
Per-survivor task launch
AWS EFS
Persisted browser profile + work dir
AWS Secrets Manager
Bedrock API key, VNC signing key, JWT secret
AWS DynamoDB
Container state + audit log
AWS CloudWatch Logs
Audit trail
af-backend-go-api
Control plane caller (start/stop/poll)
af-infra (CDK or Terraform)
ECS task def, EFS, Secrets Manager, ALB, CloudFront
Analysis
af-disaster-assistance-gov-agent — Prop-Build Analysis
Document Type: Critical Review & Analysis (companion to prop-build-template.md)
Scope: Per-Repo / Per-Module
Subject: af-disaster-assistance-gov-agent (DisasterAssistance.gov AI Automation Agent)
Reviewer(s): Claude (automated code review)
Date: 2026-04-09
Version: 0.1
Confidence Level: Medium
What would raise confidence: Running the container locally end-to-end against mock FEMA; access to CloudWatch metrics and ECS task telemetry; interview with Gordon Zheng on two-agent split rationale; execution of pytest + e2e perf harness.
Inputs Reviewed:
- Prop-build doc:
/Users/andres/src/af/af-analysis/data/af-disaster-assistance-gov-agent.yaml - Companion docs:
/Users/andres/src/af/af-analysis/data/af-disaster-assistance-gov-agent/{api-examples,data-flow,runbook,deployment}.md - Source tree:
/Users/andres/src/af/af-disaster-assistance-gov-agent/(fema_agent/src/web_service/{app.py:1271, state.py:388, agent_runner.py:317}) - Dashboards / metrics: not accessed
- ADRs / design docs:
specs/001..020directories (listing only) - Interviews: none
A.1 Executive Summary
- Overall health: Functional and cleanly split into a Flask data-plane, FSM, agent runner, and survivor parser; architecture is coherent for a per-survivor ephemeral container, but the HTTP surface is oversized and unauthenticated at the app layer.
- Top risk: All 22 Flask endpoints, including destructive and PII-accepting routes, have
auth: "none"and rely entirely on network isolation + CloudFront signed URLs — a single misconfiguration in ALB/SG removes all authz (see A.6.5, A.11). - Top win / thing worth preserving: Sentinel-file completion pattern (
temp/agent*.completed) decouples "agent finished successfully" from "subprocess exited cleanly" — robust against unclean Claude Code exits (fema_agent/src/web_service/agent_runner.py). - Single recommended next action: Gate the Flask data-plane behind a shared-secret header or mTLS verified by the container, even behind the ALB, so app-layer authz exists as defense-in-depth.
- Blocking unknowns: Actual test coverage %, CI pass rate, production error rates, concurrent-task ceiling, and whether
specs/019-stuck-agent-watcheris deployed.
A.2 Health Scorecard
| # | Dimension | Score (1–5) | Justification |
|---|---|---|---|
| 1 | Module overview / clarity of intent | 4 | YAML + README articulate per-survivor container and two-agent split clearly; purpose unambiguous. |
| 2 | External dependencies | 3 | Well-enumerated but heavy: Bedrock, Login.gov, DisasterAssistance.gov DOM, 3 MCP servers, Chrome CDP — many failure surfaces. |
| 3 | API endpoints | 2 | 22 routes defined in app.py vs 10 documented; test/internal routes (/test/*, /agent*/restart, /test/force-status) ship in the same binary as prod (app.py:707–838). |
| 4 | Database schema | 3 | No local DB; state is in-memory + EFS + DynamoDB (control plane). Reasonable for ephemeral container. |
| 5 | Backend services | 4 | Clean separation: state.py FSM, agent_runner.py subprocess orchestration, survivor_api/ parsing. |
| 6 | WebSocket / real-time | 3 | noVNC proxy inherited from base image; heartbeat added — adequate. |
| 7 | Frontend components | 3 | Minimal — relies on noVNC; vnc_embed.html iframe variant is pragmatic. |
| 8 | Data flow clarity | 4 | data-flow.md plus YAML end-to-end block trace the request path explicitly. |
| 9 | Error handling & resilience | 3 | Good abort codes and sentinel fallback; no circuit breakers; stale-process handling exists but relies on operator retry. |
| 10 | Configuration | 3 | Env-var driven, documented; no feature-flag framework; some magic defaults (600s watchdog). |
| 11 | Data refresh patterns | 3 | On-demand only; appropriate for this runtime model. |
| 12 | Performance | 2 | Targets TBD; 1 vCPU-bound container per survivor implies linear cost scaling; no load numbers captured. |
| 13 | Module interactions | 4 | Explicitly documented with af-backend-go-api, Bedrock, EFS, DynamoDB. |
| 14 | Troubleshooting / runbooks | 4 | runbook.md covers top 6 failure modes. |
| 15 | Testing & QA | 2 | pytest referenced but coverage unknown/null in YAML; e2e is a bash script run_e2e_perf_test.sh; no contract tests executed in CI are verified. |
| 16 | Deployment & DevOps | 3 | GitHub Actions → ECR → CDK; rollback documented; no blue/green specifics. |
| 17 | Security & compliance | 2 | auth: "none" on 22 routes; survivor PII (SSN, DOB) in-memory + EFS; sensitive Login.gov cookies on EFS; STRIDE coverage incomplete (see A.11). |
| 18 | Documentation & maintenance | 4 | Strong README, AGENTS.md, 20 specs directories, runbook, api-examples. |
| 19 | Roadmap clarity | 3 | specs/001–020 imply active phased work but phase/owner mapping not captured here. |
Overall score: 3.1 — architecture and docs are above average; security posture and test signal are the drag.
A.3 What's Working Well
-
Strength: Sentinel-file completion signal decoupled from subprocess exit code.
- Location:
fema_agent/src/web_service/agent_runner.py:23-91and dual-location poll. - Why it works: Claude Code CLI subprocesses can exit uncleanly; relying on exit code would produce false negatives. The sentinel reifies "agent wrote its completion marker," a stronger invariant.
- Propagate to:
af-fema-real-ai-agentsister repo; any future agent-runtime containers.
- Location:
-
Strength: Two-agent split around the Login.gov human-in-the-loop boundary.
- Location: State machine
fema_agent/src/web_service/state.py:8-16plus script generator infema_agent/src/raw_builder/+intermediate_builder/. - Why it works: Makes the unavoidable human 2FA gate an explicit state (
WAITING_FOR_LOGIN) rather than a hidden pause, enabling the frontend to render the VNC iframe at the right moment. - Propagate to: Any other gov-portal automation repos encountering Login.gov / ID.me.
- Location: State machine
-
Strength: Clear module decomposition under
fema_agent/src/(web_service,survivor_api,raw_builder,intermediate_builder,enums).- Location:
fema_agent/src/package layout. - Why it works: Parser, builder, and runtime concerns are separable; supports unit testing of pure-data layers without Flask/subprocess stack.
- Location:
A.4 What to Improve
A.4.1 P0 — Add app-layer authentication to the Flask data-plane
- Problem: Every documented and undocumented route is
auth: "none", relying solely on SG + CloudFront signed URLs. - Evidence: YAML
section_3_api.endpoints[*].auth = "none";fema_agent/src/web_service/app.py:188–838shows no@requires_authdecorator on any route. - Suggested change: Add a shared-secret header (HS256 JWT or per-task bearer token via Secrets Manager) in a Flask
before_requesthook. - Estimated effort: M
- Risk if ignored: One SG misconfig → full PII breach +
/test/force-statusallows arbitrary state transitions.
A.4.2 P1 — Split test/internal routes out of the production Flask app
- Problem:
/test/agent-auth,/test/agent-complete,/test/force-status,/test/setup-mock,/test/e2e-mock,/agent1/restart,/agent2/restart,/agent*/status_detailed, and/setup-femaship in the same binary as prod. - Evidence:
fema_agent/src/web_service/app.py:707–838; YAML notes "~12 additional internal/test-only endpoints." - Suggested change: Gate behind
ENABLE_TEST_ROUTESenv var defaulted off, or separate Blueprint loaded only in dev/stg. - Estimated effort: S
- Risk if ignored: Prod blast radius includes arbitrary state manipulation.
A.4.3 P1 — Capture and publish test coverage
- Problem: YAML lists
coverage_pct: null; unknown whetherfema_agent/tests/actually exercises FSM, parser validators, and stale-detection. - Evidence:
section_15_testing.unit.coverage_pct: null. - Suggested change: Wire
pytest --cov=fema_agent --cov-report=xmlinto CI; fail under threshold. - Estimated effort: S
- Risk if ignored: FSM regressions invisible until production.
A.4.4 P2 — Replace print(..., file=sys.stderr) with structured logging
- Evidence:
fema_agent/src/web_service/app.py:357-360, 472-474. - Suggested change: Python
loggingwith JSON formatter; include request/survivor correlation id. - Estimated effort: S
A.5 Things That Don't Make Sense
-
Observation: Two agents spawn
claude-fullsubprocesses, but the state machine defines both production (9) and test-only (6) states in the same enum.- Location:
fema_agent/src/web_service/state.py:8-49. - Question for author: Why not a separate
TestStateenum so the prod FSM's valid-transition matrix stays minimal?
- Location:
-
Observation: Sentinel file is polled at two locations (pre-archive and post-archive).
- Location:
fema_agent/src/web_service/agent_runner.py:94+. - Question for author: Determined ordering, or genuinely defensive against concurrent archive?
- Location:
A.6 Anti-Patterns Detected
A.6.1 Code-level
- God object / god function —
app.py:164–838(1271 LOC file, singlecreate_app) - Shotgun surgery
- Feature envy
- Primitive obsession
- Dead code
- Copy-paste / duplication —
/survivor-info,/p1,/p2near-identical handlers - Magic numbers —
AGENT_INACTIVITY_TIMEOUT_SECONDS=600,NOVNC_HEARTBEAT_SECONDS=30 - Deep nesting
- Long parameter lists
- Boolean-flag parameters
A.6.2 Architectural
- Big ball of mud
- Distributed monolith
- Chatty services
- Leaky abstraction — Flask state dict holds
Popenhandles alongside survivor JSON - Golden hammer
- Vendor lock-in — Bedrock + Claude Code CLI + chrome-devtools-mcp, no abstraction layer
- Stovepipe
- Missing seams for testing — hard-coded filesystem paths, direct
subprocess.Popen, direct Chrome CDP URLs
A.6.3 Data
N/A — no persistent DB in this repo.
A.6.4 Async / Ops
- Poison messages
- Retry storms
- Missing idempotency keys
- Hidden coupling via shared state — global Flask
_statemutated by multiple routes + polling thread - Work queues without visibility
A.6.5 Security
- Secrets in code /
.envcommitted - Missing authn/z on internal endpoints — 22 routes, zero auth decorators
- Overbroad IAM roles
- Unvalidated input crossing trust boundary —
/test/force-statusaccepts arbitrary state strings - PII/PHI in logs or error messages (suspected) — validation errors embed field values into 400 descriptions
A.6.6 Detected Instances
| # | Anti-pattern | Location (file:line) | Severity | Recommendation |
|---|---|---|---|---|
| 1 | God object / god function | app.py:164–838 (1271 LOC, single create_app, 22 routes) | P1 | Split into Blueprints: health_bp, survivor_bp, agent_bp, test_bp. |
| 2 | Copy-paste / duplication | app.py:303–590 (/survivor-info, /p1, /p2 share near-identical validation + env-setup) | P1 | Extract _ingest_survivor(payload, variant) helper. |
| 3 | Magic numbers | section_10_configuration.thresholds — 600s, 30s defaults | P2 | Centralize in config.py with rationale. |
| 4 | Leaky abstraction | state.py:76-104 — Flask state dict holds Popen handles + survivor JSON + state enum | P1 | Wrap in AgentProcess class. |
| 5 | Vendor lock-in | Bedrock + Claude Code CLI + chrome-devtools-mcp, no abstraction | P2 | Document exit cost; keep LLM_PROVIDER switch exercised. |
| 6 | Missing testing seams | app.py:44-57, agent_runner.py — hard-coded paths, direct subprocess | P1 | Inject Clock, FileSystem, ProcessSpawner. |
| 7 | Hidden coupling | Global _state mutated by multiple routes + polling thread | P1 | Serialize mutations behind a locked StateStore. |
| 8 | Missing authn/z | app.py:188–838 — 22 routes, zero auth | P0 | A.4.1. |
| 9 | Unvalidated input | /test/force-status (app.py:763-772) accepts arbitrary state strings | P0 | Disable in prod (A.4.2); validate enum. |
| 10 | PII in error messages | app.py:383,388,422,427,432,438,498,502,520,525,530,583,586 — validation errors embed field values | P1 | Scrub PII from 400 descriptions. |
A.7 Open Questions
- Q: Are
/test/*routes compiled out or disabled by env in production deploys?- Blocks: A.4.2, A.6 #9
- Who can answer: Gordon Zheng / platform SRE
- Q: What is the actual pytest coverage today?
- Blocks: A.2 row 15, A.4.3
- Who can answer: CI logs
- Q: Is
specs/019-stuck-agent-watchermerged and running in prod?- Blocks: Confidence in stale-agent recovery
- Who can answer: Repo maintainer
A.8 Difficulties Encountered
- Difficulty: No access to CI output, coverage, or runtime metrics.
- Impact: Scored row 15 conservatively.
- Fix: Publish CI summary (coverage %, test count, flake rate) into a badge or
ci-summary.json.
- Difficulty: Two sister repos (
af-fema-real-ai-agent) with "near-identical" contents but no shared library.- Impact: Unable to tell which is canonical; duplication risk unmeasured.
- Fix: Extract shared Python package or document divergence explicitly.
A.9 Risks & Unknowns
A.9.1 Known risks
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Data-plane exposure if SG/CloudFront misconfigures | M | H | A.4.1 add app-layer auth |
| 2 | DisasterAssistance.gov DOM change breaks Agent scripts | M | H | Script regeneration + monitoring; manual override via VNC |
| 3 | Bedrock quota / auth failure mid-submission | M | M | Runbook exists; stale-detection; retry from operator |
| 4 | Login.gov session cookie leak via EFS snapshot | L | H | EFS encryption at rest; scope IAM read to task role |
| 5 | 1:1 container-to-user scaling hits Fargate limits at volume | M | M | Spot + quota raises; capacity planning TBD |
A.9.2 Unknown unknowns
- Area not reviewed: CDK stack under
aws_deployment/(IAM policies, SG rules, CloudFront rules).- Reason: Out of scope for per-repo Part A.
- Best guess: M — the entire security model relies on these.
- Area not reviewed:
fema_agent/tests/contents.- Reason: Time.
- Best guess: M.
- Area not reviewed: specs/001–020 (20 spec directories).
- Reason: Time.
- Best guess: L — docs only.
A.10 Technical Debt Register
| # | Debt item | Quadrant | Interest | Remediation |
|---|---|---|---|---|
| 1 | Unauthenticated Flask data-plane | Reckless & Deliberate | High — one misconfig = PII breach | Per-task bearer token (A.4.1) |
| 2 | 22 routes in one 1271-line app.py | Prudent & Inadvertent | Medium — slows review | Blueprints (A.6 #1) |
| 3 | Test routes ship in prod binary | Reckless & Deliberate | Medium — /test/force-status blast radius | Env-gated Blueprint (A.4.2) |
| 4 | Duplication across /survivor-info handlers | Prudent & Inadvertent | Low — 3× change cost | Extract helper |
| 5 | Near-duplicate sister repo af-fema-real-ai-agent | Prudent & Deliberate | Medium — 2× maintenance | Extract shared package |
| 6 | Unknown test coverage (null) | Prudent & Inadvertent | Medium | Wire coverage (A.4.3) |
| 7 | Global _state dict mutated from multiple sites | Prudent & Inadvertent | Medium — race risk | Locked StateStore |
| 8 | print(..., file=sys.stderr) for errors | Prudent & Inadvertent | Low | Logger (A.4.4) |
A.11 Security Posture (lightweight STRIDE)
| Category | Threat present? | Mitigated? | Gap |
|---|---|---|---|
| Spoofing (identity) | Yes — any caller reaching :5001 is trusted | Partial (network only) | No app-layer auth (A.4.1) |
| Tampering (integrity) | Yes — /test/force-status can overwrite FSM | Partial if test routes disabled (unverified) | Confirm prod gating |
| Repudiation | Partial — DynamoDB AuditEvent at control plane | Partial | Data-plane actions don't emit per-request audit |
| Information Disclosure | Yes — PII in _state, EFS, possibly 400 descriptions | Partial | Scrub errors; verify EFS encryption |
| Denial of Service | Yes — unauthenticated endpoints + subprocess spawning | Partial | No rate limit at app layer |
| Elevation of Privilege | Low at app layer | Partial | Task IAM role scope not reviewed |
A.12 Operational Readiness
| Capability | Present / Partial / Missing | Notes |
|---|---|---|
| Structured logs | Partial | CloudWatch stdout; some print paths |
| Metrics | Partial | AgentExecutionTime, ErrorRate listed but runbook links TBD |
| Distributed tracing | Missing | No trace id propagation |
| Actionable alerts | Partial | Metrics defined, runbook links TBD |
| Runbooks | Present | runbook.md covers top 6 issues |
| On-call ownership defined | Missing | — |
| SLOs / SLIs | Missing | Performance targets TBD |
| Backup & restore tested | N/A | Ephemeral containers |
| Disaster recovery plan | Missing | — |
| Chaos / failure testing | Missing | — |
A.13 Test & Quality Signals
- Coverage: N/A —
coverage_pct: null - Untested critical paths (suspected): FSM stale-detection, dual-location sentinel poll,
_clear_chrome_session_restore - Missing test types: [ ] unit [x] integration [ ] e2e [x] contract [x] load [x] security/fuzz
A.14 Performance & Cost Smells
- Hot paths: Agent subprocess poll loop; sentinel file stat.
- Suspected bottlenecks: Chrome rendering under X11; Bedrock latency dominates per-page time.
- Oversized infra: 2 vCPU / 4 GB per survivor — verify with telemetry.
A.15 Bus-Factor & Knowledge Risk
- Only-person: Single author (Gordon Zheng). Script generator + raw/intermediate builder entirely author-originated.
- What breaks: DOM-selector regressions when DisasterAssistance.gov changes; claude-full wrapper tuning.
- Tribal knowledge: Mapping between page numbering and script sections; rationale for 2-location sentinel polling.
- Actions: Architecture walkthrough recording; rationale comments in
raw_builder/; pair-review script regeneration with second engineer.
A.16 Compliance Gaps
N/A — YAML does not claim HIPAA/SOC 2/state-insurance compliance. FEMA/PII handling regime implied but not asserted. Formal compliance review belongs in af-infra.
A.17 Recommendations Summary
| Priority | Action | Owner | Effort | Depends on |
|---|---|---|---|---|
| P0 | Add app-layer auth (bearer token) to all Flask routes | Repo maintainer + SRE | M | Secrets Manager provisioning |
| P0 | Disable /test/* and /agent*/restart in prod | Repo maintainer | S | — |
| P1 | Split create_app into Blueprints; extract _ingest_survivor helper | Repo maintainer | M | — |
| P1 | Wrap global _state in locked StateStore; wrap Popen in AgentProcess | Repo maintainer | M | — |
| P1 | Wire pytest coverage into CI with threshold | CI owner | S | CI access |
| P1 | Inject filesystem/clock/process seams for FSM unit tests | Repo maintainer | M | Blueprint split |
| P1 | Scrub PII from 4xx descriptions; structured-log instead | Repo maintainer | S | Logger replacement |
| P2 | Replace print(stderr) with Python logger | Repo maintainer | S | — |
| P2 | Centralize magic-number thresholds with rationale | Repo maintainer | S | — |
| P2 | Document Bedrock / Claude Code exit strategy or extract shared package with af-fema-real-ai-agent | Architect | L | Cross-repo coordination |
Environment variables
| Name | Purpose |
|---|---|
LLM_PROVIDER | bedrock|anthropic |
LLM_MODEL_NAME | Friendly alias mapped to Bedrock inference profile |
BEDROCK_API_KEY* | Aliased to AWS_BEARER_TOKEN_BEDROCK |
AWS_REGION | Bedrock region |
WORK_DIR | Container work directory |
CLAUDE_CONFIG_DIR | Claude credentials path inside container |
DISPLAY | X11 display |
NOVNC_HEARTBEAT_SECONDS | Idle keepalive for CloudFront |
AGENT_INACTIVITY_TIMEOUT_SECONDS | Watchdog for stale agent subprocess |
PORT_5001 / PORT_5900 / PORT_6080 | Host port mappings (set in .env for start_container.sh) |
