Skip to content

URL: https://mkdocs.justinsforge.com/memory/investigations/coord-stall-be-2026-05-04/

Coord-Stall Fix B + Fix E, Status

TL;DR

  • Fix B is already in the code. All three sub-changes from the handoff are present in scripts/forge_telegram_brain.py. Nothing to do unless we want to tighten the wall-clock budget from 240s → 180s.
  • Fix E is still needed. Google Calendar refresh token in ~/.forge-secrets/google-calendar.env returns invalid_grant: Token has been expired or revoked as of 2026-05-04 16:1x CDT. This is a manual re-auth, not a code change.
  • One-PR scope shrinks to zero code changes. The "PR" reduces to: (a) optionally retune handle_budget_s 240→180, (b) re-auth Calendar (one-shot CLI). No coordinator-bot edits.

Fix B status: shipped

Handoff asked for three changes in forge_telegram_brain.py. Current code (HEAD, untracked working copy):

Sub-fix Handoff target Current code File:line
Coordinator timeout 150 → 90 90s 90s forge_telegram_brain.py:591 (90 if persona == "coordinator" else 150)
Drop iteration-retry on timeout only retry parse-error/error already split via _COORD_RETRYABLE = {"parse error", "error"} forge_telegram_brain.py:642, used at line 650
Per-handle() wall-clock budget 180s 240s (named handle_budget_s) forge_telegram_brain.py:690, break at :694-699

The 240s value differs from the handoff's suggested 180s. Two readings:

  1. Leave at 240. Budget is hard-cap, not target. Worst case is now 90s primary + (parse-error retry path) ≤ ~92s per iter, so 240s allows ~2.5 iters before forced break, which matches typical 1-3 iter coordinator flows.
  2. Tighten to 180. Cuts worst-case user-visible latency by 60s on a degraded API. Low-risk one-line change. Mild regression possible if a legit 3-iter conversation gets cut at iter 2.

Recommendation: leave 240s for now; revisit only if a future stall actually rides the budget to the cap. No log evidence today of handle() budget Ns exceeded firing.

Fix E status: not done, refresh token confirmed dead

Live test against Google's token endpoint with the current secret:

POST https://oauth2.googleapis.com/token  grant_type=refresh_token
→ HTTP 400  {"error":"invalid_grant","error_description":"Token has been expired or revoked."}

~/.forge-secrets/google-calendar.env last modified Apr 30 10:09, matches the date the token went stale per reference_google_calendar_client.md. Until re-auth, every coordinator brain iteration that touches calendar burns one cycle on an apology, exactly as the handoff predicted.

Re-auth path (canonical, per reference_google_calendar_client.md):

  1. Run scripts/forge_google_calendar.py --reauth (if that flag exists). If not, regenerate via the OAuth consent flow with client_id/client_secret from the env file, scope https://www.googleapis.com/auth/calendar, and overwrite GOOGLE_CALENDAR_REFRESH_TOKEN in place (chmod 600).
  2. Re-test with the same token endpoint POST above; expect HTTP 200 + access_token.
  3. Smoke-test with a coordinator command that invokes a calendar tool.

Note: --reauth flag not confirmed present in forge_google_calendar.py (greps showed _refresh, _persist_refresh_token, but no CLI re-auth subcommand). May need a one-shot OAuth helper or use an existing forge_google_*_oauth_bootstrap.sh if one exists. Worth a 2-minute check before committing the next worker to a path.

One-PR fix, exact scope

Given Fix B is already shipped, the "B+E one PR" collapses to:

Optional code change (1 line, defer unless requested)

scripts/forge_telegram_brain.py:690

-    handle_budget_s = 240 if persona == "coordinator" else 600
+    handle_budget_s = 180 if persona == "coordinator" else 600

Required ops change (no PR, secret-file edit)

~/.forge-secrets/google-calendar.env → rotate GOOGLE_CALENDAR_REFRESH_TOKEN via re-auth flow. Verify with token-endpoint POST; expect 200.

What still needs a real PR

Fix A (decouple brain.handle from poll loop via multiprocessing.Process) is unchanged and still the dominant structural fix. Schedule it as a feature-plan task per the handoff's suggested order, after Calendar re-auth so we have a clean baseline.

Fix C (heartbeat + watchdog probe) and Fix D (PrivateTmp=no on coordinator unit) also still pending and unaffected by today's findings.

[Claude Code, investigation worker]