URL: https://mkdocs.justinsforge.com/memory/investigations/coord-stall-be-2026-05-04/
Coord-Stall Fix B + Fix E, Status¶
TL;DR¶
- Fix B is already in the code. All three sub-changes from the handoff are present in
scripts/forge_telegram_brain.py. Nothing to do unless we want to tighten the wall-clock budget from 240s → 180s. - Fix E is still needed. Google Calendar refresh token in
~/.forge-secrets/google-calendar.envreturnsinvalid_grant: Token has been expired or revokedas of 2026-05-04 16:1x CDT. This is a manual re-auth, not a code change. - One-PR scope shrinks to zero code changes. The "PR" reduces to: (a) optionally retune
handle_budget_s240→180, (b) re-auth Calendar (one-shot CLI). No coordinator-bot edits.
Fix B status: shipped¶
Handoff asked for three changes in forge_telegram_brain.py. Current code (HEAD, untracked working copy):
| Sub-fix | Handoff target | Current code | File:line |
|---|---|---|---|
| Coordinator timeout 150 → 90 | 90s | 90s | forge_telegram_brain.py:591 (90 if persona == "coordinator" else 150) |
Drop iteration-retry on timeout |
only retry parse-error/error | already split via _COORD_RETRYABLE = {"parse error", "error"} |
forge_telegram_brain.py:642, used at line 650 |
Per-handle() wall-clock budget |
180s | 240s (named handle_budget_s) |
forge_telegram_brain.py:690, break at :694-699 |
The 240s value differs from the handoff's suggested 180s. Two readings:
- Leave at 240. Budget is hard-cap, not target. Worst case is now 90s primary + (parse-error retry path) ≤ ~92s per iter, so 240s allows ~2.5 iters before forced break, which matches typical 1-3 iter coordinator flows.
- Tighten to 180. Cuts worst-case user-visible latency by 60s on a degraded API. Low-risk one-line change. Mild regression possible if a legit 3-iter conversation gets cut at iter 2.
Recommendation: leave 240s for now; revisit only if a future stall actually rides the budget to the cap. No log evidence today of handle() budget Ns exceeded firing.
Fix E status: not done, refresh token confirmed dead¶
Live test against Google's token endpoint with the current secret:
POST https://oauth2.googleapis.com/token grant_type=refresh_token
→ HTTP 400 {"error":"invalid_grant","error_description":"Token has been expired or revoked."}
~/.forge-secrets/google-calendar.env last modified Apr 30 10:09, matches the date the token went stale per reference_google_calendar_client.md. Until re-auth, every coordinator brain iteration that touches calendar burns one cycle on an apology, exactly as the handoff predicted.
Re-auth path (canonical, per reference_google_calendar_client.md):
- Run
scripts/forge_google_calendar.py --reauth(if that flag exists). If not, regenerate via the OAuth consent flow withclient_id/client_secretfrom the env file, scopehttps://www.googleapis.com/auth/calendar, and overwriteGOOGLE_CALENDAR_REFRESH_TOKENin place (chmod 600). - Re-test with the same token endpoint POST above; expect HTTP 200 +
access_token. - Smoke-test with a coordinator command that invokes a calendar tool.
Note: --reauth flag not confirmed present in forge_google_calendar.py (greps showed _refresh, _persist_refresh_token, but no CLI re-auth subcommand). May need a one-shot OAuth helper or use an existing forge_google_*_oauth_bootstrap.sh if one exists. Worth a 2-minute check before committing the next worker to a path.
One-PR fix, exact scope¶
Given Fix B is already shipped, the "B+E one PR" collapses to:
Optional code change (1 line, defer unless requested)¶
scripts/forge_telegram_brain.py:690
- handle_budget_s = 240 if persona == "coordinator" else 600
+ handle_budget_s = 180 if persona == "coordinator" else 600
Required ops change (no PR, secret-file edit)¶
~/.forge-secrets/google-calendar.env → rotate GOOGLE_CALENDAR_REFRESH_TOKEN via re-auth flow. Verify with token-endpoint POST; expect 200.
What still needs a real PR¶
Fix A (decouple brain.handle from poll loop via multiprocessing.Process) is unchanged and still the dominant structural fix. Schedule it as a feature-plan task per the handoff's suggested order, after Calendar re-auth so we have a clean baseline.
Fix C (heartbeat + watchdog probe) and Fix D (PrivateTmp=no on coordinator unit) also still pending and unaffected by today's findings.
[Claude Code, investigation worker]