URL: https://mkdocs.justinsforge.com/memory/handoffs/coord-stall-debug-2026-05-04/
Coordinator Stall Debug, 2026-05-04¶
Investigation only. No code changes made. Continues console-instability-2026-05-03.
TL;DR¶
The bot is not crashing, it is going deaf for 1-5 min stretches while a single
synchronous claude -p brain call runs to completion. systemd sees an active
process the whole time, the watchdog never fires, and Justin sees a "stalled" bot.
Today's KillMode=mixed + loosen-spawn.conf work was correct but addresses spawn
crashes, not the stall mode.
Evidence at a glance¶
| Signal | What we see |
|---|---|
| Service state | active (running) since 11:51, never restarted today, 24.9M RSS, 1 task. No process death. |
| Watchdog | Last actual restart of forge-lifeos-coordinator was Apr 30 (the forge_habits_callbacks ImportError storm). Since then: zero restarts. Watchdog timer fires every 5min but has nothing to do because the process is "alive". |
| Brain timeouts | 11 Sonnet attempt 1 returned timeout|error|parse error lines across Apr 29 → today. Today: 11:23 (parse error after read-only /tmp), 14:22:55 (timeout, almost certainly the stall Justin felt right before this debug session). |
| Today's tool failures | 11:23 EROFS on /tmp/tmux-1000/ from tool_spawn_remote_session. 14:16 Google Calendar invalid_grant (refresh token broken). Both contaminate brain context and trigger retries. |
Root cause, ranked¶
#1, structural: brain.handle() blocks the main poll loop¶
forge_telegram_coordinator_bot.py:200-330 is a single-threaded while running: loop:
brain.handle (forge_telegram_brain.py:654-700) iterates up to MAX_ITERS
brain calls. Each call is _call_sonnet_once which shells out to claude -p
with timeout_s = 150 for coordinator (forge_telegram_brain.py:591). On
timeout / parse error / error the brain auto-retries once at the iteration
level (_RETRYABLE_THOUGHTS, line 638). Net worst case per user message:
During this entire window:
- getUpdates is not called.
- New Telegram messages queue at Telegram side.
- systemd sees a healthy Python process.
- Watchdog has no signal to act on.
This is the dominant stall mode. Today's 14:22:55 retry alone budgets 2-5 minutes of silence even on a single-iteration message.
#2, proximate: Sonnet API instability amplified by tight timeout¶
11 brain failures in 5 days, spread across times of day, suggests intermittent
Anthropic-side latency rather than a deterministic bug. claude -p cold-start
plus inference within a 150s ceiling is tight under degraded API conditions.
Reducing iteration count or budgeting wall-clock per handle() call would
contain the blast radius without solving #1.
#3, contextual: tool failures inflate iteration count¶
Two tools today returned errors that Sonnet then tried to "explain" in follow-up iterations, burning iterations and time:
/tmpEROFS, 11:23.tool_spawn_remote_session(forge_telegram_inbox_brain.py:2198-2225) shells out toforge_spawn_session.shwhich writes to/tmp/tmux-1000/. Even with today'sloosen-spawn.conf(ProtectSystem=no,NoNewPrivileges=false),PrivateTmpis likely still on by default for this unit. Spawn worked earlier in the loop but a later iteration tripped EROFS, and Sonnet produced prose instead of JSON →No JSON in brain output→ parse-error retry → 4-min wall-clock burn.- Calendar
invalid_grant, 14:16. Refresh token in~/.forge-secrets/google-calendar.envis stale (perreference_google_calendar_client.md). Returns instantly, no stall, but forces follow-up explanation iterations.
Proposed fix, ordered by leverage¶
Do not implement until you confirm the plan.
Fix A, highest leverage: decouple brain.handle from the poll loop¶
File: forge_telegram_coordinator_bot.py:200-330 (main loop) and the
handle_message body around lines 220-270.
Change: spawn each incoming message into a child process (not a
thread, per feedback_no_threads_with_subprocess_run.md, threads + the
brain's subprocess.run(claude) race and cause Errno 8). Use
multiprocessing.Process(target=_handle_message_in_child, args=(...)),
daemonize, and serialize per-chat ordering with a small per-chat
multiprocessing.Queue if needed. Main loop:
while running:
updates = getUpdates(...)
for u in updates:
offset = u.update_id + 1
proc_pool.submit(handle_one, u) # returns instantly
Result: long brain calls no longer block getUpdates. Justin's next
message is acknowledged with 👀 within ~1s even while the prior message
is still cooking.
Fix B, defense in depth: per-message wall-clock budget¶
Files: forge_telegram_brain.py:591 and forge_telegram_brain.py:683
(the for iteration in range(max_iters): loop in handle).
Change:
1. Lower coordinator timeout_s from 150 → 90.
2. Drop the iteration-level retry on timeout (keep on parse error
only). A 150-second timeout has already used the full budget, retrying
doubles the user-facing wait.
3. Add a t_budget = 180 wall-clock cap inside handle(). If elapsed
exceeds it, break the loop with whatever done=True reply we have so
far instead of running another full iteration.
Caps user-facing latency at ~3 min hard, ~90s soft, vs today's
unbounded MAX_ITERS × 300s.
Fix C, observability: in-bot heartbeat + watchdog probe¶
Files:
- forge_telegram_coordinator_bot.py, near top of main loop: write
/run/user/1000/forge-coordinator.heartbeat (mtime-only) every poll
iteration.
- scripts/forge_bot_watchdog.sh (or wherever
forge-bot-watchdog.service is wired): add check, if heartbeat
mtime > 300s old, systemctl restart forge-lifeos-coordinator.
Catches a future stall mode the moment the poll loop goes quiet, even if the process is "alive" to systemd.
Fix D, environmental: clear /tmp EROFS for spawn tools¶
File: /etc/systemd/system/forge-lifeos-coordinator.service.d/loosen-spawn.conf
Change: add PrivateTmp=no. Verify with
systemd-analyze security forge-lifeos-coordinator.service after reload.
Without this, tool_spawn_remote_session will hit EROFS again the next
time tmux tries to recreate /tmp/tmux-1000/.
Fix E, out of scope here: re-auth Google Calendar¶
~/.forge-secrets/google-calendar.env refresh token is dead. Run
scripts/forge_google_calendar.py --reauth (or whatever the canonical
re-auth path is in reference_google_calendar_client.md). Until done,
every calendar tool call adds noise to Sonnet's context and uses
iterations on apologies.
Suggested order¶
- Fix E first, takes 30 seconds, removes one source of iteration burn.
- Fix B, single-file, low risk, immediately bounds worst case.
- Fix C, gives us a true liveness signal before the bigger refactor.
- Fix A, the actual structural fix. Bigger change, do it after C is in place so we have a fallback.
- Fix D, the next time the coordinator unit is touched.
Things explicitly ruled out¶
- Telegram rate limiting / wedged getUpdates. Long-poll uses
requests.post(..., timeout=LONG_POLL_TIMEOUT + 10)and recovers cleanly. NoMax retries exceededin today's log. - Memory / CPU pressure. 24.9M RSS, 43s CPU over 2.5h. Cgroup is cool.
KillMode=mixedregression. Today's drop-in is healthy, no spawn-kill cascades since.
Next worker¶
Pick up Fix B + Fix E in one PR. They are 1-file each and independently
verifiable. Then schedule Fix A as a feature-plan task; it is a real
refactor and deserves a spec.
[Claude Code, debug worker]