Skip to content

URL: https://mkdocs.justinsforge.com/memory/handoffs/coord-stall-debug-2026-05-04/

Coordinator Stall Debug, 2026-05-04

Investigation only. No code changes made. Continues console-instability-2026-05-03.

TL;DR

The bot is not crashing, it is going deaf for 1-5 min stretches while a single synchronous claude -p brain call runs to completion. systemd sees an active process the whole time, the watchdog never fires, and Justin sees a "stalled" bot. Today's KillMode=mixed + loosen-spawn.conf work was correct but addresses spawn crashes, not the stall mode.

Evidence at a glance

Signal What we see
Service state active (running) since 11:51, never restarted today, 24.9M RSS, 1 task. No process death.
Watchdog Last actual restart of forge-lifeos-coordinator was Apr 30 (the forge_habits_callbacks ImportError storm). Since then: zero restarts. Watchdog timer fires every 5min but has nothing to do because the process is "alive".
Brain timeouts 11 Sonnet attempt 1 returned timeout|error|parse error lines across Apr 29 → today. Today: 11:23 (parse error after read-only /tmp), 14:22:55 (timeout, almost certainly the stall Justin felt right before this debug session).
Today's tool failures 11:23 EROFS on /tmp/tmux-1000/ from tool_spawn_remote_session. 14:16 Google Calendar invalid_grant (refresh token broken). Both contaminate brain context and trigger retries.

Root cause, ranked

#1, structural: brain.handle() blocks the main poll loop

forge_telegram_coordinator_bot.py:200-330 is a single-threaded while running: loop:

getUpdates  →  brain.handle(text)  →  send reply  →  next getUpdates

brain.handle (forge_telegram_brain.py:654-700) iterates up to MAX_ITERS brain calls. Each call is _call_sonnet_once which shells out to claude -p with timeout_s = 150 for coordinator (forge_telegram_brain.py:591). On timeout / parse error / error the brain auto-retries once at the iteration level (_RETRYABLE_THOUGHTS, line 638). Net worst case per user message:

MAX_ITERS × (150s primary + 150s retry) = MAX_ITERS × 300s

During this entire window: - getUpdates is not called. - New Telegram messages queue at Telegram side. - systemd sees a healthy Python process. - Watchdog has no signal to act on.

This is the dominant stall mode. Today's 14:22:55 retry alone budgets 2-5 minutes of silence even on a single-iteration message.

#2, proximate: Sonnet API instability amplified by tight timeout

11 brain failures in 5 days, spread across times of day, suggests intermittent Anthropic-side latency rather than a deterministic bug. claude -p cold-start plus inference within a 150s ceiling is tight under degraded API conditions. Reducing iteration count or budgeting wall-clock per handle() call would contain the blast radius without solving #1.

#3, contextual: tool failures inflate iteration count

Two tools today returned errors that Sonnet then tried to "explain" in follow-up iterations, burning iterations and time:

  • /tmp EROFS, 11:23. tool_spawn_remote_session (forge_telegram_inbox_brain.py:2198-2225) shells out to forge_spawn_session.sh which writes to /tmp/tmux-1000/. Even with today's loosen-spawn.conf (ProtectSystem=no, NoNewPrivileges=false), PrivateTmp is likely still on by default for this unit. Spawn worked earlier in the loop but a later iteration tripped EROFS, and Sonnet produced prose instead of JSON → No JSON in brain output → parse-error retry → 4-min wall-clock burn.
  • Calendar invalid_grant, 14:16. Refresh token in ~/.forge-secrets/google-calendar.env is stale (per reference_google_calendar_client.md). Returns instantly, no stall, but forces follow-up explanation iterations.

Proposed fix, ordered by leverage

Do not implement until you confirm the plan.

Fix A, highest leverage: decouple brain.handle from the poll loop

File: forge_telegram_coordinator_bot.py:200-330 (main loop) and the handle_message body around lines 220-270.

Change: spawn each incoming message into a child process (not a thread, per feedback_no_threads_with_subprocess_run.md, threads + the brain's subprocess.run(claude) race and cause Errno 8). Use multiprocessing.Process(target=_handle_message_in_child, args=(...)), daemonize, and serialize per-chat ordering with a small per-chat multiprocessing.Queue if needed. Main loop:

while running:
    updates = getUpdates(...)
    for u in updates:
        offset = u.update_id + 1
        proc_pool.submit(handle_one, u)   # returns instantly

Result: long brain calls no longer block getUpdates. Justin's next message is acknowledged with 👀 within ~1s even while the prior message is still cooking.

Fix B, defense in depth: per-message wall-clock budget

Files: forge_telegram_brain.py:591 and forge_telegram_brain.py:683 (the for iteration in range(max_iters): loop in handle).

Change: 1. Lower coordinator timeout_s from 150 → 90. 2. Drop the iteration-level retry on timeout (keep on parse error only). A 150-second timeout has already used the full budget, retrying doubles the user-facing wait. 3. Add a t_budget = 180 wall-clock cap inside handle(). If elapsed exceeds it, break the loop with whatever done=True reply we have so far instead of running another full iteration.

Caps user-facing latency at ~3 min hard, ~90s soft, vs today's unbounded MAX_ITERS × 300s.

Fix C, observability: in-bot heartbeat + watchdog probe

Files: - forge_telegram_coordinator_bot.py, near top of main loop: write /run/user/1000/forge-coordinator.heartbeat (mtime-only) every poll iteration. - scripts/forge_bot_watchdog.sh (or wherever forge-bot-watchdog.service is wired): add check, if heartbeat mtime > 300s old, systemctl restart forge-lifeos-coordinator.

Catches a future stall mode the moment the poll loop goes quiet, even if the process is "alive" to systemd.

Fix D, environmental: clear /tmp EROFS for spawn tools

File: /etc/systemd/system/forge-lifeos-coordinator.service.d/loosen-spawn.conf

Change: add PrivateTmp=no. Verify with systemd-analyze security forge-lifeos-coordinator.service after reload. Without this, tool_spawn_remote_session will hit EROFS again the next time tmux tries to recreate /tmp/tmux-1000/.

Fix E, out of scope here: re-auth Google Calendar

~/.forge-secrets/google-calendar.env refresh token is dead. Run scripts/forge_google_calendar.py --reauth (or whatever the canonical re-auth path is in reference_google_calendar_client.md). Until done, every calendar tool call adds noise to Sonnet's context and uses iterations on apologies.

Suggested order

  1. Fix E first, takes 30 seconds, removes one source of iteration burn.
  2. Fix B, single-file, low risk, immediately bounds worst case.
  3. Fix C, gives us a true liveness signal before the bigger refactor.
  4. Fix A, the actual structural fix. Bigger change, do it after C is in place so we have a fallback.
  5. Fix D, the next time the coordinator unit is touched.

Things explicitly ruled out

  • Telegram rate limiting / wedged getUpdates. Long-poll uses requests.post(..., timeout=LONG_POLL_TIMEOUT + 10) and recovers cleanly. No Max retries exceeded in today's log.
  • Memory / CPU pressure. 24.9M RSS, 43s CPU over 2.5h. Cgroup is cool.
  • KillMode=mixed regression. Today's drop-in is healthy, no spawn-kill cascades since.

Next worker

Pick up Fix B + Fix E in one PR. They are 1-file each and independently verifiable. Then schedule Fix A as a feature-plan task; it is a real refactor and deserves a spec.

[Claude Code, debug worker]