Coord-Stall Fix A, Exec Summary¶
URL: https://mkdocs.justinsforge.com/memory/investigations/coord-stall-fixa-2026-05-04/
Date: 2026-05-04 Companion plan: coord-stall-fix-a-2026-05-04 Source handoff: coord-stall-debug-2026-05-04
Scope¶
Fix A is the structural fix for the coordinator-bot "deaf for 1-5 min" stall: forge_telegram_coordinator_bot.py calls brain.handle() synchronously on the poll thread, so a single Sonnet retry storm blocks getUpdates for up to MAX_ITERS x 300s. This plan offloads each Telegram update to a daemon multiprocessing.Process child. The poll loop returns to Telegram in milliseconds, the 👀 reaction fires from the child immediately on receipt, and a per-chat flock keeps capture order intact without forcing global serial behavior. Touches one bot file plus a smoke script; documentation lands in MEMORY.md + a topic file + LESSONS.md.
Out of scope: Fix B/C/D/E from the debug handoff, capture and general-purpose bot refactors, async/aiogram migration.
Risk¶
- Medium-low overall. Single file with a well-defined extension point; rollback is
git revertplussystemctl --user restart forge-lifeos-coordinator. - Fork hygiene is the main hazard: leaked fds, zombie children, or env contamination from
claude -psubprocess inheritance. Mitigated bydaemon=True, explicit.join()reaping every poll iteration, andMAX_CONCURRENT_BRAINS=4cap. - Threads-vs-subprocess.run gotcha (per
feedback_no_threads_with_subprocess_run.md) is explicitly avoided by usingmultiprocessing, notthreading. - Filelock fallback preserves liveness if
/run/user/1000/is unavailable (degrades to "messages may interleave" rather than "bot dies"). - No DB schema, secrets, or external API changes. systemd unit untouched.
Effort estimate¶
- 8 tasks, each sized to 2-8 minutes per the writing-plans rule.
- Total focused implementation: 45-75 minutes including the smoke test and documentation.
- Plus 5-10 minutes of live phone testing (Task 7) to confirm 👀 latency on rapid messages.
- Each task is its own commit; eval harness runs in Task 8 to catch regressions.
Recommended sequence vs Fix B + Fix E¶
The debug handoff's "Suggested order" was: E → B → C → A → D. Fix A is the structural fix and the largest change; the smaller fixes contain blast radius first. Endorsing that order with one tweak:
- Fix E first (Calendar re-auth). 30-second secret rotation; eliminates one source of iteration burn that contaminates brain context. Independent of A. Do this today.
- Fix B (per-message wall-clock budget). 1-file change in
forge_telegram_brain.py: lower coordinatortimeout_s150 → 90, drop iteration-level retry ontimeout(keep onparse error), add at_budget = 180wall-clock cap insidehandle(). Bounds worst-case latency without depending on A. Land in same PR as E per the handoff's "next worker" note. - Fix C (heartbeat probe) before A so we have a true liveness signal that survives the refactor.
- Fix A (this plan). The actual structural fix. Land after B + C are in place: B caps the duration of any single brain call, C catches a future stall mode, and A removes the head-of-line block. With B already in place, A's correctness is easier to verify because the worst-case child runtime is bounded.
- Fix D (
PrivateTmp=no) the next time the coordinator unit is touched.
Net recommendation: treat Fix A as a separate PR after E + B + C ship, not as a parallel track. The smaller fixes are the prerequisites that let A be evaluated cleanly: without B, an A-only world still lets a single brain call burn 5 minutes (it just doesn't block other messages); with B in place, A's per-chat lock is the right shape for the bounded brain calls B produces.
Success criteria¶
- Two Telegram messages sent 2 seconds apart both receive 👀 reactions within 1 second.
- The coordinator's main-loop heartbeat mtime stays under 60 seconds old continuously even while a brain child is running for 4+ minutes.
journalctl --user -u forge-lifeos-coordinatorshowschild started/child reapedlines, zeroErrno 8 Exec format error, zerodispatch crash.bash scripts/forge_eval_run.shpasses after Task 8.