Coord-Stall Fix A, Exec Summary¶

URL: https://mkdocs.justinsforge.com/memory/investigations/coord-stall-fixa-2026-05-04/

Date: 2026-05-04 Companion plan: coord-stall-fix-a-2026-05-04 Source handoff: coord-stall-debug-2026-05-04

Scope¶

Fix A is the structural fix for the coordinator-bot "deaf for 1-5 min" stall: forge_telegram_coordinator_bot.py calls brain.handle() synchronously on the poll thread, so a single Sonnet retry storm blocks getUpdates for up to MAX_ITERS x 300s. This plan offloads each Telegram update to a daemon multiprocessing.Process child. The poll loop returns to Telegram in milliseconds, the 👀 reaction fires from the child immediately on receipt, and a per-chat flock keeps capture order intact without forcing global serial behavior. Touches one bot file plus a smoke script; documentation lands in MEMORY.md + a topic file + LESSONS.md.

Out of scope: Fix B/C/D/E from the debug handoff, capture and general-purpose bot refactors, async/aiogram migration.

Risk¶

Medium-low overall. Single file with a well-defined extension point; rollback is git revert plus systemctl --user restart forge-lifeos-coordinator.
Fork hygiene is the main hazard: leaked fds, zombie children, or env contamination from claude -p subprocess inheritance. Mitigated by daemon=True, explicit .join() reaping every poll iteration, and MAX_CONCURRENT_BRAINS=4 cap.
Threads-vs-subprocess.run gotcha (per feedback_no_threads_with_subprocess_run.md) is explicitly avoided by using multiprocessing, not threading.
Filelock fallback preserves liveness if /run/user/1000/ is unavailable (degrades to "messages may interleave" rather than "bot dies").
No DB schema, secrets, or external API changes. systemd unit untouched.

Effort estimate¶

8 tasks, each sized to 2-8 minutes per the writing-plans rule.
Total focused implementation: 45-75 minutes including the smoke test and documentation.
Plus 5-10 minutes of live phone testing (Task 7) to confirm 👀 latency on rapid messages.
Each task is its own commit; eval harness runs in Task 8 to catch regressions.

Recommended sequence vs Fix B + Fix E¶

The debug handoff's "Suggested order" was: E → B → C → A → D. Fix A is the structural fix and the largest change; the smaller fixes contain blast radius first. Endorsing that order with one tweak:

Fix E first (Calendar re-auth). 30-second secret rotation; eliminates one source of iteration burn that contaminates brain context. Independent of A. Do this today.
Fix B (per-message wall-clock budget). 1-file change in forge_telegram_brain.py: lower coordinator timeout_s 150 → 90, drop iteration-level retry on timeout (keep on parse error), add a t_budget = 180 wall-clock cap inside handle(). Bounds worst-case latency without depending on A. Land in same PR as E per the handoff's "next worker" note.
Fix C (heartbeat probe) before A so we have a true liveness signal that survives the refactor.
Fix A (this plan). The actual structural fix. Land after B + C are in place: B caps the duration of any single brain call, C catches a future stall mode, and A removes the head-of-line block. With B already in place, A's correctness is easier to verify because the worst-case child runtime is bounded.
Fix D (PrivateTmp=no) the next time the coordinator unit is touched.

Net recommendation: treat Fix A as a separate PR after E + B + C ship, not as a parallel track. The smaller fixes are the prerequisites that let A be evaluated cleanly: without B, an A-only world still lets a single brain call burn 5 minutes (it just doesn't block other messages); with B in place, A's per-chat lock is the right shape for the bounded brain calls B produces.

Success criteria¶

Two Telegram messages sent 2 seconds apart both receive 👀 reactions within 1 second.
The coordinator's main-loop heartbeat mtime stays under 60 seconds old continuously even while a brain child is running for 4+ minutes.
journalctl --user -u forge-lifeos-coordinator shows child started / child reaped lines, zero Errno 8 Exec format error, zero dispatch crash.
bash scripts/forge_eval_run.sh passes after Task 8.