Skip to content

Plan: Coordinator Stall Fix A, decouple brain.handle from the poll loop

URL: https://mkdocs.justinsforge.com/memory/plans/coord-stall-fix-a-2026-05-04/

Date: 2026-05-04 Approved spec: forge_telegram_coordinator_bot.py main loop currently calls brain.handle() synchronously, which can block getUpdates for MAX_ITERS x 300s while a single Sonnet brain call retries. New messages from Justin queue at Telegram and the bot looks "stalled" even though systemd sees a healthy process. Fix A spawns each incoming message into a daemon child process so the poll loop returns to getUpdates in milliseconds. The 👀 reaction and typing indicator fire from the child immediately on message receipt; brain calls serialize per chat via a filesystem lock so capture order is preserved without forcing global serial behavior. Source spec: memory/handoffs/coord-stall-debug-2026-05-04.md Fix A. Out of scope: - Fix B (per-message wall-clock budget inside handle()) - Fix C (in-bot heartbeat + watchdog probe) - Fix D (PrivateTmp=no on the coordinator unit) - Fix E (Google Calendar refresh-token re-auth) - Capture/general-purpose bot refactor (this plan only touches the coordinator) - Replacing multiprocessing with asyncio or aiogram (separate effort)

Task list

Task 1: Extract child-process worker function

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py
  • What: Lift the body of handle_message (lines ~180-294, react→download→brain→send→react) into a new module-level function _handle_message_in_child(token, authorized_id, update, persona). Function must construct its own brain instance inside the child (objects from the parent are not picklable across fork). Keep the existing handle_message signature alive as a thin shim for now so the rest of the file still imports cleanly.
  • Verification: python -c "import forge_telegram_coordinator_bot as m; assert callable(m._handle_message_in_child)"
  • Commit: coord-bot: extract _handle_message_in_child for multiprocessing dispatch

Task 2: Add per-chat brain lock

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py
  • What: Inside _handle_message_in_child, wrap only the brain.handle(...) call in a fcntl.flock on /run/user/1000/forge-coordinator-chat-<chat_id>.lock. The 👀 react, typing indicator, voice download, and transcribe step run BEFORE acquiring the lock so the user sees the ack immediately; the lock only serializes Sonnet calls so two rapid messages do not interleave Inbox writes. On OSError opening the lock dir, fall back to no-lock with a warn log.
  • Verification: python -c "import forge_telegram_coordinator_bot as m, inspect; assert 'flock' in inspect.getsource(m._handle_message_in_child)"
  • Commit: coord-bot: serialize brain.handle per chat via filelock, ack stays instant

Task 3: Replace synchronous handle_message call with mp.Process spawn

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py
  • What: In main() (lines ~322-355), where handle_message(token, authorized_id, msg) is currently called, build mp.Process(target=_handle_message_in_child, args=(token, authorized_id, update, PERSONA), daemon=True), call .start(), and append to a module-level active_children: list[mp.Process] = []. Import multiprocessing as mp at the top. Use mp.set_start_method("fork", force=True) once at module load (Linux-only is fine, this is Console).
  • Verification: systemctl --user restart forge-lifeos-coordinator && sleep 3 && systemctl --user status forge-lifeos-coordinator | head -20 (must show active running, no traceback)
  • Commit: coord-bot: dispatch each Telegram update to a daemon child process

Task 4: Reap finished children and cap concurrency

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py
  • What: At the top of every poll iteration in main(), prune active_children of not p.is_alive() entries and p.join(0) them so they don't become zombies. If len(active_children) >= MAX_CONCURRENT_BRAINS (constant = 4), block on the oldest child's .join(timeout=10) before pulling the next batch of updates. Log a warning if cap is hit (signals brain backlog). On SIGTERM in stop(), terminate any still-running children with .terminate() then .join(5).
  • Verification: python -c "import forge_telegram_coordinator_bot as m; assert m.MAX_CONCURRENT_BRAINS == 4"
  • Commit: coord-bot: reap child processes and cap concurrent brains at 4

Task 5: Decouple heartbeat from brain

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py
  • What: Confirm heartbeat() is called once per poll-loop iteration (already on line 324) and remove any heartbeat call from _handle_message_in_child if Task 1 carried one over. Heartbeat must reflect main-loop liveness only, so a stuck brain child does not give a false-alive signal. Add a brief comment noting this contract.
  • Verification: grep -n "heartbeat()" /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py (heartbeat calls only inside main(), none inside child or handle_message shim)
  • Commit: coord-bot: heartbeat reflects main-loop liveness only, not brain children

Task 6: Smoke test concurrent dispatch

  • Files: /home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh (new)
  • What: Small bash script that prints the manual test recipe Justin runs from his phone: send two messages 2 seconds apart, expect 👀 reaction on both within ~1s of send, expect both 👌 replies within the brain budget, no 🤔 Empty message or duplicated work. Script also greps the last 200 lines of journalctl --user -u forge-lifeos-coordinator for child started, child reaped, and absence of dispatch crash. Make executable.
  • Verification: bash /home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh (prints the recipe + journal grep summary, exits 0)
  • Commit: coord-bot: add smoke-test script for concurrent message dispatch

Task 7: Restart unit and verify in production

  • Files: none (operational step)
  • What: systemctl --user daemon-reload && systemctl --user restart forge-lifeos-coordinator. Then send two real Telegram messages in rapid succession from Justin's phone. Confirm both 👀 reactions appear within 1 second. Confirm replies arrive in order. Tail journal for any Errno 8 Exec format error (the threading-vs-subprocess race the feedback_no_threads_with_subprocess_run.md rule warns about, which multiprocessing should avoid).
  • Verification: journalctl --user -u forge-lifeos-coordinator -n 100 --no-pager | grep -E "child started|brain.handle|Errno 8" (child started lines present, no Errno 8)
  • Commit: none (operational; results captured in next task's handoff)

Task 8: Register and document

  • Files:
  • /home/justinwieb/forge/memory/general/reference_coordinator_concurrent_dispatch.md (new)
  • /home/justinwieb/.claude/projects/-home-justinwieb-forge/memory/MEMORY.md (one-line index entry under Telegram Surface)
  • /home/justinwieb/forge/LESSONS.md (one-line entry: stall mode root cause + Fix A landed)
  • What: Topic file documents the per-chat filelock contract, MAX_CONCURRENT_BRAINS knob, child-reap protocol, and the _handle_message_in_child extension point for future bot tools. MEMORY.md gets a one-line pointer per feedback_register_tools.md. LESSONS.md captures the structural bug + the eval check we should add (assert no synchronous brain call in poll loop). Run bash /home/justinwieb/forge/scripts/forge_eval_run.sh to confirm no regressions.
  • Verification: bash /home/justinwieb/forge/scripts/forge_eval_run.sh && grep -c "reference_coordinator_concurrent_dispatch" /home/justinwieb/.claude/projects/-home-justinwieb-forge/memory/MEMORY.md (eval passes, grep returns 1)
  • Commit: coord-bot: document concurrent dispatch contract + register in MEMORY.md

Risk notes

  • Forking inside a long-running daemon can leak fds. daemon=True + close-on-exec is fine for short-lived children but verify in Task 7 that fd count does not climb (ls /proc/$(pidof -s python3)/fd | wc -l before and after 10 messages).
  • mp.set_start_method("fork") is required because claude -p brain children inherit env. spawn would re-import the bot module from scratch and slow first response by ~1s.
  • Filelock fallback if /run/user/1000/ is missing (e.g. systemd user manager dropped XDG_RUNTIME_DIR): the no-lock fallback in Task 2 means we degrade to "messages may interleave" rather than "bot dies". Acceptable.
  • Per-chat serialization is single-user-correct. Justin is the only authorized user. If a future bot version supports multiple authorized users, MAX_CONCURRENT_BRAINS=4 already supports 4 parallel chats; the lock is keyed per chat_id.