Plan: Coordinator Stall Fix A, decouple brain.handle from the poll loop¶
URL: https://mkdocs.justinsforge.com/memory/plans/coord-stall-fix-a-2026-05-04/
Date: 2026-05-04
Approved spec: forge_telegram_coordinator_bot.py main loop currently calls brain.handle() synchronously, which can block getUpdates for MAX_ITERS x 300s while a single Sonnet brain call retries. New messages from Justin queue at Telegram and the bot looks "stalled" even though systemd sees a healthy process. Fix A spawns each incoming message into a daemon child process so the poll loop returns to getUpdates in milliseconds. The 👀 reaction and typing indicator fire from the child immediately on message receipt; brain calls serialize per chat via a filesystem lock so capture order is preserved without forcing global serial behavior. Source spec: memory/handoffs/coord-stall-debug-2026-05-04.md Fix A.
Out of scope:
- Fix B (per-message wall-clock budget inside handle())
- Fix C (in-bot heartbeat + watchdog probe)
- Fix D (PrivateTmp=no on the coordinator unit)
- Fix E (Google Calendar refresh-token re-auth)
- Capture/general-purpose bot refactor (this plan only touches the coordinator)
- Replacing multiprocessing with asyncio or aiogram (separate effort)
Task list¶
Task 1: Extract child-process worker function¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py - What: Lift the body of
handle_message(lines ~180-294, react→download→brain→send→react) into a new module-level function_handle_message_in_child(token, authorized_id, update, persona). Function must construct its ownbraininstance inside the child (objects from the parent are not picklable acrossfork). Keep the existinghandle_messagesignature alive as a thin shim for now so the rest of the file still imports cleanly. - Verification:
python -c "import forge_telegram_coordinator_bot as m; assert callable(m._handle_message_in_child)" - Commit:
coord-bot: extract _handle_message_in_child for multiprocessing dispatch
Task 2: Add per-chat brain lock¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py - What: Inside
_handle_message_in_child, wrap only thebrain.handle(...)call in afcntl.flockon/run/user/1000/forge-coordinator-chat-<chat_id>.lock. The 👀 react, typing indicator, voice download, and transcribe step run BEFORE acquiring the lock so the user sees the ack immediately; the lock only serializes Sonnet calls so two rapid messages do not interleave Inbox writes. OnOSErroropening the lock dir, fall back to no-lock with a warn log. - Verification:
python -c "import forge_telegram_coordinator_bot as m, inspect; assert 'flock' in inspect.getsource(m._handle_message_in_child)" - Commit:
coord-bot: serialize brain.handle per chat via filelock, ack stays instant
Task 3: Replace synchronous handle_message call with mp.Process spawn¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py - What: In
main()(lines ~322-355), wherehandle_message(token, authorized_id, msg)is currently called, buildmp.Process(target=_handle_message_in_child, args=(token, authorized_id, update, PERSONA), daemon=True), call.start(), and append to a module-levelactive_children: list[mp.Process] = []. Importmultiprocessing as mpat the top. Usemp.set_start_method("fork", force=True)once at module load (Linux-only is fine, this is Console). - Verification:
systemctl --user restart forge-lifeos-coordinator && sleep 3 && systemctl --user status forge-lifeos-coordinator | head -20(must show active running, no traceback) - Commit:
coord-bot: dispatch each Telegram update to a daemon child process
Task 4: Reap finished children and cap concurrency¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py - What: At the top of every poll iteration in
main(), pruneactive_childrenofnot p.is_alive()entries andp.join(0)them so they don't become zombies. Iflen(active_children) >= MAX_CONCURRENT_BRAINS(constant = 4), block on the oldest child's.join(timeout=10)before pulling the next batch of updates. Log a warning if cap is hit (signals brain backlog). On SIGTERM instop(), terminate any still-running children with.terminate()then.join(5). - Verification:
python -c "import forge_telegram_coordinator_bot as m; assert m.MAX_CONCURRENT_BRAINS == 4" - Commit:
coord-bot: reap child processes and cap concurrent brains at 4
Task 5: Decouple heartbeat from brain¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py - What: Confirm
heartbeat()is called once per poll-loop iteration (already on line 324) and remove any heartbeat call from_handle_message_in_childif Task 1 carried one over. Heartbeat must reflect main-loop liveness only, so a stuck brain child does not give a false-alive signal. Add a brief comment noting this contract. - Verification:
grep -n "heartbeat()" /home/justinwieb/forge/scripts/forge_telegram_coordinator_bot.py(heartbeat calls only insidemain(), none inside child or handle_message shim) - Commit:
coord-bot: heartbeat reflects main-loop liveness only, not brain children
Task 6: Smoke test concurrent dispatch¶
- Files:
/home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh(new) - What: Small bash script that prints the manual test recipe Justin runs from his phone: send two messages 2 seconds apart, expect 👀 reaction on both within ~1s of send, expect both 👌 replies within the brain budget, no
🤔 Empty messageor duplicated work. Script also greps the last 200 lines ofjournalctl --user -u forge-lifeos-coordinatorforchild started,child reaped, and absence ofdispatch crash. Make executable. - Verification:
bash /home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh(prints the recipe + journal grep summary, exits 0) - Commit:
coord-bot: add smoke-test script for concurrent message dispatch
Task 7: Restart unit and verify in production¶
- Files: none (operational step)
- What:
systemctl --user daemon-reload && systemctl --user restart forge-lifeos-coordinator. Then send two real Telegram messages in rapid succession from Justin's phone. Confirm both 👀 reactions appear within 1 second. Confirm replies arrive in order. Tail journal for anyErrno 8 Exec format error(the threading-vs-subprocess race thefeedback_no_threads_with_subprocess_run.mdrule warns about, whichmultiprocessingshould avoid). - Verification:
journalctl --user -u forge-lifeos-coordinator -n 100 --no-pager | grep -E "child started|brain.handle|Errno 8"(child started lines present, no Errno 8) - Commit: none (operational; results captured in next task's handoff)
Task 8: Register and document¶
- Files:
/home/justinwieb/forge/memory/general/reference_coordinator_concurrent_dispatch.md(new)/home/justinwieb/.claude/projects/-home-justinwieb-forge/memory/MEMORY.md(one-line index entry under Telegram Surface)/home/justinwieb/forge/LESSONS.md(one-line entry: stall mode root cause + Fix A landed)- What: Topic file documents the per-chat filelock contract, MAX_CONCURRENT_BRAINS knob, child-reap protocol, and the
_handle_message_in_childextension point for future bot tools. MEMORY.md gets a one-line pointer perfeedback_register_tools.md. LESSONS.md captures the structural bug + the eval check we should add (assert no synchronous brain call in poll loop). Runbash /home/justinwieb/forge/scripts/forge_eval_run.shto confirm no regressions. - Verification:
bash /home/justinwieb/forge/scripts/forge_eval_run.sh && grep -c "reference_coordinator_concurrent_dispatch" /home/justinwieb/.claude/projects/-home-justinwieb-forge/memory/MEMORY.md(eval passes, grep returns 1) - Commit:
coord-bot: document concurrent dispatch contract + register in MEMORY.md
Risk notes¶
- Forking inside a long-running daemon can leak fds.
daemon=True+ close-on-exec is fine for short-lived children but verify in Task 7 that fd count does not climb (ls /proc/$(pidof -s python3)/fd | wc -lbefore and after 10 messages). mp.set_start_method("fork")is required becauseclaude -pbrain children inherit env.spawnwould re-import the bot module from scratch and slow first response by ~1s.- Filelock fallback if
/run/user/1000/is missing (e.g. systemd user manager dropped XDG_RUNTIME_DIR): the no-lock fallback in Task 2 means we degrade to "messages may interleave" rather than "bot dies". Acceptable. - Per-chat serialization is single-user-correct. Justin is the only authorized user. If a future bot version supports multiple authorized users, MAX_CONCURRENT_BRAINS=4 already supports 4 parallel chats; the lock is keyed per chat_id.