Coordinator concurrent dispatch (Fix A, 2026-05-04)¶
URL: https://mkdocs.justinsforge.com/memory/general/reference_coordinator_concurrent_dispatch/
What changed¶
Before: forge_telegram_coordinator_bot.py main loop called handle_message()
synchronously. A single brain.handle could block getUpdates for up to
MAX_ITERS x 300s, leaving the bot deaf to new messages while systemd saw
a healthy process. The "stall mode" Justin felt as 1-5 minute silences.
After: each Telegram update is dispatched to a daemon multiprocessing.Process
child. The main poll loop returns to getUpdates in milliseconds. The 👀
react and typing indicator fire from the child immediately on receipt. Only
brain.handle() itself is serialized, via a per-chat flock at
$XDG_RUNTIME_DIR/forge-coordinator-locks/chat-<chat_id>.lock.
File map¶
forge_telegram_coordinator_bot.py(only file changed for Fix A):_handle_message_in_child(token, authorized_id, msg): body of one message turn. Auth check → 👀 react + typing → voice/photo download → flock → brain.handle → release lock → send + final react → cleanup._acquire_chat_lock/_release_chat_lock: flock helpers; degrade to no-lock ifLOCK_DIRcan't be created._dispatch(token, authorized_id, msg, active): spawns daemon mp.Process, appends to active list, logschild started pid=...._reap_children(active): prunesnot is_alive()children, logs non-zero exit codes. Called once per poll iteration._wait_for_slot(active): blocks on oldest child for 10s if active count >=MAX_CONCURRENT_BRAINS. Logsbrain cap N reachedwhen triggered.MAX_CONCURRENT_BRAINS = 4: hard cap on concurrent message handlers.
Tunables¶
MAX_CONCURRENT_BRAINS(default 4): bump if Justin starts multi-chat (not applicable today, single authorized user).LOCK_DIR(default$XDG_RUNTIME_DIR/forge-coordinator-locks): location of per-chat flock files. Auto-created.mp.set_start_method("fork", force=True)in main(): "fork" inherits the already-imported brain module (~500ms saved per child startup). spawn would also work but adds first-ack latency.
Operational signals¶
- Healthy:
child started pid=...per message,child reaped rc=0shortly after, heartbeat mtime under 60s old. - Degraded:
brain cap 4 reached, waiting on oldest childlines (informational; means brain backlog), but heartbeat stays fresh and watchdog does NOT restart. - Broken:
dispatch crash,Errno 8 Exec format error, or heartbeat older than 360s. Watchdog auto-restarts on heartbeat staleness.
Smoke test¶
bash /home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh
Reads recent journal, counts child started / child exit / dispatch
crashes / Errno 8 / cap-reached lines, checks heartbeat freshness, exits
non-zero on red flags.
Why fork over threads¶
Per feedback_no_threads_with_subprocess_run.md, threading.Thread plus
requests.post races with the main thread's subprocess.run(claude) in
the brain and produces random [Errno 8] Exec format error. The earlier
heartbeat thread caused exactly this; it was removed and replaced with
in-loop heartbeat. multiprocessing sidesteps the race entirely because
each child has its own process space.
Future work¶
- Capture bot (
forge_telegram_inbox_capture_bot.py) has the same synchronousbrain.handleshape. Lower priority (capture brain is faster, single-iteration intent classifier most of the time) but the refactor is the same pattern. - General-purpose / remote-bridge bot: same.
- An eval check could assert "no synchronous brain call in poll loop" by static-analyzing the bot files.
Related¶
- coord-stall-debug-2026-05-04, source debug handoff.
- coord-stall-fix-a-2026-05-04, the implementation plan.
- No threads with subprocess.run, why mp not threads.
[Claude Code]