Skip to content

Coordinator concurrent dispatch (Fix A, 2026-05-04)

URL: https://mkdocs.justinsforge.com/memory/general/reference_coordinator_concurrent_dispatch/

What changed

Before: forge_telegram_coordinator_bot.py main loop called handle_message() synchronously. A single brain.handle could block getUpdates for up to MAX_ITERS x 300s, leaving the bot deaf to new messages while systemd saw a healthy process. The "stall mode" Justin felt as 1-5 minute silences.

After: each Telegram update is dispatched to a daemon multiprocessing.Process child. The main poll loop returns to getUpdates in milliseconds. The 👀 react and typing indicator fire from the child immediately on receipt. Only brain.handle() itself is serialized, via a per-chat flock at $XDG_RUNTIME_DIR/forge-coordinator-locks/chat-<chat_id>.lock.

File map

  • forge_telegram_coordinator_bot.py (only file changed for Fix A):
  • _handle_message_in_child(token, authorized_id, msg): body of one message turn. Auth check → 👀 react + typing → voice/photo download → flock → brain.handle → release lock → send + final react → cleanup.
  • _acquire_chat_lock / _release_chat_lock: flock helpers; degrade to no-lock if LOCK_DIR can't be created.
  • _dispatch(token, authorized_id, msg, active): spawns daemon mp.Process, appends to active list, logs child started pid=....
  • _reap_children(active): prunes not is_alive() children, logs non-zero exit codes. Called once per poll iteration.
  • _wait_for_slot(active): blocks on oldest child for 10s if active count >= MAX_CONCURRENT_BRAINS. Logs brain cap N reached when triggered.
  • MAX_CONCURRENT_BRAINS = 4: hard cap on concurrent message handlers.

Tunables

  • MAX_CONCURRENT_BRAINS (default 4): bump if Justin starts multi-chat (not applicable today, single authorized user).
  • LOCK_DIR (default $XDG_RUNTIME_DIR/forge-coordinator-locks): location of per-chat flock files. Auto-created.
  • mp.set_start_method("fork", force=True) in main(): "fork" inherits the already-imported brain module (~500ms saved per child startup). spawn would also work but adds first-ack latency.

Operational signals

  • Healthy: child started pid=... per message, child reaped rc=0 shortly after, heartbeat mtime under 60s old.
  • Degraded: brain cap 4 reached, waiting on oldest child lines (informational; means brain backlog), but heartbeat stays fresh and watchdog does NOT restart.
  • Broken: dispatch crash, Errno 8 Exec format error, or heartbeat older than 360s. Watchdog auto-restarts on heartbeat staleness.

Smoke test

bash /home/justinwieb/forge/scripts/forge_telegram_coordinator_smoke.sh

Reads recent journal, counts child started / child exit / dispatch crashes / Errno 8 / cap-reached lines, checks heartbeat freshness, exits non-zero on red flags.

Why fork over threads

Per feedback_no_threads_with_subprocess_run.md, threading.Thread plus requests.post races with the main thread's subprocess.run(claude) in the brain and produces random [Errno 8] Exec format error. The earlier heartbeat thread caused exactly this; it was removed and replaced with in-loop heartbeat. multiprocessing sidesteps the race entirely because each child has its own process space.

Future work

  • Capture bot (forge_telegram_inbox_capture_bot.py) has the same synchronous brain.handle shape. Lower priority (capture brain is faster, single-iteration intent classifier most of the time) but the refactor is the same pattern.
  • General-purpose / remote-bridge bot: same.
  • An eval check could assert "no synchronous brain call in poll loop" by static-analyzing the bot files.

[Claude Code]