title: Pure Phoenix Phase 4.9, Troubleshoot-to-Guardrail Pipeline date: 2026-04-30 status: design, awaiting sign-off owner: Justin phase: 4.9 (addendum to Pure Phoenix plan at ~/.claude/plans/yes-lets-go-into-pure-phoenix.md)
Pure Phoenix Phase 4.9: Troubleshoot-to-Guardrail Pipeline¶
URL: https://mkdocs.justinsforge.com/memory/handoffs/pure-phoenix-phase-4-9-troubleshoot-to-guardrail-2026-04-30/
Why this exists¶
Forge already has reactive learning surfaces (LESSONS.md, feedback_*.md topic files, the eval harness, auto-dream). What it lacks is the loop from "issue surfaced in a live session" to "guardrail merged". Today, in-session bugs get fixed, sometimes a memory gets written, occasionally a hook gets added, but there is no contract that says "when X breaks twice, a guard ships." Lessons accumulate, prevention does not.
Phase 4.9 closes that loop with three small additions: a structured incident schema in LESSONS.md, a tiny script to write entries, and a weekly auto-dream nag that surfaces recurring lessons without guards.
Design principle: seen-twice rule¶
A single bug does not justify a guardrail. Premature systematization creates abstraction sludge that is harder to evolve than the bug it prevents. The bar:
| Occurrences | Action |
|---|---|
| 1 | Log the lesson. No guard. |
| 2+ (or 1 with high blast radius: data loss, security, silent corruption) | Propose a concrete guard with file path, ship within one session. |
"High blast radius" is judgment-call territory. The doctrine clause below names the four trigger categories so it is not arbitrary.
The incident loop¶
When something breaks mid-session, the bot runs four steps before moving on:
- Root-cause one line. Format: "cause was X because Y." Not "I fixed it." If the bot cannot state the cause cleanly, it has not understood the bug yet.
- Log the lesson. Call
forge_incident_log.py(new, see schema below). Writes toLESSONS.mdand incrementsseen_countif a matchingincident_idalready exists. - Decide guard tier. seen_count == 1 and not high-blast → stop here. Otherwise propose the guard inline (hook, eval check, doctrine clause, boundary sanitizer) with the exact file path. Ship it the same session.
- Link forward. Lesson entry includes a
guard:field referencing the file/line of the guard. The guard's comment references theincident_id. Future audits trace the chain in either direction.
LESSONS.md schema extension¶
Current entries are free-form prose with **Doctrine:** / **Decision:** / **Owner:** headings. Phase 4.9 adds a structured frontmatter block per incident, parseable by the eval harness:
## YYYY-MM-DDTHH:MM [incident_id] one-line title
- **doctrine:** Section X (rule name) | n/a
- **eval_check:** check-name | none
- **incident_id:** kebab-case-stable-key
- **seen_count:** N
- **first_seen:** YYYY-MM-DD
- **last_seen:** YYYY-MM-DD
- **blast_radius:** low | medium | high (data-loss, security, silent-corruption, customer-visible)
- **guard:** path/to/guard.py:LN | path/to/hook | doctrine:Section-X | none (single-occurrence)
- **guard_status:** shipped | proposed | not-needed | overdue
### Root cause
One line.
### Fix
What changed in code (file paths, what flipped).
### Recurrence prevention
If guard_status == shipped, what stops this from happening again. If guard_status == proposed or overdue, the proposed mechanism and target date.
Existing entries stay as-is; new entries follow the schema. The eval harness gets a new check lessons-md-schema-conformance that warns (does not block) when new entries omit fields.
New script: forge_incident_log.py¶
Path: forge/scripts/forge_incident_log.py. Single-purpose: append or upsert a lesson entry.
forge_incident_log.py \
--id "telegram-bot-subprocess-thread-race" \
--title "Errno 8 from threading.Thread + subprocess.run race" \
--doctrine "n/a" \
--blast medium \
--root-cause "threading.Thread + requests.post raced with main-thread subprocess.run, fork() in subprocess saw inconsistent fd state" \
--fix "scripts/forge_telegram_*.py: replaced Thread heartbeat with multiprocessing.Process" \
--guard "scripts/forge_text_sanitize.py" \
--guard-status shipped
Behavior:
- If incident_id already exists in LESSONS.md, increment seen_count, update last_seen, append a sub-bullet under "Recurrence" with the new occurrence date. Do not duplicate the entry.
- If new, write a fresh block following the schema.
- All flags optional except --id and --title; missing fields render as tbd.
- Emits the entry id to stdout so the caller can reference it in commit messages.
Auto-dream weekly nag¶
Auto-dream (Phase 4.4 nightly consolidation) gets a new pass: scan LESSONS.md for entries where seen_count >= 2 and guard_status in (proposed, none). Once per week (Sunday consolidation), compile a list and route it to coordinator chat as a single notify:
3 lessons recurring without guards: [incident-id-1] (seen 4x, last 2026-04-27), [incident-id-2] (seen 2x, last 2026-04-29), [incident-id-3] (seen 3x, last 2026-04-30). Want me to draft guards?
The bot does not auto-build guards. It surfaces the backlog so Justin (or a worker session) can act. Silent automated guard-building violates the seen-twice judgment requirement: the human keeps the call on whether the recurrence justifies prevention infrastructure.
Implementation: extend forge/scripts/forge_auto_dream.py (or wherever weekly consolidation lives, TBD by Phase 4.4 owner) with a lessons_recurrence_scan() function. Output channels through /notify warning.
Doctrine amendment: Section 10 addendum¶
Section 10 currently covers self-iteration via eval harness and LESSONS.md. Phase 4.9 adds a sub-clause:
10.4 Troubleshoot-to-Guardrail Loop. When an issue surfaces in a live session, the four-step loop runs before moving on: state the root cause in one line, log via
forge_incident_log.py, decide guard tier per the seen-twice rule, link forward. High-blast-radius categories trigger a guard on first occurrence: data loss, security, silent corruption, customer-visible regression. Single low-blast occurrences log only. The eval harness checklessons-md-schema-conformancewarns on schema drift but does not block commits.
The sub-clause goes into FORGE-DOCTRINE.md Section 10 alongside existing self-iteration protocol. This is the contract; without it, the loop becomes optional and decays.
Eval harness check: orphan-lessons¶
New check, lands in Phase 4.9 alongside the script:
- name:
lessons-orphan-recurrence - rule: No entry in
LESSONS.mdwithseen_count >= 3andguard_status in (proposed, none)older than 14 days. - severity: warning (initial), tightens to error after a clean week per the same policy as
no-em-dashes. - rationale: Three recurrences over two weeks is a clear signal that a guard is overdue, regardless of blast radius. This is the eval-harness teeth behind the auto-dream nag.
Files this phase touches¶
| File | Change |
|---|---|
forge/FORGE-DOCTRINE.md |
Add Section 10.4 |
forge/LESSONS.md |
Schema applies to new entries only; existing entries grandfathered |
forge/scripts/forge_auto_memory.py |
Add log_incident() function (single owner of LESSONS.md writes) |
forge/scripts/forge_incident_log |
New thin CLI shim that calls forge_auto_memory.log_incident() |
forge/scripts/forge_auto_dream.py |
Add lessons_recurrence_scan() |
forge/scripts/forge_eval_check_lessons_orphan_recurrence.py |
New |
forge/scripts/forge_eval_check_lessons_md_schema.py |
New |
forge/eval.json |
Register two new checks |
forge/MEMORY.md index |
Add [Incident loop](memory/general/reference_incident_loop.md) entry |
forge/memory/general/reference_incident_loop.md |
New, topic file documenting the script + schema for future sessions |
Sequencing and gates¶
Phase 4.9 lands after Phase 4.5 (eval harness, already shipped) and Phase 4.4 (auto-memory + auto-dream, in progress). It does not depend on the bot redesign (Phase 4.2) or Drive redesign (Phase 4.6); those are orthogonal.
Sign-off gates:
1. Justin approves the schema and seen-twice rule (this doc).
2. forge_incident_log.py ships, three real lessons rewritten in the new schema as a smoke test.
3. Section 10.4 added to FORGE-DOCTRINE.md.
4. Eval checks register; first nightly run shows them passing or producing a sane backlog.
5. Auto-dream weekly nag fires once successfully (likely first Sunday after ship).
What I am NOT proposing¶
- Auto-building guards from lessons. Premature codification risk; humans keep the judgment call.
- Migrating existing 793 lines of
LESSONS.mdto the new schema. Grandfather them; only new entries follow. - A separate "incidents" database.
LESSONS.mdis the single store; the schema is structured prose, not a sqlite table. Easier to read, grep, and recover. - Replacing
feedback_*.mdtopic files. Those capture user-preference corrections (style, voice, workflow). Lessons capture system-failure recurrences. Different surfaces, different cadence.
Risks¶
| Risk | Mitigation |
|---|---|
| Schema overhead discourages logging | Keep all fields except id + title optional; the script fills tbd for the rest. |
| Lessons backlog grows faster than guards | Auto-dream nag + orphan-recurrence eval check; the human gets surfaced backlog, not silent rot. |
| Bot logs every minor hiccup as an incident | Doctrine wording: "issue surfaced" means user-visible failure, exception, wrong output, or regression. Not "I tried two approaches and the first did not compile." |
seen_count upserts go wrong |
Script writes to a tempfile + atomic rename. Pre-commit eval check validates LESSONS.md parses cleanly. |
Decisions locked (2026-04-30)¶
Justin delegated the backend judgment calls; defaults below are final unless the implementing worker hits a blocker.
- High-blast "customer-visible regression" narrowed to "customer-visible regression with no quick rollback." Broad version was too inclusive.
- Incident logging folded into
forge_auto_memory.pyas alog_incident()sub-module plus a thinforge_incident_logCLI shim. Single owner ofLESSONS.mdwrites prevents two-writer races. No standaloneforge_incident_log.pyas originally drafted. - Orphan-recurrence threshold: 2 occurrences over 7 days (was 3 over 14). Matches the seen-twice doctrine; longer windows let recurrences stew without action.
- Implementation order: auto-memory
log_incident()extension → CLI shim → Section 10.4 doctrine clause → eval checks → auto-dream nag.
[Claude Code]