building a multi-llm plan critique system for claude code

January 10, 2026·16 min read

Created Jan 10, 2026·Updated Apr 1, 2026

Version history

v3Replaced v1/v2 hooks with superpowers skills + codex review gate
v2.1Added auto-save to thoughts/research/, prompted save to plans/
v2Added council planning (parallel brainstorming)
v1Original sequential critique system

one model has blind spots. always. i've watched gemini catch architectural issues codex missed, and codex catch edge cases gemini glossed over. when they disagree, that's signal.

this is a deep dive into how i built a claude code hook that sends plans through multiple llms before implementation starts.

planning vibes

Updated Apr 1, 2026

v3: the custom hooks from v1 and v2 have been replaced by superpowers — a skill system that handles brainstorming, planning, verification, and code review natively inside claude code. it's better than what i built. on top of that, i've added the codex review gate as a stop-time quality check. skip to the current setup for what i actually use now.

the insight

i was using multiple llms to critique plans. claude writes a plan, then i send it to gemini and codex for review. it works.

but i realized i was doing it wrong.

the problem: reviewers only see the finished plan. they can't offer alternative approaches they might have discovered if they'd explored the problem themselves.

it's like asking someone to review your code after you've already written it. they can spot bugs, but they can't suggest a fundamentally different architecture. that ship has sailed.

post-critique = code review (valuable, but late) parallel planning = brainstorming session (early, diverse perspectives)

both have their place. here's how to build each.

v1: sequential critique

the original approach. claude plans, then others review.

the goal

when claude creates an implementation plan, i want:

test-first enforcement. block if no test files listed
gemini 3 flash critique. architectural review, coverage gaps
codex second opinion. what did gemini miss?
all feedback visible to claude before implementation starts

how claude code hooks work

hooks are commands that run at specific points in claude's workflow. the one i care about is PostToolUse. runs after a tool completes.

in ~/.claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "ExitPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/critique-plan.sh",
            "timeout": 600
          }
        ]
      }
    ]
  }
}

this triggers my script whenever claude exits plan mode (i.e., the plan is ready for approval).

the architecture

Multi-LLM plan critique workflow showing Claude, Gemini, and Codex phases

the hard parts

problem 1: blocking doesn't work with json

my first attempt returned:

{
  "decision": "block",
  "reason": "TEST-FIRST VIOLATION: Plan rejected..."
}

claude code said "hook succeeded" and kept going. the decision: block was ignored.

fix: use exit code 2 + stderr instead:

# this gets ignored
cat << EOF
{"decision": "block", "reason": "..."}
EOF
exit 0

# this actually blocks
cat >&2 << EOF
TEST-FIRST VIOLATION - Plan Rejected
...
EOF
exit 2

exit code 2 signals "hook blocked this action" and stderr content becomes the message claude sees.

problem 2: background processes are useless

i tried running the llm critiques in the background:

( ... gemini ... codex ... ) &

hook returned immediately. critique ran in the background. claude never saw it because the hook was already done.

fix: make it synchronous. yes, it takes 20-30 seconds. worth it.

# blocking - claude waits and sees the result
GEMINI_RESULT=$(opencode run -m "openrouter/google/gemini-3-flash-preview" "$PROMPT")
CODEX_RESULT=$(codex exec --full-auto "$PROMPT")

problem 3: plan freshness

the hook reads the most recent plan file from ~/.claude/plans/. but if the user takes too long reviewing, the plan gets "stale" and the hook silently exits:

Plan age: 288s (max 120s)
EXIT: Plan too old

claude sees "hook succeeded" and proceeds.

partial fix: increased timeout to 600s. better fix would be reading from tool_response.plan in the hook input instead of relying on file timestamps.

the code

here's the main hook script. it's messy but it works.

#!/bin/bash
set -euo pipefail

DEBUG_LOG="$HOME/.claude/logs/critique-hook.log"
mkdir -p "$(dirname "$DEBUG_LOG")"

log_debug() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$DEBUG_LOG"
}

# find most recent plan file
PLAN_FILE=$(ls -t ~/.claude/plans/*.md 2>/dev/null | head -1)

if [[ ! -f "$PLAN_FILE" ]]; then
    exit 0  # no plan, nothing to do
fi

# freshness check
PLAN_AGE=$(($(date +%s) - $(stat -f %m "$PLAN_FILE")))
if [[ $PLAN_AGE -gt 600 ]]; then
    exit 0  # too old, skip
fi

PLAN_CONTENT=$(cat "$PLAN_FILE")

# test-first check (separate script)
TEST_CHECK=$(echo "$PLAN_CONTENT" | ~/.claude/hooks/test-first-check.sh)
TEST_PASSED=$(echo "$TEST_CHECK" | jq -r '.passed')

if [[ "$TEST_PASSED" == "false" ]]; then
    ISSUES=$(echo "$TEST_CHECK" | jq -r '.issues | join("\n- ")')
    cat >&2 << EOF
TEST-FIRST VIOLATION - Plan Rejected

Issues:
- $ISSUES

Fix: Add test files BEFORE implementation files in your plan.
EOF
    exit 2
fi

# gemini critique
GEMINI_PROMPT="You are a senior architect reviewing an implementation plan...
$PLAN_CONTENT"

GEMINI_RESULT=$(opencode run -m "openrouter/google/gemini-3-flash-preview" "$GEMINI_PROMPT" 2>/dev/null || echo "[unavailable]")

# codex reviews gemini's critique
CODEX_PROMPT="Review this plan AND Gemini's critique. What did Gemini miss?
---
PLAN:
$PLAN_CONTENT
---
GEMINI'S CRITIQUE:
$GEMINI_RESULT"

CODEX_RESULT=$(codex exec --full-auto "$CODEX_PROMPT" 2>/dev/null || echo "[unavailable]")

# return to claude
FULL_CRITIQUE="PLAN CRITIQUE FROM GEMINI + CODEX

## Gemini 3 Flash
$GEMINI_RESULT

## Codex
$CODEX_RESULT"

jq -n --arg ctx "$FULL_CRITIQUE" '{
  "hookSpecificOutput": {
    "hookEventName": "PostToolUse",
    "additionalContext": $ctx
  }
}'

what the critiques look like

here's an actual critique from a recent plan:

gemini:

Test Coverage (PRIMARY CRITIQUE - INSUFFICIENT) While the plan includes collection-viewer.test.ts, it has significant gaps:

Missing Add Page Tests: You modify src/app/collection/[id]/add/page.tsx but provide no corresponding test file.

Null Safety: currentAndeeTag can now be null. Tests must verify sub-components don't crash.

codex:

Missed test scenario: ?viewer= present but useCurrentAndee reports authenticated; ensure viewer override wins

Test quality issue: page.test.tsx uses vi.mock inside tests; this won't rewire imports after module is loaded

they catch different things. that's the point.

v2: council planning

the problem with post-plan critique

v1 reviewers only see the finished plan. they can spot flaws in claude's logic, but they can't offer alternative approaches they might have discovered if they'd explored the problem themselves.

it's like asking someone to review your design doc after you've already committed to the architecture. they can poke holes, but the frame is set. they're optimizing within your solution space.

the insight

what if all three models explored the problem independently and in parallel?

each model has different:

training data
biases and blind spots
knowledge of libraries and patterns

when they converge on the same approach: high confidence. when they diverge: you've found the real design decisions.

the architecture

Council planning flow showing parallel execution of Claude, Gemini, and Codex

two ways to use it

1. explicit: /council command

/council implement OAuth2 authentication for the API

runs all three models in parallel. claude synthesizes.

2. automatic: hooks on plan mode

when claude enters plan mode naturally, hooks fire:

PreToolUse[EnterPlanMode] → start-council.sh
  - Extracts user's request from transcript
  - Launches Gemini + Codex in background
  - They plan while Claude researches

PostToolUse[ExitPlanMode] → collect-council.sh
  - Waits for background processes (2min timeout)
  - Injects their plans into Claude's context
  - Claude sees other perspectives before finalizing

you don't have to do anything. just use plan mode. council happens automatically.

key insight: non-blocking start, blocking collect. council starts immediately when planning begins. they work in parallel with claude. when claude finishes, we wait for results.

the hooks configuration

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "EnterPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/start-council.sh",
            "timeout": 10
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "ExitPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/collect-council.sh",
            "timeout": 180
          }
        ]
      }
    ]
  }
}

key details:

PreToolUse with matcher: "EnterPlanMode" fires when claude is about to enter plan mode
PostToolUse with matcher: "ExitPlanMode" fires when claude exits plan mode
the collect hook has 180s timeout because we're waiting for background processes

start-council.sh

#!/bin/bash
set -e

SCRATCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/scratch/council"
mkdir -p "$SCRATCH_DIR"

# Read hook input from stdin (JSON)
INPUT=$(cat)

# Get transcript path to extract user's request
TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty')

if [ -z "$TRANSCRIPT_PATH" ] || [ ! -f "$TRANSCRIPT_PATH" ]; then
    exit 0
fi

# Extract last user message from transcript (JSONL format)
LAST_USER_MSG=$(tac "$TRANSCRIPT_PATH" | grep -m1 '"type":"human"' | jq -r '.message.content // empty')

if [ -z "$LAST_USER_MSG" ]; then
    exit 0
fi

# Create council request
cat > "$SCRATCH_DIR/request.md" << EOF
# Planning Request

## Task
$LAST_USER_MSG

## Instructions
Create a detailed implementation plan. Include:
1. Approach and rationale
2. Key files to modify/create
3. Step-by-step implementation
4. Potential risks or edge cases
5. Testing strategy
EOF

# Launch Gemini in background
(
    opencode run --model openrouter/google/gemini-2.5-pro \
        "$(cat "$SCRATCH_DIR/request.md")" \
        > "$SCRATCH_DIR/gemini.md" 2>&1
) &
echo $! > "$SCRATCH_DIR/gemini.pid"

# Launch Codex in background
(
    codex exec \
        "$(cat "$SCRATCH_DIR/request.md")" \
        > "$SCRATCH_DIR/codex.md" 2>&1
) &
echo $! > "$SCRATCH_DIR/codex.pid"

echo "Council started: Gemini and Codex planning in background"
exit 0

collect-council.sh

#!/bin/bash
set -e

SCRATCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/scratch/council"
RESEARCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/research"

# Check if council was started
if [ ! -f "$SCRATCH_DIR/gemini.pid" ]; then
    exit 0
fi

# Wait for background processes (with timeout)
TIMEOUT=120
WAITED=0

for MODEL in gemini codex; do
    PID_FILE="$SCRATCH_DIR/$MODEL.pid"
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        while kill -0 "$PID" 2>/dev/null && [ $WAITED -lt $TIMEOUT ]; do
            sleep 2
            WAITED=$((WAITED + 2))
        done
        kill -9 "$PID" 2>/dev/null || true
    fi
done

# Auto-save raw perspectives to research/ for future reference
mkdir -p "$RESEARCH_DIR"
TIMESTAMP=$(date +%Y-%m-%d_%H-%M)
if [ -f "$SCRATCH_DIR/gemini.md" ] && [ -s "$SCRATCH_DIR/gemini.md" ]; then
    cp "$SCRATCH_DIR/gemini.md" "$RESEARCH_DIR/${TIMESTAMP}_gemini.md"
fi
if [ -f "$SCRATCH_DIR/codex.md" ] && [ -s "$SCRATCH_DIR/codex.md" ]; then
    cp "$SCRATCH_DIR/codex.md" "$RESEARCH_DIR/${TIMESTAMP}_codex.md"
fi

# Output council results to Claude's context
echo ""
echo "=== COUNCIL RESULTS ==="
echo ""

if [ -f "$SCRATCH_DIR/gemini.md" ] && [ -s "$SCRATCH_DIR/gemini.md" ]; then
    echo "### Gemini's Plan"
    echo ""
    cat "$SCRATCH_DIR/gemini.md"
    echo ""
fi

if [ -f "$SCRATCH_DIR/codex.md" ] && [ -s "$SCRATCH_DIR/codex.md" ]; then
    echo "### Codex's Plan"
    echo ""
    cat "$SCRATCH_DIR/codex.md"
    echo ""
fi

echo "=== END COUNCIL ==="
echo ""
echo "Consider the above perspectives when finalizing your plan."
echo "Save the synthesized plan to thoughts/plans/ for future reference."

# Cleanup
rm -f "$SCRATCH_DIR/gemini.pid" "$SCRATCH_DIR/codex.pid"

exit 0

the hook auto-saves raw council output to thoughts/research/ with timestamps. you can grep through old perspectives when you forget why you rejected an approach three weeks ago.

what the output looks like

when claude exits plan mode, it sees something like:

=== COUNCIL RESULTS ===

### Gemini's Plan

1. Use the existing auth middleware pattern from /src/app/core/security.py
2. Add OAuth provider config to settings.py
3. Create /src/app/services/oauth.py with provider implementations
...

### Codex's Plan

1. Implement OAuth flow in a new router /src/app/api/v1/endpoints/oauth.py
2. Store tokens in the existing Firebase setup
3. Add rate limiting to prevent abuse
...

=== END COUNCIL ===

Consider the above perspectives when finalizing your plan.

then claude synthesizes:

Agreement (all 3 models):

Use existing auth middleware pattern

Store tokens in Firebase

Add rate limiting

Gemini's unique insight:

Suggested abstracting providers for future additions

Codex's unique insight:

Flagged token refresh race condition

Conflict: Router location

Gemini: Add to existing auth router

Codex: New dedicated oauth router

Claude: Separate router for clarity

Recommendation: New router, but share middleware

real example: rate limiting

task: "add rate limiting to the API endpoints"

claude's approach: middleware-based, redis for distributed state, per-user and per-IP limits.

gemini's approach: also middleware, but suggested slowapi library and pointed out a recent FastAPI best practice.

codex's approach: emphasized different limits for authenticated vs anonymous users, with patterns from popular repos.

the synthesis:

all three agreed on middleware placement → high confidence
gemini's library suggestion saved reinventing the wheel
codex's auth/anon distinction was a blind spot for claude and gemini
combined: better than any single model

graceful degradation

the council is advisory, not blocking. if gemini times out, claude continues. if codex fails, we work with what we have.

cat "$SCRATCH_DIR/gemini.md" 2>/dev/null || echo "(No response)"

you don't want planning held hostage by an API hiccup.

why this matters

diverse perspectives early — catch bad architecture in planning (minutes) not implementation (days)
confidence calibration — three models converge independently = meaningful signal
unknown unknowns — each model has different training data, different code exposure. gemini might know a library update. codex might have seen this fail in a thousand repos.
zero extra effort — automatic hooks. just use plan mode.

v3: superpowers + codex review gate

why i stopped using my own hooks

v1 and v2 worked. but maintaining bash scripts that parse transcripts, manage background processes, and handle timeouts across three LLM APIs is fragile. every time an API changed or claude code updated its hook format, something broke.

then i found superpowers.

what superpowers replaced

superpowers is a skill system for claude code built by jesse vincent. it installs as a plugin and gives claude a structured workflow for brainstorming, planning, debugging, testing, and verification — all the things my custom hooks were trying to do, but better.

here's what it replaced:

v1 sequential critique → superpowers brainstorming + planning skills. instead of claude writing a plan and then sending it to other models for review, superpowers changes how claude plans in the first place. the brainstorming skill uses socratic questioning — it probes your intent, surfaces assumptions, and presents multiple approaches in structured sections before you commit to anything. it's not "here's my plan, what do you think?" it's "here are three approaches with tradeoffs, which direction do you want?"

v2 council → mostly replaced, with a gap. the brainstorming skill gives you the diverse-perspectives-early benefit that council provided. claude explores multiple approaches and presents them for your input before committing. the gap: it's still one model doing the exploration. i haven't brought back /council to run gemini and codex in parallel during the superpowers brainstorm phase — that's on my list.

plan mode itself → replaced. superpowers effectively replaces claude code's built-in plan mode for me. the writing-plans skill creates detailed implementation plans with bite-sized tasks (2-5 minutes each), exact file paths, and verification steps. the executing-plans skill works through them with review checkpoints. it's more structured than what i was getting from raw plan mode + hook critiques.

the key skills i use daily:

Skill	What it does
`brainstorming`	Socratic design exploration before any code
`writing-plans`	Detailed implementation plans with verification steps
`executing-plans`	Works through plans with review checkpoints
`test-driven-development`	RED-GREEN-REFACTOR enforcement
`systematic-debugging`	Root-cause analysis before proposing fixes
`verification-before-completion`	Evidence-based completion checks
`requesting-code-review`	Structured review with severity tracking

what superpowers didn't replace: the review gate

superpowers has a verification-before-completion skill that checks work before claiming done. it's good — it makes claude actually run tests and verify output before saying "done."

but it's still claude checking claude's own work. same model, same blind spots.

that's where the codex review gate comes in. it's a different model (codex/gpt) reviewing the actual diff after implementation. different training data, different biases, different things it notices.

the codex review gate

the review gate fires when you're about to mark work as done. codex reviews the actual changes — not the plan, the implementation.

you: /done
codex: [reviews diff, checks for drift, missed tests, security issues]
codex: "LGTM" → work marked done
codex: "ISSUES FOUND" → you see the issues before claiming completion

setup is one command:

/codex:setup --enable-review-gate

the codex plugin configures everything. no hooks to write, no scripts to maintain.

what the review gate catches

real examples from my recent sessions:

test drift — plan said "test the error path for expired tokens." implementation had a test file but the error path test was a TODO comment.
missing files — plan listed 4 files to create. only 3 existed. the 4th was referenced in imports but never written.
silent failures — tests passed but were testing the mock, not the actual implementation. codex flagged the vi.mock placement issue.

these are exactly the failure modes from my execution outcomes — the Q:0/5 and Q:1/5 scores were almost always implementation drift, not bad plans.

why codex specifically

codex is genuinely good as both a model and a harness. it runs against the actual diff — it sees what changed, not what you said would change. it has codebase context and can verify claims ("did you actually add that test?").

a review gate check takes 10-20 seconds. small price for catching the kind of failures that used to cost hours.

how the current pipeline works

[superpowers brainstorm] → [superpowers plan] → [implementation w/ TDD] → [codex review gate] → done

brainstorming skill — socratic exploration, multiple approaches presented
writing-plans skill — detailed plan with file paths, verification steps
executing-plans + TDD skills — implementation with review checkpoints
codex review gate — different-model verification of the actual diff

superpowers handles the planning lifecycle. codex handles the final quality gate. they complement each other because they solve different problems: superpowers makes the process better, codex makes the output verifiable.

the evolution

Version	Approach	Status
v1	Custom bash hooks, sequential LLM critique	Retired — replaced by superpowers
v2	Custom bash hooks, parallel council brainstorm	Mostly retired — superpowers brainstorming replaces this, but `/council` with parallel models is still on my list
v3	Superpowers skills + codex review gate	Current

the progression: hand-rolled bash scripts → structured skill system + purpose-built review tool. less maintenance, better results.

what's next

bring back /council for superpowers brainstorm — run gemini and codex in parallel during the brainstorming phase. superpowers gives you structured exploration from one model; council would add diverse perspectives from three.
close the feedback loop — feed review gate findings back into brainstorm prompts so the same class of drift doesn't recur
more codex integration — codex as a rescue agent when claude gets stuck, not just as a reviewer

superpowers — the skill system that replaced my custom hooks
my overall ai coding workflow
test-first enforcement. the validator script
handoff. managing context across sessions

the lesson from v1 → v2 → v3: don't build infrastructure you can install. my custom hooks taught me what i needed. superpowers delivers it better. codex fills the gap that no single-model system can — a second pair of eyes on the actual output.

the best multi-model workflow isn't "throw models at every stage." it's "use the right tool at the right stage." superpowers for process. codex for verification. done.

< EOF

< cd ../the-system cd ../test-first-enforcement >

the insight

v1: sequential critique

the goal

how claude code hooks work

the architecture

the hard parts

problem 1: blocking doesn't work with json

problem 2: background processes are useless

problem 3: plan freshness

the code

what the critiques look like

v2: council planning

the problem with post-plan critique

the insight

the architecture

two ways to use it

the hooks configuration

start-council.sh

collect-council.sh

what the output looks like

real example: rate limiting

graceful degradation

why this matters

v3: superpowers + codex review gate

why i stopped using my own hooks

what superpowers replaced

what superpowers didn't replace: the review gate

the codex review gate

what the review gate catches

why codex specifically

how the current pipeline works

the evolution

what's next

related