← Back to writing

building a multi-llm plan critique system for claude code

·13 min read
Created Jan 10, 2026·Updated Jan 28, 2026
Version history
  • v2.1Added auto-save to thoughts/research/, prompted save to plans/
  • v2Added council planning (parallel brainstorming)
  • v1Original sequential critique system

one model has blind spots. always. i've watched gemini catch architectural issues codex missed, and codex catch edge cases gemini glossed over. when they disagree, that's signal.

this is a deep dive into how i built a claude code hook that sends plans through multiple llms before implementation starts.

planning vibes

Updated Jan 28, 2026

v2: this post now covers two approaches. the original sequential critique (reviewers see Claude's finished plan) and the new council system (all models explore in parallel). skip to council planning for the new approach.

the insight

i was using multiple llms to critique plans. claude writes a plan, then i send it to gemini and codex for review. it works.

but i realized i was doing it wrong.

the problem: reviewers only see the finished plan. they can't offer alternative approaches they might have discovered if they'd explored the problem themselves.

it's like asking someone to review your code after you've already written it. they can spot bugs, but they can't suggest a fundamentally different architecture. that ship has sailed.

post-critique = code review (valuable, but late) parallel planning = brainstorming session (early, diverse perspectives)

both have their place. here's how to build each.

v1: sequential critique

the original approach. claude plans, then others review.

the goal

when claude creates an implementation plan, i want:

  1. test-first enforcement. block if no test files listed
  2. gemini 3 flash critique. architectural review, coverage gaps
  3. codex second opinion. what did gemini miss?
  4. all feedback visible to claude before implementation starts

how claude code hooks work

hooks are commands that run at specific points in claude's workflow. the one i care about is PostToolUse. runs after a tool completes.

in ~/.claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "ExitPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "~/.claude/hooks/critique-plan.sh",
            "timeout": 600
          }
        ]
      }
    ]
  }
}

this triggers my script whenever claude exits plan mode (i.e., the plan is ready for approval).

the architecture

Multi-LLM plan critique workflow showing Claude, Gemini, and Codex phases

the hard parts

problem 1: blocking doesn't work with json

my first attempt returned:

{
  "decision": "block",
  "reason": "TEST-FIRST VIOLATION: Plan rejected..."
}

claude code said "hook succeeded" and kept going. the decision: block was ignored.

fix: use exit code 2 + stderr instead:

# this gets ignored
cat << EOF
{"decision": "block", "reason": "..."}
EOF
exit 0

# this actually blocks
cat >&2 << EOF
TEST-FIRST VIOLATION - Plan Rejected
...
EOF
exit 2

exit code 2 signals "hook blocked this action" and stderr content becomes the message claude sees.

problem 2: background processes are useless

i tried running the llm critiques in the background:

( ... gemini ... codex ... ) &

hook returned immediately. critique ran in the background. claude never saw it because the hook was already done.

fix: make it synchronous. yes, it takes 20-30 seconds. worth it.

# blocking - claude waits and sees the result
GEMINI_RESULT=$(opencode run -m "openrouter/google/gemini-3-flash-preview" "$PROMPT")
CODEX_RESULT=$(codex exec --full-auto "$PROMPT")

problem 3: plan freshness

the hook reads the most recent plan file from ~/.claude/plans/. but if the user takes too long reviewing, the plan gets "stale" and the hook silently exits:

Plan age: 288s (max 120s)
EXIT: Plan too old

claude sees "hook succeeded" and proceeds.

partial fix: increased timeout to 600s. better fix would be reading from tool_response.plan in the hook input instead of relying on file timestamps.

the code

here's the main hook script. it's messy but it works.

#!/bin/bash
set -euo pipefail

DEBUG_LOG="$HOME/.claude/logs/critique-hook.log"
mkdir -p "$(dirname "$DEBUG_LOG")"

log_debug() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$DEBUG_LOG"
}

# find most recent plan file
PLAN_FILE=$(ls -t ~/.claude/plans/*.md 2>/dev/null | head -1)

if [[ ! -f "$PLAN_FILE" ]]; then
    exit 0  # no plan, nothing to do
fi

# freshness check
PLAN_AGE=$(($(date +%s) - $(stat -f %m "$PLAN_FILE")))
if [[ $PLAN_AGE -gt 600 ]]; then
    exit 0  # too old, skip
fi

PLAN_CONTENT=$(cat "$PLAN_FILE")

# test-first check (separate script)
TEST_CHECK=$(echo "$PLAN_CONTENT" | ~/.claude/hooks/test-first-check.sh)
TEST_PASSED=$(echo "$TEST_CHECK" | jq -r '.passed')

if [[ "$TEST_PASSED" == "false" ]]; then
    ISSUES=$(echo "$TEST_CHECK" | jq -r '.issues | join("\n- ")')
    cat >&2 << EOF
TEST-FIRST VIOLATION - Plan Rejected

Issues:
- $ISSUES

Fix: Add test files BEFORE implementation files in your plan.
EOF
    exit 2
fi

# gemini critique
GEMINI_PROMPT="You are a senior architect reviewing an implementation plan...
$PLAN_CONTENT"

GEMINI_RESULT=$(opencode run -m "openrouter/google/gemini-3-flash-preview" "$GEMINI_PROMPT" 2>/dev/null || echo "[unavailable]")

# codex reviews gemini's critique
CODEX_PROMPT="Review this plan AND Gemini's critique. What did Gemini miss?
---
PLAN:
$PLAN_CONTENT
---
GEMINI'S CRITIQUE:
$GEMINI_RESULT"

CODEX_RESULT=$(codex exec --full-auto "$CODEX_PROMPT" 2>/dev/null || echo "[unavailable]")

# return to claude
FULL_CRITIQUE="PLAN CRITIQUE FROM GEMINI + CODEX

## Gemini 3 Flash
$GEMINI_RESULT

## Codex
$CODEX_RESULT"

jq -n --arg ctx "$FULL_CRITIQUE" '{
  "hookSpecificOutput": {
    "hookEventName": "PostToolUse",
    "additionalContext": $ctx
  }
}'

what the critiques look like

here's an actual critique from a recent plan:

gemini:

Test Coverage (PRIMARY CRITIQUE - INSUFFICIENT) While the plan includes collection-viewer.test.ts, it has significant gaps:

  • Missing Add Page Tests: You modify src/app/collection/[id]/add/page.tsx but provide no corresponding test file.
  • Null Safety: currentAndeeTag can now be null. Tests must verify sub-components don't crash.

codex:

  • Missed test scenario: ?viewer= present but useCurrentAndee reports authenticated; ensure viewer override wins
  • Test quality issue: page.test.tsx uses vi.mock inside tests; this won't rewire imports after module is loaded

they catch different things. that's the point.

v2: council planning

the problem with post-plan critique

v1 reviewers only see the finished plan. they can spot flaws in claude's logic, but they can't offer alternative approaches they might have discovered if they'd explored the problem themselves.

it's like asking someone to review your design doc after you've already committed to the architecture. they can poke holes, but the frame is set. they're optimizing within your solution space.

the insight

what if all three models explored the problem independently and in parallel?

each model has different:

  • training data
  • biases and blind spots
  • knowledge of libraries and patterns

when they converge on the same approach: high confidence. when they diverge: you've found the real design decisions.

the architecture

Council planning flow showing parallel execution of Claude, Gemini, and Codex

two ways to use it

1. explicit: /council command

/council implement OAuth2 authentication for the API

runs all three models in parallel. claude synthesizes.

2. automatic: hooks on plan mode

when claude enters plan mode naturally, hooks fire:

PreToolUse[EnterPlanMode] → start-council.sh
  - Extracts user's request from transcript
  - Launches Gemini + Codex in background
  - They plan while Claude researches

PostToolUse[ExitPlanMode] → collect-council.sh
  - Waits for background processes (2min timeout)
  - Injects their plans into Claude's context
  - Claude sees other perspectives before finalizing

you don't have to do anything. just use plan mode. council happens automatically.

key insight: non-blocking start, blocking collect. council starts immediately when planning begins. they work in parallel with claude. when claude finishes, we wait for results.

the hooks configuration

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "EnterPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/start-council.sh",
            "timeout": 10
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "ExitPlanMode",
        "hooks": [
          {
            "type": "command",
            "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/collect-council.sh",
            "timeout": 180
          }
        ]
      }
    ]
  }
}

key details:

  • PreToolUse with matcher: "EnterPlanMode" fires when claude is about to enter plan mode
  • PostToolUse with matcher: "ExitPlanMode" fires when claude exits plan mode
  • the collect hook has 180s timeout because we're waiting for background processes

start-council.sh

#!/bin/bash
set -e

SCRATCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/scratch/council"
mkdir -p "$SCRATCH_DIR"

# Read hook input from stdin (JSON)
INPUT=$(cat)

# Get transcript path to extract user's request
TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty')

if [ -z "$TRANSCRIPT_PATH" ] || [ ! -f "$TRANSCRIPT_PATH" ]; then
    exit 0
fi

# Extract last user message from transcript (JSONL format)
LAST_USER_MSG=$(tac "$TRANSCRIPT_PATH" | grep -m1 '"type":"human"' | jq -r '.message.content // empty')

if [ -z "$LAST_USER_MSG" ]; then
    exit 0
fi

# Create council request
cat > "$SCRATCH_DIR/request.md" << EOF
# Planning Request

## Task
$LAST_USER_MSG

## Instructions
Create a detailed implementation plan. Include:
1. Approach and rationale
2. Key files to modify/create
3. Step-by-step implementation
4. Potential risks or edge cases
5. Testing strategy
EOF

# Launch Gemini in background
(
    opencode run --model openrouter/google/gemini-2.5-pro \
        "$(cat "$SCRATCH_DIR/request.md")" \
        > "$SCRATCH_DIR/gemini.md" 2>&1
) &
echo $! > "$SCRATCH_DIR/gemini.pid"

# Launch Codex in background
(
    codex exec \
        "$(cat "$SCRATCH_DIR/request.md")" \
        > "$SCRATCH_DIR/codex.md" 2>&1
) &
echo $! > "$SCRATCH_DIR/codex.pid"

echo "Council started: Gemini and Codex planning in background"
exit 0

collect-council.sh

#!/bin/bash
set -e

SCRATCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/scratch/council"
RESEARCH_DIR="$CLAUDE_PROJECT_DIR/thoughts/research"

# Check if council was started
if [ ! -f "$SCRATCH_DIR/gemini.pid" ]; then
    exit 0
fi

# Wait for background processes (with timeout)
TIMEOUT=120
WAITED=0

for MODEL in gemini codex; do
    PID_FILE="$SCRATCH_DIR/$MODEL.pid"
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        while kill -0 "$PID" 2>/dev/null && [ $WAITED -lt $TIMEOUT ]; do
            sleep 2
            WAITED=$((WAITED + 2))
        done
        kill -9 "$PID" 2>/dev/null || true
    fi
done

# Auto-save raw perspectives to research/ for future reference
mkdir -p "$RESEARCH_DIR"
TIMESTAMP=$(date +%Y-%m-%d_%H-%M)
if [ -f "$SCRATCH_DIR/gemini.md" ] && [ -s "$SCRATCH_DIR/gemini.md" ]; then
    cp "$SCRATCH_DIR/gemini.md" "$RESEARCH_DIR/${TIMESTAMP}_gemini.md"
fi
if [ -f "$SCRATCH_DIR/codex.md" ] && [ -s "$SCRATCH_DIR/codex.md" ]; then
    cp "$SCRATCH_DIR/codex.md" "$RESEARCH_DIR/${TIMESTAMP}_codex.md"
fi

# Output council results to Claude's context
echo ""
echo "=== COUNCIL RESULTS ==="
echo ""

if [ -f "$SCRATCH_DIR/gemini.md" ] && [ -s "$SCRATCH_DIR/gemini.md" ]; then
    echo "### Gemini's Plan"
    echo ""
    cat "$SCRATCH_DIR/gemini.md"
    echo ""
fi

if [ -f "$SCRATCH_DIR/codex.md" ] && [ -s "$SCRATCH_DIR/codex.md" ]; then
    echo "### Codex's Plan"
    echo ""
    cat "$SCRATCH_DIR/codex.md"
    echo ""
fi

echo "=== END COUNCIL ==="
echo ""
echo "Consider the above perspectives when finalizing your plan."
echo "Save the synthesized plan to thoughts/plans/ for future reference."

# Cleanup
rm -f "$SCRATCH_DIR/gemini.pid" "$SCRATCH_DIR/codex.pid"

exit 0

the hook auto-saves raw council output to thoughts/research/ with timestamps. you can grep through old perspectives when you forget why you rejected an approach three weeks ago.

what the output looks like

when claude exits plan mode, it sees something like:

=== COUNCIL RESULTS ===

### Gemini's Plan

1. Use the existing auth middleware pattern from /src/app/core/security.py
2. Add OAuth provider config to settings.py
3. Create /src/app/services/oauth.py with provider implementations
...

### Codex's Plan

1. Implement OAuth flow in a new router /src/app/api/v1/endpoints/oauth.py
2. Store tokens in the existing Firebase setup
3. Add rate limiting to prevent abuse
...

=== END COUNCIL ===

Consider the above perspectives when finalizing your plan.

then claude synthesizes:

Agreement (all 3 models):

  • Use existing auth middleware pattern
  • Store tokens in Firebase
  • Add rate limiting

Gemini's unique insight:

  • Suggested abstracting providers for future additions

Codex's unique insight:

  • Flagged token refresh race condition

Conflict: Router location

  • Gemini: Add to existing auth router
  • Codex: New dedicated oauth router
  • Claude: Separate router for clarity
  • Recommendation: New router, but share middleware

real example: rate limiting

task: "add rate limiting to the API endpoints"

claude's approach: middleware-based, redis for distributed state, per-user and per-IP limits.

gemini's approach: also middleware, but suggested slowapi library and pointed out a recent FastAPI best practice.

codex's approach: emphasized different limits for authenticated vs anonymous users, with patterns from popular repos.

the synthesis:

  • all three agreed on middleware placement → high confidence
  • gemini's library suggestion saved reinventing the wheel
  • codex's auth/anon distinction was a blind spot for claude and gemini
  • combined: better than any single model

graceful degradation

the council is advisory, not blocking. if gemini times out, claude continues. if codex fails, we work with what we have.

cat "$SCRATCH_DIR/gemini.md" 2>/dev/null || echo "(No response)"

you don't want planning held hostage by an API hiccup.

why this matters

  1. diverse perspectives early — catch bad architecture in planning (minutes) not implementation (days)
  2. confidence calibration — three models converge independently = meaningful signal
  3. unknown unknowns — each model has different training data, different code exposure. gemini might know a library update. codex might have seen this fail in a thousand repos.
  4. zero extra effort — automatic hooks. just use plan mode.

v1 vs v2: when to use each

Aspectv1 Sequentialv2 Council
TimingAfter Claude plansWhile Claude plans
PerspectiveReview modeExploration mode
Best forCatching bugs, gapsArchitecture decisions
Speed20-30sSame (parallel)
Cost per plan~$0.03~$0.03

use v1 for routine plans where you want quality checks.

use v2 for:

  • architecture decisions with multiple valid approaches
  • exploring unfamiliar territory
  • when you want diverse ideas, not just error-catching

you can run both. v2 during planning, v1 on exit for final validation.

when to skip council

not everything needs three perspectives:

  • simple changes — typos, config tweaks, obvious bugs
  • established patterns — "add another endpoint like the others"
  • time-sensitive — when you need to ship now

use /council explicitly for architecture decisions. let automatic hooks handle the rest, but know you can always proceed without waiting.

the tools

ModelCommandCost
ClaudeNative (already running)Session cost
Geminiopencode run --model openrouter/google/gemini-2.5-pro~$0.01/plan
Codexcodex exec~$0.02/plan

total overhead per plan: ~$0.03 and ~30 seconds of parallel execution.

what's missing

things i'd improve:

  1. read plan from hook input instead of filesystem (avoids staleness issues)
  2. add confidence weights. if both models agree on an issue, flag it higher
  3. auto-apply suggestions. generate a diff for test files mentioned in critiques
  4. more models. gpt-4o, claude itself reviewing its own plan

the shift from "critique after" to "brainstorm together" changed how i think about multi-model workflows. it's not about which model is best. it's about what each model knows that the others don't.

three minds, three perspectives, one plan.