root cause vs symptom
i asked claude to fix a ranking bug. it wrote a backfill script and declared victory in 5 minutes.
i pushed back: "did you find the root cause or the symptom?"
the next day, it made the same mistake again.
story 1: missing graph edges
ranking algorithm showing score=0 for favorited items.
database A: 9 favorites. database B: 1 graph edge.
claude investigated for 5 minutes:
found the issue. database A has 9 favorites but database B only has 1 edge. here's a backfill script to create the 9 missing edges.. done. it's working now!
fast. confident. wrong.
the pushback
me: "did you find the root cause or the symptom."
claude admits
you're right. i found the symptom (missing edges), not the root cause (why they weren't being created).
the backfill fixed historical data, but if the root cause isn't fixed, new favorites will still fail silently.
key moment. ai optimizes for speed. you have to force depth.
finding the root
me: "spin up research subagents to find the root"
four parallel agents launched:
- code path tracer - where could execution skip?
- git history - when was this feature added?
- configuration audit - what happens when env vars are missing?
- pattern detector - how many places have this anti-pattern?
fifteen minutes later:
# deps.py - returns None if GRAPH_URI not set
def get_graph_provider() -> GraphProvider | None:
if not settings.graph_uri:
return None # silent skip, no error
# service.py - graph writes skipped if provider is None
if self._graph:
# ... create edge
# no else, no logging, no error
# API returns 201 Created anyway
timeline from git:
- jan 30: first 6 favorites created
- jan 31: graph integration deployed
- feb 3: 2 more favorites created
all 9 favorites were created when server was running without GRAPH_URI configured.
the pattern: "graceful degradation" by design. app could run without graph database.
but:
- no warning when graph disabled
- no error when operations skipped
- API returned success when graph sync failed
the fix
changed to fail-fast:
def get_graph_provider() -> GraphProvider:
if not settings.graph_uri:
raise GraphNotConfiguredError(
"GRAPH_URI required for ranking, recommendations"
)
server won't start without graph config. no silent failures.
๐ก ai optimizes for speed, not depth
left alone, ai will fix symptoms (5 min backfill) because it looks successful. you have to explicitly demand root cause investigation.
story 2: the next day
blend endpoint returning signals_used=0.
claude investigated:
- neo4j has edges with plain UUIDs:
"abc-123" - blend queries for prefixed keys:
"place:abc-123" - no matches found
the "fix": add place: prefixes to storage, delete old edges, recreate.
result: signals_used=2 โ
the pushback
me: "that is false.
place:is outdated. find root cause"
the actual root cause
the codebase migrated away from prefixes on feb 4 (the day i wrote this post).
the migration:
- removed
place:,book:prefixes from all graph operations - updated signal creation to use plain UUIDs
- cleaned up 1,160 duplicate prefixed nodes
- but never updated
blend.py- still had prefix logic from jan 28
claude had the timeline backwards. the "fix" was reinstating a deprecated pattern.
what went wrong
stale pattern recognition. saw prefix format, assumed it was current.
first answer bias. found mismatch, "fixed" it, stopped.
confirmation bias. test returned non-zero, must be correct.
the parallel research agents found the truth:
- all production code uses plain UUIDs since feb 4
- git history shows explicit migration with ADR
- only
blend.pyhad old prefix logic - storage was correct, query was wrong
the pattern
ai finds mismatch between A and B. ai "fixes" by making A match B. didn't verify: is B actually correct?
assumption: "if code has prefixes somewhere, prefixes must be standard." reality: "those prefixes are legacy fossils."
the meta-lesson
even after documenting this failure mode, repeated it the next day.
why? because "i knew" the answer. debug output showed prefixed edges (from my own backfill), confirming my theory.
the fix wasn't to add prefixes to storage. it was to remove prefix logic from queries.
๐ even knowing the pattern doesn't prevent repeating it
wrote this post. made the same mistake the next day. the fix: automate the correct behavior with a skill.
stale code is poison
the real problem was stale code in blend.py that never got updated during migration.
that 9-line function adding prefixes was toxic:
- made deprecated pattern look intentional
- gave false confidence ("the code does this, so this must be right")
- caused regression (blend broke after migration)
- will confuse every future developer
stale code doesn't sit there harmlessly. it actively misleads.
when you migrate a pattern, grep for the old pattern and remove it everywhere. leaving fossils creates regressions.
โ ๏ธ stale code is poison
after migrations, old code doesn't just sit there - it actively misleads by making deprecated patterns look intentional. grep and delete immediately.
how to actually prevent this
when ai declares victory, ask two questions:
"did you find the root cause or the symptom?"
"what assumptions are you making? how did you verify those?"
red flags that ai fixed the symptom:
- "this should work now"
- "i backfilled the missing data"
- "the code looks correct when i test it"
- no explanation of WHY problem happened
signs they found the root cause:
- timeline correlation (when did bug appear vs code changes)
- pattern identified (design flaw, not data gap)
- systemic fix proposed (code change, not data fix)
- assumptions verified (checked git history, ADRs, migration dates)
the process
- ai suggests quick fix
- ask: "root cause or symptom?"
- ask: "what assumptions? how verified?"
- demand: "research the root"
- wait for: timeline, pattern, systemic fix
time investment
symptom fix: 5 minutes. root cause: 15 minutes. future debugging sessions prevented: infinite.
making it reusable
after repeating this mistake twice, i created a skill:
~/.claude/plugins/kevin-workflow-plugin/skills/root-cause/
now when ai fixes something too quickly, i run:
/kevin-workflow:root-cause
it automatically:
- acknowledges this is a symptom fix
- launches 4 parallel research agents
- applies "5 whys" technique
- verifies assumptions with evidence
- proposes systemic fix (not just data fix)
the skill is in my workflow plugin repo. install it and you'll never accept symptom fixes again.
the questions
"did you find the root cause or the symptom?"
"what assumptions are you making? how did you verify those assumptions?"
two questions that change how ai debugs.
try them next time your pair programmer declares victory too fast.