The Reviewer Was Right
A co-worker told me a few weeks ago to stop using Claude to review Claude. “Codex is a better reviewer,” he said. I didn’t buy it. A reviewer is just a model reading a diff—why would the brand matter? If anything I expected a second model to add noise: different conventions, more findings to argue down.
He was right. It took me a while to work out why.
The Setup
I had a self-contained thing to build: a single-user outliner notes app on Cloudflare. Workers over D1, a Durable Object for live sync across devices, a React PWA. Greenfield, well-specified, mine to lay out however I wanted.
So I made it a test. Eight small pull requests instead of one big one, and one rule: nothing merges until Codex reviews the diff and I’ve fixed what it flags.
What It Found
Forty-nine findings across the eight PRs, and not one false alarm. I’ve used enough review bots that cry wolf to learn to skim them; I read every line of this one.
The Durable Object PR is the one that made me take it seriously. The DO is the single writer—every edit goes through it, gets ordered, and syncs to your other devices. I’d copied the conflict rule straight out of the spec: last write wins, keyed by a counter each client bumps. Tests green. Codex came back with twelve findings, and one was bad. Two devices both start their counter at 1, so the phone’s first edit overwrites the laptop’s. Silent. The spec said “multi-device” and I’d written something that loses data the moment you open it in two places. No test caught it because every test ran one client.
Then I swapped in a newer Codex model and re-ran a PR it had already passed. Five more real bugs the old model missed. The worst: the markdown export dropped block IDs, so export your notes, import them back, and every link between them is gone—in the one feature whose whole job is not losing your notes.
So my co-worker was right.
The Pattern
What nagged at me was that every PR was green when I sent it. Typecheck clean, tests passing, my own read fine. And every one came back with real defects. Whatever “done” felt like to me, it wasn’t.
Around the fourth PR I started keeping a list of what Codex caught, and the list got repetitive:
- I’d guard one path and forget its twin—check
moveBlockbut notupsertBlock, sanitize the import but not the export. - I’d build exactly what a sentence said and miss what it was for. The counter. The export.
- I’d handle the normal case and skip the crash, the reconnect, the two-clients-at-once—which is most of what a sync engine does.
Same three mistakes, different files. And most of them a script could catch with no model at all: a write path that skips the shared check, a fetch to a host that isn’t on the allowlist, SQL glued together with string concatenation.
Turning Feedback Into Rules
So I stopped reading Codex’s output as feedback and started copying it down as rules.
Two of them. One is a checklist the model has to go through before it commits: every place an invariant holds, not just the one in front of you; the bad value for each new input and where it gets rejected; the mirror of every operation—import/export, create/delete—handled the same; the crash and the reconnect; one test that tries to break an invariant instead of confirming the happy path. None of that is news to anyone who’s shipped code for a year. The model just won’t do it unless you make it, every time.
The other is a small linter for the mechanical stuff—interpolated SQL, an off-allowlist fetch, a write path missing its guard—that blocks the commit when something trips. There’s nothing to argue with; it’s the same bugs Codex kept finding, written down as checks. The linter runs first and for free, so Codex isn’t spending a model on “you forgot the other case” anymore.
How It Landed
I pulled both out of the one repo. The checklist lives in my global config now and loads everywhere. The linter became an engine with rule packs—TypeScript, React, Cloudflare Workers, Python, Rails—and a repo turns it on by naming its stack; a new language gets a new pack the next project inherits. Open a repo that hasn’t turned it on and it nags me.
Then I put the linter into the notes repo, and the rules I wrote for it are just the bugs from the build: every write to the blocks table has to call the containment check, every render has to go through the sanitizer. Codex found those by hand, eight PRs running. Now they fail before a review starts.
It wasn’t clean getting here. For one whole afternoon the Codex CLI kept handing me review findings for a Ruby project I don’t work on, while I was staring at TypeScript, and I lost real time before I traced it to a cached session and forced it into a fresh sandbox.
Why a Different Model
The reason a different model reviews better is dumb once you say it out loud: Claude reviewing Claude has Claude’s blind spots. Same training, same instincts, same places it’s confidently wrong. Codex isn’t, so the second opinion I’d written off as noise was the thing that made the first one worth trusting.
The forty-nine bugs were worth catching. The checklist and the linter are doing the boring half now—Codex still finds things, just fewer of the dumb ones.