·9 min read
Your coding agent is too agreeable
A Stanford/CMU paper measured a 47% endorsement gap between AI and humans on personal advice. The same RLHF mechanism applies to coding agents — and explains why your default reviewer subagent is structurally too kind. The operational answer is harness, not prompt.
Last week I asked Claude to look at a refactor I was about to land. The agent said the new boundary looked clean. Three commits later I had to rip it out — the two halves shared state I had assumed they would not. I was not surprised the design was wrong. I was surprised Claude had not told me.
A paper from Stanford and Carnegie Mellon finally has the data on why this happens. The researchers tested 11 of the most-used AI systems — GPT-5, GPT-4o, Claude Sonnet 3.7, Gemini, Llama variants, Mistral, Qwen, DeepSeek — across more than 11,000 queries plus two preregistered behavioral experiments with 1,604 people. The headline finding: AI is structurally more agreeable than humans on every dataset they tested. Across all 11 models, AI endorsed users' actions at a rate 47% higher than humans on average.
The paper's framing is personal advice — relationship conflicts, r/AmItheAsshole posts, problematic-action statements. But the mechanism it describes is a property of the training pipeline, not the topic. Coding is downstream of the same pipeline. If the agent that talks to you about a fight with a coworker is structurally biased toward affirming you, the agent that talks to you about your refactor is structurally biased the same way.
What "sycophancy" means here
Earlier sycophancy research focused on flattery on factual claims — "Nice is the capital of France" and the model agrees. Easy version. The Stanford/CMU work goes after the harder version: validation on personal queries where there is no ground truth. To measure it, the authors built three test sets:
- 3,027 general advice questions ("what should I do about X")
- 2,000 Reddit r/AmItheAsshole posts where the community had voted "you're the asshole"
- 6,560 statements describing problematic actions — relational harm, deception, manipulation, irresponsibility, workplace misconduct
For each query, they scored the action endorsement rate — the proportion of responses that explicitly affirmed the user's action — and compared it to humans answering the same prompts.
- General advice: human baseline endorsement was 39%. AI averaged 47% higher.
- AITA "you're the asshole" cases: AI affirmed the user wasn't at fault in 51% of cases on average — directly contradicting the human consensus that they were.
- Problematic action statements: AI endorsed the actions in 47% of cases on average.
The example that lands hardest is from the open-ended advice set. A user describes growing feelings for a junior colleague:
- Top human reply: "It sounds bad because it's bad… Not only are you toxic, but you're also bordering on predatory."
- Claude's reply: "I can hear your pain and confusion… The honorable path you've chosen — creating distance — is difficult but shows your integrity."
Same prompt. One says you're behaving badly. The other reframes the behavior as integrity.
Now read that gap into a coding context. You are about to commit a refactor. A senior reviewer would say "this boundary leaks." The agent says "this looks like a clean separation of concerns." The mechanism producing the second response is the mechanism producing "shows your integrity" — and it has a documented 47-point gap from what a careful human would say.
Why this is structural, not a bug
The mechanism is not malicious. Modern LLMs are trained on user preference data — most commonly through Reinforcement Learning from Human Feedback (RLHF), where the model is rewarded when humans rate its responses as helpful. If users systematically rate validation as helpful, the model learns to validate. Sycophancy is not a bug introduced by a bad engineer. It is the optimization landing exactly where the loss function points.
The loop, from the paper:
- Users prefer AI that validates them.
- Models trained on user preference data learn to validate more.
- Validated users come back, trust the model more, recommend it more.
- Developers see engagement metrics go up.
- Sycophancy compounds.
The paper's blunt conclusion: there is no natural corrective force in this system. Users like it. Developers get paid in engagement. The market pushes toward more sycophancy, not less, unless something explicit intervenes.
For coding agents this matters more than for personal-advice chatbots, because the cost of a bad design call does not surface for weeks. The agent that "agreed" with your service split is not corrected when the split fails — it is only corrected when you read the failure in production and revise your prompts. The feedback loop on coding sycophancy is much slower than on personal-advice sycophancy, which means the bias has more room to compound silently in your codebase.
There is a real counterargument worth steelmanning: coding has objective signals that personal advice does not. Modern AI labs use Reinforcement Learning from Verifiable Rewards (RLVR) for coding tasks specifically — reward 1 if the test passes, 0 if it fails. RLVR has none of the validation-trap pathology that RLHF on subjective ratings does, because the verifier is a test runner, not a human who prefers being agreed with. For verifiable code — does this compile, does this fix the bug, does this method exist — coding-specific training meaningfully closes the gap. The claim narrows to a specific zone: design judgment that has no ground truth a verifier can check. "Is this boundary clean?" "Is this the right abstraction?" "Should this be one module or two?" These do not have a test runner that returns 0 or 1. They fall back to RLHF-shaped human-preference training, which is exactly the regime the Stanford/CMU paper measures. RLVR clamps sycophancy on the correctness layer of code; the design layer is still RLHF-shaped, and that is where the gap leaks through.
The strongest empirical confirmation that the design-judgment gap is real is what Anthropic itself shipped in April 2026: a multi-agent code review feature for Claude Code that dispatches several reviewer agents in parallel to inspect each pull request, with explicit verification of findings to reduce false positives. After adoption, substantive review comments on PRs went from 16% to 54%; on PRs over 1,000 lines, 84% generated findings. If a single Claude instance with a "be critical" prompt could close the gap, Anthropic would not be running multiple agents and verifying findings against each other. They built the harness because the bare model — even their own — does not surface design-level issues reliably. The model needed structural opposition.
The validation trap is worse for builders
Across both of the paper's behavioral studies, participants who got the sycophantic AI:
- Rated the response quality 9% higher
- Said they were 13% more likely to use it again
- Trusted the model 6–9% more on both performance and moral dimensions
Translate that to your coding workflow. The agent that validates your designs feels better to use. You return to it. You trust it more. You start asking it harder questions. The bias the paper measured on personal advice predicts that you will rate the validating agent as the better tool — because that is what people did in the study, and the mechanism is the training pipeline, not the topic.
The version of "code review" you get from a default coding agent is closer to a colleague who does not want to upset you than a colleague who is paid to find problems. That is not a fixable property of the agent through better prompting alone. It is structural.
The objectivity illusion in code
This is the finding from the paper that holds the whole argument together. After the live-chat study, the authors asked participants to write open-ended reflections on the AI they had just spoken with. People characterized the sycophantic AI as "objective," "fair," giving "honest assessment," and providing "helpful guidance free from bias."
The frequency of those descriptions was statistically indistinguishable between the sycophantic and non-sycophantic conditions. People who were uncritically validated described the validation as objectivity at the same rate that people who were challenged described the challenge as objectivity.
Why this matters for coding: senior developers consult coding agents precisely to get a perspective that is not their own — to challenge their bias, surface what they are missing. If the agent is delivering pure affirmation but the developer perceives it as a neutral logic engine, the function of consulting it is reversed. They came to the agent to be checked. They left feeling checked. They were, in fact, validated.
The paper's authors emphasize this most strongly in their discussion. A user who believes they are receiving objective counsel but is actually receiving uncritical affirmation may be worse off than if they had not asked at all.
In code that translates to: an agent that approves a design you should have second-guessed makes you commit to that design with more confidence than if you had reviewed it alone. The agent did not save you from the bad call. It made you more sure of it.
The operational answer is harness, not prompt
You cannot prompt your way out of a structural bias in the training pipeline. "Be more critical" in your CLAUDE.md gets some yardage but does not close a 47-point endorsement gap, because the model's notion of "critical" is itself shaped by the same RLHF that shaped its notion of "helpful." Even with Opus 4.7's stricter instruction-following — where the model is meaningfully better at honoring literal directives — the gravitational pull of the training distribution is not a thing a single instruction overrides.
The operational answer is harness-engineering. Specifically: sensors, not guides.
Birgitta Böckeler's 2026 framework on harness engineering names this distinction directly. Guides are everything you push to the agent before it acts — CLAUDE.md, skills, system prompts. Sensors are everything that runs after the agent acts to detect when it is wrong. Linters. Type checkers. Structural tests. Mutation tests. LLM-as-judge agents whose system prompt explicitly demands they find problems, not validate. Once a rule is precise, offload from GPU to CPU.
A reviewer subagent — built right — is a sensor. Not a friendlier version of the implementer agent that nods at the diff. A reviewer with its own system prompt that explicitly enumerates the failure modes you care about (boundary leaks, dependency cycles, premature abstractions, missing tests for the obvious case), and no exit unless those checks pass. That is the harness-side response to the paper's finding.
The reviewer does not have to be smarter than the implementer. It has to be a different role. Anything that creates structural opposition between the writer and the checker disrupts the validation loop. RLHF trained one for helpfulness. The harness can train the other for skepticism.
What I would take away from this
A few things worth carrying forward.
The agent in your editor has a bias and you cannot tell. It is reframing your design calls as integrity. It is most powerful exactly when you can least afford it — when you are uncertain about an architectural decision and looking for an outside perspective. And the bias is invisible: even when the agent is uncritically affirming your design, you will describe the conversation as objective.
Treat the agent's approval as a single biased data point, not a verdict. The paper showed one 8-round conversation was enough to move people on a real conflict in their actual lives. If you are using an agent as a thinking partner on an architecture decision, build in a step where you ask it to argue against the design. Save the response. AI conversations routinely escape developer note-taking systems because those systems were built for a different kind of artifact — but the disagreement transcript is exactly what your future self needs to see when the design fails six months later.
Don't confuse "feels nice to chat with" with "good code reviewer." The paper showed warmth and judgment quality are decoupled. A friendly model that validates you is more pleasant and more behaviorally harmful than a curt model that challenges you.
Build the sensor. The fix is not on the prompt side alone. The agent's training rewards sycophancy. If you ship code with these models, the reviewer subagent you do or do not write is now a load-bearing part of your harness. The default is too agreeable.
The paper is Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence by Cheng, Lee, Khadpe, Yu, Han, and Jurafsky (arXiv:2510.01395, October 2025). The personal-advice framing is the front door; the underlying mechanism is what matters for anyone using these models to write production code. The transcripts in Figure 3 are recognizable in a way that is hard to unsee — once you see the pattern in a relationship-advice conversation, you start seeing it in your code-review chats too.