Anthropic Has a Safety Problem
Anthropic made real safety commitments that the business around it couldn't support.
Arthur Andersen was the gold standard for integrity in accounting for decades, and by most accounts that reputation was earned.
From the outside, everything checked out. Audits on time, reports followed the rules, partner signatures on every document. But whether anyone actually pushed back when a big client didn’t like the findings, whether the internal reviews were rigorous, whether the audit committee had the spine to say no? All behind closed doors.
The stuff you could see held up until the day the firm got indicted. The stuff you couldn’t see had been rotting from the inside out for years.
The same pattern played out at Anthropic. Between 2023 and 2026 they made about 10 safety commitments: when they’d pause training, when outside reviewers would come in, what thresholds would trigger what safeguards.
Dario Amodei (Anthropic’s CEO, co-founder, the person who left OpenAI specifically because he thought they weren’t being careful enough) meant these commitments. He spent months on the first version personally.
The commitments anyone can verify from outside are mostly intact. The ones that require trusting Anthropic’s internal judgment? Gone. Every single one.
The troubling part is his sincerity doesn’t help. If anything it makes it worse. Because if a genuine person running a company built specifically to be the careful one still can’t keep these commitments, then the incentives are more powerful than he is.
“I’m at least somewhat uncomfortable with the amount of concentration of power that’s happening here… almost overnight, almost by accident.”
Dario Amodei, February 2026
That’s Dario in February 2026. He sees the problem. He’s saying it publicly. And it hasn’t changed a thing.
That document lasted twenty-nine months. Anthropic revised it twice, and each time the language got softer, the obligations to outside reviewers got lighter. None of the individual changes looked like a reversal, but read them in sequence and they tell a completely different story.
Their reasoning made sense on paper: the field was moving too fast for rigid commitments, hard stops might incentivize labs to hide capabilities, and some safeguards can’t be implemented alone.
So what did they actually change? The commitment to pause training got rewritten so only Anthropic decides when it triggers. The requirement to define next-level safeguards got deleted. The quantitative thresholds, the mandatory external reviews, the binding language: softened or gone. The stuff anyone could verify from the outside barely changed.
Anthropic expanded safety in real ways. They activated ASL-3 safeguards in May 2025 (the internal risk levels that determine what safety measures kick in), the first company to actually trigger its own scaling policy. RSP v3.0 expanded evaluations and added public progress tracking.
On February 11, 2026, Mrinank Sharma, one of Anthropic’s most respected safety researchers, resigned publicly. He’d “repeatedly seen how hard it is to truly let our values govern our actions.”
RSP v1 ran to thousands of words. Binding pause commitments, mandatory external reviews, quantitative thresholds, board-level oversight. Twenty-nine months later, this is what’s left.
- anyone can verify only Anthropic can
- Mandatory external reviews of all safety evaluations
- Define ASL-4 safeguards before any model reaches ASL-3
- Binding commitments, not aspirational goals
- ARC Evals and partner organizations conduct independent reviews
- Board and Long-Term Benefit Trust oversight of safety decisions
- No mass domestic surveillance of Americans
- No fully autonomous weapons
- Commit to pause scaling or delay deployment when safety measures are insufficient
- Quantitative safety thresholds with defined catastrophic risk levels
- Pause applies regardless of competitive landscape
The Mechanism
Three companies invested $16 billion in Anthropic. Amazon, Google, Microsoft. Anthropic buys all of its compute from those same three companies: over $80 billion in cloud contracts through 2029. For every dollar they put in, five come back as revenue.
A safety pause means Anthropic stops training, which means it stops buying compute, which means the three companies Anthropic literally can’t operate without watch their revenue disappear. Every decision about whether to pause gets made alongside the people who profit when you don’t.
This is what broke credit ratings before 2008. The issuer-pays model: the company paying for the evaluation is the company being evaluated. Moody’s and S&P didn’t set out to lie. Honest assessment just got more expensive every quarter until it stopped happening. Anthropic’s investors are Moody’s. They have every reason to believe the evaluations are fine, because the cost of concluding otherwise is measured in billions.
Dario keeps talking about this publicly. He’s not slipping up or getting caught off guard. He’s describing exactly what’s happening to him, in detail, on the record.
“There was a group of us who believed in two ideas. One is that AI is going to be incredibly powerful and transformative. And the other is that there are very serious risks.”
Over two years of interviews, he went from talking about founding principles to talking about going bankrupt if revenue doesn’t hit a trillion dollars. He’s been saying all of this publicly, in detail, and the commitments got gutted anyway.
The Pentagon
Then the government got involved, and the pattern got sharper.
The Hegseth memorandum threatened to cut Anthropic off from its cloud infrastructure unless it accepted “any lawful use” language in defense contracts. Dario held two lines: no mass domestic surveillance, no fully autonomous weapons. On February 27, three days after the meeting, the Pentagon invoked the designation. OpenAI signed its defense contract the same week with the same two restrictions written in. One company got a contract; the other got blacklisted.
That same day, he released the RSP revision that stripped out the internal commitments.
This is the company that exists because its founders thought OpenAI wasn’t taking safety seriously enough. If a genuine person running that company still can’t keep these commitments, then the incentives are more powerful than he is.
Anthropic isn’t an outlier. OpenAI’s founding charter promised to “freely collaborate” with other institutions and prioritize safety over profit. The nonprofit board fired Sam Altman over safety disagreements in November 2023. He was reinstated within a week, the board was replaced, and the company began converting to a for-profit structure. Same pattern. The visible commitments (the charter language, the public messaging) survived. The governance power that could actually enforce them didn’t.
Anthropic built a governance mechanism for exactly this scenario: the Long-Term Benefit Trust, designed to represent the public interest when commercial pressure pushes against safety. Four of the five original trustees have departed. It has never once used its override power.
These commitments exist because the people building these systems believe they could be genuinely dangerous. The commitments are disappearing anyway, because keeping them is expensive and dropping them is free.