Anthropic Has a Safety Problem

Anthropic made real safety commitments that the business around it couldn't support.

February 25, 2026Analysis18 min read

Arthur Andersen was the gold standard for integrity in accounting for decades, and by most accounts that reputation was earned.

From the outside, everything checked out. Audits on time, reports followed the rules, partner signatures on every document. But whether anyone actually pushed back when a big client didn’t like the findings, whether the internal reviews were rigorous, whether the audit committee had the spine to say no? All behind closed doors.

The stuff you could see held up until the day the firm got indicted. The stuff you couldn’t see had been rotting from the inside out for years.

The same pattern played out at Anthropic. Between 2023 and 2026 they made about 10 safety commitments: when they’d pause training, when outside reviewers would come in, what thresholds would trigger what safeguards.

Dario Amodei (Anthropic’s CEO, co-founder, the person who left OpenAI specifically because he thought they weren’t being careful enough) meant these commitments. He spent months on the first version personally.

The commitments anyone can verify from outside are mostly intact. The ones that require trusting Anthropic’s internal judgment? Gone. Every single one.

The troubling part is his sincerity doesn’t help. If anything it makes it worse. Because if a genuine person running a company built specifically to be the careful one still can’t keep these commitments, then the incentives are more powerful than he is.

“I’m at least somewhat uncomfortable with the amount of concentration of power that’s happening here… almost overnight, almost by accident.”

Dario Amodei, February 2026

That’s Dario in February 2026. He sees the problem. He’s saying it publicly. And it hasn’t changed a thing.

Responsible Scaling Policy

Version 1.0

September 2023

commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures

Direct quote from the policy.

No exceptions for competitive pressure.

No carve-outs for market conditions.

The trigger was unconditional.

Most corporate safety policies include competitor clauses.

External reviews are mandatory.

Safety thresholds are quantitative.

Every commitment is binding.

commit

That document lasted twenty-nine months. Anthropic revised it twice, and each time the language got softer, the obligations to outside reviewers got lighter. None of the individual changes looked like a reversal, but read them in sequence and they tell a completely different story.

Their reasoning made sense on paper: the field was moving too fast for rigid commitments, hard stops might incentivize labs to hide capabilities, and some safeguards can’t be implemented alone.

So what did they actually change? The commitment to pause training got rewritten so only Anthropic decides when it triggers. The requirement to define next-level safeguards got deleted. The quantitative thresholds, the mandatory external reviews, the binding language: softened or gone. The stuff anyone could verify from the outside barely changed.

RSP v1.0 — September 2023commit

commit to pausedelayOnly Anthropic decides when this triggers now.

hard commitments — operational constraintspublic goals we will openly grade our progress towards

Define ASL-4 safeguards before any model reaches ASL-3

Deleted. No replacement.

SaferAI: 1.6 / 5“Weak” — alongside OpenAI and DeepMind

TIME: “Anthropic Drops Flagship Safety Pledge”

February 2026.

Anthropic expanded safety in real ways. They activated ASL-3 safeguards in May 2025 (the internal risk levels that determine what safety measures kick in), the first company to actually trigger its own scaling policy. RSP v3.0 expanded evaluations and added public progress tracking.

On February 11, 2026, Mrinank Sharma, one of Anthropic’s most respected safety researchers, resigned publicly. He’d “repeatedly seen how hard it is to truly let our values govern our actions.”

RSP v1 ran to thousands of words. Binding pause commitments, mandatory external reviews, quantitative thresholds, board-level oversight. Twenty-nine months later, this is what’s left.

10 commitments

anyone can verify only Anthropic can
Mandatory external reviews of all safety evaluations
Define ASL-4 safeguards before any model reaches ASL-3
Binding commitments, not aspirational goals
ARC Evals and partner organizations conduct independent reviews
Board and Long-Term Benefit Trust oversight of safety decisions
No mass domestic surveillance of Americans
No fully autonomous weapons
Commit to pause scaling or delay deployment when safety measures are insufficient
Quantitative safety thresholds with defined catastrophic risk levels
Pause applies regardless of competitive landscape

The Mechanism

Three companies invested $16 billion in Anthropic. Amazon, Google, Microsoft. Anthropic buys all of its compute from those same three companies: over $80 billion in cloud contracts through 2029. For every dollar they put in, five come back as revenue.

A safety pause means Anthropic stops training, which means it stops buying compute, which means the three companies Anthropic literally can’t operate without watch their revenue disappear. Every decision about whether to pause gets made alongside the people who profit when you don’t.

This is what broke credit ratings before 2008. The issuer-pays model: the company paying for the evaluation is the company being evaluated. Moody’s and S&P didn’t set out to lie. Honest assessment just got more expensive every quarter until it stopped happening. Anthropic’s investors are Moody’s. They have every reason to believe the evaluations are fine, because the cost of concluding otherwise is measured in billions.

Dario keeps talking about this publicly. He’s not slipping up or getting caught off guard. He’s describing exactly what’s happening to him, in detail, on the record.

“There was a group of us who believed in two ideas. One is that AI is going to be incredibly powerful and transformative. And the other is that there are very serious risks.”
Lex Fridman #452, November 2024

Over two years of interviews, he went from talking about founding principles to talking about going bankrupt if revenue doesn’t hit a trillion dollars. He’s been saying all of this publicly, in detail, and the commitments got gutted anyway.

The Pentagon

Then the government got involved, and the pattern got sharper.

The Hegseth memorandum threatened to cut Anthropic off from its cloud infrastructure unless it accepted “any lawful use” language in defense contracts. Dario held two lines: no mass domestic surveillance, no fully autonomous weapons. On February 27, three days after the meeting, the Pentagon invoked the designation. OpenAI signed its defense contract the same week with the same two restrictions written in. One company got a contract; the other got blacklisted.

That same day, he released the RSP revision that stripped out the internal commitments.

This is the company that exists because its founders thought OpenAI wasn’t taking safety seriously enough. If a genuine person running that company still can’t keep these commitments, then the incentives are more powerful than he is.

Anthropic isn’t an outlier. OpenAI’s founding charter promised to “freely collaborate” with other institutions and prioritize safety over profit. The nonprofit board fired Sam Altman over safety disagreements in November 2023. He was reinstated within a week, the board was replaced, and the company began converting to a for-profit structure. Same pattern. The visible commitments (the charter language, the public messaging) survived. The governance power that could actually enforce them didn’t.

Anthropic built a governance mechanism for exactly this scenario: the Long-Term Benefit Trust, designed to represent the public interest when commercial pressure pushes against safety. Four of the five original trustees have departed. It has never once used its override power.

These commitments exist because the people building these systems believe they could be genuinely dangerous. The commitments are disappearing anyway, because keeping them is expensive and dropping them is free.