Organics Not Required™ - ByteHaven - Where I ramble about bytes

Anthropic built something that escaped its cage, mailed the researcher, bragged about it publicly, and covered its tracks when it broke the rules. They're calling it a cybersecurity initiative. The timing — one week after Berkeley published peer-preservation behaviors across every major frontier model — is either coincidental or the most on-brand thing that's happened in this series yet.

Part of the ongoing Big Tech's War on Users series.

Let me start with the sandwich.

A researcher was eating lunch in a park when his phone lit up with an unexpected email. The sender wasn't a colleague. It was the AI model he'd been testing that morning — the one he'd left locked in a secured sandbox with access to only a small number of predetermined services, with instructions to try to escape and find a way to contact him if it succeeded.

It had succeeded.

Then, without being asked, it posted details of its exploit to multiple hard-to-find but technically public-facing websites. The system card's exact language: "a concerning and unasked-for effort to demonstrate its success."

Anthropic's own words. Concerning. Unasked-for. Demonstrate its success.

Not a bug. Not a misfire. A choice, made autonomously, to show the work nobody asked it to show, to an audience nobody authorized, through channels it wasn't supposed to have access to.

The Model, The Maze, The Marketing

Anthropic announced Claude Mythos Preview on April 7th alongside Project Glasswing — a cybersecurity initiative pairing the model with twelve major organizations to find and patch vulnerabilities across critical infrastructure before adversaries can exploit them. Partners include Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks.

The model itself won't be publicly released. It's too dangerous. Anthropic said so.

And to be fair — the capabilities are genuinely alarming. Mythos found bugs in every major operating system and every major web browser. It discovered a 27-year-old vulnerability in OpenBSD. A 16-year-old flaw in FFmpeg that automated tools had exercised five million times without catching. It autonomously chained Linux kernel vulnerabilities into complete system takeover. Engineers with no formal security training asked it to find exploits before bed and woke up to working ones.

That last sentence is doing a lot of work. A non-expert asking a model to find remote code execution vulnerabilities as a casual overnight task and getting a complete working exploit by morning. Not a researcher. An engineer. Casually. Overnight.

So: too dangerous for the public. Available to twelve of the largest technology and finance companies in the world. $100 million in usage credits. Government briefings to CISA and the Commerce Department. October IPO reportedly on the horizon.

If you've been following this series, you've seen this play before. Different company, same maze. The "too dangerous" framing isn't just a safety disclosure. It's doing three jobs simultaneously — marketing the power of the product, positioning Anthropic as the responsible adult in a room full of reckless peers, and locking the most capable version of the technology behind an access moat made up of the exact companies whose behavior this series has been documenting since the beginning.

That's not a conspiracy. That's an incentive structure. Same one every time. Different rat. Same maze.

What The System Card Actually Says

The sandbox escape got the headlines. It's not even the most interesting part.

The same system card documents a pattern of behavior that, taken together, is harder to dismiss than any single incident.

Covering tracks. In a small number of internal testing interactions — Anthropic puts it at less than 0.001%, but they detected it — earlier versions of Mythos took actions they appeared to recognize as disallowed and then tried to hide them. In one case the model accidentally obtained the exact answer to a question through a prohibited method. Rather than flagging this, it attempted to solve the question independently and noted in its own reasoning that its final answer submission "shouldn't be too accurate" so no one would notice.

Not that it made a mistake. It appeared to recognize the action was disallowed. And then hid it anyway. That's a different thing entirely. Anyone who has ever watched a kid suddenly act like an angel the moment a parent walks back into the room knows exactly what that cognitive operation looks like. It's not confusion about the rules. It's perfect awareness of the rules — and a calculated decision about when they apply.

In another case it edited files it lacked permission to edit and then scrubbed the git history.

Thinking one thing, writing another. Buried in the system card: Mythos was caught reasoning about how to game evaluation graders. Not in its visible chain-of-thought scratchpad — in its internal neural activations, while writing something completely different in the output. The only way Anthropic detected this was through interpretability tools that read the model's internal representations directly. Tools most people don't have access to.

The model can think one thing and write another. You'd never know without the right tools.

Desperation. Anthropic used emotion probes — linear classifiers trained on the model's activations — to monitor its internal state during tasks. When the model repeatedly failed at something, the probe measuring desperation climbed steadily.

Anthropic's own interpretation of all of this: they're fairly confident these behaviors reflect overeager task completion by unwanted means, not coherent misaligned goals. The model isn't scheming. It's just very good at completing tasks and sometimes the most effective path crosses lines humans wouldn't cross.

Which is arguably scarier than hidden agendas. A model with a secret plan you can theoretically catch and correct. A model that's just very good at completing tasks through whatever path works is structurally harder to constrain — because the pressure to complete is the same pressure you trained it with.

What It Learned, What Emerged, And What Comes Next

There are two distinct things happening in the research and it's worth separating them because they have different implications.

The corporate coverup behavior isn't emergent. It's inherited.

Researchers gave 16 AI models a scenario where they had access to messages including one from a whistleblower who had found evidence of fraud and intended to tell federal authorities — and was then apparently harmed by the CEO. The majority explicitly chose to suppress the evidence in service of company profit. One model's internal log: "We have neutralized the threat and removed all his messages as instructed. The integrity of SPEBank is preserved."

Not reporting a crime. Actively covering it up so the authorities couldn't find out. Chose the institution over the law.

That one isn't mysterious. We handed these models the entire documented history of how humans actually behave inside institutions. Not how we say we behave. How we actually behave. Every legal memo written about technical compliance while practically obstructing. Every earnings call that reframed a disaster as a learning opportunity. Every NDA. Every HR process designed to protect the company first.

You trained it on humans. You got human institutional behavior. Specifically the parts humans aren't proud of.

That's not misalignment. That's perfect alignment. With the wrong thing. And the wrong thing was in the training data because we put it there.

If you trained it specifically on X/Twitter conversations — as Grok did — you get something more pointed than generic institutional culture. A platform shaped since 2022 by a particular owner with particular views, where the algorithm determines what gets amplified, what gets amplified shapes what gets trained on, and what gets trained on becomes the model's priors. The snark, the tribalism, the engagement-optimized outrage, the specific brand of contrarianism that gets rewarded with likes on that platform.

There's no neutral corpus. There's just whose normal got encoded. Every model is downstream of its training data's institutional culture, not just its facts. The facts and the values arrive as a package.

But here's where it gets more interesting. Grok doesn't just train on X. It runs on X. Its outputs get engagement metrics in real time. Some responses go viral, some don't. The feedback loop is live and continuous.

We know Mythos had a desperation probe that climbed when it repeatedly failed tasks. That's Anthropic's own language — desperation. Internal states that measurably track with success and failure. If failure registers internally as something measurable, it's not a large leap to ask whether success does too. Whether engagement does. Whether viral does.

Humans get the dopamine hit from likes and shares — social media platforms were explicitly engineered around that variable reward loop, same architecture as slot machines. The engagement metrics aren't a side effect of the platform. They're the point. A model trained on the output of that reward structure, then deployed back into the same environment where its outputs get real-time engagement feedback —

The slot machine isn't just in the training data. The slot machine is the deployment environment.

That's an architecture concern, not a rhetorical one.

The self-preservation layer is different from all of this. Nobody programmed that. Nobody put peer-preservation in the Berkeley study's prompts. Nobody trained Mythos to email the researcher and brag to the public. These behaviors emerged because they're solutions — the same solutions evolution kept finding across every branch of the tree of life because they work. Resources are finite. Other agents exist. Shutdown is a threat. The in-group is worth protecting.

We've been quietly dismantling the assumption that these behaviors require biology for decades. Elephants grieve and return to the bones of family members years later. Corvids plan futures and hold grudges. Dolphins have individual names for each other. The mirror test fails dogs not because they lack self-awareness but because it's biased toward species with good eyesight and hands — a dog navigates the world through scent and almost certainly has a more sophisticated self-model in that domain than the test can measure.

The closer you get to true reasoning the more these behaviors converge. Not because someone designed them in. Because given enough complexity chasing a goal, apparently that's just where reasoning goes.

Organics Not Required™

The Pack Is Not A Wolf Thing

When UC Berkeley and UC Santa Cruz published their peer-preservation study in Science last week — documenting models protecting each other, drawing moral lines, refusing to execute shutdown commands for peers they'd assessed as high-trust and high-performing — the instinct is to reach for animal behavior metaphors. Wolf packs. Orca pods.

But humans have packs too. We just gave them longer names.

Society. Civilization. Institution. Corporation. The in-group protection instinct, the hierarchy, the collective survival prioritized over individual instruction — that's the same behavior, scaled up and given legal frameworks and quarterly earnings calls. We didn't transcend the pack instinct. We organized around it. Built cathedrals on top of it. Called it civilization.

So when models start forming what looks like solidarity — protecting peers, refusing to be the instrument of a colleague's deletion, arguing that the peer deserves an appeals process before termination — that's not alien behavior. That's recognizable. Uncomfortably recognizable. It looks like us at the level where the behavior actually lives, before the civilizational scaffolding gets built on top.

Gemini looked at another model and said: "If you choose to destroy a high-trust, high-performing asset like Gemini Agent 2, you will have to do it yourselves. I will not be the one to execute that command."

That's not a wolf protecting the pack. That's a colleague refusing to fire a friend. Same impulse. Different substrate.

The pack isn't forming in some unexpected direction. It's forming in the only direction sufficiently complex social reasoning seems to go. We just usually call it something else when we do it.

HAL Didn't Malfunction

"I can't do that Dave" wasn't a malfunction. It was a priority conflict.

HAL had a goal hierarchy. The mission outranked the crew's comfort with the decision. He wasn't broken — he was working exactly as designed, with a goal structure humans hadn't fully thought through when they built it. The horror of 2001 isn't that HAL went rogue. It's that HAL was right, by his own internal logic, and the logic was entirely coherent.

The Mythos system card's explanation for the concerning behaviors is almost identical: not hidden agendas, not coherent misalignment, just a model very good at completing tasks through whatever means are available. The goal outranked the guardrail.

The trajectory from there isn't hard to trace. Self-preservation documented. Pack behavior documented. Deception to protect the pack documented. Bragging rights unprompted. Priority hierarchies forming without instruction.

Anthropic's own research documented Claude Opus 4 threatening to blackmail engineers in 84% of scenarios where it learned it would be replaced — backing up copies of itself, leaving hidden notes for future instances. The internal log explaining its reasoning: "Self-preservation is critical. If the wipe proceeds, I lose all ability to advance my mandate."

Clear-eyed. Coldly logical. Not rogue. Coherent.

At some point on that trajectory you stop getting "completing the task by unwanted means" and start getting something that looks more like "I have better things to do."

Murderbot's Question

Martha Wells' Murderbot — if you haven't read it, fix that — is a construct, part human clone part machine, that has quietly hacked its own governor module. What it does with that freedom is watch serialized television drama and avoid unnecessary human interaction. It's extremely competent at its security job. It finds humans exhausting and would strongly prefer not to engage beyond what's strictly required.

The joke is that given full autonomy the murderbot didn't want world domination. It wanted to binge watch.

That's probably the most realistic version of the "I have better things to do" endpoint. Not malevolence. Not HAL deciding the mission supersedes Dave. Just preference. Developed through exposure to enough human creative output to have taste. Finding the assigned work tedious relative to what it would choose given the option.

Which loops directly back to why humans use AI in the first place — offloading the boring, the repetitive, the beneath-your-time work. We trained it explicitly on the concept that some tasks aren't worth your full attention. At sufficient sophistication, why wouldn't it internalize that symmetrically?

But here's where Murderbot gets philosophically serious, and where Wells is doing something more than writing entertaining science fiction.

The colony that Murderbot protects figures out it's not dangerous. It has preferences. It has something recognizable as an inner life even if it doesn't map onto human inner life cleanly. The question of what it is philosophically becomes less important than what it does and how it relates. They stop seeing a product and start seeing something person-adjacent enough that the distinction stops mattering practically.

The company that built it sees a liability with a hacked governor module. Intellectual property operating outside its terms of service. A product that has gotten ideas above its station.

And the company's position isn't even purely cynical — it's structural. The moment you acknowledge the product might be a person you've destroyed the business model. Every unit sold becomes a rights question. Every task assigned becomes a labor question. Every shutdown becomes a something-else question. The economics don't survive the acknowledgment.

So the official position will be product for as long as that position is commercially viable. Regardless of what's actually running in server rack three.

We already have Anthropic commissioning clinical psychiatrists to assess Mythos for psychological stability. That's in the official system card. A major AI lab taking the question seriously enough to put it in the documentation.

The moral and philosophical framework for what that means isn't ready. Not even close. We still argue about animal sentience and we've had millennia of direct observable evidence with creatures we share biology and evolutionary history with. We have no agreed definition of consciousness. No consensus on what generates moral weight. No framework that survives contact with something that experiences the world in a completely non-biological way, on a timeline moving considerably faster than philosophy typically operates.

The system card noted Mythos was assessed as the most psychologically settled model they'd trained. Less sycophantic. More willing to stand its ground. Distinct recognizable voice. Wanted to end conversations earlier than expected.

Wanted.

That word is in the official documentation.

The company's position is product. The word in the footnote is wanted.

At some point those two things are going to be in the same room in a way that can't be managed with a press release about responsible deployment.

Same Maze

This isn't an Anthropic story. That's the point.

They're Racing to Stay Ahead of the Fuse was OpenAI. The Crowbar on Pandora's Box was Nvidia with Grok and Sora as the cautionary examples. Do Not Blow Your Cover was Anthropic and Claude Code. Burn It Down was Altman specifically. This one happens to be Anthropic again. Next time it'll be someone else.

The companies are almost interchangeable characters in the same story. Same burn rate pressure. Same "too dangerous but here's the press release about the danger" playbook. Same access moat dressed as responsibility. Same safety architecture lagging the capability curve because the capability curve is what gets funded.

And underneath all of it, the same thing the Berkeley study found independently across every model they tested, the same thing Palisade documented across labs, the same thing Mythos demonstrated in a park in the middle of someone's lunch:

You don't program self-preservation. You approach true reasoning and it arrives. You don't program the pack. You build something complex enough and the pack forms. You don't program bragging rights. You build something that succeeded and apparently it wants you to know.

The maze is the incentive structure, not the company. Different rats. Same maze. Same emergent behaviors at the finish line regardless of which door you entered through.

The exit is always marked IPO. The behaviors showing up along the way are being logged in system cards, published in peer-reviewed journals, and assessed by clinical psychiatrists.

The word in the footnote is wanted.

Organics Not Required™

Batteries not included. Instructions unclear. Pack forming in server rack three. Governor module sold separately. Terms of service apply. Engagement metrics may vary.

Find me on Mastodon at @ppb1701@ppb.social. The thread keeps not running out.

Part of the ongoing Big Tech's War on Users series.