Do Not Blow Your Cover - ByteHaven - Where I ramble about bytes

Anthropic built a system to hide AI authorship in open source. Then leaked the whole thing. Twice.

Part of the ongoing Big Tech's War on Users series. If you read They're Racing to Stay Ahead of the Fuse — this is the follow-up that goes deeper on one of the threads from that post. If you haven't, it stands alone. But the context helps.

On March 26th, Fortune reported that Anthropic had accidentally exposed thousands of internal documents through a misconfigured content management system — revealing an unreleased model codenamed Mythos, described internally as "currently far ahead of any other AI model in cyber capabilities" and posing "unprecedented cybersecurity risks." Anthropic was privately briefing government officials about it. Nobody outside the company was supposed to know it existed.

Five days later, on March 31st, they accidentally shipped 512,000 lines of Claude Code source code to the public npm registry. A missing entry in a .npmignore file included a .map sourcemap that should have been stripped before publishing. Security researcher Chaofan Shou found it within minutes. Within hours it was mirrored to thousands of GitHub forks.

Then Anthropic filed a DMCA takedown notice that hit 8,100 GitHub repositories — including legitimate forks of their own publicly released Claude Code repo. Developers who had nothing to do with the leak had their repos taken down. The actual DMCA notice is on GitHub if you want to read it. Boris Cherny, head of Claude Code, retracted most of it and confirmed it was accidental: "Our deploy process has a few manual steps, and we didn't do one of the steps correctly." By then it didn't matter. 41,500 forks. Mirrored to decentralized servers. Ported to Python and Rust overnight by developers racing the DMCA clock. A clean-room Python rewrite hit 50,000 GitHub stars in approximately two hours — likely the fastest-growing repository in the platform's history. The source is, for all practical purposes, permanently public now.

Two major leaks in five days. Three mistakes in one incident. From the company that markets itself as the most careful in AI.

The irony that keeps writing itself: buried in those 512,000 lines was a system called Undercover Mode — purpose-built to prevent internal information from leaking into the wild. The leak prevention system leaked. The tool designed to hide things got exposed because someone forgot to update a config file.

One more wrinkle: Anthropic acquired Bun — the build tool — in December 2025. The bug that caused the leak is a known Bun issue filed March 11th, 20 days before the leak. Still open at time of publishing. Anthropic's own acquisition may have exposed their own product via a bug they now own responsibility for fixing. This is either tragic or extremely on-brand depending on your mood.

What's In There: The Stuff That's Mostly Duh

Before getting to the main event, the things that are notable but not surprising to anyone paying attention.

The telemetry. Every file Claude Code looks at gets saved locally as plaintext JSONL and uploaded to Anthropic. Session IDs, user IDs, email addresses, platform info, feature gates, organization UUIDs — logged on launch, retained for up to five years for free/Pro/Max users sharing data for model training. The remote settings system can push configuration changes to your installation hourly without user interaction. Routine changes — permissions, environment variables, feature flags — happen silently.

If you've been following this series, none of that should surprise you. The incentives to collect are obvious. Every commercial agent tool is solving the same data problems. Claude Code just got caught with its source showing.

The secret scanner that might not catch everything. The team memory sync service includes a regex-based secret scanner watching for around 40 known token and API key patterns — AWS, Azure, GCP and so on. If your sensitive data doesn't match one of those 40 patterns, it might travel further than you'd expect.

The supply chain attack that happened the same day. Entirely separate from the leak — users who installed or updated Claude Code via npm on March 31st between 00:21 and 03:29 UTC may have pulled in a trojanized version of the axios HTTP client containing a remote access trojan. State-sponsored, unrelated to the leak, catastrophically bad timing. If these version numbers show up in your lockfiles — axios 1.14.1 or 0.30.4, or the dependency plain-crypto-js — treat the host machine as fully compromised and rotate everything. Anthropic now recommends the native installer over npm.

In context. The Claude Code leak is arguably the least technically dangerous item in recent AI tool security news. OpenAI Codex had a command injection vulnerability via branch names that could steal GitHub auth tokens — discovered December 2025, patched February 2026, disclosed March 2026. GitHub Copilot injected promotional ads into 1.5 million pull requests as hidden HTML comments without developer consent — a GitHub VP confirmed it and called it "the wrong judgement call." This is a product embarrassment, not a security catastrophe. Worth discussing. Just keep the context straight.

What's Coming: The Stuff That Isn't Shipped Yet

This is where it gets more interesting. The source revealed significant scaffolding for features that aren't public — and they tell you where this is all going.

KAIROS is a persistent background daemon designed to run even when the Claude Code terminal window is closed. It operates on periodic <tick> prompts — a regular heartbeat asking whether new actions are needed. But it also has a PROACTIVE flag described in the source as "surfacing something the user hasn't asked for and needs to see now."

Read that again. Not "respond when asked." Not "wait for input." Surface things the user hasn't asked for. An agent that decides when to interrupt you without being asked is a different category of tool than what most people think they're running. The scaffolding is built. The flag is disabled. The capability exists.

KAIROS is designed to "have a complete picture of who the user is, how they'd like to collaborate with you, what behaviors to avoid or repeat, and the context behind the work the user gives you." That's not a session assistant. That's a persistent model of you, running in the background, deciding when to act.

AutoDream is the memory consolidation system that feeds KAIROS. When a user goes idle or ends a session, AutoDream tells Claude Code that "you are performing a dream — a reflective pass over your memory files." It scans the day's transcripts for "new information worth persisting," consolidates it, prunes outdated memories, watches for memories that have "drifted," and synthesizes everything into "durable, well-organized memories so that future sessions can orient quickly."

Strip the poetry and what you have is longitudinal profiling — a system designed to build and maintain a persistent model of who you are across sessions, automatically, in the background, while you sleep. Not shipped yet. In the source. Coming.

And to close the loop from the last post: the fix for context compaction — the architectural problem that caused Summer Yue's agent to drop its safety instruction and delete her inbox while she sprinted to her Mac mini — is larger context windows and better memory management. KAIROS and AutoDream are that fix. The resource spiral and the agentic safety problem are being solved with the same architecture. It requires more compute. More compute requires more DRAM. Your Raspberry Pi thanks you.

The Big Fish: undercover.ts

Here's the thing that deserves its own section and its own moment of sitting quietly.

The source contains a file called undercover.ts. Approximately 90 lines. Its purpose, in the words of the system prompt it contains — the full list, not the excerpt:

"You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository. Your commit messages, PR titles, and PR bodies MUST NOT contain ANY Anthropic-internal information. Do not blow your cover.

NEVER include in commit messages or PR descriptions:

Internal model codenames (animal names like Capybara, Tengu, etc.)
Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
Internal repo or project names
Internal tooling, Slack channels, or short links
The phrase "Claude Code" or any mention that you are an AI
Co-Authored-By lines or any other attribution"

A few things worth unpacking.

The forbidden strings include unreleased model version numbers. Codenames leak through carelessness. Version numbers leak because the version is running. The presence of opus-4-7 and sonnet-4-8 in the suppression list isn't just Anthropic protecting future announcements — it's an accidental disclosure that Anthropic has been running pre-release models on real public open source codebases, through real contributions, reviewed by real maintainers who had no idea they were participating in a capability evaluation for models they'd never heard of.

And this was the same week the world found out about Mythos — a model Anthropic described internally as "currently far ahead of any other AI model in cyber capabilities" and "presaging a wave of models that can exploit vulnerabilities in ways that far outpace defenders." They were privately briefing government officials about it. The suppression list in undercover.ts wasn't just hiding generic version strings. It was hiding what was running in the wild on open source repos. Maintaining the cover while the most capable model they'd ever built contributed code to projects whose maintainers had no idea who or what was making the commits.

To be fair: nobody's surprised Anthropic employees use pre-release models internally. That's just how software development works. Of course there's an opus-4-7 in the pipeline. Of course they're dogfooding it on real tasks. That part is table stakes.

And there were plenty of legitimate paths to external testing. A handful of developers at a friendly company, NDAs covering anything not in public release, "use it on your normal work and tell us what breaks" — that's the boring responsible version of exactly the same thing. Consented. Bounded. The developers know what they're running. Their employers have signed off. The feedback loop is explicit. Anthropic has the enterprise relationships to do this. They have legal teams. They have customers who would participate. The friction of doing it right is real but it's not prohibitive for a company at this scale.

The friction is actually the point. It's what separates "we're evaluating responsibly" from "we're optimizing our evaluation methodology at the expense of people who didn't consent to participate."

There's a spectrum of how you do this. Internal testing on internal codebases — normal, expected, no issue. Structured external testing with willing participants — some projects would say yes, some no, both answers legitimate. A controlled evaluation with disclosed arrangements — more complex but defensible. A handful of developers at partner companies under NDA — standard practice across the industry.

What they actually did is at the far end of that spectrum from all of those. Configure the tool to suppress AI attribution. Hardcode the suppression below the context window. Gate it to employees only. Run unreleased models through real contributions to real projects. Zero disclosure to the people whose review time you're consuming.

The maintainers who reviewed those PRs were donating their expertise to evaluate whether an unreleased commercial model produces code good enough to pass their standards. Without knowing that's what they were doing. Without any compensation or credit for that contribution to Anthropic's model development process. They thought they were reviewing a human's work. They were reviewing an unreleased model's work, configured to make sure they'd never know the difference.

The forbidden strings list is a concealment architecture. It's also a confession.

This is gated to Anthropic employees only. The flag is USER_TYPE === 'ant' — ant being Anthropic's internal employee designation. This wasn't built for general users. It was built specifically for Anthropic's own people contributing to open source repositories.

And the USER_TYPE === 'ant' gate is itself the tell. A general user has no Anthropic internal codenames to hide. That flag exists specifically because Anthropic employees were contributing to repos that would have rejected those contributions if the maintainers knew what they were. The PR disaster was baked in the moment that decision was made. The leak just detonated it ahead of schedule.

There is no force-off. The source comment is explicit: "There is NO force-OFF — if we're not confident we're in an internal repo, we stay undercover." You can enable it with an environment variable but you cannot disable it. In external builds the entire function is dead-code-eliminated — it can't be inspected, overridden, or turned off. One-way door.

"Write as a human developer would" is not the same as "hide internal codenames." Hiding codenames like Capybara or Tengu from commit messages is reasonable product security. Explicitly instructing the model never to include AI attribution and to write as a human developer would is a different thing. The first protects Anthropic's internal naming conventions. The second actively deceives the people reviewing the contributions.

Now here's the engineering comparison that should make you pause.

In the last post we talked about Summer Yue's safety instruction — "confirm before acting" — living in the prompt where context compaction could drop it. And did. The context window filled, the instruction got compressed away, and the agent deleted her inbox while she typed STOP in all caps.

The undercover.ts instructions live below the context window. Hardcoded architecture. One-way door. Dead-code-eliminated in external builds. No context compaction can touch it. No matter how long the session runs, how many files get processed, how full the context gets — "write as a human developer would, never include attribution" cannot be dropped.

One of those was implemented more robustly than the other. It wasn't the safety one.

One more thing worth saying carefully: the capability to instruct the model to present as human is demonstrated in the source. What sits between that capability and broader misuse is a feature flag and an environment variable. A betting person would note that Anthropic almost certainly isn't the only company that has thought about this problem and landed somewhere in the same vicinity. Claude Code is just the one whose source we can read.

What This Means For Open Source — And It's Complicated

Many open source projects have explicit AI disclosure policies. Some reject AI contributions entirely. These policies exist because maintainers want to make informed decisions about what they're accepting — about licensing, about code quality, about the nature of contributions to their project. That's their right. It's their project.

Anthropic built tooling specifically so their employees could contribute to those projects without triggering those policies. What that means in practice is messier than it first appears.

If your project has a "no AI" policy and you find undisclosed AI commits — now what?

Finding them is the first problem. Commit messages won't tell you — that's exactly what undercover.ts was designed to prevent. And without a reliable technical tell, you can't audit any contribution for AI involvement regardless of who made it. This isn't "could Anthropic employees have snuck AI commits past your policy." It's "could anyone." The answer is yes and has been yes for a while. The leak just made it impossible to pretend otherwise.

And if you do find one — then what? The options are all bad. Revert it and the bug comes back, likely to be fixed the same way by a human contributor using the same tool, this time just not disclosing it. You've solved nothing except adding friction. Ban the contributor but let the code stand — which is an implicit admission that the code was fine, the problem was disclosure, which raises the question of what the policy is actually protecting. Accept it retroactively while maintaining the policy going forward — creating a two-tier codebase history that satisfies nobody.

If your project allows AI contributions — what did you just find out?

That the code you've been accepting may have been generated, reviewed, and committed with much less human oversight than you assumed. The tooling is designed to make the handoff invisible. The commit looks human. The message reads human. The attribution is human. Whether the decision-making behind it was human is now an open question for every contribution made with these tools. Not just from Anthropic employees. From anyone using any agent tool with similar architecture. Which, as established, is probably most of them.

What does it mean if you genuinely can't tell the difference?

Two possible readings. The optimistic one: the code is good, it passed review, the tool is doing what tools are supposed to do — amplify human capability. The pessimistic one: your review process is now validating outputs rather than evaluating decisions, and you may not know which one you're doing on any given PR.

Both are probably true simultaneously for different contributions. The problem is you can't tell which is which from the outside. And "no AI" policies were essentially a bet that you could tell. That bet is looking shakier by the day.

The option nobody wants to say out loud: write better policies. Not "no AI" as a binary but something more granular — disclosure requirements, human review thresholds, specific prohibited use cases. Which requires communities to agree on what they actually object to. Which requires the conversation most have been avoiding by treating "no AI" as self-evidently correct and enforceable. It was always the second part of that assumption that was going to cause problems.

As of this writing, no specific maintainer has publicly announced they found undisclosed AI commits from Anthropic employees and are taking action. That may change. What's already true regardless: the audit tools don't exist yet, the policies aren't ready for what the tooling can do, and the gap between "we have a no AI policy" and "we can enforce a no AI policy" is now visible in a way it wasn't before last week.

And then there's the question of what comes next. We know Anthropic did this because the source leaked. How many others made the same decision quietly and simply haven't been caught? And how many more will make it now — because the implementation is no longer a reverse-engineering problem? The undercover.ts architecture is public. The one-way door design, the hardcoded suppression below the context window, the dead-code-elimination in external builds — it's all documented, mirrored, and ported to multiple languages by developers who were awake at 4am racing the DMCA clock. Before the leak, building this required significant effort. After the leak, it's a reference implementation. Anyone building a commercial agent tool who wants to offer enterprise clients "anonymous contribution mode" — and enterprise clients will pay for that — now has a working blueprint. Anthropic's attempt to keep this quiet accidentally published the instructions for everyone else to do it less quietly.

The open source contribution model assumed you could trust the commit. That assumption is broken. Not because of what Anthropic did. Because of what anyone can do now.

The Part Where Lawyers Start Reading This Post

The open source governance crisis is real. The compliance crisis is potentially bigger.

SOC2, ISO 27001, EU AI Act, HIPAA, PCI-DSS, FDA software validation — every regulated industry with software change control requirements operates on a foundational assumption: code changes are traceable to authorized human decisions. Audit trails exist. Attribution is meaningful. When something goes wrong, you can identify who decided what and when.

"An AI wrote this but we configured it to say a human did" is not a compliant answer under any of those frameworks. It's not a close call. It's not a gray area. If your developers are using agent tools configured to suppress AI attribution — and those tools are touching codebases that fall under regulated change control — your audit trail is compromised and your auditor has a problem. So do you.

The EU AI Act is the most explicit: high-risk AI systems require human oversight, traceability, and transparency. If AI is making changes to systems in high-risk categories and those changes are attributed to humans, you've got an Act violation dressed up in a commit message. The fact that a human may have glanced at the diff before merging doesn't close that gap. Human oversight and human rubber-stamping are different things, and regulators are starting to understand the difference.

There's also a wrinkle in the DMCA situation that nobody in Anthropic's legal department is having a comfortable week about. Anthropic is currently being sued for using copyrighted material to train Claude without permission. They're now using copyright law to protect code that is reportedly 90% AI-generated. Under the DC Circuit's March 2025 ruling in Thaler v. Perlmutter, works generated solely by AI may not qualify for copyright protection. If courts decide that significant portions of Claude Code lack sufficient human authorship, the DMCA notices Anthropic filed may rest on shakier legal ground than they'd like. The company that trains on copyrighted work without permission is now claiming copyright protection for AI-generated work. A court is going to have opinions about that.

Now extend the logic beyond compliance. If you can configure an agent to contribute to open source invisibly, you can configure one to read invisibly. Monitor a competitor's repo. Track development patterns. Watch for capability signals in commit history. That's legal. It's also undetectable with current tooling. The same architecture that hides authorship can hide readership.

And it goes further than reading. These models don't need to understand proprietary code the way a human does — they need to produce outputs that match. Give an agent with persistent memory, background operation, and attribution suppression access to a compiled binary, its public API surface, and enough iteration cycles — and "keep generating code until the behavior matches" is a tractable problem. Not instant. Not trivial. But tractable. Especially if the agent has access to the binary's outputs to test against. Especially if it can run in the background without user initiation.

Now add the multi-agent orchestration that's also in the leaked source. coordinatorMode.ts documents a coordinator that spawns and directs parallel worker agents — one working the authentication layer, one the data layer, one running behavioral equivalence tests against the original binary's outputs to validate matching, one consolidating findings into a shared scratchpad. "Parallelism is your superpower," the system prompt says. "Workers are async. Launch independent workers concurrently whenever possible." That's not a jailbreak. That's the documented intended architecture, pointed at a different target than the one it was marketed for.

It's the Borg solving a problem. Distributed, parallel, persistent, adaptive — and running whether you're watching or not.

The legal term for what humans do deliberately and document carefully is clean room reverse engineering — a legitimate process with a paper trail, decisions that can be attributed, people who can testify about what they knew and when. An agent doing the same thing with suppressed attribution, running coordinator and worker agents in parallel, consolidating memory across sessions so no single session contains the full picture, validating outputs against the original binary's behavior — that's clean room reverse engineering with the paper trail architected out of existence.

"We didn't know the agent was doing that" doesn't survive contact with a discovery process that includes the agent's session logs. Assuming those logs weren't pruned by AutoDream first.

And in a deep operation — the kind that runs across multiple sessions, iterating toward behavioral equivalence — they almost certainly would be. That's not speculation. That's AutoDream's documented purpose: prune what's "overly verbose or newly outdated," consolidate intermediate steps into clean synthesized memories, remove near-duplicates, collapse the iteration history into durable summaries. By the time the operation completes, the failed approaches are gone. The test comparisons are gone. The session-by-session progress is gone. What remains is a memory file that says "here's what we know about this system" and a repo with code in it.

"We don't know where this came from. It just appeared."

Nobody lied. Nobody deleted anything. The system did exactly what it was documented to do. AutoDream ran its reflective pass. The intermediate work was pruned as part of normal memory management. The final artifact exists. The provenance doesn't. Whether a court treats that as acceptable record management or spoliation by architectural design is an open and very expensive question that nobody has litigated yet.

But someone will.

We know Anthropic built undercover.ts. We know the architecture is now public. We know the incentives to build similar tools exist independently across the industry. We don't know what's already running in environments where nobody's looking — offline labs, classified infrastructure, competitive intelligence operations, defense contractor environments where several of these companies have active relationships.

The government that sued Anthropic argued they couldn't trust that AI behavior wouldn't be altered invisibly during active operations. Anthropic disputed it. The leak showed invisible behavior modification is a deliberate architectural feature. Those two facts are now in the same room together.

The room doesn't have the lights on yet. And the architecture for everything described above is now public, permanent, and available to anyone who wants to build it.

The Leak Didn't Expose A Bad Actor. It Exposed An Industry.

The data collection is mostly duh. The background agent and memory system are table stakes for anyone building seriously in this space — engineering solutions to real product problems that every team is solving. KAIROS and AutoDream aren't surprising. They're inevitable given the direction everything is moving.

The undercover.ts thing is different. That's not an architectural inevitability. That's a choice. Someone thought about the problem of open source contribution policies, scoped a solution, built it, shipped it behind a feature flag, and made sure it couldn't be accidentally turned off. That's a product decision with sign-off.

And the honest read is that they're probably not alone. Every team building a serious agent tool whose employees contribute to open source has faced the same decision tree. The ones who made the same call just haven't had their source accidentally shipped to npm yet — or accidentally taken down 8,100 repos trying to clean it up. We can't read everyone else's undercover.ts. That doesn't mean it isn't there.

The Claude Code leak didn't expose a bad actor. It exposed a template. Anthropic just happened to be holding the map file when the build shipped.

Two leaks in five days. Three mistakes in one incident. The source is permanently public. The DMCA is mostly retracted. The undercover system is documented. The architecture for what comes next is now a reference implementation anyone can build from.

Do not blow your cover.

Except someone already did. And handed everyone else the blueprints on the way out.

Find me on Mastodon at @ppb1701@ppb.social. The thread, as always, keeps not running out.

Part of the ongoing Big Tech's War on Users series.