Smaller Batches Sharper Eyes - ByteHaven - Where I ramble about bytes

Part of the ongoing Big Tech's War on Users series.

Eric Brandwine, VP and distinguished engineer at Amazon Security, told The Register something this week — picked up by The Next Web — that should be obvious and somehow still isn't, in most boardrooms: humans are bad at watching things. Not bad at judgment. Bad at the specific, narrow task of staring at a stream of low-variance events for hours and reliably catching the one that matters.

He calls it "normalization of deviance" — a term he's been talking about since a 2017 re:Invent talk, long before agentic AI made it relevant again. The illustration is an ER nurse. First day on the job, every alarm gets a full response. Weeks later, after enough false alarms with no consequences, the discipline erodes. Eventually a real one gets missed, and it's not because anyone stopped caring. It's because vigilance isn't a renewable resource you can demand indefinitely from a person and expect full output every single time.

I want to push on that ER story for a second, because I've seen it from the inside. My wife worked as a nurse, and I spent time on the administrative side of a doctor's office, so I'll add the part that tends to get left out of the tidy "normalization of deviance" framing: most of these floors, and a lot of assisted living facilities too, are chronically understaffed and run ragged. The expectation isn't just "catch every emergency." It's catch every emergency, while also staying warm and patient with people who are in pain, scared, disoriented, and losing their own patience by the minute.

And it's nothing like what you see on TV. Doctors and nurses aren't standing around between dramatic moments — they're hopping from patient to patient, with a waiting room packed full of people still to be seen and most of their beds already full. They aren't making you wait for the hell of it. The reason it feels like a whirlwind when you're the one in the bed is because it genuinely is one. They are swamped, frequently understaffed, and still expected not to miss anything.

Discipline doesn't erode because people stop caring. It erodes because the job is demanding sustained, perfect vigilance from someone already stretched to their limit. Brandwine's alarm story is true. It's also only telling half of why it's true — the other half is staffing levels and workload, not just human psychology in a vacuum.

Brandwine's broader point still holds, though: this is exactly what happens when you ask a human to approve or reject AI agent actions all day. "They'll do a good job," he said. "And then they'll do an okay job, and pretty quickly they'll be doing a poor job." Human-in-the-loop isn't a safety mechanism. It's a box checked by people who haven't thought hard about what they're asking a human brain to actually sustain.

I've said versions of this online for a while now: AI can be genuinely useful, but it needs checking. So does a junior dev — "junior" being relative to the project and the task, not just a job title. And to be clear, this isn't a post about whether AI is good or bad — that's not what's relevant here. This is about human nature — the same human nature that's been causing these exact failures in ERs, behind the wheel, and in code reviews long before anyone trained a language model. Neither AI nor the junior dev is the problem. The problem is what we're asking the checker to do, and whether the structure of the work makes that checking possible at all.

Same Failure, Different Lane

This is the part that clicked for me immediately, because it's not a new problem dressed up in AI clothing — it's the exact same finding the NTSB has been issuing about partial self-driving systems for years.

The NTSB's investigation into a fatal Tesla Autopilot crash found that the system's "ineffective monitoring of driver engagement" was a contributing factor — not the only cause, but a consistent thread across multiple crashes the board has reviewed. The driver is told, explicitly, that they're responsible for paying attention at all times. The system does almost all the actual driving. That gap between "responsible for" and "actually doing" is precisely the vigilance decay Brandwine is describing, just with a steering wheel instead of an approve button.

And here's the thing — at least pilots are trained specifically for automation handoff scenarios. There are checklists, simulators, and entire FAA certification requirements built around the moment the autopilot hands control back to a human. The average person commuting on the highway has none of that. They're a regular driver, in a consumer car, being asked to snap from passenger-mode back to full alertness in a fraction of a second when something goes wrong. Partial automation is arguably the worst possible design point for exactly that reason. Full manual keeps you engaged because it has to. Full autonomy removes you from the loop entirely. The middle ground — handles itself 95% of the time, needs you instantly for the other 5% — is the hardest human-factors problem of the three, handed to the least prepared person to solve it.

And there's real-world proof of exactly how that plays out: there are videos online, and NHTSA crash investigations have documented it, where a driver sat and watched a collision unfold without ever hitting the brake or turning the wheel. Not because they were unconscious. Because the car was "driving itself" and somewhere in their brain the loop never closed that they needed to take over. In several of those cases, the crash scene was visible for an average of eight seconds before impact. Eight seconds might not be enough to avoid a crash entirely — but it's enough to hit the brakes, scrub some speed, and turn what could be a fatal impact into something survivable. It's nothing if your brain has already handed the wheel to the machine and never got the memo to take it back.

Same Failure, Different Diff

Now swap the driver's seat for a pull request.

You're reviewing a change that touches thirty files. Most of that diff is noise — someone ran a reformat across the codebase, Ctrl+K, D in half a dozen files, whitespace and brace placement shifting in ways that are visually identical to a hundred other harmless changes you've already approved this week. Mixed in, somewhere, is the actual logic edit. Maybe two lines. Maybe a renamed parameter that quietly changes behavior three call sites away.

Your eyes are doing exactly what Brandwine described in an ER. The first few diffs you read carefully. By file twenty, you're pattern-matching: green, red, looks fine, looks fine, looks fine. If it builds and the tests are green, it gets stamped. Not because you're careless — because the signal-to-noise ratio in that diff actively works against the kind of sustained, line-by-line attention a real review requires.

And same as the ER nurse, the reviewer isn't doing this in a vacuum with nothing else on their plate. He or she has their own ticket to get back to, a meeting in twenty minutes, and BAs, QA, other devs, and sometimes end users pinging them in the middle of it. The review is a context switch — stopping what you're doing, loading a completely different problem into your head, then trying to pick your original work back up after — wedged into a day already full of context switches, on a section of the codebase they may not have touched in months. That's not a focus problem you fix by asking them to concentrate harder. It's a workload and scheduling problem wearing a focus costume.

This is true whether the PR came from a human teammate, a junior dev still learning the codebase, or an AI agent. None of those sources are inherently more or less trustworthy in the abstract. Both an AI and a junior dev can also lean too hard on what they "learned" from Stack Overflow, Reddit, or a quick internet search — and that source might be outdated, missing context, or just flat wrong for your specific situation. What's different is whether they hand you a change you can actually evaluate, or a wall of text you're statistically guaranteed to skim. An AI agent or a junior dev can both miss that a change to method xyz quietly breaks three other call sites buried somewhere else in the UI layer — not from malice or stupidity, but because neither one held the whole system in their head at the moment of the edit. The reviewer's job is to catch exactly that. The reviewer's job also gets harder, not easier, every time the diff gets noisier.

And before someone says "well just let AI review it" — you're potentially trading one side of the problem for the other. AI context windows are real and have limits; throw a large enough codebase at it and things fall off the edges. It's also bounded by its training and by whatever it can find online or in the codebase — and only then if the prompt or logic path actually tells it to go look. It won't know about that architectural decision your team made in a meeting six months ago that never made it into a comment or a doc. Neither will the junior dev, for that matter. Which is exactly why the answer isn't a different set of eyes doing the same flawed review — it's making the change itself smaller and easier to reason about in the first place.

There's another angle that tends to get glossed over too. Technically, the person who wrote the prompt or submitted the AI's change should own that change — they kicked it off, it's on them. That's fair in principle. But in practice, that person is staring at the same wall of output as the reviewer. It builds, it looks roughly right, submit. Same problem, different seat. And that's before you factor in that some teams are now letting agents operate more independently — submitting their own commits, opening their own PRs, with a human somewhere upstream who technically "started it." Flipping the accountability entirely to whoever wrote the initial prompt doesn't fully solve anything when the volume and complexity of the output makes real review just as hard on that end too.

What Actually Works Is Boring

The fix isn't more discipline. Telling people to "just pay closer attention" is the same advice that's failed in every ER, every car with a half-baked autopilot, and every code review since any of these problems existed. What actually moves the needle is changing the shape of the task so it fits what human attention can sustain.

Smaller, scoped change sets. A PR with one logical change is reviewable in the way a PR with thirty files and six unrelated reformats never will be. Separate the mechanical changes from the substantive ones — if a tool reformatted a file, that's its own commit, not buried inside the commit that also changes behavior. You don't need superhuman vigilance to catch a two-line logic change in isolation. You need it to survive a thirty-file haystack, which is a different and much harder problem that no amount of "be more careful" solves.

Amazon's own answer points the same direction, even if they don't frame it that way. Instead of leaning on a human to catch every bad agent action in real time, they scope what the agent can do in the first place — static guardrails, a maximum privilege ceiling per agent, dynamically narrowed permissions for the specific task. Every agent action is also tied to an identity, so the activity log reads "this agent did this on behalf of Eric," not just "Eric did this." The human still owns the outcome. But the system isn't pretending a person watching a dashboard all day is the thing standing between a working production database and a deleted one.

That's the actual lesson underneath all three examples. Whether it's a self-driving system, an AI coding agent, or a junior dev's first big feature branch, the question isn't "did a human look at it." It's whether the human had a realistic chance of catching the problem given what they were actually shown, for how long, and how often. Ask someone to do that job in thirty-file batches with the noise mixed in, and you'll get a stamp, not a review. Ask them to do it in small, isolated, well-bounded pieces, and they'll actually catch the thing that matters.

Human-in-the-loop was never the safety net it was marketed as. It was always a question of whether the loop was designed for a human to actually function in.

Postscript: And if you want to know how deep this rabbit hole goes — New Scientist reported today that multiple whistleblowers have come forward to say the workers being paid to supply the high-quality human conversations used to train the next generation of AI models are just using chatbots to do it. Experts are calling it "AI inbreeding" and warning it could cause future models to "collapse." So to recap: AI generating code, AI reviewing code, and now AI generating the training data that's supposed to make AI better. The human in the loop didn't just stop paying attention. In some cases, they've been quietly replaced entirely or decided to outsource their own work — and nobody noticed because nobody was really watching.

Find me on Mastodon at @ppb1701@ppb.social

Part of the ongoing Big Tech's War on Users series.