Back to blog
Learn··8 min read

Vibe Coding Was Round One. Agentic Engineering Is What Ships.

Notes on Andrej Karpathy's Sequoia AI Ascent 2026 talk — December as the inflection, Software 3.0, jagged intelligence, and the discipline that replaces vibe coding when the code has to survive contact with production.

THE DECEMBER INFLECTIONBEFORE DEC 2025MineAgentINVERSIONAFTER DEC 2025MineAgent"I can't remember the last time I corrected it."

At Sequoia's AI Ascent 2026, Stephanie Zhan opened her interview with Andrej Karpathy by quoting something he'd said on X a few weeks earlier: that he'd never felt more behind as a programmer. The talk that followed walks through why — and lays out the discipline that's replacing the term he himself coined a year ago. It's the clearest framing of where AI-assisted software engineering actually is in 2026, so we recommend watching it before reading anything else (including this).

Andrej Karpathy in conversation with Stephanie Zhan, Sequoia AI Ascent 2026.

December was the inflection

Karpathy had been using agentic coding tools for about a year. They produced "chunks of code" that were sometimes good, sometimes wrong, often required editing. Useful, but not load-bearing. Then December 2025 hit. He had time off. He sat down with the latest models.

"The chunks just came out fine. Then I kept asking for more and they still came out fine. I can't remember the last time I corrected it. I just trusted the system more and more."

That's the sentence the entire talk pivots on. He doesn't claim a smooth ramp; he calls it a stark transition. The agentic coherent workflow, the thing many of us had been half-believing in, started actually working. He went from correcting agents to delegating to them, and his side-projects folder "bloated with random things" as a result.

He's clear that a lot of people missed it. Most people's mental model of AI is still "the ChatGPT thing from last year" — and that mental model is wrong as of December. If you haven't re-evaluated since, you're calibrated to a regime that no longer exists.

Software 3.0: programming the LLM

Karpathy's frame for what changed is the same one he's been sharpening for a couple of years: three eras of software.

  • Software 1.0: humans write explicit code.
  • Software 2.0: humans curate datasets and train neural-network weights — the "program" is the dataset and the architecture.
  • Software 3.0: the model is the interpreter, and the context window is the program. "Your programming now turns to prompting, and what's in the context window is your lever over the interpreter that is the LLM."

He grounds it with two concrete examples. The first: installing OpenClaw. Historically that's a bash script — it has to balloon to handle every platform and every machine quirk. The Software 3.0 version is a block of text you copy-paste to your agent. The agent reads your environment, debugs in a loop, and gets it installed. The program isn't the script anymore; it's the prompt.

The second is sharper. He built MenuGen — an app that takes a photo of a restaurant menu, OCRs the items, runs an image generator, and renders a version of the menu with pictures. Then he tried the Software 3.0 version: hand the photo to Gemini, say "use Nano Banana to overlay the items onto the menu," and the model returns the rendered menu directly. "All of my menu gen is spurious. It's working in the old paradigm. That app shouldn't exist."

The lesson isn't "everything is faster now." It's that whole categories of software you might have written in 2024 are now just a prompt to a model that does the work end-to-end. General information processing — not just code — is becoming automatable. If you're building, the more interesting question isn't "how do I speed up what already exists?" but "what couldn't exist before that can now?"

Jagged intelligence

The constraint that defines the era, in Karpathy's framing, is that frontier models are jagged entities. They peak in some domains and stagnate in others, and the shape of the jaggedness is hard to predict from the outside.

His current favorite example:

"I want to go to a car wash to wash my car and it's 50 meters away. Should I drive or should I walk? And state-of-the-art models today will tell you to walk because it's so close."

The same model can refactor a 100,000-line codebase or find zero-day vulnerabilities, and then fail at this. "This is insane."

It's not random. The shape comes from how the labs train: huge reinforcement-learning environments around verifiable signals, with capability concentrating wherever the labs put attention and data. He gives the chess example — GPT-3.5 to GPT-4 saw a step change in chess because chess data made it into pre-training, not because of some general lift. "You're slightly at the mercy of whatever the labs are doing, whatever they happen to put into the mix."

Practically: every system you build sits on a tool with no manual. Some circuits fly; others struggle. You have to figure out which circuits your application is in. If you're out of distribution, you don't pray — you fine-tune, or you hold the bar yourself in the loop.

Verifiability is the law

The deepest line in the talk, and one Karpathy has been writing about on his blog:

"Traditional computers can easily automate what you can specify in code. LLMs can easily automate what you can verify."

That's why code, math, tests, benchmarks, and games are racing ahead — they're cheap to verify, so labs can build dense RL environments around them. It's also why writing, taste, and open-ended judgment lag: harder to verify automatically, fewer environments, less RL pressure.

For founders, this is a lever, not a description. If you can construct a verifiable environment for a valuable problem the labs haven't absorbed, that environment is the moat — and the same fine-tuning machinery that lifted code can lift your domain. Karpathy was careful not to name examples on stage ("I don't want to vibe-post on the stage"), but he believes some of the most valuable verifiable environments are still unbuilt.

The corollary, when pushed by Zhan: "ultimately almost everything can be made verifiable to some extent" — even soft outputs like writing, via councils of LLM judges. It's a question of how hard the verification harness is to construct, not whether it can be constructed at all.

Vibe coding vs. agentic engineering

This is where Karpathy retires his own term.

"Vibe coding is about raising the floor for everyone in terms of what they can do in software. The floor rises, everyone can vibe-code anything, and that's amazing, incredible."

A founder prototypes in a weekend. A non-engineer ships an internal tool. The barrier to producing something that runs is lower than it's ever been.

Agentic engineering is the discipline on the other side:

"Preserving the quality bar of what existed before in professional software. You're not allowed to introduce vulnerabilities due to vibe coding. You're still responsible for your software just as before, but can you go faster?"

The job, in his words, is "coordinating these spiky entities — a bit fallible, a little stochastic, but extremely powerful — to go faster without sacrificing your quality bar, and doing that well and correctly." He thinks the ceiling for an excellent agentic engineer is far above the old "10x engineer" framing. 10x is no longer the right unit for the speedup.

He also flags a hiring gap: most companies haven't updated their process for this. Brain-teaser puzzles measure the wrong thing. His suggestion: hand a candidate a real-sized project — "build a Twitter clone for agents, make it secure, then I'll point ten Codex 5.4x-high instances at it and try to break it." That's the assessment that maps to the actual work.

What stays human

Karpathy is direct about where the human is still load-bearing. It's not typing. It's the spec, the taste, and the understanding.

The spec story comes with a great example. In MenuGen, users sign up with Google but pay with Stripe. The Stripe email and the Google email can differ. Karpathy's agent, left to its own devices, tried to crossorelate purchases by matching email strings — and silently failed when they didn't match. "A human needs enough product and engineering judgment to insist on persistent user IDs."The agent is fast at fill-in-the-blanks; it doesn't catch that the blank was the wrong shape.

Taste is the ceiling on the work. He admits the code agents produce right now "sometimes gives me a heart attack — it's bloaty, copy-paste, awkward abstractions, brittle. It works but it's gross." Models will probably get better here once labs add the right reward signal, but for now you're the aesthetic judge, the one who knows when something is too clever, too coupled, or just wrong-feeling.

And then the line he closed on, attributing it to a tweet that had been bouncing around his head:

"You can outsource your thinking, but you can't outsource your understanding."

He framed his own bottleneck as exactly this — information has to make it into his brain, because he still has to direct the agents. You can't be a good director without understanding the thing you're directing. That's why he's investing in personal LLM-powered knowledge bases — they're tools to deepen his understanding, not replace it.