Dear Agent: Prove it.
Count the number of r's in strawberry. Actually no, prove it instead.
At some point it became possible to write the following prompt, and any frontier LLM will give you a correct answer.
You'll get a program like this, or some variation of it. We'll walk through it in just a minute.
/-- Count occurrences of a character in a list. -/
def countCharList (c : Char) : List Char → Nat
| [] => 0
| x :: xs => (if x = c then 1 else 0) + countCharList c xs
/-- Count occurrences of a character in a string. -/
def countChar (c : Char) (s : String) : Nat :=
countCharList c s.toList
/-- Correctness proof: our function matches `List.count`. -/
theorem countChar_correct (c : Char) (s : String) :
countChar c s = s.toList.count c := by
unfold countChar
induction s.toList with
| nil =>
simp [countCharList]
| cons x xs ih =>
simp [countCharList, ih]
by_cases h : x = c <;> simp [h]
At first we laughed at LLMs which couldn't even count letters in a word. Now we can reliably ask it to write a Python program to count the number of characters in a string and it'll nail it.
We still have to check that the Python code looks good, maybe we'll write some tests.
Above, however, we have The Ultimate program for this task. It's not just a Python program we need to check for correctness. It's a program that is proven correct for all inputs it accepts.
Once and for all. No tests needed. Done.
We now know we can write (or generate) a correct program to count the number of characters in a string, and we don't need to do that ever again. How incredible is that?
Formal Verification: Not a Primer
Our character-counting program is written in Lean, a language for doing Formal Verification.
We'll walk through the main bits so that you, the working programmer, have a chance at the "ah-ha" moment. Know that the field is deep, and those who venture deep into Formal Verification mostly forget to come back and explain it to the rest of us.
In Lean you can write proofs for your programs with respect to specifications. Specifications describe how a program must behave, and good specifications obviate the need for tests, because they cover all possible inputs and guarantee behavior over all outputs.
Writing good specifications used to be very tedious and hard. Now they're just hard, thanks to LLMs.
Make Agents Prove
On to the example. Here's the character-counting function in Lean:
/-- Count occurrences of character `c` in string `s`. -/
def countChar (c : Char) (s : String) : Nat :=
match s.toList with
| [] => 0
| x :: xs => (if x = c then 1 else 0) + countChar c xs.asString
Don't worry about the functional style here. It's not a big deal, we wrote a function to count characters, you could do the same in Python.
So we have a character c, a string s, and our function countChar. This function is the thing we are going to prove is correct, that it actually counts the number of times c is in s.
We'll put our specification and proof in the theorem part of Lean, and this is where things get interesting. You can think of theorem as a programmatic construct that lives next to your code, and says things about your code. We'll call our theorem countChar_is_correct.
There are multiple ways to express the behavior that countChar counts characters. So bear with me for simplicity: we'll use Lean's built-in s.toList.count c that's been proven correct internally, and we're going to trust that as our reference implementation. The proof setup is then: show that our countChar is consistent with this reference. Here is the proof setup expressed in Lean code, where theorem gets used:
theorem countChar_is_correct (c : Char) (s : String) :
countChar c s = s.toList.count c := by ...
If you've used TypeScript, you'll know you can assign types like number and string to variables. Our theorem has a type too. Here it is, highlighted:
countChar c s = s.toList.count c
Something Super Not Normal is happening here. This whole thing is a type. That = demands you construct a proof that both sides are equivalent. And the type references concrete parameters c and s freely. It's just allowed to do that. The type can use equivalences, parameters, functions, etc. to express provable claims about your program.
So that's the ah-ha: Lean gives you this special language where you can express these specifications; a kind of sidecar to your implementation. Those specifications are mathematical propositions, and represented as types. If this idea is new to you, hopefully you're now realizing this is not normal. countChar c s = s.toList.count c has mathematical meaning that Lean understands, and rules it can follow to prove it true.
You can literally go and write down: Okay, so I have this function I wrote that takes s and c, and if I give that s and c to another function that exists, I want to prove that they do the same thing.
What!2
The proof itself is the part after := by ...
I'm going to gloss a bit over this proof implementation, because it's less instructive (do ask your favorite agent to explain it to you). In essence, Lean provides constructs, like proof by induction, to reduce complex proofs into simpler steps that can be mechanically verified. The point is...
It is correct if the agent succeeds
Once we've written a specification (required program behavior) it doesn't matter what an LLM generates. It can only generate a program that is proven correct, or it cannot.
We can ask an agent to try prove something without caring about the proof language or mechanisms that Lean uses.
For example, you can express Fermat's Last Theorem in Lean, without knowing whether a proof exists (sorry is a placeholder until we can prove the theorem).
theorem fermats_last_theorem :
forall n : Nat, n >= 3 -> forall a b c : Nat, a > 0 -> b > 0 -> c > 0 -> a^n + b^n != c^n := by
sorry
Mathematical intuition will go a long way to steering specs, just like good software engineering intuition would steer good software specifications. What's beautiful is you can write things down and send it, steer it. That's what the example shows off: we are challenged to write a function based on a known reference implementation, and Lean tells us if our implementation is correct or not. That proof is independently verifiable.
Do you realize how crazy it is that LLMs can explore proof-writing sufficiently well for non-trivial properties now? Frontier models today one-shot the example counting program.
Of course specifications can be wrong. Or underspecified and weak. For example, we could write a specification that the count of characters is always zero or more. That's provable, but not really useful. Specification quality is the hard part.3 Writing specifications that are both correct and complete requires deep understanding of the problem domain, and where expertise matters.
You're already thinking it, so I don't really have to say it. What if agents became expert and wrote the specifications? Indeed. That's still difficult, but surely they'll get better? Indeed. And we can have them implement the functions as well, and prove it. Indeed, why not? And then...
The case for proofs
So you can see what I see: A path that goes from taking a nondeterministic LLM, and not just producing operational code, but producing computationally correct operational code. There is effectively no stronger guarantee we can impose on (decidable) software than to demand a proof alongside its implementation.
Formal verification is a hardcore approach. It is difficult and it is the holy grail of software engineering. LLMs have made it much more accessible to humans and, more importantly, non-humans.1,3
If the specification is good, we don't need tests. It makes no sense to test whether there are three r's in strawberry and not two, when we have proved that the answer is correct for every possible character / string inputs.
But also tests won't save you
There's a much more alarming trend that makes the case for agents that prove.
Tests are too optimistic for where we're headed.
Let's play it out. Agents write your code now, you just prompt. The junior devs too. You fend against the onslaught of code in the only reasonable way: prompt the agent to review the code as well. Because the anxiety of reviewing 3,000 LOC pull requests is too much. Tomorrow more is on the way. What's the anchor for these changes, how do we trust them? Tests! Yes, those will help.
Well, we don't have time to write those, so let the agent write tests as well. Hm.
But for humans to even judge whether a generated test suite is "good" is too much of a cognitive ask now. We're generating perfectly operational code too efficiently. At best we skim but mostly we just trust SuperLLM@20280208 did a good job on those tests.
It's wildly uncomfortable to think tests are pointless, but that's where we're headed. Tests will be pointless. There I said it.
Proof not optional
I think we end up here in 2 to 3 years:
(a) Agents introspect their own behavior and adapt "soft" software on-the-fly. Code is liquid. We have enough confidence that the LLM writes good code that doesn't need checking (e.g., static site generator). Tests are cruft.
(b) Crucial, core software is written by agents and requires a proof witness and verifiable spec. The kind that controls private keys, writes auth code, encryption.
We know now that agents generate operational code at volumes humans never could. And when used for red teaming, they will relentlessly probe every gap, every edge case, every vulnerability. They only need to find one chink in the armor, and it's over. When code generation is relentless and attacks are relentless, formal verification is the only way to definitively say "this code is an impenetrable bastion".
The end game is: Prove or do not, there is no try.
Further Reading
- [1] The Code-Only Agent — What if we made agents always generate code, so that they always have behavior to prove?
- [2] Lambda Cube — If you got the sense that our proof specification feels like programming in 3D, you wouldn't be far off. Lean sits in the corner with the most expressive type system, enabling dependent types and proof construction.
- [3] Specifications Don't Exist and Claude Can Sometimes Prove It. The bottleneck in Formal Verification is obtaining good specifications. Claude Code shows surprising capability at interactive theorem proving in Lean, but still requires expert oversight.
- [4] Curry-Howard Correspondence — The deep connection between programs and proofs, propositions and types. Understanding this helps to understand why formal verification in languages like Lean is possible at all.
- [5] Other proof languages. Besides Lean there's F*, Rocq, Dafny, Isabelle/HOL, Agda. Each makes different tradeoffs between automation and expressiveness. Exploring these may lead you to rabbit hole territory, because they differ on what counts as a proof. Lean has the momentum right now: it's where the AI tooling, the math community, and the developer ergonomics are converging.
What's Next
I won't pretend we'll just end up in a world where all code has good specifications and is proved correct. It will take hard work in between. Performance can suffer: Lean programs can be slow, and extraction to efficient languages adds complexity. Still, there are rich opportunities to overcome these.
AI × Formal Verification is one of the most demanding things you could choose to work on right now. Today, the most demanding things are the ones with staying power. After thousands of hours of prompting and engineering with agents, I realized this deserves more exposure now, if you want to keep being relevant in software. It will absolutely need to be cracked in the end game. The payoff is huge.
Follow me on 𝕏 for more posts on building Agents and Formal Verification.
Timestamped 8 Feb 2026