Feedback is All You need
AI researchers are using their own tools to build the next generation — and reporting 100%+ productivity gains that accelerate with each release cycle.
The key ingredient isn’t intelligence. It’s verification: a way for the system to know whether its output got better or worse. The domains where AI is compounding fastest are the ones where that feedback signal exists.
This month, the AAA launched a new tool: you submit your arguments and evidence and the “Resolution Simulator” gives you a simulated arbitral decision. Legal reasoning just acquired something it has historically lacked: a way to score the next draft.
The key to fast AI progress isn’t intelligence. It’s feedback.
Anthropic’s researchers use Claude Code to do their research. An internal survey of sixteen researchers found a median self-reported productivity gain of 100%. Mean: 152%. And those numbers are accelerating with each model release, which come every two to three months. OpenAI’s researchers use Codex. Same dynamic. The people building the tools are using the tools to build better tools.
That sounds like a story about intelligence. But the deeper story is about verification.
Why This Works
Recursive improvement is not magic. It depends on a feedback signal: some way for the system, or the human directing it, to tell whether the output got better or worse.
In AI research, that signal is often a benchmark. Run the new model against a test suite. Did the scores go up? Then you have something to build on.
Andrej Karpathy’s autoresearch project makes this logic especially clear. An AI agent receives a research direction and a codebase. It runs experiments autonomously, trains a small model, analyzes the results, adjusts parameters, commits improvements, and repeats. The agent makes a lot of mistakes and goes down a lot of blind alleys. But that doesn’t matter, because all it needs to improve is a way to tell whether each round improved performance.
The same pattern appeared in software. Shopify CEO Tobi Lütke recently described an AI agent submitting a pull request to a framework he created twenty years ago and improving benchmark performance by 53 percent. Two decades of human development had left optimizations on the table. The agent found them because it could run repeated experiments against a clear metric.
Feedback for Legal Work
This month, the American Arbitration Association launched the Resolution Simulator. A party submits case materials — arguments, evidence, legal authority — and receives an AI-generated simulated decision. Nonbinding. Informational. Based on the same structured reasoning framework as the AAA’s live AI Arbitrator.
The AAA built it as a strategic preview tool. A way for legal teams to pressure-test their arguments before formal proceedings.
But look at what else they created: a feedback signal.
There is no technical reason (as far as I know) that an AI agent could not go full “Karpathy Autoresearch” (1) draft arguments, (2) submit them to the Resolution Simulator, (3) read the simulated decision and identify the weaknesses in the arguments, (5) revise, and resubmit, (6) repeat steps 1 through 5 until you have a winner. The same loop that already exists in coding and AI research could, in principle, start appearing in legal reasoning. (We touched on this two weeks ago — the idea that autonomous AI loops work in any domain where you can tell whether the output is correct. Legal work has traditionally been hard to verify at machine speed. That may be changing.)
Nobody has built this yet. The AAA almost certainly did not design the tool with this in mind. But the missing piece — the feedback — just appeared.
The bottleneck may not be that AI cannot think like a lawyer. The bottleneck may be that legal work has historically had very weak feedback.
Code has test suites. AI research has benchmarks. Legal argument now has, at minimum, a first approximation of a simulated result.
The important point is not that this loop is already here. It is that legal reasoning just acquired something it usually lacks: a way to tell whether the next draft got better.
We’ve already seen what this feedback signal can do for software. What can it do for legal work?
Capability Tracker
GPT-5.4 mini and nano released. OpenAI shipped smaller, cheaper versions of GPT-5.4 this week. Simon Willison estimated the nano model could process descriptions of every image in a 76,000-photo library for $52 total. Frontier-level capability is moving downmarket fast.
Xiaomi’s MiMo-V2-Pro matches near-frontier intelligence. The Chinese electronics giant released a reasoning model scoring 49 on the Artificial Analysis Intelligence Index — placing it between Kimi K2.5 and GLM-5. Another non-US lab reaching competitive performance.
MiniMax-M2.7 delivers GLM-5 intelligence at one-third the cost. An 8-point jump over its predecessor released just one month ago. The price-performance curve continues to steepen.
Mistral Small 4: open-weights hybrid reasoning. Mistral released a 119B mixture-of-experts model with just 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. Open-weights models continue closing the gap with proprietary ones.
Trying to make sense of all this? I help law firms and companies navigate the legal side of AI implementation. Email me at adam@lawsnap.com or click here to schedule a short call.


