The 70% "Breakthrough" That Isn’t: NVIDIA Just Re-Introduced Systems Engineering to AI
Every hype cycle needs a miracle, a headline result that suggests a course correction. If NVIDIA thought this was it for Agentic AI, they were wrong.
Share this post

Every hype cycle needs a miracle, a headline result that suggests a course correction. If NVIDIA thought this was it for Agentic AI, they were wrong.
In the latest round of agentic AI, NVIDIA researchers argue that routing is the lever: instead of pushing every request through a single large model, route each step to the cheapest adequate capability, whether that's a small language model, a specialist model, or a tool. The payoff, they report, is dramatically lower cost and faster end-to-end performance.
The data and benchmarks are genuinely useful. But the underlying idea isn't new. It's simply the industry rediscovering a basic rule of systems engineering: you don't run a power station on a single monolith and a prayer, and you don't call it innovation when you've just reintroduced a forgotten operating discipline.
Rediscovering Systems Engineering in the Age of AI Hype
This article began with a stray LinkedIn post spotted while I was waiting for a delayed train. It sent me back to two NVIDIA Research papers, a remark from Andrew Ng, and a familiar pattern I keep seeing across modern AI.
One pattern keeps resurfacing in AI research and in the industry: practices we have relied on for decades, routing, scheduling, and basic systems thinking, are being rediscovered and marketed as breakthroughs. I was working on mainframes in the 1980s, and plenty of these techniques were already mature then.
The surprise isn't that they work. The surprise is how easily institutional memory evaporates in a hype cycle, and how often "new" turns out to mean "new to this cohort". Not everything is searchable, indexed, and packaged as a Medium post. A lot of it lived in places like the IBM Redbooks: thousands of pages of hard-won operational knowledge that rarely shows up in today's AI discourse.
"Small Language Models are the Future of Agentic AI"
The real story beneath NVIDIA's results is that the industry is finally remembering what serious systems engineers never forgot: intelligence is not a monolith; it is a choreography. This is the old lesson of computing: do not burn your most expensive resource on trivial work.
Operating systems schedulers, database query planners, microservice routers, and even basic caching strategies all exist to prevent exactly that kind of waste. NVIDIA has simply translated that pattern into the current language model stack. The fact that this passes as a revelation says more about the last three years of AI than it does about the idea itself.
A Platform Reply That Turned Into a Thesis: Permission to Talk Architecture Again
I had clocked the NVIDIA research earlier, but only in passing. Then a post on this platform yesterday forced me to articulate what I actually think, and that reply became the spine of this piece.

The Numbers, Then the Reality: Routing Isn't New, It's Engineering
Let's deal with the numbers first. In ToolOrchestra, NVIDIA trains an 8B orchestrator that coordinates tools and other models and reports 37.1% on Humanity's Last Exam, compared with 35.1% for a monolithic GPT-5 baseline, while being 2.5× more efficient. They also show large cost advantages on FRAMES and τ2-Bench, with the abstract stating "about 30% of the cost", which is where the "roughly 70% saving" comes from.
Useful. But conceptually, what is happening? A controller evaluates the request, decomposes it, and routes sub-tasks to the cheapest adequate capability: a calculator, search, a specialist coding model, or a frontier model, only when needed. That is not new. It is how we have engineered efficient, resilient systems for decades.
The Real Advance: Trainable Routing, and the Fine Print Everyone Skips
So where is the novelty? Not the idea of routing, but the discipline of making routing measurable and trainable. ToolOrchestra formalises orchestration as a sequential decision process, then trains with reinforcement learning using rewards for outcome, cost, latency, and user preference. That matters because naïve prompting is not an orchestrator.
NVIDIA shows prompting is brittle and biased: models over-delegate to "family members" (self-enhancement bias) or default to the strongest option regardless of price.
There are also caveats that the hype posts omit. Their HLE comparison is on a text-only subset, and they note reproduction differences versus previously reported GPT-5 results. And their "cost" is computed via third-party API pricing conversions, which is directionally useful but not a universal constant across vendors, discounts, or on-prem deployments.
SLMs, Workload Reality, and the Inertia of the ChatGPT Era
Now, the companion paper "Small Language Models are the Future of Agentic AI" argues that most agent calls are narrow and repetitive, and therefore small models are often more suitable and more economical for agentic systems.
Again, that is common sense if you have ever profiled a production workload. The paper's strongest value is that it names the inertia. It has three parts. Three parts that any more than casual observer of OpenAI in particular will recognise:
➠ The LLM behemoths sunk capital into centralised LLM inferencing
➠ The community built tooling around it, and
➠ Such was the pull of Altman and the ChatGPT hype engine that most (maybe all) alternatives were considered heresy even when they were the rational path.
Where the Work Really Is: Turning Routing Into Governable Infrastructure
Here is where I will critique my own instinct to call this "trivial". The hard part is not noticing that routing exists. The hard part is shipping routing safely: tool authentication, deterministic fallbacks, audit logs, policy gates, and continuous evaluation. That is governance, not a hackathon demo. ToolOrchestra's preference-aware training and its generalisation tests with unseen tools are steps towards that maturity.
The SLM paper outlines a migration path:
➠ Instrument and securely log agent calls
➠ Scrub sensitive data
➠ Cluster tasks
➠ Fine-tune specialists, and
➠ Iterate.
The Uncool Truth: Agents Live or Die on Architecture, Not Hype
This is the bit everyone skips because it looks like work. It is the work. And it is where most "agent" projects die.
So my position stands, with sharper edges. We should stop calling "route to the right tool" groundbreaking. It is baseline engineering.
But we should take the papers seriously as evidence that the parameter-count arms race was never the point, and that "agentic AI" is reverting to what computing has always been: composed systems with explicit interfaces, measurable objectives, and controllers that optimise for correctness, cost, and time under constraints.
Associated Reading
Original LinkedIn Post by Chorouk Malmoum that drew my attention to the subject in the first place
NVlabs / ToolOrchestra - GitHub repository: ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Small Language Models are the Future of Agentic AI - Abstract by Peter Belcak, Greg Heinrich, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, Pavlo Molchanov at NVIDIA Research
Small Language Models are the Future of Agentic AI - Full Paper
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration - arXiv Abstract
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration - Full Paper

