"We need to stop with LLMs"

Generative AI is everywhere, driven by the spectacular power of language models. Yet behind the glowing benchmarks, the reality on the ground reveals a dead end: that of a probabilistic technology we're desperately trying to make deterministic through prompts and sheer scale. In this op-ed, Olivier Bergeret, Thiga Director of the Tech and Agentic teams, calls for a break with the "fetish for size." For him, the future of AI no longer lies in a single, omniscient brain, but in precision engineering: that of Small Language Models (SLMs) and agentic architectures steered with industrial-grade rigor.

Presenting Large Language Models (LLMs) as an absolute breakthrough leaves me somewhat puzzled. Neural language models are twenty years old. The Transformer architecture, the foundation on which every major model today rests, is almost ten. What we're told is a revolution rests on foundations the industry has long known. That said, you have to admit that OpenAI got it right in 2022 with its massive "scaling" strategy, which propelled the market into a new era almost by force.

We all rushed into the breach, churning out POCs — often more spectacular than genuinely foundational, it has to be said — to explore this new playground. Today, the verdict is twofold: yes, the technology undeniably works. But there's no denying the technical and operational limits that come with it, and the often serious consequences that go along with them.

Because while benchmarks celebrate the models' progress in what they can do — code, mathematics, knowledge retrieval — they say far less about what the models still can't guarantee: reliability and stability. Yet these reliability and stability problems don't boil down to isolated defects; they drag down a great many use cases.

The most visible of these problems remains hallucination. A model can invent a fact, a source, or an explanation without the slightest hesitation. What makes hallucination dangerous is the confidence with which it's delivered: a piece of false but fluent, well-constructed information quickly takes on the appearance of truth.

These problems can also be amplified by structural limits — context windows too small for dense tasks, reasoning that frays over long stretches — and by the models' tendency toward sycophancy, which pushes them to adopt their interlocutor's assumptions rather than question them.

These difficulties can take on an added dimension in the agentic domain. When a model — used by an agent — plans, chains steps together, and calls tools, an error is no longer confined to a single rough answer; it propagates down the execution chain. A misreading, a poorly tracked context, or a botched chain of steps can compromise an entire sequence.

Put plainly: LLMs excel at producing believable output, and that's exactly where the trap lies. Their fluency says nothing about their reliability, the soundness of their reasoning, or their ability to tell the true… from the merely plausible.

It's time to stop tinkering around these limits and to rethink the architecture from the ground up.

Treating the symptoms rather than the disease

Believing that refining prompts or backing models with document databases would be enough to fix their fundamental weaknesses was probably one of the great (dis)illusions of this first phase of adoption. For a long time, we tried to treat as mere usage flaws what are in fact structural limits. An LLM remains, by nature, a probabilistic system that excels at producing plausible strings of words. But predicting — even accurately — the most likely word at each step of a sentence does not amount to any real understanding of what it produces.

This is precisely why today's remedies quickly reach their limits. The system prompt can steer a model's behavior, but its effect fades as the context grows longer. RAG, for its part, improves access to relevant information and reduces certain forms of forgetting, yet without endowing the model with robust reasoning. Even when properly fed with accurate data, an LLM can keep producing a wrong answer the moment the task demands something other than a plausible recomposition of language. And because it generates token after token, never retracing its steps, an initial error tends to propagate through the answer rather than be corrected.

The ecosystem has naturally sought to compensate for these fragilities — grounding, GraphRAG, DPO, fine-grained context segmentation — with real results.

Even so, these approaches do no more than improve the models' observable behavior without changing the way they fundamentally work. In other words, they ease the symptoms without curing the disease.

The LLM dead end

There's another angle to this problem — less technical but just as concrete: cost. For several years, the industry's answer came down to a single move: making models bigger. BERT-Large had 340 million parameters in 2018. GPT-3 boasted 175 billion two years later. Qwen launched a 480-billion-parameter model in 2025. At every stage, innovation was conceived as a change of scale, as if raw power were, in itself, an answer to the limits of its predecessor.

The problem is that this race comes at a price: in compute capacity, energy consumption, infrastructure, and inference costs for every use case put into production… and without always producing a proportional gain in business value. We've gotten used to wheeling out behemoths for tasks that could have been handled with far less. A model with hundreds of billions of parameters to summarize a document, classify a support ticket, or generate a standard customer reply: the ratio between the power deployed and the real complexity of the task is often absurd. It's the classic image of using a bazooka to swat a fly. By conflating size with intelligence for so long, we've ended up normalizing a model that is expensive, power-hungry, and structurally inefficient for most of the uses we put it to.

The answer to these issues doesn't lie in piling on ever more parameters. The LLM isn't meant to do everything on its own! The moment a task calls for computation, fact-checking, formal logic, or executing actions, it makes far more sense to hand those operations off to specialized, deterministic, controllable tools. The agentic approach formalizes this shift: rather than expecting a single model to answer everything, you build a chain where each component does what it "knows" how to do.

In practice, it looks like this: one agent forms a hypothesis, another generates code, a third evaluates it in a sandboxed environment. If the test fails, the fix is fed back into the loop. The point of this setup is less to "make the model intelligent" than to keep its weaknesses better in check.

But this rigor comes at a cost. Multi-agent chains lengthen execution times and multiply calls to the models, with a rapid rise in latency and inference costs. Yet this breaking-down of tasks also brings something important into view: at each step, the case for wheeling out a very large model becomes less obvious.

Choosing efficiency

For years, the industry equated progress with a continuous increase in parameter count. But that equation is losing its relevance today. Not every task demands the depth of a massive generalist model; many call first and foremost for precision, speed, stability, and controlled cost. A realization that forces us to give up the fetish for size we seem to have when it comes to language models.

This is exactly what the rise of Small Language Models is about. Far from being watered-down versions of the large models, they often meet a large share of real needs more aptly. Recent work shows, in fact, that when a problem is properly broken down into well-defined sub-tasks, smaller models — better tuned and better orchestrated — can reach performance on targeted tasks that is comparable to, or even better than, that of far larger models. To put it more concretely, today's SLM tends to offer a level of useful intelligence comparable to that of an LLM from just a few months ago, while using 15 to 20 times fewer parameters.

Their appeal is economic first, but it quickly becomes operational. More compact models make it possible to keep costs down, reduce latency, and simplify deployment. They also enable local deployment, as close as possible to the data and to business environments — which answers growing demands for security and sovereignty. In other words, SLMs don't signal a retreat in ambition; on the contrary, they reflect a form of maturity: seeking not maximum power in every situation, but the right level of model for the right use.

The platform imperative

Beware, however, of swapping the blindness of the magic prompt that fixes everything for the blindness of agentic hype: believing that deploying agents and SLMs is enough to solve the problem is making exactly the same mistake. You absolutely have to recognize that this technological rationalization demands relentless rigor, and that going into it half-cocked is out of the question.

Deploying swarms of agents — these networks of agents coordinated in parallel — requires relying on a genuine industrial-grade orchestration platform, the kind that falls under LLMOps or AgentOps. Cobbling agents together without centralized infrastructure means courting operational disaster and building over-engineered contraptions that are impossible to maintain. This platform must also, without fail, provide full observability. It's no longer about reading a simple conversation log, but about tracing API calls in real time, monitoring token consumption, auditing reasoning loops (tracing), and measuring the latency, error rate, and hallucination level of each sub-agent.

More crucial still, this infrastructure has to guarantee security by design. Entrusting execution tools (such as database access) to autonomous agents requires building in strict safeguards (guardrails). These algorithmic firewalls, placed upstream and downstream of the LLMs, have to filter prompt injections, block off-topic outputs, prevent data leaks between agents, and ensure that any critical action always requires deterministic validation. Without this governance platform, agentic architecture shifts from being a solution to being a new source of risk.

Ultimately, the future of generative AI doesn't look like a single, all-powerful brain, but rather like the deployment of a swarm of cognitive workers — fast, frugal, and ultra-specialized. The greatest victory of this new iteration is to have knocked the tool off its pedestal — the supposedly omniscient LLM — in favor of an approach where the language model goes back to being a mere engineering component in the service of operational efficiency. That's what it means to stop with LLMs: not abandoning them, but no longer asking them for what they cannot guarantee.