GPT-5, damp squib or non-event?

Paul Allard
26 nov. 2025
12 min de lecture

LLMs are reaching a plateau

by Claude Coulombe

Ph. D. / Entrepreneur / Applied AI Consultant / Deep Learning & NLP

published August 27, 2025

Big deal... Not a revolution, but an incremental improvement

After more than two years of anticipation [1] of media hype, promises of general AI (AGI), and even "superintelligence," the recent GPT-5 is a damp squib. This probably explains its release in the middle of summer, on August 7, 2025.

Yet expectations were high and fully matched by the wild promises. Recall that on May 21, 2024, at Microsoft Build 2024 [2], Kevin Scott, Microsoft's CTO, along with Sam Altman of OpenAI, stated, « We are nowhere near the point of diminishing marginal returns on how powerful we can make AI models as we increase the scale of compute » in front of a giant screen with a beautiful exponential curve for the GPT model. Later, Kevin Scott compared the size of the computing infrastructure for GPT-5 to a blue whale, while a killer whale was enough to train GPT-4 and a great white shark for GPT-3. And not to be outdone, Sam Altman stated: « As the models get more powerful there will be many new things to figure out as we move towards AGI. ».

Not to mention Sam Altman's many statements when he claimed that the improvement from GPT-4 to GPT-5 will be as significant as that from GPT-3 to GPT-4: « GPT-4 to GPT-5 will be a similar leap, and it's amazing! » [3] or this other one [4] where he suggested that OpenAI's models would continue to improve for "three or four" more generations. Finally, Sam Altman's dithyrambic predictions with a religious flavor on his blog: « Humanity is close to building digital superintelligence » [5].

The exponential evolution curve is rather a sigmoid

The quasi-religious hypothesis of increasing artificial intelligence with scaling up (i.e., more data and more computation) has hit a wall! For large language models (LLMs), bigger and taller do not mean smarter. The evolution curve of LLMs would not be the dreamed-of exponential but rather a sigmoid well known to practitioners of neural networks.

I'm having a bit of fun here, since the actual shape of LLMs evolutionary curve is difficult to predict. What we do know is that the curve started as an exponential and is showing clear signs of slowing down, or even leveling off on maybe a plateau, hence my hypothesis of a sigmoid. The points and the shape of the curve in the image in the header of this article are purely a caricature.

« The most important thing to remember about GPT-5 is that LLMs are reaching a ceiling » - Claude Coulombe

One thing is for sure, the progress of GPT-5 over GPT-4 is not comparable to the performance progress of previous models in the GPT family, such as the improvements between GPT-3 in 2020 and GPT-4 in 2023.

My findings are consistent with those of Gary Marcus, an AI expert and notorious skeptic of the large language model approach to achieving general AI [6]. In a similar vein, I recommend the video by physicist and science popularizer Sabine Hossenfelder entitled « GPT-5: Have We Finally Hit the AI Scaling Wall? » [7].

« Nobody with intellectual integrity should still believe that pure scaling will get us to AGI » - Gary Marcus

Let's get back to the scientific method!

There was a hypothesis that many doubted (like me) that by increasing the size of LLMs it would lead to emergence of new cognitives skills, the correction of confabulations (the misnamed hallucinations) and then "magically" to artificial general intelligence (AGI). Some sort of language-based model of the world would emerge... Same thing with images that were supposed to lead to physical models of the world. To verify the hypothesis, GAFAM+ did some very expen$ive experiments. With the disappointment of recent LLMs including Llama 4 and GPT-5, we can say that this hypothesis was wrong. We get rather efficient generative models for textual data, more scholar, but still shallow. In short, "idiots savants"... This remains useful, but it is not the path to AGI and even less superintelligence.

LLMs are useful but limited and unreliable

Large model languages (LLMs) have their uses, particularly for homeworks assistance (a polite way of talking about plagiarism), writing assistance, and computer coding assistance, tasks in which humans are directly involved to ensure quality control.

In conclusion, it is surprising to see what the LLMs have managed to accomplish thanks to the properties of multidimensional latent semantic spaces, the attention mechanism (thanks to the pioneering work of Yoshua Bengio) and some tricks like the good old tree search improved with Monte-Carlo process.

LLMs don't reason, they pretend

A first scientific study from Apple's AI research group, « The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity » [8], challenged the idea that LLMs could reason by generalizing at any level of depth and showed that they failed on moderately complex versions of basic algorithms like the Tower of Hanoi. Beyond a certain level of problem complexity, LLMs fail to follow the necessary basic logical rules. This casts serious doubt on the ability of LLMs to generalize and certainly to ever achieve AGI.

A recent scientific study, « Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens » [9] by researchers at Arizona State University shows that the Chain of Thought, the supposed trace of reasoning that comes from LLM-based generative chatbots, is just “largely a brittle mirage.” LLMs don’t reason. They perform pattern searching in their latent space created from their training data. The researchers asked the models to solve logic puzzles that require “out-of-distribution” generalization. As soon as a problem deviates even slightly from its training data, the “pseudo-reasoning” breaks down completely.

Also, researchers at Anthropic, the company behind the generative chatbot Claude, discovered that the steps in the chain of thought (CoT) used by models do not actually match the outcome [10]. That is, the model may have correct problem-solving steps but end up with an incorrect result, or get a correct result from incorrect steps. Their research suggests that LLMs are not full-fledged reasoning engines, but rather sophisticated simulators / generators of reasoning text. Rather than demonstrating true understanding of the problem, the chain of thought applies patterns learned during training.

The confabulations (misnamed hallucinations) are still unresolved

One thing is certain, the problem of confabulations persists, and has still not been resolved after nearly three years and many promises. In just a few days since the launch of GPT-5 , the web is overflowing with errors made by GPT-5, errors of the same type as those made by its predecessors [11]. This model, supposedly « expert with a Ph.D. » level according to Sam Altman, is still not capable of reliably counting the number of "R" letters in the name of a fruit or performing basic arithmetic operations.

Just thinking about it, some of these problems could came from the new model switcher module, which selects the model. For example, GPT-5 chose a Python interpreter when you typed the words "calculate," "add," or "subtract," but not when you directly typed a mathematical expression. Switching model problem also seemed to be related to the type of subscription: free, paid, or premium. Friends told me that paid versions were without of these trivial errors. However, the erratic behavior of this switching module at the launch of GPT-5 speaks volumes about OpenAI's quality control. It looks like the work of an amateur or intern.

That said, other similar errors and confabulations are discovered every day using GPT-5 [11].

To think that some would like to entrust our economies, our courts, our security, and even our lives in medical applications [11] to powerful but unreliable LLMs. I teach it and I repeat it, for critical applications the human must remain in the loop. By virtue of the very nature of systems based on LLMs, we cannot guarantee their reliability because of the confabulations that we can reduce using RAG and other techniques but never completely eliminate.

A paper titled « The wall confronting large language models » [12] published a few weeks before the launch of GPT-5. This paper revises the scaling laws of LLM. These laws allegedly do not take into account the fact that error removal requires an extremely large amount of computation. According to the study, reducing the error rate of current LLMs to make them reliable for enterprise use would require 1.0 E+20 times more computing power (i.e., 10 to the power of 20). This is physically impossible.

I'm curious to know an honest estimate of the percentage reduction in fabulations of GPT-5 compared to GPT-4.

Moreover, these models are black boxes that cannot explain their decisions. For "life and death" matters, humans must remain in the loop and take responsibility for the decisions.

Finally, I only mention in passing the risks of addiction, worsening of mental disorders and cognitive loss associated with the misuse of generative conversational robots. Developing and maintaining complex cognitive skills requires active work and cannot rely solely on technological assistance. When it comes to natural neural networks, it's "USE THEM OR LOSE THEM" [13].

GPT-5, a predictable failure

According to Silicon Valley magazine The Information [14], based on information obtained from internal sources at OpenAI and its partner Microsoft during the fall of 2024, Orion, which was the code name for GPT-5, is a major disappointment. It represents a qualitative leap that is less significant than the transition from GPT-3 to GPT-4 . OpenAI then decided to launch it under the name GPT-4.5 because it did not represent a sufficient advance.

LLMs, by their objective (predicting the next word based on context) and especially their dependence on training data, were probably destined to reach a plateau. Stakeholders and hard cash VCs, OpenAI and Microsoft and other techno-optimists hyped the myth of the emergence of spectacular properties and an unlimited amount of training data, even if it meant synthesizing them. For my part, on the "emergent properties" side, I only observed a better ability to perform more varied tasks in context without any example (zero-shot) or with very few examples (few-shot). For now, GPT-5 has "Ph.D level intelligence" but only for problems in is training dataset distribution.

Not surprisingly, large generative models are good at three areas where there are large data repositories on the web: text including audio, images including video, and programming code.

A major problem is the scarcity of high-quality data for training LLMs. Once all the web pages and digital books have been scraped [15], the contribution of texts provided by click workers, poorly paid although many of them holding Ph.D. degrees, only manages to fill the most significant gaps in mathematics, problem solving and programming. This is what I called "stop-gap" LLM's improvement technique, which consists of improving the results of the benchmarks by filling the gaps. It's not easy to get to AGI with this method!

GPT-5 faster, scholarly improved, and better at programming

GPT-5 is not the announced revolution, at most a good evolution at the cost of spending money, energy, GHG emissions and unparalleled resources. GPT-5 is certainly faster (thanks to more powerful chips and optimizations), better on the usual benchmarks, many of the answers of which are probably found in the model's training set [16], better in programming like most competing models (we will see further why) and with an announced reduction in confabulations (the misnamed hallucinations) which remains to be proven. Finally, a decrease in price, which is not a good sign for an improved and revolutionary product.

Let's face it, the only real flagship application for generative chatbots is student cheating, as the ChatGPT usage curve, which tracks student session activity, clearly illustrates. But there's not much money to be made with this clientele. Seeing the AGI target out of reach, OpenAI and other LLM providers are therefore banking on a potential flagship application for businesses. Unfortunately, a recent MIT study reports a spectacular failure rate of 95% of genAI pilot projects in companies [17]. In business, writing assistance (ideation, translation, summarization, proofreading, etc.) is useful but mainly concerns low-paid employees. On the other hand, computer programming is an activity likely to interest businesses because it concerns some of the highest-paid employees.

That said, all the GPT-5 coding demos I've seen have been pre-designed and reheated. This is my observation, not a scientific study: The dramatic productivity gains with genAI coding tools are due to good programmers, who are already several times more productive than average. Average programmers with genAI tools sometimes reach the level of good programmers without genAI. And debugging is a real problem, even a nightmare, with genAI tools. Novice coders with genAI coding tools are able to produce software they don't understand and that are quite brittle. Worse, I suspect most of them will never learn to program well and will depend on genAI tools for their entire lives. In addition, I observe a more or less rapid erosion of programming skills due to the abuse of genAI tools and the law of least effort.

By adding vibe coding, we risk facing explosive technical debt (poor code quality). This problem already existed with agile methods, when developers cut corners on consolidation, factorization, and re-architecture steps. We'll quickly end up with a house of cards, or rather, inextricable spaghetti code that neither generative AI nor humans will be able to maintain.

But GPT-5 would have made a “mathematical discovery”

On August 20, 2025, a post on X proclaims that GPT-5 had proven a new mathematical result [18]. More precisely, it was an entirely new mathematical proof that established a bound of 1.5/L in convex optimization. Social networks went wild and the news became a "mathematical discovery" made by GPT-5.

This extraordinary claim happens at the perfect time to sow confusion around GPT-5's true capabilities. But suspicion grows when it is learned that this news about GPT-5 on X comes from an OpenAI employee, Sébastien Bubeck.

Nice attempt, but as Carl Sagan said, "Extraordinary claims require extraordinary evidence." Rest assured, this is not a great mathematical discovery... The claim is outrageous because GPT-5 uses well-known proof techniques (i.e., Bregman divergence, the standard smoothing and coercivity inequalities often used in convex optimization) and therefore likely find themselves in different copies in the model's training data.

A LLM-based generative chatbot is basically a "translator" that can predict the next word (actually, a text segment or token) in a text by taking into account the context. It does not reason or understand anything in depth. If generative chatbots sometimes give the impression of solving mathematical problems, it is because they master the language and mathematical notation, then they find problems and solutions by "pattern searching" in their high-dimensional latent space that was created by training on huge corpora including thousands of mathematical proofs, and then they perform a heuristic search among many solutions step by step. Thus, a generative chatbot can construct a fragment of a proof, or even a proof not too far from those in its training dataset, but it will probably not be a great "mathematical discovery".

At the most, we have here a result that underlines the potential of AI to assist mathematical work, but also the importance of human experts in stating problems, putting them into context and validating AI results in order to determine their true novelty and value.

Given the current state of the art, a major “mathematical discovery” by a generative chatbot is not impossible, but it remains very unlikely.

The future is elsewhere

Sorry, but the LLM path has failed. To achieve the Holy Grail of AGI, it's not enough to build a larger LLM model, we'll have to use our natural brains...

For strong AI, neurosymbolic approaches such as those proposed by François Chollet [19], Gary Marcus [20] and/or causal ones which integrate explicit models of the world, such as the Joint Embedding Predictive Architecture (JEPA) of Yann LeCun [21] and the generative flow networks (GFlowNet) of Yoshua Bengio [22] are more promising.

But GAFAM+ has invested very little in these approaches, preferring to sell dream.

Book a 30‑minute productivity + governance review with our experts to evaluate your top AI use cases.

Notes and References

[1] GPT-4 was released on March 14, 2023

[2] A video excerpt from Microsoft CTO Kevin Scott's presentation at Microsoft Build 2024, May 21, 2024

[3] Sam Altman's statement in a video by Tsarathustra (@tsarnick) on X and reported on Reddit, early 2025

[4] Sam Altman's statement on X in 2024

[5] Statement by Sam Altman in a blog post titled "The Gentle Singularity", June 10, 2025.

[6] « GPT-5: Overdue, overhyped and underwhelming. And that’s not the worst of it », Gary Marcus, August 9, 2025.

[7] « GPT-5: Have We Finally Hit The AI Scaling Wall? », Sabine Hossenfelder, Youtube, August 21, 2025

[8] Shojaee, P. & al. (2025). « The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity », Apple, June 2025.

[9] Zhao, C. & al. (2025). « Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens », arxiv preprint server, August 13, 2025

[10] Lindsey, J. & al (2025). « On the Biology of a Large Language Model », transformer-circuits.pub, Anthropic, March 27, 2025

[11] James O'Donnell (2025). « What you may have missed about GPT-5 » - MIT Technology Review, August 12, 2025

[12] Coveney, P. V., & Succi, S. (2025). « The wall confronting large language models », arxiv preprint server, July 30, 2025

[13] « Will generative AI make us brainless? », Claude Coulombe, LinkedIn article, November 21, 2023.

[14] Stephanie Palazzolo, Erin Woo, Amir Efrati (2024)« OpenAI Shifts Strategy as Rate of ‘GPT’ AI Improvements Slows » - The Information, 09 novembre 2024

[15] That's all texts on the Web and digital books without regard to copyright. As Elon Musk "candidly" admitted in an interview in November 2023, everything starts with the massive use of all available data in digital format, without regard for copyright, to train AI models. Corporate pirates have understood that, at worst, they would be caught by the courts and would have to pay ridiculous sums, after years of legal recourse and appeals.

[16] It is a well-known problem that much of the answers from common benchmarks is leaked on the web. In the huge corpora harvested from the web, it is difficult to ensure that no benchmark answers are found in the training data.

[17] Sheryl Estrada (2025) « MIT report: 95% of generative AI pilots at companies are failing » - Fortune, August 18, 2025

[18] Post on X from Sébastien Bubeck, OpenAI employee, August 20, 2025.

[19]« It's Not About Scale, It's About Abstraction », François Chollet, Youtube, October 12, 2024

[20] « How o3 and Grok 4 Accidentally Vindicated Neurosymbolic AI», Gary Marcus, Marcus On AI blog, July 13, 2025

[21] « Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI » - Lex Fridman Podcast #416 - March 7, 2024

[22] « Are GFlowNets the future of AI? », Edward Hu, Youtube, March 13, 2024