Blog

GPT-4.1: OpenAI’s most confusing model

April 16, 2025

OpenAI LLM — Image credit: 123RF (with modifications)

On Monday, OpenAI announced its newest family of large language models (LLMs), GPT-4.1. This is a model that is meant to be for developers and will not be available to consumers. And to say the least, it has confused quite a lot of people who are following OpenAI’s progress and building applications on top of its models.

Here is what we know (and don’t know) about GPT-4.1 and its peculiar release.

What is GPT-4.1?

GPT-4.1 is a family of LLMs that come in three sizes: GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano. GPT-4.1 is not a reasoning model like OpenAI’s o1 and o3 models. The models are reportedly an upgraded version of OpenAI’s flagship GPT-4o. (There are no details on the size and architecture of the models, but from the pricing, we can assume that GPT-4.1 is a bit smaller than GPT-4o and GPT-4.1 mini is around twice the size of GPT-4o mini. GPT-4.1 nano is the smallest, with the pricing being comparable to Gemini 2.0 Flash.)

The context window of GPT-4.1 is 1 million tokens, extended from 128K tokens in GPT-4o, bringing it on par with Google’s Gemini models. And the output token limit has been extended from 16,384 in GPT-4o to 32,768 in GPT-4.1 (still below Gemini 2.5 Pro’s 65,536 output token limit).

How well does the model perform on long-context tasks? According to OpenAI, the model has excellent performance on the “needle-in-a-haystack” test across its entire context window, where it is given a very long sequence and asked to retrieve a very specific bit of information.

To be clear, needle-in-a-haystack is not an accurate measure of a model’s performance on long context tasks. So OpenAI introduced two more benchmarks, OpenAI-MRCR (Multi-Round Coreference) and Graphwalks⁠. MRCR tests the model’s ability to find and disambiguate between multiple needles well hidden in context while Graphwalks evaluates multi-hop reasoning on long-context sequences. According to OpenAI, GPT-4.1 outperforms GPT-4o on both benchmarks.

(Oddly, OpenAI did not release results on the advanced Nvidia RULER benchmark, which is a kind of gold standard for long-context reasoning. More broadly, the frustrating part of OpenAI’s release is that it only compares GPT-4.1 to existing OpenAI models and you have to do your own research to compare it with other leading non-reasoning models such as DeepSeek-V3.1, Claude 3.7 Sonnet, and Grok 3.)

OpenAI publicizes GPT-4.1 as a superior model for instruction-following and coding and useful for agentic applications. Experiments show that GPT-4.1 is very good at writing coding, modifying existing code, rewriting entire files, and writing good frontend web code. With the long context window, expanded output token limit, and good long-context reasoning abilities, GPT-4.1 can be a good choice for challenging coding tasks on large codebases that span across thousands of lines of code.

A confusing release, to say the least

If the release of GPT-4.5 was disappointing, GPT-4.1 must be OpenAI’s most confusing model release. First, after the release of GPT-4.5, OpenAI CEO Sam Altman said that the company would not release non-reasoning models moving forward. Here we are, two months later, and OpenAI has not only released a non-reasoning model, but it is a lower-version model than GPT-4.5.

If GPT-4.1 is this much better at coding than GPT-4.5, I cannot wait to see how impressive GPT-3.7 is. pic.twitter.com/e4Mu2YiZ04
— Packy McCormick (@packyM) April 14, 2025

And guess what, GPT-4.5 will be discontinued soon. According to OpenAI, they will begin “deprecating GPT‑4.5 Preview in the API, as GPT‑4.1 offers improved or similar performance on many key capabilities at much lower cost and latency.”

GPT‑4.5 Preview will be turned off on July 14, 2025. “GPT‑4.5 was introduced as a research preview to explore and experiment with a large, compute-intensive model, and we’ve learned a lot from developer feedback,” according to OpenAI. “We’ll continue to carry forward the creativity, writing quality, humor, and nuance you told us you appreciate in GPT‑4.5 into future API models.” That’s a long way of saying GPT-4.5 was burning a lot of cash and didn’t have a viable business model.

Things get even more confusing. OpenAI has announced that GPT-4.1 will only be accessible through the OpenAI API and will not be available through the ChatGPT application, which is the lowest-barrier means to experience new models. In addition to the API, the model has already been added to AI-powered coding applications and assistants like Windsurf, Cursor, and GitHub Copilot. (Windsurf is providing free access to GPT-4.1 until April 21, 2025.)

It is also not clear where OpenAI stands in the lineup of leading models. Users who have compared GPT-4.1 against other models on key benchmarks (which OpenAI did not do to the frustration of developers), show that it doesn’t move the Pareto frontier of cost vs performance. “No one needs a not-so-smart model (GPT-4.1 mini), nor an overpriced dumb one (GPT-4.1),” one user posted on X. “You either want a super smart model, or the best cheap model. Gemini 2.5 Pro or Gemini 2.0 Flash.”

For the very first time OpenAI released a model right after Google and they are way behind

No one needs a not-so-smart model (GPT-4.1 mini), nor an overpriced dumb one (GPT-4.1)

You either want a super smart model, or the best cheap model. Gemini 2.5 Pro or Gemini 2.0 Flash pic.twitter.com/GXneX65hmm
— Pierre Bongrand (@bongrandp) April 14, 2025

Nonetheless, this can be a significant model and release. At the very least, it is unconventional in LLM terms, and unconventional sometimes leads to surprising discoveries and progress. OpenAI has released a prompting guide explaining how to get the best results from GPT-4.1, and those who have used it have reported interesting outcomes. It will be interesting to see what comes out of this model.

don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents… with a new recommendation for:

– telling the model to be persistent (+20%)
– dont self-inject/parse toolcalls (+2%)
– prompted planning (+4%)
– JSON BAD – use… https://t.co/DLZJVVZnXG pic.twitter.com/cJ98xcOLaU
— swyx (@swyx) April 14, 2025

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models