What to know about o3 and o4-mini, OpenAI’s new reasoning models

Ben Dickson

6 days ago

multi-purpose robot — Image created with Microsoft Copilot

On the heels of the confusing GPT-4.1 launch, OpenAI has just launched o3 and o4-mini, its newest reasoning models. These models are smarter and more capable than their predecessors and show a step change in the ability of language models to solve complex problems.

Here is what we know about o3 and o4-mini.

What are o3 and o4-mini?

OpenAI o3 and o4-mini are reasoning models, which means they have been designed to “think” before answering a prompt, making them suitable for tasks such as math and coding (and any task that requires logical reasoning and planning).

Like other OpenAI models, there are no details about the architecture of the models and their training data aside from the fact that reinforcement learning played an important role in the models’ reasoning abilities.

The key advantage of o3 is its ability to use tools, including executing code, retrieving files, and using web search. Tool use is a critical part of modern LLM applications, especially agentic systems. One method to apply tool use is to have an external process (e.g., another model or a rule-based system) determine when and how tools such as information retrieval. In contrast, modern reasoning models such as o3 have been trained to natively support tool use.

This means that during its reasoning process, the model organically determines when it needs to use a tool (e.g., run a web search or run the code it generated) and generates the tokens required for the enclosing application to activate the tool. This enables these models to perform much more complex tasks, such as run multiple search queries as they gradually solve problems and gather new information.

You can see this in action when you look at o3’s reasoning chain—which unfortunately is not the full reasoning tokens. o3 outperforms its predecessors on key coding and reasoning benchmarks, including Codeforces, SWE-bench, and MMMU.

o3 has also been enhanced in image reasoning, which makes them great at tasks that require multi-modal processing, like analyzing images, charts, and graphics. o3 now replaces o1 in ChatGPT, both because it is more capable and is 33% less expensive to run. o3-pro will be coming soon.

o4-mini is OpenAI’s low-cost and fast reasoning model. It outperforms o3-mini on key reasoning benchmarks (and is a hint that the full o4 will be here in a few months).

According to OpenAI, “Compared to previous iterations of our reasoning models, these two models should also feel more natural and conversational, especially as they reference memory and past conversations to make responses more personalized and relevant.”

How good are o3 and o4-mini?

You can currently access o3 and o4-mini through ChatGPT of OpenAI API. Users who have tested them have in general given very positive feedback. Some have described it as a genius-level AI.

In my own experiments, I found o3 to be very good at reasoning, around the level of Gemini 2.5 Pro. It shows impressive results on image analysis. And it performs very well on my SVG challenge (I give it a technical article and ask it to create an SVG that depicts the technique described in the article). Although it doesn’t get the best results on the first try, it is very decent at executing follow-up instructions and making corrections to the image.

Testing out o3 with my SVG diagram challenge, where I give it an article and ask it to create a diagram that depicts the technical concept discussed in it. So far, I'm impressed. It doesn't get it perfectly on the first try, but it is very good at following instructions and… pic.twitter.com/YZaZuSjyqs
— Ben Dickson (@bendee983) April 17, 2025

But like all LLMs, it is not perfect. It can hallucinate tools and results. It can make trivial mistakes just as it solves complicated problems. And for some weird reason, OpenAI decided to include the user’s name in its reasoning chain.

One of the most impressive results of the OpenAI release was the team’s finding that scaling reinforcement learning continues to yield positive results, which means we can expect reasoning models to continue to improve for the time being.