What is...

Everything you need to know about Grok-3

February 20, 2025

xAI’s new large language model (LLM) Grok-3 was released with much fanfare this week, beating other state-of-the-art (SOTA) models on key benchmarks and public user evaluation forums. We don’t know much about the model besides what was declared during the public presentation by xAI founder Elon Musk and the Grok-3 team, and a blog post that was published shortly after. But what we do know can have important implications for the AI industry.

https://t.co/hEfQ31gANQ
— xAI (@xai) February 18, 2025

What is Grok-3?

Grok-3 is a family of LLMs and large reasoning models (LRMs). There is no detail yet on the size or architecture of the models or how they were trained (I will update this post if and when xAI releases the paper on Grok-3).

The base Grok-3 model is a general-purpose LLM that matches the capabilities of leading models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0. It can perform tasks such as writing articles and code. It comes in two sizes, the full model and Grok-3 Mini (similar to GPT, Claude, and Gemini models that come in different sizes).

Grok-3 Reasoning Beta is the reasoning version of the model, similar to OpenAI o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. It uses test-time compute to generate chain-of-thought (CoT) tokens, the equivalent of “thinking” to solve problems. Grok-3 Reasoning also has a Mini version that is more compute-efficient (though again, we have no details yet about the size and architecture of the models). Both models have shown competitive performance on key reasoning benchmarks such as MATH and AIME 2024.

The model also performs well on AIME 2025, which has just been released and has no chance of being memorized by the model during training.

Where can you access Grok-3?

xAI has wrapped all models in a nice interface in X and a standalone Grok app that was released shortly after the model was announced. The Grok app looks like other chatbot applications such as ChatGPT and Perplexity. Grok uses data from the web and X to answer your prompts with the latest information (the deep integration with X can give it a slight advantage over other models when it comes to using up-to-date information).

When prompting Grok, you can enable “Think” mode to access the LRM version of the model. The app also has a “Big Brain” mode (not available yet), which would enable the model to use more compute cycles to improve its accuracy (this is similar to the difference between Open-AI’s o3-mini and o3-mini-high models).

Grok-3 also supports a “DeepSearch” mode, which is similar to the Deep Research feature in Gemini, ChatGPT, and Perplexity (Google came up with the name first, by the way). DeepSearch uses reasoning techniques to plan its research step by step, grab information from the web, and come up with a detailed answer. In a few minutes, DeepSearch will do the equivalent of the kind of research that would normally take several hours. But I would advise against trusting its answers right away. Like everything else that has to do with LLMs, you still have to carefully check the answers and sources. But it still saves you a lot of time in comparison to doing the research yourself.

Is Grok-3 open source?

Grok-3 is not open source. You can only use it through the application or the soon-to-be-launched API. According to Elon, they will always release the next-to-best version of their model, so we can expect the weights of Grok-2 to be released in a few months, when Grok-3 is finalized. Grok-3 will be open sourced when the next generation of the model is released.

Grok-3 Reasoning also doesn’t reveal its full chain-of-thought (CoT) tokens to prevent competitors from copying the model. Instead, it shows a detailed summary of the thought process, which makes it useful to users while still concealing the secret sauce (this is similar to what OpenAI has recently done with o3-mini).

Is Grok-3 better than other models?

On paper, Grok-3 beats or matches other SOTA models on key benchmarks. And the team did some impressive live demos during their presentation. A more impressive data point is the Chatbot Arena leaderboard, where Grok-3 secretly outperformed all other models by a wide margin, earning an ELO score of 1,400, which is an indication that the model has not just memorized the training data. (Chatbot Arena is an online platform where users evaluate the answers of different models on the same prompt side-by-side without knowing which model generated each answer.)

BREAKING: @xAI early version of Grok-3 (codename "chocolate") is now #1 in Arena! 🏆

Grok-3 is:
– First-ever model to break 1400 score!
– #1 across all categories, a milestone that keeps getting harder to achieve

Huge congratulations to @xAI on this milestone! View thread 🧵… https://t.co/p8z8lccNd5 pic.twitter.com/hShGy8ZN1o
— lmarena.ai (formerly lmsys.org) (@lmarena_ai) February 18, 2025

Public reaction, however, has been mixed. Some users have found it to be very impressive, with examples showing the model creating fully working games with a single prompt. Others have pointed out that the model underperforms in many tasks in comparison to OpenAI o3-mini and Claude 3.5 Sonnet.

My experience is that each model has a learning curve that you have to master. You have to learn to prompt it and use it in the best way possible. Also, I never judge models based on single-shot prompts. I usually expect them to be more like an assistant, working with me step by step to solve a problem or accomplish a task. With the small tests that I have done so far, I think Grok-3 matches other cutting-edge models.

What makes Grok-3 special?

xAI launched less than two years ago. With Grok-3, it has managed to bring itself into the conversation of the leading models. This is largely due to the work ethics of the team and their ability to speedily set up Collossus, the biggest compute cluster in the world with 200,000 Nvidia GPUs.

When DeepSeek-R1 was launched, there was a lot of discussion about the expenditures of AI accelerators being overrated and going to waste. xAI and Grok-3 made compute scaling popular again, showing that with a huge pile of cash and a good team of bright and hard-working engineers, you can ship one of the most advanced AI models in record time. (Elon has hinted that much more than scaling was behind the success of Grok-3; we’ll have to wait for more details to be published hopefully.)

Are we at the cusp of a new era for artificial…

What to know about o3 and o4-mini, OpenAI’s new reasoning models

GPT-4.1: OpenAI’s most confusing model

Making data work for you: Challenges, innovations, and lessons learned

Under the hood: The Innovations powering DeepSeek’s AI breakthrough

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

These mistakes can ruin your machine learning project

How to choose metrics for your machine learning projects

How Google’s Agent2Agent can boost AI productivity through inter-agent communication

Demystifying vibe coding: Hype, reality, and why you still need to…

What is Model Context Protocol (MCP)?

What to know about Claude 3.7 Sonnet, Anthropic’s new frontier language…