Alibaba’s QwQ-32B reasoning model matches DeepSeek-R1, outperforms OpenAI o1-mini

Ben Dickson

2 months ago

Image credit: 123RF (with modifications)

Alibaba’s Qwen team has just released QwQ-32B (Qwen with Questions), a large reasoning model (LRM) that matches the performance of leading models such as DeepSeek-R1 and OpenAI o1-mini.

Alibaba released QwQ-32B-preview in November. The new version has better performance on key benchmarks while being more memory and compute efficient than other open source contenders. This new release comes against the backdrop of an accelerating race between large tech companies and AI labs to dominate the market for LRMs.

What is QwQ-32B?

QwQ-32B is a reasoning model, which means it has been trained to perform better on tasks such as math and coding or require thoughtful thinking and step-by-step problem solving. It is 32 billion parameters, which makes it more than an order of magnitude smaller than the main version of DeepSeek-R1, which has 671 billion parameters. (However, it is worth noting that R1 is a mixture-of-experts (MoE) model, which means at any given time, only a subset of its parameters are activated.)

QwQ-32B-preview had a 32,000-token context window. According to the model card for QwQ-32B, its context length has been expanded to 131,072 tokens, which is similar to other reasoning models such as Claude 3.7 Sonnet and Gemini 2.0 Flash Thinking.

On math and coding benchmarks, QwQ-32B nearly matches the performance of DeepSeek-R1-671B while beating o1-mini and distilled versions of R1 (though it underperforms o1, o3-mini, Claude 3.7 Sonnet, and Grok-3, which are not included in the chart published by the Qwen Team).

QwQ-32B’s training recipe

QwQ-32B is based on Qwen-2.5-32B, the Qwen Team’s frontier general-purpose large language model (LLM). The team trained the base mode with reinforcement learning (RL) on with “outcome-based rewards.” This means the model was left to reason by itself and produce a result. The result was then checked with a verifier such as a code interpreter or a math solver. The model then reviews and reformulates its response until it gets the right answer.

This stage of the training was limited to math and coding tasks. This is the same approach used in DeepSeek-R1-Zero, which discarded the conventional approach of using reward models that not only the result but also the reasoning process.

This is a counterintuitive approach because in reasoning tasks, rewards are “sparse,” which means the model can try countless ways to approach a problem and very few of them result in the right answer. Traditionally, the model required human-designed examples to learn the correct reasoning paths. However, given the strong knowledge that today’s models obtain through pre-training, the pure RL approach works. The model is able to use its internal knowledge to correct its reasoning and find the right solution. And with the reward signal from the verifier, the model gradually converges not only on the correct solution but also on the optimal reasoning path.

According to the company’s blog, they have “integrated agent-related capabilities into the reasoning model, enabling it to think critically while utilizing tools and adapting its reasoning based on environmental feedback.”

One of the interesting aspects of QwQ-32B is the two-stage RL training. After the first training stage, they performed another stage of RL for general capabilities. In this stage, the model was trained with general reward models and rule-based verifiers that had more manual engineering involved.

“We find that this stage of RL training with a small amount of steps can increase the performance of other general capabilities, such as instruction following, alignment with human preference, and agent performance, without significant performance drop in math and coding,” the researchers write. This is evident in the model’s performance on benchmarks such as LiveBench and IFEval.

This is important because it is a scalable process, enabling the model to learn reasoning mostly on its own and requiring very little human guidance to improve its performance and align its output with human preferences.

How to access QwQ-32B

Unlike OpenAI o1 and o3, QwQ-32B is an open model, which means you can download it and run it on your own servers. It is available on Hugging Face and ModelScope (the Chinese equivalent of Hugging Face). It has also been released under the Apache 2.0 license, which means it can be used for commercial purposes. (It is worth noting that Qwen has not released the source code and data used to train the model, which effectively doesn’t make it a fully open source model.)

There is a hosted version of QwQ-32B on Hugging Face Spaces, where you can run experiments and test the model’s reasoning abilities. You can also access the model on Qwen Chat, the ChatGPT equivalent for Qwen models. (Be careful about typing in sensitive information in hosted versions of the model.)