Hands on How much can reinforcement learning – and a bit of extra verification – improve large language models, aka LLMs? Alibaba’s Qwen team aims to find out with its latest release, QwQ.
Despite having a fraction of DeepSeek R1’s claimed 671 billion parameters, Alibaba touts its comparatively compact 32-billion “reasoning” model as outperforming R1 in select math, coding, and function-calling benchmarks.
Much like R1, the Qwen team fine-tuned QwQ using reinforcement learning to improve its chain-of-thought reasoning for problem analysis and breakdown. This approach typically reinforces stepwise reasoning by rewarding models for correct answers, encouraging more accurate responses. However, for QwQ, the team also integrated a so-called accuracy verifier and a code execution server to ensure rewards were given only for correct math solutions and functional code.
The result, the Qwen team claims, is a model that punches far above its weight class, achieving performance on par with and, in some cases, edging out far larger models.
Here’s how Alibaba claims QwQ stacks up against the competition in benchmarks – click to enlarge
However, AI benchmarks aren’t always what they seem to be. So, let’s take a look at how these claims hold up in the real world, and then we’ll show you how to get QwQ up and running so you can test it out for yourself.
How does it stack up?
We ran QwQ through a slate of test prompts ranging from general knowledge to spatial reasoning, problem solving, mathematics, and other questions known to trip up even the best LLMs.
Because the full model requires substantial memory, we ran our tests in two configurations to cater to those of you who have a lot of RAM and those of you who don’t. First, we evaluated the full model using the QwQ demo on Hugging Face. Then, we tested a 4-bit quantized version on a 24 GB GPU (Nvidia 3090 or AMD Radeon RX 7900XTX) to assess the impact of quantization on accuracy.
As for most general knowledge questions, we found that QwQ performed similarly to DeepSeek’s 671 billion parameter R1 and other reasoning models like OpenAI’s o3-mini, spending a few seconds to compose its thoughts before spitting out the answer to the query.
Where the model stands out, perhaps unsurprisingly, is when it’s tasked with solving more complex logic, coding, or mathematics challenges, so we’ll focus on those before addressing some of its weak points.
Spatial reasoning
For fun, we decided to start with a relatively new spatial-reasoning test developed by the folks at Homebrew Research as part of their AlphaMaze project.
QwQ was able to solve all three AlphaMaze tests without any issues – click to enlarge
The test, illustrated above, presents the model with a maze in the form of a text prompt, like the one below. The model’s objective is then to navigate from the origin “O” to the target “T.”
You are a helpful assistant that solves mazes. You will be given a maze represented by a series of tokens.
The tokens represent:
– Coordinates: <|row-col|> (e.g., <|0-0|>, <|2-4|>)
– Walls: <|no_wall|>, <|up_wall|>, <|down_wall|>, <|left_wall|>, <|right_wall|>, <|up_down_wall|>, etc.
– Origin: <|origin|>
– Target: <|target|>
– Movement: <|up|>, <|down|>, <|left|>, <|right|>, <|blank|>
Your task is to output the sequence of movements (<|up|>, <|down|>, <|left|>, <|right|>) required to navigate from the origin to the target, based on the provided maze representation. Think step by step. At each step, predict only the next movement token. Output only the move tokens, separated by spaces.
MAZE:
<|0-0|><|up_left_right_wall|><|blank|><|0-1|><|up_down_left_wall|><|blank|><|0-2|><|up_down_wall|><|blank|><|0-3|><|up_wall|><|blank|><|0-4|><|up_right_wall|><|blank|>
<|1-0|><|down_left_wall|><|blank|><|1-1|><|up_right_wall|><|blank|><|1-2|><|up_left_wall|><|blank|><|1-3|><|down_right_wall|><|target|><|1-4|><|down_left_right_wall|><|blank|>
<|2-0|><|up_left_right_wall|><|blank|><|2-1|><|left_right_wall|><|blank|><|2-2|><|down_left_wall|><|blank|><|2-3|><|up_down_wall|><|blank|><|2-4|><|up_right_wall|><|blank|>
<|3-0|><|left_right_wall|><|blank|><|3-1|><|down_left_wall|><|origin|><|3-2|><|up_down_wall|><|blank|><|3-3|><|up_down_wall|><|blank|><|3-4|><|right_wall|><|blank|>
<|4-0|><|down_left_wall|><|blank|><|4-1|><|up_down_wall|><|blank|><|4-2|><|up_down_wall|><|blank|><|4-3|><|up_down_wall|><|blank|><|4-4|><|down_right_wall|><|blank|>
Both our locally hosted QwQ instance and the full-sized model were able to solve these puzzles successfully every time, though each run did take a few minutes to finish.
The same couldn’t be said of DeepSeek’s R1 and its 32B distill. Both models were able to solve the first maze, but R1 struggled to complete the second, while the 32B distill solved it correctly nine times out of ten. This level of variation isn’t too surprising considering R1 and the distill use completely different base models.
While QwQ outperformed DeepSeek in this test, we did observe some strange behavior with our 4-bit model, which required nearly twice as many “thought” tokens to complete the test. At first, it looked as though this may be due to quantization-related losses – a challenge we explored here. But, as it turned out, the quantized model was just broken out of the box. After adjusting the hyperparameters – don’t worry, we’ll show you how to fix those in a bit – and the tests run again, the problem disappeared.
A one-shot code champ?
Since its launch, QwQ has garnered a lot of interest from netizens curious as to whether the model can generate usable code on the first attempt in a so-called one-shot test. And this particular challenge certainly seems to be a bright spot for the model.
We asked the model to recreate a number of relatively simple games, namely Pong, Breakout, Asteroids, and Flappy Bird, in Python using the pygame library.
Pong and Breakout weren’t much of a challenge for QwQ. After a few minutes of work, the model spat out working versions of each.
In our testing, QwQ was able to recreate classic arcade games like Breakout in a single shot with relative ease – click to enlarge
Tasked with recreating Asteroids, however, QwQ fell on its face. While the code ran, both the graphics and game mechanics were frequently distorted and buggy. By comparison, on its first attempt, R1 faithfully recreated the classic arcade shooter.
On the left, QwQ’s recreation of Asteroids, and on the right DeepSeek-R1’s – click to enlarge
Some folks have even managed to get R1 and QwQ to one-shot code a minimalist version of Flappy Bird, which we can confirm also worked without issue. If you’re interested, you can find the prompt we tested here.
It has occurred to us that these models were trained on a huge set of openly available source code, which no doubt included reproductions of classic games. Aren’t the models therefore just remembering what they learned during training rather than independently figuring out game mechanics from scratch? That’s the whole illusion of these massive neural networks.
Here you can see the minimalist version QwQ wrote in Python in a single shot – click to enlarge
At least when it comes to recreating classic arcade games, QwQ performs well beyond what its parameter count might suggest, even if it can’t match R1 in every test. To borrow a phrase from the automotive world, there’s no replacement for displacement. This might explain why Alibaba isn’t stopping with QwQ 32B and has a “Max” version in the works.




