llama.cpp finally adds real reasoning budget control for hybrid models

The llama.cpp project has shipped a significant update to its reasoning budget system, giving users genuine control over how much "thinking" their hybrid reasoning models like Qwen3 and DeepSeek-R1 can do. Until now, the `--reasoning-budget` flag was essentially a stub that only disabled thinking when set to zero.

The new implementation, merged via commit acb7c790698fa28a0fbfc0468804926815b94de3, introduces a proper sampler mechanism that counts reasoning tokens during generation and forces termination when the specified budget is reached.

Why this matters

Hybrid reasoning models like DeepSeek-R1 and Qwen3 output special tokens marked as internal reasoning before producing their final answer. This chain-of-thought process improves performance on math and logic tasks but consumes significantly more compute.

Previously, users had no granular control. You could either have full reasoning or disable it entirely. The new system allows developers to tune the reasoning budget based on task complexity — a quick code snippet might need only 32 tokens of thinking, while a complex proof might benefit from 512.

Benchmark impact

The maintainer tested the feature on Qwen3 9B using HumanEval, a standard coding benchmark. The results reveal why this needs careful tuning: the reasoning-enabled version scored 94%, the non-thinking version scored 88%, but enforcing a tight reasoning budget without proper prompting cratered performance to just 78%.

To address this, the update includes `--reasoning-budget-message`, which inserts a transition message before reasoning terminates to help the model produce coherent outputs.

Current limitations

The feature isn't perfect. Some users report that `--reasoning-budget 0` still fails to fully disable thinking on certain models like GLM-4.5-Air after a recent refactor (commit b7756). The reasoning budget also behaves differently depending on whether you're using Jinja chat templates.

For users running quantized models on consumer hardware, the ability to dial down reasoning tokens could make the difference between a usable local model and one that grinds to a halt.