I've been busy working on some new ranking/position methodologies and excited to start sharing some results.
Plot legends:
- X = truncation rate (low = good)
- ? = confusion rate (low = good)
- blue bars = average completion tokens (low = good)
- black diamonds = CI-banded performance (high = good)
- cluster squares = models inside this group are equivalent
openai/gpt-oss-120b remains the king in all dimensions of interest: truncation rates, completion lengths and performance. If I had but one complaint it's the reason_effort does not seem to actually work - more on this soon.
Second is a 3-way tie in performance between the Qwen3-235B-2507 we all know and love with an unexpected entrant - ByteDance-Seed/Seed-OSS-36B-Instruct
This is a very capable model and it's reasoning effort controls actually works, but you should absolutely not leave it on the default "unlimited" - enable a sensible limit (4k works well for 8k context length).
Third place is another 3-way tie, this one between Seed-OSS-36B (it straddles the CI boundary between 2nd and 3rd place), Qwen/Qwen3-Next-80B-A3B-Instruct (demonstrating that full attention may be overrated after all and gated is the way to go) and the newly released zai-org/GLM-4.7 which offers excellent across the board performance with some of the shortest reasoning traces I've seen so far.

