001 2026-03-26 open Tapasya

Tapasya model bar: frontier references vs the local lane

llmsevalslocal-modelsproduct-decisions
Setup
rolling Tapasya model-bar note across local snapshots, reasoning-vs-no-think, scaling decisions, and future cohorts
Found
current evidence says the local lane improved, but frontier hosted references still set the product bar
Result
open: the best local lane still trails the frontier band; the next cohorts will test whether that gap is shrinking enough to matter

The question

Tapasya needs a model that can stay grounded in a difficult text and still produce something worth reading.

The question is not whether local models are interesting. They are. The question is whether they are good enough to sit behind the product.

This note replaces the earlier public Tapasya model logs. Future model cohorts should extend this note instead of spawning parallel writeups.

Current evidence

The current read comes from four March 2026 runs on the same 20-case Nietzsche passage benchmark, with Codex GPT-5.4 as the fixed answer baseline and Claude as the primary judge lane.

The reference band was not vague. It had two concrete poles:

ModelRole in the setWhy it matters
Codex GPT-5.4fixed answer baselineevery local candidate was judged against this answer set
Claude Opusfrontier answering referencethe only answering lane that stayed near that bar
SliceWhat it established
Local snapshotQwen3.5 9B was the first local model that felt properly strong without making runtime absurd
Reasoning testQwen3 8B with thinking beat /no_think on groundedness and usefulness, but paid a large latency cost
Qwen scaling decisionQwen3 14B thinking became the best local quality lane; Qwen3.5 9B stayed the practical fallback
Desktop decisionthe best local lane stayed visibly heavy in product terms, especially on time to first answer

The strongest compact evidence from the current set is still the local snapshot against the frontier reference band:

ModelRoleGroundedness gapVoice gapUsefulness gapDirectness gap
Codex GPT-5.4fixed baselineanchoranchoranchoranchor
Claude Opusfrontier answering reference0.00+0.50+0.35-0.35
Qwen3.5 9Bbest local lane-2.00-0.15-1.75-1.55
Qwen3.5 4Bnext local lane-2.15-0.55-1.50-1.95

That is the current boundary. Codex stayed the answer bar. Claude Opus was the only answering lane near it. The best local lane still sat roughly 1.5 to 2.0 points back on the metrics that mattered most.

What the current runs ruled out

They ruled out two easier stories.

First, they ruled out the idea that a small fast local lane was already good enough if I was just selective about prompts. The quality gap stayed too visible.

Second, they ruled out the idea that local reasoning would erase that gap cleanly. It helped, but it was not free:

ModeMean latencyGroundednessUsefulness
Qwen3 8B thinking6.45s3.153.03
Qwen3 8B /no_think1.97s2.462.54

The later Qwen comparison kept the same pattern. The better local lane was better for a reason. It was also heavier:

ModelMean answer TTFTTime to 150 charsMean full answer
Qwen3.5 9B1.11s1.88s5.66s
Qwen3 14B thinking27.26s30.21s38.44s

Interim read

Frontier hosted models still set the bar for Tapasya.

That is not a verdict against local models. It is the current product read. The local work is still useful because it makes the gap visible and sets the right targets. It just has not justified a local-first default yet.

In practice that means:

  • Codex GPT-5.4 stayed the fixed answer bar
  • Claude Opus was the only answering reference lane near that bar
  • the best local lanes were still clearly below both

Next models

The next cohorts should answer a narrower question: is the gap shrinking enough to matter on the current product task?

That means:

  • refresh the frontier reference band on the current product posture
  • retest the strongest open local lanes against the same benchmark discipline
  • promote a local lane only if it narrows the groundedness and usefulness gap without turning runtime into a benchmark artifact

Until that changes, the current read stays simple: frontier references still set the model bar for Tapasya.