001 2026-03-26 open Tapasya

Tapasya model bar: frontier references vs the local lane

llmsevalslocal-modelsproduct-decisions

Setup: rolling Tapasya model-bar note across local snapshots, reasoning-vs-no-think, scaling decisions, and future cohorts
Found: current evidence says the local lane improved, but frontier hosted references still set the product bar
Result: open: the best local lane still trails the frontier band; the next cohorts will test whether that gap is shrinking enough to matter

The question

Tapasya needs a model that can stay grounded in a difficult text and still produce something worth reading.

The question is not whether local models are interesting. They are. The question is whether they are good enough to sit behind the product.

This note replaces the earlier public Tapasya model logs. Future model cohorts should extend this note instead of spawning parallel writeups.

Current evidence

The current read comes from four March 2026 runs on the same 20-case Nietzsche passage benchmark, with Codex GPT-5.4 as the fixed answer baseline and Claude as the primary judge lane.

The reference band was not vague. It had two concrete poles:

Model	Role in the set	Why it matters
Codex GPT-5.4	fixed answer baseline	every local candidate was judged against this answer set
Claude Opus	frontier answering reference	the only answering lane that stayed near that bar

Slice	What it established
Local snapshot	`Qwen3.5 9B` was the first local model that felt properly strong without making runtime absurd
Reasoning test	`Qwen3 8B` with thinking beat `/no_think` on groundedness and usefulness, but paid a large latency cost
Qwen scaling decision	`Qwen3 14B thinking` became the best local quality lane; `Qwen3.5 9B` stayed the practical fallback
Desktop decision	the best local lane stayed visibly heavy in product terms, especially on time to first answer

The strongest compact evidence from the current set is still the local snapshot against the frontier reference band:

Model	Role	Groundedness gap	Voice gap	Usefulness gap	Directness gap
Codex GPT-5.4	fixed baseline	anchor	anchor	anchor	anchor
Claude Opus	frontier answering reference	0.00	+0.50	+0.35	-0.35
Qwen3.5 9B	best local lane	-2.00	-0.15	-1.75	-1.55
Qwen3.5 4B	next local lane	-2.15	-0.55	-1.50	-1.95

That is the current boundary. Codex stayed the answer bar. Claude Opus was the only answering lane near it. The best local lane still sat roughly 1.5 to 2.0 points back on the metrics that mattered most.

What the current runs ruled out

They ruled out two easier stories.

First, they ruled out the idea that a small fast local lane was already good enough if I was just selective about prompts. The quality gap stayed too visible.

Second, they ruled out the idea that local reasoning would erase that gap cleanly. It helped, but it was not free:

Mode	Mean latency	Groundedness	Usefulness
Qwen3 8B thinking	6.45s	3.15	3.03
Qwen3 8B /no_think	1.97s	2.46	2.54

The later Qwen comparison kept the same pattern. The better local lane was better for a reason. It was also heavier:

Model	Mean answer TTFT	Time to 150 chars	Mean full answer
Qwen3.5 9B	1.11s	1.88s	5.66s
Qwen3 14B thinking	27.26s	30.21s	38.44s

Interim read

Frontier hosted models still set the bar for Tapasya.

That is not a verdict against local models. It is the current product read. The local work is still useful because it makes the gap visible and sets the right targets. It just has not justified a local-first default yet.

In practice that means:

Codex GPT-5.4 stayed the fixed answer bar
Claude Opus was the only answering reference lane near that bar
the best local lanes were still clearly below both

Next models

The next cohorts should answer a narrower question: is the gap shrinking enough to matter on the current product task?

That means:

refresh the frontier reference band on the current product posture
retest the strongest open local lanes against the same benchmark discipline
promote a local lane only if it narrows the groundedness and usefulness gap without turning runtime into a benchmark artifact

Until that changes, the current read stays simple: frontier references still set the model bar for Tapasya.