Tapasya model bar: frontier references vs the local lane
- Setup
- rolling Tapasya model-bar note across local snapshots, reasoning-vs-no-think, scaling decisions, and future cohorts
- Found
- current evidence says the local lane improved, but frontier hosted references still set the product bar
- Result
- open: the best local lane still trails the frontier band; the next cohorts will test whether that gap is shrinking enough to matter
The question
Tapasya needs a model that can stay grounded in a difficult text and still produce something worth reading.
The question is not whether local models are interesting. They are. The question is whether they are good enough to sit behind the product.
This note replaces the earlier public Tapasya model logs. Future model cohorts should extend this note instead of spawning parallel writeups.
Current evidence
The current read comes from four March 2026 runs on the same 20-case Nietzsche passage benchmark, with Codex GPT-5.4 as the fixed answer baseline and Claude as the primary judge lane.
The reference band was not vague. It had two concrete poles:
| Model | Role in the set | Why it matters |
|---|---|---|
| Codex GPT-5.4 | fixed answer baseline | every local candidate was judged against this answer set |
| Claude Opus | frontier answering reference | the only answering lane that stayed near that bar |
| Slice | What it established |
|---|---|
| Local snapshot | Qwen3.5 9B was the first local model that felt properly strong without making runtime absurd |
| Reasoning test | Qwen3 8B with thinking beat /no_think on groundedness and usefulness, but paid a large latency cost |
| Qwen scaling decision | Qwen3 14B thinking became the best local quality lane; Qwen3.5 9B stayed the practical fallback |
| Desktop decision | the best local lane stayed visibly heavy in product terms, especially on time to first answer |
The strongest compact evidence from the current set is still the local snapshot against the frontier reference band:
| Model | Role | Groundedness gap | Voice gap | Usefulness gap | Directness gap |
|---|---|---|---|---|---|
| Codex GPT-5.4 | fixed baseline | anchor | anchor | anchor | anchor |
| Claude Opus | frontier answering reference | 0.00 | +0.50 | +0.35 | -0.35 |
| Qwen3.5 9B | best local lane | -2.00 | -0.15 | -1.75 | -1.55 |
| Qwen3.5 4B | next local lane | -2.15 | -0.55 | -1.50 | -1.95 |
That is the current boundary. Codex stayed the answer bar. Claude Opus was the only answering lane near it. The best local lane still sat roughly 1.5 to 2.0 points back on the metrics that mattered most.
What the current runs ruled out
They ruled out two easier stories.
First, they ruled out the idea that a small fast local lane was already good enough if I was just selective about prompts. The quality gap stayed too visible.
Second, they ruled out the idea that local reasoning would erase that gap cleanly. It helped, but it was not free:
| Mode | Mean latency | Groundedness | Usefulness |
|---|---|---|---|
| Qwen3 8B thinking | 6.45s | 3.15 | 3.03 |
| Qwen3 8B /no_think | 1.97s | 2.46 | 2.54 |
The later Qwen comparison kept the same pattern. The better local lane was better for a reason. It was also heavier:
| Model | Mean answer TTFT | Time to 150 chars | Mean full answer |
|---|---|---|---|
| Qwen3.5 9B | 1.11s | 1.88s | 5.66s |
| Qwen3 14B thinking | 27.26s | 30.21s | 38.44s |
Interim read
Frontier hosted models still set the bar for Tapasya.
That is not a verdict against local models. It is the current product read. The local work is still useful because it makes the gap visible and sets the right targets. It just has not justified a local-first default yet.
In practice that means:
Codex GPT-5.4stayed the fixed answer barClaude Opuswas the only answering reference lane near that bar- the best local lanes were still clearly below both
Next models
The next cohorts should answer a narrower question: is the gap shrinking enough to matter on the current product task?
That means:
- refresh the frontier reference band on the current product posture
- retest the strongest open local lanes against the same benchmark discipline
- promote a local lane only if it narrows the groundedness and usefulness gap without turning runtime into a benchmark artifact
Until that changes, the current read stays simple: frontier references still set the model bar for Tapasya.