terminal bench / completed runs only / generated 2026-04-15
jaca
evaluation dashboard
The run-by-run dashboard of just-another-coding-agent. Rows rank tasks by solve history. Columns count only the times a task was actually tried.
40 runs · 1187 trials · 89 tasks · 4 cohorts · 2026-03-27 → 2026-04-14
decode
Green fill marks a solve. Red bar marks a failed attempt. Blank means that task had not been retried that many times yet. Switch among outcome, latency, and tokens to reread the same matrix as result, duration, or transcript cost.
cohorts on record
- glm-5 / high
- gpt-5.4 / high
- gpt-5.4 / medium
- gpt-5.4 / xhigh
sheet
[n = 89 tasks · k = 1..14 attempts]Rows are sorted by overall pass rate. Columns are each task's own attempt counter, not calendar days. Tap or hover any cell to inspect the run, model, slice, date, wall time, and token load. The task-name tint marks which fixed benchmark slice each row belongs to.
Outcome keeps the binary read. Blank cells are real gaps: that task had not been retried yet.
Latency recolors every observed attempt by percentile-ranked wall time. Haloed cells begin around 42m 13s and mark the slowest 5% of the sheet.
Tokens recolor attempts by total transcript load. Hatched cells mean the run finished but token telemetry was missing. Coverage is 264/1150 attempts (23.0%), with the top 5% beginning around 1379k tokens.
observation register
[selected = 8 entries · newest first]Only the runs with the strongest signal stay here: the newest anchor points, the largest slice flips, the cohort firsts, and the extrema worth re-reading. Each entry is still a diff against the previous run of the same slice, not a standalone card.
- 76.7% gpt-5.4 / xhigh slice b 2026-04-1423/30 passed · err 16.7% · 90m wall · pass 9+4 F→P −1 P→F
- highest solve rate so far
- 73.3% gpt-5.4 / xhigh slice a 2026-04-1422/30 passed · err 6.7% · 69m wall · pass 7+5 F→P
- highest solve rate so far
- 66.7% gpt-5.4 / xhigh slice a 2026-04-1020/30 passed · err 13.3% · 82m wall · pass 1+7 F→P −3 P→F
- 70.0% gpt-5.4 / high slice a 2026-04-0621/30 passed · err 13.3% · 84m wall · pass 1+14 F→P −2 P→F
- highest solve rate so far
- 63.3% gpt-5.4 / high slice b 2026-04-0619/30 passed · err 13.3% · 104m wall · pass 1+9 F→P −1 P→F
- first run on gpt-5.4 / high
- highest solve rate so far
- 30.0% glm-5 / high slice a 2026-04-049/30 passed · err 26.7% · 90m wall · pass 5+1 F→P −6 P→F
- lowest solve rate so far
- 51.7% glm-5 / high slice c 2026-04-0215/29 passed · err 34.5% · 595m wall · pass 4+7 F→P −2 P→F
- elevated error rate (34%)
- 34.5% glm-5 / high slice c 2026-03-3010/29 passed · err 31.0% · 124m wall · pass 3−6 P→F
- lowest solve rate so far
- elevated error rate (31%)