terminal bench / completed runs only / generated 2026-04-15

jaca

evaluation dashboard

The run-by-run dashboard of just-another-coding-agent. Rows rank tasks by solve history. Columns count only the times a task was actually tried.

40 runs · 1187 trials · 89 tasks · 4 cohorts · 2026-03-27 → 2026-04-14

decode

Green fill marks a solve. Red bar marks a failed attempt. Blank means that task had not been retried that many times yet. Switch among outcome, latency, and tokens to reread the same matrix as result, duration, or transcript cost.

cohorts on record

  • glm-5 / high
  • gpt-5.4 / high
  • gpt-5.4 / medium
  • gpt-5.4 / xhigh
fig. 1

sheet

[n = 89 tasks · k = 1..14 attempts]

Rows are sorted by overall pass rate. Columns are each task's own attempt counter, not calendar days. Tap or hover any cell to inspect the run, model, slice, date, wall time, and token load. The task-name tint marks which fixed benchmark slice each row belongs to.

Outcome keeps the binary read. Blank cells are real gaps: that task had not been retried yet.

Latency recolors every observed attempt by percentile-ranked wall time. Haloed cells begin around 42m 13s and mark the slowest 5% of the sheet.

Tokens recolor attempts by total transcript load. Hatched cells mean the run finished but token telemetry was missing. Coverage is 264/1150 attempts (23.0%), with the top 5% beginning around 1379k tokens.

outcome
task tap or hover any cell to inspect
model
thinking
slice
run
date
wall
tokens
01 first attempt
02
03
04
05
06
07
08
09
10
11
12
13
14 latest
headless-terminal 14/14
log-summary-date-ranges 14/14
modernize-scientific-stack 14/14
distribution-search 13/13
pypi-server 13/13
pytorch-model-cli 13/13
regex-log 13/13
reshard-c4-data 13/13
vulnerable-secret 13/13
fix-git 12/12
hf-model-inference 12/12
prove-plus-comm 12/12
sqlite-db-truncate 11/11
git-leak-recovery 13/14
git-multibranch 13/14
kv-store-grpc 13/14
large-scale-text-editing 13/14
largest-eigenval 13/14
multi-source-data-merger 13/14
nginx-request-logging 13/14
code-from-image 12/13
constraints-scheduling 12/13
count-dataset-tokens 12/13
custom-memory-heap-crash 12/13
merge-diff-arc-agi-task 12/14
build-pmars 11/13
portfolio-optimization 11/13
sqlite-with-gcov 11/13
cobol-modernization 10/12
feal-differential-cryptanalysis 10/12
llm-inference-batching-scheduler 11/14
mailman 11/14
openssl-selfsigned-cert 11/14
password-recovery 11/14
bn-fit-modify 10/13
build-pov-ray 10/13
circuit-fibsqrt 10/13
financial-document-processor 9/12
sparql-university 9/12
sanitize-git-repo 8/11
fix-ocaml-gc 10/14
pytorch-model-recovery 7/10
cancel-async-tasks 9/13
fix-code-vulnerability 9/13
overfull-hbox 9/14
path-tracing-reverse 9/14
crack-7z-hash 7/11
break-filter-js-from-html 8/13
build-cython-ext 8/13
extract-elf 8/13
feal-linear-cryptanalysis 8/13
qemu-startup 8/13
write-compressor 8/13
rstan-to-pystan 7/13
schemelike-metacircular-eval 6/13
tune-mjcf 6/13
winning-avg-corewars 6/13
path-tracing 6/14
model-extraction-relu-logits 5/12
protein-assembly 5/13
chess-best-move 4/11
compile-compcert 4/13
torch-tensor-parallelism 4/13
mcmc-sampling-stan 4/14
mteb-leaderboard 3/11
query-optimize 3/11
adaptive-rejection-sampler 3/12
configure-git-webserver 3/12
dna-assembly 2/13
qemu-alpine-ssh 2/13
regex-chess 2/13
dna-insert 1/13
make-mips-interpreter 1/14
polyglot-c-py 1/14
gcode-to-text 0/14
gpt2-codegolf 0/14
install-windows-3.11 0/14
make-doom-for-mips 0/14
caffe-cifar-10 0/13
db-wal-recovery 0/13
extract-moves-from-video 0/13
polyglot-rust-c 0/13
raman-fitting 0/13
sam-cell-seg 0/13
train-fasttext 0/13
video-processing 0/13
torch-pipeline-parallelism 0/12
filter-js-from-html 0/11
mteb-retrieve 0/8
fig. 1 — every task, every observed attempt. the grid is ranked by solve history; each row keeps its own retry clock. Empty cells are true gaps, not zeroes.
fig. 2

observation register

[selected = 8 entries · newest first]

Only the runs with the strongest signal stay here: the newest anchor points, the largest slice flips, the cohort firsts, and the extrema worth re-reading. Each entry is still a diff against the previous run of the same slice, not a standalone card.

  1. 76.7% gpt-5.4 / xhigh slice b 2026-04-14
    23/30 passed · err 16.7% · 90m wall · pass 9
    +4 F→P −1 P→F
    • highest solve rate so far
  2. 73.3% gpt-5.4 / xhigh slice a 2026-04-14
    22/30 passed · err 6.7% · 69m wall · pass 7
    +5 F→P
    • highest solve rate so far
  3. 66.7% gpt-5.4 / xhigh slice a 2026-04-10
    20/30 passed · err 13.3% · 82m wall · pass 1
    +7 F→P −3 P→F
  4. 70.0% gpt-5.4 / high slice a 2026-04-06
    21/30 passed · err 13.3% · 84m wall · pass 1
    +14 F→P −2 P→F
    • highest solve rate so far
  5. 63.3% gpt-5.4 / high slice b 2026-04-06
    19/30 passed · err 13.3% · 104m wall · pass 1
    +9 F→P −1 P→F
    • first run on gpt-5.4 / high
    • highest solve rate so far
  6. 30.0% glm-5 / high slice a 2026-04-04
    9/30 passed · err 26.7% · 90m wall · pass 5
    +1 F→P −6 P→F
    • lowest solve rate so far
  7. 51.7% glm-5 / high slice c 2026-04-02
    15/29 passed · err 34.5% · 595m wall · pass 4
    +7 F→P −2 P→F
    • elevated error rate (34%)
  8. 34.5% glm-5 / high slice c 2026-03-30
    10/29 passed · err 31.0% · 124m wall · pass 3
    −6 P→F
    • lowest solve rate so far
    • elevated error rate (31%)