Blogs
2026-04-28/Research

Terminal Agent Training Data: How AB-Terminal Bench Improves Coding AI

Longtian Ye's avatar
Longtian Ye,Member of Technical Staff

To build effective training data for terminal-native agents, teams must create executable, verifiable, and task complete coding environments with clear evaluation criteria. AB-Terminal Bench achieves this through containerized tasks, pytest-based verification, oracle solutions, and multi-stage agent pipelines. In practice, structured task design and evaluation rigor, not raw data scale, determine whether coding agents improve after training.

AB-Terminal Bench: Training Data for Terminal-Native Agents

AB-Terminal Bench is a post-training corpus of containerized terminal-coding tasks, produced through an agent-driven pipeline for training and evaluating agentic coding systems. A single round of supervised fine-tuning on 1.8k filtered samples moves Qwen3-32B from 3.37% to 17.24% pass@1 on a 120-task held-out eval slice under the Terminus-2 harness.

The field is moving fast enough that this kind of data is becoming the rate-limiter. On the current Vals AI leaderboard, Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.3 Codex sit at 68%, 67%, and 64% pass@1 on Terminal-Bench 2.0. Eighteen months ago, the same family of coding agents was below 15% on comparable tasks. METR's March 2025 analysis reported agent task-completion horizons doubling roughly every four months through 2024–2025. Most of the gain comes from post-training: more task variety, better tool-use trajectories, reinforcement learning on verifier-graded rollouts.

AB-Terminal Bench is independent from the official Terminal-Bench benchmark. It is produced internally at Abaka, but follows the Terminal-Bench taxonomy and Harbor-compatible runtime conventions because that stack has become the reference standard for terminal-agent work. Each task includes a self-contained Docker environment, a natural-language brief, a pytest-based verifier, and an oracle solution that proves the task is solvable. Where tasks derive from open-source materials such as GitHub Issues, pull requests, or Kaggle notebooks, we preserve attribution and apply source-specific license review before inclusion.

The corpus spans five categories of engineering work that together cover most of what a software engineer does day-to-day:

  • Bugfix — locate and patch a defect inside an unfamiliar repository
  • New Feature — understand existing architecture and extend it
  • Performance Optimization — find a bottleneck and improve a measurable target
  • ML / Data Science — end-to-end modeling against a metric threshold
  • ETL / Data Processing — build a deterministic pipeline with schema and value-level guarantees

Task example

Below is a task from the ML/DS category, derived from a Kaggle materials-science competition. The agent receives a dataset of material structures (a train.json with labeled rows and test.json without) and is asked to train a regression model predicting a numeric property called hform.

Task brief excerpt. Predict the target column hform. Use an 80/20 train/validation split with random_state=42. Feature pipeline must be deterministic and identical for train and test. Validation RMSE must be ≤ 0.20. Write submission.csv, a valid.txt containing the RMSE, and a main.py that runs the full pipeline end-to-end.

The verifier is a pytest suite that encodes the brief's acceptance criteria. It runs 12 assertions on the agent's response, including these four:

  • Schema alignment. submission.csv must contain exactly [id, hform] in that column order, with the same row count and row ordering as test.json. Weak agents often produce outputs that look plausible but swap columns or misalign rows.
  • Numeric validity. Every prediction has to be a finite number. Catches agents whose model silently returns NaN on some test rows.
  • Accuracy threshold. The RMSE in valid.txt is ≤ 0.20. A fixed acceptance bar, not grade-on-a-curve.
  • Reproducibility. main.py must run standalone inside the container and produce a bit-identical submission on a second run. Catches notebook-style solutions with hidden environment dependencies or nondeterministic training.

Four categories of check, covering four distinct classes of failure. A strong model can write a plausible task brief in one shot. Writing the full rubric, and getting every assertion to actually fail on the agents it should fail on, is the harder part. Across the corpus, Hard-tier tasks like this one typically carry 10 or more independent checks per task; easier tasks carry fewer, but the shape is the same: multiple assertions, covering distinct failure modes.

Building a task

The pipeline has five stages, each producing one artifact of the final task.

Figure 1. Task construction flow.
Figure 1. Task construction flow.
Each of the first four stages is an artifact built by a specialized agent chain; Evaluation runs adjudication followed by execution-based validation on the assembled candidate.

Environment parses the source material (a Kaggle notebook with its dataset, a GitHub Issue/PR pair, or an algorithm repository) and produces a minimal Dockerfile plus the data bundle the agent will need. Query turns the source material into the natural-language instruction and task metadata. Test produces the pytest rubric. Solution produces the oracle script. Evaluation validates that the completed candidate works as a coherent task.

Each of the first four stages runs a small chain of agents: a Builder that drafts the artifact, then a Reviewer that checks it against a rubric specific to that stage. Tests get an extra Planner stage because they are the most failure-prone output. Without a Planner to decompose what needs to be verifiable, a Builder tends to write assertions that match its own mental model of the task rather than the task's actual specification.

Evaluation is where the pipeline becomes adversarial, and it runs in two stages. First, a multi-agent flow checks that Environment, Query, Test, and Solution agree with each other and with the source material's intent. It looks for structural mismatches (instructions that reference files not in the environment; tests that expect outputs the solution does not produce), ambiguity (tasks with two valid interpretations that would each produce different graded outputs), and intent drift (tasks internally consistent but subtly off from the source's intent). The Bar Raiser review in the figure is the adversarial final pass whose job is to catch cases earlier reviewers let through.

Then an execution-based check runs the oracle solution inside a clean environment and applies the full pytest rubric, proving the task is solvable end-to-end without relying on LLM judgment. Finally, the same task runs against several real agent stacks: tasks every model passes are flagged as likely too easy, tasks no model passes are re-inspected, and the middle of the distribution becomes a useful quality signal.

Humans do not manually review every task. They refine the rubrics, audit sampled outputs, and inspect suspicious failures. The task-by-task reviewing itself is handled by specialized reviewer agents.

Signal quality: task-level discrimination

Before using the corpus for post-training, we first checked whether it exhibited a reasonable difficulty profile. A useful training dataset should separate stronger agents from weaker ones. It is most informative when it avoids collapsing stronger systems into the same score, and retains enough headroom that weaker systems do not already pass most tasks. To assess this, we ran four agent stacks (agentic CLI + model) on a 24-task diagnostic slice, pass@1:

Figure 3. Pass@1 across four agent stacks on a 24-task diagnostic slice.
Figure 3. Pass@1 across four agent stacks on a 24-task diagnostic slice.
Three frontier stacks landed at 37.5%. Claude Code + Opus 4.6, Codex + GPT-5.2 Codex, and OpenCode + Opus 4.5 each cleared nine of twenty-four tasks, although the overlap was not exact. OpenCode + Grok 4 fell to 20.8%, clearing five. Given the size of the slice, the exact percentages should not be overinterpreted. What matters here is the overall pattern: the stronger stacks cluster together, while the weaker stack separates clearly. That is roughly the shape we expect at this scale.

Grouping the 24 tasks by category shows where the discrimination comes from:

Figure 4. Pass rate by task category across the same four agent stacks
Figure 4. Pass rate by task category across the same four agent stacks
*pass rate = fraction of stacks that cleared the task averaged across tasks in the category.

Some categories are clearly easy: tabular regression and classification tasks pass on most stacks. At the hard end, multi-file bugfixes in large unfamiliar codebases sit near zero across all four systems. Frontier agents are competitive on structured prediction tasks, but still struggle to localize and fix bugs without prior context. The middle band is where post-training signal lives: ETL tasks, smaller library changes, and some performance optimization problems are neither trivial nor uniformly unsolved. These are the tasks most likely to separate systems and show measurable gains from post-training.

These diagnostics were run during early corpus calibration. We then fixed a 120-task held-out slice (24 per category across the five categories, disjoint from the training pool) as the evaluation set for every post-training number that follows.

Post-training on Qwen3-32B

One round of supervised fine-tuning on Qwen3-32B, trained on a oracle-pass filtered subset of AB-Terminal Bench (tasks whose oracle solution cleanly passes the full pytest rubric and whose generation cleared every pipeline stage).

Evaluated on a 120-task held-out slice of AB-Terminal Bench disjoint from the training subset.
Evaluated on a 120-task held-out slice of AB-Terminal Bench disjoint from the training subset.
The result is straightforward. A single SFT pass under 2,000 training examples raises pass@1 from 3.37% to 17.24% — about a 5× relative lift. This remains well below the frontier, which have been through orders of magnitude more post-training and inference-time scaffolding than five epochs on 1.8k examples. But the result carries a real signal: the dataset is clean enough, the format compatible enough, and the filter doing enough work that one low-overhead SFT pass moves a 32B model meaningfully.

The gains are not uniform across categories.


Figure 5. Per-category pass@1 on the 120-task held-out eval (frontier reference: Claude Opus 4.6)
Figure 5. Per-category pass@1 on the 120-task held-out eval (frontier reference: Claude Opus 4.6)
The post-training signal is strongest in the categories with the most metric-driven acceptance criteria. ML / Data Science rises from 5% to 22%, and ETL from 4% to 19%. Bugfix and New Feature also improve, though less sharply. Performance Optimization moves the least, which is expected as these problems often depend on multi-turn systems reasoning that a single SFT pass can't fully simulate.

The per-difficulty breakdown confirms the picture.

Figure 6. Per-difficulty pass@1 on the 120-task held-out eval.
Figure 6. Per-difficulty pass@1 on the 120-task held-out eval.
Most of the lift comes from the Easy band, which moves from 5% to 27%. The Medium band also improves, from 2% to 10%. Hard tasks move only slightly, from 1% to 4%. That is a fairly intuitive result for one supervised pass: easier tasks respond first, medium tasks respond somewhat, and the hardest tasks—especially long-horizon debugging across real codebases—likely need more than imitation alone. Taken together, this does not show that a single SFT run closes the gap to frontier systems; it does not. What's most informative is that the corpus moves the model in the right places, and in a way that is visible on held-out tasks.

Closing

AB-Terminal Bench is ultimately an attempt to turn task quality from something anecdotal into something procedural. The hard part is not generating plausible instructions, but building tasks that are executable, verifiable, solvable, and discriminative at the same time. It takes a pipeline that can decompose, check, and reject at every stage before a task ever reaches training. As agent capability becomes bottlenecked by post-training data, manufacturing tasks at this standard is the infrastructure problem that matters.

Currently, we use the corpus internally for our own SFT and RL experiments. If this shape of data fits what you are building, contact Abaka.

  • AB-Terminal Bench is an independent Abaka dataset. Not affiliated with Terminal-Bench, Stanford University, or the Laude Institute.
  • All numbers are pass@1. Qwen3-32B pass@1 is averaged across seeds under the Terminus-2 harness.

References:

FAQ

What is AB-Terminal Bench?

AB-Terminal Bench is a structured dataset of containerized coding tasks designed to train and evaluate terminal-based AI agents. Each task includes a runnable environment, instructions, verification tests, and a ground-truth solution.

Why are terminal-based datasets important for AI agents?

Terminal environments reflect real-world engineering workflows. They require agents to execute code, manage files, debug systems, and meet strict constraints, making them essential for training practical coding agents.

What makes a high-quality agent training dataset?

High-quality datasets must be executable, verifiable, and discriminative. This means tasks must run end-to-end, include objective evaluation criteria, and be difficult enough to differentiate between models.

How does AB-Terminal Bench improve model performance?

Even a single round of supervised fine-tuning on curated tasks can significantly improve agent performance, as the dataset provides structured reasoning, tool-use trajectories, and precise evaluation signals.


Other Articles