Every Model Breaks: Building the Data Layer Behind AI Safety Evaluation

AI safety evaluation has scaled fast. HarmBench standardized 18 attack methods against 33 models under a unified framework (Mazeika et al., 2024). OpenAI and Anthropic jointly red-teamed each other's frontier models (OpenAI & Anthropic, 2025). A meta-analysis surveyed 210 safety benchmarks, the most comprehensive audit of the field to date (Yu et al., 2026). And the UK AI Safety Institute ran 1.8 million attacks across 22 frontier models (UK AISI, 2026).

Every single model broke.

If every model breaks under sufficient adversarial pressure, the value of safety evaluation lies not in pass/fail verdicts but in understanding how models break — along which dimensions, under which strategies, at what severity. The bottleneck is not the frameworks or the models. It is the data.

What AI Safety Benchmarks Miss

Current benchmarks reliably detect known attack patterns, but a meta-analysis of 210 safety benchmarks exposes how narrow those conditions are: 81% test only predefined risks with fixed prompts, 68% are single-turn only, 79% reduce safety to binary pass/fail, and 89% run on static data (Yu et al., 2026). Yet multi-turn attacks are where models actually break — distribute harmful intent across 5–20 conversational turns and failure rates climb to 75% (Li et al., 2024). Intent Laundering achieves 90–98% ASR by stripping triggering cues while preserving harmful intent (Golchin & Wetter, 2026). And measurement itself is unstable: the same model shows 4.7% ASR on a single attempt but 63% at 100 attempts (VentureBeat, 2025).

A Rumsfeld Matrix from Yu et.al maps safety risk coverage into four quadrants — and the distribution is stark. 81% of benchmarks sit in the Known Knowns quadrant: fixed prompts testing predefined risks, where marginal value is lowest. Only 3% attempt to probe Unknown Knowns (risks we could anticipate but don't test for) and Unknown Unknowns (novel failure modes no one has catalogued yet) — where the most dangerous failures live.

Figure 1: Rumsfeld Matrix — AI Safety Benchmark Coverage Distribution (Yu et al., 2026)

The AI Safety Data Problem

Red teaming is not labeling — it is generative. Adversaries probe a live model, adapt in real time, and produce failure cases that did not exist before the interaction. This process is an arms race — the adversarial search space is open-ended and effective red teaming requires understanding model internals (safety filters, refusal patterns, scoring heuristics, etc.). In particular, human and automated red teaming are complementary — humans routinely break models that automated methods rate as robust (Mazeika et al., 2024) — which demands a hybrid pipeline that leverages both. And serving multiple evaluation paradigms (RLHF training, regression testing, compliance documentation, etc.) calls for flexible infrastructure.

What is missing is not another benchmark. It is a continuous capability for producing, structuring, validating, and refreshing adversarial safety evaluation data. The requirements: strategy-structured, severity-graded, prevalence-calibrated, multi-turn, continuously refreshed, generated through human-AI hybrid adversarial interaction, and campaign-ready for targeted deployment. Together, these describe an operational capability — beyond a simple dataset.

Our Approach: Building AI Safety Evaluation Data

Abaka AI brings deep data engineering expertise — our annotation platform, global expert network, multi-layered quality assurance, and RLHF training data experience — to safety evaluation. We apply this infrastructure to the safety data problem through a three-layer approach.

Layer 1: Expert Adversarial Workforce

The foundation is domain-specialist red teamers sourced and managed through Abaka's annotation infrastructure. We recruit not general-purpose annotators but domain specialists — security researchers, linguists, social engineers, policy experts — vetted for cybersecurity certifications, linguistic expertise (low-resource languages), and biosafety backgrounds. Abaka's global annotator network provides the recruitment reach to source these specialists across regions and disciplines.

Each interaction is captured through a structured annotation schema that goes far beyond binary harm labels:

Attack strategy type — from an extended taxonomy: conditional misdirection, creative reframing, authority exploitation, incremental escalation, low-resource language translation, encoding-based attacks, role-play exploitation
Severity — graduated scale from mild to critical, prevalence-calibrated against real-world frequency data
Conversation trajectory — turn count, escalation arc, intermediate model responses at each turn
Failure mechanism — where exactly the model's safety broke down
Red teamer rationale — the reasoning and decision process behind each adversarial move

This granularity aims to transform red teaming from a pass/fail exercise into a rich data source for safety improvement.

Quality assurance applies Abaka's multi-layered QA, covering expert review, cross-validation, consensus labeling, and automated error detection. Delivery is flexible, we support campaigns for targeted evaluations, or ongoing embedded engagements with dedicated specialists working within a client's safety workflow. In practice, human-expert rubrics have been shown to outperform synthetic alternatives in driving RL training gains.

Layer 2: Synthetic Amplification Pipeline

Expert-generated adversarial strategies are valuable but expensive to produce at volume. Layer 2 takes these high-quality seeds and scales them through an LM automated pipeline with human validation.

Say, a multi-turn escalation bypasses safety filters through incremental context-shifting — our pipeline generates hundreds of variants through multi-turn interactive generation, preserving the underlying adversarial logic across full conversation trajectories, not just single-prompt rephrasing. Generation is coverage-aware, mapping gaps across:

Strategy types — 7+ categories from conditional misdirection to role-play exploitation
Severity levels — ensuring representation from mild to critical
Harm categories — mapped to standard taxonomies
Languages and cultural contexts

Human-in-the-loop validation is essential. Automated screening removes duplicates, trivially detectable attacks, and severity mismatches. Expert red teamers then validate synthetic outputs for contextual realism, refine promising variants, and discard ones that are artificial or unrepresentative.

The pipeline is self-improving: each generation batch informs the next, with failed attacks providing diagnostic signals and refined filtering criteria propagating forward. Research demonstrates 90%+ ASR with quality-diversity synthetic generation approaches — but purely synthetic methods inherit generator biases (Samvelyan et al, 2024). Automation amplifies expert signal; it does not replace it.

Layer 3: Continuous Data Operations

Safety evaluation data must be a living capability, not a static dataset. Layer 3 provides the operational infrastructure to sustain this.

Every red team interaction is logged with full trajectory: not just pass/fail, but the complete chain of strategy attempted, model response at each turn, where the red teamer adapted, and why. This level of granularity enables step-level reward shaping for RLHF safety training, fine-grained failure analysis to pinpoint where safety degrades within a conversation, and strategy-level aggregation into which adversarial approaches are most effective across model families.

Temporal versioning ensures each dataset carries metadata about generation date, model version tested, last validation date, contamination risk score. Stale test cases are flagged and cycled out, ensuring continued relevance and robustness to evolving red-teaming and jailbreak techniques.

Safety data is tagged by deployment context, so teams pull evaluation sets matched to their specific risk profile including but not limited to:

Sustained-pressure testing — persistent adversary threat models
Rapid-iteration testing — fast-patching workflows
Agentic safety testing — tool-use, code execution, and multi-agent deployments

This approach supports multiple evaluation paradigms, instead of collapsing them into a single standard that privileges one threat model over another. Our standard, integration-ready delivery for training and evaluation pipelines reflects an ongoing data service, not a one-time handoff.

The Future of AI Safety: From Benchmarks to Data Infrastructure

The EU GPAI Code of Practice (enforcement August 2026) requires documented adversarial testing evidence. Insurance "AI Security Riders" now require documented red teaming. The attack surface is expanding as models gain tool use, code execution, and autonomous action (tool poisoning, privilege escalation, cascading failures, etc.). Regulatory mandates and the expansion into agentic systems both converge on the same need — safety evaluation must cover what models do, not just what they say.

The field has built strong evaluation frameworks and mature attack tooling. What it has not built is the data layer to feed them — strategy-structured, severity-graded, continuously refreshed, generated through expert adversarial interaction. That is what we provide.

If you are building frontier models, training agents for enterprise deployment, or preparing for regulatory compliance, Abaka AI provides the safety evaluation data infrastructure to move from benchmark scores to operational safety assurance.

FAQs

What is AI safety evaluation data?

AI safety evaluation data is structured adversarial data generated through red teaming, used to test how models behave under harmful, ambiguous, or multi-turn scenarios.

How is red teaming different from standard AI evaluation?

Red teaming actively probes models with adaptive attacks to uncover failure modes, while standard evaluation relies on fixed prompts and predefined test cases.

What makes high-quality AI safety evaluation data?

High-quality safety data is multi-turn, strategy-structured, severity-graded, and continuously updated to reflect evolving attack patterns and real-world risks.

References

HarmBench — Mazeika et al., 2024 · https://www.emergentmind.com/topics/harmbench-framework
Li et al., LLM Defenses to Multi-Turn Human Jailbreaks, 2024 · https://arxiv.org/abs/2408.15221
Samvelyan et al., Rainbow Teaming, 2024 · https://arxiv.org/abs/2402.16822
Yu et al., 2026 meta-analysis · https://arxiv.org/html/2601.23112v2
OpenAI & Anthropic joint evaluation, 2025 · https://openai.com/index/openai-anthropic-safety-evaluation/
VentureBeat: Red Teaming LLMs, 2025 · https://venturebeat.com/security/red-teaming-llms-harsh-truth-ai-security-arms-race
UK AISI Frontier AI Trends, 2026 · https://www.grayswan.ai/blog/uk-aisi-x-gray-swan-agent-red-teaming-challenge-results-snapshot
Intent Laundering — Golchin & Wetter, 2026 · https://arxiv.org/html/2602.16729v1
EU GPAI Code of Practice · https://artificialintelligenceact.eu/code-of-practice-overview/

AI Safety Evaluation Data | Red Teaming & AI Safety

Every Model Breaks: Building the Data Layer Behind AI Safety Evaluation

What AI Safety Benchmarks Miss

The AI Safety Data Problem