Blogs
2025-12-23/Research

EditReward: Abaka AI’s Human-Aligned Reward Model for Image Editing

Hazel Gao's avatar
Hazel Gao,Marketing Manager

EditReward is Abaka AI’s new human-aligned reward model built to solve the biggest bottleneck in AI image editing: the lack of a reliable, interpretable, high-fidelity “judge.” Trained on 200K+ expert-annotated preference pairs and designed with multidimensional reasoning, EditReward outperforms GPT-5 and GPT-4o on GenAI-Bench and AURORA-Bench, enabling the entire open-source ecosystem to build higher-quality, instruction-faithful generative models.

Introducing EditReward: The AI Judge Bringing Human Sense to Image Editing

Have you ever wondered why open-source AI image editors, while impressive, often struggle to match the flawless performance of closed-source giants like Google's Nano Banana or OpenAI's GPT-Image? The secret isn't just about model architecture; it's about the quality of the data they're trained on.

The biggest bottleneck for open-source AI has been the lack of a reliable "judge" — an AI that can accurately tell a good image edit from a bad one. Without a good judge, you can't create the high-quality training data needed to build a great editor.

Today, we're pulling back the curtain on a project that tackles this problem head-on. Introducing EditReward, a human-aligned reward model designed to serve as a fair, consistent, and incredibly accurate critic for instruction-guided image editing.

The Problem: Why Is Judging Image So Hard?

Imagine trying to train a chef using a food critic who only says "I like it" or "I don't." It's not very helpful, right? This is the problem facing AI image editing. Current "reward models" (the AI critics) are often unreliable:

  • Some are trained on noisy, inconsistent data from crowd-sourced platforms.
  • Others use labels generated by proprietary models, which can be biased and inaccurate.
  • Most provide a single, vague score, failing to capture why an edit is good or bad. Is it visually stunning but ignored the user's instructions? Or did it follow the prompt perfectly but create a distorted, unrealistic image?

This lack of a reliable critic has held the open-source community back. To build a truly state-of-the-art model, you first need a state-of-the-art judge.

EDITREWARD-DATA: The Foundation of Excellence

This is where the journey of EditReward begins, with a foundational contribution from data experts at 2077AI, an open-source research initiative led by Abaka AI. We knew that, to train a world-class AI judge, we first needed to build the world's best "rulebook" — a dataset that embodies what humans truly consider a high-quality edit.

This led to the creation of EditReward-data. This isn't just another dataset; it's a meticulously curated collection of over 200,000 human preference pairs.

Here’s what makes it different, and where the expertise of the 2077AI team shines:

  • Expert Annotation: Instead of noisy crowd-sourcing, every single data point was annotated by trained experts following a rigorous, standardized protocol.
  • Multi-Dimensional Scoring: We moved beyond a single score. Our experts rated each edit on two distinct axes: Instruction Following (did it do what you asked?) and Visual Quality (does it look good and realistic?).
  • Diversity: The data covers a massive range of edits from seven state-of-the-art models, ensuring our final judge is fair and unbiased.

Building this dataset was a monumental task, reflecting 2077AI's commitment to pioneering the high-fidelity data infrastructure that empowers the entire open-source ecosystem.

Training the Ultimate Image Critic

With this gold-standard dataset in hand, we trained EditReward. We used a powerful Vision-Language Model (VLM) as its backbone and taught it to think like our human experts.

We employed a sophisticated training strategy called Multi-Dimensional Uncertainty-Aware Ranking. It’s a mouthful, but the concept is intuitive: we taught the model to understand that a great edit is a balance of different factors and to weigh them accordingly. It learns not just to pick a winner between two images, but to understand the nuanced trade-offs between following instructions perfectly and achieving visual perfection.

EditReward in action, correctly assigning high scores to successful edits and low scores to failed ones, demonstrating strong alignment with human judgment.

The Results

So, does it work? The results speak for themselves.

EditReward doesn't just perform well; it sets a new standard for AI evaluation.

  • Outperforming Giants: On established benchmarks like GenAI-Bench and AURORA-Bench, EditReward achieves a higher correlation with human judgment than powerful proprietary models like GPT-5 and GPT-4o.
  • Massive Uplift: When we applied our training framework to a standard open-source VLM, its performance as a judge skyrocketed, improving by over 23 points on GenAI-Bench. This proves the power of our high-quality data and training methodology.


But the most exciting result is its practical application. We used EditReward to filter a large, noisy dataset of 46,000 images down to a high-quality subset of 20,000. When we trained a leading open-source model, Step1X-Edit, on this smaller, curated dataset, its performance significantly surpassed training on the full, noisy dataset.

This is the key takeaway: quality over quantity. A powerful AI judge like EditReward is the key to unlocking the next generation of generative models.

Why This Matters for the Future of AI

EditReward represents more than a new model — it represents a shift in how the open-source community approaches alignment.

At Abaka AI, our mission is to build the data and evaluation infrastructure that makes AI trustworthy, human-aligned, and production-ready. EditReward is part of that mission:

  • A reliable, interpretable critic for generative models
  • A scalable, high-fidelity dataset created through expert human workflows
  • A benchmark suite (EditReward-Bench) that enables transparent progress tracking

By collaborating with top researchers and contributing the foundational data work, we’re helping open-source AI compete on equal terms with proprietary giants.

If you’re building:

  • Image editing models
  • Reward models
  • Evaluation pipelines
  • Human preference datasets
  • Foundation models needing alignment signals

We’d love to partner.

Contact us → Abaka AI can build the high-fidelity data engine behind your next breakthrough.


Other Articles