High-quality pairwise datasets for DPO and RLHF

Compare model outputs, choosing the better one with clear reasoning.
Trusted by leading U.S. companies.
Pairwise comparisons
Human raters compare two model outputs for the same prompt, choose the better response, and explain why, producing clean preference signals for training.
Reward model inputs
We provide structured human feedback ready for reward model training, including labels, short rationales, and quality tags that match your training schema.
Code-focused RLHF signals
We judge correctness, edge cases, naming, readability, and test quality to produce high-signal labels for RLHF on real-world coding problems.
DPO-ready formatting
We deliver preference data in clean schemas tailored to DPO pipelines, so you can plug it directly into your training stack.
Multi-language preference data
We collect preference signals across multiple programming languages so aligned behavior is consistent, not language-specific.
Case study

48-Hour Code Evaluation Audit at Scale for a Frontier AI Lab

A frontier AI lab required a 48-hour code evaluation audit comparing model outputs across qualitative dimensions. The urgency and custom formatting constraints exceeded the client’s internal tooling capacity. Revelo executed the audit on its platform, mobilized expert annotators, and delivered client-ready data on time.
End-to-end turnaround
48h
Delivered tasks
210
Qualitative evaluation dimensions
6
Programming languages
5
“Their pairwise data was consistent, clear, and actually usable.”
Lead Researcher at Meta

Frequently asked questions

Do you deliver DPO-ready data?
Our network of 400,000 experienced engineers includes experts in every coding language that exists.
How do you ensure rater agreement?
Our technology allows us to scale up and down extremely quickly. We can typically spin up new evaluation capacity within 48 hours, and can onboard hundreds of new developers to your project every week.
Can you mix SFT and RLHF?
As much or as little as you want. Some clients are hands-off and just want the data, others want to deeply collaborate on evaluation criteria. We're flexible — though we do recommend initial calibration sessions to align on quality standards.
Do annotators explain choices?
Data quality is the most important issue that we focus on. To have a detailed conversation about our processes and systems we use to ensure high-quality data, get in touch with our team.
Do you support multiple languages?
Yes! Most clients start with a pilot project — usually a few hundred evaluations across different problem types. This lets us calibrate our process to your needs and demonstrate our quality before scaling

Train better-aligned models with human preferences