Benchmark-based Dataset Generation

High-quality datasets for training and benchmarking LLMs on real code

Human-curated tasks and full problem/solution traces from real repositories.

Get a sample dataset

Trusted by leading U.S. companies.

SWE-bench full traces

Full workflows from real repos (issue, tests, reasoning, and patch) structured so models can learn realistic long-horizon code reasoning.

TAU-style multi-turn tasks

Multi-step conversations that simulate real reasoning flows, including tool use, corrections, and branching outcomes grounded in real tasks.

Multi-language repos (top 30 programming languages)

Tasks and traces from real repositories in multiple languages, helping you test and improve how models generalize across different ecosystems and stacks.

Infrastructure-as-Code tasks

Terraform, Pulumi, and CloudFormation tasks that expose models to real infra design, configuration, and validation patterns used in production systems.

UI generation tasks

Figma-to-code and UI component generation tasks drawn from real front-end work, so models learn to produce usable, coherent interfaces.

Case study
‍

Scaling Agentic AI in Weeks with Full-Trace Data

A hyperscaler building a coding agent needed full-trace data to boost performance on benchmarks like SWE-bench. We built a custom workflow, brought in elite engineers with niche repo expertise, and delivered rapid, measurable gains beyond expectations.

Collection throughput

10x

Reduction in AHT

48%

Quality score improvement

20%

Experts activated

210+

Read full case study

“Revelo’s datasets helped us test our model on real software problems we couldn’t find anywhere else.”

Lead Researcher at Meta

Frequently asked questions

What makes your datasets different?

Our network of 400,000 experienced engineers includes experts in every coding language that exists.

Can you extend benchmarks?

Our technology allows us to scale up and down extremely quickly. We can typically spin up new evaluation capacity within 48 hours, and can onboard hundreds of new developers to your project every week.

How large can projects be?

As much or as little as you want. Some clients are hands-off and just want the data, others want to deeply collaborate on evaluation criteria. We're flexible — though we do recommend initial calibration sessions to align on quality standards.

How do you ensure quality?

Data quality is the most important issue that we focus on. To have a detailed conversation about our processes and systems we use to ensure high-quality data, get in touch with our team.

Can everything run on our infra?

Yes! Most clients start with a pilot project — usually a few hundred evaluations across different problem types. This lets us calibrate our process to your needs and demonstrate our quality before scaling

Train and evaluate LLMs with real engineering work

Get a sample dataset