Human Data
Model Evaluation
Human evaluation for LLMs on real software tasks
Structured evaluations using real code, repos, and rubrics you can trust.
Trusted by leading U.S. companies.

Execution-based evaluations
We run model-generated code in real or containerized environments to verify correctness, runtime behavior, and reproducibility across different tasks and repos.
Rubric scoring
We score model outputs using clear rubrics for correctness, reasoning quality, clarity, efficiency, structure, and other axes that matter to your research goals.
Multi-turn reasoning evaluation
We evaluate how models handle multi-step debugging, refactoring, or planning tasks over several turns, including corrections and partial failures.
Agent performance testing
We test agents on planning, tool use, code edits, and iterative refinement across realistic tasks, not just one-shot completions.
Case study
Scaling multi-modal code evals with automated rubrics
A leading foundation model company needed a scalable way to evaluate code generation. In just 8 days, we delivered 1,000+ expert-reviewed tasks and built custom rubrics, enabling automated benchmarking with every new model release.
Tasks completed
1,000
Days to full-scale
8
Parallel queues
25
Languages covered
100+
“Their evaluations showed exactly where the model broke, and why.”
Lead Researcher at Meta
Frequently asked questions
Do you test in real environments?
Our network of 400,000 experienced engineers includes experts in every coding language that exists.
Can you use our rubrics?
Our technology allows us to scale up and down extremely quickly. We can typically spin up new evaluation capacity within 48 hours, and can onboard hundreds of new developers to your project every week.
How do you keep raters consistent?
As much or as little as you want. Some clients are hands-off and just want the data, others want to deeply collaborate on evaluation criteria. We're flexible — though we do recommend initial calibration sessions to align on quality standards.
Can you test agents?
Data quality is the most important issue that we focus on. To have a detailed conversation about our processes and systems we use to ensure high-quality data, get in touch with our team.
Can everything run on our infra?
Yes! Most clients start with a pilot project — usually a few hundred evaluations across different problem types. This lets us calibrate our process to your needs and demonstrate our quality before scaling
See how your model behaves on real tasks
