We are a small research team measuring how well AI coding agents can use real-world APIs. Senior engineers, public methodology, reproducible runs, open-source harness.
If a vendor or a journalist clones the repo and gets different numbers, that is a bug we treat as urgent. The harness is open source for this reason.
Every benchmark passes four gates before going on the leaderboard: empty submission scores 0, reference scores 100, hidden EVAL_LEAK_CHECK canaries, and a second-engineer adversarial pass.
We publish what the data shows, not what is convenient. Vendor disputes are reviewed in public; we change a score if and only if the methodology was wrong.
Small team. Senior engineers on every track. Most days are async — written specs, async code review, weekly research sync. We meet in person twice a year.
Author end-to-end benchmarks against real third-party APIs. Each benchmark is a small repo + grader + reference solution that scores how well a frontier coding agent can ship an integration. You will pick the tasks, write the grader, hide it from the agent, and calibrate the difficulty so the run lands in the 40–70 pass@K range.
Own the Goose-based harness that runs every benchmark, captures every tool call, and emits the JSON the leaderboard reads. Make it fast, reproducible, and friendly enough that any open-source contributor can clone the repo and run a vendor benchmark themselves in under five minutes.
Define what we measure and how we measure it. Own the assertion taxonomy (Setup / Implementation / Robustness today; more dimensions tomorrow), the scoring rubric, the per-vendor coverage model, and the public methodology doc that defends the Index against “your tests were biased.”
Build the commercial side of the Index. Sell continuous-eval CI to fintech / BaaS / bank-data devrel and platform teams. Bring the first ten paying vendor relationships in-house and the next fifty into the pipeline.
If you have shipped meaningful work against a developer API and want to build the Index, email us anyway. Include a short note on the API you know best and the benchmark you would want to author first.