Careers

Build the Agentic
Readiness Index.

We are a small research team measuring how well AI coding agents can use real-world APIs. Senior engineers, public methodology, reproducible runs, open-source harness.

§ How we work
01

Reproducibility is the product

If a vendor or a journalist clones the repo and gets different numbers, that is a bug we treat as urgent. The harness is open source for this reason.

02

Adversarial review before publish

Every benchmark passes four gates before going on the leaderboard: empty submission scores 0, reference scores 100, hidden EVAL_LEAK_CHECK canaries, and a second-engineer adversarial pass.

03

Findings over framings

We publish what the data shows, not what is convenient. Vendor disputes are reviewed in public; we change a score if and only if the methodology was wrong.

04

Senior, small, async-first

Small team. Senior engineers on every track. Most days are async — written specs, async code review, weekly research sync. We meet in person twice a year.

§ Open roles

4 positions, all senior, all remote-friendly.

  • Research · Full-time

    Benchmark Authoring Engineer

    Remote · US time zones

    Author end-to-end benchmarks against real third-party APIs. Each benchmark is a small repo + grader + reference solution that scores how well a frontier coding agent can ship an integration. You will pick the tasks, write the grader, hide it from the agent, and calibrate the difficulty so the run lands in the 40–70 pass@K range.

    You will
    • Pick benchmark tasks based on real engineering work against the target API (you have shipped against the vendor in production).
    • Write reference solutions that score 100 and adversarial empty submissions that score 0.
    • Build hidden graders that query the vendor's data model for ground truth (no grading against agent HTTP responses).
    • Calibrate difficulty so the task discriminates between models without being un-solvable.
    • Author the methodology section that ships with the benchmark.
    Fit
    • Shipped meaningful production code against at least one fintech, BaaS, KYC, bank-data, or card-issuing API.
    • Strong opinions about API ergonomics and where agents tend to fail.
    • Comfortable with Python or TypeScript graders + Docker.
    • Cares about reproducibility. You sweat the details that keep a benchmark un-gameable.
  • Platform · Full-time

    Harness Engineer

    Remote · US time zones

    Own the Goose-based harness that runs every benchmark, captures every tool call, and emits the JSON the leaderboard reads. Make it fast, reproducible, and friendly enough that any open-source contributor can clone the repo and run a vendor benchmark themselves in under five minutes.

    You will
    • Maintain the harness CLI, Docker images, and reference environments.
    • Add support for new model providers (Codex, Cursor Agents, Cline, etc.) behind a clean adapter interface.
    • Build the publish pipeline that turns harness JSON into the leaderboard rows and vendor pages.
    • Write the run-it-yourself docs that go in the public repo.
    Fit
    • Strong systems / infra background. Docker, GitHub Actions, sub-process management.
    • Bias toward boring, reproducible tooling over magic.
    • Care about open-source ergonomics — your README is the marketing.
  • Research · Full-time

    Research Engineer, Methodology

    Remote · US time zones

    Define what we measure and how we measure it. Own the assertion taxonomy (Setup / Implementation / Robustness today; more dimensions tomorrow), the scoring rubric, the per-vendor coverage model, and the public methodology doc that defends the Index against “your tests were biased.”

    You will
    • Develop the assertion taxonomy and the rubric translating per-assertion pass rates into the published score and grade.
    • Run adversarial sweeps against new benchmarks before publication (prompt injection, doc-leakage canaries).
    • Author the methodology page and respond to vendor disputes in the public changelog.
    • Publish research notes when findings warrant.
    Fit
    • Background in AI eval design, statistics, or experimental software research.
    • Writes well. The methodology doc IS the company's defensibility.
    • Comfortable being publicly named on a research output.
  • GTM · Full-time

    Founding GTM

    NYC or SF preferred · open to remote for the right person

    Build the commercial side of the Index. Sell continuous-eval CI to fintech / BaaS / bank-data devrel and platform teams. Bring the first ten paying vendor relationships in-house and the next fifty into the pipeline.

    You will
    • Run outbound and inbound on fintech devrel + platform leadership.
    • Close the first paid continuous-eval CI contracts.
    • Coordinate with research on which APIs to prioritize for the public Index based on commercial pull.
    • Build the case studies and category reports that drive top-of-funnel.
    Fit
    • Has sold dev tools or API products to engineering buyers before.
    • Comfortable in a small, technical room. Reads APIs and methodology docs as part of the job.
    • Bias toward credible specificity over generic pitch.

Don't see your role?

If you have shipped meaningful work against a developer API and want to build the Index, email us anyway. Include a short note on the API you know best and the benchmark you would want to author first.

← Back to the Index

NextdevNextdev

The Agentic Readiness Index.
Benchmarks to measure API agentic readiness.

© 2026 NextDev, Inc. All rights reserved.
Online
v3.0.1
Careers — Nextdev Labs