Position Overview
AI Evaluation Engineer
We’re hiring an AI Evaluation Engineer to own the quality bar for every LLM-powered feature we ship. You will design, build, and scale the infrastructure that tells us -- with evidence -- whether a prompt change, model swap, or agent refactor made things better or worse.
Responsibilities
- Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features such as our expert search system, natural language grants search, and AI SDR agents.
- Define what good means: Partner with product and domain experts to translate vague customer outcomes (does this surface the right principal investigator?) into precise, measurable rubrics.
- Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests.
- Ship quickly under uncertainty...