Position Overview
Type: Hourly contract
Compensation: $20–$30/hour
Location: Remote
Commitment: 10–40 hours/week
Role Responsibilities
- Evaluate outputs from large language models and autonomous agent systems using defined rubrics and quality standards.
- Review multi-step agent workflows, including screenshots and reasoning traces, to assess accuracy and completeness.
- Apply benchmarking criteria consistently while identifying edge cases and recurring failure patterns.
- Provide structured, actionable feedback to support model refinement and product improvements.
- Participate in calibration sessions to ensure consistent evaluation alignment across reviewers.
- Adapt to evolving guidelines and ambiguous scenarios with sound judgment.
- Document findings clearly and communicate insights to relevant stakeholders.