About the job Remote | PhD Rater — Up to $100/hr
We are sharing a specialised remote opportunity for experienced researchers and technical experts to support a frontier-model evaluation initiative focused on advanced STEM reasoning and agentic workflows.
This project focuses on designing and validating complex benchmark tasks across domains such as data science, machine learning, finance, and coding. The role involves building real-world evaluation tasks, implementing them in Python-based environments, and analysing how advanced AI systems perform in solving complex technical problems.
Key Responsibilities
Design challenging real-world STEM problems for model evaluation
Implement benchmark tasks inside agentic development environments using Python
Create reproducible tasks with executable tests and clearly defined specifications
Analyse model and agent outputs to identify reasoning gaps and failure modes
Evaluate how AI systems perform on complex data science, machine learning, finance, and coding tasks
Document benchmark tasks, environments, and evaluation outcomes
Ideal Profile
Strong candidates may have:
Active or recently completed PhD from a top-tier U.S.-based university
Deep expertise in data science, machine learning, finance, and/or Python-based programming
Strong research background in advanced STEM domains
Experience designing complex technical problems or research benchmarks
Ability to analyse model reasoning traces and diagnose deeper system behaviour issues
Strong analytical and research documentation skills
Educational Background:
PhD in Computer Science, Data Science, Machine Learning, Finance, or related STEM fields
Nice to Have
Experience working with agentic frameworks or LLM tooling ecosystems
Familiarity with frameworks such as LangChain, AutoGen, MetaGPT, CrewAI, LlamaIndex, BabyAGI, or related systems
Contributions to open-source software or research projects
Experience analysing complex model behaviour or agent workflows
Why This Opportunity
Contribute directly to frontier AI model evaluation and benchmarking efforts
Work on advanced research challenges in agentic AI systems and STEM reasoning tasks
Collaborate with leading AI labs and technical researchers
Help identify and improve limitations in next-generation AI systems
Contract Details
Independent contractor role
Fully remote with flexible scheduling
Part-time research engagement with expected availability of 30+ hours per week
Competitive rates between $50–$100/hour depending on expertise
Weekly payments via Stripe or Wise
Projects may extend or adjust depending on scope and performance
No access to confidential or proprietary information from employers or institutions
About the Platform
This opportunity is available through a leading AI-driven work platform.