InterviewStack.io LogoInterviewStack.io
Browse more Software Engineer jobs

Senior Software Engineer - AI Interaction Evaluator (Codex / Claude Code, up to $200/hr)

G2i

Miami; Edmonton; Ottawa; Vancouver; Washington, D.C.; Atlanta; Mississauga; Calgary; Montreal; Toronto; Winnipeg; Denver; Jacksonville; Savannah; Boise; Chicago; Indianapolis; Boston; Reno; Las Vegas; Columbus; Oklahoma City; Philadelphia; Nashville; Austin; Dallas; Fort Worth; Huston; San Antonio; Arlington; Chesapeake; Fairfax; Norfolk; Richmond; Virginia Beach; Seattle; Buenos Aires; Córdoba; La Plata; Mar del Plata; Rosario; Cochabamba; El Alto; La Paz; Oruro; Santa Cruz de la Sierra; Belém; Belo Horizonte; Brasília; Campinas; Curitiba; Fortaleza; Goiânia; Guarulhos; Manaus; Porto Alegre; Recife; Rio de Janeiro; Salvador; São Paulo; Santiago de Chile; Barranquilla; Bogotá; Cali; Cartagena; Medellín; Cuenca; Guayaquil; Quito; Santo Domingo; Asunción; Ciudad del Este; Arequipa; Lima; Montevideo; Casablanca; Fez; Marrakesh; Rabat; Tanger; Chattogram; Dhaka; Gazipur; Khulna; Narayanganj; Phnom Penh; Siem Reap; Agra; Amritsar; Aurangabad; Bangalore; Bhopal; Chennai; Coimbatore; Delhi; Dhanbad; Faridabad; Ghaziabad; Gwalior; Howrah; Hyderabad; Indore; Jabalpur; Jaipur; Jodhpur; Kanpur; Kolkata; Kota; Lucknow; Ludhiana; Madurai; Meerut; Patna; Prayagraj; Raipur; Ranchi; Srinagar; Thane; Varanasi; Vijayawada; Visakhapatnam; Bandar Lampung; Bandung; Batam; Jakarta; Semarang; Bogor; Bekasi; Depok; Makassar; Medan; Palembang; Pekanbaru; Surabaya; Tangerang; Tirana; Vienna; Brussels; Sarajevo; Sofia; Zagreb; Prague; Tallinn; Helsinki; Berlin; Athens; Budapest; Dublin; London; Rome; Prishtinë; Riga; Vilnius; Valletta; Podgorica; Skopje; Warsaw; Lisbon; Porto; Bucharest; Bratislava; Madrid; Belgrade; Ankara; Istanbul; İzmirRemote$50 - $200/hr1 month ago
44 views12 saves6 applies

Prepare for this role


Job Type

contract

Description

Senior AI Interaction Evaluator (Codex / Claude Code)

Contract | $50-200/hr | 10–20 hrs/week | Start ASAP (through early May)

Check out this Loom video for more details!

We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.

This is not a traditional engineering role.

You won’t be writing production code.
You’ll be evaluating something harder: whether the model thinks like a great engineer.

What This Role Actually Is

You will assess how AI coding agents behave in real-world scenarios — focusing on:

  • Whether the response makes sense

  • Whether the preamble and reasoning are useful

  • Whether the output reflects strong engineering judgment

  • Whether the interaction feels right to an experienced developer

This role is about engineering taste — not syntax correctness.

What You’ll Be Doing

  • Evaluate AI-generated coding interactions end-to-end

  • Judge whether outputs are:

    • Useful

    • Correct (at a high level)

    • Aligned with how a strong engineer would think

  • Assess the quality of explanations and reasoning, not just code

  • Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)

  • Provide clear, opinionated feedback on:

    • What worked

    • What didn’t

    • What felt “off” or misleading

  • Help define what great looks like when interacting with tools like Cursor

What We Mean by “Taste”

We’re specifically looking for engineers who can answer questions like:

  • Does this feel like something a strong engineer would actually say?

  • Is this explanation helpful, or just technically correct?

  • Is the model guiding the user well, or just dumping output?

  • Would this interaction build or erode trust?

You should be comfortable making subjective but rigorous judgments.

Who You Are

  • Staff / Principal-level engineer (or equivalent experience)

  • Strong background in one of the below:

    • TypeScript / JavaScript

    • Python

  • Hands-on experience using:

    • OpenAI Codex

    • Claude Code

    • Cursor

  • Deep familiarity with modern AI-assisted dev workflows

  • Able to evaluate code without needing to fully execute or deeply review every line

  • Comfortable giving direct, opinionated feedback

  • High bar for what “good engineering” looks like

Nice to Have

  • Experience with tools like Cursor or similar AI-first IDEs

  • Prior exposure to prompt design or evaluation workflows

  • Experience mentoring senior engineers or defining engineering standards

Engagement Details

  • US and Canada up to $200/hr

  • EU and Latam up to $150/hr

  • Other locations up to $100/hr

  • Hours: ~10–20 hours/week

  • Duration: Through early May (with possible extension)

  • Start: ASAP

  • Process:

    • Take-home evaluation exercise

    • One behavioral interview

This job is found at InterviewStack.io

Skills

openaitypescriptjavascriptpython

About G2i

G2i is an AI engineering company that helps the world's leading frontier labs, enterprises, and high-growth startups hire, train, and ship AI systems.

software, aiWebsite