full time

Key Responsibilities:

Perform scoring and qualitative evaluations of
LLM-generated responses across multiple use cases.
Develop and maintain scoring guidelines and rubrics to
ensure consistency and objectivity.
Collaborate with data scientists, product managers, and
engineering teams to align scoring with project goals.
Assist in the creation and labeling of high-quality
evaluation datasets for prompt tuning or model fine-tuning.
Utilize NLP-based metrics and tools (e.g., ROUGE, BLEU,
cosine similarity) for automated scoring support.
Document scoring patterns, common model errors, and
improvement opportunities.
Contribute to prompt experimentation and help compare
effectiveness of different prompt strategies.

Qualifications:

Prior experience with LLMs (e.g., GPT, Claude, LLaMA,
etc.) or AI/NLP projects is highly preferred.
Strong analytical skills and attention to detail,
especially in assessing language quality.
Familiarity with prompt engineering, generative AI, or
conversational AI tools is a plus.
Hands-on experience with Python, Jupyter, or evaluation
libraries (optional but desirable).
Experience working with evaluation frameworks or
annotation tools (Label Studio, Prodigy, etc.) is a bonus.
Excellent written and verbal communication skills

This job is found at InterviewStack.io

llmsgptnlpgenerative aipythonprompt engineeringfine tuningexperimentation

Full-service recruitment agency specializing in executive search, permanent placements, and contingency staffing.

recruitment, staffing

AI Model Evaluation Specialist