llm judge

environment Mar 19, 2026 4 min read

some qualities are easy to check with code: did the answer match a string, did the tests pass, did the output parse? others aren’t: is the tone appropriate, is the explanation clear, does the response avoid unnecessary caveats? these are things a human could evaluate in seconds but are hard to capture in a regex or a function. llm judges fill that gap: they let you reward properties that you can describe in natural language but can’t easily operationalize in code.

a Rubric is a reusable LLM judge criterion. define it once, use it across reward functions. each rubric has a title, description, type (positive or negative), and an optional score map for discrete levels.

from benchmax.rubrics.rubric import Rubric, evaluate_single_rubric

correctness = Rubric(
    title="correctness",
    description="is the answer factually correct?",
    type="positive",
    score_map={0: "wrong", 0.5: "partially correct", 1: "fully correct"},
)
  • positive rubrics score for the presence of a good quality (higher = better)
  • negative rubrics score for the presence of a bad quality (higher = worse). the score itself is still in [0, 1]; apply the penalty by giving the component a negative weight in the training config, not by returning a negative reward
  • score_map is optional. when provided, the judge is constrained to those levels. when omitted, it returns a float in [0, 1].

use evaluate_single_rubric to score a response. any OpenAI-compatible model works as a judge; gpt-5.4-mini is a good default for cost and quality.

result = await evaluate_single_rubric(
    rubric=correctness,
    question=(task or {}).get("prompt", ""),
    ground_truth=(task or {}).get("ground_truth", ""),
    response=text,
    model_name="gpt-5.4-mini",
    base_url="https://api.openai.com/v1",
    api_key=self._judge_api_key,
)
return {"correctness": max(0.0, min(1.0, result.get("score", 0.0)))}

the judge model is separate from the model being fine-tuned. you can use any model accessible via an OpenAI-compatible API (OpenAI, Anthropic via proxy, self-hosted, etc.). the model_name and base_url parameters control which endpoint is called.

compose multiple rubrics in a single compute_reward to score different dimensions independently.

ranking instead of absolute scoring

absolute scores from llm judges are noisy. a judge that gives 0.7 to one response and 0.8 to another is making a fine distinction it may not consistently reproduce. but the same judge is much more reliable when asked “which of these is better?”. evaluate_rubric_ranking takes a group of responses and asks the judge to rank them in a single call, then converts that ranking into per-response scores in [0, 1].

this pairs naturally with compute_group_reward, which already operates over a group of rollouts:

from benchmax.rubrics.rubric import Rubric, evaluate_rubric_ranking

clarity = Rubric(
    title="clarity",
    description="is the response clear and easy to follow?",
    type="positive",
)

async def compute_group_reward(
    self, rollout_ids, completions, ground_truths, **kwargs
) -> list[dict[str, float]]:
    texts = [
        "\n".join(m["content"] for m in c if m.get("role") == "assistant" and m.get("content"))
        for c in completions
    ]

    result = await evaluate_rubric_ranking(
        rubric=clarity,
        question=kwargs.get("prompt", ""),
        responses=texts,
        model_name="gpt-5.4-mini",
        base_url="https://api.openai.com/v1",
        api_key=self._judge_api_key,
    )

    return [{"clarity": s} for s in result["scores"]]

one judge call ranks the whole group. empty responses automatically score 0 and are excluded from the ranking sent to the judge.

grounding against real production outputs

ranking against each other is good, but it doesn’t tell the model what “good enough” means in absolute terms. you can pass a ground_truth to evaluate_rubric_ranking to anchor the ranking. when you do, the function adds the reference as an unlabeled entry in the ranking (the judge doesn’t know which one it is), then converts positions to scores relative to where the reference lands:

  • responses ranked above the reference score in [0.5, 1.0]
  • responses tied with the reference score 0.5
  • responses ranked below score in [0.0, 0.3], with a deliberate discontinuity at the tie point so “worse than the reference” is meaningfully penalized

a good source for ground_truth is your real production model’s output on the same prompt. if you log responses from your current deployed model, you can feed those in as the reference:

async def compute_group_reward(
    self, rollout_ids, completions, ground_truths, **kwargs
) -> list[dict[str, float]]:
    texts = [
        "\n".join(m["content"] for m in c if m.get("role") == "assistant" and m.get("content"))
        for c in completions
    ]

    # ground_truth here is your production model's response to the same prompt,
    # logged and stored alongside the training data
    prod_response = ground_truths[0] if ground_truths else None

    result = await evaluate_rubric_ranking(
        rubric=clarity,
        question=kwargs.get("prompt", ""),
        responses=texts,
        ground_truth=prod_response,
        model_name="gpt-5.4-mini",
        base_url="https://api.openai.com/v1",
        api_key=self._judge_api_key,
    )

    return [{"clarity": s} for s in result["scores"]]

this sets the bar at what your production model actually does today. rollouts that beat it get a strong positive signal; rollouts that match it get a neutral signal; rollouts that fall short get penalized. as the model trains, it learns to consistently exceed the production baseline. the bar the judge holds it to is fixed, so improvement shows up directly as higher scores over time.