rewards

traces May 11, 2026 1 min read

with a dataset of per-turn examples in hand, the last thing to define is the reward: what the model optimizes for as it trains. for trace-based training, each example carries the original agent’s action at that turn, so the natural signal is how well the model reproduces, or improves on, what the agent actually did.

comparing against the original action

at each turn, the model’s output is compared to the original agent’s action. the reference action is available in task["completion_messages"], set by the trace processing pipeline.

an llm judge rubric that scores fidelity to that reference works well:

from benchmax.rubrics.rubric import Rubric

fidelity = Rubric(
    title="fidelity",
    description="does the model's response achieve the same outcome as the original agent action?",
    type="positive",
    score_map={0: "completely different approach", 0.5: "similar intent, different execution", 1: "equivalent or better"},
)

scoring on the outcome rather than the exact wording lets the model find better paths to the same result instead of memorizing the original agent’s phrasing.

next steps

see llm judge for the full rubric API, and group rewards for the reward functions that run the judge calls. then launch training.