the reward function defines what the model optimizes for during training. every design choice here directly shapes the model’s behavior.
defining rewards
compute_reward runs after each rollout. it receives the full conversation transcript as messages (a list of ChatMessage dicts) and per-example data as task (a dict containing ground_truth and any other fields from your dataset row). returns a dict mapping reward component names to float scores.
extract_completion_text is a helper that pulls the assistant’s text out of a messages list. use it at the top of your reward function to get the text you’ll score.
import re
from benchmax.envs.reward_helpers import extract_completion_text
async def compute_reward(self, rollout_id, messages, task, **kwargs):
text = extract_completion_text(messages)
match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
answer = match.group(1).strip() if match else ""
ground_truth = (task or {}).get("ground_truth", "")
return {"correct": 1.0 if answer == ground_truth.strip() else 0.0}
multiple reward components
return multiple keys to score different dimensions independently. each component is tracked separately on the platform, so you can monitor and debug them individually in the rollout inspector.
async def compute_reward(self, rollout_id, messages, task, **kwargs):
text = extract_completion_text(messages)
ground_truth = (task or {}).get("ground_truth", "")
correctness = await self._judge_correctness(text, ground_truth)
format_ok = 1.0 if self._is_valid_format(text) else 0.0
return {"correctness": correctness, "format": format_ok}
each component can have a different weight in the training config. a component with weight 2.0 contributes twice as much to the final reward as one with weight 1.0. the platform shows each component’s score over time as a separate chart, making it easy to see which dimension is improving and which is lagging.
going further
two patterns extend the basic per-rollout reward:
best practices
- keep reward components non-negative. each component should stay in
[0, 1]: return0for failure, not a negative score. negative values complicate weighting and make the reward breakdown harder to read. to express a penalty, gate the component (return0when a condition fails) or give it a negative weight in the training config rather than returning a negative reward. (an llm-judge negative rubric still returns[0, 1]; the penalty comes from its weight.) - gate secondary rewards on primary ones. for example, if you have a conciseness reward, gate it on correctness, so the model doesn’t learn to produce short, wrong answers.
- start simple. design with a strict few reward components first. add more only when you see specific behaviors to encourage or discourage.
- watch for reward hacking. if the model finds a shortcut that scores well but produces bad output, your reward function has a gap. inspect high-reward completions regularly.
- use deterministic gates before expensive judges. check cheap conditions first (is there an answer tag? does it parse?) and return 0 immediately for obvious failures. saves LLM judge cost.
- log intermediate values. use
logger.info()insidecompute_rewardto trace how scores were computed. see logging. - test locally first. run your reward function on hand-crafted completions before launching. see testing.