an environment is the specification of a task you want to train a model on. it defines three things:
- tools: what the model can do (search a corpus, call an API, execute code, modify a file)
- rewards: how the model’s output is scored (exact match, LLM judge, programmatic check)
- dataset preprocessing: how your raw data maps to prompts and ground truth
during training, the model interacts with your environment thousands of times. each interaction is a rollout: the model receives a prompt, generates a response (optionally calling tools), and your reward function scores the result. the model updates its weights to produce higher-scoring responses over time.
the BaseEnv contract
to define your environment, you will need to extend BaseEnv with the following methods:
from benchmax.envs.base_env import BaseEnv
from benchmax.envs.types import ChatMessage, Example, ToolDefinition
class MyEnv(BaseEnv):
system_prompt = "..."
async def list_tools(self) -> list[ToolDefinition]:
"""what tools can the model call? return [] for no tools."""
...
async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
"""execute a tool call. return a string result."""
...
async def compute_reward(
self, rollout_id: str, messages: list[ChatMessage], task: dict | None, **kwargs
) -> dict[str, float]:
"""score the model's output. return {"component_name": score}."""
...
@classmethod
def dataset_preprocess(cls, example, **kwargs) -> Example:
"""convert a raw dataset row into prompt + task."""
...
# optional: set up per-rollout state before the model sees the prompt
async def init_rollout(self, rollout_id: str, **rollout_args) -> None:
"""allocate rollout-scoped resources (workspace files, db seed, etc.)."""
...
# optional: clean up after compute_reward completes
async def release_rollout(self, rollout_id: str) -> None:
"""free anything allocated in init_rollout."""
...
list_tools, run_tool, and compute_reward must be implemented. for environments without tools, return an empty list from list_tools and an empty string from run_tool.
these are the full list of methods provided by BaseEnv:
| method | signature | description |
|---|---|---|
list_tools | async () -> list[ToolDefinition] | declare available tools. return [] for no tools |
run_tool | async (rollout_id, tool_name, **tool_args) -> str | execute a tool call, return the result |
compute_reward | async (rollout_id, messages, task, **kwargs) -> dict[str, float] | score a single rollout. messages is the full transcript, task is per-example data. see rewards |
compute_group_reward | async (rollout_ids, messages_list, tasks, **kwargs) -> list[dict[str, float]] | score a group of rollouts jointly (diversity, relative scoring). see group rewards |
dataset_preprocess | @classmethod (example, **kwargs) -> Example | convert a raw dataset row. see dataset |
load_dataset | @classmethod (dataset_name, **kwargs) -> (Dataset, path) | load a dataset from HuggingFace or local files |
init_rollout | async (rollout_id, **rollout_args) -> None | set up per-rollout state (e.g. copy files to workspace) |
release_rollout | async (rollout_id) -> None | clean up after a rollout |
shutdown | async () -> None | clean up when the environment is done |
how a rollout works
dataset_preprocessconstructs the prompt for the trainer from a raw example + system prompt. a single prompt can consist of multiple messages (the messages leading up to chat history)- (optional)
init_rolloutsets up per-rollout state (e.g. copies files to a workspace, seeds a database) - the model receives the prompt.
- the model generates a response, optionally calling tools via
list_tools/run_tool compute_rewardscores the final output- (optional)
release_rolloutcleans up any resources allocated in step 2 - the model updates to produce higher-scoring responses
this loop repeats across your dataset. the environment stays fixed while the model improves.
use Python’s standard logging module from any env method. the trainer attributes each log record to its rollout automatically, so logs appear alongside the completion and reward breakdown in the rollout inspector. see logging.