environment overview

environment Mar 19, 2026 4 min read

an environment is the specification of a task you want to train a model on. it defines three things:

  1. tools: what the model can do (search a corpus, call an API, execute code, modify a file)
  2. rewards: how the model’s output is scored (exact match, LLM judge, programmatic check)
  3. dataset preprocessing: how your raw data maps to prompts and ground truth

during training, the model interacts with your environment thousands of times. each interaction is a rollout: the model receives a prompt, generates a response (optionally calling tools), and your reward function scores the result. the model updates its weights to produce higher-scoring responses over time.

the BaseEnv contract

to define your environment, you will need to extend BaseEnv with the following methods:

from benchmax.envs.base_env import BaseEnv
from benchmax.envs.types import ChatMessage, Example, ToolDefinition

class MyEnv(BaseEnv):
    system_prompt = "..."

    async def list_tools(self) -> list[ToolDefinition]:
        """what tools can the model call? return [] for no tools."""
        ...

    async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
        """execute a tool call. return a string result."""
        ...

    async def compute_reward(
        self, rollout_id: str, messages: list[ChatMessage], task: dict | None, **kwargs
    ) -> dict[str, float]:
        """score the model's output. return {"component_name": score}."""
        ...

    @classmethod
    def dataset_preprocess(cls, example, **kwargs) -> Example:
        """convert a raw dataset row into prompt + task."""
        ...

    # optional: set up per-rollout state before the model sees the prompt
    async def init_rollout(self, rollout_id: str, **rollout_args) -> None:
        """allocate rollout-scoped resources (workspace files, db seed, etc.)."""
        ...

    # optional: clean up after compute_reward completes
    async def release_rollout(self, rollout_id: str) -> None:
        """free anything allocated in init_rollout."""
        ...

list_tools, run_tool, and compute_reward must be implemented. for environments without tools, return an empty list from list_tools and an empty string from run_tool.

these are the full list of methods provided by BaseEnv:

methodsignaturedescription
list_toolsasync () -> list[ToolDefinition]declare available tools. return [] for no tools
run_toolasync (rollout_id, tool_name, **tool_args) -> strexecute a tool call, return the result
compute_rewardasync (rollout_id, messages, task, **kwargs) -> dict[str, float]score a single rollout. messages is the full transcript, task is per-example data. see rewards
compute_group_rewardasync (rollout_ids, messages_list, tasks, **kwargs) -> list[dict[str, float]]score a group of rollouts jointly (diversity, relative scoring). see group rewards
dataset_preprocess@classmethod (example, **kwargs) -> Exampleconvert a raw dataset row. see dataset
load_dataset@classmethod (dataset_name, **kwargs) -> (Dataset, path)load a dataset from HuggingFace or local files
init_rolloutasync (rollout_id, **rollout_args) -> Noneset up per-rollout state (e.g. copy files to workspace)
release_rolloutasync (rollout_id) -> Noneclean up after a rollout
shutdownasync () -> Noneclean up when the environment is done

how a rollout works

  1. dataset_preprocess constructs the prompt for the trainer from a raw example + system prompt. a single prompt can consist of multiple messages (the messages leading up to chat history)
  2. (optional) init_rollout sets up per-rollout state (e.g. copies files to a workspace, seeds a database)
  3. the model receives the prompt.
  4. the model generates a response, optionally calling tools via list_tools / run_tool
  5. compute_reward scores the final output
  6. (optional) release_rollout cleans up any resources allocated in step 2
  7. the model updates to produce higher-scoring responses

this loop repeats across your dataset. the environment stays fixed while the model improves.

use Python’s standard logging module from any env method. the trainer attributes each log record to its rollout automatically, so logs appear alongside the completion and reward breakdown in the rollout inspector. see logging.

go deeper