environment overview | castform docs

an environment is the specification of a task you want to train a model on. it defines three things:

tools: what the model can do (search a corpus, call an API, execute code, modify a file)
rewards: how the model’s output is scored (exact match, LLM judge, programmatic check)
dataset preprocessing: how your raw data maps to prompts and ground truth

during training, the model interacts with your environment thousands of times. each interaction is a rollout: the model receives a prompt, generates a response (optionally calling tools), and your reward function scores the result. the model updates its weights to produce higher-scoring responses over time.

the BaseEnv contract

to define your environment, you will need to extend BaseEnv with the following methods:

from benchmax.envs.base_env import BaseEnv
from benchmax.envs.types import ChatMessage, Example, ToolDefinition

class MyEnv(BaseEnv):
    system_prompt = "..."

    async def list_tools(self) -> list[ToolDefinition]:
        """what tools can the model call? return [] for no tools."""
        ...

    async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
        """execute a tool call. return a string result."""
        ...

    async def compute_reward(
        self, rollout_id: str, messages: list[ChatMessage], task: dict | None, **kwargs
    ) -> dict[str, float]:
        """score the model's output. return {"component_name": score}."""
        ...

    @classmethod
    def dataset_preprocess(cls, example, **kwargs) -> Example:
        """convert a raw dataset row into prompt + task."""
        ...

    # optional: set up per-rollout state before the model sees the prompt
    async def init_rollout(self, rollout_id: str, **rollout_args) -> None:
        """allocate rollout-scoped resources (workspace files, db seed, etc.)."""
        ...

    # optional: clean up after compute_reward completes
    async def release_rollout(self, rollout_id: str) -> None:
        """free anything allocated in init_rollout."""
        ...

list_tools, run_tool, and compute_reward must be implemented. for environments without tools, return an empty list from list_tools and an empty string from run_tool.

these are the full list of methods provided by BaseEnv:

method	signature	description
`list_tools`	`async () -> list[ToolDefinition]`	declare available tools. return `[]` for no tools
`run_tool`	`async (rollout_id, tool_name, **tool_args) -> str`	execute a tool call, return the result
`compute_reward`	`async (rollout_id, messages, task, **kwargs) -> dict[str, float]`	score a single rollout. `messages` is the full transcript, `task` is per-example data. see rewards
`compute_group_reward`	`async (rollout_ids, messages_list, tasks, **kwargs) -> list[dict[str, float]]`	score a group of rollouts jointly (diversity, relative scoring). see group rewards
`dataset_preprocess`	`@classmethod (example, **kwargs) -> Example`	convert a raw dataset row. see dataset
`load_dataset`	`@classmethod (dataset_name, **kwargs) -> (Dataset, path)`	load a dataset from HuggingFace or local files
`init_rollout`	`async (rollout_id, **rollout_args) -> None`	set up per-rollout state (e.g. copy files to workspace)
`release_rollout`	`async (rollout_id) -> None`	clean up after a rollout
`shutdown`	`async () -> None`	clean up when the environment is done

how a rollout works

dataset_preprocess constructs the prompt for the trainer from a raw example + system prompt. a single prompt can consist of multiple messages (the messages leading up to chat history)
(optional) init_rollout sets up per-rollout state (e.g. copies files to a workspace, seeds a database)
the model receives the prompt.
the model generates a response, optionally calling tools via list_tools / run_tool
compute_reward scores the final output
(optional) release_rollout cleans up any resources allocated in step 2
the model updates to produce higher-scoring responses

this loop repeats across your dataset. the environment stays fixed while the model improves.

use Python’s standard logging module from any env method. the trainer attributes each log record to its rollout automatically, so logs appear alongside the completion and reward breakdown in the rollout inspector. see logging.

go deeper

tools define the model's action space with inline tools or MCP servers. rewards score the model's output with programmatic checks, LLM judges, or both. dataset preprocess and format your training data. testing validate your environment locally before launching. logging debug rollouts with logs that appear in the rollout inspector.