tools

environment Mar 19, 2026 4 min read

tools give the model the ability to take actions during training: search a corpus, call an API, execute code, read a file. each tool has a schema that the model sees (name, description, input parameters) and an implementation that runs when the model calls it.

during training, the model learns when to use each tool and how to call it with the right arguments to maximize its reward.

defining a tool

a tool needs a ToolDefinition with a name, description, and JSON schema for its inputs:

from benchmax.envs.types import ToolDefinition

search_tool = ToolDefinition(
    name="search",
    description="Search the knowledge base.",
    input_schema={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "the search query"},
            "limit": {"type": "integer", "description": "max results to return"},
        },
        "required": ["query"],
    },
)

the description and input_schema are what the model sees when deciding whether and how to call the tool. clear descriptions lead to better tool usage.

registering and implementing tools

implement list_tools to declare available tools and run_tool to handle calls:

class MyEnv(BaseEnv):
    async def list_tools(self):
        return [search_tool]

    async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
        if tool_name == "search":
            return await self._search(tool_args["query"], tool_args.get("limit", 10))

run_tool receives the tool name and arguments exactly as the model provided them. return a string; this becomes the tool result the model sees in its context.

multiple tools

when your environment has several tools, use a dispatch pattern:

async def list_tools(self):
    return [search_tool, execute_tool, summarize_tool]

async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
    handler = {
        "search": self._search,
        "execute": self._execute,
        "summarize": self._summarize,
    }[tool_name]
    return await handler(**tool_args)

this is how the built-in SearchEnv works: it registers a search tool over your corpus and dispatches calls to the search backend.

rollout lifecycle hooks

for tools that need per-example setup (e.g. copying a file into a workspace before the model can edit it), use the lifecycle hooks:

async def init_rollout(self, rollout_id: str, **rollout_args):
    """called before each rollout. use rollout_args from init_rollout_args."""
    self._workspace[rollout_id] = setup_workspace(rollout_args["file_path"])

async def release_rollout(self, rollout_id: str):
    """called after each rollout. clean up resources."""
    cleanup(self._workspace.pop(rollout_id, None))

init_rollout_args are passed from dataset_preprocess via the Example. see dataset.

full example

here’s a complete environment for a code-editing task. the model can read files, write patches, and run tests; after it finishes, compute_reward checks whether the tests pass.

import subprocess
from benchmax.envs.base import BaseEnv
from benchmax.envs.types import ToolDefinition, StandardizedExample

read_tool = ToolDefinition(
    name="read_file",
    description="Read the contents of a file in the workspace.",
    input_schema={
        "type": "object",
        "properties": {
            "path": {"type": "string", "description": "relative path to the file"},
        },
        "required": ["path"],
    },
)

write_tool = ToolDefinition(
    name="write_file",
    description="Overwrite a file in the workspace with new content.",
    input_schema={
        "type": "object",
        "properties": {
            "path": {"type": "string", "description": "relative path to the file"},
            "content": {"type": "string", "description": "new file content"},
        },
        "required": ["path", "content"],
    },
)

run_tests_tool = ToolDefinition(
    name="run_tests",
    description="Run the test suite and return stdout + exit code.",
    input_schema={"type": "object", "properties": {}},
)


class CodeEditEnv(BaseEnv):
    def __init__(self):
        self._workspaces: dict[str, str] = {}  # rollout_id -> tmp dir path

    async def list_tools(self):
        return [read_tool, write_tool, run_tests_tool]

    async def init_rollout(self, rollout_id: str, **rollout_args):
        # copy the repo snapshot for this example into an isolated tmp dir
        self._workspaces[rollout_id] = setup_workspace(rollout_args["repo_path"])

    async def release_rollout(self, rollout_id: str):
        cleanup(self._workspaces.pop(rollout_id, None))

    async def run_tool(self, rollout_id: str, tool_name: str, **tool_args):
        workspace = self._workspaces[rollout_id]
        if tool_name == "read_file":
            full_path = f"{workspace}/{tool_args['path']}"
            return open(full_path).read()
        elif tool_name == "write_file":
            full_path = f"{workspace}/{tool_args['path']}"
            open(full_path, "w").write(tool_args["content"])
            return "ok"
        elif tool_name == "run_tests":
            result = subprocess.run(
                ["pytest", "--tb=short", "-q"],
                cwd=workspace,
                capture_output=True,
                text=True,
            )
            return f"exit {result.returncode}\n{result.stdout}{result.stderr}"

    async def compute_reward(self, rollout_id: str, messages, **rollout_args):
        workspace = self._workspaces[rollout_id]
        result = subprocess.run(
            ["pytest", "-q"], cwd=workspace, capture_output=True, text=True
        )
        return 1.0 if result.returncode == 0 else 0.0

    async def dataset_preprocess(self, example: dict) -> StandardizedExample:
        return StandardizedExample(
            prompt=f"Fix the failing tests in this repo.\n\n{example['task']}",
            init_rollout_args={"repo_path": example["repo_path"]},
            metadata=example,
        )