sdk overview | castform docs

benchmax is castform’s open-source python sdk and the companion to the training platform. it’s the library the platform runs under the hood, and the one you reach for when you want full control: you define your dataset, environment, and rewards in code, then launch a run from your own machine.

prefer not to write it by hand? see see our quickstart guide here to scaffold a project with the castform cli and let your coding agent build it for you.

what benchmax offers

environments as code: extend BaseEnv to define an environment with rewards and tools for your task, see environments.
flexible rewards: our rewards and rubrics api allows you to easily define the success criteria, optimized for training. see rewards.
automated dataset generation: generate synthetic qa pairs from a corpus, or process production traces into training data.
managed training + eval: the sdk provides an interface with the castform platform to validate, upload and launch runs. you can launch runs on gpus castform provisions, then evaluate against baselines.

step-by-step overview

prerequisites

python 3.12 (3.13 is not supported)
a castform api key (app.castform.com/settings)

pip install benchmax

1. prepare your dataset

training data is a list of dicts. the simplest format uses prompt and ground_truth columns (for multi-turn conversations, use messages instead of prompt; see dataset):

dataset = generate_my_examples(n=400)  # your data generation logic
# each example looks like: {"prompt": "...", "ground_truth": "..."}

split = int(len(dataset) * 0.8)
train_data, eval_data = dataset[:split], dataset[split:]

see dataset for format details and data sources.

2. define your environment

an environment tells the platform how to score the model’s output. at minimum you need a system prompt and a reward function:

from benchmax.envs.base_env import BaseEnv

class MyEnv(BaseEnv):
    system_prompt = "..."

    async def list_tools(self):
        return []  # define tools the model can call

    async def run_tool(self, rollout_id, tool_name, **tool_args):
        ...  # execute a tool call

    async def compute_reward(self, rollout_id, messages, task, **kwargs):
        ...  # score the model's output, return {"name": score}

see environment overview for the full contract, rewards for reward patterns, and tools for adding tool use.

3. launch

from benchmax.platform.training_run import upload_training_run
from benchmax.platform.client import TrainerClient
import dataclasses

uploaded = upload_training_run(
    env_class=MyEnv, train_dataset=train_data,
    eval_dataset=eval_data, run_name="my-first-run", api_key="sk_...",
)
trainer = TrainerClient(api_key="sk_...")
run_id = trainer.launch_training_run(
    training_run_type="simple", **dataclasses.asdict(uploaded),
)

view your run at https://app.castform.com/experiments/{run_id}. see launching for full api reference.

complete runnable script

import os, re, dataclasses
from benchmax.envs.base_env import BaseEnv
from benchmax.envs.reward_helpers import extract_completion_text
from benchmax.platform.training_run import upload_training_run
from benchmax.platform.client import TrainerClient

API_KEY = os.environ["CASTFORM_API_KEY"]

# --- dataset ---
dataset = [
    {"prompt": "what is 2+2?", "ground_truth": "4"},
    {"prompt": "capital of France?", "ground_truth": "Paris"},
    # ... add 200+ examples for real training
]
split = int(len(dataset) * 0.8)
train_data, eval_data = dataset[:split], dataset[split:]

# --- environment ---
class MyEnv(BaseEnv):
    system_prompt = "answer the question concisely. put your answer in <answer> tags."

    async def list_tools(self):
        return []

    async def run_tool(self, rollout_id, tool_name, **tool_args):
        return ""

    async def compute_reward(self, rollout_id, messages, task, **kwargs):
        text = extract_completion_text(messages)
        match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
        answer = match.group(1).strip().lower() if match else ""
        expected = (task or {}).get("ground_truth", "").strip().lower()
        return {"correct": 1.0 if answer == expected else 0.0}

# --- upload and launch ---
uploaded = upload_training_run(
    env_class=MyEnv, train_dataset=train_data,
    eval_dataset=eval_data, run_name="my-first-run", api_key=API_KEY,
)
trainer = TrainerClient(api_key=API_KEY)
run_id = trainer.launch_training_run(
    training_run_type="simple", **dataclasses.asdict(uploaded),
)
print(f"https://app.castform.com/experiments/{run_id}")

what happens next

once your run launches:

gpus warm up (a few minutes). status shows “pending”.
metrics start flowing. reward curves & model responses appear on the train tab.
inspect completions. expand rollouts to see what the model is generating at each step.

don’t draw conclusions from the first dozen steps. rewards will fluctuate early as the model explores. see monitoring a run for how to interpret metrics and spot problems.

once training converges, evaluate your model against baselines and test it in the playground.

learn more

what is rl finetuning: how rl training works and when to use it
how it works: how the full pipeline fits together
fine-tune for rag: train a search model over your documents
fine-tune from traces: train from existing agent logs