dataset | castform docs

the dataset is what your model trains on. during training, each example becomes one rollout: the model receives the example’s messages (a single user message, or a full conversation history), generates a response, and gets scored by your reward function.

format

training data is a list of dicts. the simplest format uses prompt and ground_truth columns:

train_data = [
    {"prompt": "what is 2+2?", "ground_truth": "4"},
    {"prompt": "capital of France?", "ground_truth": "Paris"},
]

for multi-turn conversations, use messages instead of prompt:

train_data = [
    {
        "messages": [
            {"role": "user", "content": "search for climate change"},
            {"role": "assistant", "content": "searching..."},
            {"role": "user", "content": "summarize the top result"},
        ],
        "ground_truth": "...",
    },
]

each message is a ChatMessage dict. the web UI accepts CSV and JSON uploads with the same structure.

ChatMessage

every message, whether in a messages column or in the resulting Example (see preprocessing), is a ChatMessage from benchmax.envs.types:

field	type	description
`role`	`str`	`"system"`, `"user"`, `"assistant"`, or `"tool"` (required)
`content`	`str`	the message text
`tool_calls`	`list[ToolCallDict]`	tool invocations (assistant messages). serialized in OpenAI nested format
`tool_call_id`	`str`	which tool call this message responds to (tool messages)
`name`	`str`	tool name on tool-result messages

where training data comes from

manual creation: write examples by hand. good for small, focused tasks.
synthetic generation: use the QA generation pipeline to create examples from a corpus. good for RAG tasks.
trace import: extract training examples from existing agent logs. see traces overview.

custom preprocessing

override dataset_preprocess when your column names differ from the defaults or you need to reshape the data:

from benchmax.envs.example_id import make_example

@classmethod
def dataset_preprocess(cls, example, **kwargs):
    return make_example(
        prompt_messages=[{"role": "user", "content": example["question"]}],
        task={"ground_truth": example.get("answer")},
    )

make_example returns an Example with these fields:

field	type	description
`id`	`str`	SHA-256 hash of prompt + task, auto-computed by `make_example`
`prompt_messages`	`list[ChatMessage]`	the conversation the model receives
`task`	`dict[str, Any] \| None`	per-example data passed to `compute_reward` (e.g. `ground_truth`, scoring config). can be None
`init_rollout_args`	`dict[str, Any] \| None`	per-example arguments passed to `init_rollout()` when the rollout starts (e.g. file paths, task metadata)

init_rollout_args is useful when each example needs a different setup. for instance, a spreadsheet environment might pass the path to the input file:

return make_example(
    prompt_messages=[{"role": "user", "content": "fix the formula in cell B5"}],
    task={"answer_position": "B5", "expected": "42"},
    init_rollout_args={"spreadsheet_path": "/data/input.xlsx"},
)

guidance

minimum size: the platform requires at least 16 examples. 200+ is recommended for meaningful training, 1000+ for best results.
validate before launching. run a few examples through dataset_preprocess locally to catch format issues. validate_env does this for you - see testing.