dataset

environment Mar 19, 2026 3 min read

the dataset is what your model trains on. during training, each example becomes one rollout: the model receives the example’s messages (a single user message, or a full conversation history), generates a response, and gets scored by your reward function.

format

training data is a list of dicts. the simplest format uses prompt and ground_truth columns:

train_data = [
    {"prompt": "what is 2+2?", "ground_truth": "4"},
    {"prompt": "capital of France?", "ground_truth": "Paris"},
]

for multi-turn conversations, use messages instead of prompt:

train_data = [
    {
        "messages": [
            {"role": "user", "content": "search for climate change"},
            {"role": "assistant", "content": "searching..."},
            {"role": "user", "content": "summarize the top result"},
        ],
        "ground_truth": "...",
    },
]

each message is a ChatMessage dict. the web UI accepts CSV and JSON uploads with the same structure.

ChatMessage

every message, whether in a messages column or in the resulting Example (see preprocessing), is a ChatMessage from benchmax.envs.types:

fieldtypedescription
rolestr"system", "user", "assistant", or "tool" (required)
contentstrthe message text
tool_callslist[ToolCallDict]tool invocations (assistant messages). serialized in OpenAI nested format
tool_call_idstrwhich tool call this message responds to (tool messages)
namestrtool name on tool-result messages

where training data comes from

  • manual creation: write examples by hand. good for small, focused tasks.
  • synthetic generation: use the QA generation pipeline to create examples from a corpus. good for RAG tasks.
  • trace import: extract training examples from existing agent logs. see traces overview.

custom preprocessing

override dataset_preprocess when your column names differ from the defaults or you need to reshape the data:

from benchmax.envs.example_id import make_example

@classmethod
def dataset_preprocess(cls, example, **kwargs):
    return make_example(
        prompt_messages=[{"role": "user", "content": example["question"]}],
        task={"ground_truth": example.get("answer")},
    )

make_example returns an Example with these fields:

fieldtypedescription
idstrSHA-256 hash of prompt + task, auto-computed by make_example
prompt_messageslist[ChatMessage]the conversation the model receives
taskdict[str, Any] | Noneper-example data passed to compute_reward (e.g. ground_truth, scoring config). can be None
init_rollout_argsdict[str, Any] | Noneper-example arguments passed to init_rollout() when the rollout starts (e.g. file paths, task metadata)

init_rollout_args is useful when each example needs a different setup. for instance, a spreadsheet environment might pass the path to the input file:

return make_example(
    prompt_messages=[{"role": "user", "content": "fix the formula in cell B5"}],
    task={"answer_position": "B5", "expected": "42"},
    init_rollout_args={"spreadsheet_path": "/data/input.xlsx"},
)

guidance

  • minimum size: the platform requires at least 16 examples. 200+ is recommended for meaningful training, 1000+ for best results.
  • validate before launching. run a few examples through dataset_preprocess locally to catch format issues. validate_env does this for you - see testing.