the dataset is what your model trains on. during training, each example becomes one rollout: the model receives the example’s messages (a single user message, or a full conversation history), generates a response, and gets scored by your reward function.
format
training data is a list of dicts. the simplest format uses prompt and ground_truth columns:
train_data = [
{"prompt": "what is 2+2?", "ground_truth": "4"},
{"prompt": "capital of France?", "ground_truth": "Paris"},
]
for multi-turn conversations, use messages instead of prompt:
train_data = [
{
"messages": [
{"role": "user", "content": "search for climate change"},
{"role": "assistant", "content": "searching..."},
{"role": "user", "content": "summarize the top result"},
],
"ground_truth": "...",
},
]
each message is a ChatMessage dict. the web UI accepts CSV and JSON uploads with the same structure.
ChatMessage
every message, whether in a messages column or in the resulting Example (see preprocessing), is a ChatMessage from benchmax.envs.types:
| field | type | description |
|---|---|---|
role | str | "system", "user", "assistant", or "tool" (required) |
content | str | the message text |
tool_calls | list[ToolCallDict] | tool invocations (assistant messages). serialized in OpenAI nested format |
tool_call_id | str | which tool call this message responds to (tool messages) |
name | str | tool name on tool-result messages |
where training data comes from
- manual creation: write examples by hand. good for small, focused tasks.
- synthetic generation: use the QA generation pipeline to create examples from a corpus. good for RAG tasks.
- trace import: extract training examples from existing agent logs. see traces overview.
custom preprocessing
override dataset_preprocess when your column names differ from the defaults or you need to reshape the data:
from benchmax.envs.example_id import make_example
@classmethod
def dataset_preprocess(cls, example, **kwargs):
return make_example(
prompt_messages=[{"role": "user", "content": example["question"]}],
task={"ground_truth": example.get("answer")},
)
make_example returns an Example with these fields:
| field | type | description |
|---|---|---|
id | str | SHA-256 hash of prompt + task, auto-computed by make_example |
prompt_messages | list[ChatMessage] | the conversation the model receives |
task | dict[str, Any] | None | per-example data passed to compute_reward (e.g. ground_truth, scoring config). can be None |
init_rollout_args | dict[str, Any] | None | per-example arguments passed to init_rollout() when the rollout starts (e.g. file paths, task metadata) |
init_rollout_args is useful when each example needs a different setup. for instance, a spreadsheet environment might pass the path to the input file:
return make_example(
prompt_messages=[{"role": "user", "content": "fix the formula in cell B5"}],
task={"answer_position": "B5", "expected": "42"},
init_rollout_args={"spreadsheet_path": "/data/input.xlsx"},
)
guidance
- minimum size: the platform requires at least 16 examples. 200+ is recommended for meaningful training, 1000+ for best results.
- validate before launching. run a few examples through
dataset_preprocesslocally to catch format issues.validate_envdoes this for you - see testing.