an environment defines the tools your model can use and the reward signals for training. for rag, this means a search tool over your corpus and a reward function that checks if the model retrieved the right information.
how it works
we provide a default environment, SearchEnv, that comes with a default search tool implementation, system prompt, and rewards. within each rollout,
- the model is given a prompt from the generated qa dataset, formatted to also include a system prompt that instructs them on how to use the search tool to answer the question and cite their sources.
- the model can use the
searchtool up to a certain number of times to query your corpus. it works across any of our supported search backends. - then, the environment scores the answer on five components:
| component | what it measures |
|---|---|
answer_correctness | an llm judge compares the answer to the ground truth |
conciseness | an llm judge rewards brevity (counts only when the answer is correct) |
citation_recall | did it cite the sources that support the answer? |
citation_precision | of the sources it cited, how many were relevant? |
search_efficiency | did it answer correctly without over-searching? |
correctness and conciseness rely on an llm judge, which is why SearchEnv requires a judge_model. each component is weighted and summed; you can retune the weights below.
quickstart
the only code you write is a small subclass that sets the system prompt for your corpus. then pass your corpus and judge through constructor_args and launch:
from benchmax.rag.corpus.postgres.search import PostgresSearch
from benchmax.envs.postgres_search.search_env import SearchEnv
from benchmax.platform.training_run import upload_training_run
from benchmax.platform.client import TrainerClient
import dataclasses
MAX_SEARCH_CALLS = 10
class MySearchEnv(SearchEnv):
# the system prompt tells the model how to search, answer, and cite
system_prompt = SearchEnv.render_system_prompt(
corpus_description="acme's internal support docs",
max_search_calls=MAX_SEARCH_CALLS,
)
search = PostgresSearch(corpus_name="my-docs") # the corpus to search over
uploaded = upload_training_run(
env_class=MySearchEnv,
train_dataset=train_data,
eval_dataset=eval_data,
run_name="my-search-model",
api_key="sk_...",
constructor_args={
"search": search,
"judge_base_url": "https://api.openai.com/v1",
"judge_model": "gpt-5.4-mini",
"max_search_calls": MAX_SEARCH_CALLS,
},
)
trainer = TrainerClient(api_key="sk_...")
run_id = trainer.launch_training_run(
training_run_type="simple",
**dataclasses.asdict(uploaded),
)
constructor_args is the dict castform passes to SearchEnv.__init__ when the run starts on the trainer; that’s how the environment receives your search client, judge, and settings.
SearchEnv expects your dataset rows to have question, answer, and reference_chunks columns, matching exactly what qa generation produces.
basic customization
you configure SearchEnv through constructor_args and the system prompt on your subclass. no rewriting the environment.
the search tool
the search tool runs against whatever search client you pass as search. swap the client to change backend or search modes; the prompt and rewards stay the same.
| client | import | modes |
|---|---|---|
PostgresSearch | benchmax.rag.corpus.postgres.search | lexical |
TpufSearch | benchmax.rag.corpus.turbopuffer.search | lexical, vector, hybrid |
PineconeSearch | benchmax.rag.corpus.pinecone.search | vector |
ChromaSearch | benchmax.rag.corpus.chroma.search | vector, lexical, hybrid |
when a backend supports more than one mode, the model chooses per query; with "auto" (the default) SearchEnv picks hybrid > lexical > vector. see corpus backends for setup.
reward weights
every component has a weight you can raise, lower, or zero out. set any of these in constructor_args:
| arg | default | component |
|---|---|---|
w_correctness | 1.0 | answer correctness |
w_conciseness | 0.5 | conciseness |
w_citation_recall | 0.5 | citation recall |
w_citation_precision | 0.5 | citation precision |
w_search_efficiency | 0.1 | search efficiency |
set a weight to 0 to drop that component. for example, if citations don’t matter for your task, set both citation weights to 0.
search budget
max_search_calls (default 10) caps how many times the model may call search; a rollout that exceeds it scores 0. set it in both render_system_prompt (so the prompt states the right budget) and constructor_args (so the environment enforces it), as shown in the quickstart.
the system prompt
the system prompt is what makes the rest work: it instructs the model to reason in <think>, call search, and put its final answer in <answer>...</answer> with [Source: <id>] citations, both of which the reward parses. render_system_prompt fills your corpus_description and max_search_calls into a default template.
to change the wording, override SYSTEM_PROMPT_TEMPLATE on your subclass. keep the {corpus_description} and {max_search_calls} placeholders, and the <answer> / [Source: ...] conventions the reward depends on:
class MySearchEnv(SearchEnv):
SYSTEM_PROMPT_TEMPLATE = """\
Answer questions about {corpus_description} using the search tool.
You may search up to {max_search_calls} times.
Put your final answer in <answer>...</answer> and cite sources as [Source: <id>].
"""
system_prompt = SearchEnv.render_system_prompt(
corpus_description="acme's support docs",
max_search_calls=10,
)
advanced customization
custom search backend
the built-in clients implement a small SearchClient protocol. implement it yourself to search any store; you pass your client as search with no subclassing of SearchEnv needed.
from benchmax.rag.corpus.search_client import SearchClient
class SearchClient(Protocol):
def search(self, query: str, mode: str = "auto", top_k: int = 10) -> list[dict[str, Any]]
def embed(self, text: str) -> list[float] | None
@property
def available_modes(self) -> list[str]
def get_params(self) -> dict[str, Any]
search()returns dicts withcontent,source,metadata, andscoreavailable_modesreports which oflexical/vector/hybridthe backend supportsget_params()returns serializable connection parameters
the client must be pickle-safe: store connection parameters and reconstruct sdk clients lazily after unpickling, since it’s shipped to remote workers.
from benchmax.rag.corpus.search_client import SearchClient
class MySearch:
def __init__(self, endpoint: str, api_key: str):
self._endpoint = endpoint
self._api_key = api_key
def search(self, query: str, mode: str = "auto", top_k: int = 10) -> list[dict[str, Any]]:
# return dicts with content, source, metadata, score
...
def embed(self, text: str) -> list[float] | None:
return None # optional
@property
def available_modes(self) -> list[str]:
return ["lexical"]
def get_params(self) -> dict[str, Any]:
return {"endpoint": self._endpoint, "api_key": self._api_key}
# pass it as `search` — no SearchEnv subclass needed for the backend
constructor_args = {
"search": MySearch("https://...", "key"),
"judge_base_url": "https://api.openai.com/v1",
"judge_model": "gpt-5.4-mini",
}
corpus-specific citations
citations are matched by document id. by default SearchEnv reads the file (or file_path) metadata key from your reference chunks and from each [Source: <id>] tag. if your sources use a different id, override _extract_reference_ids or _canonicalize_id on your subclass.
fully custom rewards
to change scoring beyond the weights, override compute_reward on your subclass, or build an environment from scratch. see writing your own environment.
next steps
- see qa generation to generate the
question/answer/reference_chunksdataset over your corpus - see launching a training run to launch a training job using your environment and dataset
- see corpus backends for setting up turbopuffer, pinecone, or chroma