search environment

rag Mar 19, 2026 6 min read

an environment defines the tools your model can use and the reward signals for training. for rag, this means a search tool over your corpus and a reward function that checks if the model retrieved the right information.

how it works

we provide a default environment, SearchEnv, that comes with a default search tool implementation, system prompt, and rewards. within each rollout,

  • the model is given a prompt from the generated qa dataset, formatted to also include a system prompt that instructs them on how to use the search tool to answer the question and cite their sources.
  • the model can use the search tool up to a certain number of times to query your corpus. it works across any of our supported search backends.
  • then, the environment scores the answer on five components:
componentwhat it measures
answer_correctnessan llm judge compares the answer to the ground truth
concisenessan llm judge rewards brevity (counts only when the answer is correct)
citation_recalldid it cite the sources that support the answer?
citation_precisionof the sources it cited, how many were relevant?
search_efficiencydid it answer correctly without over-searching?

correctness and conciseness rely on an llm judge, which is why SearchEnv requires a judge_model. each component is weighted and summed; you can retune the weights below.

quickstart

the only code you write is a small subclass that sets the system prompt for your corpus. then pass your corpus and judge through constructor_args and launch:

from benchmax.rag.corpus.postgres.search import PostgresSearch
from benchmax.envs.postgres_search.search_env import SearchEnv
from benchmax.platform.training_run import upload_training_run
from benchmax.platform.client import TrainerClient
import dataclasses

MAX_SEARCH_CALLS = 10

class MySearchEnv(SearchEnv):
    # the system prompt tells the model how to search, answer, and cite
    system_prompt = SearchEnv.render_system_prompt(
        corpus_description="acme's internal support docs",
        max_search_calls=MAX_SEARCH_CALLS,
    )

search = PostgresSearch(corpus_name="my-docs")  # the corpus to search over

uploaded = upload_training_run(
    env_class=MySearchEnv,
    train_dataset=train_data,
    eval_dataset=eval_data,
    run_name="my-search-model",
    api_key="sk_...",
    constructor_args={
        "search": search,
        "judge_base_url": "https://api.openai.com/v1",
        "judge_model": "gpt-5.4-mini",
        "max_search_calls": MAX_SEARCH_CALLS,
    },
)

trainer = TrainerClient(api_key="sk_...")
run_id = trainer.launch_training_run(
    training_run_type="simple",
    **dataclasses.asdict(uploaded),
)

constructor_args is the dict castform passes to SearchEnv.__init__ when the run starts on the trainer; that’s how the environment receives your search client, judge, and settings.

SearchEnv expects your dataset rows to have question, answer, and reference_chunks columns, matching exactly what qa generation produces.

basic customization

you configure SearchEnv through constructor_args and the system prompt on your subclass. no rewriting the environment.

the search tool

the search tool runs against whatever search client you pass as search. swap the client to change backend or search modes; the prompt and rewards stay the same.

clientimportmodes
PostgresSearchbenchmax.rag.corpus.postgres.searchlexical
TpufSearchbenchmax.rag.corpus.turbopuffer.searchlexical, vector, hybrid
PineconeSearchbenchmax.rag.corpus.pinecone.searchvector
ChromaSearchbenchmax.rag.corpus.chroma.searchvector, lexical, hybrid

when a backend supports more than one mode, the model chooses per query; with "auto" (the default) SearchEnv picks hybrid > lexical > vector. see corpus backends for setup.

reward weights

every component has a weight you can raise, lower, or zero out. set any of these in constructor_args:

argdefaultcomponent
w_correctness1.0answer correctness
w_conciseness0.5conciseness
w_citation_recall0.5citation recall
w_citation_precision0.5citation precision
w_search_efficiency0.1search efficiency

set a weight to 0 to drop that component. for example, if citations don’t matter for your task, set both citation weights to 0.

search budget

max_search_calls (default 10) caps how many times the model may call search; a rollout that exceeds it scores 0. set it in both render_system_prompt (so the prompt states the right budget) and constructor_args (so the environment enforces it), as shown in the quickstart.

the system prompt

the system prompt is what makes the rest work: it instructs the model to reason in <think>, call search, and put its final answer in <answer>...</answer> with [Source: <id>] citations, both of which the reward parses. render_system_prompt fills your corpus_description and max_search_calls into a default template.

to change the wording, override SYSTEM_PROMPT_TEMPLATE on your subclass. keep the {corpus_description} and {max_search_calls} placeholders, and the <answer> / [Source: ...] conventions the reward depends on:

class MySearchEnv(SearchEnv):
    SYSTEM_PROMPT_TEMPLATE = """\
Answer questions about {corpus_description} using the search tool.
You may search up to {max_search_calls} times.
Put your final answer in <answer>...</answer> and cite sources as [Source: <id>].
"""
    system_prompt = SearchEnv.render_system_prompt(
        corpus_description="acme's support docs",
        max_search_calls=10,
    )

advanced customization

custom search backend

the built-in clients implement a small SearchClient protocol. implement it yourself to search any store; you pass your client as search with no subclassing of SearchEnv needed.

from benchmax.rag.corpus.search_client import SearchClient

class SearchClient(Protocol):
    def search(self, query: str, mode: str = "auto", top_k: int = 10) -> list[dict[str, Any]]
    def embed(self, text: str) -> list[float] | None

    @property
    def available_modes(self) -> list[str]

    def get_params(self) -> dict[str, Any]
  • search() returns dicts with content, source, metadata, and score
  • available_modes reports which of lexical / vector / hybrid the backend supports
  • get_params() returns serializable connection parameters

the client must be pickle-safe: store connection parameters and reconstruct sdk clients lazily after unpickling, since it’s shipped to remote workers.

from benchmax.rag.corpus.search_client import SearchClient

class MySearch:
    def __init__(self, endpoint: str, api_key: str):
        self._endpoint = endpoint
        self._api_key = api_key

    def search(self, query: str, mode: str = "auto", top_k: int = 10) -> list[dict[str, Any]]:
        # return dicts with content, source, metadata, score
        ...

    def embed(self, text: str) -> list[float] | None:
        return None  # optional

    @property
    def available_modes(self) -> list[str]:
        return ["lexical"]

    def get_params(self) -> dict[str, Any]:
        return {"endpoint": self._endpoint, "api_key": self._api_key}

# pass it as `search` — no SearchEnv subclass needed for the backend
constructor_args = {
    "search": MySearch("https://...", "key"),
    "judge_base_url": "https://api.openai.com/v1",
    "judge_model": "gpt-5.4-mini",
}

corpus-specific citations

citations are matched by document id. by default SearchEnv reads the file (or file_path) metadata key from your reference chunks and from each [Source: <id>] tag. if your sources use a different id, override _extract_reference_ids or _canonicalize_id on your subclass.

fully custom rewards

to change scoring beyond the weights, override compute_reward on your subclass, or build an environment from scratch. see writing your own environment.

next steps