qa generation | castform docs

rl training a rag agent requires a dataset of question-answer pairs grounded in your corpus. we provide an automated pipeline for generating this dataset. you can point the pipeline at an indexed corpus and it produces a train/eval split of questions, their answers, and the chunks that support each answer.

the output is a train/eval split in jsonl, ready to launch a training run.

quickstart

you can generate qa pairs with the default pipeline settings, just provide a corpus and target sample count. expect the pipeline to generate around ~5-10 questions per minute.

from benchmax.rag.qa_generation.pipeline_config import PipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from benchmax.rag.qa_generation.pipeline import Pipeline

cfg = PipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", corpus_id="..."),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = Pipeline(cfg)
result = pipeline.run()

train_data = result["train_dataset"]
eval_data = result["eval_dataset"]

if you haven’t chunked anything yet, you can pass docs_path instead of corpus_id to chunk and index documents in one step. the run is resumable: rerun with the same output directory and it picks up from the last checkpoint.

how it works

the pipeline runs four stages:

profile: it first samples your corpus to learn the domain, summarizing the content and extracting the entities and jargon that recur in it. this grounds the generated questions in your actual terminology (it uses your corpus description and example queries here too, if you provide them).
generate: it samples chunks from your corpus and uses an llm to write questions answerable only from those chunks, each paired with its answer and the supporting chunks. you control the mix of question types based on the types of questions you want the model to be able to answer, from simple lookups to multi-hop reasoning across documents.
filter: every pair is quality-checked. pairs whose answer isn’t supported by the chunks, or that a plain keyword search already answers (so there’s nothing for the model to learn), are dropped or regenerated with feedback.
transform: questions are rewritten into the styles real users type (keyword, natural language, expert shorthand), so the model trains on realistic inputs rather than clean prose.

basic customization

corpus context

telling the pipeline about your domain can improve question quality. a description and a few example_queries are used during profiling to summarize your corpus and understand your terminology.

field	default	description
`description`	`""`	plain-text description of your corpus
`example_queries`	`[]`	example search queries users would ask

question mix

targets.primary_type_distribution controls how many questions of each type the pipeline generates. weight it toward the kinds of questions your model needs to answer. the weights sum to 1.

type	description
`lookup`	single-chunk fact lookup
`co_located_multi_hop`	multi-hop within the same document
`cross_document_multi_hop`	multi-hop across different documents
`sequential_reasoning`	step-by-step reasoning chains
`synthesis`	summarization across multiple sources (disabled by default)

lookup is the cheapest (one llm call); multi-hop types require chunk linking first, so they cost more.

question styles

the transform stage rewrites each question into the styles real users type, controlled by transformation.style_distribution. adjust the weights to match how your users actually search.

style	default weight	example
`keyword`	33%	`k8s pod memory limits`
`natural`	34%	`how do I set memory limits on kubernetes pods?`
`expert`	33%	`configure resource requests and limits in pod spec`

output

control the train/eval split and where files land with split and output.

field	default	description
`split.train_ratio`	`0.8`	fraction of data for training
`split.stratify_by`	`["qa_type", "style"]`	balanced splits across these columns
`output.dir`	`"outputs/castform"`	output directory
`output.train_jsonl`	`"train.jsonl"`	training data filename
`output.eval_jsonl`	`"eval.jsonl"`	eval data filename

putting it together

a customized run combines the settings above into one config:

from benchmax.rag.qa_generation.pipeline_config import (
    PipelineConfig, PlatformConfig, CorpusConfig, CorpusContextConfig,
    TargetsConfig, SplitConfig, OutputConfig,
)
from benchmax.rag.qa_generation.pipeline import Pipeline

cfg = PipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./my-docs"),
    # corpus context: ground questions in your domain
    corpus_context=CorpusContextConfig(
        description="internal engineering docs for acme corp",
        example_queries=["how do I configure the auth middleware?"],
    ),

    # question mix: weight toward what your model needs to answer
    targets=TargetsConfig(
        total_samples=200,
        primary_type_distribution={
            "lookup": 0.4,
            "co_located_multi_hop": 0.2,
            "cross_document_multi_hop": 0.3,
            "sequential_reasoning": 0.1,
        },
    ),

    # output: train/eval split and where files land
    split=SplitConfig(train_ratio=0.8),
    output=OutputConfig(dir="outputs/my-docs"),
)
cfg.resolve_api_keys()

result = Pipeline(cfg).run()
train_data, eval_data = result["train_dataset"], result["eval_dataset"]

see launching a training run to start a training job with the result.

advanced customization

chunk linkers

linkers find related chunks for multi-hop questions.

structural (default) uses file-structure neighbors and BM25 enrichment. no LLM calls. good for well-structured docs.

field	default	description
`bm25_enrichment_queries`	`3`	BM25 queries per chunk
`max_related_refs`	`3`	max related chunks to link
`search_mode`	`"auto"`	`auto` / `lexical` / `hybrid` / `vector`

llm_guided has the LLM generate search queries to find semantically related chunks. better for unstructured corpora, more expensive.

adaptive starts structural and falls back to LLM-guided when enrichment signals are weak.

generators

llm_direct (default) makes a direct LLM call per QA pair. fast.

field	default	description
`model`	`"gpt-5.4"`	generation model
`max_concurrent`	`8`	parallel generation requests
`batch_enabled`	`true`	enable batch processing

llm_env generates QA through an RL environment rollout where the model uses tools to search interactively. more expensive but produces higher quality multi-hop questions.

tips:

chunk size: 1024-2048 chars works well. too small gives low context, too big gives noisy questions.
spread seeds: more chunks with fewer questions each beats fewer chunks with many questions.
start with llm_direct. only switch to llm_env if you need higher quality multi-hop.

filters run in sequence, cheapest to most expensive. each marks items as passed, rejected, or needs_refinement; items that need refinement get regenerated with feedback.

1. deterministic guards catch format and length issues: empty answers, single-word questions, missing references.

field	default	description
`min_question_chars`	`12`	minimum question length
`min_answer_chars`	`24`	minimum answer length
`min_reference_chunks`	`1`	minimum reference chunks

2. grounding_llm uses an LLM judge to check whether the answer is actually supported by the reference chunks. the most important filter.

3. retrieval_too_easy_llm checks if naive BM25 can already find the answer. if so, the question won’t teach the model anything via RL. marks as needs_refinement rather than rejecting.

field	default	description
`overlap_threshold`	`0.5`	chunk overlap threshold for flagging
`too_easy_confidence_threshold`	`0.75`	confidence above this = too easy

4. env_rollout runs the QA pair through the actual RL environment. most expensive, most accurate. optional.

refinement loop. failed items get regenerated with the failure reason as feedback:

filters run on all QA pairs
needs_refinement items get regenerated with feedback
regenerated items go through filters again
repeat until all pass or budget runs out

field	default	description
`max_refinements_per_item`	`2`	max fix attempts per pair
`max_same_seed_attempts_before_reanchor`	`3`	failures before switching to a different seed chunk
`max_rounds`	`4`	max filter-refine cycles
`max_total_regenerations`	`total_samples * 2`	global budget cap

if a seed chunk keeps producing bad questions, reanchoring to a different chunk is more productive than retrying.

checkpointing. results are saved after each filter round. resume: true (the default) picks up from the last completed round on restart.

transformation

beyond the question styles above, the transform stage can add realistic noise and validates that restyling didn’t change a question’s meaning.

noise levels simulate how users actually type:

level	behavior
`none`	no modification
`light`	minor typos, abbreviations, casual phrasing
`moderate`	dropped words, shorthand, spelling errors

start with light.

validation. an LLM validates the transformed question still maps to the same answer. if the restyling changed the meaning, the original is kept.

email normalization. email_normalization: true strips names, dates, and email headers that would leak context not available in a real search query.

full config example

every option, with its default. you only need the handful shown in the quickstart; this is the exhaustive reference.

random_seed: 42
verbose: true
resume: true

platform:
    api_key: 'sk_...'
    base_url: 'https://app.castform.com'

corpus:
    docs_path: './my-docs'
    corpus_name: 'my-docs'
    min_chunk_chars: 400

corpus_context:
    enabled: true
    description: 'internal engineering documentation for acme corp'
    example_queries:
        - 'how do I configure the auth middleware?'
        - "what's the retry policy for failed jobs?"
    num_top_level_samples: 4
    num_random_samples: 4
    generate_entity_patterns: true

targets:
    total_samples: 200
    primary_type_distribution:
        lookup: 0.333
        co_located_multi_hop: 0.200
        cross_document_multi_hop: 0.333
        sequential_reasoning: 0.133
        synthesis: 0.0

linker:
    type: 'structural'
    structural:
        bm25_enrichment_queries: 3
        bm25_enrichment_top_k: 5
        max_related_refs: 3
        search_mode: 'auto'

generation:
    mode: 'llm_direct'
    llm_direct:
        model: 'gpt-5.4'
        max_completion_tokens: 4096
        max_concurrent: 8
        batch_enabled: true

filtering:
    deterministic_guards:
        enabled: true
        min_question_chars: 12
        min_answer_chars: 24
        min_reference_chunks: 1
    filters:
        - 'grounding_llm'
        - 'retrieval_too_easy_llm'
    grounding_llm:
        judge_model: 'gpt-5.4'
    retrieval_llm:
        judge_model: 'gpt-5.4'
        overlap_threshold: 0.5
        too_easy_confidence_threshold: 0.75

refinement:
    enabled: true
    max_refinements_per_item: 2
    max_same_seed_attempts_before_reanchor: 3
    max_rounds: 4

transformation:
    noise_level: 'light'
    style_distribution:
        keyword: 0.33
        natural: 0.34
        expert: 0.33
    validation_enabled: true
    preserve_original_in_metadata: true

split:
    train_ratio: 0.8
    stratify_by: ['qa_type', 'style']
    seed: 42

output:
    dir: 'outputs/castform'
    train_jsonl: 'train.jsonl'
    eval_jsonl: 'eval.jsonl'