corpus | castform docs

a corpus is an indexed, searchable collection of your chunks. once you’ve chunked your documents, you upload them to a corpus backend. the qa generation pipeline and the training environment both search against it.

already using a vector database? if your documents are already indexed in turbopuffer, pinecone, or chroma, you don’t need to chunk or upload anything. point castform at that backend instead, see corpus backends.

castform corpus api

PostgresChunkSource is the interface to the castform corpus, our hosted search backend. it uses BM25 keyword matching (no embeddings required) and is included with your castform account, so it’s the default backend.

from benchmax.rag.corpus.postgres.source import PostgresChunkSource

source = PostgresChunkSource(
    api_key="sk_...",
    corpus_name="my-docs",
    base_url="https://app.castform.com",
)

# load from a local folder (chunks + uploads in one step)
source.populate_from_folder("./docs/")

# or reuse an existing corpus by name
source.populate_from_existing_corpus_name()

third-party backends

prefer vector or hybrid search, or already have your data in turbopuffer, pinecone, or chroma? use that database as your corpus backend instead. they implement the same ChunkSource interface, so the rest of the pipeline is unchanged. see corpus backends for setup.

ChunkSource interface

every corpus backend implements ChunkSource. this means all the methods below work identically regardless of which backend you’re using.

uploading chunks

# from a folder of documents (chunks automatically)
source.populate_from_folder("./docs/")

# from an existing ChunkCollection
source.populate_from_chunks(chunks)

searching

# simple text search
results = source.search_text("kubernetes pod limits")

# structured search with a SearchSpec
results = source.search({
    "mode": "lexical",
    "text_query": "kubernetes pod limits",
    "top_k": 5,
})

# find chunks related to a source chunk
results = source.search_related(source_chunk, queries=["scaling", "limits"], top_k=5)

sampling and context

# random sample (useful for qa generation seed chunks)
sample = source.sample_chunks(n=10, min_chars=200)

# get a chunk with its neighbors for extra context
context = source.get_chunk_with_context(chunk)

filtering

you can filter search results by chunk metadata:

from benchmax.rag.corpus.search_schema import SearchSpec

results = source.search({
    "mode": "lexical",
    "text_query": "kubernetes pod limits",
    "top_k": 5,
    "filter": {"field": "file_path", "op": "eq", "value": "k8s-docs.md"},
})