chunking is the first step in the rag pipeline. you take your raw documents and split them into chunks, smaller pieces that can be indexed, searched, and used for qa generation. if you’ve already chunked your data, skip ahead to corpus.
provided chunkers
we currently provide two chunkers: MarkdownChunker for docs and EmailChunker for email threads.
MarkdownChunker
a 3-stage pipeline for markdown files:
- split by headers: uses LangChain’s
MarkdownHeaderTextSplitterto split on H1/H2/H3 boundaries. header hierarchy is preserved in each chunk’s metadata so you always know where a chunk came from. - fuse short sections: adjacent sections below
min_charget merged together. this prevents tiny chunks that lack enough context. - split large sections: anything above
max_chargets recursively character-split with overlap to maintain continuity across chunk boundaries.
from benchmax.rag.chunkers.markdown import MarkdownChunker
chunker = MarkdownChunker(min_char=1024, max_char=2048, chunk_overlap=128)
# chunk a single file
chunks = chunker.chunk_file("./docs/getting-started.md")
# chunk a folder (scans for .md files by default)
chunks = chunker.chunk_folder("./docs/", file_extensions=[".md", ".mdx"])
| parameter | default | description |
|---|---|---|
min_char | 1024 | sections shorter than this get fused with neighbors |
max_char | 2048 | sections longer than this get recursively split |
chunk_overlap | 128 | character overlap between split chunks |
EmailChunker
reply-graph aware chunking for email threads. instead of treating each email as a standalone document, it reconstructs conversation threads and creates overlapping sliding windows across them.
key behaviors:
- handles forked threads (where someone replies to an earlier message) by compacting shared prefixes
- sliding window approach preserves conversational context across chunk boundaries
from benchmax.rag.chunkers.email import EmailChunker
chunker = EmailChunker(max_emails_per_chunk=10, max_chars=2048, overlap_emails=2)
chunks = chunker.chunk_folder("./email-export/")
| parameter | default | description |
|---|---|---|
max_emails_per_chunk | 10 | max emails in a single chunk window |
max_chars | 2048 | max character length per chunk |
overlap_emails | 2 | number of emails to overlap between adjacent chunks |
each chunk includes metadata: thread_id, chunk_id, parent_chunk_id, child_chunk_ids, subject, participants, and date ranges.
chunk types
note: you shouldn’t need to use these unless you are building your own chunker or extending our qa-generation pipeline.
ChunkCollection is a list-like container of Chunk objects that stays aware of which file each chunk came from.
a Chunk is an immutable record of one piece of text. it holds three things:
content: the chunk text.metadata: key/value pairs describing where the chunk came from, e.g. header hierarchy (h1/h2/h3), sourcefile, and positionindex. the email chunker adds thread fields likethread_idandsubject.hash: a SHA-256 of the content + metadata, computed automatically.
read content and metadata off a chunk without mutating it:
chunk.content # the text
chunk.metadata_dict # metadata as a dict
chunk.get_metadata("file") # one metadata value, with optional default
len(chunk) # length of the content
chunk.chunk_str(max_chars=200) # metadata + (optionally truncated) content, for printing
a ChunkCollection wraps the full set of chunks and indexes them by source file and by hash. it behaves like a list (iterate, index, len()) and adds navigation that respects document structure:
| method | returns |
|---|---|
.files | the unique source files in the collection |
get_file_chunks(file) | every chunk from one file, in order |
get_neighboring_chunks(chunk, before, after) | the chunks immediately before/after a chunk in its file |
get_chunk_with_context(chunk) | a chunk plus previews of its neighbors |
get_chunk_by_hash(hash) | look a chunk up by its hash |
get_top_level_chunks() | chunks from the shallowest files in the directory tree |
this file-awareness is what lets downstream steps pull surrounding context for a chunk instead of treating each one in isolation.
persistence and inspection
save a collection to yaml so you don’t re-chunk every time, and load it back later:
from benchmax.rag.chunkers.storage import save_chunks, load_chunks
save_chunks(chunks, "chunks.yaml")
chunks = load_chunks("chunks.yaml")
ChunkInspector helps you sanity-check chunks before corpus upload or qa generation:
from benchmax.rag.chunkers.inspector import ChunkInspector
inspector = ChunkInspector(chunks)
print(inspector.summary()) # chunk count, size distribution, etc.
inspector.read_chunk(0) # print a specific chunk
inspector.sample_file() # randomly sample and display chunks from one file
use it to tune min_char / max_char: you want chunks long enough to stand on their own but short enough to stay focused on a single topic.