what is rl finetuning | castform docs

the core technique castform uses to train models is rl finetuning. rl finetuning trains models to achieve outcomes. instead of showing the model the right answer, you let it try different approaches and reward it when it succeeds. the model explores different reasoning paths and gradually learns which strategies actually work. this is what separates today’s reasoning models from earlier generations. they’ve been trained to search for good solutions, not just predict plausible continuations.

when rl finetuning is the right choice

rl works best when you can grade the output. if you can write a function (or use a judge) that scores whether the model’s output is good, rl can train the model to get better at it.

good fits for rl:

search and retrieval: formulate better queries, retrieve the right documents, synthesize answers from search results
tool use: learn when and how to call APIs, execute code, query databases
structured output: produce valid json, sql, or domain-specific formats that parse correctly
constrained generation: follow specific rules (citation format, word limits, style guides, poem structures)
code generation: write code that actually runs, passes tests, and handles edge cases
roleplay and persona: maintain consistent character voices, follow persona guidelines, and stay in-character across long conversations, scored with llm judge rubrics that evaluate tone, consistency, and guideline adherence

how it works on castform

you define an environment: the tools the model can use and a reward function that scores its output
you provide a dataset of prompts the model will train on
castform runs the training loop: the model generates responses, your reward function scores them, and the model updates to produce higher-scoring responses
you monitor and evaluate using the console

the model learns through trial and error within your environment, guided by your reward signal. over hundreds of training steps, it figures out what strategies produce high rewards.

key concepts

concept	what it means
environment	defines the task: tools, rewards, system prompt. see environment overview.
reward function	scores each model output. the training signal. see rewards.
rollout	one attempt: the model receives a prompt, generates a response (possibly calling tools), and gets scored.