what is rl finetuning

start Mar 19, 2026 2 min read

the core technique castform uses to train models is rl finetuning. rl finetuning trains models to achieve outcomes. instead of showing the model the right answer, you let it try different approaches and reward it when it succeeds. the model explores different reasoning paths and gradually learns which strategies actually work. this is what separates today’s reasoning models from earlier generations. they’ve been trained to search for good solutions, not just predict plausible continuations.

when rl finetuning is the right choice

rl works best when you can grade the output. if you can write a function (or use a judge) that scores whether the model’s output is good, rl can train the model to get better at it.

good fits for rl:

  • search and retrieval: formulate better queries, retrieve the right documents, synthesize answers from search results
  • tool use: learn when and how to call APIs, execute code, query databases
  • structured output: produce valid json, sql, or domain-specific formats that parse correctly
  • constrained generation: follow specific rules (citation format, word limits, style guides, poem structures)
  • code generation: write code that actually runs, passes tests, and handles edge cases
  • roleplay and persona: maintain consistent character voices, follow persona guidelines, and stay in-character across long conversations, scored with llm judge rubrics that evaluate tone, consistency, and guideline adherence

how it works on castform

  1. you define an environment: the tools the model can use and a reward function that scores its output
  2. you provide a dataset of prompts the model will train on
  3. castform runs the training loop: the model generates responses, your reward function scores them, and the model updates to produce higher-scoring responses
  4. you monitor and evaluate using the console

the model learns through trial and error within your environment, guided by your reward signal. over hundreds of training steps, it figures out what strategies produce high rewards.

key concepts

conceptwhat it means
environmentdefines the task: tools, rewards, system prompt. see environment overview.
reward functionscores each model output. the training signal. see rewards.
rolloutone attempt: the model receives a prompt, generates a response (possibly calling tools), and gets scored.