prompt
roll a dice. output in <answer></answer> tags.
generate
<answer>4</answer>
reward: 1
1 ≤ 4 ≤ 6
update weights
valid paths → more likely