prompt

roll a dice. output in <answer></answer> tags.

generate

<answer>4</answer>

repeat

reward: 1

1 ≤ 4 ≤ 6

update weights

valid paths → more likely