Glory to BRPO
Classing up ML algorithms with cultist robes
“Bootstrapped Relative Policy Optimization” is a machine learning algorithm intended to work with difficult-to-specify learning targets, such as aesthetic website design.
A training round using the standard method, “group relative policy optimization”, will look something like:
The test model generates sixteen outputs to the task
(“make a beautiful website”),
and the reward model scores each of those outputs with a single scalar grade
(“5/10”, “7/10”).
Then, the outputs with the highest scores will be reinforced, so as to gradually push the model in towards making things which are more like that.
This works great for objective tasks which can be quantified with a score, such as a math test. However, when you have a subjective target such as beauty, this tends to lock in a lot of cruft, and accidentally enshrine things you didn’t want. Any imprecision in your reward model will be locked in and amplified a millionfold. Perhaps the nearest proximal target in your website design training is “minimalism” rather than beauty, and the model will keep chasing that gradient until there is no website at all.
So instead,
We use Bootstrapped Relative policy optimization. The reward model randomly picks one of the sixteen generations as the “champion” output, and compares all other generations to it in pairwise battles. Winners (+1) beat the champion and are reinforced, and losers (-1) do not, and are gradually discarded.
The vision of this is to prevent brittleness and overtraining, by not allowing a single, reward-hacking path to victory. It shouldn’t be the case that all sixteen generations are simultaneously chasing minimalism,
inshallah.



