Reinforcement studying from human suggestions sometimes optimizes towards a reward mannequin that has been skilled to foretell human preferences. Because the reward mannequin is an imperfect proxy, overoptimizing its worth can hinder the efficiency of the bottom fact, in response to Goodhart’s legislation. Though this impact has been often noticed, it has not been rigorously measured as a result of gathering information on human preferences is pricey. On this work, we use an artificial setup by which a set “gold commonplace” reward mannequin performs the function of a human and offers the labels used to coach a surrogate reward mannequin. We research how the rating of a gold reward mannequin modifications after we optimize towards a proxy reward mannequin utilizing both reinforcement studying or best-of-n sampling. We discover that this relationship follows completely different useful kinds relying on the optimization methodology, and in each instances its coefficient scales easily with the variety of parameters within the reward mannequin. We additionally examine the affect of the dimensions of the reward mannequin dataset, the variety of parameters within the reward mannequin and coverage, and the issue of the KL penalty added to the reward within the reinforcement studying setup on this relationship. We discover the implications of those empirical outcomes for theoretical issues in AI alignment.
Scaling legislation for overoptimization of reward fashions
by root