There are lots of methods to adapt the LLM to human tastes. Reinforcement studying with human suggestions (RLHF) is usually thought-about too resource-intensive to use constantly to newly fine-tuned fashions, whereas direct override optimization (DPO) is the most well-liked technique for LLM alignment. is likely one of the alternate options.
Though DPO is considerably cheaper than RLHF, it nonetheless requires a reference mannequin along with the “coverage” mannequin (i.e., the one that’s actively skilled). Which means each fashions have to be loaded into GPU reminiscence on the similar time. This may be tough in single GPU configurations, particularly for giant fashions.
A extra reminiscence environment friendly method is to make use of LoRA for DPO coaching. As a substitute of coaching the whole mannequin, freeze its parameters and prepare a small adapter. This technique turns into much more environment friendly if each the coverage mannequin and the reference mannequin share the identical base mannequin. In that case, loading the bottom mannequin as soon as after which loading the frozen adapter for the reference mannequin and the trainable adapter for the coverage mannequin will considerably cut back reminiscence necessities.
Nonetheless, in my view, the influence of LoRA on DPO efficiency remains to be not nicely studied. LoRA can get you near an ideal exercise, however its efficiency…

