Abstract
Reinforcement Learning with Performance Feedback (RLPF) unlocks a new way of turning generic GenAI models into customized models fine-tuned for specific tasks. This approach is especially powerful when combined with in-house data and performance metrics. In this talk we highlight the application of RLPF to the ad text generation system on the Facebook platform.
The presentation covers the core technical components required for production RLHF systems: preference data collection methodologies, reward model training approaches, and policy optimization techniques that maintain stability in production environments. We'll explore the theoretical foundations underlying these systems and how they translate into practical engineering solutions.