I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for o...

		mountainriver 85 days ago \| parent \| context \| favorite \| on: CS234: Reinforcement Learning Winter 2025 I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one. They also force exploration as a part of the algorithm. They can be used for synthetic data generation once the reward model is good enough.