Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.

They also force exploration as a part of the algorithm.

They can be used for synthetic data generation once the reward model is good enough.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: