Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I like to think of RLHF as a technique that I, as a student, used to apply to score good marks in my exam. As soon as I started working, I realized that out-of-distribution generalization can't be only achieved from practicing in an environment with verifiable rewards.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: