Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: