It used to be done that way, but newer multimodal LLMs train on a mix of image a...

		Legend2440 7 months ago \| parent \| context \| favorite \| on: Is Gemini 2.5 good at bounding boxes? It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.