Thanks for this post; it's inspiring — for a personal project I'm trying just to...

pkilgore · 2025-07-10T15:19:26 1752160766

The model is seeming trained to pick up on the existence of the words "bounding box" or "segmentation mask" and if so is pre-trained to return Array<{ box_2d: [number, number, number, number], label: string>, mask: "base64/png"}>, where [y0,x0,y1,x1] for bounding box if you ask it for JSON too.

Recommend the Gemini docs here, they are implicit on some of these points.

Prompts matter too, less is more.

And you need to submit images to get good bounding boxes. You can somewhat infer this from the token counts, but Gemini APIs do something to PDFs (OCR, I assume) that cause them to lose complete location context on the page. If you send the page in as an image, that context isn't lost and the boxes are great.

As an example of this, you can send a PDF page with half of the page text, the bottom half empty. If you ask it to draw a bounding box around the last paragraph it tends to return a result that is much higher number on the normalized scale (lower on the y axis) than it should be. In one experiment I did, it would think a footer text that was actually about 2/3 down the page was all the way at the end. When I sent as an image, it had in around the 660 mark on the normalized 1000 scale exactly where you would expect it.

mdda · 2025-07-10T17:35:08 1752168908

You've got to be careful with PDFs : We can't see how they are rendered internally for the LLM, so there may be differences in how it's treating the margin/gutters/bleeds that we should account for (and cannot).