Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for this post; it's inspiring — for a personal project I'm trying just to get bounding boxes from scanned PDF pages (around paragraphs/verses/headings etc), and so far did not get great results. (It seems to recognize the areas but then the boxes are offset/translated by some amount.) I only just got started and haven't looked closely yet (I'm sure the results can be improved, looking at this post), but I can already see that there are a bunch of things to explore:

- Do you ask the multimodal LLM to return the image with boxes drawn on it (and then somehow extract coordinates), or simply ask it to return the coordinates? (Is the former even possible?)

- Does it better or worse when you ask it for [xmin, xmax, ymin, ymax] or [x, y, width, height] (or various permutations thereof)?

- Do you ask for these coordinates as integer pixels (whose meaning can vary with dimensions of the original image), or normalized between 0.0 and 1.0 (or 0–1000 as in this post)?

- Is it worth doing it in two rounds: send it back its initial response with the boxes drawn on it, to give it another opportunity to "see" its previous answer and adjust its coordinates?

I ought to look at these things, but wondering: as you (or others) work on something like this, how do you keep track of which prompts seem to be working better? Do you log all requests and responses / scores as you go? I didn't do that for my initial attempts, and it feels a bit like shooting in the dark / trying random things until something works.



The model is seeming trained to pick up on the existence of the words "bounding box" or "segmentation mask" and if so is pre-trained to return Array<{ box_2d: [number, number, number, number], label: string>, mask: "base64/png"}>, where [y0,x0,y1,x1] for bounding box if you ask it for JSON too.

Recommend the Gemini docs here, they are implicit on some of these points.

Prompts matter too, less is more.

And you need to submit images to get good bounding boxes. You can somewhat infer this from the token counts, but Gemini APIs do something to PDFs (OCR, I assume) that cause them to lose complete location context on the page. If you send the page in as an image, that context isn't lost and the boxes are great.

As an example of this, you can send a PDF page with half of the page text, the bottom half empty. If you ask it to draw a bounding box around the last paragraph it tends to return a result that is much higher number on the normalized scale (lower on the y axis) than it should be. In one experiment I did, it would think a footer text that was actually about 2/3 down the page was all the way at the end. When I sent as an image, it had in around the 660 mark on the normalized 1000 scale exactly where you would expect it.


You've got to be careful with PDFs : We can't see how they are rendered internally for the LLM, so there may be differences in how it's treating the margin/gutters/bleeds that we should account for (and cannot).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: