Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> We represent the robot actions as text strings as shown below. An example of such a string could be a sequence of robot action token numbers: “1 128 91 241 5 101 127 217”.

Training with numbers like this might be a little problematic, I have tried to fine tune GPT 4o-mini with very little success(just me?)

On the other hand I found[1] Gemini and Molmo being able to locate elements on screen much better than 4o.

1. https://github.com/BandarLabs/clickclickclick



They do not turn the actions into text that is then tokenized, but generate tokens directly. So the action token 128 doesn't necessarily correspond to the tokenization of the number 128 when it appears in text input. (Except for PaLI-X they make use of the fact that integers up to 1000 have unique tokens and do use those for the actions. But for PaLM-E, they hijack the 256 least frequently used tokens instead.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: