Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So the space character is part of the token?


Yup. Most common words have several tokens - the word, the word with a capital letter, the word with a leading space and sometimes the word all in caps too.

Try searching for different words using the search box here: https://observablehq.com/@simonw/gpt-tokenizer#cell-135


I wonder if the embeddings could be explicitly configured to account for these “symmetries”. E.g.: instead of storing seperate full copies of the “variants”, maybe keep a reduced representation with a common prefix and only a small subset of the embedding vector that is allowed to be learned?

This could force the model to correctly learn how to capitalise, make all-caps, etc…


There was some discussion of doing this for RKVW, but I don't think it has actually been implemented yet.

The goal is simply to speed up training slightly, it wouldn't actually make a difference to the final performance of a model as big as GPT-4 (except maybe decrease the prevalence of glitch tokens)


> wouldn't actually make a difference to the final performance

Doesn't that assume that the embeddings learned are in some sense "perfect"? Is that actually the case in practice?

I would expect the learned embeddings to have some errors, especially for the rarer ones that have few examples available for the model to learn from.

I also thought that explicitly accounting for symmetries always improved model performance, because then it doesn't waste parameters learning things that aren't unique and interesting pieces of information.


Thing is, when you consider the tasks you actually want to optimize the models for, quite a few things mentioned in this discussion - e.g. correctly learn how to capitalise, make all-caps, count syllables, act on specific counts of letters - fall in the category of uninteresting things you don't want to waste parameters on. Sure, they'd help with some trick questions that refer to the peculiarities of how exactly we encode stuff in letters, but that's the whole thing we want to abstract away, going beyond textual encoding (or verbal encoding or pictures as rectangles of pixels) towards what the utterance means - like, not only we want to abstract away from spelling mistakes or variations, but also much larger changes to text like different grammar structures to say the same thing, or even saying the same thing in a different language in a different alphabet.


You have to represent spaces in some way (you want to make a distinction between therapist and the rapist), different tokenizers do it differently - one option is to include space as part of the token, another commonly used option is to include the lack of space as part of the token by adding a specific mark representing "the word goes on" at the end.


This can vary by BPE tokenizer. The original GPT-2/GPT-3 was weirder about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: