I really really wish someone would try tokenizing off of a phonetic representation rather than textual one. I think it would be interesting to compare the output
I can see the theoretical advantages of such a concept, but I think a key limitation is that we don't have appropriate amounts of data with accurate phonetic representation.
The potential advantage of using a phonetic representation is that it can have different relevant information than written spelling does. However, if you take the written spelling and pass it through some rules that transform it to what the phonetic representation might be... that transformation can only destroy information, not add it; you'd just be better off using the source data directly.
Now if at some point we get to a place where most of the training data is audio (i.e. the quantity of spoken words in available audio data becomes larger than current written data on internet and in libraries), then phonetic representation would make all sense, being closer to the source data.
But if we're talking about purely tokenization - I think your suggestion is effectively halfway towards morphologically based tokenization, splitting into morphemes (which tend to map to semantics), and that is getting explored. The problem is, for an apples-to-apples comparison you need equally sized models and changes to tokenization require a complete retraining of the model; so doing a comparison on GPT-3 or GPT-4 scale is very expensive (too expensive for "would be interesting" to justify it), and measuring the effect on small models won't necessarily be very indicative of how it will affect large models.
it doesn't matter, the 'bitter lesson' as coined by Rich Sutton is that stacking more layers with more parameters and compute and dataset size is going to swamp any kind of clever 'feature engineering' like trying to be clever about phonetic tokens. Karpathy for example just wants to go back to byte tokens.
If we don't fix up issues caused by the tokenizers, than techniques which literally remove superfluous computation (i.e. through filters of the LLM probability distribution) are useful as a stop-gap.
Yes, but how much extra layers and computing power do you need? Of course, phonetic tokens are awkward idea, but there is a reason why word "human" is encoded as only one token.
I don't think that is intuitive at all. "Clever feature engineering" like trying to create columns from calculations of tabular data, sure. You're not going to move the needle. But the basic representation of unstructured data like text could very believably alter the need for parameters, layers, and calculation speed by orders of magnitude.
If you replace a tokenizer with 5 bytes per token on average by a byte-level representation, you now need 5 times as much memory and (depending on the specifics of the attention mechanism) 11 to 25 times as much compute.
At the scales we're talking about, that's quite a hefty price to pay, and it doesn't even take into account that you might need more layers to replace the processing that was implicitly done by the tokenizer.
> I think that’s part of the reason I would love to see somebody try it, because intuitively I think it would make a difference, but it may not.
The crazy thing is it's already solved. YC should just spin a subreddit.ycombinator.com for each one, there's a nice search that works and some very nice apps for reading. What the reddit shareholders are about to buy is incredibly fragile and the management is fucking with it so much it's obvious they only care about the payola.
I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.
I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.
Probably better to skip that and go for characters or bytes, since it can simply learn morphemes or phonemes from the smallest structure available. Alas, the context size problem is the main pressure against this.