I really really wish someone would try tokenizing off of a phonetic representati...

PeterisP · on June 9, 2023

I can see the theoretical advantages of such a concept, but I think a key limitation is that we don't have appropriate amounts of data with accurate phonetic representation.

The potential advantage of using a phonetic representation is that it can have different relevant information than written spelling does. However, if you take the written spelling and pass it through some rules that transform it to what the phonetic representation might be... that transformation can only destroy information, not add it; you'd just be better off using the source data directly.

Now if at some point we get to a place where most of the training data is audio (i.e. the quantity of spoken words in available audio data becomes larger than current written data on internet and in libraries), then phonetic representation would make all sense, being closer to the source data.

But if we're talking about purely tokenization - I think your suggestion is effectively halfway towards morphologically based tokenization, splitting into morphemes (which tend to map to semantics), and that is getting explored. The problem is, for an apples-to-apples comparison you need equally sized models and changes to tokenization require a complete retraining of the model; so doing a comparison on GPT-3 or GPT-4 scale is very expensive (too expensive for "would be interesting" to justify it), and measuring the effect on small models won't necessarily be very indicative of how it will affect large models.

emmelaich · on June 9, 2023

I would like to see what happens when you go the other way. Extremely naive tokening, for instance none at all. Just a stream of bytes or nybbles.

It might take far more training but also it might avoid any biases introduced by tokenisation.

[edit - see @api had the same question]

f137 · on June 9, 2023

I've run across a paper "Bytes is all you need" or the like a few days ago. Probably something you'd like to read

ftxbro · on June 8, 2023

it doesn't matter, the 'bitter lesson' as coined by Rich Sutton is that stacking more layers with more parameters and compute and dataset size is going to swamp any kind of clever 'feature engineering' like trying to be clever about phonetic tokens. Karpathy for example just wants to go back to byte tokens.

Der_Einzige · on June 9, 2023

If we don't fix up issues caused by the tokenizers, than techniques which literally remove superfluous computation (i.e. through filters of the LLM probability distribution) are useful as a stop-gap.

Switching to bytes is the ultimate fix, but for the interim, if you want reliable rhyming with an LLM, you need filter-assisted decoding: https://paperswithcode.com/paper/most-language-models-can-be... and replicas post about this work: https://replicate.com/blog/turn-your-llm-into-a-poet

nsinreal · on June 8, 2023

Yes, but how much extra layers and computing power do you need? Of course, phonetic tokens are awkward idea, but there is a reason why word "human" is encoded as only one token.

spywaregorilla · on June 8, 2023

I don't think that is intuitive at all. "Clever feature engineering" like trying to create columns from calculations of tabular data, sure. You're not going to move the needle. But the basic representation of unstructured data like text could very believably alter the need for parameters, layers, and calculation speed by orders of magnitude.

ftxbro · on June 8, 2023

> "I don't think that is intuitive at all."

That's exactly the point. Every intuition is always on the side of feature engineering.

spywaregorilla · on June 11, 2023

Not really.

whimsicalism · on June 8, 2023

You would be wrong at the scales we are talking about.

The whole point is that it is unintuitive.

yorwba · on June 9, 2023

If you replace a tokenizer with 5 bytes per token on average by a byte-level representation, you now need 5 times as much memory and (depending on the specifics of the attention mechanism) 11 to 25 times as much compute.

At the scales we're talking about, that's quite a hefty price to pay, and it doesn't even take into account that you might need more layers to replace the processing that was implicitly done by the tokenizer.

sp332 · on June 8, 2023

Most current implementations can't count syllables at all, so it would get you at least that far.

jcims · on June 9, 2023

I think that’s part of the reason I would love to see somebody try it, because intuitively I think it would make a difference, but it may not.

To me it’s like changing the periodic table, at the macroscopic scale it may or may not make a difference.

zecg · on June 11, 2023

> I think that’s part of the reason I would love to see somebody try it, because intuitively I think it would make a difference, but it may not.

The crazy thing is it's already solved. YC should just spin a subreddit.ycombinator.com for each one, there's a nice search that works and some very nice apps for reading. What the reddit shareholders are about to buy is incredibly fragile and the management is fucking with it so much it's obvious they only care about the payola.

Der_Einzige · on June 9, 2023

I agree with you, and I'm SHOCKED at how little work there actually is in phonetics within the NLP community. Consider that most of the phonetic tools that I am using to enforce rhyming or similar syntactic constrained in constrained text generation studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...) were built circa 2014, such as the CMU rhyming dictionary. In most cases, I could not find better modern implementations of these tools.

I did learn an awful lot about phonetic representations and matching algorithms. Things like "soundex" and "double metaphone" now make sense to me and are fascinating to read about.

chaxor · on June 9, 2023

Probably better to skip that and go for characters or bytes, since it can simply learn morphemes or phonemes from the smallest structure available. Alas, the context size problem is the main pressure against this.

bpiche · on June 8, 2023

spacy's sense2vec gets pretty close to that

https://spacy.io/universe/project/sense2vec/

granted, it is 8 years old, but it's still interesting

justanotheratom · on June 9, 2023

I wonder what Whisper's token vocabulary looks (sounds) like..

LudwigNagasena · on June 9, 2023

What do you mean by phonetic representation? Sound files?

ajuc · on June 9, 2023

Something like IPA probably. Some languages are already basically phonetic so you don't need to change anything.