I don't understand. Why would much of the vocabulary be dedicated to rare Chines...

yorwba · 2025-04-05T14:48:33 1743864513

Yes, rare Chinese characters do show up in the training data (the rarest of them at least appear in lists of characters) and yes, they get tokenized as weird byte sequences, making the model work harder to process them, but it's better for that to happen to rare characters than to common words. It's a tradeoff.

And of course UTF-8 is unlikely to be the single best encoding (e.g. Anthropic has a tokenizer that turns all caps text into a special caps lock symbol plus the regular-case equivalent) but much of it is papered over by byte-pair encoding. E.g. the most important Latin letters appear often enough that they get dedicated tokens anyways.

RedNifre · 2025-04-05T16:35:46 1743870946

Thanks, makes sense.