I don't understand. Why would much of the vocabulary be dedicated to rare Chinese characters? Wouldn't those need to show up in the training data first? And if they did, wouldn't they also show up as weird byte sequences? And aren't UTF-8 byte sequences kinda risky for everything other than ASCII, since only ASCII bytes and header bytes are unambiguous, whereas following bytes (10***) are very ambiguous individually? I mean, sure, the LLM would notice that their meaning changes depending on preceding following- and header-bytes, but it is still not clear to me, why UTF-8 bytes are better for LLMs than characters (or even grapheme clusters). UTF-8 bytes seem like a very arbitrary choice to me. Why not do UTF-9 instead and get the most important Latin letters as single ninebitbytes?
Yes, rare Chinese characters do show up in the training data (the rarest of them at least appear in lists of characters) and yes, they get tokenized as weird byte sequences, making the model work harder to process them, but it's better for that to happen to rare characters than to common words. It's a tradeoff.
And of course UTF-8 is unlikely to be the single best encoding (e.g. Anthropic has a tokenizer that turns all caps text into a special caps lock symbol plus the regular-case equivalent) but much of it is papered over by byte-pair encoding. E.g. the most important Latin letters appear often enough that they get dedicated tokens anyways.