Representation of Byte-Level Byte Pair Encoding Tokens Ever wondered why your favourite language model tokenizers use G with dot above for tokens starting with spaces? 2026/01/08 comp @nlp tini[1] #tokenizer #unicode