Representation of Byte-Level Byte Pair Encoding Tokens Ever wondered why your favourite language model tokenizers use G with dot above for tokens starting with spaces? 2026/01/08 comp @nlp tini[1] #tokenizer #unicode
语言选择、任务和大语言模型 大语言模型有些侧重英语训练,有些侧重汉语训练,而中国普遍存在的双语者在与大语言模型对话时会如何选择语言呢? 2025/11/09 comp @nlp @hci zh-CN #multilingual #lm #academia