Fish and sheep from the Au Fei Temple

Quantum Position | 公众号 QbitAI

Regarding tokenization, Karpathy has just recommended a new must-read paper.

The topic is: automatic detection of tokens in large models that cause "failures".

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

To put it simply, since the creation of a large model tokenizer and model training are separate, some tokens may rarely or even never appear in training. These "under-trained" tokens can cause the model to produce anomalous output.

The most classic example is SolidGoldMagikarp -

This word once made ChatGPT "gibberish". As soon as the prompt contains this word, ChatGPT will start to misinterpret the text and generate some chaotic output:

Now, researchers from Cohere have proposed an effective way to detect "faulty" tokens in response to this problem, and they have also found that undertrained tokens are prevalent to varying degrees on many mainstream open-source large language models, including the Llama series and the Mistral series.

p.s. Cohere, a company founded by Aidan Gomez, the youngest author of Transformer, previously launched the Command R series of open-source models. In June last year, the company reached a valuation of $2.2 billion.

Automatically detect undertrained tokens in LLMs

The method proposed by the researchers consists of three main steps.

First, by checking the tokenizer vocabulary and observing its encoding/decoding behavior, the tokenizer is analyzed to find out the special types of tokens, such as incomplete UTF-8 sequences.

Then, according to the model architecture, the identification index was calculated, and the tokens with abnormal embedding vectors were found and included in the candidate list of "insufficient training".

For example, for the tied embedding model, a set of known unused embeddings is used to remove the constant components in the unembedding matrix by principal component analysis.

The cosine distance of the remaining tokens and these unembedding is then calculated as an "undertrained" metric.

For the non-tied embedding model, the L2 norm of the embedding vector can be directly used to detect it.

Finally, a specific prompt is used to verify whether the candidate tokens are indeed out of the distribution of the training data and will throw an abnormal output.

After applying the method to several mainstream open-source large language models, the researchers found that the tokens that could make large models "crazy" due to insufficient training were ubiquitous on these large models, and they mined thousands of them in one go.

Common types include:

Single-byte tokens, especially unused bytes in the UTF-8 standard, such as 0xF5-0xFF;
In the process of byte-pair encoding (BPE), some intermediate tokens that are not fully trained.
Some special characters, such as <pad>, <unk>etc.

The researchers also found that models with larger vocabularies also had significantly more "undertrained" tokens.

Because a large vocabulary means a more sparse token distribution and more fine-grained token sharding, this will inevitably lead to more low-frequency tokens and meaningless token fragments, increasing the proportion of "undertrained" tokens. At the same time, the large vocabulary also brings greater optimization difficulty to model training.

It is worth noting that the paper mentions that the performance of models based on the same tokenizer is similar, and different tokenizer implementation, configuration, and training data will lead to obvious differences in "under-trained" tokens between different models.

This paper argues that optimizing the vocabulary structure and tokenizer algorithm is the key to solving the problem of insufficient token training.

They also made a number of recommendations:

Make sure that the preprocessing of input data in tokenizer training data, model training data, and model inference is exactly the same.
Make sure that the model training data and tokenizer are aligned, especially if you are training a new base model from scratch.
For single-byte tokens, either the vocabulary contains all 256 characters and no duplicates are allowed, or 13 characters that do not appear in UTF-8 (0xC0/0xC1, 0xF5-0xFF).
Once the tokenizer is trained, the unreachable tokens are checked by encoding and decoding the vocabulary to ensure that manually added tokens are handled correctly.
When posting the "fast" and "slow" versions of tokenizer on Hugging Face, make sure they output the same.
When training the base model, check the undertrained tokens in a small test and reconsider the word segmentation method and data. Running tests on different corpora can also uncover preprocessing errors that cause "malfunctioning" inputs in the master training data.

Address:

https://arxiv.org/abs/2405.05417

— END —

Quantum QbitAI · 头条号

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

Automatically detect undertrained tokens in LLMs

Read on

Unraveling the Mystery of Memory: Ebbinghaus's Forgetting Curve and Mind Model Playing Cards Help You Grow and Leap

After GPU, NPU becomes the standard configuration again, how do mobile phones and PCs carry large AI models?

Be a sneak peek! ByteDance is unprecedented! The large model is stunningly unveiled, and the price is as low as 99%!

39 million people watched Lei Jun's live test drive; Musk recruits second brain-computer experiment patient; DeepMind launches a large-scale model risk assessment framework

From "sky-high prices" to "fracture prices", large models are about to change

If you want to land a large model, let everyone afford to use it first

Direct interaction with hundreds of millions of users Third-party AI models accelerate access to the Weibo ecosystem

iFLYTEK Xinghuo large model empowerment, opening up the "new consciousness" of virtual people

When open source meets large models, what kind of changes will occur?

It is said that the senior management of the Tsinghua Department of the large model company has changed

58.com Sun Qiming: How to build a large model of life service vertical? Self-developed + open source with both hands

AI Dimensity Full Push, China's First End-to-End Large Model Mass Production on the Car Xpeng opens the era of AI intelligent driving

The price of large models has fallen, and the Internet-style "turf war" has reappeared, will big factories really lose money?

The Past of China's Large Model Capital: 20 Large Model Insiders Walk on the "Life and Death Table"

The price war of AI large models starts, and the winner will be decided in a year?

Baidu's first Wenxin large model learning machine Z30 is on sale, and 8G +256G is sold for 6694 yuan