laitimes

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

author:Quantum Position

Fish and sheep from the Au Fei Temple

Quantum Position | 公众号 QbitAI

Regarding tokenization, Karpathy has just recommended a new must-read paper.

The topic is: automatic detection of tokens in large models that cause "failures".

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

To put it simply, since the creation of a large model tokenizer and model training are separate, some tokens may rarely or even never appear in training. These "under-trained" tokens can cause the model to produce anomalous output.

The most classic example is SolidGoldMagikarp -

This word once made ChatGPT "gibberish". As soon as the prompt contains this word, ChatGPT will start to misinterpret the text and generate some chaotic output:

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

Now, researchers from Cohere have proposed an effective way to detect "faulty" tokens in response to this problem, and they have also found that undertrained tokens are prevalent to varying degrees on many mainstream open-source large language models, including the Llama series and the Mistral series.

p.s. Cohere, a company founded by Aidan Gomez, the youngest author of Transformer, previously launched the Command R series of open-source models. In June last year, the company reached a valuation of $2.2 billion.

Automatically detect undertrained tokens in LLMs

The method proposed by the researchers consists of three main steps.

First, by checking the tokenizer vocabulary and observing its encoding/decoding behavior, the tokenizer is analyzed to find out the special types of tokens, such as incomplete UTF-8 sequences.

Then, according to the model architecture, the identification index was calculated, and the tokens with abnormal embedding vectors were found and included in the candidate list of "insufficient training".

For example, for the tied embedding model, a set of known unused embeddings is used to remove the constant components in the unembedding matrix by principal component analysis.

The cosine distance of the remaining tokens and these unembedding is then calculated as an "undertrained" metric.

For the non-tied embedding model, the L2 norm of the embedding vector can be directly used to detect it.

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

Finally, a specific prompt is used to verify whether the candidate tokens are indeed out of the distribution of the training data and will throw an abnormal output.

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

After applying the method to several mainstream open-source large language models, the researchers found that the tokens that could make large models "crazy" due to insufficient training were ubiquitous on these large models, and they mined thousands of them in one go.

The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

Common types include:

  • Single-byte tokens, especially unused bytes in the UTF-8 standard, such as 0xF5-0xFF;
  • In the process of byte-pair encoding (BPE), some intermediate tokens that are not fully trained.
  • Some special characters, such as <pad>, <unk>etc.
The great god Karpathy strongly pushed, a must-read in the field of word segmentation: the token that automatic fishing makes the large model "crazy".

The researchers also found that models with larger vocabularies also had significantly more "undertrained" tokens.

Because a large vocabulary means a more sparse token distribution and more fine-grained token sharding, this will inevitably lead to more low-frequency tokens and meaningless token fragments, increasing the proportion of "undertrained" tokens. At the same time, the large vocabulary also brings greater optimization difficulty to model training.

It is worth noting that the paper mentions that the performance of models based on the same tokenizer is similar, and different tokenizer implementation, configuration, and training data will lead to obvious differences in "under-trained" tokens between different models.

This paper argues that optimizing the vocabulary structure and tokenizer algorithm is the key to solving the problem of insufficient token training.

They also made a number of recommendations:

  • Make sure that the preprocessing of input data in tokenizer training data, model training data, and model inference is exactly the same.
  • Make sure that the model training data and tokenizer are aligned, especially if you are training a new base model from scratch.
  • For single-byte tokens, either the vocabulary contains all 256 characters and no duplicates are allowed, or 13 characters that do not appear in UTF-8 (0xC0/0xC1, 0xF5-0xFF).
  • Once the tokenizer is trained, the unreachable tokens are checked by encoding and decoding the vocabulary to ensure that manually added tokens are handled correctly.
  • When posting the "fast" and "slow" versions of tokenizer on Hugging Face, make sure they output the same.
  • When training the base model, check the undertrained tokens in a small test and reconsider the word segmentation method and data. Running tests on different corpora can also uncover preprocessing errors that cause "malfunctioning" inputs in the master training data.

Address:

https://arxiv.org/abs/2405.05417

— END —

Quantum QbitAI · 头条号

Follow us and be the first to know about the signing of cutting-edge technology trends

Read on