Ming Min Yang Jing from the Cave Fei Temple

Qubits | Official account QbitAI

I didn't expect that today, ChatGPT will still make low-level mistakes?

Lord Wu Enda pointed out in the latest opening of the class:

ChatGPT doesn't reverse words!

For example, let it reverse the word lollipop, and the output is pilollol, which is completely confusing.

Ng Ng's ChatGPT class exploded: AI gave up writing words backwards, but understood the whole world

Oh, that's a bit of a surprise.

So much so that after listening to the class netizens posted on Reddit, they immediately attracted a large number of onlookers, and the popularity of the post quickly rushed to 6k.

And this is not an accidental bug, netizens found that ChatGPT really can't complete this task, and the same is true of our own test results.

△Measured ChatGPT (GPT-3.5)

Not even products such as Bard, Bing, and Wen Xin.

△ Measured Bard

△ Actual measurement of Wen Xin's words

Others complained that ChatGPT was terrible at handling these simple word tasks.

For example, playing the previously popular word game Wordle is simply a disaster, and it has never been done right.

Huh? What the hell is this for?

The key is tokens

The reason for this phenomenon is that the key is the token. Tokens are the most common sequence of characters in text, while large models use tokens to process text.

It can be an entire word or a fragment of a word. Large models understand the statistical relationships between these tokens and are good at generating the next token.

So when dealing with the small task of word reversal, it may just flip each token over instead of the letter.

This is even more evident in Chinese context: a word is a token, or a word is a token.

In response to the example at the beginning, someone tried to understand the reasoning process of ChatGPT.

For a more intuitive understanding, OpenAI even released a GPT-3 Tokenizer.

For example, like the word lollipop, GPT-3 will understand it as I, oll, and ipop.

According to the summary of experience, such unwritten rules were born.

1 token≈ 4 English characters ≈ three-quarters of a word;
100 tokens≈ 75 words;
1-2 sentences≈ 30 tokens;
A paragraph ≈ 100 tokens, 1500 words≈ 2048 tokens;

How words are divided also depends on the language. Previously, it was calculated that the number of tokens to be used in Chinese was 1.2 to 2.7 times the number of English.

The higher the token-to-char ratio, the higher the processing cost. Therefore, processing Chinese tokenize is more expensive than English.

It can be understood that tokens are a way for large models to understand the real world of humans. It is very simple and greatly reduces memory and time complexity.

But there is a problem with tokenizing words, which makes it difficult for the model to learn meaningful input representations, and the most intuitive representation is that it cannot understand the meaning of words.

At the time, Transformers did optimizations such as splitting a complex, uncommon word into a meaningful token and an independent token.

Just as annoyingly is divided into "annoying" and "ly", the former retains its semantics, and the latter appears frequently.

This has also led to the amazing results of ChatGPT and other large model products today, which can understand human language well.

As for the small task of not being able to handle word reversal, there is naturally a solution.

The simplest and most direct thing is to separate the words yourself first~

Or you can ask ChatGPT to tokenize each letter step by step.

Or ask it to write a program that reverses the letters, and the program turns out right. (Dog Head)

However, GPT-4 can also be used, and there is no such problem in actual measurements.

△Measured GPT-4

In short, tokens are the cornerstone of AI's understanding of natural language.

As a bridge for AI to understand human natural language, the importance of tokens is becoming more and more obvious.

It has become a key determinant of the performance of AI models, and is also the billing standard for large models.

There is even token literature

As mentioned earlier, tokens can facilitate the model to capture more fine-grained semantic information, such as word meaning, word order, grammatical structure, etc. Its order and location are critical in sequence modeling tasks such as language modeling, machine translation, text generation, and so on.

Only by accurately understanding the position and context of each token in the sequence can the model better predict the content correctly and give a reasonable output.

Therefore, the quality and quantity of tokens have a direct impact on the model effect.

Since the beginning of this year, more and more large models will emphasize the number of tokens when they are released, such as Google PaLM 2 exposure details mentioned that it trained 3.6 trillion tokens.

And many industry bigwigs have also said that tokens are really key!

Andrej Karpathy, an AI scientist who jumped from Tesla to OpenAI this year, said in a speech:

More tokens make the model think better.

And he emphasized that the performance of a model is not determined by the scale of the parameters alone.

For example, LLaMA has a much smaller parameter scale than GPT-3 (65B vs 175B), but LLaMA is more powerful because it is trained with more tokens (1.4T vs 300B).

With its direct impact on model performance, tokens are also the billing standard for AI models.

Taking OpenAI's pricing standard as an example, they are billed in units of 1K tokens, and different models and different types of tokens have different prices.

In short, after stepping into the door of the field of AI large models, you will find that tokens are knowledge points that cannot be bypassed.

Well, even derived token literature ...

However, it is worth mentioning that what exactly the token should translate into in the Chinese world has not yet been fully decided.

The literal translation of "token" is always a bit weird.

GPT-4 thinks it's better to call it "token" or "tag", what do you think?

Reference Links:

[1]https://www.reddit.com/r/ChatGPT/comments/13xxehx/chatgpt_is_unable_to_reverse_words/

[2]https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

[3]https://openai.com/pricing

— End —

Qubits QbitAI · Headline number signed

Ng Ng's ChatGPT class exploded: AI gave up writing words backwards, but understood the whole world

The key is tokens

There is even token literature

Read on

More heart-warming than ChatGPT: Dealing with loneliness Japan's development of AI companions that can listen to you and respond naturally

The visual version of GPT-4 Turbo has been upgraded!ChatGPT 40 messages are limited or canceled

ChatGPT's registration-free website hangs up?Professor Wharton: OpenAI made the wrong decision

Kingsoft released a one-stop office suite, WPS 365丨ChatGPT, which became the main creator of film and television dramas for the first time

ChatGPT App Store could be a one-size-fits-all app!

To surpass human memory, they gave ChatGPT a cheat sheet

Ultraman's explosive AI hardware was badly reviewed, and the ChatGPT version of the iPhone overturned when it was launched!

ChatGPT vs Gemini: Where is the blowout point in the hydrogen industry?

After having AI, what is the most important skill for future people?

Not as good as the smart bracelet, the "wearable ChatGPT" has received a flood of negative reviews a week after its release

Tiangong 3.0 is officially opened!400 billion parameter MoE is open source, opening the moment of music generation ChatGPT

Three reasons why bots are about to have a "ChatGPT moment".

Can it compete with ChatGPT?Baidu says Wenxin Yiyan now has 200 million users

I cheated with ChatGPT in a tech interview and no one knew

Phantom Magic: ChatGPT's Parasitism

After ChatGPT became popular, a valuable operation needs to have 6 abilities.

logos: Creative World Novelty Feeling! A set of simple logo designs to share

4.23 World Book Day | The Municipal Library successfully held the "Date with Books" Read the "See Spring" cultural bazaar into the campus activities

IMF Managing Director: The world economy has shown remarkable resilience, but there are still many concerns

Activity Recruitment|Scholarly Yuanming-World Book Day Cultural Experience Special Event

Macau World Cup: Wang Manyu 4-1 Hina Morningata, semi-finals against Miwa Zhangmoto

【World Book Day】"Rediscovering the Library, Reading Modern Civilization Together" Bo'ai Kindergarten walked into the Linwei District Library Experience Day Activity

The top 4 were released, the world's No. 1 2-0 Chinese talent, the No. 2 seed + No. 3 seed were all upset and reversed

On the schedule of the Macau World Cup on April 20, the national table tennis locked two tickets to the finals, and the two will be eliminated

The men's singles semi-finals of the Macau World Cup produced three seats, the national table tennis world champion was stunned, and the Olympic lineup changed

Three seats were decided in the semi-finals of the Macau World Cup, the Japanese team was defeated on all fronts, and Wang Manyu had to continue to resist the Japanese

Underwater world

Fan Zhendong lost to Lin Gaoyuan 2-4, 3 national table tennis players were out, and only Zhang Ben was left in Japanese table tennis

Macau World Cup: 4 to 2!

World Book Day: Xinhua Bookstore in Pu'an County launched the "I am Xinhua 'Little Bookworm'" reading activity

The Australian star provoked: I will enjoy the Paris Olympics beating Sun Yang, he is a demon in the swimming world

Guests and friends from five continents gathered in Hui'an and the World Heritage Research and Tourism Development Seminar was held