laitimes

The AGI Era: From Technology Paradigm to Business Model

author:Brother Bird's Notes

Source: Agent's subconscious

AGI is a revolution in productivity. If large language models are the steam engine, then AGI is an industrial revolution. In this process of new quality productivity revolution, technology is the driving engine, and a deep understanding of technology can better cut the cake of the business; just like a racing driver, he knows the control logic of the engine in order to overtake in the corner.

Let's talk about the technology paradigm first, and then the business model.

1. The real reason for scaling law

From childhood to adulthood, after countless exams, we know in common sense that it is easy to judge whether the question is right or wrong, and half of the points are scored for blind selection; it is difficult to choose one of the four answers for a single-choice question, and it is even more difficult to choose one out of ten. This logic is valid. The same is true for machine learning. From the perspective of image classification, if it is divided into ten categories, it is equivalent to choosing one out of ten multiple-choice questions, and the dataset of imagenet is 1000 categories, which is one out of 1000. And what is the large language model? He chooses the most likely token from more than 100,000 vocabulary lists, and the number of classifications increases exponentially. From the perspective of mathematical probability theory, the larger the shape of the softmax regression, the exponentially increasing sample size of the dataset is required to be fully trained, because the conditional probability distribution P(Ai|( A1A2...... A 100,000, input text), the number of A increases, then the number of combinations of text and A increases exponentially. It is necessary to calculate the probabilities one by one with a large amount of data in order to rule out other possibilities and get Allah Ai.

In order to obtain a complete distribution of such a scale, the expression of mathematical logic in it is more complex, so it requires large model parameters, the larger the parameters, the more mathematical logic it can express, and at the same time, it needs a large amount of data to provide a complete distribution, so that it can be fully trained. In order to find a way to achieve this goal, OpenAI and the others found that as long as it is a transformer, the width and depth do not need to be designed to permutate and combine, and if the number of parameters is about the same, the mathematical logic that the model can express is basically the same. Therefore, it is easy to find a more suitable model architecture. Direct violence increases the depth and enlarges the data set, and this huge task is accomplished: a multiple-choice task to choose one Allah answer from 100,000 options.

二、sora离真正的text2video的GPT4时刻究竟差多远

Let's get a basic estimate of how large the training set text2video will need. Previously analyzed

SORA Technology 6: In-depth understanding of full-modal video generation by Google VideoPoet

In the classic image classification project, the imagenet dataset has a total of 1000 categories, and you can think that the codebook of the token is 1000 size, and then the total dataset is 1.28 million, which is the appearance of 1300 images in each category. The meaning of this analogy is that 1300 examples are needed to calculate the full distribution probability of a token.

GPT1's dictionary size is 40,478, and GPT-2's dictionary size is 50257, so the dictionary difference is not very big, so let's assume that GPT4's dictionary size is 60,000, and its dataset is 13 trillion tokens, that is, each token has 200 million examples to calculate the full distribution probability to achieve the effect of GPT4.

The size of Videopoet's codebook is 270,000, and if the dictionary is too large, it will cause a huge embedding matrix, which will bring complexity to storage and time. Therefore, in the short term, the video generation task cannot reach the level of GPT4 because the codebook is too large. An analogy is:

When the codebook size is 1000, 1300 examples are required to calculate the full distribution.

At codebook size 60,000, 200,000,000 samples are required to calculate the full distribution. 150,000 times that of the 1300. The codebook size has only increased by a factor of 60. That is, the expansion ratio is 2300 times.

When the codebook size is 270,000, then it is 4.3 times that of 60,000, how many tokens do you need? That is 4.3*2300*200 million. Such a huge amount of data is impossible to calculate.

Therefore, it is no exaggeration to say that the demo released by SORA is just a local distribution trained to a corner on a small local optimal solution (saddle point), and he cannot reach the fully distributed local optimal solution. In other words, sora can only synthesize excellent videos in a few cases, and if the flood attack is really spread out for the public to test at will, it is basically impossible to reach the ability of chatGPT.

To solve this problem, on the one hand, we need scalaing law, which is the most earthy way, and on the other hand, the core is to reduce the size of the codebook. This is a crucial step towards AGI.

3. How difficult is it to land? General and vertical: two bodies of water

As long as it is a general large model, no matter how many dataset evaluation lists he has swiped, he is still a laboratory product after all, because he is training a public dataset, and the public dataset itself has semantic confusion, and he cannot enter a serious workplace environment to solve real-world problems. The classification model trained on Imagenet cannot be directly used for defect detection in industrial vision: this spot belongs to the normal noise of the CPU, and this pit is a defect in the CPU process. This requires reconstructing the real dataset to train a classification model realistically.

The general model is also like this, he is far from the last kilometer of the landing. As an example, if you let the general model answer medical questions, I guess people in the industry are not at ease. And that's what the real business scenario looks like. He is not an assistant for small talk. He needs to be strict about quality.

Therefore, the current problem that the vertical model needs to solve is to answer the convergence problem. Secondly, there is the case of taking the initiative to ask questions. is a real doctor, he needs to take the initiative to look at and inquire, take the initiative to dig out the patient's condition, and the current large model can't do it. Therefore, the vertical model needs to be closely integrated with the business to find another way.

Fourth, why do you need to independently train vertical large models?

The pedestal is the full data distribution of all codebooks, because there are many dross in the public data set, such as the iFLYTEK learning machine incident, the essence is that there are a lot of hostile ideological data in his base training set; the data he eats in the process of pedestal training is the full distribution of a codebook; this full distribution base is crooked, and the applications that grow on his basis are more or less weird outputs from time to time.

Therefore, we need to train a large model of the base of a vertical domain. How to train this large model?

The first is to reduce the size of the codebook. If we are doing medical consultations, then we definitely don't need the code codebook, the code tokens can be removed, and the second is to build a data set of moderate size with sufficient vertical data. If we only have a vertical dataset, we may not be able to cover the full distribution of the entire data, and if we only have a public dataset, we will not have a sufficient understanding of the vertical category. Therefore, it needs to be considered comprehensively. The third is the modest model size. To make a large vertical model, we are training an excavator worker from Nanxiang Technical School, who drives an excavator to work fast and well; instead of training a generalist from Peking University, he lives in a temple and worries about his people and has the world in mind.

5. LLM is loaded into the 1080 graphics card: to welcome the vertical large model of a hundred flowers

Cost determines the key to landing. First of all, the cost should be small, and the model should not be too large, and secondly, the high concurrency should be beaten. This also reduces the cost of landing.

The most important thing is that every corner of every industry needs to have a large vertical model dedicated to this work. The real big model is not the operating system, he doesn't need to be big and complete, he wants to be small and fine, in this field, his knowledge is very accurate, can solve problems in a closed loop,

For example, in the field of intelligent car cockpit, he can answer the car's control guide very accurately, for example, he can answer where to operate the child lock, which is different for each car, and the general model cannot answer.

Then there's the reduction of power consumption. The power consumption is too high, and it is inappropriate to require 4090 for inference. The 4090 has a power consumption of nearly 500W, which is too much electricity. Lower-cost deployments are also required. One day, an old graphics card like 1080 will be able to run, and the industry will be spring.

summary

The real business model has to deeply cultivate a vertical domain, train its own large model in this vertical domain, and the algorithm effect can achieve a closed loop. Then reduce the cost of deployment, so that it can really be a new quality of productivity to make money.

At present, text2video is not enough to meet the standards of commercial products. It's still hard to get to the ground. However, the scientific research task in this direction is still very heavy, and there will be no products similar to GPT4 in the short term and a year.

A word for the family.

Read on