A code name sparked panic across the network? What exactly is OpenAI's Q*?
Let's put aside the Polar Smash Bros. within OpenAI's management and talk about the company's latest rumor, Q*.
OpenAI sent an internal letter to employees on Nov. 22 acknowledging Q* and describing the project as "autonomous systems beyond humans." It's really scary.
Although OpenAI has not officially released any news about Q*, we can still get a shallow understanding.
First of all, the first step is to understand the pronunciation of Q*, the official name is Q-Star, which translates to Q-Star. Yes, you read that right, even though in deep learning, blocks are solved by multiplication, but in Q*, "*" does not mean multiplication, but "asterisk". The letter "Q" indicates the expected reward for an action in reinforcement learning.
In the field of artificial intelligence, anything that has anything to do with capitalized Q is essentially Q learning. Q learning can be regarded as a kind of reinforcement learning based on the current evaluation criteria, which refers to the way in the training process, in the way of recording the historical reward value of the training, telling the agent how to choose the next step to be the same as the highest historical reward value. However, please note that the historical maximum reward value does not represent the maximum reward value of the model, it may or may not be, and it may even fail to hit.
In other words, Q learning and agents are like the relationship between an analyst and a coach of a team. The coach is responsible for coaching the team, and the analyst is used to assist the coach.
In the process of reinforcement learning, the agent's output decisions are fed back to the environment in order to receive reward values. Q learning, on the other hand, only records the reward value, so it does not need to model the environment, which is equivalent to "good results, all is good".
However, looking at it this way, it seems that Q learning is not as good as the deep learning models commonly used in artificial intelligence, especially large models. With billions and tens of billions of parameters like the current one, Q learning not only does not help the model, but also increases complexity and thus reduces robustness.
Don't worry, in fact, this is because the idea behind the above Q learning is only a basic concept that was born in 1989.
In 2013, DeepMind launched an algorithm called Deep Q Learning by improving Q learning, the most distinctive feature of which is the use of experience playback, sampling from multiple results in the past, and then using Q learning, so as to improve the stability of the model and reduce the divergence of the training direction of the model due to a certain result.
However, to tell the truth, there is a reason why this concept has not become popular, and from a practical point of view, the biggest role of deep Q learning in the academic community has been the development of DQN.
DQN refers to Deep Q Network, which was born from deep Q learning. The idea of DQN is exactly the same as that of Q learning, but the process of finding the maximum reward value in Q learning is realized by neural networks. All of a sudden, it became fashionable.
DQN generates only one node at a time. At the same time, DQN generates a priority queue, and then stores the remaining nodes and action ancestors in the priority queue. Obviously, one node is definitely not enough, and if the whole process is only one node, the final solution must be ridiculously wrong. When a node and an action ancestor are removed from the queue, a new node will be generated based on the association that the action applies to the node that has already been generated, and so on.
People who know a little bit about the history of artificial intelligence will feel that the more they look at it, the more familiar they become, isn't this the high-end version of Freud asking for a side length?
In modern computers, the core principle used by processors is the Freud algorithm, which is used to find the shortest path between two points by comparing it with the historical optimum. The purpose of memory is to store computations in a priority manner, and each time the processor completes a computation, the memory throws the next computation to the processor.
DQN is essentially the same.
That's basically what Q means, so what does * mean?
Judging from the analysis of many industry insiders, * is likely to refer to the A* algorithm.
This is a heuristic. Without rushing into what heuristics are, let me tell you a joke:
A asks B to "quickly find the product of 1928749189571*1982379176", and B immediately answers A: "32". When I heard this, I wondered that when two numbers of such a large number were multiplied, it was impossible for the answer to be two digits. B asked A: "Are you going to say it's fast?"
It seems outrageous, but heuristics are the same.
Its essence is estimation, and you can only choose one between efficiency and positive solution. Either it's very efficient, but sometimes it's wrong, or it's very accurate, and sometimes it takes a long time. The A* algorithm first estimates an approximate value through a heuristic algorithm, which is likely to deviate greatly from the correct solution. Once the estimation is complete, the loop starts traversing, and if there is no way to solve it, it is revalued until the solution starts to appear. This is repeated to finally arrive at the best solution.
Although the best solution can be obtained, A* is the second type mentioned above, and the answer is correct, and it takes a long time. It's okay to put it in a lab environment, but if this algorithm is placed on a personal device, it may cause memory overflows and cause system problems, such as blue screens.
Therefore, this limitation makes the A* algorithm often applied to some less complex models in the past, the most typical is character pathfinding in online games. In some large games, the character stutters at the moment of pathfinding because of the A* algorithm.
On the whole, the current consensus in the artificial intelligence circle is that the Q* algorithm mentioned in OpenAI's internal letter is probably a combination of Q learning and A, that is, saving computing power, saving memory, and getting the best solution - because it can't always spend more computing power, waste memory, and finally get the best solution!
And, just as OpenAI finally made the basic model, it also existed for a long time, and was even ignored by people for a while, until OpenAI rediscovered its potential with specific and innovative methods. Today, people naturally have reason to believe that in the two long-standing algorithm ideas of Q and A, OpenAI can repeat the old tricks and create miracles again - of course, the harm that this miracle may bring to mankind has also made more people worried because of the recent OpenAI farce.
Therefore, going back to this algorithm, Q* is most likely to use Q learning to quickly find the valuation of the near-optimal solution, and then use the A* algorithm to solve it in a small range, eliminating a lot of meaningless calculation processes, so as to quickly find the best solution. But what exactly OpenAI is going to do will have to wait for the public paper (if it can wait).
The emergence of Q* actually illustrates a problem, and the leading AI companies realize that the process of solving in the current development of artificial intelligence is more meaningful than solving. Because now only the pursuit of the correctness of the answer can no longer meet people's demand for artificial intelligence. For example, on OpenCompass, even if the average score difference is 10 or 20 points, if you look at the accuracy of understanding, there is no big gap between the best model and the worst model.
Amid the speculation and panic, one of the claims about Q* is that Q* can solve very advanced math problems. Andrew Rogosky, director of the Surrey Institute for Human-Centered Artificial Intelligence, said: "We know that existing AI has been shown to be capable of doing math at undergraduate level, but is not capable of handling more advanced math problems. But Q* is very likely to be used to solve difficult math problems. "Maybe when Q* comes out, you can test its Goldbach conjecture. Mathematics is considered to be one of the greatest crystallizations of human wisdom, so Q* is just a code name that has caused panic across the Internet.
And behind Q* is also linked to OpenAI's mission - that is, the exploration of artificial general intelligence (AGI), and even superintelligence. OpenAI defines AGI as autonomous systems that surpass humans in the most economically valuable tasks, and Q* is a step towards AGI at OpenAI.
At the moment, OpenAI has not commented on Q* and the internal letter leak, but I have mixed feelings. I am happy that Q* has strong capabilities, and the development of artificial intelligence will go further. At the same time, I was also worried that the Q* gimmick was bigger than the reality, and in the end, the test results were just like that on the day they were released, and I was slapped in the face.