laitimes

To surpass human memory, they gave ChatGPT a cheat sheet

author:Titanium Media APP
文 | 追问nextquestion

"I have read more than 10,000 books, and my pen is like a god. This old saying has been more clearly explained in the training process of modern large models. With enough training data, AI has achieved human-like performance on many tasks. For the sake of understanding, one view personifies the large model, arguing that the large model also has the same memory as humans, and even has similar memory patterns and mechanisms. However, just as an airplane should not be simply compared to an iron bird, the process of human memory from generation to extraction is fundamentally different from that of a language model that predicts the next word based on context.

However, the study of human memory provides a starting point for understanding the memory mechanisms of large models. Human memory is divided into long-term memory and short-term memory (also known as working memory). For large models, the "long-term memory" is stored in the billions of parameters of the model, and the short-term memory is reflected in the context of the length of the context that the model can recall in a conversation. For example, GPT-4's context is 128k, which is equivalent to entering about 100,000 Chinese characters at a time.

But is this kind of ratio really effective? What are the similarities and differences between large language models and human memory, and how can we use human memory mechanisms to solve the problems encountered in the application of large models?

01 The long-term memory of the large model is similar to that of humans

For any animal, its brain function is simply to win in the relentless selection of evolution. Language as a communication tool is no exception. In language, complex features such as grammatical structure and recursive nesting, the bottom level of the inquiry is still to complete communication more efficiently and accurately. That being the case, language does not have to be perfect in general. For large models that have been strengthened and adjusted by humans, their essence is also probabilistic and random. You can then make the output of the model look more creative by adjusting the parameter Temperature, which is the hyperparameter used to adjust the text authoring and diversity when the model is generated.

In terms of memory, large models, like humans, exhibit both first-cause and proximate effects [1], especially when there are more facts that need to be memorized (Figure 1).

Primary effect: primary effect, that is, the first impression is remembered when remembering, and recency effect: that is, the last contact with the object leaves a deep impression or cognition.

To surpass human memory, they gave ChatGPT a cheat sheet

▷Figure 1: The prediction accuracy of large models decreases and then rises with vocabulary, similar to human memory. Source: Reference 1

This property is an emergent feature that occurs after the model size reaches a certain threshold (Figure 2), and when the model parameter is only 70M, the model is not actually able to predict words that are further away, so there is no first-cause effect.

To surpass human memory, they gave ChatGPT a cheat sheet

▷ Figure 2: Accuracy of models with different parameter sizes in predicting words at different positions. Source: Reference 1

During the learning process, humans can improve memory through repetition, a phenomenon that also occurs in large models (Figure 3). In addition, the effectiveness of the model will be improved by changing the order of the content and repeating the learning compared to repeating the content to be learned.

To surpass human memory, they gave ChatGPT a cheat sheet

▷ Figure 3: Comparison of the prediction accuracy of the model in the face of duplicate content. Source: Reference 1

When humans are confronted with conflicting facts, memory errors occur, which means that the cause of forgetting is not that memories decay over time, but that there are disturbances in memory production. Large models behave similarly when confronted with conflicting facts, and the more specific the conflict is (e.g., the conflict comes from the same person rather than from a different country), the more pronounced the error in memory becomes (Figure 4).

To surpass human memory, they gave ChatGPT a cheat sheet

▷Figure 4: After adding different types of interference information, the prediction accuracy of the large model decreases significantly. Source: Reference 1

In addition, Canadian cognitive psychologist Endel Tulving believes that memory storage and reading are two separate processes, which also lends itself to large models, whose training and reasoning processes use very different mechanisms. Tolvin further divides long-term memory into declarative memory and procedural memory, where declarative memory includes semantic memory and episodic memory.

For large models, semantic memory corresponds to the knowledge base that the model accumulates through a pre-training or fine-tuning process, which is implicitly stored in its parameters. At the same time, the correspondence of episodic memory is reflected in the ability of the model to rely on specific contextual information when processing or generating text. However, when generating entirely new content, what needs to be activated is a capability similar to procedural memory, which goes beyond mere episodic memory. [4]

Although the large model is mainly involved in the application of explicit episodic memory in the training process, procedural memory is not significantly involved. In the inference process, large language models use input contextual information to refer to previous conversations or data related to the current context, a process that can be seen as a simulated invocation of contextual memory. This suggests that while large models are trained to primarily process explicit information related to a particular instance, they can still exhibit a human episodic memory by processing contextual information related to previous interactions. Furthermore, some researchers believe that when the model receives sufficiently detailed and specific contextual information, it can "activate" more complex behavior patterns, similar to human procedural memory, thus exhibiting advanced emergent abilities such as causal inference and mental simulation.

Although large models and the human brain show similarities in specific performance, this does not mean that the two also have similar information processing mechanisms. In fact, there is no clear conclusion on why large models exhibit such characteristics. For example, in the above study, it is not clear whether we can reproduce characteristics such as the first-cause effect if we only consider the topmost parameters of the large model, or whether the performance of the model will change when the scope of the context is limited. Perhaps the limited large model can be used to further locate modules that are similar to human memory, which can help to explain this phenomenon.

02 The memory capacity of the large model is increased by "plug-in".

Understanding memory is essential to expand the capabilities of large models. Just as recording steps on scratch paper when solving math puzzles enhances our working memory, introducing "memory plug-in" technology to large models can help models significantly improve their working memory.

For example, with the TiM system application, the large model has to perform some processing on the external storage space before each answer to the question, including insertion, forgetting, and merging (see Figure 6). In this way, when responding to multiple rounds of conversations or questions, the large model can more effectively process and recall contextual information to accurately retrieve the required information. A similar approach includes recursively generating scene memory [6], which can be seen as having the large model summarize the context of the previous round of questions after each round of answering questions and put them into external memory, so as to prevent the large model from forgetting the content of the previous conversations during multiple rounds of conversations.

To surpass human memory, they gave ChatGPT a cheat sheet

▷Figure 6: Comparison of the traditional memory of the large model with the performance of the newly proposed TiM in answering questions. Source: Ref. 5

To address the challenge of long text processing, a paper published in NIPS in 2023 proposed a method called LongMem [7]. Instead of dealing with multiple rounds of conversations, this technology deals with one long text at a time. By splitting the long text into multiple parts, each part is processed independently by a fixed large model, and then the information of each part is synthesized through a trainable residual network, and the most relevant part is selected according to the specific content of the question to be answered. In this way, LongMem allows large models to extract information more accurately.

To surpass human memory, they gave ChatGPT a cheat sheet

▷图7:LongMem机制的运作流程示意图。 图源:参考文献7

In the case of robot control, the application of large models also requires the matching of memory modules [8], and this type of model is called embodied AI. In the robot control task, the "eyes" of the embodied AI process the input of the vision sensor to generate a language description of the surrounding environment, and then its "neural nerve" combines with the robot's own actions to generate a first-person centered description of state information. This information is then encoded and stored in a high-level language processing system, the so-called "brain". At the same time, this brain can also generate control instructions according to navigation tasks.

This way of operation allows the robot to interact directly with humans through natural language, and it can also use the massive amount of common sense stored in the large model to identify and adapt to changes in the environment, such as something that is alive and moving, and I need to avoid it. The robot built in this way will "realize" that the cat in front of it may avoid it when it approaches it, even if it is lying on its stomach. The foundation of this type of embodied AI is to generate, store, and update memory models of their own state.

To surpass human memory, they gave ChatGPT a cheat sheet

▷图8:LLM-Brain具身AI的架构。 图源:参考文献8

Another example of the use of large models plus memory comes from the search scene. The researchers proposed an architecture called CoPS [9], which consists of three parts: an external memory module stores the user's search history and behavior, and then hands over to a large model to infer the user's intent and background, and reproducibly ranks the links given by traditional search engines based on the inferred profile, so as to make the results given by the search engine more personalized. Because of the use of pre-trained large models, CoPS can learn with zero attempts, that is, it does not need to recruit test users, collect user data and feedback, and can use the knowledge in the large model to improve the accuracy of search.

To surpass human memory, they gave ChatGPT a cheat sheet

▷ Figure 9: CoPS architecture. Source: Ref. 9

There are many cases of expanding the application range of large models by adding external memory to large models. Studies have shown [10] that the current language model of the Transformer architecture can be regarded as a "general Turing machine" in terms of computational after being given a read-writable relational external memory. This means that these models are not only capable of processing input strings of finite length, but can also simulate any algorithm and handle input of any size.

03 The "illusion" of the large model does not need to be overcome

Cognitive scientist Lisa Feldman Bartlett points out that "memory is not simply the reactivation of a myriad of fixed, lifeless, fragmented traces, but an imaginative reconstruction or construction." This description seems to be appropriate for large models as well.

Knowing the imperfections of biological memory, we may no longer see the "illusion" of large models as a stubborn disease to be overcome, but as an endogenous and inevitable emergent feature. As Jia Baoyu said in "Dream of Red Mansions", "There are too many inventions in ancient times, but I can't make them up." In fact, the author of "Dream of Red Mansions" also practiced the words of his characters and made up many allusions in the book. But this will not affect the greatness of "Dream of Red Mansions" in the slightest. Once we regard the hallucinations of large models as a by-product of the memory generation process, we should not try to eliminate the "hallucinations" within the framework of the large model itself, but should solve the problems caused by hallucinations in specific scenarios through external memories. Even, the "illusion" can be seen as a valley that leads to the AGI, and it is necessary to find a way to increase the illusion of the model first, so as to promote the creativity of the model.

Although we do not fully understand the mechanism of memory for either large models or the human brain. However, there are many ways to classify memory in neuroscience research, which may remind large model developers that they should not adopt only one memory mode. By adding explicit memory outside the large model, the performance of the large model in long texts and multiple rounds of dialogue can be significantly improved, and the application scenarios of the large model can be expanded. This proposes a more cost-effective and resource-saving optimization path for developers who simply scale up their models to roll out better models.

In neuroscience, memories are competing, and this dynamic means that the extraction, renewal, reinforcement, and forgetting of memories should be examined within the same framework. In today's large models, memory generation and reading are independent of each other. The large model does not update the storage of a certain memory due to repeated reading, and every time a human reads a long-term memory, it is a generative reproduction of the past, and after repeated reading and writing, the original original memory may change, which is also the difference that researchers need to pay attention to in the future.

bibliography

[1] https://arxiv.org/abs/2311.03839

[2] https://arxiv.org/ftp/arxiv/papers/2309/2309.01660.pdf

[3] https://arxiv.org/abs/2402.15052

[4] https://arxiv.org/pdf/2401.02509.pdf

[5] https://arxiv.org/pdf/2311.08719.pdf

[6] https://arxiv.org/pdf/2308.15022.pdf

[7] https://arxiv.org/pdf/2306.07174.pdf

[8] https://arxiv.org/pdf/2304.09349v1.pdf

[9] https://arxiv.org/pdf/2402.10548.pdf

[10] https://arxiv.org/abs/2301.04589

Read on