Inside the big language model, is there a world model?

Does LLM have a sense of space? And on multiple scales of space-time?

Recently, several researchers at MIT found that the answer is yes!

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

Paper address: https://arxiv.org/abs/2310.02207

They found that Llama-2-70B was able to depict a textual map of the researchers' real world.

For spatial characterization, the researchers ran the Llama-2 model on the names of tens of thousands of cities, regions, and natural landmarks around the world.

They trained the linear detector on the final token activation and found that Llama-2 could predict the true latitude and longitude of each place.

In terms of temporal characterization, the researchers ran models on the names of celebrities from the past 3,000 years, the names of songs, movies and books since 1950, and New York Times headlines from the 2010s, and trained linear probes to successfully predict the year of death of celebrities, the release date of songs, movies, books, and news.

In conclusion, everything shows that LLM is more than just a random parrot - Llama-2 contains a detailed model of the world, and it is no exaggeration to say that humans have even found a "longitude neuron" in a large language model!

As soon as this work was launched, it received an immediate response. The author forwarded the summary of the paper on Twitter, and it was read more than 1.4 million people in less than 15 hours!

Netizens exclaimed: This work is amazing!

Someone said: intuitively, this is reasonable. Because the brain refines our physical world and stores it in biological networks. When we "see" things, they are actually projections of our brain's internal processing.

It's incredible that you were able to model this!

Some people take the same view, saying that perhaps we deceived the Creator by trying to imitate the brain.

LLM is not a random parrot

Previously, many people speculated that the amazing power of a big language model may be only because it learns a large collection of superficial statistics, not because it is a coherent model (i.e., the world model) that includes the process of data generation.

In 2021, Emily M. Bender, a linguist at the University of Washington, published a paper arguing that large language models are just "stochastic parrots" that do not understand the real world, but only count the probability of a word appearing, and then randomly produce plausible words like parrots.

Due to the uninterpretable nature of neural networks, academics are unclear whether language models are random parrots, and opinions vary widely.

In the absence of a widely accepted test, whether models can "understand the world" has become a philosophical rather than a scientific one.

However, MIT researchers found that LLM learned linear representations of space and time at multiple scales, and that these representations were robust to different cue variations and uniform across different environmental types, such as cities and landmarks.

They even found that LLM also has separate "spatial neurons" and "temporal neurons" that reliably encode spatial and temporal coordinates.

That is, LLM does not just learn superficial statistics, but acquires structured knowledge about fundamental dimensions such as space and time.

In short, big language models are able to understand the world.

LLM understands space and time

In this paper, the researchers ask whether LLM can form a model of the world (and time) from the contents of the dataset.

The researchers sought to answer this question by extracting a real world map from LLM.

Specifically, the researchers constructed six datasets containing place or event names and corresponding spatial or temporal coordinates across multiple spatiotemporal dimensions:

This includes addresses worldwide, within the U.S. and within New York City.

The dataset also includes different time coordinates:

1) The year of death of the historical figure

2) History of the last 3000 years

3) Release dates of works of art and entertainment programs since the 50s of the 20th century

4) Publication dates of news headlines from 2010 to 2020

Using the Llama 2 series of models, the researchers trained linear regression probes to study the internal activation of these places and events in the names of each layer of the model to predict their real-world locations or times.

These exploratory experiments revealed evidence that the model constructed spatial and temporal representations throughout the early layers, and then reached a plateauing near the points in the model, a process that consistently outperformed smaller models as a result.

Further, the researchers demonstrated that these characterizations are

(1) Linear, because nonlinear probes perform poorly

(2) It can have high robustness to the change of cues

(3) There are similarities between different types of concepts (e.g., similarities between cities and natural landmarks)

The researchers believe that one possible explanation for this result is that the model only learned the mapping from place to country, while the probe actually learned how these different groups relate geospatially (or temporally) to the global geographic structure.

To investigate this, the researchers performed a series of robustness checks to understand how probes generalize on different data distributions and how probes trained on PCA components perform.

The researchers' findings suggest that the probe remembers the "absolute position" of these concepts, but the model does have some representations that reflect "relative positioning."

In other words, the probe learns the mapping from coordinates in the model to human-interpretable coordinates.

Finally, the researchers used probes to look for individual neurons activated as a function of space or time, providing strong evidence that the model did indeed use these features.

Preparations

To investigate, the researchers constructed datasets of six entity names (people, places, events, etc.), which also included their respective locations or when they occurred, each at a different size.

For each dataset, researchers included multiple types of entities, such as densely populated places such as cities and natural landmarks such as lakes, to study a unified representation of different object types.

In addition, the researchers optimized and enriched the metadata so that they could analyze the data with more detailed segmentation and identify the source of the training test leak.

Location information

The researchers constructed three toponymic datasets for the world, the United States, and New York City. The researchers' world dataset was constructed from raw data queried by DBpedia Lehmann et al.

Further, the researchers included densely populated locations, physical locations, and structural locations (such as buildings or infrastructure). The researchers then matched that content with Wikipedia articles and filtered out entities that had at least 5,000 page views in three years.

The researchers' U.S. dataset included the names of cities, counties, zip codes, universities, natural locations, and structures where sparsely populated or viewed locations were similarly filtered out.

The New York City dataset contains locations such as schools, churches, transportation, and public housing within the city.

Time information

The researchers' three temporal datasets include:

(1) the names and occupations of historical figures who died between 1000 BC and 2000 AD,

(2) using Wikipedia page views filtering technology to construct titles and authors from DBpedia including songs, movies, and books from 1950 to 2020;

(3) New York Times news headlines from 2010 to 2020, from the news section that writes current affairs news.

Data preparation

All of the researchers' experiments were conducted using the basic version of the Llama 2 series model, covering 7 billion to 70 billion parameters.

For each dataset, the researchers run each entity name through the model, possibly preceded by a short hint, and save the activation of the hidden state (residual stream) on the last entity token of each layer.

For a set of n entities, this generates one for each layer

Activate the dataset.

probe

To find evidence of spatial and temporal characterization in LLM, the researchers used standard probe techniques.

It fits a simple model on network activations to predict some target labels associated with labeled input data. In particular, given the activation dataset A ∈ Rn×dmodel and target Y containing time or two-dimensional latitude and longitude coordinates, the researchers fit fit linear ridge regression probes.

This results in a linear probe:

The high predictive performance of out-of-sample data indicates that the underlying model has linearly decodeable temporal and spatial information in its representation, although this does not mean that the model actually uses these representations.

In all experiments, the researchers used efficient leave-out-out cross validation on the probe training set to adjust λ.

Linear models of space and time

existence

The researchers first examine the empirical question: Do models characterize time and space? If so, where is it inside the model? Does the quality of the representation change significantly with the size of the model?

In the researchers' first experiment, the researchers trained probes on each layer of Llama 2-{7B, 13B, 70B} for each spatial and temporal dataset.

The researchers' main results, shown in the figure below, show fairly consistent patterns across datasets. In particular, both spatial and temporal features can be recovered by linear probes.

As the size of the model increases, these representations become more accurate, and the quality of the representations in the first half of the model steadily improves until it reaches a steady state.

These observations are consistent with those of the factual recall literature, suggesting that the early to mid-term MLP layer is responsible for recalling information about factual topics.

The worst-performing dataset is the New York City dataset. This is to be expected considering that most entities are relatively obscure compared to other datasets.

However, this is also the dataset with the best relative performance of the largest model, with almost 2x R than the smaller model, suggesting that a sufficiently large LLM can eventually form a detailed spatial model of individual cities.

Linear characterization

In the literature on interpretability, there is growing evidence to support the linear representation hypothesis – that features in neural networks are represented linearly.

That is, the presence or intensity of features can be read out by projecting the relevant activation onto a feature vector. However, these results are almost always for binary or categorical features, which differ from the natural continuum of features of space or time.

To test whether spatial and temporal features are represented linearly, the researchers compared the performance of linear ridge regression probes to the performance of more expressive nonlinear MLPs.

The results are as follows, showing that for any dataset or model, the improvement of using nonlinear probes for R is minimal.

The researchers take this as strong evidence that space and time can also be represented linearly (or at least linearly decodeable), although they are continuous.

Sensitivity to cue words

Another obvious question is whether these spatial or temporal features are sensitive to cue words, i.e. can context induce or inhibit the recall of these facts?

Intuitively, for any entity token, the autoregressive model is incentivized to generate representations suitable for solving any possible future context or problem.

To investigate this question, the researchers created a new activation dataset in which the researchers added different cues for each entity marker following several basic themes. In all cases, the researchers included an "empty" prompt that contained nothing other than the entity token (and the beginning of the serial token).

The researchers then added a prompt asking the model to recall relevant facts, such as "<位置>What is the latitude and longitude?" or "<书>What is the release date of ?"

For datasets in the United States and New York City, the researchers also included versions of these prompts asking where the location is located in the United States or New York City to disambiguate common place names (such as City Hall).

As a baseline, the researchers included hints of 10 random tokens (sampled for each entity). To determine whether researchers can confuse topics, for some datasets, researchers capitalize the names of all entities completely.

Finally, for the title dataset, the researchers tried to probe the last token and the period token appended to the title.

The upper figure is the result of the 70B model, and the lower figure is the result of all models.

The researchers found that explicitly prompting model input, or giving disambiguation cues, such as somewhere in the United States or New York City, had little effect on performance. However, the researchers were surprised by the extent to which random interference tokens reduced performance.

Capitalizing entity names also degrades performance, although less severe and not too unexpected, as it may interfere with the "detokenization" of entities.

One change that significantly improves performance is the detection of the period token after the title, indicating that the period contains some summary information about the ending sentence.

Robustness detection

The previous section has shown that real time or space points of different types of events or places can be recovered linearly from the internal activation of the middle and later layers of LLM.

However, this does not mean whether (or how) the model actually uses the feature direction learned by the probe, because the probe itself can learn some linear combination of simpler features that the model actually uses.

Validate by generalization

To illustrate the potential problems with the researchers' results, consider the task of representing a complete map of the world.

If the model has almost orthogonal binary features "in country X," as the researchers expect, then a high-quality latitude (longitude) probe can be constructed by adding these orthogonal feature vectors for each country with a coefficient equal to the latitude (longitude) of that country.

Assuming that a place is located in only one country, such a probe places each entity in its national centroid.

In this case, however, the model doesn't actually represent space, only state membership, and it's just a probe that learns different country geometries from explicit supervision.

To better distinguish between these cases, the researchers analyzed how the probe generalized when it provided a specific block of data.

In particular, the researchers trained a series of probes, and for each probe, the researchers were given a country, state, ward, century, decade, or year for the world, the United States, New York City, historical figures, entertainment, and headline datasets.

The researchers then evaluate the probing of the retained data blocks. In the table above, the researchers report the average proximity error of a block of data at full retention, compared to the error of the test point for that block in the default training-test split (averaged across all retained blocks).

The researchers found that while generalization performance suffered, especially for spatial datasets, it was significantly better than random datasets. By plotting the state or country projections marked in the figure below, a clearer picture emerges.

Worldwide

That is, the probe generalizes correctly by placing the point in the correct relative position (measured by the angle between the true and predicted centroids) rather than the absolute position.

The researchers see this as weak evidence that the probe is extracting explicitly learned features through the model, but is remembering the conversion from model coordinates to human coordinates.

However, this does not completely rule out the underlying binary feature assumption, since there may be hierarchies of such characteristics that do not follow national or ten-year boundaries.

Cross-entity generalization

Implicit in the researchers' discussions so far is that the model represents the spatial or temporal coordinates of different types of entities, such as cities or natural landmarks, in a uniform way.

However, similar to how latitude sounding can be a weighted sum of membership features, latitude sounding can also be the sum of different (orthogonal) directions of the latitude of a city and the latitude of a natural landmark.

Similar to above, the researchers distinguished these hypotheses by training a series of probes in which a training test split was performed to preserve all points of a particular entity class, as shown in the table below, to determine the proximity compared to the error of the entity in the default test split compared to when it was retained, as previously averaged across all such splits.

The results show that probes largely generalize entity types, with the exception of entertainment datasets.

Spatial and temporal neurons

While these previous results are instructive, there is no direct evidence that the model used the features learned by the probe.

To solve this problem, the researchers searched for individual neurons with input or output weights that had high cosine similarity to the direction of detection of learning.

That is, the neurons the researchers looked for were read or written in a direction similar to the direction the probe learned.

They found that when projecting the activation dataset onto the weights of the most similar neurons, these neurons were indeed highly sensitive to the entity's true location in space or time.

That is, there are individual neurons in the model, which are themselves feature probes with considerable predictive power.

In addition, these neurons are sensitive to all entity types in the dataset, which further indicates that these representations are uniform.

If probes trained under explicit supervision are the approximate upper bound of the extent to which the model represents these spatial and temporal features, then the performance of individual neurons is the lower bound.

In particular, scholars often believe that features are superimposed, which makes the level of analysis of individual neurons wrong.

Still, the presence of these individual neurons (which are not supervised other than the next token prediction) is strong evidence that the model learns and uses features of space and time.

Othello GPT proves that LLM understands the world, which is highly praised by Ng N-da

The most immediate inspiration for MIT researchers is previous research on the extent to which deep learning systems form interpretable models of data generation processes.

The most powerful and clear demonstrations undoubtedly come from GPT models trained on chess and Othello games, which have clear representations of the board and the state of the game.

In February, researchers from Harvard University and the Massachusetts Institute of Technology co-published a new study, Othello-GPT, that verifies the validity of internal representations in a simple board game.

They believe that language models do create a model of the world internally, not just memory or statistics, but the source of their capabilities is unclear.

Link to paper: https://arxiv.org/pdf/2210.13382.pdf

The experiment was very simple, and without any prior knowledge of Othello's rules, the researchers found that the model was able to predict legitimate moves with very high accuracy, capturing the state of the board.

Ng expressed his high approval of the study in the "Letters" column, saying that based on the research, there is reason to believe that large language models have built sufficiently complex world models to understand the world to some extent.

Blog link: https://www.deeplearning.ai/the-batch/does-ai-understand-the-world/

Checkerboard world model

If you think of the chessboard as a simple "world" and require the model to constantly make decisions in the game, you can preliminarily test whether the sequence model can learn the world representation.

The researchers chose a simple reversi game Othllo as an experimental platform, and its rules were —

In the center position of the 8*8 board, put four pieces first, two in black and white; Then both sides take turns to descend, and in the straight or diagonal direction, all enemy sons (which cannot contain spaces) between the two sons on their own side become their own sons (called eaters), and there must be eaters every time they fall; In the end, the chessboard is all occupied, and the one with more children wins.

Compared to chess, Othello's rules are much simpler; At the same time, the search space of board games is large enough, and the model cannot complete the sequence generation through memory, so it is very suitable for testing the model's world representation learning ability.

Othello language model

The researchers first trained a GPT variant language model (Othello-GPT) to feed a game script (a series of pawn moves made by the player) into the model, but the model had no prior knowledge about the game and its rules.

The model has also not been explicitly trained to pursue strategy improvement, win matches, etc., but the accuracy is relatively high when generating legitimate Othello move actions.

data set

The researchers used two sets of training data:

The Championship focused more on data quality, mainly from the more strategic mobile steps employed by professional human players in the two Othello tournaments, but only collected 7605 and 132921 game samples, respectively, and the two datasets were combined and randomly divided into a training set (20 million samples) and a validation set (3.796 million) in an 8:2 ratio.

Synthetic is more concerned with the scale of the data, consisting of random, legal movement operations, and the data distribution differs from the tournament dataset, but is uniformly sampled from the Othello game tree, where 20 million samples are used for training and 3.796 million samples are used for validation.

The description of each game consists of a string of tokens with a vocabulary size of 60 (8*8-4).

Models and training

The architecture of the model is an 8-layer GPT model with 8 headers and a hidden dimension of 512.

The weights of the model are initialized completely randomly, including the word embedding layer, and although geometric relationships exist in the vocabulary representing the position of the checkerboard (e.g., C4 is lower than B4), this inductive bias is not explicitly expressed, but is left to the model to learn.

Predict legal moves

The main evaluation indicator of the model is whether the movement operation predicted by the model complies with Othello's rules.

The error rate of Othello-GPT trained on synthetic datasets is 0.01%, and the error rate on tournament datasets is 5.17%, compared to 93.29% for untrained Othello-GPT, which means that both datasets have allowed the model to learn the rules of the game to some extent.

One possible explanation is that the model remembers all the movement actions of the Othello game.

To test this hypothesis, the researchers synthesized a new data set: at the beginning of each game, Othello had four possible opening positions (C5, D6, E3 and F4), removed all C5 opening moves as a training set, and then used the C5 opening data as a test, that is, removed nearly 1/4 of the game tree, and found that the model error rate was still only 0.02%.

So the high performance of Othello-GPT is not because of memory, because the test data is completely unseen during the training process, so what makes the model successfully predict?

Explore internal characterization

A commonly used neural network internal representation detection tool is a probe, each of which is a classifier or regressor whose input consists of the network's internal activations and is trained to predict features of interest.

In this task, in order to detect whether the internal activation of Othello-GPT contains a representation of the current checkerboard state, after entering the movement sequence, the internal activation vector is used to predict the next movement step.

When using linear probes, the trained Othello-GPT internal representation was only slightly more accurate than random guesses.

When nonlinear probes (two-layer MLPs) are used, the error rate drops dramatically, proving that checkerboard state is not stored in network activation in a simple way.

Intervention experiments

To determine the causal relationship between the model's predictions and the emerging world representation, i.e., whether the checkerboard state actually influenced the network's predictions, the researchers conducted a set of intervention experiments and measured the degree of the resulting impact.

Given a set of activations from Othello-GPT, predict the checkerboard state with the probe, record the associated movement prediction, and then modify the activation to have the probe predict the updated checkerboard state.

Intervention operations include changing a pawn piece in a certain position from white to black, etc., and a small modification will lead to the model finding that the internal representation can reliably complete the prediction, that is, there is a causal effect between the internal representation and the model's prediction.

visualization

In addition to the intervention experiment to verify the validity of the internal representation, the researchers also visualize the prediction results, for example, for each piece on the chessboard, can ask the model how the model's prediction will change if the piece is changed by the intervention technique, corresponding to the significance of the prediction result.

It can be seen that clear patterns are shown in the latent significance plots of the top1 predictions of Othello-GPTs trained on both synthetic and tournament datasets.

In short, from this Harvard and MIT study, it can be seen that the big language model does understand the world, and it is no wonder that it is appreciated by Ng Ender Ng.

GPT-4 is just a spark for AGI? LLM will eventually retire, and the world model is the future

Why is the "world model" so appealing?

This is precisely because the ultimate form and ultimate goal of artificial intelligence and development - general artificial intelligence (AGI), a "model that can understand the world", not just "a model that describes the world".

In 1931, Kurt Gödel published the incompleteness theorem.

Gödel's theorem states that even mathematics cannot conclusively prove everything – humans will always have facts that cannot be proved – while quantum theory states that the lack of certainty in a researcher's world makes it impossible for researchers to predict certain events, such as the speed and position of electrons.

Although Einstein famously expressed the idea that "God does not play dice with the universe", in essence, human limitations are fully manifested simply in predicting or understanding things in physics.

In How We Learn, scholar Stanislas Dehaene defines learning as "the process of forming a model of the world."

In 2016, AlphaGo defeated world champion Lee Sedol by a big 4-1 score in Go.

However, it lacks the human ability to recognize uncommon tactics and adjust accordingly. Therefore, it is simply a weak artificial intelligence.

The AGI that researchers need is a world model that is consistent with experience and can make accurate predictions.

On April 13, Microsoft released a paper titled "Sparks of Artificial General Intelligence: Early Experiments with GPT-4."

Address: https://arxiv.org/pdf/2303.12712

It mentions:

GPT-4 not only masters language, but also solves cutting-edge tasks covering mathematics, coding, vision, medicine, law, psychology, and other fields without the need for any special prompts.

And in all the above tasks, the level of performance of GPT-4 is almost comparable to that of humans. Based on the breadth and depth of GPT-4's capabilities, the researchers believe it can reasonably be seen as a near-but incomplete version of general AI.

However, as many experts have criticized, the mistaken equating of performance with capability means that GPT-4 generates a summary description of the world that is considered to be an understanding of the real world.

Most of today's models are trained only on text and do not have the ability to speak, hear, smell, and act in the real world.

Like Plato's cave parable, people living in caves can only see shadows on the walls, but cannot recognize the reality of things.

Both the Harvard and MIT February study, and today's paper, show that big language models do understand the world to some extent, not just ensure that they are grammatically correct.

These possibilities alone are exciting enough.

Resources:

https://arxiv.org/abs/2310.02207

https://twitter.com/wesg52/status/1709551516577902782

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

Robustness detection

Explore internal characterization

Intervention experiments

visualization

Read on