laitimes

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

author:Quantum Position

The west wind comes from the Wafei Temple

Quantum Position | 公众号 QbitAI

Since Ilya Sutskever's official announcement of her departure from OpenAI, his next move has become the focus of everyone's attention.

There are even people who pay close attention to his every move.

No, Ilya just liked ❤️ a new paper on the front foot -

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

——Netizens rushed to see it on their hind feet:

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it
Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

The paper is from MIT, and the author proposes a hypothesis, which is summarized in one sentence like this:

Neural networks are trained on different data and modalities with different targets, and are trending towards a shared real-world statistical model in their representation space.
Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

They named this speculation Plato's representation hypothesis, a reference to Plato's allegory of the cave and his notion of the nature of ideal reality.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Ilya's selection is still guaranteed, and some netizens called it the best paper they saw this year after reading it:

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

There are also netizens who are really talented, and after reading it, Hua summed it up with a sentence at the beginning of "Anna Karenina": All happy language models are similar, and each unfortunate language model has its own misfortunes.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

To paraphrase Whitehead's famous quote: All machine learning is a footnote to Plato.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it
Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

We also took a look, and the approximate content is:

The authors analyze the representational convergence of AI systems, i.e., how data points are represented in different neural network models is becoming more and more similar, across different model architectures, training targets, and even data modalities.

What is driving this convergence? Will this trend continue? Where does it end up?

After a series of analyses and experiments, the researchers speculated that this convergence does have an endpoint and a driving principle: different models are striving to achieve an accurate representation of reality.

A diagram to explain:

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

where the image (X) and the text (Y) are different projections of the common underlying reality (Z). The researchers speculate that representation learning algorithms will converge to a unified representation of Z, and that the increase in model size and the diversity of data and tasks are key factors driving this convergence.

I can only say that it is worthy of the question that Ilya is interested in, it is too profound, and we don't understand it very well, so let's ask AI to help interpret it and share it with you~

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Evidence of the expropriation of the table

First, the authors analyzed a large number of previous related studies, and also conducted experiments themselves, and presented a series of evidence for surface convergence, showing the convergence, scale and performance, and cross-modal convergence of different models.

PS: This research focuses on vector embedding representations, where data is transformed into vector form to describe the similarity or distance between data points through a kernel function. The concept of "representation alignment" in this paper means that if two different representation methods reveal similar data structures, then the two representations are considered to be aligned.

1. The convergence of different models, and the models of different architectures and goals tend to be consistent in the underlying representation.

The number of systems built on top of pre-trained base models is increasing, and some models are becoming standard core architectures for multitasking. This wide applicability to a variety of applications reflects their versatility in the way data is characterized.

While this trend suggests that AI systems are converging towards a smaller set of base model sets, it does not prove that different base models will form the same representation.

However, some recent studies related to model stitching have found that the middle-layer representation of image classification models can be well aligned even when trained on different datasets.

For example, some studies have found that the early layers of convolutional networks trained on the ImageNet and Places365 datasets are interchangeable, indicating that they learn similar initial visual representations. A large number of "Rosetta neurons" have also been discovered, i.e., neurons that are activated in different visual models with highly similar patterns......

2. The larger the model size and performance, the higher the degree of representation alignment.

The researchers measured the alignment of 78 models on the Places-365 dataset using the mutual nearest neighbor method and evaluated their performance on the downstream task of adapting the visual task to the baseline VTAB.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

The results show that the representation alignment between model clusters with stronger generalization ability is significantly higher.

Previous studies have also observed higher alignment of CKA kernels between larger models. Theoretically, it has also been shown that models with similar output performance must also have similar internal activations.

3. The model table of different modalities is collected.

The researchers used the mutual nearest neighbor method on the Wikipedia image dataset WIT to measure alignment.

The results reveal a linear relationship between language-visual alignment and language modeling scores, with a general trend of better alignment between more capable language models and more capable visual models.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

4. The model also showed a degree of consistency with the brain representation, possibly due to similar data and task constraints.

In 2014, it was found that the activation of the middle layer of neural networks is highly correlated with the activation pattern of the visual area of the brain, which may be due to the similar visual tasks and data constraints.

Since then, it has been further found that the use of different training data can affect the alignment of brain and model representations. Psychological research has also found that the way humans perceive visual similarity is highly consistent with neural network models.

5. The alignment of model representation is positively correlated with the performance of downstream tasks.

The researchers used two downstream tasks to evaluate the model's performance: Hellaswag (common sense reasoning) and GSM8K (math). The DINOv2 model was used as a reference to measure the alignment of other language models with the vision model.

Experimental results show that language models with higher alignment to visual models also perform better on Hellaswag and GSM8K tasks. The visualization results show a clear positive correlation between the degree of alignment and the performance of downstream tasks.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

I won't go into the previous research here, but interested families can check the original paper.

Reasons for convergence

Then, through theoretical analysis and experimental observations, the research team proposed the potential causes of the accumulation of the surface, and discussed how these factors work together to lead to the convergence of different models in representing the real world.

In the field of machine learning, the training goal of the model needs to be to reduce the prediction error on the training data. To prevent model overfitting, regularization items are often added to the training process. Regularization can be implicit or explicit.

In this section, the researchers illustrate how each colored part of the figure below might play a role in facilitating table collection during this optimization process.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

1、任务通用性导致收敛(Convergence via Task Generality)

As models are trained to solve more tasks, they need to find representations that meet the needs of all tasks:

The number of representations capable of N tasks is less than the number of representations capable of M (M < N) tasks. As a result, when a more general model is trained that can solve multiple tasks at the same time, there will be fewer viable solutions.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

A similar principle has been proposed before, and the diagram is like this:

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Moreover, there are multiple solutions for easy tasks and fewer solutions for difficult tasks. Therefore, as the difficulty of the task increases, the representation of the model tends to converge to a better and fewer number of solutions.

2、模型容量导致收敛(Convergence via Model Capacity)

The researchers point to the capacity assumption that if there is a globally optimal representation, then the larger model is more likely to approximate this optimal solution with sufficient data.

As a result, larger models using the same training target, regardless of their architecture, tend to converge on this optimal solution. When different training targets have similar minimum values, the larger model can find these minimum values more efficiently and tend to similar solutions in each training task.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Here's the diagram:

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

3、简单性偏差导致收敛(Convergence via Simplicity Bias)

As for the cause of convergence, the researchers also put forward a hypothesis. Deep networks tend to look for simple fits to the data, and this inherent simplicity bias makes large models tend to simplify their representations, leading to convergence.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

That is, larger models have a wider coverage and are able to fit the same data in all possible ways. However, the implicit simplicity preference of deep networks encourages larger models to find the simplest of these solutions.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

The end of convergence

After a series of analyses and experiments, as described at the beginning, the researchers proposed the Platonic representation hypothesis, speculating on the end point of this convergence.

That is, different AI models, although trained on different data and targets, are converging in their representation spaces into a common statistical model that represents the real world that generates the data we observe.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

They first constructed an idealized model of the discrete event world. The world consists of a series of discrete events Z, each sampled from an unknown distribution P(Z). Each event can be observed in different ways through the observation function obs, such as pixels, sounds, text, etc.

Next, the authors consider a class of contrastive learning algorithms that attempt to learn a representation of fX so that the inner product of fX(xa) and fX(xb) approximates the ratio of logarithmic odds of xa and xb as positive pairs of samples (from nearby observations) to logarithmic odds as pairs of negative examples (random sampling).

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

After mathematical derivation, the authors found that if the data were smooth enough, such an algorithm would converge to a representation fX with a kernel function of the point mutual information (PMI) kernels of xa and xb.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Since the study considers an idealized discrete world, the observation function obs is biprojective, so the PMI kernels of xa and xb are equal to the PMI kernels of the corresponding events za and zb.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

This means that learning representations from either visual data X or linguistic data Y eventually converges to the same kernel function representing P(Z), i.e., the PMI kernel between event pairs.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

The researchers tested this theory through an empirical study on color. Whether learning color representations from the pixel co-occurrence statistics of images or from the word co-occurrence statistics of texts, the resulting color distances are similar to human perceptions, and this similarity is getting higher and higher as the model size increases.

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

This is in line with the theoretical analysis that greater model power can more accurately model the statistics of the observed data, and thus obtain PMI kernels that are closer to the ideal event representation.

Some final thoughts

Finally, the authors summarize the potential implications of surface collection for the field of AI and future research directions, as well as potential limitations and exceptions to the Platonic representation hypothesis.

They note that as the model size increases, the possible effects of the convergence of representations include, but are not limited to:

  • While simple scaling can improve performance, there are differences in scaling efficiency between different approaches.
  • If there are modal-independent Platonic representations, then the data from the different modalities should be jointly trained to find this shared representation. This explains why it is beneficial to incorporate visual data into language model training and vice versa.
  • Transitions between aligned representations should be relatively simple, which may explain the fact that conditional generation is easier than unconditional generation, and cross-modal transformations are also possible for unpaired data.
  • Larger model sizes may reduce the language model's hypothetical tendencies and some biases, making it more accurate to reflect bias in the training data rather than exacerbating it.

The authors emphasize that the premise of the above impact is that the training data of future models must be diverse and lossless enough to truly converge to representations that reflect the statistical laws of the real world.

At the same time, the authors also say that data of different modalities may contain unique information, which may make it difficult to achieve complete representation convergence even as the model size increases. In addition, not all representations are currently converging, for example, there is no standardized way to characterize state in the field of robotics. Researchers and community preferences can lead to a convergence of models towards human representations, ignoring other possible forms of intelligence.

And intelligent systems that are specifically designed for a specific task may not converge to the same characterization as general intelligence.

The authors also highlight the controversy over ways to measure representation alignment, and that different measurement methods may lead to different conclusions. Even though the representations of the different models are similar, there are gaps to be explained, and it is not possible to determine whether such gaps are important at this time.

For more details and argumentation methods, put the paper here for Daga~

Ilya's first action after leaving the company: liked the paper, and netizens rushed to read it

Link to paper: https://arxiv.org/abs/2405.07987

Reference Links:

[1]https://x.com/phillip_isola/status/1790488966308769951

[2]https://x.com/bayeslord/status/1790868039224688998

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on