laitimes

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Mengchen Xiao Zhen was sent from The Temple of Oufei

Qubits | Official account QbitAI

The 100-billion-level parameter AI big model can really get the code?!

When I woke up, a sensational thing happened in the AI circle -

Meta AI opened up a large language model OPT-175B that "weighs" 175 billion parameters, not only has fewer parameters than GPT-3's 375 billion, but also does not lose GPT-3 at all.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

This means that AI scientists can finally "pry open" big models like GPT-3 and see what secrets are inside.

Before the GPT-3 although the effect is amazing but not open enough, the source code is exclusively licensed to Microsoft, and even Musk has criticized OpenAI for not being open enough.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Although the paper is there, if you want to do further research on this, you have to reproduce one before you say it.

This time, Meta is all open from the complete model to the training code and deployment code.

Some people even touched the GitHub repository that had not yet been uploaded before the official announcement.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

There is also a person Aite OpenAI trying to "lead the war":

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

So, what are the characteristics of the Meta model, how to achieve green and low energy consumption, and why should it be open to the outside world? Let's take a look.

With 16 V100s, you can run

Opt's full name is Open Pre-trained Transformer Language Models, or "open pre-trained Transformer language model."

Compared with GPT, the name directly replaces Generator with Open, which can be said to be very connotative. (Manual Dog Head)

In the paper, Meta AI also does not shy away from declaring that OPT-175B is the benchmark GPT-3, and also hints that a wave of its own is more environmentally friendly:

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Meta AI explained that OPT is running to open code, in order to let more people study the big model, the environment configuration is definitely more economical and better.

No, the carbon footprint generated during operation is not even 1/7 of GPT-3, which is really energy-saving and efficient.

In order to facilitate researchers to "do what they can", Meta AI has developed VARIOUS SIZE OPT models, from 125M parameters to 175 billion parameters of different size models.

Among them, the model of 66 billion parameters is still in production, and will soon meet with everyone:

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

So, how efficient is the largest OPT-175B model, and how?

In terms of performance, Meta AI was tested with 14 NLP tasks for OPT-175B and GPT-3.

The results showed that the average accuracy of OPT on these tasks was not much different from GPT-3 in both zero-shot and multi-shot learning. Where the dashed line is GPT and the solid line is OPT:

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

△ Zero sample learning on the left and multi-sample learning on the right

Let's look at the specific task. In the dialogue task, OPT-175B was trained using unsupervised learning methods, and the effect was similar to several types of models trained by supervised learning:

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

The effect of the hate speech detection task is completely more than that of Davinci's version of the GPT-3 model (the best of the four versions of the GPT-3):

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

In terms of training hardware, Meta AI used 992 NVIDIA A100 GPUs (80GB) to train OPT, and the average computational efficiency of each GPU can reach up to 147 TFLOP/s.

This efficiency is even higher than that used by NVIDIA's own researchers, about 17% or so.

Meta AI revealed that on the one hand, it uses a GPU memory saving tool called FSDP (Fully Sharded Data Parallel) launched by itself, which makes large-scale training about 5 times faster than traditional methods;

On the other hand, they also borrowed the tensor parallel method of NVIDIA's Megatron-LM model, which distributes an operation to multiple processors at the same time.

Even Meta AI says it only takes as little as 16 NVIDIA V100 GPUs to train and deploy the OPT-175B model.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Some netizens can't wait to try it:

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Of course, Meta AI is not shy about talking about some of the problems faced by the OPT-175B big model, such as the easier generation of "toxic language" (such as the use of offensive words, language discrimination, etc.):

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

The researchers said they hope that after the opening, more people will participate in the research and really solve these problems.

Teach you to fork the GPT-3 by hand

As mentioned above, this time the OPT model series, 30 billion parameters and below the version can be directly downloaded, 66 billion version is still on the way.

Only the complete 175 billion version requires an additional application form, including the work unit, purpose, related publication work, etc.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

The code kit metaseq for training and deployment is published on GitHub with tutorials and documentation.

As an offshoot of the prestigious Fairseq toolkit, metaseq focuses on 175 billion large models, removing parts that are not needed to train and use large models.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

There are also many developers who value a "hidden treasure" released at the same time as the model and code - the development log.

It details the problems encountered by the Meta team in the development of the big model, the solutions and the basis for decision-making.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published
Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

It provides first-hand solutions to the pain points and confusions in a series of machine learning studies that have existed since the birth of Pytorch.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Such an openness can be said to be unprecedented, and naturally received a lot of praise.

For example, Thomas Wolf, chief scientist at HugingFace, who is also working on the open source big model project.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

However, there are still doubts about the need to apply for the 175 billion parameter version.

I am not a scholar or practitioner, will they accept my application?

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Some developers have suggested that Meta provide some demos like OpenAI, and if you see the effect, you will be more willing to participate in research and improvement, otherwise it is quite discouraged just to build a development environment.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Percy Liang, director of the Center for Basic Model Research at Stanford University and associate professor, expressed his views on this, summarizing the openness of large models into 4 levels, and higher levels of openness allow researchers to focus on deeper problems.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

The first layer of papers is open, proving the feasibility of some ideas and providing ideas for construction.

The second layer of APIs is open, allowing researchers to explore and evaluate the capabilities (e.g., inference) and limitations (e.g., biases) of existing models

The third layer is open to model weights and training data. Allows researchers to progressively improve existing models, develop deeper interpretability techniques and more effective fine-tuning methods, allowing researchers to better understand the role of training data in model behavior.

The fourth layer of computing power is open, allowing researchers to experiment with new architectures, training goals and processes, do data fusion, and develop entirely new models in different fields.

Percy Liang believes that higher levels of openness also bring more risks.

Maybe it's time to develop relevant community norms?

One More Thing

Meta's co-authored paper is three, of which Susan Zhang came from OpenAI before joining Meta.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

However, during her time at OpenAI, she was not in charge of the development of GPT-3, but participated in the OpenAI Five reinforcement learning project of Playing Dota, as well as the research of multimodal large models.

Read on