Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

2022-05-04 13:47:44

Mengchen Xiao Zhen was sent from The Temple of Oufei

Qubits | Official account QbitAI

The 100-billion-level parameter AI big model can really get the code?!

When I woke up, a sensational thing happened in the AI circle -

Meta AI opened up a large language model OPT-175B that "weighs" 175 billion parameters, not only has fewer parameters than GPT-3's 375 billion, but also does not lose GPT-3 at all.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

This means that AI scientists can finally "pry open" big models like GPT-3 and see what secrets are inside.

Before the GPT-3 although the effect is amazing but not open enough, the source code is exclusively licensed to Microsoft, and even Musk has criticized OpenAI for not being open enough.

Although the paper is there, if you want to do further research on this, you have to reproduce one before you say it.

This time, Meta is all open from the complete model to the training code and deployment code.

Some people even touched the GitHub repository that had not yet been uploaded before the official announcement.

There is also a person Aite OpenAI trying to "lead the war":

So, what are the characteristics of the Meta model, how to achieve green and low energy consumption, and why should it be open to the outside world? Let's take a look.

With 16 V100s, you can run

Opt's full name is Open Pre-trained Transformer Language Models, or "open pre-trained Transformer language model."

Compared with GPT, the name directly replaces Generator with Open, which can be said to be very connotative. (Manual Dog Head)

In the paper, Meta AI also does not shy away from declaring that OPT-175B is the benchmark GPT-3, and also hints that a wave of its own is more environmentally friendly:

Meta AI explained that OPT is running to open code, in order to let more people study the big model, the environment configuration is definitely more economical and better.

No, the carbon footprint generated during operation is not even 1/7 of GPT-3, which is really energy-saving and efficient.

In order to facilitate researchers to "do what they can", Meta AI has developed VARIOUS SIZE OPT models, from 125M parameters to 175 billion parameters of different size models.

Among them, the model of 66 billion parameters is still in production, and will soon meet with everyone:

So, how efficient is the largest OPT-175B model, and how?

In terms of performance, Meta AI was tested with 14 NLP tasks for OPT-175B and GPT-3.

The results showed that the average accuracy of OPT on these tasks was not much different from GPT-3 in both zero-shot and multi-shot learning. Where the dashed line is GPT and the solid line is OPT:

△ Zero sample learning on the left and multi-sample learning on the right

Let's look at the specific task. In the dialogue task, OPT-175B was trained using unsupervised learning methods, and the effect was similar to several types of models trained by supervised learning:

The effect of the hate speech detection task is completely more than that of Davinci's version of the GPT-3 model (the best of the four versions of the GPT-3):

In terms of training hardware, Meta AI used 992 NVIDIA A100 GPUs (80GB) to train OPT, and the average computational efficiency of each GPU can reach up to 147 TFLOP/s.

This efficiency is even higher than that used by NVIDIA's own researchers, about 17% or so.

Meta AI revealed that on the one hand, it uses a GPU memory saving tool called FSDP (Fully Sharded Data Parallel) launched by itself, which makes large-scale training about 5 times faster than traditional methods;

On the other hand, they also borrowed the tensor parallel method of NVIDIA's Megatron-LM model, which distributes an operation to multiple processors at the same time.

Even Meta AI says it only takes as little as 16 NVIDIA V100 GPUs to train and deploy the OPT-175B model.

Some netizens can't wait to try it:

Of course, Meta AI is not shy about talking about some of the problems faced by the OPT-175B big model, such as the easier generation of "toxic language" (such as the use of offensive words, language discrimination, etc.):

The researchers said they hope that after the opening, more people will participate in the research and really solve these problems.

Teach you to fork the GPT-3 by hand

As mentioned above, this time the OPT model series, 30 billion parameters and below the version can be directly downloaded, 66 billion version is still on the way.

Only the complete 175 billion version requires an additional application form, including the work unit, purpose, related publication work, etc.

The code kit metaseq for training and deployment is published on GitHub with tutorials and documentation.

As an offshoot of the prestigious Fairseq toolkit, metaseq focuses on 175 billion large models, removing parts that are not needed to train and use large models.

There are also many developers who value a "hidden treasure" released at the same time as the model and code - the development log.

It details the problems encountered by the Meta team in the development of the big model, the solutions and the basis for decision-making.

It provides first-hand solutions to the pain points and confusions in a series of machine learning studies that have existed since the birth of Pytorch.

Such an openness can be said to be unprecedented, and naturally received a lot of praise.

For example, Thomas Wolf, chief scientist at HugingFace, who is also working on the open source big model project.

However, there are still doubts about the need to apply for the 175 billion parameter version.

I am not a scholar or practitioner, will they accept my application?

Some developers have suggested that Meta provide some demos like OpenAI, and if you see the effect, you will be more willing to participate in research and improvement, otherwise it is quite discouraged just to build a development environment.

Percy Liang, director of the Center for Basic Model Research at Stanford University and associate professor, expressed his views on this, summarizing the openness of large models into 4 levels, and higher levels of openness allow researchers to focus on deeper problems.

The first layer of papers is open, proving the feasibility of some ideas and providing ideas for construction.

The second layer of APIs is open, allowing researchers to explore and evaluate the capabilities (e.g., inference) and limitations (e.g., biases) of existing models

The third layer is open to model weights and training data. Allows researchers to progressively improve existing models, develop deeper interpretability techniques and more effective fine-tuning methods, allowing researchers to better understand the role of training data in model behavior.

The fourth layer of computing power is open, allowing researchers to experiment with new architectures, training goals and processes, do data fusion, and develop entirely new models in different fields.

Percy Liang believes that higher levels of openness also bring more risks.

Maybe it's time to develop relevant community norms?

One More Thing

Meta's co-authored paper is three, of which Susan Zhang came from OpenAI before joining Meta.

However, during her time at OpenAI, she was not in charge of the development of GPT-3, but participated in the OpenAI Five reinforcement learning project of Playing Dota, as well as the research of multimodal large models.

Meta fork GPT-3 "backstab" OpenAI, full model weights & training code published

Read on