Crowdfunding supercomputing live training 176 billion parameters AI big model, 900 engineers engaged in open source

2022-03-19 13:08:38

Reports from the Heart of the Machine

Machine Heart Editorial Department

If you had 1 million GPU hours, what kind of language model would you train?

As of yesterday, the big model "BigScience" has been trained 5%.

This model has 329GB of bf16 weights alone, is being trained with 384 A100 blocks, and has a throughput of about 150 TFLOPS per second.

The good news is that training losses are falling:

Crowdfunding supercomputing live training 176 billion parameters AI big model, 900 engineers engaged in open source

Unlike many companies that don't open-source large models, the parameters of bigScience model training are visible to everyone, and according to project organizers, there are still three months to go before the goal is completed.

AI has had a fundamental impact on human society, but unlike the rise of the internet, AI relies heavily on training larger models on larger data sets. Therefore, the resources of this technological transformation are mainly in the hands of large technology giants. From the perspective of research progress, environmental, ethical and social impact, this status quo has constrained AI technology. For example, the outside world does not have access to training datasets or checkpoints, which makes it impossible for other researchers to analyze important aspects such as the ability, limitations, potential improvements, biases, and so on of the model exactly.

From May 2021 to May 2022, for an (expected) year, 900 researchers from 60 countries and more than 250 institutions are working together to create a very large multilingual neural network model and a very large dataset of multilingual text, running on a French Jean Zay (IDRIS) nuclear supercomputer with 28 petaflops. The project was named BigScience.

Recently, the project went live on Twitter.

BigScience is what it does

Open scientific collaboration is a successful model of research in other disciplines, and there are already several large shared research centers that benefit the world, such as CERN.

Similarly, the BigScience project aims to create, research, and share large language models in the AI/NLP research community in a new way, exploring new models of collaboration for big models. The large research community created around the BigScience project will be able to explore the many research questions (capabilities, limitations, potential improvements, biases, general artificial intelligence, etc.) of the megamanage language model in advance and engage in academic discussions to advance technology.

What the BigScience model looks like

In simple terms, the BigScience model is a multilingual model with 176 billion parameters that has the following characteristics:

Similar to GPT, it is a decoder-only architecture with 176 billion parameters;

70 layers of neural networks, 112 attention heads per layer - hidden dimensions of 14336 - 2048 token sequence lengths;

ALiBi Position Embedding - GeLU activation function.

How did BigScience come about?

Scaling laws

First, the researchers derived the law of extension and calculated the upper limit of the "best" model that can be provided: ~392 billion parameters are trained from ~165 billion data tokens.

But the law of extension doesn't take into account service/inference costs, downstream task performance, etc. In addition, the study needed to ensure that low-resource languages still get enough tokens during pre-training. The researchers didn't want the BigScience model to require zero-sample learning of the entire language, so they decided they should pre-train at least 300-400 billion tokens.

compute

Back to budget: GENCI, France's national large computing center, provided the project with 384 NVIDIA A100 80GB for 18 weeks on the supercomputer Jean Zay, or 1161261 A100-hour!

It is worth mentioning that Jean Zay is a supercomputer built in France in 2019, with hardware supplied by HP and a peak performance of 28 Pflops/s after expansion in 2020. Due to its connection to the French power grid, the supercomputer is powered by a nuclear power plant. To further reduce the environmental impact of training, they even used the heat generated by the hardware for heating the campus buildings.

Before formal development, the researchers assessed the size of the model suitable for training and considered the security aspects of the system. The final evaluation result is: ~175 billion parameters of the model, its corresponding token volume has the opportunity to reach or even exceed 400 billion.

Before training, the researchers analyzed how other large models with more than 100 billion parameters formed. There are also many studies that can be used to refer to how model mass changes with increasing size: in particular, the study of Kaplan et al. (2020) and Levine et al. (2020).

velocity

Finally, Stas Bekman, a distinguished engineer at BigScience, benchmarked hundreds of configurations to find the fastest one. You can read more about this on their website. It's all about finding a set of magic numbers and avoiding influences like tile/wave quantization.

The project ended up getting three promising configurations, first excluding (1) because the attention head was too large, and finally choosing (3) because it was faster than (2). Speed is important: each increase in throughput means more total compute, which leads to more pre-trained tokens and better models.

In addition, the BigScience model has 329GB of bf16 weights alone and 2.3TB of full checkpoints with optimizer state in terms of checkpoints.

BigScience's 176 billion parametric big model training began on March 11, 2022 at 11:42 a.m. West Coast time.

data set

The project used a terabyte-scale multilingual dataset containing 1.5 terabytes (350 billion tokens) of text data. What is the concept of this amount of data? If you print it on A4 paper, it can be stacked into 141 Eiffel Towers or 5 Mount Everest.

To build this dataset, the project team members worked together to accomplish the following:

The Data Governance Group helped define the specific value that guides data efforts and proposed a new international data governance structure that includes a number of supporting technical and legal tools;

The Data Sources Group organized Hackathons around the world, helping participants build a catalog of resources in 246 languages using local expertise and preparing a list of 605 related websites;

The Privacy Working Group works on classifications and strategies to reduce privacy risks;

The Legal Academic Group has developed a set of legal manuals covering nine jurisdictions that contain different privacy and data protection regulations to help ML practitioners understand the legal context in which they work.

Due to the excessive size of the data, the impact of using automated methods to automatically screen the entire corpus will be very uncontrollable, and it is also a challenge to obtain good insights by manually examining the data samples. To address these challenges and improve the understandability and explainability of the data selection process, project personnel prioritized the following approaches in their work:

1) Build tools that support large-scale human decision-making, rather than fully automating, and find a balance between manual and automated.

2, less language, more language expertise. Focus on languages and language groups that can devote sufficient resources.

The following blog post explains more details about the dataset: https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling

Finally, students who follow this project can view the live information of the following accounts: https://twitter.com/BigScienceLLM

Reference links: https://www.reddit.com/r/MachineLearning/comments/tfm7zb/n_live_and_open_training_of_bigsciences_176b/

In mid-spring, Yang and Fangqi - The Heart of the Machine "AI Technology Annual Conference"

The Heart of Machine AI Technology Annual Conference will be held online on March 23, and the event is divided into three forums:

"Artificial Intelligence Forum" live broadcast room: http://live.bilibili.com/3519835

"AI x Science Forum" Live Room: http://live.bilibili.com/24531944

"Chief Intellectual Officer Conference" live broadcast room: https://live.bilibili.com/24532108

The AI Forum focuses on high-performance computing, federated learning, systems machine learning, reinforcement learning, CV and NLP development, RISC-V, and more.

The AI x Science Forum focuses on AI and cross-cutting research advances in proteins, biocomputing, mathematics, physics, chemistry, new materials, and neuroscience.

The CCHI Conference focuses on smart cars, automotive robots, driverless commercialization, vehicle-grade chips, and unmanned logistics.

Crowdfunding supercomputing live training 176 billion parameters AI big model, 900 engineers engaged in open source

Read on