Massively parallel AI training system Colossal-AI is designed as the core of a deep learning framework to help users easily maximize ai deployment efficiency while minimizing deployment costs.
Open source address: https://github.com/hpcaitech/ColossalAI
Colossal-AI has received widespread attention as soon as it is open source, ranking first in the world in the direction of Python on the GitHub hot list for many consecutive days, and attracting attention at home and abroad together with many star open source projects that have tens of thousands of stars!
After the continuous efforts of the developers, Colossal-AI has ushered in the official version after months of intensive testing! This version consists of more than 300 commits.
This official update focuses on optimizing distributed training performance and ease of use for developers, and key highlights include:
Refactor ZeRO to improve performance and ease of use; add fine-grained Profiler TensorBoard monitoring plug-ins to monitor the status of memory, network, etc. during training; more flexible checkpoint policies, scalable pipeline modules; rich industry solutions such as open source protein prediction FastFold; add Chinese tutorials, MOE, BERT and other examples, open user communities and forums.
Professional assistance in large model training
In recent years, with the rise of deep learning and the sweeping of large models through the major performance lists, the size of cutting-edge AI models has increased by 10,000 times in just a few years, far exceeding the slow growth of hardware several times. Cutting-edge AI models not only far exceed the capacity of a single GPU, but also require computing power to run a single GPU for hundreds or even thousands of years.
Therefore, how to improve the capacity of a single GPU, how to efficiently use distributed technology, and how to combine multiple GPUs to achieve parallel training acceleration at low cost have become the key pain points of the AI model.
Aiming at the pain points of the existing solutions such as limited parallel dimensions, low efficiency, poor versatility, difficult deployment, and lack of maintenance, Colossal-AI can efficiently and quickly deploy AI large model training through efficient multi-dimensional parallelism, video memory optimization, large-scale optimization library, fine-grained monitoring, etc.
Multidimensional parallelism
Compared with the existing schemes in which the parallel dimension only includes data parallelism, one-dimensional tensor parallelism, and pipeline parallelism, Colossal-AI further provides 2/2.5/3-dimensional tensor parallelism and sequence parallelism, as well as convenient multidimensional hybrid parallel solutions.
△When the ViT tensor is combined with 64, it can increase the batch size by 14 times and the training speed by 5 times
Among them, high-dimensional tensor parallelism can greatly reduce the consumption of video memory, improve communication efficiency, and make the utilization of computing resources more efficient.
Sequence parallelism helps BERT increase training speed by 2x, or 1.5x sequence length
The parallel sequence for large pictures, videos, long text, long-term medical monitoring and other data, can help break through the original machine capacity limitations, directly process long sequence data.
Video memory optimization
Colossal-AI integrates multi-memory optimization technology, including multi-dimensional parallelism, ZeRO redundant memory elimination, CPU offload, Gradient Checkpoint, automatic mixing precision (AMP) and other cutting-edge technologies, to help users avoid video memory bottlenecks to the greatest extent and reduce the hardware requirements for training.
△ GPT-2 uses Colossal-AI, which can train the model size by 24 times or 3 times the training speed under the same hardware
Flexible and easy to use
The Colossal-AI interface is designed to be consistent with the PyTorch style, reducing learning and usage costs, combining existing projects with Colossal-AI with minimal modifications, and easily scaling to massively parallel. In addition, the system maintains excellent scalability, making it easy to add new features according to requirements and is compatible with existing function modules.
Fine-grained monitoring: Fine-grained Profiler TensorBoard plug-in, compared to PyTorch can only record the training process in the unit of information, Colossal-AI can monitor the status of network, communication, memory and so on within the institution, which is convenient for developers to accurately analyze and debug, and improve development efficiency.
Large-scale optimization library: Colossal-AI provides massively parallel optimizers LAMB, LARS, etc., expanding the training match size to 65536 for the first time. Colossal-AI is also compatible with PyTorch's own various optimizers, and is constantly exploring and adding the latest cutting-edge optimization technology to meet the needs of various models.
Rich industry solutions
Colossal-AI has reached cooperation with well-known manufacturers in autonomous driving, cloud computing, retail, medicine, chips and other industries, and established cooperation with Huging Face, a top open source organization in the field of AI.
Protein Structure Prediction Acceleration Protocol: FastFold
AlphaFold was selected by Science and Nature as the first of the top ten scientific breakthroughs in 2021 for its powerful AI ability to predict protein structure, but there are problems such as long training time and high cost.
△Image source: https://arxiv.org/pdf/2203.00854.pdf
FastFold, an acceleration scheme based on Colossal-AI, introduces GPU optimization and large model training technology into AlphaFold's training and inference, successfully surpasses the google and Columbia University's programs, reduces alphaFold training time from 11 days to 67 hours, and lower the total cost, and also achieves a speed increase of 9.3 to 11.6 times in long sequence inference.
△ Long sequence inference performance comparison
Half of the GPUs train GPT-3
For very large AI models, such as GPT-3, compared with NVIDIA' solution, Colossal-AI only needs half of the computing resources to start training; if the same computing resources are used, it can speed up by 11%, which can reduce the cost of GPT-3 training by more than one million US dollars.
Colossal-AI focuses on the construction of open source communities, provides Chinese tutorials, open user communities and forums, efficiently communicates and iteratively updates user feedback, and constantly adds cutting-edge applications such as MoE.
Project team
The core members of Lu Chen's technical team are from the University of California, Berkeley, Stanford University, Tsinghua University, Peking University, National University of Singapore, Nanyang Technological University and other well-known universities at home and abroad; with Google Brain, IBM, Intel, Microsoft, NVIDIA and other well-known manufacturers working experience. The company immediately obtained seed round investment from many top VC institutions such as Innovation Factory and Zhen Fund.
△ Professor You Yang, founder of Luchen Technology: Ph.D. of the University of California, Berkeley, IPDPS/ICPP Best Paper, ACM/IEEE George Michael HPC Fellowship, Forbes Under 30 Elite (Asia 2021), IEEE-CS Supercomputing Outstanding Newcomer Award, UC Berkeley EECS Lotfi A. Zadeh Outstanding Graduate Award
CSO Prof. James Demmel: Distinguished Professor at the University of California, Berkeley, ACM/IEEE Fellow, Member of the American Academy of Sciences, Academy of Engineering, and Academy of Arts and Sciences
Portal
Address of the paper: https://arxiv.org/abs/2110.14883
Project Address: https://github.com/hpcaitech/ColossalAI
Document Address: https://www.colossalai.org/
*Reference links for views in this article:
https://medium.com/@hpcaitech/5-must-follow-features-that-are-seeing-colossal-ais-success-2d5361e27e4b
—Ends—